In [1]:
import logging

logging.basicConfig(level=logging.ERROR)
logging.getLogger().setLevel(level=logging.ERROR)

import warnings
warnings.simplefilter("ignore")

# Load the dataset

In [2]:
from mlprimitives import datasets

dataset = datasets.load_newsgroups()
dataset.describe()

20 News Groups dataset.

    The data of this dataset is a 1d numpy array vector containing the texts
    from 11314 newsgroups posts, and the target is a 1d numpy integer array
    containing the label of one of the 20 topics that they are about.
    
Data Modality: text
Task Type: classification
Task Subtype: multiclass
Data shape: (11314,)
Target shape: (11314,)
Metric: accuracy_score
Extras: 


## Split the dataset in train/test

In [3]:
X_train, X_test, y_train, y_test = dataset.get_splits(1)

`X` variables contain the raw texts

In [4]:
X_train[0]

'From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject: Re: ARMENIA SAYS IT COULD SHOOT DOWN TURKISH PLANES (Henrik)\nLines: 53\nNntp-Posting-Host: alban.dsv.su.se\nReply-To: hilmi-er@dsv.su.se (Hilmi Eren)\nOrganization: Dept. of Computer and Systems Sciences, Stockholm University\n\n\n  \n|>      henrik@quayle.kpc.com writes:\n\n\n|>\tThe Armenians in Nagarno-Karabagh are simply DEFENDING their RIGHTS\n|>        to keep their homeland and it is the AZERIS that are INVADING their \n|>        territorium...\n\t\n\n\tHomeland? First Nagarno-Karabagh was Armenians homeland today\n\tFizuli, Lacin and several villages (in Azerbadjan)\n\tare their homeland. Can\'t you see the\n\tthe  "Great Armenia" dream in this? With facist methods like\n\tkilling, raping and bombing villages. The last move was the \n\tblast of a truck with 60 kurdish refugees, trying to\n\tescape the from Lacin, a city that was "given" to the Kurds\n\tby the Armenians. \n\n\n|>       However, I hope that the Armenians WILL forc

The `y` variables contain the category of the corresponding articles.

In [5]:
y_train[0:5]

array([17,  6,  0, 14, 14])

Our goal will be to be able to predict the value of the `y` variable based on the text contents.

# Build the Pipeline

To build the pipeline, we specify the list of primitives that we want to use, as well as some additional initialization arguments that additional arguments needed by some of the primitives, as well
as information about how the primitives interact with each other.

In [6]:
from mlblocks import MLPipeline

primitives = [
    "mlprimitives.custom.counters.UniqueCounter",
    "mlprimitives.custom.text.TextCleaner",
    "mlprimitives.custom.counters.VocabularyCounter",
    "keras.preprocessing.text.Tokenizer",
    "keras.preprocessing.sequence.pad_sequences",
    "keras.Sequential.LSTMTextClassifier"
]
init_params = {
    "mlprimitives.custom.counters.VocabularyCounter#1": {
        "add": 1
    },
    "mlprimitives.custom.text.TextCleaner#1": {
        "language": "en"
    },
    "keras.preprocessing.sequence.pad_sequences#1": {
        "maxlen": 100
    },
    "keras.Sequential.LSTMTextClassifier#1": {
        "input_length": 100
    }
}
input_names = {
    "mlprimitives.custom.counters.UniqueCounter#1": {
        "X": "y"
    }
}
output_names = {
    "mlprimitives.custom.counters.UniqueCounter#1": {
        "counts": "classes"
    },
    "mlprimitives.custom.counters.VocabularyCounter#1": {
        "counts": "vocabulary_size"
    }
}

pipeline = MLPipeline(primitives, init_params=init_params,
                      input_names=input_names, output_names=output_names)

Using TensorFlow backend.


# Train the pipeline

To train the pipeline we simply pass it the `X` and `y` train variables.

In [8]:
pipeline.fit(X=X_train, y=y_train)

# Make predictions

To make predictions using the fitted pipeline, we simply pass it the test `X` variable.

In [None]:
predictions = pipeline.predict(X=X_test)

In [None]:
predictions[0:5]

# Evaluate the performance

We now can use the `scorer` method from the dataset object to evaluate the performance
of the pipeline.

In [None]:
dataset.score(y_test, predictions)