## Adding New Primitives

First, import the class `AutoMLClassifier`

In [1]:
from alpha_automl  import AutoMLClassifier
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The sentiment dataset is used for this example.

In [2]:
output_path = 'tmp/'
train_dataset = pd.read_csv('datasets/sentiment/train_data.csv')
test_dataset = pd.read_csv('datasets/sentiment/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'sentiment'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,text,Time of Tweet,Age of User,Country
0,"I`d have responded, if I were going",morning,0-20,Afghanistan
1,Sooo SAD I will miss you here in San Diego!!!,noon,21-30,Albania
2,my boss is bullying me...,night,31-45,Algeria
3,what interview! leave me alone,morning,46-60,Andorra
4,"Sons of ****, why couldn`t they put them on t...",noon,60-70,Angola
...,...,...,...,...
27476,wish we could come see u on Denver husband l...,night,31-45,Ghana
27477,I`ve wondered about rake to. The client has ...,morning,46-60,Greece
27478,Yay good for both of you. Enjoy the break - y...,noon,60-70,Grenada
27479,But it was worth it ****.,night,70-100,Guatemala


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,sentiment
0,neutral
1,negative
2,negative
3,negative
4,negative
...,...
27476,negative
27477,negative
27478,positive
27479,positive


### Adding New Primitives into AlphaAutoML's Search Space

In [5]:
automl = AutoMLClassifier(output_path, time_bound=10, verbose=True)

In [6]:
from my_module import MyEmbedder

my_embedder = MyEmbedder()
automl.add_primitives([(my_embedder, 'TEXT_ENCODER')])

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: xlm-roberta-base
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/xlm-roberta-base HTTP/1.1" 200 2929
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/.gitattributes HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/README.md HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/config.json HT

Some weights of the model checkpoint at /Users/rlopez/.cache/torch/sentence_transformers/xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


### Searching  Pipelines

In [7]:
automl.fit(X_train, y_train)

INFO:datamart_profiler.core:Setting column names from header
INFO:datamart_profiler.core:Identifying types, 4 columns...
INFO:datamart_profiler.core:Processing column 0 'text'...


  y = column_or_1d(y, warn=True)


INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Text]
INFO:datamart_profiler.core:Processing column 1 'Time of Tweet'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 2 'Age of User'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 3 'Country'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:alpha_automl.data_profiler:Results of profiling data: non-numeric features = dict_keys(['TEXT_ENCODER', 'CATEGORICAL_ENCODER']), useless columns = [], missing values = True
INFO:alpha_automl.utils:Sampling down data from 27481 to 2000
INFO:alpha_automl.pipeline_synthesis.setup_search:Creating a manual grammar
INFO:alpha_automl.primitive_loader:Hierarchy of all primitives loaded
INFO:alpha_automl.grammar_loader:Creating task g

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:09, scoring...
INFO:alpha_automl.pipeline_synthesis.pipeline_builder:New pipelined created:
Pipeline(steps=[('sklearn.impute.SimpleImputer',
                 SimpleImputer(strategy='most_frequent')),
                ('sklearn.compose.ColumnTransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('sklearn.feature_extraction.text.CountVectorizer-text',
                                                  CountVectorizer(), 0),
                                                 ('sklearn.preprocessing.OneHotEncoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  [1, 2, 3])])),
                ('sklearn.preprocessing.MaxAbsScaler', MaxAbsScaler()),
                ('sklearn.linear_model.LogisticRegression',
                 LogisticRegression())])
INFO:alpha_automl.scorer:Score: 0.5

INFO:alpha_automl.automl_manager:Pipeline scored successfully, score=0.5919080192111774
INFO:alpha_automl.automl_api:Scored pipeline, score=0.5919080192111774
INFO:alpha_automl.automl_manager:Found new pipeline
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:10, scoring...
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/xlm-roberta-base HTTP/1.1" 200 2929
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/.gitattributes HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/README.md HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingfa

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


INFO:alpha_automl.scorer:Score: 0.621161402998108
INFO:alpha_automl.automl_manager:Pipeline scored successfully, score=0.621161402998108
INFO:alpha_automl.automl_api:Scored pipeline, score=0.621161402998108


  board = Variable(board, volatile=True)
  board = Variable(board, volatile=True)
  board = Variable(board, volatile=True)
Some weights of the model checkpoint at /Users/rlopez/.cache/torch/sentence_transformers/xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: xlm-roberta-base
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/xlm-roberta-base HTTP/1.1" 200 2929
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/.gitattributes HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/42f548f32366559214515ec137cdd16002968bf6/README.md HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-rober

Some weights of the model checkpoint at /Users/rlopez/.cache/torch/sentence_transformers/xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


Batches:  38%|███▊      | 18/47 [00:44<01:00,  2.08s/it]

INFO:alpha_automl.pipeline_synthesis.setup_search:Receiving signal, terminating process
INFO:alpha_automl.automl_manager:Found 3 pipelines
INFO:alpha_automl.automl_manager:Search done
INFO:alpha_automl.automl_api:Found 3 pipelines


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


After the pipeline search is complete, we can display the leaderboard:

In [9]:
automl.plot_leaderboard()

ranking,pipeline,accuracy
1,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, LogisticRegression",0.678795
2,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, PassiveAggressiveClassifier",0.621161
3,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, MultinomialNB",0.591908


Removing the target column from the features for the test dataset

In [10]:
X_test = test_dataset.drop(columns=[target_column])
X_test

Unnamed: 0,text,Time of Tweet,Age of User,Country
0,Last session of the day http://twitpic.com/67ezh,morning,0-20,Afghanistan
1,Shanghai is also really exciting (precisely -...,noon,21-30,Albania
2,"Recession hit Veronique Branquinho, she has to...",night,31-45,Algeria
3,happy bday!,morning,46-60,Andorra
4,http://twitpic.com/4w75p - I like it!!,noon,60-70,Angola
...,...,...,...,...
3529,"its at 3 am, im very tired but i can`t sleep ...",noon,21-30,Nicaragua
3530,All alone in this old house again. Thanks for...,night,31-45,Niger
3531,I know what you mean. My little dog is sinkin...,morning,46-60,Nigeria
3532,_sutra what is your next youtube video gonna b...,noon,60-70,North Korea


Selecting the target column for the test dataset

In [11]:
y_test = test_dataset[[target_column]]
y_test

Unnamed: 0,sentiment
0,neutral
1,positive
2,negative
3,positive
4,positive
...,...
3529,negative
3530,positive
3531,negative
3532,positive


Pipeline predictions are accessed with:

In [12]:
y_pred = automl.predict(X_test)
y_pred

array(['neutral', 'positive', 'negative', ..., 'negative', 'positive',
       'neutral'], dtype=object)

The pipeline can be evaluated against a held out dataset with the function call:

In [13]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: accuracy, Score: 0.6825127334465195


  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


{'metric': 'accuracy', 'score': 0.6825127334465195}

### Visualizing pipelines using Pipeline Profiler

In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

For more information about how to use PipelineProfiler, click [here](https://towardsdatascience.com/exploring-auto-sklearn-models-with-pipelineprofiler-5b2c54136044). There is also a video demo available [here](https://www.youtube.com/watch?v=2WSYoaxLLJ8).