## Adding New Primitives

First, import the class `AutoMLClassifier`

In [1]:
from alpha_automl import AutoMLClassifier
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The sentiment dataset is used for this example.

In [2]:
output_path = 'tmp/'
train_dataset = pd.read_csv('datasets/sentiment/train_data.csv')
test_dataset = pd.read_csv('datasets/sentiment/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'sentiment'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,text,Time of Tweet,Age of User,Country
0,"I`d have responded, if I were going",morning,0-20,Afghanistan
1,Sooo SAD I will miss you here in San Diego!!!,noon,21-30,Albania
2,my boss is bullying me...,night,31-45,Algeria
3,what interview! leave me alone,morning,46-60,Andorra
4,"Sons of ****, why couldn`t they put them on t...",noon,60-70,Angola
...,...,...,...,...
27476,wish we could come see u on Denver husband l...,night,31-45,Ghana
27477,I`ve wondered about rake to. The client has ...,morning,46-60,Greece
27478,Yay good for both of you. Enjoy the break - y...,noon,60-70,Grenada
27479,But it was worth it ****.,night,70-100,Guatemala


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,sentiment
0,neutral
1,negative
2,negative
3,negative
4,negative
...,...
27476,negative
27477,negative
27478,positive
27479,positive


### Adding New Primitives into AlphaAutoML's Search Space

In [5]:
automl = AutoMLClassifier(output_path, time_bound=10, verbose=True)

In [6]:
from alpha_automl.base_primitive import BasePrimitive
from sentence_transformers import SentenceTransformer
import numpy as np

embedder = SentenceTransformer('xlm-roberta-base')

class MyEmbedder(BasePrimitive):
    # If running it in Windows or CUDA environment, this implementation should be in an external module.

    def fit(self, X, y=None):
        return self

    def transform(self, texts):
        text_list = texts.tolist()
        embeddings = embedder.encode(text_list)

        return np.array(embeddings)

my_embedder = MyEmbedder()
automl.add_primitives([(my_embedder, 'TEXT_ENCODER')])

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: xlm-roberta-base
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "GET /api/models/xlm-roberta-base HTTP/1.1" 200 3217
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/77de1f7a7e5e737aead1cd880979d4f1b3af6668/.gitattributes HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/77de1f7a7e5e737aead1cd880979d4f1b3af6668/README.md HTTP/1.1" 200 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443
DEBUG:urllib3.connectionpool:https://huggingface.co:443 "HEAD /xlm-roberta-base/resolve/77de1f7a7e5e737aead1cd880979d4f1b3af6668/config.json HT

Some weights of the model checkpoint at /Users/rlopez/.cache/torch/sentence_transformers/xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu


### Searching  Pipelines

In [7]:
automl.fit(X_train, y_train)

INFO:datamart_profiler.core:Setting column names from header
INFO:datamart_profiler.core:Identifying types, 4 columns...
INFO:datamart_profiler.core:Processing column 0 'text'...


  y = column_or_1d(y, warn=True)


INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Text]
INFO:datamart_profiler.core:Processing column 1 'Time of Tweet'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 2 'Age of User'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:datamart_profiler.core:Processing column 3 'Country'...
INFO:datamart_profiler.core:Column type http://schema.org/Text [http://schema.org/Enumeration]
INFO:alpha_automl.data_profiler:Results of profiling data: non-numeric features = dict_keys(['TEXT_ENCODER', 'CATEGORICAL_ENCODER']), useless columns = [], missing values = True
INFO:alpha_automl.utils:Sampling down data from 27481 to 2000
INFO:alpha_automl.pipeline_synthesis.setup_search:Creating a manual grammar
INFO:alpha_automl.primitive_loader:Hierarchy of all primitives loaded
INFO:alpha_automl.grammar_loader:Creating task g

  board = Variable(board, volatile=True)


INFO:alpha_automl.pipeline_search.MCTS:Prediction 0.505688
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 2
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: S
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: S -> IMPUTATION ENCODERS FEATURE_SCALING FEATURE_SELECTION CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: IMPUTATION|ENCODERS|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.MCTS:Prediction 0.50579166
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 3
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: S
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: S -> IMPUTATION ENCODERS FEATURE_SCALING FEATURE_SELECTION CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: IMPUTATION|ENCODERS|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: EN

INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SELECTION -> sklearn.feature_selection.GenericUnivariateSelect
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: TEXT_ENCODER -> sklearn.feature_extraction.text.CountVectorizer
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: CATEGORICAL_ENCODER -> sklearn.preprocessing.OneHotEncoder
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.fe

INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 5
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: IMPUTATION|ENCODERS|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: ENCODERS -> TEXT_ENCODER CATEGORICAL_ENCODER
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: IMPUTATION|TEXT_ENCODER|CATEGORICAL_ENCODER|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: IMPUTATION -> sklearn.impute.SimpleImputer
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SCALING -> sklearn.preprocessing.MaxAbsScaler
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_

INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SELECTION -> sklearn.feature_selection.GenericUnivariateSelect
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: TEXT_ENCODER -> sklearn.feature_extraction.text.CountVectorizer
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: CATEGORICAL_ENCODER -> sklearn.preprocessing.OneHotEncoder
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.fe

INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: CLASSIFICATION -> sklearn.tree.DecisionTreeClassifier
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|sklearn.preprocessing.OneHotEncoder|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|sklearn.tree.DecisionTreeClassifier
INFO:alpha_automl.pipeline_synthesis.pipeline_builder:New pipelined created:
Pipeline(steps=[('sklearn.impute.SimpleImputer',
                 SimpleImputer(strategy='most_frequent')),
                ('sklearn.compose.ColumnTransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('sklearn.feature_extraction.text.CountVectorizer-text',
                                                  CountVectorizer(), 0),
                                                 ('sklearn.preprocessing.OneHotEncoder',
         

  board = Variable(board, volatile=True)


INFO:alpha_automl.automl_api:Found pipeline, time=0:00:04, scoring...




INFO:alpha_automl.pipeline_search.MCTS:Prediction 0.50545466
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 4
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: IMPUTATION|TEXT_ENCODER|CATEGORICAL_ENCODER|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: IMPUTATION -> sklearn.impute.SimpleImputer
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|FEATURE_SCALING|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SCALING -> sklearn.preprocessing.MaxAbsScaler
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SELECTION -> sklearn.feature_select

INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: TEXT_ENCODER -> sklearn.feature_extraction.text.CountVectorizer
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: CLASSIFICATION -> sklearn.naive_bayes.GaussianNB
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|sklearn.naive_bayes.GaussianNB
INFO:alpha_automl.pipeline_search.MCTS:Prediction 0.50545925
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 3
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.Simple

INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|sklearn.feature_extraction.text.CountVectorizer|sklearn.preprocessing.OneHotEncoder|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|sklearn.discriminant_analysis.LinearDiscriminantAnalysis
INFO:alpha_automl.pipeline_search.MCTS:MCTS SIMULATION 2
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|FEATURE_SELECTION|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:MOVE ACTION: FEATURE_SELECTION -> sklearn.feature_selection.GenericUnivariateSelect
INFO:alpha_automl.pipeline_search.pipeline.PipelineGame:PIPELINE: sklearn.impute.SimpleImputer|TEXT_ENCODER|CATEGORICAL_ENCODER|sklearn.preprocessing.MaxAbsScaler|sklearn.feature_selection.GenericUnivariateSelect|CLASSIFICATION
INFO:alpha_automl.pipeline_search.pipeline.PipelineLogic:

Batches:   0%|          | 0/47 [00:00<?, ?it/s]

INFO:alpha_automl.scorer:Score: 0.4252656090816475
INFO:alpha_automl.automl_manager:Pipeline scored successfully, score=0.4252656090816475
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.pipeline_synthesis.setup_search:Receiving signal, terminating process
INFO:alpha_automl.automl_manager:Found 1 pipelines
INFO:alpha_automl.automl_manager:Search done
INFO:alpha_automl.automl_api:Found 1 pipelines


After the pipeline search is complete, we can display the leaderboard:

### Exploring Pipelines

In [8]:
automl.plot_leaderboard()

  return self.leaderboard.style.format(decimal_format).hide_index()


ranking,pipeline,accuracy_score
1,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, GenericUnivariateSelect, DecisionTreeClassifier",0.425


In [None]:
automl.plot_comparison_pipelines()