## Adding Fasttext Primitives

First, import the class `AutoMLClassifier`

In [1]:
from alpha_automl import AutoMLClassifier
import pandas as pd

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset. The sentiment dataset is used for this example.

In [2]:
output_path = 'tmp/'
train_dataset = pd.read_csv('datasets/sentiment/train_data.csv')
test_dataset = pd.read_csv('datasets/sentiment/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'sentiment'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,text,Time of Tweet,Age of User,Country
0,"I`d have responded, if I were going",morning,0-20,Afghanistan
1,Sooo SAD I will miss you here in San Diego!!!,noon,21-30,Albania
2,my boss is bullying me...,night,31-45,Algeria
3,what interview! leave me alone,morning,46-60,Andorra
4,"Sons of ****, why couldn`t they put them on t...",noon,60-70,Angola
...,...,...,...,...
27476,wish we could come see u on Denver husband l...,night,31-45,Ghana
27477,I`ve wondered about rake to. The client has ...,morning,46-60,Greece
27478,Yay good for both of you. Enjoy the break - y...,noon,60-70,Grenada
27479,But it was worth it ****.,night,70-100,Guatemala


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,sentiment
0,neutral
1,negative
2,negative
3,negative
4,negative
...,...
27476,negative
27477,negative
27478,positive
27479,positive


### Adding New Primitives into AlphaAutoML's Search Space

In [5]:
automl = AutoMLClassifier(output_path, time_bound=10)

In [6]:
# Download the fasttext module if not already downloaded
import os
import fasttext
import fasttext.util


fasttext.util.download_model('en', if_exists='ignore')  # English
fasttext_model_path = os.getcwd() + '/cc.en.300.bin' # change this accordingly to the path where the model is downloaded



In [7]:
'''
Fasttext Module and adding this as a primitive to automl

'''
from alpha_automl.wrapper_primitives.fasttext import FastTextEmbedder 
fasttext_embedder = FastTextEmbedder(fasttext_model_path)
automl.add_primitives([(fasttext_embedder, 'TEXT_ENCODER')])

### Searching  Pipelines

In [8]:
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:03, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:04, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:05, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.31596565274341437




INFO:alpha_automl.automl_api:Found pipeline, time=0:00:32, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.36646776306214524
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:16, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4217726677339543
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:24, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4045990394411294
INFO:alpha_automl.automl_api:Found pipeline, time=0:02:06, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6277106680250327
INFO:alpha_automl.automl_api:Found pipeline, time=0:02:20, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.43559889390190654
INFO:alpha_automl.automl_api:Found pipeline, time=0:03:24, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.31596565274341437
INFO:alpha_automl.automl_api:Found pipeline, time=0:03:24, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6186872362101586
INFO:alpha_automl.automl_api:Found pipeline, time=0:03:27, scoring...




DEBUG:matplotlib:matplotlib data path: /ext3/miniconda3/lib/python3.10/site-packages/matplotlib/mpl-data
DEBUG:matplotlib:CONFIGDIR=/home/yfw215/.config/matplotlib
DEBUG:matplotlib:interactive is False
DEBUG:matplotlib:platform is linux




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4102750691311308
INFO:alpha_automl.automl_api:Found pipeline, time=0:04:10, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.5280163003929559
INFO:alpha_automl.automl_api:Found pipeline, time=0:04:10, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:04:10, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.36675884150778637
INFO:alpha_automl.automl_api:Found pipeline, time=0:04:56, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:04:58, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6787949352350459
DEBUG:matplotlib:matplotlib data path: /ext3/miniconda3/lib/python3.10/site-packages/matplotlib/mpl-data
DEBUG:matplotlib:CONFIGDIR=/home/yfw215/.config/matplotlib
DEBUG:matplotlib:interactive is False
DEBUG:matplotlib:platform is linux
INFO:alpha_automl.automl_api:Found pipeline, time=0:05:00, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:05:10, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4111483044680541
INFO:alpha_automl.automl_api:Found pipeline, time=0:05:58, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6144665987483627
INFO:alpha_automl.automl_api:Found pipeline, time=0:05:59, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:06:03, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4142046281472857
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:02, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:03, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:04, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.605006549265027
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:18, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.3169844273031582
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:19, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6453209139863193
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:20, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:21, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4037258041042061
INFO:alpha_automl.automl_api:Found pipeline, time=0:07:21, scoring...
INFO:alpha_aut



INFO:alpha_automl.automl_api:Scored pipeline, score=0.5613447824188619
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:13, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4222092854024159
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:13, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.6917479260660748
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:14, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.5268519866103915
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:15, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.42337359918498035
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:17, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.39790423519138407
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:18, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.6409547373017028
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:38, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:53, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:54, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.4252656090816475
INFO:alpha_automl.automl_api:Found pipeline, time=0:08:55, scoring...




INFO:alpha_automl.automl_api:Scored pipeline, score=0.6718090525396594
INFO:alpha_automl.automl_api:Found 40 pipelines


After the pipeline search is complete, we can display the leaderboard:

### Exploring Pipelines

In [9]:
automl.plot_leaderboard()

DEBUG:matplotlib:CACHEDIR=/home/yfw215/.cache/matplotlib
DEBUG:matplotlib.font_manager:Using fontManager instance from /home/yfw215/.cache/matplotlib/fontlist-v330.json


ranking,pipeline,accuracy_score
1,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, SelectPercentile, LogisticRegression",0.692
2,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, LogisticRegression",0.679
3,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, BaggingClassifier",0.672
4,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, SelectPercentile, PassiveAggressiveClassifier",0.645
5,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, GradientBoostingClassifier",0.641
6,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, DecisionTreeClassifier",0.628
7,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, SelectPercentile, DecisionTreeClassifier",0.619
8,"SimpleImputer, ColumnTransformer, CountVectorizer, OneHotEncoder, MaxAbsScaler, PassiveAggressiveClassifier",0.614
9,"SimpleImputer, ColumnTransformer, TfidfVectorizer, OneHotEncoder, MaxAbsScaler, DecisionTreeClassifier",0.605
10,"SimpleImputer, ColumnTransformer, FastTextEmbedder, OneHotEncoder, MaxAbsScaler, PassiveAggressiveClassifier",0.561


In [None]:
automl.plot_comparison_pipelines()

### Testing Pipelines

Removing the target column from the features for the test dataset

In [10]:
X_test = test_dataset.drop(columns=[target_column])
X_test

Unnamed: 0,text,Time of Tweet,Age of User,Country
0,Last session of the day http://twitpic.com/67ezh,morning,0-20,Afghanistan
1,Shanghai is also really exciting (precisely -...,noon,21-30,Albania
2,"Recession hit Veronique Branquinho, she has to...",night,31-45,Algeria
3,happy bday!,morning,46-60,Andorra
4,http://twitpic.com/4w75p - I like it!!,noon,60-70,Angola
...,...,...,...,...
3529,"its at 3 am, im very tired but i can`t sleep ...",noon,21-30,Nicaragua
3530,All alone in this old house again. Thanks for...,night,31-45,Niger
3531,I know what you mean. My little dog is sinkin...,morning,46-60,Nigeria
3532,_sutra what is your next youtube video gonna b...,noon,60-70,North Korea


Selecting the target column for the test dataset

In [11]:
y_test = test_dataset[[target_column]]
y_test

Unnamed: 0,sentiment
0,neutral
1,positive
2,negative
3,positive
4,positive
...,...
3529,negative
3530,positive
3531,negative
3532,positive


Pipeline predictions are accessed with:

In [12]:
y_pred = automl.predict(X_test)
y_pred

array(['neutral', 'positive', 'negative', ..., 'neutral', 'positive',
       'positive'], dtype=object)

The pipeline can be evaluated against a held out dataset with the function call:

In [13]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: accuracy_score, Score: 0.7026032823995473


{'metric': 'accuracy_score', 'score': 0.7026032823995473}