## Solving Semi-Supervised Classification Tasks

First, import the class `AutoMLSemiSupervisedClassifier`

In [1]:
import pandas as pd

from alpha_automl import AutoMLSemiSupervisedClassifier

### Generating Pipelines for CSV Datasets

In this example, we are generating pipelines for a CSV dataset.  The [LL0_1053_jm1 dataset](https://datasets.datadrivendiscovery.org/d3m/datasets/-/tree/master/training_datasets/seed_datasets_archive/SEMI_1053_jm1) is used for this example.

In [2]:

train_dataset = pd.read_csv('datasets/LL0_1053_jm1/train_data.csv')
test_dataset = pd.read_csv('datasets/LL0_1053_jm1/test_data.csv')

Removing the target column from the features for the train dataset

In [3]:
target_column = 'defects'
X_train = train_dataset.drop(columns=[target_column])
X_train

Unnamed: 0,loc,v(g),ev(g),iv(g),n,v,l,d,i,e,...,t,lOCode,lOComment,lOBlank,locCodeAndComment,uniq_Op,uniq_Opnd,total_Op,total_Opnd,branchCount
0,1.1,1.4,1.4,1.4,1.3,1.30,1.30,1.30,1.30,1.30,...,1.30,2,2,2,2,1.2,1.2,1.2,1.2,1.4
1,1.0,1.0,1.0,1.0,1.0,1.00,1.00,1.00,1.00,1.00,...,1.00,1,1,1,1,1.0,1.0,1.0,1.0,1.0
2,72.0,7.0,1.0,6.0,198.0,1134.13,0.05,20.31,55.85,23029.10,...,1279.39,51,10,8,1,17.0,36.0,112.0,86.0,13.0
3,190.0,3.0,1.0,3.0,600.0,4348.76,0.06,17.06,254.87,74202.67,...,4122.37,129,29,28,2,17.0,135.0,329.0,271.0,5.0
4,37.0,4.0,1.0,4.0,126.0,599.12,0.06,17.19,34.86,10297.30,...,572.07,28,1,6,0,11.0,16.0,76.0,50.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7479,18.0,4.0,1.0,4.0,52.0,241.48,0.14,7.33,32.93,1770.86,...,98.38,13,0,2,0,10.0,15.0,30.0,22.0,7.0
7480,9.0,2.0,1.0,2.0,30.0,129.66,0.12,8.25,15.72,1069.68,...,59.43,5,0,2,0,12.0,8.0,19.0,11.0,3.0
7481,42.0,4.0,1.0,2.0,103.0,519.57,0.04,26.40,19.68,13716.72,...,762.04,29,1,10,0,18.0,15.0,59.0,44.0,7.0
7482,10.0,1.0,1.0,1.0,36.0,147.15,0.12,8.44,17.44,1241.57,...,68.98,6,0,2,0,9.0,8.0,21.0,15.0,1.0


Selecting the target column for the train dataset

In [4]:
y_train = train_dataset[[target_column]]
y_train

Unnamed: 0,defects
0,
1,
2,
3,True
4,
...,...
7479,
7480,False
7481,
7482,


### Searching Pipelines

In [5]:
automl = AutoMLSemiSupervisedClassifier(time_bound=10)
automl.fit(X_train, y_train)

INFO:alpha_automl.automl_api:Found pipeline, time=0:00:09, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.755211117049706
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:12, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7439871726349546
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:51, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7846071619454837
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:52, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7969000534473544
INFO:alpha_automl.automl_api:Found pipeline, time=0:00:53, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.7867450561197221
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:44, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.8011758417958311
INFO:alpha_automl.automl_api:Found pipeline, time=0:01:47, scoring...
INFO:alpha_automl.automl_api:Scored pipeline, score=0.797434526990914
INFO:alpha_auto

### Exploring Pipelines

After the pipeline search is complete, we can display the leaderboard:

In [6]:
automl.plot_leaderboard()

ranking,pipeline,accuracy_score
1,"SimpleImputer, MaxAbsScaler, SelfTrainingClassifier, LogisticRegression",0.808
2,"SimpleImputer, RobustScaler, SelfTrainingClassifier, LinearDiscriminantAnalysis",0.806
3,"SimpleImputer, StandardScaler, AutonBox, LinearDiscriminantAnalysis",0.802
4,"SimpleImputer, MaxAbsScaler, AutonBox, LinearDiscriminantAnalysis",0.801
5,"SimpleImputer, MaxAbsScaler, SelfTrainingClassifier, MultinomialNB",0.8
6,"SimpleImputer, MaxAbsScaler, SelfTrainingClassifier, SGDClassifier",0.8
7,"SimpleImputer, SelfTrainingClassifier, LinearDiscriminantAnalysis",0.799
8,"SimpleImputer, MaxAbsScaler, AutonBox, MultinomialNB",0.799
9,"SimpleImputer, MaxAbsScaler, AutonBox, LGBMClassifier",0.798
10,"SimpleImputer, MaxAbsScaler, SelfTrainingClassifier, LinearDiscriminantAnalysis",0.797


In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AlphaAutoML system.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [None]:
automl.plot_comparison_pipelines()

### Testing Pipelines

Removing the target column from the features for the test dataset

In [8]:
X_test = test_dataset.drop(columns=[target_column])
X_test

Unnamed: 0,loc,v(g),ev(g),iv(g),n,v,l,d,i,e,...,t,lOCode,lOComment,lOBlank,locCodeAndComment,uniq_Op,uniq_Opnd,total_Op,total_Opnd,branchCount
0,49,6,6,3,171,927.89,0.04,25.33,36.63,23506.58,...,1305.92,34,0,13,0,19.0,24.0,107.0,64.0,11.0
1,187,35,26,16,526,3296.33,0.02,42.56,77.45,140300.03,...,7794.45,164,1,16,0,21.0,56.0,299.0,227.0,69.0
2,6,2,1,1,21,82.04,0.17,6.00,13.67,492.27,...,27.35,4,0,0,0,10.0,5.0,15.0,6.0,3.0
3,44,10,8,10,0,0.00,0.00,0.00,0.00,0.00,...,0.00,0,0,0,0,0.0,0.0,0.0,0.0,19.0
4,8,1,1,1,15,51.89,0.34,2.92,17.79,151.35,...,8.41,4,0,2,0,5.0,6.0,8.0,7.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1961,127,16,16,1,198,998.79,0.02,43.80,22.80,43747.00,...,2430.39,74,36,14,0,18.0,15.0,125.0,73.0,29.0
1962,9,1,1,1,16,55.35,0.11,9.00,6.15,498.16,...,27.68,4,0,3,0,9.0,2.0,12.0,4.0,1.0
1963,3,1,1,1,6,15.51,0.50,2.00,7.75,31.02,...,1.72,1,0,0,0,4.0,2.0,4.0,2.0,1.0
1964,27,4,1,3,99,447.83,0.05,21.27,21.05,9526.62,...,529.26,23,0,1,1,12.0,11.0,60.0,39.0,7.0


In [9]:
y_test = test_dataset[[target_column]]
y_test

Unnamed: 0,defects
0,True
1,True
2,True
3,True
4,True
...,...
1961,False
1962,False
1963,False
1964,False


Pipeline predictions are accessed with:

In [10]:
y_pred = automl.predict(X_test)
y_pred

array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]], dtype=object)

In [11]:
automl.score(X_test, y_test)

INFO:alpha_automl.automl_api:Metric: accuracy_score, Score: 0.7960325534079349


{'metric': 'accuracy_score', 'score': 0.7960325534079349}