## Working with non D3M datasets

First, import the class `AutoML` from `d3m-interface`:

In [1]:
from d3m_interface import AutoML

### Generating pipelines for CSV datasets

In this example, we are generating pipelines for CSV datasets. The [credit dataset](https://gitlab.com/ViDA-NYU/d3m/d3m_interface/-/tree/master/examples/csv_dataset) is used for this example. This dataset consists of loan records to determine the best way to predict whether a loan applicant will fully repay or default on a loan. 
<div class="alert alert-info"><b>Important</b>
    
Note that the path to the dataset directories must be an absolute path and not a relative path. 
</div>

In [2]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = '/Users/rlopez/D3M/examples/credit_train.csv'
test_dataset = '/Users/rlopez/D3M/examples/credit_test.csv'

automl = AutoML(output_path, 'AlphaD3M')
automl.plot_summary_dataset(train_dataset)

In [3]:
automl.search_pipelines(train_dataset, time_bound=5, target='Loan Status')

INFO: Reiceving a raw dataset, converting to D3M format
INFO: Initializing NYU TA2...
INFO: NYU TA2 initialized!
INFO: Found pipeline, id=3340a0ae-e9d0-46ab-9afd-4f027400b953, accuracy=0.793, time=0:00:26.749123
INFO: Found pipeline, id=b0e9a9ce-fa9a-40b1-ade8-9a467338606c, accuracy=0.79, time=0:00:44.954807
INFO: Found pipeline, id=69ed0a9c-d534-47c9-aa01-41f29b505f47, accuracy=0.827, time=0:01:06.084196
INFO: Found pipeline, id=cf42b613-27f1-4611-aa9e-101949280576, accuracy=0.823, time=0:01:30.380957
INFO: Found pipeline, id=1e146bab-d61c-44a9-8115-3e309d2407cc, accuracy=0.718, time=0:01:45.495946
INFO: Found pipeline, id=5a9f71d1-2f75-4e3d-8e66-c10321bff032, accuracy=0.824, time=0:02:06.562136
INFO: Found pipeline, id=25f620b6-18ac-4d06-a97c-2d75aaf82784, accuracy=0.775, time=0:02:27.712013
INFO: Found pipeline, id=8370b5c5-f312-4aad-9bfb-35e21a526f14, accuracy=0.817, time=0:02:48.943258
INFO: Found pipeline, id=6792fd8a-b439-453c-8f01-c33387b1bea4, accuracy=0.65, time=0:03:07.06041

After the pipeline search is complete, we can display the leaderboard:

In [4]:
automl.plot_leaderboard()

Unnamed: 0,ranking,id,summary,accuracy
0,1,69ed0a9c-d534-47c9-aa01-41f29b505f47,"add_semantic_types.common, imputer.sklearn, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, gradient_boosting.sklearn",0.827
1,2,5a9f71d1-2f75-4e3d-8e66-c10321bff032,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, ada_boost.sklearn",0.824
2,3,414a69b4-9276-4810-a8ed-a1058f74e69a,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, ada_boost.sklearn",0.824
3,4,cf42b613-27f1-4611-aa9e-101949280576,"add_semantic_types.common, imputer.sklearn, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, linear_svc.sklearn",0.823
4,5,8370b5c5-f312-4aad-9bfb-35e21a526f14,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, linear_svc.sklearn",0.817
5,6,3340a0ae-e9d0-46ab-9afd-4f027400b953,"add_semantic_types.common, imputer.sklearn, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, random_forest.sklearn",0.793
6,7,b0e9a9ce-fa9a-40b1-ade8-9a467338606c,"add_semantic_types.common, imputer.sklearn, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, extra_trees.sklearn",0.79
7,8,0c2a317b-74ea-4143-b796-5b4949f07224,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, extra_trees.sklearn",0.776
8,9,25f620b6-18ac-4d06-a97c-2d75aaf82784,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, bernoulli_naive_bayes.sklearn",0.775
9,10,97a1b20e-6201-4ed6-89d2-ed74bb8344a6,"add_semantic_types.common, encoder.distiltextencoder, encoder.dsbox, robust_scaler.sklearn, imputer.sklearn, joint_mutual_information.autorpi, bernoulli_naive_bayes.sklearn",0.775


Individual pipelines need to be trained with the full data. The training is done with the call:

In [5]:
model_id = automl.train('69ed0a9c-d534-47c9-aa01-41f29b505f47')

INFO: Training model...
INFO: Training finished!


Pipeline predictions are accessed with:

In [6]:
predictions = automl.test(model_id, test_dataset)
predictions

INFO: Reiceving a raw dataset, converting to D3M format
INFO: Testing model...
INFO: Testing finished!


Unnamed: 0,d3mIndex,Loan Status
0,0,Fully Paid
1,1,Fully Paid
2,10,Fully Paid
3,100,Charged Off
4,1000,Fully Paid
...,...,...
1753,995,Fully Paid
1754,996,Fully Paid
1755,997,Fully Paid
1756,998,Fully Paid


The pipeline can be evaluated against a held out dataset with the function call:

In [7]:
automl.score('69ed0a9c-d534-47c9-aa01-41f29b505f47', test_dataset)

INFO: Reiceving a raw dataset, converting to D3M format


('accuracy', 0.82651)

### Generating pipelines for OpenML datasets

In this example, we are generating pipelines for OpenML datasets. The [German Credit dataset](https://www.openml.org/d/31) is used for this example. 
This dataset classifies people described by a set of attributes as good or bad credit risks.
<div class="alert alert-info"><b>Important</b>
    
Note that the path to the dataset directories must be an absolute path and not a relative path. 
</div>

In [9]:
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = 'https://www.openml.org/d/31' # credit risks dataset
automl = AutoML(output_path, 'AlphaD3M')
automl.search_pipelines(train_dataset, time_bound=1, target='class', metric='f1Macro')

INFO: Reiceving a raw dataset, converting to D3M format
DEBUG: Data pickle file already exists and is up to date.
INFO: Initializing NYU TA2...
INFO: NYU TA2 initialized!
INFO: Found pipeline, id=e334e21c-022a-440c-957e-8c608b69fea9, f1_macro=0.64148, time=0:00:23.835888
INFO: Found pipeline, id=1986142d-11b5-4bf5-849c-abe6f3ba6693, f1_macro=0.63608, time=0:00:45.034725
INFO: Found pipeline, id=c687c324-04ca-4514-a151-fc8103078f62, f1_macro=0.71179, time=0:01:06.215613


### Generating pipelines for Sklearn datasets

In this example, we are generating pipelines for Sklearn datasets. The [Diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html) is used for this example. 
The diabetes dataset consists of 10 physiological variables and an indication of disease progression after one year.
<div class="alert alert-info"><b>Important</b>
    
Note that the path to the dataset directories must be an absolute path and not a relative path. 
</div>

In [9]:
from sklearn.datasets import load_diabetes
output_path = '/Users/rlopez/D3M/examples/tmp/'
train_dataset = load_diabetes
automl = AutoML(output_path, 'AlphaD3M')
automl.search_pipelines(train_dataset, time_bound=1, target='column 10', task_keywords=['regression'])

INFO: Reiceving a raw dataset, converting to D3M format
INFO: Initializing NYU TA2...
INFO: NYU TA2 initialized!
INFO: Found pipeline, id=0e5cc36a-b6fd-4922-bb80-8369f70bcd23, root_mean_squared_error=60.75112, time=0:00:14.181106
INFO: Found pipeline, id=5771e5cf-6442-4389-95e4-08691604afd6, root_mean_squared_error=60.36269, time=0:00:26.286293
INFO: Found pipeline, id=95f49c0c-1d82-4ba4-bd29-3ceb82725f13, root_mean_squared_error=56.78704, time=0:00:38.394248
INFO: Found pipeline, id=366032d8-fd2e-455c-9bde-eb21b6ccec34, root_mean_squared_error=62.1996, time=0:00:50.511736
INFO: Found pipeline, id=6b739b04-14d0-4c97-9529-11b2c19cbb3c, root_mean_squared_error=56.68418, time=0:01:02.608088


### Visualizing pipelines using Pipeline Profiler

In order to explore the produced pipelines, we can use [PipelineProfiler](https://github.com/VIDA-NYU/PipelineVis). PipelineProfiler is a visualization that enables users to compare and explore the pipelines generated by the AutoML systems.

After the pipeline search process is completed, we can use PipelineProfiler with:

In [10]:
automl.plot_comparison_pipelines()

INFO: Inputs for PipelineProfiler created!


After the analysis is complete, end the session to stop the Docker container and clean up temporary files:

In [15]:
automl.end_session()

INFO: Ending session...
INFO: Session ended!
