# Why EvalML is one of the best AutoML library you can get your hands on

## What is AutoML?

**According to wikipedia** -

Automated machine learning (*AutoML*) is the process of automating the tasks of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model. AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning.The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models. AutoML has been used to compare the relative importance of each factor in a prediction model.

# What is EvalML?
EvalML is an open source automated machine learning library created by Altryx's Innovation team EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Basically EvalML provides a simple low code interface to create machine learning model and use those models to generate insights and to make accurate predictions.    

## What I liked the most about EvalML

- EvalML cuts downs the process of model training and tuning by hand, this includes data quality checks and cross-validation.

- Data Checks and warnings: EvalML helps you in identifying the probelm in the data before using or setting it up for modelling

- Pipeline building: EvalML helps you in consructing a highly optimised pipeline including a state-of-the-art data preprocessing, feature engineering, feature selection and a lot pf modelling techniques 

- Model Understanding: Just like Shap, Eli5, Lime and other model explanibility libraries EvalML also provides a broad level of understanding about the model you are building, for the purpose of presentation

- Domain-specific: This is the missing link in most of the AutoML libraries where you can define the objective if the problem. Once you have determined the objective for your business, you can provide that to EvalML to optimize by defining a custom objective function. 

# Let's get started with EvalML

## How to install 

> **Note:** EvalML includes several optional dependencies. The xgboost and catboost packages support pipelines built around those modeling libraries. The plotly and ipywidgets packages support plotting functionality in automl searches. These dependencies are recommended, and are included with EvalML by default but are not required in order to install and use EvalML

In [19]:
!pip install evalml



## Loading a dataset

Loading a dataset in EvalMl is just the usual process we can use any library for this, I am using pandas for this and I am using breast cancer dataset for this
Dataset Link: [Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

In [20]:
import evalml
X, y = evalml.demos.load_breast_cancer()
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary')
# Here we'll split the data table using evalml's preprocessing "split_data" library

Note: EvalML uses data tables as a standard data format but you can read the regular .csv dataset and it gets converted using Woorworks (another altryx project). 
EvalML also accepts and works well with pandas DataFrames. But using the DataTable makes it easy to control how EvalML will treat each feature, as a numeric feature, a categorical feature, a text feature or other type of feature. Woodwork’s DataTable includes features like inferring when a categorical feature should be treated as a text feature.

In [21]:
X_train.head()

Data Column,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
Physical Type,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,...,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
Logical Type,Double,Double,Double,Double,Double,Double,Double,Double,Double,Double,...,Double,Double,Double,Double,Double,Double,Double,Double,Double,Double
Semantic Tag(s),['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],...,['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric'],['numeric']
381,11.04,14.93,70.67,372.7,0.07987,0.07079,0.03546,0.02074,0.2003,0.06246,...,12.09,20.83,79.73,447.1,0.1095,0.1982,0.1553,0.06754,0.3202,0.07287
144,10.75,14.97,68.26,355.3,0.07793,0.05139,0.02251,0.007875,0.1399,0.05688,...,11.95,20.72,77.79,441.2,0.1076,0.1223,0.09755,0.03413,0.23,0.06769
136,11.71,16.67,74.72,423.6,0.1051,0.06095,0.03592,0.026,0.1339,0.05945,...,13.33,25.48,86.16,546.7,0.1271,0.1028,0.1046,0.06968,0.1712,0.07343
116,8.95,15.76,58.74,245.2,0.09462,0.1243,0.09263,0.02308,0.1305,0.07163,...,9.414,17.07,63.34,270.0,0.1179,0.1879,0.1544,0.03846,0.1652,0.07722
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,...,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124


## Automated pipeline search

User can use AutoMLSearch() for searching the best pipeline. EvalML uses Bayesian optimization to sort the best pipeline as per the defined objective

In [22]:
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')

Using default limit of max_batches=1.

Generating pipelines to search over...


when you use search() function after automl.search() then the search for best pipeline is started. The need for data wrangling is eliminated in EvalML, you can directly load the data and start seaching for the best pipeline after defining feature and outcome variable, let's find the best pipeline now 

In [23]:
automl.search()


*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines. 
Allowed model families: catboost, linear_model, decision_tree, random_forest, extra_trees, xgboost, lightgbm



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 12.904

*****************************
* Evaluating Batch Number 1 *
*****************************

Elastic Net Classifier w/ Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.506
Decision Tree Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 2.432
	High coefficient of variation (cv >= 0.2) within cross validation scores.
	Decision Tree Classifier w/ Imputer may not perform as estimated on unseen data.
Random Forest Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.120
LightGBM Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.133
Logistic Regression Classifier w/ Imputer + Standa

So from the above snippet we got the best pipline i.e "*Logistic Regression Classifier w/ Imputer + Standard Scaler*" with Log loss of 0.094

After the search we will rank the pipeline on the basis of scores, for this we need to use a simple code 
<automl.rankings>

In [24]:
automl.rankings

Unnamed: 0,id,pipeline_name,mean_cv_score,standard_deviation_cv_score,validation_score,percent_better_than_baseline,high_variance_cv,parameters
0,5,Logistic Regression Classifier w/ Imputer + St...,0.094015,0.033791,0.060529,99.271446,True,{'Imputer': {'categorical_impute_strategy': 'm...
1,6,XGBoost Classifier w/ Imputer,0.113098,0.038613,0.069048,99.123568,True,{'Imputer': {'categorical_impute_strategy': 'm...
2,3,Random Forest Classifier w/ Imputer,0.119972,0.019487,0.099614,99.070299,False,{'Imputer': {'categorical_impute_strategy': 'm...
3,4,LightGBM Classifier w/ Imputer,0.132722,0.024842,0.110679,98.971496,False,{'Imputer': {'categorical_impute_strategy': 'm...
4,7,Extra Trees Classifier w/ Imputer,0.136959,0.022862,0.111169,98.938661,False,{'Imputer': {'categorical_impute_strategy': 'm...
5,8,CatBoost Classifier w/ Imputer,0.386387,0.011583,0.374338,97.005774,False,{'Imputer': {'categorical_impute_strategy': 'm...
6,1,Elastic Net Classifier w/ Imputer + Standard S...,0.505862,0.008317,0.496767,96.079926,False,{'Imputer': {'categorical_impute_strategy': 'm...
7,2,Decision Tree Classifier w/ Imputer,2.431916,0.531935,2.726782,81.15435,True,{'Imputer': {'categorical_impute_strategy': 'm...
8,0,Mode Baseline Binary Classification Pipeline,12.904388,0.082537,12.952041,0.0,False,{'Baseline Classifier': {'strategy': 'mode'}}


As per the table Logistic Regression is the best pipeline with high mean_cv_score and validation_score, we can can get the description of the pipeline now by using <automl.describe_pipeline(5)> where "5" is the pipeline ID 



In [25]:
automl.describe_pipeline(5)


***************************************************************
* Logistic Regression Classifier w/ Imputer + Standard Scaler *
***************************************************************

Problem Type: binary
Model Family: Linear

Pipeline Steps
1. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : mean
	 * categorical_fill_value : None
	 * numeric_fill_value : None
2. Standard Scaler
3. Logistic Regression Classifier
	 * penalty : l2
	 * C : 1.0
	 * n_jobs : -1
	 * multi_class : auto
	 * solver : lbfgs

Training
Training for binary problems.
Total training time (including CV): 3.7 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary  Sensitivity at Low Alert Rates # Training # Validation
0                      0.061       0.958 0.997      0.966 0.974                     0.981            0.980                           0.412        303          152
1

We can easily check all the parameteres of any pipeline using pipeline ID

In [26]:
pipeline = automl.get_pipeline(1)
print(pipeline.name)
print(pipeline.parameters)

Elastic Net Classifier w/ Imputer + Standard Scaler
{'Imputer': {'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'Elastic Net Classifier': {'alpha': 0.5, 'l1_ratio': 0.5, 'n_jobs': -1, 'max_iter': 1000, 'penalty': 'elasticnet', 'loss': 'log'}}


# Check the best pipeline required to build a model

In [27]:
best_pipeline = automl.best_pipeline

# Evaluate the pipeline performance by using it against the holdoff data

In [28]:
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

OrderedDict([('AUC', 0.9933862433862434),
             ('F1', 0.963855421686747),
             ('Precision', 0.975609756097561),
             ('Recall', 0.9523809523809523)])

## From here you can change the objective of the model you've built using EvalML

In [29]:
automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
                          problem_type='binary',
                          objective='auc',
                          additional_objectives=['f1', 'precision'],
                          max_batches=1,
                          optimize_thresholds=True)

automl_auc.search()

Generating pipelines to search over...

*****************************
* Beginning pipeline search *
*****************************

Optimizing for AUC. 
Greater score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 1 batches for a total of 9 pipelines. 
Allowed model families: catboost, linear_model, decision_tree, random_forest, extra_trees, xgboost, lightgbm



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean AUC: 0.500

*****************************
* Evaluating Batch Number 1 *
*****************************

Elastic Net Classifier w/ Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean AUC: 0.985
Decision Tree Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean AUC: 0.923
Random Forest Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean AUC: 0.992
LightGBM Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean AUC: 0.991
Logistic Regression Classifier w/ Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean AUC: 0.991
XGBoost Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean AUC: 0.991
Extra Trees Classifier w/ Impute

The objective to optimize for. Used to propose and rank pipelines, but not for optimizing each pipeline during fit-time.
 When set to 'auto', chooses:
*     - LogLossBinary for binary classification problems,
*     - LogLossMulticlass for multiclass classification problems, and
*     - R^2 for regression problems.

# Save the model by pickling it

In [30]:
best_pipeline.save("model.pkl")

# Evaluate the model by testing it against test data

In [31]:
check_model=automl.load('model.pkl')

In [32]:
check_model.predict_proba(X_test).to_dataframe()

Unnamed: 0,benign,malignant
0,9.996252e-01,0.000375
1,9.845724e-01,0.015428
2,7.749595e-01,0.225040
3,9.907312e-01,0.009269
4,9.998272e-01,0.000173
...,...,...
109,9.990961e-01,0.000904
110,7.981366e-01,0.201863
111,9.999924e-01,0.000008
112,1.082727e-08,1.000000
