# Why EvalML is one of the best AutoML library you can get your hands on

## What is AutoML?

**According to wikipedia** -

Automated machine learning (*AutoML*) is the process of automating the tasks of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model. AutoML was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning.The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. Automating the process of applying machine learning end-to-end additionally offers the advantages of producing simpler solutions, faster creation of those solutions, and models that often outperform hand-designed models. AutoML has been used to compare the relative importance of each factor in a prediction model.

# What is EvalML?
EvalML is an open source automated machine learning library created by Altryx's Innovation team EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

Basically EvalML provides a simple low code interface to create machine learning model and use those models to generate insights and to make accurate predictions.    

## What I liked the most about EvalML

- EvalML cuts downs the process of model training and tuning by hand, this includes data quality checks and cross-validation.

- Data Checks and warnings: EvalML helps you in identifying the probelm in the data before using or setting it up for modelling

- Pipeline building: EvalML helps you in consructing a highly optimised pipeline including a state-of-the-art data preprocessing, feature engineering, feature selection and a lot pf modelling techniques 

- Model Understanding: Just like Shap, Eli5, Lime and other model explanibility libraries EvalML also provides a broad level of understanding about the model you are building, for the purpose of presentation

- Domain-specific: This is the missing link in most of the AutoML libraries where you can define the objective if the problem. Once you have determined the objective for your business, you can provide that to EvalML to optimize by defining a custom objective function. 

# Let's get started with EvalML

## How to install 

> **Note:** EvalML includes several optional dependencies. The xgboost and catboost packages support pipelines built around those modeling libraries. The plotly and ipywidgets packages support plotting functionality in automl searches. These dependencies are recommended, and are included with EvalML by default but are not required in order to install and use EvalML

In [None]:
!pip install evalml

## Loading a dataset

Loading a dataset in EvalMl is just the usual process we can use any library for this, I am using pandas for this and I am using breast cancer dataset for this
Dataset Link: [Breast Cancer Dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

In [None]:
import evalml
X, y = evalml.demos.load_breast_cancer()
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary')
# Here we'll split the data table using evalml's preprocessing "split_data" library

Note: EvalML uses data tables as a standard data format but you can read the regular .csv dataset and it gets converted using Woorworks (another altryx project). 
EvalML also accepts and works well with pandas DataFrames. But using the DataTable makes it easy to control how EvalML will treat each feature, as a numeric feature, a categorical feature, a text feature or other type of feature. Woodwork’s DataTable includes features like inferring when a categorical feature should be treated as a text feature.

In [None]:
X_train.head()

## Automated pipeline search

User can use AutoMLSearch() for searching the best pipeline. EvalML uses Bayesian optimization to sort the best pipeline as per the defined objective

In [None]:
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')

when you use search() function after automl.search() then the search for best pipeline is started. The need for data wrangling is eliminated in EvalML, you can directly load the data and start seaching for the best pipeline after defining feature and outcome variable, let's find the best pipeline now 

In [None]:
automl.search()

So from the above snippet we got the best pipline i.e "*Logistic Regression Classifier w/ Imputer + Standard Scaler*" with Log loss of 0.094

After the search we will rank the pipeline on the basis of scores, for this we need to use a simple code 
<automl.rankings>

In [None]:
automl.rankings

As per the table Logistic Regression is the best pipeline with high mean_cv_score and validation_score, we can can get the description of the pipeline now by using <automl.describe_pipeline(5)> where "5" is the pipeline ID 



In [None]:
automl.describe_pipeline(5)

We can easily check all the parameteres of any pipeline using pipeline ID

In [None]:
pipeline = automl.get_pipeline(1)
print(pipeline.name)
print(pipeline.parameters)

# Check the best pipeline required to build a model

In [None]:
best_pipeline = automl.best_pipeline

# Evaluate the pipeline performance by using it against the holdoff data

In [None]:
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

## From here you can change the objective of the model you've built using EvalML

In [None]:
automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
                          problem_type='binary',
                          objective='auc',
                          additional_objectives=['f1', 'precision'],
                          max_batches=1,
                          optimize_thresholds=True)

automl_auc.search()

The objective to optimize for. Used to propose and rank pipelines, but not for optimizing each pipeline during fit-time.
 When set to 'auto', chooses:
*     - LogLossBinary for binary classification problems,
*     - LogLossMulticlass for multiclass classification problems, and
*     - R^2 for regression problems.

# Save the model by pickling it

In [None]:
best_pipeline.save("model.pkl")

# Evaluate the model by testing it against test data

In [None]:
check_model=automl.load('model.pkl')

In [None]:
check_model.predict_proba(X_test).to_dataframe()