# **AIDI-1010**

# **Contents:**
**Module: MLJAR**

1.   Installing The Module (MLJAR)
2.   Example-A (Classification)
3.   Example-B (Regression)
4.   Takeaways & Homework





# **1 - Installing MLJAR (Google Colab)**

In [None]:
#1.1 - Install Linux Dependencies & Module (1-2mins)
!pip install mljar-supervised

**Note: Restart Runtime & Re-Run below**



In [None]:
#1.2 - Re-Check Linux Dependencies & Module (5-10secs)
!pip install mljar-supervised

# **2- Binary-Class Classification Example**

In [None]:
#Credit: https://towardsdatascience.com/binary-classification-with-automated-machine-learning-1a36e78ba50f & https://github.com/mljar/mljar-supervised#examples

#2.1 - Load Modules
import pandas as pd

#the MLJAR AutoML class
from supervised import AutoML 

# scikit-learn utilites
from sklearn.datasets import make_classification #to create our own random data
from sklearn.model_selection import train_test_split #to split our own random data into train/test splits

#The algorithms we can play around with
#mode = Explain, Perform, Compete, Optuna
#algorithms = auto or passed in a list
#results_path = path where results are stored
#total_time_limit = time in seconds for training the model
#train_ensemble = instructs if ensemble will be created at end of training
#stack_models = instructs if models stack will be created
#eval_metric = the evaluation metric that will be optimized; auto uses 'logloss' for classification; auto uses 'rmse' for regression

In [None]:
#2.2 - Create Dataset
X, y = make_classification(n_samples=100000, n_features=20, n_redundant=2)
print(X, "\n", "The Shape of X is: ", X.shape, "\n")
print(y, "\n", "The Shape of y is: ", y.shape, "\n")

In [None]:
#2.3 - Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
#2.4.1 - Create MLJAR model 2A, defining algorithms
automl_2A = AutoML(algorithms=["CatBoost", "Xgboost", "LightGBM"])

#2.4.2 - Fit data into Model 2A (1-2mins)
automl_2A.fit(X_train, y_train)

#2.4.3 - Summarize Performance of 2A
automl_2A.report()

In [None]:
#2.5.1 - Create MLJAR model 2B, automated algorithms
automl_2B = AutoML()

#2.5.2 - Fit data into Model 2B (1-2mins)
automl_2B.fit(X_train, y_train)

#2.5.3 - Summarize Performance of 2B
automl_2B.report()

**Takeaway1:** Can you reach a higher accuracy? How?

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

**Takeaway3:** Are you experiencing the same result in different modes?

# **3- Multi-Class Classification Example**

In [None]:
#Credit: https://github.com/mljar/mljar-supervised#examples

#Scenario: Optical recognition of handwritten digits dataset. Usually running this code in less than 30mins = results in test accuracy = 98%

#3.1 - Load Modules
import pandas as pd 

# scikit-learn utilites
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# mljar-supervised package
from supervised import AutoML

In [None]:
#3.2 - Load Dataset
digits3 = load_digits()
print(digits3)

In [None]:
#3.3.1 - Split data into input/output elements
X3_train, X3_test, y3_train, y3_test = train_test_split(
    pd.DataFrame(digits3.data),
    digits3.target, 
    stratify=digits3.target, 
    test_size=0.25,
    random_state=123
    )

#3.3.2 - Print shapes of splits
print("X3 Train", X3_train.shape)
print("y3 Train", y3_train.shape)
print("X3 Test", X3_test.shape)
print("y3 Test", y3_test.shape)

In [None]:
#3.4.1 - Define Model
automl_3A = AutoML(mode="Explain")

#3.4.2 - Fit & Train Data into Model 3A (3-4mins)
automl_3A.fit(X3_train, y3_train)

#3.4.3 - Markdown Report of 3A                                            
automl_3A.report()

In [None]:
#3.4.4 - Compute accuracy of 3A on test data using Model 3A
predictions_3A = automl_3A.predict_all(X3_test)
print(predictions_3A.head())

In [None]:
#3.4.5 - Summarize Performance of 3A
print("Test accuracy:", accuracy_score(y3_test,predictions_3A["label"].astype(int)))

In [None]:
#3.5.1 - Define Model/Search
automl_3B = AutoML(mode="Perform")

#3.5.2 - Fit & Train Data into Model 3B (up to 1hour)
automl_3B.fit(X3_train, y3_train)

#3.5.3 - Markdown Report of 3B
automl_3B.report()

In [None]:
#3.5 - Compute accuracy on test data using Model 3B
predictions_3B = automl_3B.predict_all(X3_test)
print(predictions_3B.head())

In [None]:
#3.6 - Summarize Performance of 3B
print("Test accuracy: ", accuracy_score(y3_test,predictions_3B["label"].astype(int)))

**Takeaway1:** Can you reach a higher accuracy? How?

**Takeaway2:** What's the accuracy of newer data (if you created any)? What other useful parameters were you able to find?

**Takeaway3:** Are you experiencing the same result in different modes?

# **4- Regression Example**

In [None]:
#Credit: https://github.com/mljar/mljar-supervised#examples

#Scenario: Regression example on Boston house prices data. On test data it scores ~ 10.85 mean squared error (MSE).

#4.1 - Load Modules
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

# scikit-learn utilites
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# mljar-supervised package
from supervised import AutoML

In [None]:
#4.2 - Load Dataset
housing4 = datasets.load_boston()
X4 = pd.DataFrame(housing4["data"],columns=housing4["feature_names"])
y4 = pd.Series(housing4["target"], name="target")

print("This is X", "\n", X4, "\n")
print("This is y", "\n", y4)

In [None]:
#4.3.1 - Split data into input/output elements
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y4, test_size=0.2)

#4.3.2 - Print shapes of splits
print("X4 Train", X4_train.shape)
print("y4 Train", y4_train.shape)
print("X4 Test", X4_test.shape)
print("y4 Test", y4_test.shape)

In [None]:
#4.4.1 - Define Model
automl_4A = AutoML(mode="Explain")

#4.4.2 - Fit & Train Data into Model 4A (3-4mins)
automl_4A.fit(X4_train, y4_train)

#4.4.3 - Markdown Report of 4A
automl_4A.report()

In [None]:
#4.5 - Compute accuracy on test data using Model 4A
predictions_4A = automl_4A.predict_all(X4_test)
print(predictions_4A.head())

In [None]:
#4.6 - Summarize Performance of Model 4A
print("Test MSE:", mean_squared_error(y4_test, predictions_4A))

In [None]:
#4.7 - Plot of Predictions from Model 4A
plt.plot(y4_test, predictions_4A.prediction, '.')
plt.xlabel("True Value")
plt.ylabel("Predicted Value")

In [None]:
#4.8 - MAE on test data
np.mean(np.abs(y4_test-predictions_4A.prediction))

# **5- Takeaways & Homework**

How many times have you run the model, are you getting consistent results?

Income classification - it is a binary classification task on census data


https://github.com/mljar/mljar-examples/blob/master/Income_classification/Income_classification.ipynb

Iris classification - it is a multiclass classification on Iris flowers data


https://github.com/mljar/mljar-examples/blob/master/Iris_classification/Iris_classification.ipynb