An Introduction to Scikit Learn Library, covering a range of common and useful features.
Documentation: https://scikit-learn.org/stable/user_guide.html

Scikit-Learn(sklearn), is an open source Python machine learning library. It is built on NumPy(Python Library) and Matplotlib Library.

The Flow of the notebook is as follows(Supervised Learning):
1. Getting Data Ready
2. Choosing the right Machine Learning algorithm/model for problem
3. Fitting chosen machine learning model to data and using it to make prediction
4. Evaluating the model
5. Improving predictions(hyperparameter tuning)
6. Saving and loading a pretrained model
7. Putting it all together in a pipeline

In [2]:
# importing the libraries needed
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

# 1. Get the data ready

In [6]:
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
heart_disease.shape

(303, 14)

Here, each row is a different patient and all columns(except target) are different patient characteristics. The target column indicates whether the patient has heart disease(target=1) or not(target=0).

In [8]:
# Creating the features data columns
X = heart_disease.drop("target", axis=1)

# Create the target column
y = heart_disease["target"]
X.shape, y.shape

((303, 13), (303,))

In [10]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [11]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [12]:
y.value_counts()

1    165
0    138
Name: target, dtype: int64

## Splitting the data into training and testing datasets

In [15]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((227, 13), (227,), (76, 13), (76,))

# 2. Choosing the model and hyperparameters
Using Random Forest Classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

In [18]:
# hyperparameters as default initially
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

# 3. Fit the Model to data and make predictions 
If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

In [19]:
clf.fit(X_train, y_train)

## Use the model to make a prediction

In [20]:
y_preds = clf.predict(X_test)

# 4. Evaluate the model
using built in score method of model - returns how accurate your model is.

In [21]:
# Evaluate the model on training data
clf.score(X_train, y_train)

1.0

In [23]:
clf.score(X_test, y_test)

0.868421052631579

## Other evaluation methods

In [25]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [27]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.83      0.92      0.88        38
           1       0.91      0.82      0.86        38

    accuracy                           0.87        76
   macro avg       0.87      0.87      0.87        76
weighted avg       0.87      0.87      0.87        76



In [29]:
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[35,  3],
       [ 7, 31]])

In [30]:
accuracy_score(y_test, y_preds)

0.868421052631579

# 5. Experiment to improve
Different models you use will have different hyperparameters you can tune. For the case of our model, the RandomForestClassifier(), we'll start trying different values for n_estimators.

In [32]:
# Trying different numbers of estimators(trees)... (no cross-validation)

np.random.seed(42)

for i in range(10,100,10):
    print(f"Estimators: {i}")
    model = RandomForestClassifier(n_estimators =i).fit(X_train, y_train)
    print(f"Model accuracy on test set:{model.score(X_test, y_test) * 100}")
    print("--------------------------------------------------------------")

Estimators: 10
Model accuracy on test set:88.1578947368421
--------------------------------------------------------------
Estimators: 20
Model accuracy on test set:80.26315789473685
--------------------------------------------------------------
Estimators: 30
Model accuracy on test set:88.1578947368421
--------------------------------------------------------------
Estimators: 40
Model accuracy on test set:88.1578947368421
--------------------------------------------------------------
Estimators: 50
Model accuracy on test set:88.1578947368421
--------------------------------------------------------------
Estimators: 60
Model accuracy on test set:89.47368421052632
--------------------------------------------------------------
Estimators: 70
Model accuracy on test set:84.21052631578947
--------------------------------------------------------------
Estimators: 80
Model accuracy on test set:90.78947368421053
--------------------------------------------------------------
Estimators: 90
Model

In [33]:
from sklearn.model_selection import cross_val_score

# With cross-validation
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {model.score(X_test, y_test) * 100}%")
    print(f"Cross-validation score: {np.mean(cross_val_score(model, X, y, cv=5)) * 100}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 88.1578947368421%
Cross-validation score: 78.53551912568305%

Trying model with 20 estimators...
Model accuracy on test set: 84.21052631578947%
Cross-validation score: 79.84699453551912%

Trying model with 30 estimators...
Model accuracy on test set: 82.89473684210526%
Cross-validation score: 80.50819672131148%

Trying model with 40 estimators...
Model accuracy on test set: 86.8421052631579%
Cross-validation score: 82.15300546448088%

Trying model with 50 estimators...
Model accuracy on test set: 88.1578947368421%
Cross-validation score: 81.1639344262295%

Trying model with 60 estimators...
Model accuracy on test set: 85.52631578947368%
Cross-validation score: 83.47540983606557%

Trying model with 70 estimators...
Model accuracy on test set: 88.1578947368421%
Cross-validation score: 81.83060109289617%

Trying model with 80 estimators...
Model accuracy on test set: 89.47368421052632%
Cross-validation score: 82.81420765027322

In [34]:
# another way to do it with GridSearchCV
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over 
param_grid = {'n_estimators': [i for i in range(10,100,10)]}

# Setup the grid search
grid = GridSearchCV(RandomForestClassifier(),
                   param_grid,
                   cv=5)

#Fit the grid search to the data
grid.fit(X,y)

# Find the best parameters
grid.best_params_

{'n_estimators': 80}

In [35]:
# Set the model to be the best estimator
clf = grid.best_estimator_
clf

In [37]:
# fit the best model
clf = clf.fit(X_train, y_train)

In [38]:
# find the scores
clf.score(X_test, y_test)

0.8947368421052632

# 6. Save the Model - for future use
using Python's in-built pickle module

In [40]:
import pickle

# Save an existing model to file 
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [41]:
# Load a saved model and make a prediction 
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.9078947368421053