# Introduction to Scikit-Learn

popular ML library that contains out-of-the-box modules.

What to cover:

- [0. An end-to-end scikit-learn workflow](#0.-an-end-to-end-scikit-learn-workflow)
- [1. Getting the data ready](#1.-getting-the-data-ready)
- [2. Choosing the right estimator/ML model.](#2.-choosing-the-right-estimator-ml-model)
- [3. Fit the model/algo/estimator and use it to make predictions](#3.-fit-the-model-algo-estimator-and-use-it-to-make-predictions)
- [4. Evaluating a modules](#4.-evaluating-a-modules)
- [5. Improve the model](#5.-improve-the-model)
- [6. Save and load a trained model](#6.-save-and-load-a-trained-model)
- [7. Putting it all together](#7.-putting-it-all-together)


## 0. An end-to-end scikit-learn workflow

## 1. get the data ready


In [1]:
# 1. get the data ready
import pandas as pd

heart_d = pd.read_csv(r"C:\Users\core i5\Desktop\GitHub\DataScience\datascience\ZTM - Data Science and Machine Learning\data\13.1 heart-disease.csv")

heart_d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [2]:
print(heart_d.head())

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  


In [3]:
# create a features matrix (dataFrame). Bind x to every columns except the target column

import numpy as np
X = heart_d.drop("target", axis=1)

# create y (labels)

y = heart_d["target"]

## 2. choose the right model and hyperparameters


In [4]:
# 2. choose the right modela and hyperparameters

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# keep default hyperparameters

clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fit the model to the training data


In [5]:
# 3. Fit the model to the training data

from sklearn.model_selection import train_test_split

# X_train getting 80% of the rows of X and y_train getting 80% rows of y. The rows getting taken from X and y are randomized
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [6]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [7]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
181,65,0,0,150,225,0,0,114,0,1.0,1,3,3
284,61,1,0,140,207,0,0,138,1,1.9,2,1,3
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2
182,61,0,0,130,330,0,0,169,0,0.0,2,0,2
191,58,1,0,128,216,0,0,131,1,2.2,1,3,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
169,53,1,0,140,203,1,0,155,1,3.1,0,0,3
261,52,1,0,112,230,0,1,160,0,0.0,2,1,2
201,60,1,0,125,258,0,0,141,1,2.8,1,1,3
31,65,1,0,120,177,0,1,140,0,0.4,2,0,3


In [8]:
y_train

181    0
284    0
151    1
182    0
191    0
      ..
169    0
261    0
201    0
31     1
145    1
Name: target, Length: 242, dtype: int64

In [9]:
# we are trainging the model at this point
clf.fit(X_train, y_train)

RandomForestClassifier()

In [10]:
# make a prediction. The code below is prediciting the y values (or the target values, which are 0 or 1) based on the info of x_test. 

# predictions cannot be made on different shaped arrays.
y_preds = clf.predict(X_test)
y_preds

array([0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0], dtype=int64)

## 4. Evaluate the model.


In [11]:
# 4. Evaluate the model. How well is the model we just trained? Evaluate the model on the trainig and test data. Note that the score when testing the training sets we used will probably be high since that was the data we used to create the model
 
clf.score(X_train, y_train)

1.0

In [12]:
# test the accuracy of the model with the test data 

clf.score(X_test, y_test)

0.8688524590163934

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.88      0.82      0.85        28
           1       0.86      0.91      0.88        33

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



In [14]:
print(confusion_matrix(y_test, y_preds))

[[23  5]
 [ 3 30]]


In [15]:
accuracy_score(y_test, y_preds)

0.8688524590163934

## 5. Improve a model


In [16]:
# 5. Improve a model: Try different amount of n_estimators The code below repeats the calling of the RandomForestClassifier algo with different n_estimators values.

np.random.seed(42)
for i in range (10,100,10):
    #which estimator value of the model's hyperparamter do we have?
    print(f"trying model with {i} estimators...")
    # change the hyperparameter of the model with i AND at the same time fit/train the model 
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    # print the model's accuracy in predicting the values in the y_test
    print(f"model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print(" ")
    

trying model with 10 estimators...
model accuracy on test set: 80.33%
 
trying model with 20 estimators...
model accuracy on test set: 90.16%
 
trying model with 30 estimators...
model accuracy on test set: 83.61%
 
trying model with 40 estimators...
model accuracy on test set: 81.97%
 
trying model with 50 estimators...
model accuracy on test set: 86.89%
 
trying model with 60 estimators...
model accuracy on test set: 83.61%
 
trying model with 70 estimators...
model accuracy on test set: 86.89%
 
trying model with 80 estimators...
model accuracy on test set: 85.25%
 
trying model with 90 estimators...
model accuracy on test set: 88.52%
 


## 6. save a model and load it


In [17]:
# 6. save a model and load it

import pickle 

# pass the model (clf), the file name as a parameter in open(), and wb for write binary(?)
pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [18]:
# import the saved model

# load the model by using pickle.load with the parameters of open({file name}) and "rb" for read binary
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))

# check the model's accuracy on our test data set. It will give a similar score as the model we have with the latest change which is with the h-parameter of n_estimatprs = 100
loaded_model.score(X_test, y_test)

0.8852459016393442

# Getting the data ready <a id ="1.-getting-the-data-ready"></a>


In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1.1 Gettiing our data ready to be used with ML model

- split the data into features and labels (usually "X" & "y")
- Fill (impute) or disregard missing values
- converting non-numerical values to numerical (One Hot Encoder)


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# 1.1 Make sure that all the date you are working with is numerical

carsales_ex = pd.read_csv(
    r"C:\Users\core i5\Documents\GitHub\DataScience\datascience\ZTM - Data Science and Machine Learning\data\car-sales-extended.csv"
)
carsales_ex.info()
carsales_ex.head(), carsales_ex.shape


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           1000 non-null   object
 1   Colour         1000 non-null   object
 2   Odometer (KM)  1000 non-null   int64 
 3   Doors          1000 non-null   int64 
 4   Price          1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


(     Make Colour  Odometer (KM)  Doors  Price
 0   Honda  White          35431      4  15323
 1     BMW   Blue         192714      5  19943
 2   Honda  White          84714      4  28343
 3  Toyota  White         154365      4  13434
 4  Nissan   Blue         181577      3  14043,
 (1000, 5))

In [25]:
# Split into X and y

X = carsales_ex.drop("Price", axis= 1)
y = carsales_ex["Price"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)