# Intro to sklearn


What we are going to cover:

0. An end to end Scikit-Learn workflow
1. Getting the data ready
2. Chose the right estimator/algorithm for our problems
3. Fit the model/algo and use them to make prediction
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together


Cheat sheet: https://images.datacamp.com/image/upload/v1676302389/Marketing/Blog/Scikit-Learn_Cheat_Sheet.pdf

## 0. An end to end Scikit learn workflow

In [1]:
#Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline


In [2]:
#1 Getting the data ready
import pandas as pd
import numpy as np
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
#Create X (features matrix)
X = heart_disease.drop("target", axis=1)

#Create y (labels)
y = heart_disease["target"]



In [4]:
import sklearn
sklearn.show_versions()


System:
    python: 3.10.10 | packaged by Anaconda, Inc. | (main, Mar 21 2023, 18:39:17) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\Franz\miniconda3\python.exe
   machine: Windows-10-10.0.19044-SP0

Python dependencies:
      sklearn: 1.2.2
          pip: 23.1.2
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.11.0
       Cython: None
       pandas: 1.5.2
   matplotlib: 3.6.2
       joblib: 1.3.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\Users\Franz\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
        version: 0.3.20
threading_layer: pthreads
   architecture: Zen
    num_threads: 16

       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Users\Franz\AppDat

In [5]:
import warnings
warnings.filterwarnings("default")

In [6]:
conda update sklearn



Note: you may need to restart the kernel to use updated packages.



PackageNotInstalledError: Package is not installed in prefix.
  prefix: C:\Users\Franz\miniconda3
  package name: sklearn


  return process_handler(cmd, _system_body)
  return process_handler(cmd, _system_body)
  return process_handler(cmd, _system_body)


In [7]:
%pip install scikit-learn


Note: you may need to restart the kernel to use updated packages.




In [8]:
#2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

#We'll keep the default hyperparameters
clf.get_params()


{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [10]:
clf.fit(X_train, y_train);


In [11]:
#make a prediction
y_preds = clf.predict(X_test)
y_preds

array([0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0], dtype=int64)

In [12]:
y_test

168    0
242    0
57     1
85     1
102    1
      ..
157    1
279    0
72     1
178    0
186    0
Name: target, Length: 61, dtype: int64

In [13]:
#4. Evaluate the model
clf.score(X_train, y_train)

1.0

In [14]:
clf.score(X_test, y_test)

0.8524590163934426

In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.87      0.84      0.85        31
           1       0.84      0.87      0.85        30

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61



In [16]:
confusion_matrix(y_test, y_preds)

array([[26,  5],
       [ 4, 26]], dtype=int64)

In [17]:
accuracy_score(y_test, y_preds)

0.8524590163934426

In [18]:
#5. Improve the model
#Try different amount of n_estimators

np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print(" ")

Trying model with 10 estimators...
Model accuracy on test set: 81.97%
 
Trying model with 20 estimators...
Model accuracy on test set: 83.61%
 
Trying model with 30 estimators...
Model accuracy on test set: 85.25%
 
Trying model with 40 estimators...
Model accuracy on test set: 83.61%
 
Trying model with 50 estimators...
Model accuracy on test set: 86.89%
 
Trying model with 60 estimators...
Model accuracy on test set: 83.61%
 
Trying model with 70 estimators...
Model accuracy on test set: 81.97%
 
Trying model with 80 estimators...
Model accuracy on test set: 88.52%
 
Trying model with 90 estimators...
Model accuracy on test set: 85.25%
 


In [19]:
# 6. Save the model and load it
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

  pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))


In [20]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

  loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))


0.8524590163934426

## 1. Getting our data ready to be used with machine learning

Three main things we have to do
1. Split the data into features and labels (usually "X" and "y")
2. Filling (also called umputing) or disregarding missing values
3. Converting non-numerical values to numerical values (also called feature encoding)

In [21]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [22]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [23]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [24]:
# Split the data into training and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [25]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.1. Make sure it is all numerical

In [26]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [27]:
len(car_sales)

1000

In [28]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [29]:
#Split the data into X/y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

#Split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [30]:
#Build machine learning model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Toyota'

In [None]:
#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
pd.DataFrame(transformed_X)

In [None]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

In [None]:
#Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,
                                                   y,
                                                   test_size=0.2)
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Create a new X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
# Let's try and convert data to numbers

#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X

#### Option 1: Fill missing data with pandas

In [None]:
#Fill the "Make" column 
car_sales_missing["Make"].fillna("missing", inplace=True)

#Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)

#Fill the "Odomter" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

#Fill the "Doors" columns

car_sales_missing["Doors"].fillna(4, inplace=True)

In [None]:
#check out dataframe
car_sales_missing.isna().sum()

In [None]:
#Remove rows with missing price value

car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing) #We removed 50 unnecessary rows

In [None]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
# Let's try and convert data to numbers

#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

### Option 2: Fill missing values with scikit learn

In [None]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Drop the rows with no values in  Price
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

In [None]:
#Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
X.isna().sum()

In [None]:
#Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

#Fill Categorical values with missing and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

#Define columns
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

#Create an imputer (sth that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

#Transform the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(filled_X, columns=["Make", "Colours", "Doors", "Odometer (KM)"])
car_sales_filled

In [None]:
car_sales_filled.isna().sum()

In [None]:
X = car_sales_filled
y = 


In [None]:
# Let's try and convert data to numbers

#Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                 remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

In [None]:
#Now we have got our data as numbers and filled (no missing values)

np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

## 2. Choosing the right estimator/algorithm for your problem

Some things to note:
* Sklearn refers to machine learning models, algorithms as estimators
* Classificatio problem - predicting a category (heart disease or not)
* Sometimes you will see "clf" (short for classifier) used as a classification estimator
* Regression problem - predicting a number (selling price of a car)

If your are working on a machine learning problem and looking to use sklearn and not sure what model you should use, look up the scikit-learn machine learning map


### 2.1. Picking a machine learning model for a regression problem

Let's use the California Housing data set

In [None]:
# Get California Housing data set

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

In [None]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df

In [None]:
housing_df["target"] = housing["target"]
housing_df.head()

In [None]:
housing_df = housing_df.drop("MedHouseVal", axis=1)

In [None]:
housing_df

In [None]:
# Import algorithm
from sklearn.linear_model import Ridge

# Setup random seed
np.random.seed(42)

#Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"] #median house price in $100,000s

#Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate and fit the model (on the training set)
model = Ridge()
model.fit(X_train, y_train)

#Check the score of the model (on the test set)
model.score(X_test, y_test)


What if "Ridge" didn't work or the score didn't fit our needs?

Well, we could always try a different model...

How about we try an ensemble model

An ensemble is a combination of smaller models than just a single model. This combination can yield to better results. 

In [None]:
#import the RandomForestRegressor model class from the ensemble module 

from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Create RandomForest model

model = RandomForestRegressor()
model.fit(X_train, y_train )

#Check the score of the model (on the test set)

model.score(X_test, y_test)

## 2.2. Picking a machine learning model for a classification problem

Let's go to the map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html



In [None]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()


Consulting the map, tells to try LinearSVC

In [None]:
#Import the LinearSVC estimator class
from sklearn.svm import LinearSVC 

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instatiate LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)

#Evaluate the model
clf.score(X_test, y_test)


In [None]:
heart_disease["target"].value_counts()

In [None]:
#Import the RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier 

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instatiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

#Evaluate the model
clf.score(X_test, y_test)

Tidbit:

    1. If you have structured data, use ensemble methods
    
    2. If you have unstructured data, use deep learning or transfer learning

## 3. Fit the  model/algorithm and our data and use it to make predictions

### 3-1 Fitting the model to the data

* X = features, data, feature variables
* y = labels, targets, target variables

In [None]:
#Import the RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier 

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instatiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

#Fit the model to the data (training the ML model)
clf.fit(X_train, y_train)

#Evaluate the Random Classifier model (Use the patterns the model has learned)
clf.score(X_test, y_test)

In [None]:
X.head()


In [None]:
y.head()

## 3.2 Make predictions using a machine learning model

2 ways to make predictions:
1. predict()

2. predict_proba()


In [None]:
X_test

In [None]:
# Use a trained model to make predicitons

clf.predict(X_test)

In [None]:
np.array(y_test)

In [None]:
# Compare predictions to truth labels to evaluate the model

y_preds = clf.predict(X_test)
test = y_preds == y_test
np.array(test) #Nach dem Vergleich erhält man eine Liste von Wahrheitswerten [True, Fale, Flase....]
np.mean(test) #Berechnet den Durchschnitt (arithmetisch)

In [None]:
clf.score(X_test, y_test) # 

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

Make predictions with 'predict_proba( )'


In [None]:
# predict_proba()return predict probabilities of a classification label
clf.predict_proba(X_test[:5])

In [None]:
# Let's predict on the same data
clf.predict(X_test[:5])

In [None]:
X_test[:5]

'predict()' can also be used for regression models

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

#Split data into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Create model instance
model = RandomForestRegressor()

#Fit the model to the data
model.fit(X_train, y_train)

#Make predictions
y_preds = model.predict(X_test)

In [None]:
y_preds[:10]

In [None]:
np.array(y_test[:10])

In [None]:
len(y_preds)

In [None]:
len(y_test)

In [None]:
#Compare the predictions to the truth

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

In [None]:
housing_df["preds"] = y_preds
housing_df.head()

## 4. Evaluating a machine learn model

Three ways to evaluate scikit learn models/estimators

1. Estimator's built-in "score()" method
2. The "scoring" parameter
3. The problem-specific metric functions

You can read more about this here:
https://scikit-learn.org/stable/modules/model_evaluation.html


### 4.1 Evaluating a model with the score method

In [None]:
#Import the RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier 

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instatiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

#Fit the model to the data (training the ML model)
clf.fit(X_train, y_train)

#Evaluate the Random Classifier model (Use the patterns the model has learned)
clf.score(X_test, y_test)

In [None]:
clf.score(X_train, y_train) #The model was trained on X_train, y_train, hence the models score is 100%

In [None]:
y_train

Let's use the score() on our regression problem...

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create the data
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

#Split data into test and train data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Create model instance
model = RandomForestRegressor(n_estimators=100)

#Fit the model to the data
model.fit(X_train, y_train)



In [None]:
model.score(X_test, y_test) #The defailt score evaluation metric is r_squared for regression algorithms (1.0 is highest)


## 4.2 Evaluating a model using the scoring parameter


In [None]:
#Import score library
from sklearn.model_selection import cross_val_score

#Import the RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier 

#Setup random seed
np.random.seed(42)

#Make the data
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


#Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instatiate Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100)

#Fit the model to the data (training the ML model)
clf.fit(X_train, y_train)


In [None]:
clf.score(X_test, y_test)

In [None]:
cross_val_score(clf, X, y, cv=5)

In [None]:
np.random.seed(42)

#Single training and test split score
clf_single_score = clf.score(X_test, y_test)

#Take the mean of 5-fold cross-validation score
clf_cross_val_score = np.mean(cross_val_score(clf, X, y, cv=5))

#compare the two

clf_single_score, clf_cross_val_score

In [None]:
#Default scoring parameter of classifier = mean accuracy
clf.score(X_test, y_test)

In [None]:
# Scoring parameter set to None by default
cross_val_score(clf, X, y, cv=5, scoring=None)

### 4.2.1. Classification  model evaluation metrics

1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification Report

**Accuracy**

In [None]:
heart_disease.head()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis=1)

y = heart_disease["target"]

clf = RandomForestClassifier()
cross_val_score = cross_val_score(clf, X, y, cv=5)
cross_val_score

In [None]:
np.mean(cross_val_score)

In [None]:
print(f"Heart Disease Classifier Accuracy: {np.mean(cross_val_score) *100:.2f}") 


**Area under the receiver operatung characteristic curve (AUC/ROC)*

* Area under curve
* ROC curve

ROC curves are a comparison of a model's true positive rate (tpr) vs. a models fals positive rate (fpr)

* True positive = model predicts 1 when truth is 1 
* Flase positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

In [None]:
# Create X_test...
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.metrics import roc_curve 

clf.fit(X_train, y_train)


#Make predictions with probabilites
y_probs = clf.predict_proba(X_test)

y_probs[:10], len(y_probs)

In [None]:
y_probs_positive = y_probs[:, 1]
y_probs_positive[:10]

In [None]:
# Calculate fpr, tpr and thresholds

fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)

#Check the false positive rates
fpr, tpr, thresholds

In [None]:
# Create a function for plotting ROC curves

import matplotlib.pyplot as plt

def plot_roc_curve(fpr, tpr):
    """
    Plots a ROC curve fiven the false positive rate (fpr)
    and true positve rate (tpr) of a model
    """
    
    # Plot roc curve
    plt.plot(fpr, tpr, color="orange", label="ROC")
    
    #Plot line with no predictive power (baseline)
    plt.plot([0, 1], [0, 1], color="darkblue", linestyle="--", label="Guessing")
    
    #Customize the plot
    plt.xlabel("False positive rate (fpr)")
    plt.ylabel("True positive rate (tpr)")
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.legend()
    plt.show()
plot_roc_curve(fpr, tpr)

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_probs_positive)

In [None]:
# Plot perfect ROC curve and AUC score

fpr, tpr, thresholds = roc_curve(y_test, y_test)
plot_roc_curve(fpr, tpr)

In [None]:
# Perfect AUC score
roc_auc_score(y_test, y_test)

**Confusion Matrix**

A confusion matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict. 

In essence, giving you an idea of where the model is getting confused.

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

In [None]:
from sklearn.metrics import confusion_matrix 

y_preds = clf.predict(X_test)

confusion_matrix(y_test, y_preds)

In [None]:
# Visualize confusion matrix with pd.crosstab()

pd.crosstab(y_test, y_preds, 
            rownames=["Actual Label"],
            colnames=["Predicted Labels"])

In [None]:
!conda activate C:\Users\Franz\Desktop\sample_project_1\env


In [None]:
!conda env list


In [None]:
!conda install --yes seaborn

In [None]:
# Make our confusion matrix more visual with seaborn heatmap()
import seaborn as sns

#Set the font scale 
sns.set(font_scale=1.5)

#Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)

#Plot using seaborn
sns.heatmap(conf_mat);

In [None]:

from sklearn.metrics import confusion_matrix

y_preds = clf.predict(X_test)

confusion_matrix(y_test, y_preds)

In [None]:

pd.crosstab(y_test, y_preds, 
            rownames=["Actual Label"],
            colnames=["Predicted Labels"])

### Creating a confusion matrox using scikit learn

To use a confusion matrix with scikit learn you need version 1.0+ of sklearn

In [None]:
clf

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(estimator=clf, X=X, y=y)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=y_preds)

**Classsification Report**

In [None]:
from sklearn.metrics import classification_report 

print(classification_report(y_test, y_preds))

In [None]:
# Where precision and recall become valuable
disease_true = np.zeros(10000)
disease_true[0] = 1 # only one positive case


disease_preds = np.zeros(10000)

pd.DataFrame(classification_report(disease_true, disease_preds, output_dict=True))

**To summarize classification metrics:**
    
   * Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples)
    
   * Precision and recall become more important when classes are imbalanced
    
   * If flase positive predictions are worse than false negatives, aim for higher precision
    
   * If false negative predictions are worse than false positives, aim for higher recall
    
   * F-1 score is a combination of precision and recall

### 4.2. Regression model evaluation metrics
https://scikit-learn.org/stable/modules/model_evaluation.html

The one's we are going to cover are:

1. R^2 or coefficient of determination
2. Mean absolute error (MAE)
3. Mean squared error (MSE)


In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df["target"] = housing["target"]
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis=1)
y = housing_df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

In [None]:
housing_df.head()

In [None]:
y_test.mean()

In [None]:
from sklearn.metrics import r2_score

#Fill an array with y_test mean()
y_test_mean = np.full(len(y_test), y_test.mean())

In [None]:
y_test_mean[:10]

In [None]:
r2_score(y_test, y_test_mean) #if the model would predict only mean values, the R2 would be 0. A perfect model would achieve 1.0

**Mean absolute error (MAE)**

MAE is the average of the absolute differences between predictions and actual values. 
It gives you an idea of how wrong your models predictions are.

In [None]:
#MAE
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)

mae = mean_absolute_error(y_test, y_preds)
mae

In [None]:
y_preds

In [None]:
y_test

In [None]:
df = pd.DataFrame(data={"actual values": y_test,
                       "predicted values": y_preds})

df["differences"] = df["predicted values"] - df["actual values"]

df.head(10)

In [None]:
np.abs(df["differences"]).mean() #sames as the mae (average of absolute difference values between predicted and true data)

**Mean Squared Error (MSE)**

MSE is the mean of the square of the mean of the errors


In [None]:
#Mean squared error

from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

In [None]:
df["squared differences"] = np.square(df["differences"])

df.head()

In [None]:
#Calculate MSE by hand

squared = np.square(df["differences"])

squared.mean()

In [None]:
df_large_error = df.copy()
df_large_error.iloc[0]["squared_differences"] = 16

In [None]:
df_large_error.head()

In [None]:
#calculate MSE with large error

df_large_error["squared differences"].mean()

In [None]:
df_large_error2 = df_large_error.copy()

df_large_error2.iloc[1:1000] = 20
df_large_error2.head(100)

In [None]:
df_large_error2["squared differences"].mean()

### 4.2.3. Finally using the scoring() parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]


clf = RandomForestClassifier(n_estimators=100)

In [None]:
np.random.seed(42)

#Cross-validation accuracy

cv_acc = cross_val_score(clf, X, y, cv=5, scoring=None)  #if scoring=None, estimators default scoring metric is used (accuracy for classification method)

cv_acc

In [None]:
# Cross-validated accuracy 
print(f"The cross-validated accuracy is: {np.mean(cv_acc)*100:.2f}%")

In [None]:
np.random.seed(42)

cv_acc = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
# Cross-validated accuracy 
print(f"The cross-validated accuracy is: {np.mean(cv_acc)*100:.2f}%")

In [None]:
#Precision
np.random.seed(42)
cv_precision = cross_val_score(clf, X, y, cv=5, scoring="precision")
cv_precision


In [None]:
print(f"The cross-validated precision is: {np.mean(cv_precision)*100:.2f}%")

In [None]:
#Recall 
np.random.seed(42)
cv_recall = cross_val_score(clf, X, y, cv=5, scoring="recall")
cv_recall

In [None]:
# Cross-validated accuracy 
print(f"The cross-validated accuracy is: {np.mean(cv_recall)*100:.2f}%")

Let's see the "scoring" parameter being used for regression problem...

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df["target"] = housing["target"]
housing_df.head()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis=1)

y = housing_df["target"]

model = RandomForestRegressor(n_estimators=100)



In [None]:
np.random.seed(42)
cv_r2 = cross_val_score(model, X, y, cv=3, scoring=None)
np.mean(cv_r2)


In [None]:
cv_r2

In [None]:
#Mean squared error
np.random.seed(42)
cv_mse = cross_val_score(model, X, y, cv=5, scoring="neg_mean_squared_error")

np.mean(cv_mse)

In [None]:
cv_mse

In [None]:
# Mean absolute error

np.random.seed(42)
cv_mae = cross_val_score(model, X, y, cv=5, scoring="neg_mean_absolute_error")
np.mean(cv_mae)

In [None]:
cv_mae

## 4.3. Using different evaluation metrics as Scikit-Learn functions

The 3rd way to evaluate scikit-learn machine learning models/estimators is to using the sklearn.metrics modules. See https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


np.random.seed(42)
#Create X & y


X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#split the data 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


#Create the model
clf = RandomForestClassifier(n_estimators=100)

#Fit the model
clf.fit(X_train, y_train)

#Make predictions
y_preds = clf.predict(X_test)

#Evaluate the model
print("Classifier metrics on the test set")
print(f"Accuracy: {accuracy_score(y_test, clf.predict(X_test))*100:.2f}%")
print(f"Accuracy: {accuracy_score(y_test, y_preds)*100:.2f}%")
print(f"Precision: {precision_score(y_test, y_preds)*100:.2f}%")
print(f"Recall: {recall_score(y_test, y_preds)*100:.2f}%")
print(f"F1: {f1_score(y_test, y_preds)*100:.2f}%")

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Create X & y

X = housing_df.drop("target", axis=1)
y = housing_df["target"]


#Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Create model
model = RandomForestRegressor()

#Fit the model
model.fit(X_train, y_train)

#Make predicitions
y_preds = model.predict(X_test)

#Evaluate the model using evaluation functions

print("Regression metrics on the test set")
print(f"R2: {r2_score(y_test, y_preds)}")
print(f"MAE: {mean_absolute_error(y_test, y_preds)}")
print(f"MSE: {mean_squared_error(y_test, y_preds)}")

## 5. Improving a model

First predictions = baseline predictions.

First model = baseline model.

**From a data perspective:**
* Could we collect more data? (generally, the more data, the better)
* Could we improve our data?

**From a model perspective:**
* Is there a better model we could use?
* Could we improve the current model?

**Hyperparameters vs. Parameters**
Parameters = Model finds the patterns in data

Hyperparameter = Settings in a model you can (potentially) improve its ability to find patterns

Three ways to adjust hyperparameters:
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchCV 

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
#Finding Hyperparameters of a model
clf.get_params()

### 5.1 Tuning hyperparameters by hand

Let's make 3 sets, training validation and test.


In [None]:
clf.get_params()

We are going to try and adjust: 
   * max_depth
   * max_features
   * min_samples_leaf
   * min_samples_split
   * n_estimators

In [None]:
def evaluate_preds(y_true, y_preds):
    """
    Performs evaluation comparison on y_true labels vs. y_pred labels. 
    on a classification model.
    
    """
    
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metrics_dict = {"accuracy": round(accuracy, 2),
                   "precision": round(precision, 2),
                   "recall": round(recall, 2),
                   "f1": round(f1, 2)}
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1: {f1:.2f}")
    return metrics_dict

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

#Shuffle the data
heart_disease_shuffled = heart_disease.sample(frac=1)

#Split into X and y

X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

#Split the data into train, validation & test sets
train_split = round(0.7 * len(heart_disease_shuffled)) #70% of data
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled))
X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:valid_split], y[train_split:valid_split]
X_test, y_test = X[valid_split:], y[valid_split:]

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

#Make baseline predictions
y_preds = clf.predict(X_valid)


#Evaluate the classifier on the validation set
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

In [None]:
clf.get_params()

In [None]:
np.random.seed(42)

#Create a second classifier with different hyperparameters

clf_2 = RandomForestClassifier(n_estimators=100)
clf_2.fit(X_train, y_train)

#Make prediction with different hyperparameters
y_preds2 = clf_2.predict(X_valid)

#Evaluate the 2nd classifier
clf_2_metrics = evaluate_preds(y_valid, y_preds2)

In [None]:
clf_3 = RandomForestClassifier(n_estimators=100,
                              max_depth=10)

In [None]:
### 5.2. Hyperparameter tuning with RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200], 
       "max_depth": [None, 5, 10, 20, 30],
       "max_features": ["auto", "sqrt"],
       "min_samples_split": [2, 4, 6],
       "min_samples_leaf": [1, 2, 4]}

np.random.seed(42)

#Split into X and y 

X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

#Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

#Setup Randomized SearchCV
rs_clf = RandomizedSearchCV(estimator=clf, 
                           param_distributions=grid,
                           n_iter=10, #number of models to try
                           cv=5,
                           verbose=2)

#Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

In [None]:
rs_clf.best_params_  #Which hyperparameters resulted in the best results. 
#When whe use predict() the model will now use these optimized hyperparameters


In [None]:
#Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)

#Evaluate the metrics
rs_metrics = evaluate_preds(y_test, rs_y_preds)

In [None]:
### 5.3. Hyperparameter tuninig with GridSearchCV
grid

In [None]:
grid_2 = {'n_estimators': [100, 200, 500],
          'max_depth': [None],
          'max_features': ['auto', 'sqrt'],
          'min_samples_split': [6],
          'min_samples_leaf': [1, 2]}

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

#Split into X and y 

X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

#Split into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

#Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf, 
                      param_grid=grid_2,
                      cv=5,
                      verbose=2)

#Fit the GridSearchCV version of clf
gs_clf.fit(X_train, y_train);


In [None]:
gs_y_preds = gs_clf.predict(X_test)

#evaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)

Let's compare our different models metrics

In [None]:
compare_metrics = pd.DataFrame({"baseline": baseline_metrics,
                               "clf_2": clf_2_metrics,
                               "random_search": rs_metrics,
                               "grid search": gs_metrics}
                              )

compare_metrics.plot.bar(figsize=(10,8))

## 6. Saving and loading machine learning models

Two ways to save and loard ml models:
1. Python "pickle" module
2. With the "joblib" module

**Pickle**

In [None]:
import pickle

# Save an existing model to file

pickle.dump(gs_clf, open("gs_random_forest_model_1.pkl", "wb"))  #wb mean write binary

In [None]:
#Load a saved model 

loaded_pickle_model = pickle.load(open("gs_random_forest_model_1.pkl", "rb")) #rb means read binary

In [None]:
#Make prediction
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test, pickle_y_preds)

**Joblib**

In [None]:
from joblib import dump, load
#Save model to file

dump(gs_clf, filename="gs_random_forest_model_1.joblib")


In [None]:
#Import a saved joblib model

loaded_job_model = load(filename="gs_random_forest_model_1.joblib")

In [None]:
#Make and evaluate joblib prediction

joblib_y_preds = loaded_job_model.predict(X_test)
evaluate_preds(y_test, joblib_y_preds)

### 7. Putting it all together!

In [None]:
data = pd.read_csv("car-sales-extended-missing-data.csv")

In [None]:
data

In [None]:
data.dtypes


In [None]:
data.isna().sum()

Steps we want to do (all in one cell):

    1. Fill missing data
    
    2. Convert data to numbers
    
    3. Build a model on the data

In [None]:
#Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

#Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

#Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv("car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace=True)

#Define different features and transformer pipeline
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))
])

numeric_feature = ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

#Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("door", door_transformer, door_feature),
        ("num", numeric_transformer, numeric_feature)
    ])

# Creating a preprocessing and modeling pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor())
])

#Split data 
X = data.drop("Price", axis=1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)


It's also possible to use "GridSearcjCV" or "RandomizedSearchCV" with our Pipeline

In [None]:
#Use GridSearchCV with our regression Pipeline
from sklearn.model_selection import GridSearchCV

pipe_grid = {
    "preprocessor__num__imputer__strategy" : ["mean", "median"],
    "model__n_estimators": [100, 1000],
    "model__max_depth": [None, 5],
    "model__max_features": ["auto"],
    "model__min_samples_split": [2, 4]
    
}

gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose=2)
gs_model.fit(X_train, y_train)

In [None]:
gs_model.score(X_test, y_test)