# Data Science Nigeria: Introductory Machine Learning

![](../Images/banner.jpeg)

# INTRODUCTION TO CLASSIFICATION

## Course Overview 

Upon completion of this study unit, you should be able to:

- Distingish the different types classification based on class

- List types of classification algorithms 

- Build a classification algorithms using SKLearn

- Evaluation a classification models performance 



In Classification, we predict the category a data belongs to ie. Classification algorithms are used to predict labels
* Spam Detection
* Churn Prediction
* Sentiment Analysis
* Dog Breed Detection

### TYPES OF CLASSIFICATION TASK

* Binary classification eg. e-mail spam detection (1 ->spam; or 0→not spam), biometric identification, whether a customer will default or Not
* Multi-class classification eg. digit recognition (where classes go from 0 to 9), predicting a party that wins the election,  

<center><img src="..\Images\class.png" style="width: 800px; height:400px"/></center>


### Types of Classification Algorithms
- Logistic Regression         
- Naive Bayes Classifier
- Nearest Neighbor			
- Support Vector Machines
- Decision Trees				
- Boosted Trees
- Random Forest	            


## The Scikit-learn

Scikit-learn is a library in Python that provides many supervised learning and unsupervised algorithms. It’s built upon some of the packages you already familiar with, like NumPy, Pandas, and Matplotlib!

The functionality that scikit-learn provides include:

- Regression

- Classification

- Clustering

- Model selection

- Preprocessing

### Installation

The easiest way to install scikit-learn is using:

`pip install -U scikit-learn`

or 

`conda install -c conda-forge scikit-learn`


### Importing Scikit-learn Module


Some of the classsification models that can be imported from sklearn library includes:

* **Logistic Regression**: `from sklearn.linear_model import LogisticRegression`
* **K Nearest Neighbor**: `from sklearn.neighbors import KNeighborsClassifier`
* **Support Vector Machine**: `from sklearn.svm import SVC`
* **Decision Trees Classifier**: `from sklearn.tree import DecisionTreeClassifier`
* **Random Forest Classifier**: `from sklearn.ensemble import RandomForestClassifier`
* **Gradient Boost Classifier**: `from sklearn.ensemble import GradientBoostingClassifier`

## Building Classification Machine Learning Model for AXA Mansard Medical Insurance 

### Problem statement

You work as an analyst in the marketing department of a company that provides various medical insurance in Nigeria. Your manager is unhappy with the low sales volume of a specific kind of insurance. The data engeenier provides you with a sample dataset for those that visit the company webiste for medical insurance.

The dataset contains the following columns:

- User ID
- Gender
- Age
- Salary
- Purchase: An indicator of wheather the users purchased (1) or not-purchased (0) a particular product.

We plan to use the following classifier to predict whether a person that visit the insurance company will buy or not.

- Logist regression

- Random forest

- Naive Bayes

- XGBoost

- Support Vector Machine

### Import Python modules

We need to import some packages that will enable us to explore the data and build machine learning models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
insurance = pd.read_csv("../Data/Medical_insurance_dataset.csv")

insurance.head(20)

In [None]:
insurance.shape

We have 5 variables and 400 instances of those that want to buy medical insurance or not in this data. The User ID is a random number generated for every customer to comes to the company for medical insurance. Therefore, it is not useful in prediciting whether the person will buy medical insurance or not. We will therefore, remove that variable from the data.

In [None]:
insurance.drop(["User ID"], axis= "columns", inplace= True)

In [None]:
insurance.head()

We want to transform or recode the label Purchased to have $1$ for those that bought the insurance and $0$ for those that did not purchased the insurance. This will transform the output variable (label) to be numeric.

In [None]:
insurance["Purchased"] = insurance["Purchased"].apply(lambda x: 1 if x == "purchased" else 0)

In [None]:
insurance.head()

Now we have 3 features that include `gender`, `age`, and `estimated salary` while `purchased` is the label in this data. Since the label has just two classes or categories (purchased (1) and not-purchased (0)), this is a binary classification problem.

# Exploratory Data Analysis

Fact generated by data exploratory will help us to know those features that can predict whether a person will purhcase medical insurance or not. Let us start by visualizing the proportion of those that want to buy medical insurance or not.

In [None]:
sns.countplot(x = "Purchased", data = insurance);

As you can see, majority of those that visit the medical insurance company did not want to buy the insurance. This is an example of class imbalanced. That is, there is no equal of proportion of those that will buy or not.

In [None]:
sns.countplot(x = "Gender", data = insurance);

The proportion of males are almost the same as females.

In [None]:
sns.countplot(x = "Gender" , hue = "Purchased", data = insurance);

It seems that females wanted to purchase the insurance when compare with males.

In [None]:
sns.boxplot(x = "Purchased", y = "Age", data = insurance);

From the look of things, other people purchased the insurance compared with the younger people.

In [None]:
sns.boxplot(x = "Purchased", y = "EstimatedSalary", data = insurance);

People that earned higher salary purchased the insurance while those that earned low did not purchase the insurance. Of course, it is expected you purchase a medical insurance when you have money.

## Model building

- Importing machine learning models

In [None]:
from sklearn import metrics
from sklearn.model_selection import train_test_split

## Data Preprocessing 
- Separating features and the label from the data

Now is the time to build machine learning models for the task of predicting whether the customers will buy medical insurance or not. Therefore, we shall separate the set of features (X) from the label (Y).

In [None]:
# split data into features and target

X = insurance.drop(["Purchased"], axis= "columns") # droping the label variable (Purchased) from the data

y = insurance["Purchased"]

In [None]:
X.head()

In [None]:
y.head()

- One-hot encoding

As dicussed in Part 3, we need to create a one-hot encoding for all the categorical features in the data because some algorithms cannot work with categorical data directly. They require all input variables and output variables to be numeric. In this case, we will create a one-hot encoding for the gender feature by using `pd.get_dummies()`.

In [None]:
pd.get_dummies(insurance["Gender"])

In fact, `pd.get_dummies()` is very powerful to actually locate the categorical features and create a one-hot encoding for them. For example:

In [None]:
pd.get_dummies(X)

We now save this result of one-hot encoding into X.

In [None]:
X = pd.get_dummies(X)

- Split the data into training and test set

As discussed in A, We will split our dataset (Features (X) and Label (Y)) into training and test data by using `train_test_split()` function from the sklearn. The training set will be $80\%$ while the test set will be $20\%$. The `random_state` that is set to 1234 is for all of us to have the same set of data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 1234)

We now have the pair of training data `(X_train, y_train)` and test data `(X_test, y_test)`

- Model training

We will use the training data to build the model and then use test data to make prediction and evaluation respectively.

## Logistic regression

Let's train a Logistic regression model with our training data. We need to import the Logistic regression from the sklearn model

In [None]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression

We now create an object of class `LogisticRegression()` to train the model on

In [None]:
logisticmodel = LogisticRegression()

logisticmodel.fit(X_train, y_train)

`logisticmodel.fit` trained the Logistic regression model. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
logisticmodel.predict(X_test)

Let's save the prediction result into `logistic_prediction`. This is what the model predicted for us.

In [None]:
logistic_prediction = logisticmodel.predict(X_test)

- Model evaluation

Since we know the true label in the test set (i.e. `y_test`), we can compare this prediction with it, hence evaluate the logistic model. I have created a function that will help you visualize a confusion matrix for the logistic model and you can call on it henceforth to check the performance of any model.

In [None]:
def ConfusionMatrix(ytest, ypred, label = ["Negative", "Positive"]):
    "A beautiful confusion matrix function to check the model performance"
    from sklearn.metrics import confusion_matrix
    import seaborn as sns
    cm = confusion_matrix(ytest, ypred)
    plt.figure(figsize=(7, 5))
    sns.heatmap(cm, annot = True, cbar = False, fmt = 'd', cmap = 'YlGn')
    plt.xlabel('Predicted', fontsize = 13)
    plt.xticks([0.5, 1.5], label)
    plt.yticks([0.5, 1.5], label)
    plt.ylabel('Truth', fontsize = 13)
    plt.title('A confusion matrix');

By using the `ConfusionMatrix()` function, we have:

In [None]:
ConfusionMatrix(y_test, logistic_prediction, label= ["not-purchaed", "purchased"])

## Interpretation of the Naive model evaluation performance


There are 53 True Negatives (TN): predicting that the customer will not buy the insurance and truly the customer did not buy the insurance.

There are 27 False Negative (FN): predicting that the customer will not buy the insurance and the customer actually bought the insurance.

**We can check the accuracy by using:**

In [None]:
metrics.accuracy_score(y_test, logistic_prediction)

The accuracy of the model is $66.25\%$. We cannot trust this accuracy since the data is class imbalanced. Therefore, we are going to use F1 score instead.

In [None]:
metrics.f1_score(y_test, logistic_prediction)

As you can see from the confusion matrix and the result of F1 score, this model is not efficient to predict whether or not a customer will buy the insurance.

## Naive Bayes model

Let's train a Naive Bayes model with our training data. We need to import the Naive Model from the sklearn model

In [None]:
from sklearn.naive_bayes import GaussianNB

naivemodel = GaussianNB()

naivemodel.fit(X_train, y_train)

`naivemodel.fit()` trained the Naive Bayes model. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
naivemodel_prediction = naivemodel.predict(X_test)

You can call one `naivemodel_prediction`to see the prediction

In [None]:
naivemodel_prediction

By using the `ConfusionMatrix()` function, we can see how the model performed:

In [None]:
ConfusionMatrix(y_test, naivemodel_prediction, label= ["not-purchaed", "purchased"])

## Interpretation of the Naive model evaluation performance

There are 48 True Negatives (TN): predicting that the customer will not buy the insurance and truly the customer did not buy the insurance.

There are 20 True Positives (TP): predicting that the customer will buy the insurance and truly the customer did buy the insurance.

There are 7 False Negatives (FN): predicting that the customer will not buy the insurance and the customer actually bought the insurance.

There are 5 False Positives (FN): predicting that the customer will buy the insurance and the customer did not buy the insurance.

## Evaluation metrics

We are going to check the **accuracy** and **F1** score of them model. 

**We can check the accuracy by using:**

In [None]:
metrics.accuracy_score(y_test, naivemodel_prediction)

The accuracy of the model is $85\%$

**We can check the F1 score by using:**

In [None]:
metrics.f1_score(y_test, naivemodel_prediction)

The F1 score of the model is $76.9\%$

As you can see, this model seems good in predicting whether a patient will buy insurance or not.

## Random Forest Model

Let's train a Random Forest model with our training data. We need to import the Random Forest model from the sklearn module

In [None]:
from sklearn.ensemble import RandomForestClassifier

randomforestmodel = RandomForestClassifier()

randomforestmodel.fit(X_train, y_train)

`randomforestmodel.fit()` trained the Random Forest model on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
randomforestmodel_prediction = randomforestmodel.predict(X_test)

You can call one `randomforestmodel_prediction` to see the prediction

In [None]:
randomforestmodel_prediction

By using the `ConfusionMatrix()` function, we can see how the model performed:

In [None]:
ConfusionMatrix(y_test, randomforestmodel_prediction, label= ["not-purchaed", "purchased"])

## Interpretation of the Random Forest model evaluation performance

There are 44 True Negatives (TN): predicting that the customer will not buy the insurance and truly the customer did not buy the insurance.

There are 23 True Positives (TP): predicting that the customer will buy the insurance and truly the customer did buy the insurance.

There are 4 False Negatives (FN): predicting that the customer will not buy the insurance and the customer actually bought the insurance.

There are 9 False Positives (FN): predicting that the customer will buy the insurance and the customer did not buy the insurance.

## Evaluation metrics

We are going to check the **accuracy** and **F1** score of them model.

**We can check the accuracy by using:**

In [None]:
metrics.accuracy_score(y_test, randomforestmodel_prediction)

The accuracy of the model is $83.75\%$

**We can check the F1 score by using:**

In [None]:
metrics.f1_score(y_test, randomforestmodel_prediction)

The F1 score of the model is $77.97\%$

As you can see, this model seems good in predicting whether a patient will buy insurance or not.

## Extreme Gradient Boost (XGBoost) Model

Let's train an XGBoost model with our training data. We need to import the XGBoost model from the sklearn module but before we do that, we need to install the module because it is not available in the sklearn.

## How to install XGBoost

Go to your termina and type `pip install xgboost`

`pip install xgboost`

![](../Images/install_XGboost.jpeg)

After installation, you can now import it as follows:

In [None]:
from xgboost import XGBClassifier

xgboostmodel = XGBClassifier(use_label_encoder=False)

xgbboostmodel = xgboostmodel.fit(X_train, y_train)

`xgboostmodel.fit()` trained the XGBoost model on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
xgbboostmodel_prediction = xgboostmodel.predict(X_test)

You can call on `xgbboostmodel_prediction` to see the prediction

In [None]:
xgbboostmodel_prediction

By using the `ConfusionMatrix()` function, we can see how the model performed:

In [None]:
ConfusionMatrix(y_test, xgbboostmodel_prediction, label= ["not-purchaed", "purchased"])

## Interpretation of the XGBoost model evaluation performance

There are 45 True Negatives (TN): predicting that the customer will not buy the insurance and truly the customer did not buy the insurance.

There are 21 True Positives (TP): predicting that the customer will buy the insurance and truly the customer did buy the insurance.

There are 6 False Negatives (FN): predicting that the customer will not buy the insurance and the customer actually bought the insurance.

There are 8 False Positives (FN): predicting that the customer will buy the insurance and the customer did not buy the insurance.

## Evaluation metrics

We are going to check the **accuracy** and **F1** score of them model.

**We can check the accuracy by using:**

In [None]:
metrics.accuracy_score(y_test, xgbboostmodel_prediction)

The accuracy of the model is $82.5\%$

**We can check the F1 score by using:**

In [None]:
metrics.f1_score(y_test, xgbboostmodel_prediction)

The F1 score of the model is $75\%$

As you can see, this model seems good in predicting whether a patient will buy insurance or not.

## Support Vector Machine (SVM)

Let's train a Support Vector Machine model with our training data. We need to import the Support Vector Machine model from the sklearn module

In [None]:
from sklearn.svm import SVC

SVMmodel = SVC()

SVMmodel.fit(X_train, y_train)

`SVMmodel.fit()` trained the Support Vector Machine on the training data. The model is now ready to make prediction for the unknown label by using only the features from the test data (`X_test`).

In [None]:
SVMmodel_prediction = SVMmodel.predict(X_test)

You can call on `SVMmodel_prediction` to see what has been predicted.

In [None]:
SVMmodel_prediction

By using the `ConfusionMatrix()` function, we can see how the model performed:

In [None]:
ConfusionMatrix(y_test, SVMmodel_prediction, label= ["not-purchaed", "purchased"])

## Interpretation of the Random Forest model evaluation performance

There are 50 True Negatives (TN): predicting that the customer will not buy the insurance and truly the customer did not buy the insurance.

There are 14 True Positives (TP): predicting that the customer will buy the insurance and truly the customer did buy the insurance.

There are 13 False Negatives (FN): predicting that the customer will not buy the insurance and the customer actually bought the insurance.

There are 3 False Positives (FN): predicting that the customer will buy the insurance and the customer did not buy the insurance.

## Evaluation metrics

We are going to check the **accuracy** and **F1** score of the model. 

**We can check the accuracy by using:**

In [None]:
metrics.accuracy_score(y_test, SVMmodel_prediction)

The accuracy of the model is $80\%$

**We can check the F1 score by using:**

In [None]:
metrics.f1_score(y_test, SVMmodel_prediction)

The F1 score of the model is $63.6\%$

As you can see, this model seems good in predicting whether a patient will buy insurance or not.

In [None]:
## Models Summary

+-----------------------+----------------------+-----------------------+
| Model (s)             | Accuracy             | F1-score              |
+=======================+======================+=======================+
| Logistic regression   | 66.25                | 0                     |
+-----------------------+----------------------+-----------------------+
| Naive Bayes           | 85                   | 76.92                 |
+-----------------------+----------------------+-----------------------+
| Random Forest         | 83.75                | 77.97                 |
+-----------------------+----------------------+-----------------------+
| XGBoost               | 82.5                 | 75                    |
+-----------------------+----------------------+-----------------------+
| SVM                   | 80                   | 63.63                 |
+-----------------------+----------------------+-----------------------+

![](../Images/metrics.jpeg)

Having train all the five (5) models, we can see that the best model that can accurately predict whether a customer will buy the insurance or not is the Random Forest Model.

## Class Activities

## Importing Scikit-learn Module

Use the following models to predict whether a customer will buy insurance or not. Your teacher has also included how to import those models for you.

* **K Nearest Neighbor**: `from sklearn.neighbors import KNeighborsClassifier`

* **Decision Trees Classifier**: `from sklearn.tree import DecisionTreeClassifier`

* **Gradient Boost Classifier**: `from sklearn.ensemble import GradientBoostingClassifier`

Which of the three (3) model is the best in term of the F1 score?