# Supervised Learning
**Author**: Gabriel Lorenzo I. Santos (gsantos@ateneo.edu)

----------
The MIT License (MIT)

Copyright (c) 2023 Gabriel Lorenzo I. Santos

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

---------

This notebook covers Supervised Learning techniques namely: K-Nearest Neigbors, Decision Trees, Naive-Bayes, and Regression. Here, we will use the __[scikit-learn](https://scikit-learn.org/stable/index.html)__ library to create and run the models.  As for our dataset, we will use the output of the previous __[Data Preprocessing](https://github.com/gabosantos/1920cs129.15/blob/master/2%20Titanic%20Data%20Preprocessing/script/Data%20Preprocessing%20in%20Python.ipynb)__ notebook. 

***Note: This notebook assumes that you are familiar with Python scripting and Jupyter notebook.  If you wish to learn the basics, check out the __[Data Preprocessing](https://github.com/gabosantos/1920cs129.15/blob/master/2%20Titanic%20Data%20Preprocessing/script/Data%20Preprocessing%20in%20Python.ipynb)__ notebook first.***

Let's import pandas to persist our data:

In [1]:
import pandas as pd

Then, we import our cleaned Titanic survivor dataset and save it to variable _df_.  Let's indicate that the _PassengerId_ column as the index

In [2]:
df = pd.read_csv('../input/titanic_cleaned.csv', index_col="PassengerId")

Let's view the top 5 rows of our dataframe:

In [3]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0.0,1.0,0.375,0.2,0.0,0.111538,0.0,1.0,0.0,0.0,1.0
3,1.0,1.0,0.458333,0.0,0.0,0.121923,1.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.645833,0.2,0.0,0.816923,1.0,0.0,0.0,0.0,1.0
5,0.0,1.0,0.645833,0.0,0.0,0.123846,0.0,1.0,0.0,0.0,1.0
6,0.0,1.0,0.535398,0.0,0.0,0.130128,0.0,1.0,0.0,1.0,0.0


### Splitting the dataset
Before we begin with the model creation and usage, we need to split our dataset first to training set and test set.  The training set is used for model creation and the test set is used for model usage and evaluation. 

First, we need to separate the target/dependent variable from the features/independent variables:

In [4]:
features = df.drop('Survived', axis=1)
target = df['Survived']

**drop()** is used to remove the column *Survived* from the original dataframe.  The resulting dataframe is then saved to the variable *features*.  

The second line selects the *Survived* column alone from the original dataframe and is saved to variable *target*. Note that even if the *Survived* column is dropped first, we did not overwrite the original dataframe, allowing us to select the said column for the second line.

We will now import our first Python module to help us split the dataset into training and test sets:

In [5]:
from sklearn.model_selection import train_test_split

__[**train_test_split()**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)__ randomly selects rows from the features and target datasets and splits them into training and test datasets.  By default, the ratio of training and test datasets is 75:25.  You can change the split by adding the parameter _'test_size = X'_, where X is a number from 0-1, indicating the percentage of test set.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(features, target)

The output of the function is a 4-column list of the training and test datasets.  In the code above, they are assigned to *X_train*, *X_test*, *y_train*, and *y_test* variables, respectively. 

### Evaluating the Model
When testing the performance of a model, accuracy rate is one of the metrics being tracked.  **Accuracy Rate** refers to the % of correct guesses of the model (both true positives and true negatives) from all guesses made.  In order to get the true positives and true negatives, we will import and prepare our next Python module for later use:

In [7]:
from sklearn.metrics import confusion_matrix

------
## K-Nearest Neighbors
The K-Nearest Neighbors algorithm assigns an unlabeled or test datapoint with the label of the majority of its *K* neighbors. To use K-Nearest Neighbors, we will import the __[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)__ from scikit-learn.

In [8]:
from sklearn.neighbors import KNeighborsClassifier

Once imported, we will then create the KNeighborsClassifier object.  Here, we will input a very important parameter: *n_neighbors*, which dictates how many neighbors should the classifier look at:

In [9]:
knn = KNeighborsClassifier(n_neighbors = 3)

In the code above, we defined our *k* to be 3, while the other parameters are left at their default values.  You can add more parameters if you wish to look at a different distance parameter, especially when you are looking at spatial data points when Euclidean distance might be impractical to use.

The code below runs the first step in supervised learning: the **model creation**. As discussed in the lecture, it is usually done using the method *fit()* and it accepts two parameters: training features and training labels.

In [10]:
knn.fit(X_train, y_train)

The second step in supervised learning is the **model usage**, where we will use the model created from the first step to start making guesses on our test set.  The main purpose of doing this is to test the accuracy of our model.  As discussed in the lecture, it is usually done using the method *predict()* and it accepts one parameter: test features.

In [11]:
knn.predict(X_test)

array([1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
       1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.,
       1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 1., 1., 0.])

The output of the *predict()* method is the predicted classification of the data points in our test set. This is what we are using to evaluate the model in __[*confusion_matrix()*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)__ method.  We include two parameters: the test labels and the predicted labels. 

In [12]:
confusion_matrix(y_test, knn.predict(X_test))

array([[106,  13],
       [ 23,  35]])

The output of the confusion matrix for binary classification is a 2x2 array indicating the true positives, false positives, false negatives, and true negatives.  We can unpack the array into these 4 individual variables to calculate the accuracy rate:

In [13]:
tn, fp, fn, tp = confusion_matrix(y_test, knn.predict(X_test)).ravel()

For us to easily calculate the accuracy rate for the next algorithms, we will create our own function which accepts a confusion matrix as a parameter:

In [14]:
def accuracy_rate(confusion_matrix):
    tn, fp, fn, tp = confusion_matrix.ravel()
    return (tp + tn) / (tp + fp + fn + tn)

accuracy_rate(confusion_matrix(y_test, knn.predict(X_test)))

0.7966101694915254

This tells us that our K-Nearest Neighbors classifier is 79.66% accurate.

-----
## Decision Trees
The Decision Tree algorithm creates a tree of rules as to how each data point is classified based on the its features. To use decision trees, we will import the __[DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)__ from scikit-learn.

In [15]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

Similar to the previous classifier, we will create a DecisionTreeClassifier object with a parameter *random_state=0* to control the randomness of the splitting. 

In [16]:
dt = DecisionTreeClassifier(random_state = 0)

We will proceed with the model creation process.

In [17]:
dt.fit(X_train, y_train)

Then, we will then proceed with using the model.

In [18]:
dt.predict(X_test)

array([1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0.,
       1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1.,
       0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0.,
       1., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0.])

Let's measure the accuracy of our decision tree classifier:

In [19]:
accuracy_rate(confusion_matrix(y_test, dt.predict(X_test)))

0.7570621468926554

This shows that our decision tree classifier is 75.71% accurate, a bit lower than the accuracy of our K-Nearest Neighbor classifier.

-----
## Naive-Bayes Classifier
The Naive-Bayes Classifier assumes that each feature is conditionally independent from the other features, and all features are equally contributing to the outcome. For example, if we want to get the probability that a data point belongs to class $y$, given features/evidences $x_1, x_2, \dots, x_n$, we calculate it using: $$P\left(y|x_1, x_2, \dots, x_n\right) = \dfrac{P\left(x_1, x_2, \dots, x_n|y\right)\cdot P\left(y\right)}{P\left(x_1, x_2, ..., x_n\right)}$$

To use a Naive-Bayes classifier, we will import the __[GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)__ from scikit-learn. Note that using this library assumes that our predictors follow a Gaussian or normal distribution.  For other types of distribution, check the rest of the NB classifiers __[here](https://scikit-learn.org/stable/modules/naive_bayes.html)__

In [20]:
from sklearn.naive_bayes import GaussianNB

We will proceed with GaussianNB object creation, model creation, model usage, and model evaluation:

In [21]:
# Instantiate the classifier object
nb = GaussianNB()

# Create the model
nb.fit(X_train, y_train)

# Use the model
nb.predict(X_test)

# Evaluate the model
accuracy_rate(confusion_matrix(y_test, nb.predict(X_test)))

0.8135593220338984

This shows that our Naive-Bayes classifier is 81.36% accurate, a little bit higher than our K-Neighbors classifier and much higher than our Decision Tree classifier.

------
## Logistic Regression
Logistic Regression is a type of regression where instead of predicting an unbounded continuous variable, the probability of the data point to belong to a class, bounded from 0 to 1, is predicted. To use a Logistic Regressor classifier, we will import the __[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)__ from scikit-learn. 

In [22]:
from sklearn.linear_model import LogisticRegression

We then instantiate our classifier object, create our model, use it and evaluate it:

In [23]:
LR = LogisticRegression()
LR.fit(X_train, y_train)
LR.predict(X_test)
accuracy_rate(confusion_matrix(y_test, LR.predict(X_test)))

0.8192090395480226

This shows that among the classifiers used in this notebook, the Logistic Regressor classifier is the most accuracte classifier at 80.79%.

----------
# Lab # 2: Supervised Learning
Your objective is to create a classifier with a better accuracy than the ones created in this notebook.  To improve the accuracy, the following can be done:
- **Re-process your working dataset**
    - Perform data preprocessing again and select with columns should be included.
    - Remove highly-correlated features and leave the one with higher correlation with the target variable.
    - Try using the original dataset with categorical variables and compare with a dataset with purely numeric figures
    - Try using an unstandardized dataset and compare to a standardized one.
- **Change the test size**: Try other ratios of training and test set such as 70:30, 80:20, and 90:10.
- **Change the model parameters**: Learn the different parameters that you can configure for each classifier and explore other configurations.

Remember: There is no 1 perfect model for everything. Try different combinations and configurations!

**Deliverables**:
1. Jupyter notebook containing your code, results, and insights. 
    - Use the markdown cells to write your explanation as to why you selected such parameter/model configuration/dataset format. 
2. Certificate of Authorship: DISCS students should know where to get this. If you can't find it, please send me an email.

**Due Date**: 21 April 2023