<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security</h1>

<h2 align='center'> Lab 06 - Machine Learning - II</h2>

*****

In this lab, we will first guide you through some required interfaces of `sklearn`. We will implement and study **Decision Trees** and **Nearest-Neighbours Regression** implimented using `sklearn`.

In [54]:
# Important Imports
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from scipy import stats

## Decision Tree

A **Decision Tree** is a **supervised** learning algorithm used for classification problems. In this algorithm, we split the original data into two or more homogeneous sets. This is done based on most significant attributes to make the resulting groups as distinct as possible. In decision analysis, a decision tree is used to visually and explicitly represent decisions and decision making. It uses a tree-like model of decisions. A decision tree is drawn with its **root** at the top and **branches** at the bottom. The branch end that doesn’t split anymore is the **decision / leaf**.

### Algorithm

1. Pick an attribute to split at a non-terminal node (based on which attribute will provide the greatest **Information Gain**).
2. Split examples into groups based on attribute value.
3. For each group:
    * **If** no examples – return majority from parent
    * **Else If** all examples in same class – return class
    * **Else** loop to Step 1


### Properties

* Internal nodes test attributes.
* Branching is determined by attribute value.
* Leaf nodes are outputs (class assignments).



Let's try to make some predictions with such a decision tree!

## Exercise 1: Sink or Float

We will use the **Titanic Passenger Dataset** which has data related to 891 passengers and 11 features + the target variable (Survived) in the dataset. The dataset is provided under the `data` folder and has the following features:

| **Feature**              |**Description**                                                                   |
|--------------------------|----------------------------------------------------------------------------------|
| PassengerId              | ID of Passenger in dataset                                                       |
| Survived                 | Whether the passenger survived or not {0: No, 1: Yes}                            |
| Pclass                   | Class of Travel (1,2 or 3)                                                       |
| Name                     | Name of Passenger                                                                |
| Sex                      | Gender of Passenger (Male or Female)                                             |
| Age                      | Age of Passenger                                                                 |
| Sibsp                    | Number of Siblings/Spouse aboard                                                 |
| Parch                    | Number of Parent/Child aboard                                                    |
| Ticket                   | Ticket Number                                                                    |
| Fare                     | Cost of the Ticket                                                               |
| Cabin                    | Cabin number of Passenger                                                        |
| Embarked                 | Port which the Passenger embarked {C: Cherbourg, S: Southhampton, Q: Queenstown} |


We are going to split this dataset into **train** and **test** data. We aim to build a **decision tree to predict whether a passenger will survive or not in the titanic crash**.

1. Complete the following steps to prepare the data for use:
    - Load the dataset `titanic.csv` into a dataframe. Do a quick check on the type of data each column has. 
    - To make life easier for the decision tree, alter the **Sex** column such that `male` is replaced with `0` & `female` is replaced with `1`
    - Clean up the **Age** column to remove any `NaN` values (**HINT**: Replace them with the mean of the entire column).
    - For this exercise we would need only need the feature columns `Pclass`, `Sex`, `Age`, `SibSp` and `Parch`, and the result `Survived`. Split the DataFrame into two new variables, one being a DataFrame with the feature columns and the other holding the result data
    - Using `sklearn`'s train-test-split function, split the data into a training set and testing set to evaluate your model

In [71]:
# YOUR ANSWER HERE
df1 = pd.read_csv('data/titanic.csv')
df1['Sex'] = df1['Sex'].map({'female': 1, 'male': 0})
age_mean = df1['Age'].mean()
df1 = df1.fillna(age_mean)


x = df1[['Pclass','Sex','Age','SibSp']]
y = df1[['Survived']]


x_train, y_train, x_test, y_test = train_test_split(x, y,
        test_size=0.2, random_state=None, shuffle=True)


     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25          

### What can you do with a decision tree object?

| Object Method | Description |
| --- |:---:|
| `dt.apply()` | **Returns the index of the leaf that each sample is predicted as.** |
| `dt.decision_path()` | **Return the decision path in the tree** |
| `dt.fit()` | **Build a decision tree classifier from the training set.** |
| `dt.predict()` | **Predict class value for X.** |
| `dt.score()` | **Returns the mean accuracy on the given test data and labels.** |

<br/>

2. Using the `sklearn` module, make a decision tree object and fit it to the `training` dataset. Determine the accuracy of this model on the `test` dataset.

In [56]:
# YOUR ANSWER HERE

# Labels | Y values
dty_labels = df1['Survived']

# Features | X values
dt_features = df1[['Pclass','Sex','Age','SibSp']]

# Classification
dt_classifier = DecisionTreeClassifier(criterion="entropy", presort=True)

dtx_train, dtx_test, dty_train, dty_test = train_test_split(dt_features, dty_labels, test_size=0.2)
dt_classifier.fit(dtx_train, dty_train)
applied = dt_classifier.apply(dtx_test)
print(applied)
dotf = export_graphviz(dt_classifier, out_file='out.dot')

[130 179  69  65 289  23 130  63 136  92 237  93 207  98 130  92 193 122
 229 106 130 220 130 174 188 177 261 130 261 216 169 188 211  65  85 188
 179 121  83 178 163 160 193 220 165 190 112  93 111 188   5  96 111 264
 190 102 174 148 130  65 174 261 145 157 278 266 207 234 178  56 131 190
 130  48 102 160 190 198 289 174  65  65 160  98 190 131  98 190  98 133
 250  33 237 193  25 130 278  65 278 190 190 291 160  39 130  33 122 190
 145 179 130  48 261  34 174 223 125 117  33 141 130  98  36  87  24 253
  93  36 190  85 111  34 190 190 188  65 111  36 142 190  92 133 261 130
 160 278 188 193 167 190 125 178 160  36  33  39  98  92 228  98 130  98
 178   8 106 169 130 200 158 130 190 234 261 125 261  98 106  33 250]


3. How would the accuracy be affected if you increase or decrease the `DecisionTreeClassifier` object parameters like `max_depth` and `min_samples_leaf`? You are welcome to try making the `DecisionTreeClassifier` object with different values of `max_depth` and `min_samples_leaf` and compare your model's accuracy. But, make sure you can justify the change in accuracy for the increase or decrease in these parameters.

In [57]:
# YOUR ANSWER HERE
# Classification
dt_classifier = DecisionTreeClassifier(max_depth=10, min_samples_leaf=5, criterion="entropy", presort=True)

dtx_train, dtx_test, dty_train, dty_test = train_test_split(dt_features, dty_labels, test_size=0.2)
dt_classifier.fit(dtx_train, dty_train)
applied = dt_classifier.apply(dtx_test)
print(applied)
dotf = export_graphviz(dt_classifier, out_file='out.dot')

[ 92  73  85  63  63  47 105  81 106  47  52  47  55   7 104  47  85  85
  47  47  22  47 106  20  25  31  82  81  82  57  85  83   8  19  53  85
  47  47 107  63  91  85  72  20  39  47  47  71  85  54 104  52  47  93
  47  39  91  52  47  47  54  20  68  47  68  71  47  71 116 107  93  39
  53 106  82  47  52  52  55  52  85 115  20  25  47  87  57  96  39  81
  52  47  29  47 119   8 119  31  49 106 122   8 109  39 106  47  85  71
  62  62  88 121  20  48 106  47  47  68  20  81  20 119  20  71  47  47
 109  29  39  91  68  68  47  82  72  39   4  47  47 107  81  18  47  47
  20  68  57 115  81  83  81  52  16  81  85  79  41 107  48  47  81   4
  25 116 107  71  52  81  85  39  52 104  81  24  71  52  63  85   5]


#### Justification
- As necessary. Use this to put down notes that you can come back to quickly later when you're studying !

4. Scikit-Learn provides you with an easy way to visualize and export this **Decision Tree**. Export this decision tree to your machine and visually inspect it (**HINT**: This was performed in the lectures)

Note: If you're doing this at home, you may require extra files to compile the dot file to visually inspect. The steps are shown below.

In [58]:
# YOUR ANSWER HERE

#copy code of out.dot, go to graphviz online and paste

#### Compiling the `.dot` file (using Graphviz)

##### OSX (Using Homebrew)
- In an open terminal window:
    - `brew install graphviz`
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)

##### Ubuntu 
- In an open terminal window
    - `sudo apt install graphviz` (if not already installed, can sometimes come standard) **(On CECS computers, it will be automatically installed)**
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)
    
##### Windows
- Install the appropiate packages from (here)[https://graphviz.gitlab.io/_pages/Download/Download_windows.html]
- Ensure Graphiz is within your PATH
- Using Powershell
    - `dot -Tpng decision_tree.dot -o decision_tree.png` (when in the directory of the `.dot` file)

<br/>

## k-Nearest Neighbours

A **k-Nearest Neighbours** is a **supervised learning** algorithm. It is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying a new data point or instance. 

KNN is a **non-parametric learning algorithm**, which means that it doesn't assume anything about the underlying data. This is an extremely useful feature since most of the real world data doesn't really follow any theoretical assumption e.g. linear-separability, uniform distribution, etc.

<img src='./images/knn_classification.jpg'>

Source: [E_blog - K-Nearest Neighbour(KNN) Classification](https://zeidigital.wordpress.com/2016/08/13/k-nearest-neighbour-classification-algorithm-implementation-in-python/)

Given a dataset with **x and y**, Nearest Neighbours Regression can be used to:

* Build a predictive model to predict future values of **x<sub>i</sub>** without a **y<sub>i</sub>** value.
* Build a model without assuming any parameters, **K** is hyperparameter.
* It can be seen as a baseline method to compare your complex models.

Lets try doing some predictions using a k-Nearest Neighbours algorithm

## Exercise 2: Want to buy

We have provided a dataset of 200 consumers and 3 features and a target variable in the dataset. The dataset is provided under the `data` folder and has the following features:

| **Feature**              |**Description**                                                                   |
|--------------------------|----------------------------------------------------------------------------------|
| CustomerID               | Unique ID assigned to the customer                                               |
| Gender                   | Gender of the Shopper (Male or Female)                                           |
| Age                      | Age of the Shopper                                                               |
| Annual Income            | Annual Income of the customer (in thousands)                                     |
| Will Buy                 | Whether the customer will buy item x {Yes: 1, No: 0}                             |

#### Scenario
You run a local grocery store and have been approached regarding stocking a products. Product-x is being marketed as the biggest thing since sliced bread in the next town over, although you're unsure if your customers will be as responsive. You decide to purchase a test batch and allow your 200 loyalty members to review the product, and use their buying statistics to determine whether you should stock the product permanently. Loyalty memebers will inform you whether they brought the product or not, and you can use this to predict whether other customers will do the same.

Note: This is another example of a binary classification problem, as you are trying to determine whether the answer will be "yes" or "no" given a question.

1. Complete the following steps to prepare the data for use:
    - Load the dataset `buying_cx.csv` into a dataframe, using CustomerID as the index column
    - As with the previous task, alter the `Gender` column to reflect 0 meaning Male & 1 meaning Female (impliment this using a different method to the previous question) 
        - **Extension**: Use a LabelEncoder or the like from the `sklearn` module to change the values.
    - Split the data into a training and testing dataset

In [106]:
# YOUR ANSWER HERE
df2 = pd.read_csv('data/buying_cx.csv')

df2['Gender'] = df2['Gender'].map({'Female': 1, 'Male': 0})

# Features | X values
X = df2[['Gender','Age','Annual Income (k$)']]

# Labels | Y values
y = df2[['Will Buy']]
y = np.ravel(y)


#length = X.shape[0]
#print(length)
#X_train = X[:int(length*0.8)]
#y_train = y[:int(length*0.8)]
#X_test = X[int(length*0.8):]
#y_test = y[int(length*0.8):]
print(X.shape)
print(y.shape)


X_train, X_test, y_train, y_test = train_test_split(X, y,
        test_size=0.2, random_state=None, shuffle=True) #train80%, test 20%

(200, 3)
(200,)


### What can you do with a k-Nearest Neighbors object?
While there are other methods, the main functions you will require for this exercise are the following:

| Method | Description |
| --- |:---:|
| `.fit()` | **Build a decision tree classifier from the training set.** |
| `.predict()` | **Predict class value for X.** |
| `.score()` | **Returns the mean accuracy on the given test data and labels.** |

<br/>

2. Using the k-Nearest Neighbors class from `sklearn`, impliment a k-Nearest Neighbors classifier where it checks the 5 closest neighbours, and determine the accuracy of the model

In [111]:

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_predictions = knn.predict(X_test)

# compare visually
print(y_test)
print(y_predictions)

print("Accuracy:",metrics.accuracy_score(y_test, y_predictions))

[0 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1
 0 0 0]
[0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 0 0 0 1
 0 0 1]
Accuracy: 0.65


3. As k is a hyperparameter that we specify when we create the model, it is possible to adjust the number of neighbours the model will check for a solution. Create 3 more models, each with a different k value (k=1, k=50, k=150) and compare the scores of each model.

In [108]:
# YOUR ANSWER HERE

#k = 1
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=1)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_predictions = knn.predict(X_test)

# compare visually
#print(y_test)
#print(y_predictions)

print("Accuracy:",metrics.accuracy_score(y_test, y_predictions))

Accuracy: 0.7


In [109]:
#k = 50

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=50)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_predictions = knn.predict(X_test)

# compare visually
#print(y_test)
#print(y_predictions)

print("Accuracy:",metrics.accuracy_score(y_test, y_predictions))

Accuracy: 0.55


In [110]:
#k = 150

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=150)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_predictions = knn.predict(X_test)

# compare visually
#print(y_test)
#print(y_predictions)

print("Accuracy:",metrics.accuracy_score(y_test, y_predictions))

Accuracy: 0.7


Comparing these scores to when we tried for the 5 closest neighbours, what does this tell us about the optimal number of neighbours? Will the optimal number of neighbours remain constant over different divisions of testing and training data? Discuss this with other members of the course or your tutor.

#### NOTES
- Insert Notes from your discussion here as necessary