# APS1070
#### Project 1 --- Basic Principles and Models 
**Deadline: Jun 4, 11PM - 10 percent**

**Academic Integrity**

This project is individual - it is to be completed on your own. If you have questions, please post your query in the APS1070 Piazza Q&A forums (the answer might be useful to others!).

Do not share your code with others, or post your work online. Do not submit code that you have not written yourself. Students suspected of plagiarism on a project, midterm or exam will be referred to the department for formal discipline for breaches of the Student Code of Conduct.

Name: Yusuf Olonade

Student ID: 1006814743

##**Marking Scheme:**

This project is worth **10 percent** of your final grade.

Draw a plot or table where necessary to summarize your findings. 

**Practice Vectorized coding**: If you need to write a loop in your solution, think about how you can implement the same functionality with vectorized operations. Try to avoid loops as much as possible (in some cases, loops are inevitable).




### How to submit **(HTML + IPYNB)**

1. Download your notebook: `File -> Download .ipynb`

2. Click on the Files icon on the far left menu of Colab

3. Select & upload your `.ipynb` file you just downloaded, and then obtain its path (right click) (you might need to hit the Refresh button before your file shows up)


4. execute the following in a Colab cell:
```
%%shell
jupyter nbconvert --to html /PATH/TO/YOUR/NOTEBOOKFILE.ipynb
```

5. An HTML version of your notebook will appear in the files, so you can download it.

6. Submit **both** <font color='red'>`HTML` and `IPYNB`</font>  files on Quercus for grading.



Ref: https://stackoverflow.com/a/64487858 



# Project 1 [10 Marks] 
Let's apply the tools we have learned in Tutorial 1 to a new dataset.

We're going to work with a breast cancer dataset. Download it using the cell below:

In [102]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()

## Part 1: Getting started [2 Marks]
First off, take a look at the `data`, `target` and `feature_names` entries in the `dataset` dictionary. They contain the information we'll be working with here. Then, create a Pandas DataFrame called `df` containing the data and the targets with the feature names as column headings. If you need help, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for more details on how to achieve this. **[0.4]**
* How many features do we have in this dataset? 30
* How many observations have a 'mean area' of greater than 700? 171
* How many participants tested `Malignant`? 212
* How many participants tested `Benign`? 357

In [None]:
dataset

In [104]:
import numpy as np
import sklearn
import pandas as pd
import matplotlib.pyplot as plt 

In [105]:
feature_names = dataset.feature_names
feature_data = dataset.data
target_names = dataset.target_names
target_data = dataset.target

In [None]:
feature_names

In [None]:
df_1 = pd.DataFrame(data = feature_data, columns = feature_names)

df_1["target"] = target_data

df_1

In [None]:
print("No of features:",feature_data.shape[1])
print("No of observations with mean area greater than 700:", (df['mean area'] > 700). sum())
print("No of participants tested Malignant:", (df['target'] == 0). sum())
print("No of participants tested Benign:", (df['target'] == 1). sum())

### Splitting the data
It is best practice to have a training set (from which there is a rotating validation subset) and a test set. Our aim here is to (eventually) obtain the best accuracy we can on the test set (we'll do all our tuning on the training/validation sets, however.) 

**Split the dataset** into a train and a test set **"70:30"**, use **``random_state=0``**. The test set is set aside (untouched) for final evaluation, once hyperparameter optimization is complete. **[0.5]**

In [109]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feature_data, target_data, test_size=0.3, random_state=0)

### Effect of Standardization (Visual)
Use `seaborn.lmplot` ([help here](https://seaborn.pydata.org/generated/seaborn.lmplot.html)) to visualize a few features of the training set. Draw a plot where the x-axis is ``worst smoothness``, the y-axis is ``worst fractal dimension,`` and the color of each datapoint indicates its class.  **[0.5]**

Standardizing the data is often critical in machine learning. Show a plot as above, but with two features with very different scales. Standardize the data and plot those features again. What's different? Based on your observation, what is the advantage of standardization? **[0.6]**






In [None]:
# Training set in Pandas DataFrame

df_2 = pd.DataFrame(data = X_train, columns = feature_names)

df_2["target"] = y_train

df_2

In [None]:
# Plot 1: worst smoothness vs worst fractal dimension

import seaborn as sns; sns.set_theme(color_codes=True)
g_1 = sns.lmplot(x = "worst smoothness", y = "worst fractal dimension", hue = "target", data = df_2)

In [None]:
# Plot 2: mean concave points vs mean area before standardization

g_2 = sns.lmplot(x = "mean concave points", y = "mean area", hue = "target", data = df_2)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train) #Fitting the scaler on X_train

print("Scaler Mean:", scaler.mean_)
print("Scaler Var:",  scaler.var_ , "\n")

# Standardizing X_train & X_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

In [None]:
# Casting the standardized training set in a Pandas DataFrame

df_3 = pd.DataFrame(data = X_train_scaled, columns = feature_names)

df_3["target"] = y_train

df_3

In [None]:
# Plot 3: mean concave points vs mean area after standardization

g_3 = sns.lmplot(x = "mean concave points", y = "mean area", hue = "target", data = df_3)

**Answer:** Visualizing Plot 2 above, before standardization, the scales of `mean area` is very large compared to `mean concave point`. After standardization in Plot 3, the scales are now same. The advantage of standardization is to prevent large feature values from dominating the small feature values which their effect in real life is equally important for our model prediction.

## Part 2: KNN Classifier without Standardization [2 Marks]
Normally, standardizing data is a key step in preparing data for a KNN classifier. However, for educational purposes, let's first try to build a model without standardization. Let's create a KNN classifier to predict whether a patient has a malignant or benign tumor. 

Follow these steps: 

1.   Train a KNN Classifier using cross-validation on the dataset. Sweep `k` (number of neighbours) from 1 to 100, and show a plot of the mean cross-validation accuracy vs `k`. **[1]**
2.   What is the best `k`? **k = 10**. What is the highest cross-validation accuracy? **0.9346518987341772** **[0.5]**
3. Comment on  which ranges of `k` lead to underfitted or overfitted models (hint: compare training and validation curves!). **[0.5]**. **Ans:** `k` range between 75 to 100 have are underfitted because they have low training and validation score below 0.9 while `k` range between 1 to 4 are overfitted since they have high training score above 0.95 and low validation score below 0.92




In [None]:
# Validating the kNN model with k = 1

from sklearn import neighbors
from sklearn.model_selection import cross_validate

knn_test = neighbors.KNeighborsClassifier(n_neighbors=1)
scores = cross_validate(knn_test, X_train, y_train, cv=5, return_train_score=True)

print('Mean Train Accuracy:',scores['train_score'].mean()) 
print('Mean Validation Accuracy:', scores['test_score'].mean())

In [None]:
from sklearn.model_selection import GridSearchCV

# Creates a new knn model
knn_1 = neighbors.KNeighborsClassifier()

# Creates a dictionary of all k-values we want to test
param_grid = {'n_neighbors': np.arange(1, 101)}

# Uses gridsearch to test all k-values we defined in dictionary
clf_1 = GridSearchCV(knn_1, param_grid, return_train_score = True)

# Fit model to our training data
clf_1.fit(X_train, y_train)

clf_1.cv_results_

In [None]:
# Cross validation results imported into a pd Dataframe

df_4 = pd.DataFrame(clf_1.cv_results_)[['param_n_neighbors','mean_train_score', 'mean_test_score']]

df_4

In [None]:
# Plot 4: KNN Model Validation Curve

df_4.plot.scatter(x ='param_n_neighbors', y = 'mean_test_score', c = 'Red', title = 'KNN Model Validation Curve')

In [None]:
# Plot 5: KNN Model Training Curve

df_4.plot.scatter(x ='param_n_neighbors', y='mean_train_score', c = 'Green', title = 'KNN Model Training Curve')

In [None]:
# Returns the best k parameter

clf_1.best_params_

In [None]:
# Returns the average mean validation score(accuracy) for the best k parameter

Full_feature_val_score = clf_1.best_score_
Full_feature_val_score

## Part 3: Feature Selection [3 Marks]
In this part, we aim to investigate the importance of each feature on the final classification accuracy. 
If we want to try every possible combination of features, we would have to test  $2^F$ different cases,  where F is the number of features, and in each case, we have to do a hyperparameter search (finding K, in KNN using cross-validation). That will take days!. 

To find more important features we will use a decision tree. based on a decision tree we can compute feature importance that is a metric for our feature selection (code is provided below).

You can use [this link](https://machinelearningmastery.com/calculate-feature-importance-with-python/
) to get familiar with extracting the feature impotance order of machine learning algorithms in Python.

After we identified and removed the least important feature and evaluated a new KNN model on the new set of features, if the stop conditions (see step 7 below) are not met, we need to repeat the process and remove another feature.


Design a function ( `Feature_selector`) that accepts your dataset (X_train , y_train) and a threshold as inputs and: **[1]**
1. Fits a decision tree classifier on the training set.

2. Extracts the feature importance order of the decision tree model.

3. Removes the least important feature based on step 2. 
4. Then, a KNN model is trained on the remaining features. The number of neighbors (`k`) for each KNN model should be tuned using a 5-fold cross-validation.
5. Store the best `mean cross-validation` score, the corresponding `k` (number of neighbours) value, and the removed feature in three lists.
6. Repeat Steps 3-5 until you meet the stop condition (step 7). 
 
7. We will stop this process when (1) there is only one feature left, or (2) our cross-validation accuracy is dropped significantly compared to a model that uses all the features. In this function, we accept a threshold as an input argument. For example, if threshold=0.95 we do not continue removing features if our mean cross-validation accuracy after tuning `k` is bellow **0.95 $\times$ Full Feature cross-validation accuracy**.

8. Your function returns the list of removed features, the list of corresponding mean cross-validation accuracies, and the list of `k` values when a feature was removed (i.e., the lists that were appended to in Step 5).

* Visualize your results by plotting the best mean cross-validation accuracy (based on the best value of `k`) on y axis vs. the number of features (x axis). This plot describes: what is the best cv score with 1 feature, 2 features, 3 features ... and all the features. **[0.5]**

* Plot the best value of `k` (y-axis) vs. the number of features. This plot explains the trend of number of neighbours with respect to the number of features.  **[0.5]**

* State what is the number of essential features for classification and justify your answer. **[1]**
  
  







 

You can use the following piece of code to start training a decision tree classifier and obtain its feature importance order. 
```
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt.fit(X_train,y_train)
importance = dt.feature_importances_
```


In [123]:
# Creates KNN model that is to be trained with selected features
knn_2 = neighbors.KNeighborsClassifier()
param_grid = {'n_neighbors': np.arange(1, 101)}
clf_2 = GridSearchCV(knn_2, param_grid)

In [None]:
from sklearn import tree
from sklearn.feature_selection import RFE        

a = list(np.arange(1, 30))

def Feature_selector (X_train, y_train, tr=0.95):
  dt = tree.DecisionTreeClassifier()
  dt.fit(X_train,y_train)
  importance = dt.feature_importances_
  best_score = 0.93                                                    # Full feature cross validation accuracy after tunning K
  list_best_score = []
  list_best_param = []
  for i in a:
    if best_score > (tr * best_score):
      fs = RFE(dt, n_features_to_select=30-i, step=1)                   # Iterates the list a to select important feature
      fs_trained = fs.fit(X_train, y_train)
      X_train_fs = fs_trained.transform(X_train)                        # Stores X_train after transformation by RFE
      clf_2 = GridSearchCV(knn_2, param_grid)                           # Defines KNN model to be trained
      clf_2_trained = clf_2.fit(X_train_fs, y_train)
      list_best_score.append(clf_2_trained.best_score_)
      list_best_param.append(clf_2_trained.best_params_)
      support = fs_trained.support_                                     # Returns a boolean array of selected features
      support_inverted = np.invert(support)                             # Inverts the boolean array above
      removed_features = feature_names[support_inverted]                # Returns an array of features removed
  return list_best_score, list_best_param, list(removed_features)

Feature_selector (X_train, y_train, tr=0.95)

## Part 4: Standardization [1 Marks]

Standardizing the data usually means scaling our data to have a mean of zero and a standard deviation of one. 

**Note:** When we standardize a dataset, do we care if the data points are in our training set or test set? Yes! The training set is available for us to train a model - we can use it however we want. The test set, however, represents a subset of data that is not available for us during training. For example, the test set can represent the data that someone who bought our model would use to see how the model performs (which they are not willing to share with us).
Therefore, we cannot compute the mean or standard deviation of the whole dataset to standardize it - we can only calculate the mean and standard deviation of the training set. However, when we sell a model to someone, we can say what our scalers (mean and standard deviation of our training set) was. They can scale their data (test set) with our training set's mean and standard deviation. Of course, there is no guarantee that the test set would have a mean of zero and a standard deviation of one, but the model should still work well enough.

**To summarize: We fit the StandardScaler only on the training set. We transform both training and test sets with that scaler.**

1. Standardize the training  and test data ([Help](https://scikit-learn.org/stable/modules/preprocessing.html)) 

2. Call your ``Feature_selector`` function on the standardized training data with a threshold of 95\%. 
 * Plot the Cross validation accuracy when we have the standardized data (this part) and the original training data (last part) vs. the Number of features in a single plot (to compare them easily).

3. Discuss how standardization (helped/hurt) your model and its performance? Discuss which cases lead to a higher cross validation accuracy (how many features? which features? What K?)


In [None]:
# 1. Standardizing the training and test set

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train) #Fitting the scaler on X_train

print("Scaler Mean:", scaler.mean_)
print("Scaler Var:",  scaler.var_ , "\n")

# Standardizing X_train & X_test
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)

In [None]:
# Feature_selector function called on the standardized data above

Feature_selector (X_train_scaled, y_train, tr=0.95)

## Part 5: Decision Tree Classifier [1.5 Mark]

Train a decision tree classifier on the standardized dataset (read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and check the example there.) Tune the `max_depth` and `min_samples_split` parameters of the tree using cross-validation (CV).
 * Compare the decision tree's performance (mean CV score) with KNN, both using all the features. **Ans**: The KNN algorithm had a mean CV score of **0.97** while the decision tree had **0.92** hence the KNN algorithm performed better.


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Creates a decision tree classifier object
dt_2 = DecisionTreeClassifier()

# Creates a dictionary of all max_depth and min_samples_split we want to tune
param_grid_2 = {'max_depth': np.arange(1, 11), 'min_samples_split': np.arange(2, 21)}

# Uses GridSearchCV to tune the parameters we defined in dictionary
clf_2 = GridSearchCV(dt_2, param_grid_2, return_train_score = True)

# Fit decision tree algorithm to our standardized training data
clf_2.fit(X_train_scaled, y_train)

clf_2.cv_results_

In [None]:
# Returns the best parameter, observe that parameters were changing for each run

clf_2.best_params_

In [None]:
# Returns the average mean validation score(accuracy) for the best k parameter

clf_2.best_score_

In [None]:
# Fitting Decision tree algorithm on the standardized training data after tunning max_depth to be 4 and min_samples_split to be 3

from sklearn.model_selection import cross_val_score

dt_4_3 = DecisionTreeClassifier(max_depth = 4, min_samples_split = 3)

score_2 = cross_val_score(dt_4_3, X_train_scaled, y_train, cv=5)

score_2.mean()

In [None]:
# Fitting KNN algorithm on the standardized training data after tunning K to be 10

knn_10 = neighbors.KNeighborsClassifier(n_neighbors=10)

score_3 = cross_val_score(knn_10, X_train_scaled, y_train, cv=5)

score_3.mean()

## Part 6: Test Data [0.5 Mark]

Now that you've created several models, pick your best one (highest CV accuracy) and apply it to the test dataset you had initially set aside. Discuss your results. **Ans:** Test accuracy score was 95.91 %, which is very close to the training score of 96.74 %. From this, we see that our model is neither over-fitted nor under-fitted but performed satisfactorily

In [None]:
from sklearn.metrics import accuracy_score

knn_10.fit(X_train_scaled, y_train)
accuracy = accuracy_score(y_test, knn_10.predict(X_test_scaled))
print ("Test set accuracy: ", accuracy * 100, "%")

References:

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

https://www.analyticsvidhya.com/blog/2021/02/machine-learning-101-decision-tree-algorithm-for-classification/