# HW 3: Classification, Evaluation, and Deployment

In this homework, you will have a chance to experience a complete machine learning cycle. You will prepare a dataset, make a model, evaluate models to find the best fit, and deploy it to a simple web page. Our main objective is to make students *learn* classification and evaluation methods, so we will apply essential data preprocessing techniques but mainly focus on classification and evaluation.

We will use **Pima Indians Diabetes Database** that is publicly available and from UCI. However, we removed and changed some parts of the dataset for the homework evaluation, so **please use the one in the zip file provided in ilearn**.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on specific diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

According to the information on the data, it has eight attributes and one binary class. The brief explanation of the attributes are as follows:

- Pregnancies: Number of times pregnant.

- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

- BloodPressure: Diastolic blood pressure (mm Hg).

- SkinThickness: Triceps skin fold thickness (mm).

- Insulin: 2-Hour serum insulin (mu U/ml).

- BMI: Body mass index (weight in kg/(height in m)^2).

- DiabetesPedigreeFunction: Diabetes pedigree function.

- Age: Age (years).

and we have a binary class which can be 0 (healthy) or 1 (diabetes).

**NOTE**: Unlike the labs, each function you make here will be **graded**, so it is important to *strictly* follow input and output instruction stated in the skeleton code.

## Contents

0. Preparation
 - Load the dataset.
 - Task 1: Changing zero value into mean (not graded).
1. Classification
 - Task 2: Random forest (graded, 0.5 pt).
 - Task 3: SVM with diverse kernels (graded, 0.5 pt).
 - Task 4: Decision tree implementation (graded, **advanced**, 2 pt).
2. Evaluation
 - Task 5: Precision, Recall, F1-score (graded, 0.3 pt).
 - Task 6: AUC/AUPRC (graded, 0.2 pt).
 - Task 7: Apply them together with scikit-learn (graded, 0.5 pt).
 - Task 8: Task 5 implementation (graded, **advanced**, 1 pt).
3. Deployment
 - Save models into a file using pickle.

# 0. Preparation

##### Student information

Please provide your information for automatic grading.

In [None]:
STUD_SUID = 'nyna2000)'
STUD_NAME = 'Nyamgarig Naranbaatar'
STUD_EMAIL = 'nyna2000@student.su.se'

#### Basic libraries

These libraries will be frequently used throughout the homework!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from HW3_helper import *
RANDOM_STATE = 12345 #Do not change it!
np.random.seed(RANDOM_STATE) #Do not change it!

#### Load the dataset

Use the **diabetes** dataset located ilearn, and load it here using pandas. 

In [None]:
diabetes = pd.read_csv("datasets/diabetes.csv")

Here you can find out some basic information by calling *info(), head()*, and *describe()*.

In [None]:
diabetes.info()

In [None]:
diabetes.head()

In [None]:
diabetes.describe()

It seems like there is no null data. However, if you check zero values in the dataset, there are so many of them. Is it normal that people's BMI is zero? or not? You may want to change the zero values into another reasonable value, such as mean or median. The only thing that can have zero value is **pregnancies**. Let's first make a function changing zero values into the mean of the column.

In [None]:
(diabetes==0).sum(axis=0)

#### Task 1: Changing zero value into mean (not graded)

In [None]:
def imputation(df, columns):
    """
     A function to change nan value (or zero value) to the mean of the attribute
        
        - Step 1: Get a part of dataframe using columns received as a parameter.
        - Step 2: Change the zero values in the columns to zero
        - Step 3: Change the nan values to the mean of each attribute (column). 
                  You can use apply() or fillna() function.
        
        Input:
          df: A dataframe that needs to apply imputation
          columns: A list of columns that need to apply imputation
          
        Output:
          An imputed dataframe
    
    """
    return

In [None]:
diabetes = imputation(diabetes, ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"])

Now, you can see that the columns' min values are changed except for pregnancies' one.

In [None]:
if diabetes != None:
    diabetes.describe()

After finishing a simple data processing, let's proceed to our main task, classification.

If you want to skip this part, you may want to use the imputation code from scikit-learn. If you are interested in imputation, you can find more information [here](https://scikit-learn.org/stable/modules/impute.html).
**NOTE**: The imputation function itself is not graded, but it is required for you to run imputation as it will affect other functions' results that are graded. 

<span style="color:blue"> **You HAVE TO run imputation even though it is not graded for the next tasks. It is your responsibliity to follow the instruction. If you want to skip this part, remove the comments below and run the code th perform imputation**

In [None]:
# Remove the comments and run the code below if you skip the parts above 

# from sklearn.impute import SimpleImputer
# columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
# df_parts = diabetes.copy()[columns]
# df_parts[df_parts==0] = np.nan
# imp = SimpleImputer(missing_values=np.nan, strategy='mean')
# df_converted = pd.DataFrame(imp.fit_transform(df_parts), columns=columns)
# diabetes[columns] = df_converted
# diabetes.describe()

# 1. Classification

In this assignment, we will try to run random forest (RF), and support vector machine (SVM) with different kernels using scikit-learn. As an extra task, we also have a chance to understand the decision tree in detail by implementing it from scratch. We will continue to use **the pre-processed diabetes dataset**!

#### Task 2: Random forest (graded, 0.5 pt)

Here you will run the random forest algorithm using scikit-learn, together with cross-validation. Detailed information about the random forest in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

Your task is as follows:
 1. Create a random forest classifier with the random state stated above (RANDOM_STATE).
 2. Report an average cross-validation score with stratified k-fold with **k=5** into the variable called **rf_cross_val_score** (0.2 pt).
 3. Run grid search with a dictionary having two elements; `max_depth` from 1 to 10, and `min_samples_split` from 2 to 10. Report the best classifier (or the best estimator) into the variable called **rf_best_classifier**. Use the same random forest classifier instance with the random state. Set **k=5** for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put a complete dataset, not a split training set (0.3 pt).
 
* For clarification, **Task 2** and **Task 3** are independent. You have to run stratified k-fold for **Task 2**, but it will not be used in **Task 3**.
* There is no further partial point for each subtask, so please be careful to read the instruction.

In [None]:
# Import required libraries if needed.


In [None]:
rf = None # CHANGE IT

In [None]:
rf_cross_val_score = None # CHANGE IT

In [None]:
rf_best_classifier = None # CHANGE IT

#### Task 3: SVM with diverse kernels (graded, 0.5 pt)

We already tried a simple SVC with the RBF kernel before. Here you will rerun SVM, but trying different kernels, together with cross-validation. Detailed information about SVC in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC).

Your task is as follows:

  1. Create a standard SVC classifier without setting any parameter.
  2. You may want to re-scale the dataset so that all the attributes have the same range of values. Apply StandardScaler without changing any parameters. You can find information [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). *Please do not apply Scaler to the label* (0.1 pt).
  3. Report test score of SVC model with **holdout test** with **test set ratio = 30%** into the variable called svm_ho_score. It means that you train the model using the training dataset and report the score using the test set. Since test_train_split function shuffles the dataset, do not forget to put the random state stated above (`RANDOM_STATE`) (0.2 pt).
  4. Run grid search with a dictionary stating kernels ['linear', 'poly', 'rbf'] and C = [1, 10, 100], and put the best classifier into the variable called svm_best_classifier. Set **k=10** for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set (0.2 pt).
  
  
  * For clarification, **Task 3** and **Task 4** are independent. Any produced work in **Task 3** will not be used in **Task 4**.
  * There is no further partial point for each subtask, so please be careful to read the instruction. Failing to apply StandardScaler affects the scores of part 3 and 4, as those are automatically graded, so be careful.

In [None]:
# Apply StandardScaler to change the datasets here



In [None]:
svc = None # CHANGE IT

In [None]:
svm_ho_score = None # CHANGE IT

In [None]:
svm_best_classifier = None # CHANGE IT

#### Task 4: Decision tree implementation (graded, advanced, 2 pt)

This task is extra for those who want to get extra points! We will now implement a decision tree from scratch. Follow the instruction carefully so that you can return the correct result, which will be a criterion to grade. We will also offer a simple test function so you can validate your implementation.

We have two different grading options:

  - 4-1. Implement a decision tree without any constraints (1 pt)
  - 4-2. Allow two main parameters: max_depth (0.5 pt) and min_size (0.5 pt) (in total 1 pt)

Here you can see our structure. Unlike labs, since this task is graded, we did not offer you class structure since it can make additional confusion to some students. We have seven separate methods, and here is a brief description of each method:

- **dt_fit**: This function is first called with the dataset and creates the tree's root node. It also calls a recursive function to grow the tree.
- **dt_score**: A function returning the accuracy scores of the received dataset and labels.
- **dt_predict**: A recursive function that predict a row's label by going through the trained tree.
- **find_best_split**: This function examines the best split by trying to split based on each attribute and a specific value.
- **gini_index**: This function receives two groups (left, right) and calculates a Gini index of these two groups based on outcome distribution.
- **leaf_final_value**: This function receives one group and returns the most common label (outcome) so that the tree can terminate with its final decision.
- **split**: This function is a recursive function that calculates the best split and splits the node into two parts until specific criteria are met, such as minimum samples or max depth of the tree.



* Unfortunately, there is no further partial point for each subtask, so please be careful to read the instruction.
* Part 2 (4-2) is only counted when you successfully finish part 1 (4-1). So prioritize finishing 4-1 first to get scores.

In [None]:
def dt_fit(X, y, min_samples_split=1, max_depth=np.inf):
    """
    This function works as an entrance function of the recursive process of training the tree.

    - Step 1: Run find_best_split function to guess the best split for the root node.
    - Step 2: Run split function with the best split information you get from find_best_split. 
              Put two constraints (min_samples_split, max_depth) along with the dataset and its labels into the split function.
    - Step 3: Return the root node, and this root node should contain the complete tree information.

    Input:
      X: Training dataset.
      y: Training labels.
      min_samples_split: constraint. Minimum number of samples in the node that the algorithm stops splitting.
      max_depth: constraint. Maximum number of depth from the root that the algorithm stops splitting.
      * X and y should have the same size.

    Output:
      root: A root node having the whole information of the tree after completing recursion.
    """
    return
        
def find_best_split(X, y):
    """
    A function to find out the best split option of the current node.
    The input should be the datasets only belonging to each node!

    - Step 1: Get possible unique labels of the current node
    - Step 2: Iterate each column and the possible unique values of each column (double loops),
              and try dividing a node into two parts by the specific value of the chosen column (the current two values of the loop).
    - Step 3: Calculate a gini index of the node with the separated parts by chosen column and value.
              Since we are dealing with continuous values, divide the datasets with the following criteria:
               - if the value of the chosen column is lower than the chosen value -> assign it to the left node
               - otherwise (higher or equal to) -> assign it to the right node.
               - Then we can call gini_index function with those two nodes' information.
    - Step 4: By calling gini_index function for every (column, value) pair,
              get the best gini index by iterating all the values and columns from the dataset that the node has.
    - Step 5: With chosen criteria, create a node structure with the following information: [column, value to split, children]
              It is up to you to specify the structure, but here is an example.
              {'index': A chosen column name having the best gini index score,
               'value': A chosen value in the index column having the best gini index score,
               'children': A list that contains splitted groups [left, right]}.

    Input:
      X: A part of training dataset belonging to a specific node.
      y: A part of training labels belonging to a specific node.
      * X and y should have the same size.

    Output:
      A dictionary data structure containing
      {feature to split (as an column name or index of dataframe), value to split, children}.
      
    """
    return

def gini_index(children, classes):
    """
    A function that calculates a gini index value by receiving two groups separated by a specific value and column.

    - Step 1: Save the size of the whole chiledren (size of the left node + size of the right node).
    
    * For each child
    - Step 2: Take the size of the dataset in a child.
    - Step 3: Get a proportion of each class in the child and add the square of the proportion to the score.
    - Step 4: Calculate a gini index for the child: (1 - score) * (size of the child / total size of children).
    
    - Step 5: Do [step 2-step 4] for each child and sum all the gini values and return it.

    * You can also refer to the lecture note to get details of the gini index.
    
    Input
      children: A list that contains splitted groups [left, right]}.
      classes: Possible outcomes of the part of dataset.
    Output
      A gini index value.
    """
    return 


def leaf_final_value(y):
    """
    A function that returns the most common label given labels in a specific node.

    Input
      y: A list of labels of the part of dataset.
    Output
      The most common label in the input series.
    """
    return

def split(node, depth, min_samples_split=1, max_depth=np.inf):
    """
    A recursive function to split the node into two parts based on [the result from find_best_split function].
    This function will create left and right children in the node structure (so the current node will have all its children).
    
    If you only developed part 4-1, just follow Step 1 and Step 4.

    - Step 1: Termination 1: Check whether the size of left and right child is zero.
              If so, call leaf_final_value for **both children** to finalize the node.
    - Step 2: Termination 2: If the depth of current node reaches the maximum depth parameter (max_depth),
              again call leaf_final_value to finalize the node.
    
    * Step 3-4 should be applied to each child separately.
    - Step 3: Termination 3: If the number of samples in the left or right node is smaller than our threshold (min_samples_split),
              again call leaf_final_value to finalize the corresponding child (left or right).
    - Step 4: Otherwise, call find_best_split function to find out the current optimal spliter values. 
              (find_best_split receives the dataset and its labels separately)
              Save the result into the current target node (left or right).
              Then call split function using the same node object to grow the tree. The node should already have its children information from find_best_split function.
              In this case, we need to add one to the current depth.
              Also, if you developed min_samples_split and max_depth logic, don't forget to send those parameters as well.

    Input
      node: Current node information (returned from find_best_split).
            A data structure containing [column, column value to split, children].
            Refer to find_best_split for more information.
      depth: Current level of the tree.
      min_samples_split: constraint. Minimum number of samples in the node that the algorithm stops splitting.
      max_depth: constraint. Maximum number of depth from the root that the algorithm stops splitting.
    Output
      None. A 'root' node structure will recursively grow having left, right children with a full tree information.

    """
    return

def dt_score(tree, X, y):
    """
    A function returning the accuracy scores of the received dataset and labels.

    - Step 1: Iterate each row in the test dataset and predict the label by calling self.predict method.
    - Step 2: Collect all predicted labels in order and return accuracy score by comparing the labels to the truth.
     * You can use scikit-learn's accuracy_score() method for your convenience.

    Input:
      tree: A trained tree returned by fit function.
      X: A test dataset.
      y: Test labels.
    Output:
      An accuracy score.

    """
    return
    
def dt_predict(node, row):
    """
    A recursive function that predict a row's label by going through the trained tree.

    - Step 1: Get the root of the trained tree's information and check the first criterion.
              If the row's corresponding value is lower than criterion, call predict function 
    - Step 2:

    Input:
      node: A current node to check for splitting.
      row: A single row in a dataset.
    Output:
      A predicted label.

    """
    return

# 2. Evaluation 

#### Task 5: Precision, Recall, F1-score (graded, 0.3 pt)

You will evaluate the random forest and the support vector machine classifier with various performance measures besides accuracy such as precision, recall, and F1-score, also using scikit-learn. Here we continue to use the Pima Indians Diabetes Database dataset.

Your task is as follows:

1. Scale the attributes in the dataset using *StandardScaler*. Please don't apply it to the labels.
2. Create an instance of the SVC classifier without setting any constraint.
3. Divide the dataset into two parts: a training set and a test set using the train_test_split method. Assign 30% of the dataset to the test set. 
  Please **turn off** shuffling the data.
4. Fit the model using the training set.
5. Report precision score (0.1 pt), recall score (0.1 pt), and F1-score (0.1 pt) using the test set, and save it into the variable called *recall_score_svc, precision_score_svc, f1_score_svc*. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics). We require you to calculate the scores using the following functions: *precision_score, recall_score, f1_score*.


* Scaling the data can affect the results and also affect your scores. Please be careful to follow the instruction.
* There is no partial point except for the ones mentioned in part 5. There is also no any points if the result is incorrect, so you should correctly solve parts 1-4 as well.

In [None]:
recall_score_svc = None # CHANGE IT
precision_score_svc = None # CHANGE IT
f1_score_svc = None # CHANGE IT

#### Task 6: AUC / AUPRC (graded, 0.2 pt)

You will evaluate the random forest and the support vector machine classifier with various performance measures related to the ROC curve, such as the area under the ROC curve (AUC) and rea under the precision-recall curve (AUPRC).

Your task is as follows:

1. Create an instance of a random forest classifier without setting any constraint. Don't forget to set the random state to our value RANDOM_STATE.
2. Divide the dataset into two parts: a training set and a test set using the train_test_split method. Assign 30% of the dataset to the test set. As the method will shuffle the data, please again set the random state to our value RANDOM_STATE.
3. Fit the model using the training set. Please note that we no longer use scalaed dataset used in the previous task. Use the original dataset here.
4. Report AUC (0.1 pt) and AUPRC (0.1 pt) using the test set, and save it into the variable called *auc_rf, auprc_rf*. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) for AUC score, and [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) for AUPRC score. AUPRC has many names, and it is supported as *average precision score* in scikit-learn. We require you to calculate the scores using the following functions: *roc_auc_score, average_precision_score*.

In [None]:
auc_rf = None # CHANGE IT
auprc_rf = None # CHANGE IT

#### Task 7: Apply them together with scikit-learn (graded, 0.5 pt)

Here you will try to apply the grid search using the performance measures you have tried on Task 5 and Task 6, and pick the best performing model in terms of specific performance measures.

Our dataset is imbalanced, meaning that the healthy patient is dominant. Therefore, we can expect that the best model can be different, and we may also need to use AUPRC to get the most suitable model. 

Your task is as follows:

1. Create an instance of a kNN classifier without setting any constraint. 
2. Run grid search with a dictionary stating n_neighbors from 1 to 10, and use two different scoring measures: AUPRC (average_precision) and F1-score (f1). So you may want to run two different grid-search.
3. Put the best classifiers into the respective variable called *auprc_best_classifier* (0.25 pt) and *f1_best_classifier* (0.25 pt). Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set.


* Unfortunately, there is no further partial point for each subtask, so please be careful to read the instruction.

In [None]:
auprc_best_classifier = None
f1_best_classifier = None

#### Task 8: Task 5 implementation (graded, advanced, 1 pt)

This extra task requires you to implement the following performance measures:
 - Accuracy (0.25 pt)
 - Precision (0.25 pt)
 - Recall (0.25 pt)
 - F1-score (0.25 pt)
 
All inputs will be the NumPy arrays, so you can use any NumPy array methods to calculate the scores.

In [None]:
def accuracy_manual(predicted, truth):
    # Write a logic and return accuracy
    return

In [None]:
def precision_manual(predicted, truth):
    # Write a logic and return precision
    return 

In [None]:
def recall_manual(predicted, truth):
    # Write a logic and return recall
    return 

In [None]:
def f1_score_manual(predicted, truth):
    # Write a logic and return f1 score
    return

If you complete the method, you can run the following line to check whether your functions are correct or not. Note that we will evaluate your functions with different data, so please still be careful to implement them.

In [None]:
check_scores(accuracy_manual, precision_manual, recall_manual, f1_score_manual)

# 3. Deployment

You will learn how to pick the best model using cross-validation and deploy the best model as a file. This task will only be graded if you intend to do an extra task (HW 3.3), as loading the best model from this lab is one of the requirements of the next extra assignment. This part will be taught in Lab 5.

Your task is as follows:

1. Scale the attributes in the dataset using *StandardScaler*.
2. Create an instance of an SVC classifier without setting any constraint.
3. Run grid search with a dictionary stating a list of C values [1, 10, 100], and classifiers {'linear', 'poly', 'rbf'}. When examining 'poly' kernel, please also find the best classifier by testing degree = [2,3,4]. You may need to make more than one dictionary. Please use **AUPRC** as its scoring measure. Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set.
4. Save the best classifier into the variable called *svm_best_classifier_2* and train this classifier using the whole dataset we have using the fit method inside the best classifier returned by grid search.
5. Save the trained model using pickle and use this model as your deployed model for the Dash visualization. Detailed instruction can be found in Lab 5.

Completing only this task will not be graded. To get one extra point, you need to use the **best model** from this task to show the **Dash** application. We will check the following points:

 1) Whether the student successfully finds out the best classifier by following the instruction correctly.
 
 2) Whether the student deploys the model successfully using the Dash framework with the given dataset.
 

- It is highly recommmended to finish Lab 5 first before starting this section. You need to modify the files provided in Lab 5 (dash_example_web, helper_dash_example) to be appropriate for this task (that does not explicitly require the knowledge on HTML/Web programming), to handle different dataset having different columns and target label. 
- You do not need to change all the appearance but it should work with the new dataset in this homework and the best model derived in this task. It means that the deployed website should classify the new user input.

In [None]:
svm_best_classifier_2 = None # CHANGE IT