# HW 3: Classification, Evaluation, and Deployment

In this homework, you will experience a complete machine learning cycle from data preparation to deployment. You will prepare a dataset, make a model, evaluate models to find the best fit, and deploy it to a simple web page. Our main objective is to make you try classification and evaluation methods, so we will only apply essential data preprocessing techniques but mainly focus on classification and evaluation.

We will use the **Adult** dataset from the UCI repository and more information about the data is available [here](http://archive.ics.uci.edu/ml/datasets/Adult). Since we have removed and changed the dataset for a grading purpose, use the one that we provide on ilearn.

The dataset contains the information to check whether income exceeds $50K/yr based on census data. The datasets consist of 14 attributes and one binary class variable:

- income: >50K, <=50K

- age: continuous.

- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- fnlwgt: continuous.

- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous.

- education-num: continuous.

- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.

- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- sex: Female, Male.

- capital-gain: continuous.

- capital-loss: continuous.

- hours-per-week: continuous.

- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

and we have a binary class which can be `>=50` or `<50`.

**NOTE**
- Unlike the labs, each function you make here will be **graded**, so it is important to *strictly* follow the instruction.
- **Import** all necessary libraries yourself whenever needed. Failure to run any code can affect your grade.

## Contents

- **Total points: 6.0 pt**

0. Preparation (0.7 pt)
 - Task 1: Drop missing values (0.1 pt).
 - Task 2: Assign X and y (0.1 pt).
 - Task 3: One-hot encoding (0.1 pt).
 - Task 4: Train test split (0.2 pt).
 - Task 5: Standardization (0.2 pt).
1. Classification (2.9 pt)
 - Task 6: Random forest (0.5 pt).
 - Task 7: SVM with diverse kernels (0.4 pt).
 - Task 8: Decision tree and Random Forest (2.0 pt).
2. Evaluation (1.8 pt)
 - Task 9: Accuracy, Precision, Recall, F1-score (0.3 pt).
 - Task 10: AUC/AUPRC (0.3 pt).
 - Task 11: Apply them together with scikit-learn (0.4 pt).
 - Task 12: Manual implementation of performance metrics (0.8 pt).
3. Deployment (0.6 pt)
 - Task 13: Save models into a file using pickle (0.3 pt).
 - Task 14: DASH deployment (0.3 pt).

# 0. Preparation

##### Student information

Please provide your information for automatic grading.

In [None]:
STUD_SUID = 'lobe2042'
STUD_NAME = 'Longho Bernard Che'
STUD_EMAIL = 'lobe2042@student.su.se'

#### Basic libraries

These libraries will be frequently used throughout the homework. Do not change the block below.

In [None]:
import numpy as np
import pandas as pd

RANDOM_STATE = 12345  #Do not change it!
np.random.seed(RANDOM_STATE)  #Do not change it!

#### Load the dataset

Use the **Adult** dataset located ilearn, and load it here using pandas. 

In [None]:
adult = pd.read_csv("datasets/adult.data", sep=",", header=None, skipinitialspace=True)

You can run the line below to give the dataframe proper column names.

In [None]:
adult.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation',
                 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
                 'income']

Here you can find out some basic information by calling *info(), head()*, and *describe()*.

In [None]:
adult.info()

In [None]:
adult.head()

In [None]:
adult.describe()

It seems like there is no null data. However, if you read the description of the dataset, it says that there are missing parts represented as "?". You can count them by using the same technique we used for checking nulls in the previous lab. We have missing values in specific columns only, and it is about 5% of data records.

In [None]:
(adult == "?").sum()

#### Task 1: Drop missing values (0.1 pt)

There are many ways to handle missing values, such as imputation or putting median/mean values, but we will practice the simplest way: removing the rows with missing values.

- Complete the function below which removes any rows with missing values.

In [None]:
def drop_missing_values(df, miss):
    """
    Input: 
      df: the dataframe (adult in our case)
      miss: a character to represent missing value ("?" in our case)
      
    Output: the dataframe without the missing values

    Step 1: Replace the value 'miss' with np.nan.
    Step 2: Drop the nan values and store the result in data_dropped.
    Step 3: Return data_dropped
    
    """
    df.replace(to_replace=miss, value=np.nan, inplace=True)
    data_dropped = df.dropna(axis=0)  # drop any row with np.nan
    return data_dropped

Apply your function to our dataset `adult` and save the result to `adult_dropped`.

In [None]:
adult_dropped = drop_missing_values(adult, "?")

The output of the function should have the same attributes but only less number of the rows. Check how many rows are removed. Your dataset should have 30,162 rows!

In [None]:
adult_dropped.shape

#### Task 2: Assign X and y (0.1 pt)

Then let's split our dataset into two parts (`X` for attributes and `y` for labels) to use scikit-learn's various methods.
- Use `adult_dropped`.
- `X` should have all the attributes without the labels (the last column).
- `y` should be a Pandas Series only with the labels.

In [None]:
X = pd.DataFrame(adult_dropped, columns=adult_dropped.columns[:-1])  # CHANGE IT!
y = pd.Series(adult_dropped.iloc[:, -1])
print(y)

Check the type and size here.

In [None]:
(X.shape, y.shape, type(X), type(y))

#### Task 3: One-hot encoding (0.1 pt)

Unfortunately, scikit-learn does not support categorical attributes very well even for decision tree, and that means we need to convert them into reasonal form of numerical data to fit the algorithms. There is one way called one-hot encoding, which transforms the categorical data into multiple numerical columns for each possible value. There are various ways to apply this, especially using [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or [pandas](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) but here we will use the pandas function to keep the dataframe structure.

- Finish one_hot_encoding function which applies one-hot encoding to a given dataframe.


In [None]:


def one_hot_encoding(df):
    """
    Input:
        df: the attributes (X in our case)
    Output: one-hot encoded dataframe
    
    Step 1: Use pd.get_dummies to convert df to a one-hot-encoded form. 
            Enable an option called drop_first to remove duplication.
    Step 2: Return the one-hot-encoded dataframe.
    
    * Those steps and suggested method are just for your convenience. You can use your own choice of methods.
    """
    df_onehot = pd.get_dummies(data=df, drop_first=True)
    return df_onehot

- Create `X_onehot` by calling `one_hot_encoding` function with `X`.

In [None]:
X_onehot = one_hot_encoding(X)

Check your result by calling any methods you learned. If you successfully followed the instruction, the output (`X_onehot`) should have 96 columns.

In [None]:
X_onehot.head()

#### Task 4: Train test split (0.2 pt)

We also need to split our dataset further into four parts for evaluation.

- Use scikit-learn's `train_test_split` function to divide the dataset into four parts.
- Follow the instruction below carefully to get a point!.
    - Use `X_onehot` and `y`.
    - Assign 30% to a test set.
    - Use our random state (`RANDOM_STATE`)
    - Enable stratify option.

In [None]:
from sklearn.model_selection import train_test_split

# Remove the assigned values and write train_test_split function
X_train, X_test, y_train, y_test = train_test_split(one_hot_encoding(X), y, test_size=30, random_state=RANDOM_STATE,
                                                    stratify=y)

Check the type and size here.

In [None]:
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### Task 5: Standardization (0.2 pt)

After removing the missing value and split X and y, we need to take care of our numerical attributes. As you can check from `describe()` function, we have five numerical attributes and they all have different mean and standard deviation. Not all machine learning models require standardization of numerical attributes, but some do. In this homework, SVM might be the case that the standardization is required. It might be better to make standardized version when performing data preparation. 

- One-hot encoded data does not need to be standardized! So you need to choose the numerical columns only
 - ["age", "fnlwgt",  "education-num", "capital-gain", "capital-loss", "hours-per-week"]
 
- For this task, you need to import sklearn's `StandardScaler`.

In [None]:
NUMERICAL_ATTRIBUTES = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]

In [None]:
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame


def standardize(X_train, X_test, numerical=NUMERICAL_ATTRIBUTES):
    """
    Input:
        - X_train: A split training set from Task 4
        - X_test: A split test set from Task 5
        - numerical: Numerical columns that should be standardized
    Output:
        - X_train_st: A standardized numerical attributes (ndarray)
        - X_test_st: A standardized numerical attributes (ndarray)
    
    Step 1: Initialize StandardScaler.
    Step 2: Create X_train_numerical, X_test_numerical by selecting numerical columns from original X_train X_test.
    Step 3: Fit StandardScaler on X_train. You should only use the numerical columns only.
    Step 4: Use trained StandardScaler and run transform function both on X_train_st (for the training set) 
            and X_test_st (for the test set). This job will standardize both training and test sets based on
            the statistics of training set. You should only use numerical attributes.
    Step 5: Return X_train_st, X_test_st
    
    """

    # Step 1
    sc = StandardScaler()

    # Step 2
    X_train_numeric = DataFrame(data=X_train, columns=numerical)
    X_test_numeric = DataFrame(data=X_test, columns=numerical)

    # Step 3

    # Step 4
    # Assign two outputs of transformation function to X_train_st (for the training set) and X_test_st (for the test set)
    X_train_st = sc.fit_transform(X_train_numeric)
    X_test_st = sc.fit_transform(X_test_numeric)

    # Step 5
    # Note that those two variable should only contain numerical attributes, not the whole ones.
    return X_train_st, X_test_st

In [None]:
def standardize_wrapper(X_train, X_test):
    # DO NOT CHANGE THIS FUNCTION
    # This function is to ensure that the datasets keep the Pandas DataFrame format.
    if X_train.shape == (0, 0): return pd.DataFrame([0]), pd.DataFrame([0])

    numerical = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]

    X_train_st, X_test_st = standardize(X_train, X_test)

    X_train_st_df = X_train.copy()
    X_train_st_df[numerical] = X_train_st
    X_test_st_df = X_test.copy()
    X_test_st_df[numerical] = X_test_st

    return X_train_st_df, X_test_st_df

The line below will apply your standardization function to the datasets. Run the block and check the result. 
DO NOT change `standardize_wrapper`, just implement `standardize`.

In [None]:
X_train_st, X_test_st = standardize_wrapper(X_train, X_test)

Your numerical attributes (["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]) should have near zero mean and one standard deviation.

In [None]:
X_train_st.describe()

After finishing a simple data processing, let's proceed to our main task, classification.

# 1. Classification

In this assignment, we will run random forest (RF), and support vector machine (SVM) with different kernels using scikit-learn. Then we will implement score functions for decision trees and main functions for random forests to understand the concepts better. We will continue to use the pre-processed Adult dataset from the section zero (Task 1-5).

#### Task 6: Random forest (graded, 0.5 pt)

Here you will run the random forest algorithm using scikit-learn, together with cross-validation. Detailed information about the random forest in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

Your task is as follows:
 1. Create a random forest classifier `rf` with the random state defined above (`RANDOM_STATE`). Do not specify any other parameters.
 2. Report an average cross-validation score `rf_cross_val_score` with stratified k-fold with **cv=5**. You should report the average score, not a list of the scores. Use `X_onehot` and `y`, not the training or test set (0.2 pt). 
 - ***Note that you are reporting an average cross validation score, not a list of scores.***

 3. Run grid search `gs` with a single dictionary `grid_dict` with two keys 1) max_depth from 1 to 3 (included), and 2) min_samples_split from 2 to 4 (included) and report the best classifier into the variable `rf_best_classifier`. Set **cv=5** for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put a complete dataset, not a split training set (Use `X_onehot` and `y`). This task can take more than five minutes depending on computing power (0.3 pt).

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV

grid_dict = {
    'max_depth': [1, 2, 3],
    'min_samples_split': [2, 3, 4]
}


In [None]:
rf = RandomForestClassifier(random_state=RANDOM_STATE)

# train the model
rf.fit(X=X_onehot, y=y)

# predict the labels
predictions = rf.predict(X=X_onehot)
#print(predictions)

In [None]:
rf_cross_val_score_list = cross_val_score(estimator=rf, X=X_onehot, y=y, cv=5)
rf_cross_val_score = np.mean(rf_cross_val_score_list)

Run this line to check your score. Your score should be above 0.80.

In [None]:
rf_cross_val_score

In [None]:
gs = GridSearchCV(estimator=rf, param_grid=grid_dict, cv=5)
gs.fit(X_onehot, y)

In [None]:
rf_best_classifier = gs.best_estimator_

Report your best classifier here.

In [None]:
rf_best_classifier

#### Task 7: SVM with diverse kernels (graded, 0.4 pt)

We already tried a simple SVC with the RBF kernel before. Here you will run SVM again, but with different kernels, and together with cross-validation. Detailed information about SVC in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC).

Your task is as follows:

  1. Create a standard SVC classifier `svc` without setting any parameter.
  2. Report a test score to `svm_ho_score` using your standardized training set `X_train_st`. and test set `X_test_st` (0.1 pt).
  3. Run grid search with a list of two parameter dictionaries, one with kernel = ['linear', 'poly', 'rbf'] and the other one with C = [1, 10, 100]. This means you have to create a list containing two different dictionaries. Report the best classifier into the variable `svm_best_classifier`. Set **cv=3** for grid search cross-validation. Use `X_train_st` and `X_test_st`. This task can take more than five minutes depending on computing power (0.3 pt).

In [None]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X=X_train_st, y=y_train)

In [None]:
svm_ho_score = svc.score(X=X_test_st, y=y_test)

Run this line to check your score. Your score should be above 0.80.

In [None]:
svm_ho_score

In [None]:
grid_dict = [
    {'kernel': ['linear', 'poly', 'rbf']},
    {'C': [1, 10, 100]}
]

gs = GridSearchCV(estimator=svc, param_grid=grid_dict, cv=5)
gs.fit(X=X_train_st, y=y_train)
svm_best_classifier = gs.best_estimator_

KeyboardInterrupt: 

In [None]:
svm_best_classifier

#### Task 8: Decision tree and Random Forest (2.0 pt)

We will now implement a few modules for decision tree and random forest. Follow the instruction carefully so that you can return a correct result. This task is composed of two subtasks as follows:

  - 8-1. Entropy, gini index, and information gain (0.7 pt)
  - 8-2. Random forest implementation (1.3 pt)

First section of this task is to create three functions used to evaluate and grow the tree, which are covered in the lecture. Entropy, gini index are two main scores used for it. Information gain is the final score to choose a feature for dividing the node. Those scores are essencial for decision tree to work properly and a wrong score can lead to choosing the features that are not proper for creating a high-performing tree.

- For simplicity, you will not use the **adult** dataset in this task but will use a simple **playgolf** dataset with categorical attributes.

- Task 8 is a continuous task and the grade is evaluated by the result of the function. Since one function calls other functions in the task, failing to develop one function can affect the whole grade.

Import playgolf dataset to `playgolf`. You can find it in the homework file.

In [None]:
playgolf = pd.read_csv('datasets/playgolf.csv')

In [None]:
playgolf.columns

In [None]:
playgolf.head()

- Subtask 1: Create a gini index function.
 - The function receives a list and calculate a gini index. The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. 
 - You should double check the lecture slides and the examples below to make sure you made a correct function.
 - You can use collections.Counter() to count labels of the dataset.

In [66]:
from collections import Counter

def gini(dataset):
    """
    A function that calculates gini index of a given list.
    
    Input
     - dataset: a list of labels.
    Output
     - impurity: gini index of the list.
    
    You do not need to keep the output name of this function, the grade only depends on the correct outputs.
    
    """
    sum_of_squared_probabilities = 0.0
    number_of_items = len(dataset)
    counter = Counter(dataset)
    for item in counter:
        probability = counter[item]/number_of_items
        #print(f"Item = {item}, count = {counter[item]}, probability = {probability}")
        sum_of_squared_probabilities += (probability*probability)

    impurity = 1 - sum_of_squared_probabilities
    return round(impurity, 4)

Your gini index is expected to have the following results:
- `0.5` for `[0,0,1,1]`
- `0.4082` for `[0,0,0,0,0,1,1]`

In [67]:
gini([0, 0, 1, 1])

0.5

Report a gini score of the `Temp` attribute of **playgolf** to `gini_score` (0.2 pt).

In [None]:
gini_score = gini(playgolf['Temp'])  # CHANGE IT

Print your score here!

In [None]:
gini_score

- Subtask 2: Create an entropy function (0.2 pt).
 - You should double check the lecture slides and the examples below to make sure you made a correct function.
 - You can use collections.Counter() to count labels of the dataset.

In [None]:
def entropy(dataset):
    """
    A function that calculates gini index of a given list.
    
    Input
     - dataset: a list of labels.
    Output
     - impurity: entropy value of the list.
    
    You do not need to keep the output name of this function, the grade only depends on the correct outputs.
    
    """
    impurity = None  # CHANGE IT
    return impurity

Your entropy is expected to have the following results:
- `1.0` for `[0,0,1,1]`
- `0.8631` for `[0,0,0,0,0,1,1]`

In [None]:
entropy([0, 0, 1, 1])

In [None]:
entropy([0, 0, 0, 0, 0, 1, 1])

Report a gini score of the `Windy` attribute of **playgolf** to `entropy_score` (0.2 pt).

In [None]:
entropy_score = entropy(playgolf['Windy'])  # CHANGE IT

Print your score here!

In [None]:
entropy_score

- Subtask 3: Create an information gain function. 

  - **DO NOT use entropy but only use the gini index for scores.**
  - Check the lecture slides and the examples below to make sure you made a correct function.

In [None]:
def information_gain(labels_start, labels_split):
    """
    Calculate information gain when we have an information of label distribution before and after split operation.
    This information gain function receives two values:
    
    Input:
      - labels_start: A single list of all current labels
        e.g.) [0,0,0,0,1,1,1,1]
      - labels_split: A list of lists representing split 
        e.g.) [ [0,0,1,1], [1,1,0,0] ]
    
    Then we can calculate information gain by calculating the gini index before splitting,
    and substract (gini index * proportion of the subset) for each list after splitting from there.
    
    Output:
      - info_gain: Information gain
    
    You do not need to keep the output name of this function, the grade only depends on the correct outputs.
    
    """
    info_gain = None  # CHANGE IT
    return info_gain

In [None]:
information_gain([0, 0, 0, 0, 1, 1, 1, 1], [[0, 0, 1, 0], [1, 1, 0, 1]])

Your information gain is expected to have the following results:
- `0.0` for `[0,0,0,0,1,1,1,1], [[0,0,1,1],[0,0,1,1]]`
- `0.5` for `[0,0,0,0,1,1,1,1], [[0,0,0,0],[1,1,1,1]]`
- `0.125` for `[0,0,0,0,1,1,1,1], [[0,0,1,0],[1,0,1,1]]`

Here we have labels before and after splitting information. Use those two values to calculate information gain and report it to `info_gain_score` using your own `information_gain` function (0.2 pt).

In [None]:
labels_start = [1, 2, 1, 2, 2, 1, 2, 1, 3, 3, 3]
labels_split = [[3, 3, 3], [1, 2, 1, 1], [2, 2, 1, 2]]

In [None]:
info_gain_score = information_gain(labels_start, labels_split)

Print your score here!

In [None]:
info_gain_score

Now it is time for random forests. We will give you a basic `split` function for the algorithms you will develop. This split function receives the attributes (`X`), the label (`y`) and one `feature` (attribute) of it, and split the whole dataset based on the categories of the selected feature and return split data subsets and label subsets. Using those split values, you are going to make few functions needed for a random forest. 

- Note that this assignment does not make a whole working random forest but the core functions to understand the algorithm.

In [None]:
def split(X, y, attr):
    split_attrs = []
    split_labels = []

    for val in X[attr].unique():
        attr_subset = []
        label_subset = []

        for idx, row in X.iterrows():

            if row[attr] == val:
                attr_subset.append(row)
                label_subset.append(y[idx])

        split_attrs.append(pd.DataFrame(attr_subset))
        split_labels.append(label_subset)

    return split_attrs, split_labels

Check out the result by running the function below and also check the Windy column to understand what the function does.

In [None]:
split(playgolf.drop('Play Golf', axis=1), playgolf['Play Golf'], 'Windy')

In [None]:
playgolf['Windy']

Subtask 4: Now we can make a function for choosing the best feature to split, given the dataset of the node (It will be a full dataset if we run this function on the root node.). The function `best_split` receives the datasets (`X`, `y`) and the number of attributes to choose from the dataset, and returns the best feature among the chosen one and it's information gain. This process is one of the core processes of the random forest. 

In [None]:
# 0.4 pt
def best_split(X, y, num_attr):
    """
    Input
        - X: Attributes of the node.
        - y: dataset labels.
        - num_attr: the number of attributes that the algorithm chooses.
    Output
        - best_feature: The best feature in terms of information gain.
        - best_gain: The information gain value when the dataset is split by the best feature.
        
    Step 1: Choose 'num_attr' column names from X.columns without allowing replacement. Use np.random.choice. 
            Assign the result to 'attributes'
    Step 2: Set best_info_gain to zero and best_attr to None.
    Step 3: You should iterate the attributes chosen in Step 1.
            For each chosen attribute from Step 1, 'split' the dataset using the split function we have offered.
            Save the split attributes and labels.
    Step 4: Examine the information gain of the current trial. Use information_gain function you created.
    Step 5: Compare it to the current best gain, if the new gain is higher, reset best_gain and best_feature.
    Step 6: Return best_attr, best_info_gain.
    
    
    """
    # Step 1
    attributes = None  # CHANGE IT

    # Step 2
    best_info_gain = None  # CHANGE IT
    best_attr = None

    # Step 3 - You should create a for loop and Step 4 and 5 will run inside the loop
    # Step 4
    # Step 5

    # Step 6
    return best_attr, best_info_gain

Try to find the best split of the playgolf dataset with `num_features` = 2. Report the best feature and best gain to `best_attr_playgolf` and `best_gain_playgolf` (0.4 pt).

In [None]:
np.random.seed(RANDOM_STATE)
best_attr_playgolf, best_gain_playgolf = None, None  # CHANGE IT

In [None]:
# TEST YOUR RESULT HERE
best_attr_playgolf, best_gain_playgolf

Subtask 5: Now we have functions 1) to split the function based on one chosen feature (`split`) and 2) to choose the best feature to split (`find_best_split`). The next step will be to create one tree with all the information we have. Finish the function `build`. This function makes one tree of the random forest by using two previous functions. Since this function is a recursive one, it will return a complete tree, not a node.

In [None]:
# 0.5 pt

def build(X, y, num_attr, tol=0.00001):
    """
    Input
        - X: Attributes of the data
        - y: dataset labels
        - num_attr: the number of attributes that the algorithm chooses for each node.
        - tol: information gain tolerance value.
    Output
        - node: a leaf or middle node.
        
    Step 1: Run the best split function to get the best attributes and the best information gain for the node.
    Step 2: Examine the best information gain value. If it is lower than the tolerance value (tol), 
            return the node with the best information gain value. The node should be a dictionary form 
            {"type": "leaf", "gain": the best information gain}.
    Step 3: If the best information gain is higher, split the dataset with the chosen best attribute.
    Step 4: Create an empty list called "branches" to save all the branches of the current node.
    Step 5: For each split attributes and labels, run this 'build' function recursively and store the result
            to the  "branches" list.
    Step 6: After all the recursion process is done, return the root node with its best attribute, branch information,
            and the best information gain.
    """

    # Step 1
    best_attr, best_info_gain = None, None  # Change it

    # Step 2 - Change the if condition and the return value
    if None:
        return {}  # CHANGE IT

    # Step 3
    split_attrs, split_labels = None, None  # CHANGE IT

    # Step 4
    branches = []

    # Step 5 - You should create a for loop following the manual

    # Step 6 - Change None values to the correct values
    return {
        "type": "node",
        "best_feature": None,
        "branches": None,
        "value": None
    }

Bulid one tree forest using the whole playgolf dataset and report the tree to the variable 'single_tree'. Set the number of attributes to three. You do not need to specify the tolerance value (0.5 pt).

In [None]:
np.random.seed(RANDOM_STATE)
single_tree = None  # CHANGE IT

In [None]:
single_tree

Subtask 6: Finally, it is time to generate a random forest with multiple trees. Complete `random_forest` function which chooses samples of the original dataset and create multiple trees forming a 'forest'.

In [None]:
# 0.4 pt

def random_forest(X, y, num_tree, num_attr, tol=0.00001):
    """
    Input
        - X: Attributes of the data
        - y: dataset labels
        - num_tree: the number of attributes that the algorithm chooses for each node.
        - num_attr: the number of attributes that the algorithm chooses for each node.
        - tol: information gain tolerance value.
    Output
        - trees: collection of trees
    
    Step 1: Create a list called 'trees' to save all trees generated during the process
    Step 2: Create a for loop iterating num_tree times.  Repeat Step 3 - 5 for 'num_tree' times in a for loop. 
            After the for loop finishes, trees list should have num_tree different trees inside.
    Step 3: For each loop, choose indices (not values) from the dataset X of the same size, but with allowing replacement.
            For example, you can pick [1,2,2,3,3] from the data [1,2,3,4,5]. 
            The size after random sampling should be the same as the vertical size of the dataset X.
    Step 4: Use the same indices as X to pick the labels from y with replacement. 
            If you use DataFrame, you may need to reset the indices (reset_index).
    Step 5: Build a tree using the subsets of X and y and save it to the list.
    Step 6: Return 'trees'.
    
    """

    trees = None
    return trees

Create one random forest on playgolf dataset. Set the number of tree to three, and the number of attributes to three as well. You do not need to change tolerance value from the default one (0.4 pt).

In [None]:
np.random.seed(RANDOM_STATE)
forest = None  # CHANGE IT

In [None]:
forest

# 2. Evaluation 

#### Task 9: Accuracy, Precision, Recall, F1-score (0.3 pt)

We start to use our original dataset again! You will evaluate the random forest and the support vector machine classifier with various performance measures you have learned besides accuracy, such as precision, recall, and F1-score, also using scikit-learn. Here we continue to use the same dataset.

Your task is as follows:

1. Use standardized datasets (`X_train_st`, `y_train`...) throughout the task. To use the various score functions here, you need to convert the labels of `y_train` and `y_test` (`<=50K`, `>50K`) to numerical ones (0 or 1) since the score functions will not recognize the categorical labels. Create `y_train_numerical` and `y_test_numerical` with the converted labels (`<=50K` to 0, and `>50K` to 1). Refer to the previous labs. (0.1 pt)
2. Create an instance of an SVC classifier with the **polynomial** kernel.
3. Fit the model using the training set.
4. Report precision score, recall score, and F1-score using the test set, and save it into the variable `recall_score_svc`, `precision_score_svc`, and `f1_score_svc`. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics). We require you to calculate the scores using the following functions in scikit-learn: `precision_score`, `recall_score`, and `f1_score`. There is no partial point if you are correct on only some of them (0.2 pt).

In [None]:
# Subtask 1: 0.1 pt
y_train_numerical = pd.Series(0)  # CHANGE IT
y_test_numerical = pd.Series(0)  # CHANGE IT

Check if you successfully replaced the values here.

In [None]:
y_train_numerical.unique(), y_test_numerical.unique()

In [None]:
# Subtask 2: 0.2 pt
recall_score_svc = None  # CHANGE IT
precision_score_svc = None  # CHANGE IT
f1_score_svc = None  # CHANGE IT

Print three scores here.

In [None]:
precision_score_svc, recall_score_svc, f1_score_svc

#### Task 10: AUC / AUPRC (0.3 pt)

You will evaluate the random forest and the support vector machine classifier with various performance measures related to the ROC curve, such as the area under the ROC curve (AUC) and rea under the precision-recall curve (AUPRC).

Your task is as follows:

1. Use the same dataset as the ones for Task 7 (`X_train_st` and `y_train_numerical`). AUPRC and AOC score also only recognize numerical labels.
2. Create an instance of a random forest classifier without setting any constraint. Don't forget to set the random state to our value `RANDOM_STATE`.
3. Fit the model on the training set.
4. Print the accuracy on the test set to `accuracy_rf` (0.1 pt).
4. Report AUC and AUPRC using the test set, and save it into the variable called *auc_rf, auprc_rf*. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) for AUC score, and [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) for AUPRC score. AUPRC has many names, and it is supported as *average precision score* in scikit-learn. We require you to calculate the scores using the following functions: *roc_auc_score, average_precision_score*. There is no partial point if you are correct on only some of them (0.2 pt).

In [None]:
accuracy_rf = None  # CHANGE IT
auc_rf = None  # CHANGE IT
auprc_rf = None  # CHANGE IT

Print your scores here

In [None]:
(accuracy_rf, auc_rf, auprc_rf)

#### Task 11: Apply them together with scikit-learn (0.4 pt)

Here you will try to apply the grid search using the performance measures you have tried on Task 5 and Task 6, and pick the best performing model in terms of specific performance measures.

Our dataset is imbalanced, meaning that the healthy patient is dominant. Therefore, we can expect that the best model can be different, and we may also need to use AUPRC to get the most suitable model. 

Your task is as follows:

1. Use the same dataset as the ones for Task 7 (`X_train_st` and `y_train_numerical`). AUPRC and AOC score also only recognize numerical labels.
2. Create an instance of a kNN classifier without setting any constraint. 
3. Run grid search with a dictionary stating n_neighbors from 1 to 10, and use two different scoring measures: AUPRC (average_precision) and F1-score (f1). 
4. Put the best classifiers into the respective variable called *auprc_best_classifier* and *f1_best_classifier*. Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set.

If you complete the method, you can run the following line to check whether your functions are correct or not. Note that we will evaluate your functions with different data, so be careful to implement them.

In [None]:
auprc_best_classifier = None  # CHANGE IT
f1_best_classifier = None  # CHANGE IT

Check your scores here!

In [None]:
(auprc_best_classifier, f1_best_classifier)

#### Task 12: Task 5 implementation (0.8 pt)

This task requires you to implement the following performance measures:
 - Accuracy (0.2 pt)
 - Precision (0.2 pt)
 - Recall (0.2 pt)
 - F1-score (0.2 pt)
 
All inputs will be the NumPy arrays, so you can use any NumPy array methods to calculate the scores.

In [None]:
def accuracy_manual(truth, predicted):
    return None  # CHANGE IT

In [None]:
def precision_manual(truth, predicted):
    return None  # CHANGE IT

In [None]:
def recall_manual(truth, predicted):
    return None  # CHANGE IT

In [None]:
def f1_score_manual(truth, predicted):
    return None  # CHANGE IT

Assign the results of your four function on two arrays (`truth`, `predicted`).

In [None]:
truth = [1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1]
predicted = [1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1]

In [None]:
accuracy_score_manual = None  # CHANGE IT
precision_score_manual = None  # CHANGE IT
recall_score_manual = None  # CHANGE IT
f1_score_manual = None  # CHANGE IT

Show your results here!

In [None]:
(accuracy_score_manual, precision_score_manual, recall_score_manual, f1_score_manual)

# 3. Deployment

**Task 13: Save models into a file using pickle (0.3 pt)**

In this task, you need to pick the best model using cross-validation and deploy it as a pickle file. For this task, we will use the diabetes data that we used for Homework 1. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. You can find it in the homework folder.

- Load the dataset into `diabetes`.

In [None]:
diabetes = pd.read_csv("datasets/diabetes.csv")

In [None]:
diabetes.head()

- Split the dataset into two parts: attributes (`X`) and labels (Outcome, `y`).

In [None]:
X = None  # CHANGE IT
y = None  # CHANGE IT

Your task is as follows:

1. Create an instance of an SVC classifier without setting any constraint.
2. Run grid search with a list of two dictionaries. In the first dictionary, you should examine 'poly' kernel, with degree = [2, 3, 4]. In the second dictionary, you should test two kernels ['linear', 'rbf'] with a list of C values [10, 100]. Use **AUPRC** as its scoring measure. Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset.
3. Save the best classifier into the variable called `svm_best_classifier_2` and save the trained model into `model_diabetes.pickle` using pickle. When saving your model, do not specify any folder.
 - **Do not use your own specific name for the model!**

In [None]:
svm_best_classifier_2 = None  # CHANGE IT

Show your classifier here!

In [None]:
svm_best_classifier_2

**Task 14: DASH deployment (0.3 pt)**

You will run DASH application by using the model `model_diabetes.pickle` you exported with the project files in the `webplatform_dash` folder. Locate the model in the same folder with this jupeter notebook, and go into `webplatform_dash` folder. There you can run your own DASH application as you learned from Lab 5. Note that you should **not** move the model file into `webplatform_dash`.

- Submit a screenshot with your model file into one zip file in a separate submission form for your DASH project. For the details of the DASH deployment, check out Lab 5.