# CPSC 330 hw3

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import tree
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

from plot_classifier import plot_classifier

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.students.cs.ubc.ca/cpsc330-2019w-t2/home/blob/master/docs/homework_instructions.md). 

**NEW REQUIREMENT**: if you are working with a partner, you must write a few sentences explaining the contribution of each team member. You should refer to yourselves by your CSIDs (because seeing names can cause bias during grading). Here is an example:

> a1b2c did Exercise 1, checked over Exercise 2, and pair-programmed for Exercise 3. z9y8x checked over Exercise 1, did Exercise 2, and pair-programmed for Exercise 3. 

Our ideal scenario is that you worked together on all the exercises, but you are not required to do so, and for now we are only collecting this information because we are curious. If you are working alone, you can ignore this section.

_YOUR TEAMWORK CONTRIBUTION STATEMENT GOES HERE_

# Exercise 1: Data and preprocessing <a name="1"></a>
We will be focusing on the classification task of predicting the presence or absence of heart disease (the response) based on a set of 13 different biophysical measures (the features). The classification of heart disease in patients is obviously of great importance for cardiovascular disease diagnosis and prevention. Machine learning offers novel and potentially effective methods of forming predictive models from heart disease data. The dataset you will be working with has been made available by the UCI Machine Learning Repository [here](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). A slightly modified version of this dataset has been made available in your repo as `heart_disease.csv`. You will see that it contains 303 observations (patients) and 14 columns (13 features and 1 response).



In [None]:
heart_df = pd.read_csv('heart_disease.csv', index_col=0)
heart_df.head()

Note: many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.

### 1.1 Ordering the steps
rubric={points:5}

Your first task is to wrangle this dataset into a format suitable for use with the `scikit-learn` library. This includes:

1. Loading the dataset;
2. Feature preprocessing (one-hot encoding and scaling); and,
3. Splitting data into train/validation/test sets.

To help you understand this wrangling process, the code required to perform the pre-processing tasks above is provided. The code has been arranged into different blocks performing the tasks above but these blocks are in the wrong order. Rearrange the code to correctly wrangle the data and add a short comment to each block to describe what the code is doing.

In [None]:
# YOUR COMMENT HERE
X_train = pd.DataFrame(preprocessor.fit_transform(X_train),
                       index=X_train.index,
                       columns=(numeric_features +
                                list(preprocessor.named_transformers_['ohe']
                                     .get_feature_names(categorical_features))))

X_valid = pd.DataFrame(preprocessor.transform(X_valid),
                      index=X_valid.index,
                      columns=X_train.columns)

# YOUR COMMENT HERE
numeric_features = ['age', 'resting_blood_pressure', 'cholesterol',
                    'max_heart_rate_achieved', 'st_depression', 'num_major_vessels']
categorical_features = ['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg',
                        'exercise_induced_angina', 'st_slope', 'thalassemia']

# YOUR COMMENT HERE
X_train, X_valid, y_train, y_valid = train_test_split(X_train,
                                                      y_train,
                                                      test_size=0.2,
                                                      random_state=50)

# YOUR COMMENT HERE
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), numeric_features),
        ('ohe', OneHotEncoder(drop="first"), categorical_features)])

# YOUR COMMENT HERE
heart_df = pd.read_csv('heart_disease.csv', index_col=0)

# YOUR COMMENT HERE
X_train, X_test, y_train, y_test = train_test_split(heart_df.drop(columns='target'),
                                                    heart_df['target'],
                                                    test_size=0.1,
                                                    random_state=50)

In [None]:
X_train.head()

### 1.2 Exploring the one-hot encoding
rubric={points:3}

The original dataset had a feature called `st_slope`. 

1. What were the possible values of this feature? 
2. What new binary feature(s) were created to replace this feature? 
3. For each possible value of the original feature, how is this value represented in the transformed data? For example, the original feature `rest_ecg` had two values, "normal" and "abnormal". In the transformed data, the new feature is called `rest_ecg_normal`, where "normal" is represented as 1.0 and "abnormal" is represented as 0.0.

## Exercise 2: Logistic regression <a name="2"></a>
In this exercise you will work with a type of linear model known as *logistic regression*. Recall that logistic regression, despite the name, is used for classification tasks. Typically it is used to model the relationship between one dependent binary variable (the target) and one or more numerical or categorical independent variables (features). 

### 2.1 Train a logistic regression model
rubric={points:1}

Fit a [Logistic Regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) called `lgr` on the train split of the heart disease data. You can use all default hyperparameters. If you get a `FutureWarning`, you can ignore it.

### 2.2 Test the model
rubric={points:2}

1. Test the `lgr` model on the training split of the heart disease data.
2. Test the `lgr` model on the validation split of the heart disease data.

### 2.3 Interpret the test outputs
rubric={points:1}

Based on your results from **Q2.2** would you say your logisitic regression model is overfit? Why/why not?

### 2.4 Predicting probabilities
rubric={points:6}

A logistic regression model outputs a probability between 0 and 1, where (typically) probabilities less than 0.5 are assigned to class 0 and probabilites greater than 0.5 are assigned to class 1. The predictions of the logistic regression model can be revealed through the `.predict_proba()` method.

1. What is the predicted probability that the first observation in the validation set (observation 11) has heart disease (target = 1)?
1. What is the largest predicted probability across the entire validation set?
1. What is the ID of the patient with this highest predicted probability of heart disease in the validation set (give the actual ID number, not the index location)?

### 2.5 Most important features for predicting heart disease
rubric={points:5}

We can investigate the coefficient values of our logistic regression model to help understand the importance of the different features. Information of the coefficients is exposed by the `coef_` attribute of your `lgr` model (see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for more help). What are the 3 most important features in the model according to the absolute value of the coefficients?

What is the difference between a positive and negative coefficient in your `lgr` model?

## Exercise 3: Support vector machine (SVM) <a name="3"></a>

In this exercise, you will use train a SVM on the heart disease dataset and compare results to the logistic regression model from **Exercise 2**. 

### 3.1 Kernels in SVM classification
rubric={points:5}

The sklearn [`SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) allows several different values for its `kernel` argument, including `'linear'`, `'poly'`, and `'rbf'`. For each of these kernels, train a model and report the training and validation accuracy. Make sure you use a `for` loop instead of repeating your code 3 times. To avoid issues with the newer sklearn, set `gamma='auto'`.

### 3.2 Interpreting results
rubric={points:3}

How do the train and validation accuracies compare to the logistic regression classifier in **Exercise 2**? Are any of the models overfit? Based on your results, why do you think the `'rbf'` kernel is the default in `scikit-learn`?

### 3.3 Visualizing results
rubric={points:3}

To understand the effect of the different `SVC` kernels it may be helpful to visualize them. We can easily visualize decision boundaries in 2-dimensions (i.e., 2 input features). The code below visualizes the 3 different kernels you tried above using a subset of the training data with only 2 features and only 30% of the examples. Run the code and *briefly* comment on/describe the behaviour of the three different kernels.

In [None]:
# Extract some sample 2-d data, set the random_state for repeatability and matching X/y samples
sample = (X_train[['age', 'cholesterol']].join(y_train)
                                         .sample(frac=0.3, random_state=123)
                                         .dropna())
sample_X, sample_y = sample.drop(columns="target"), sample["target"]

In [None]:
plt.figure(figsize=(12,4))
for i, kernel in enumerate(kernels):
    SVC_model = SVC(kernel=kernel, gamma="auto")
    SVC_model.fit(sample_X, sample_y)
    plt.subplot(1,3,i+1)
    ax = plt.gca()
    plot_classifier(sample_X, sample_y, SVC_model, ax=ax, ticks=True)
    ax.set_title(kernel);
    ax.set_xlabel('age (scaled)')
    if i == 0:
        ax.set_ylabel('cholesterol (scaled)')

### 3.4 The RBF kernel hyperparameters
rubric={points:5}

`gamma` and `C` are hyperparameters of the `SVC` model, specifically for the RBF kernel. Both of them affect the complexity of the model. In this exercise we'll focus `gamma` specifically.

Your task is to explore different values of `gamma`: 

1. Set `gamma` to $0.001, 0.01, 0.1, 1$ and $10$, once again using a `for` loop instead of repeating your code. Inside the loop, fit the model on the original training data and report the training and validation accuracy. 
2. Furthermore, for each `gamma`, fit another `SVC` model on only the smaller dataset from the previous exercise, and produce a decision boundary plot similar to the ones above. 
3. Then, discuss your results. How does `gamma` influence your results? Can you relate this to the fundamental trade-off of ML models? Do you get more complicated surfaces from larger `gamma`, or smaller?

Note: your accuracy printouts won't correspond exactly to the plots, because the accuracy scores come from the full dataset and the plots are only made using the subsetted data.

## Exercise 4: Open-Ended <a name="4"></a>

### 4.1 Try to maximize valdiation accuracy
rubric={points:6}

Using any of `LogisticRegression`, `SVC`, `DecisionTreeClassifier` and `RandomForestClassifier`, try to get the best validation accuracy that you can on this same data set. You'll want to fiddle with the hyperparameters. When you are done, briefly describe what you tried and what worked best.

Note: this question is quite open-ended. I recommend not spending more than 20 minutes on it, and just submitting what you have after 20 min. 

### 4.2 Test set
rubric={points:3}

Evaluate your final model on the test set. How does your test accuracy compare to your validation accuracy? If they are different: do you think this is because you "overfitted on the validation set", or simply random luck? Discuss your answer in the context of the size of the data set.

### 4.3 Cross-validation
rubric={points:1}

Would this problem be a good candidate for using cross-validation, instead of a train/validation split? Briefly justify your answer.