# CPSC 330 hw4

Meta-commentary: this assignment contains more new material that I created specifically for CPSC 330, whereas previous assignments relied more on adapting materials from other courses. As a result, in this assignment it is more likely that we will encounter typos, bugs, or other frustrations. Please be patient as we work through these issues.

Following the style of the lectures, this assignment is centred around a particular dataset and is also somewhat more open-ended than the previous ones. This reflects the direction that I'm trying to take the course, namely to give you practice with, and build good habits for, the end-to-end ML process. However, if this turns out to be too messy or too difficult to grade then I may revert back to less open-ended assignments in the future. Note that, given this style, there are many possible correct answers - you should not expect exactly the same results as your classmates.

It is also a bit hard for me to gauge the right length and difficulty for a new assignment, so please do provide feedback in your repo's README!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import datasets
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import train_test_split

In [None]:
plt.rcParams['font.size'] = 14

## Instructions
rubric={points:5}

Follow the [homework submission instructions](https://github.students.cs.ubc.ca/cpsc330-2019w-t2/home/blob/master/docs/homework_instructions.md). 

**Additional requirement**: if you are working with a partner, please write a couple sentences explaining the contribution of each team member. You should refer to yourselves by your CSIDs (because seeing names can cause bias during grading). Here is an example:

> a1b2c did Exercise 1, checked over Exercise 2, and pair-programmed for Exercise 3. z9y8x checked over Exercise 1, did Exercise 2, and pair-programmed for Exercise 3. 

Our ideal scenario is that you worked together on all the exercises, but you are not required to do so, and for now we are only collecting this information because we are curious. If you are working alone, you can ignore this section.

_YOUR TEAMWORK CONTRIBUTION STATEMENT GOES HERE_

## Writing quality/quantity
rubric={points:5}

The TAs have reported a couple issues with the first few assignments: in some cases, submissions simply show the code output with no commentary; please write at least a sentence explaining your output in each question. In other cases, the TAs have come across multi-paragraph answers where a couple of sentences would have sufficed. Thus, we are now allocating the above points for well-structured answers of a reasonable length. In general, 1-3 sentences is good.

## Exercise 1: implementing `DummyClassifier`
rubric={points:20}

In this course (unlike CPSC 340) you will generally not be asked to implement the methods we talk about, like logistic regression. However, this exercise is an exception: you will implement the simplest possible classifier, `DummyClassifier`.

Below you will find starter code for a class called `MyDummyClassifier`, which has methods `fit()`, `predict()`, and `predict_proba()`. Your task is to fill in those three functions. The next code block has some tests you can use to assess whether your code is working. 

I suggest starting with `fit` and `predict`, and making sure those are working before moving on to `predict_proba`. For `predict_proba`, you should return the frequency of each class in the training data. Again, you can compare with `DummyClassifier` using the code below.

To simplify this question, you can assume **binary classification**, and furthermore that these classes are **encoded as 0 and 1**. In other words, you can assume that `y` contains only 0s and 1s. The real `DummyClassifier` works when you have more than two classes, and also works if the target values are encoded differently, for example as "cat", "dog", "mouse", etc.

In [None]:
class MyDummyClassifier:
    """
    A baseline classifier that predicts the most common class.
    The predicted probabilities come from the relative frequencies
    of the classes in the training data.
    
    This implementation only works when y only contains 0s and 1s.
    """
    def fit(self, X, y):
        pass # your code here
    
        
        
    def predict(self, X):
        pass # your code here
    
        
    
    def predict_proba(self, X):
        pass # your code here
    
        

Below are some tests for `predict`. You may want to run the cell a few times to make sure you explore the different cases (or automate this with a loop or random seeds).

In [None]:
# For testing, generate random data
n_train = 101
n_valid = 21
d = 5
X_train_dummy = np.random.randn(n_train, d)
X_valid_dummy = np.random.randn(n_valid, d)
y_train_dummy = np.random.randint(2, size=n_train)
y_valid_dummy = np.random.randint(2, size=n_valid)

my_dc = MyDummyClassifier()
sk_dc = DummyClassifier(strategy="prior")

my_dc.fit(X_train_dummy, y_train_dummy);
sk_dc.fit(X_train_dummy, y_train_dummy);

assert np.array_equal(my_dc.predict(X_train_dummy), sk_dc.predict(X_train_dummy))
assert np.array_equal(my_dc.predict(X_valid_dummy), sk_dc.predict(X_valid_dummy))

Below are some tests for `predict_proba`.

In [None]:
assert np.array_equal(my_dc.predict_proba(X_train_dummy), sk_dc.predict_proba(X_train_dummy))
assert np.array_equal(my_dc.predict_proba(X_valid_dummy), sk_dc.predict_proba(X_valid_dummy))

## Exercise 2: Precision and recall by hand

Below is the confusion matrix of a machine learning system that predicts whether a cancer is malignant or not. Let's consider malignant to be the "positive class".

|    Actual/Predicted      | Predicted Benign | Predicted Malignant |
| :------------- | -----------------------: | -----------------------: |
| **Actual Benign**       | 6 | 238 |
| **Actual Malignant**       | 20 | 194 |

#### 2(a)
rubric={points:2}

Would you consider this an imbalanced dataset? Why or why not? Max 2 sentences.

    

#### 2(b)
rubric={points:2}

Based on the above confusion matrix, what is the recall? 

    

#### 2(c)
rubric={points:5}

Do you consider this to be a good classifier? What additional information might you need to answer this question? Briefly discuss in 1-3 sentences.

    

## Exercise 3: Customer churn data

[Customer churn](https://en.wikipedia.org/wiki/Customer_attrition) refers to the notion of customers leaving a subscription service like Netflix. In this exercise, we will try to predict customer churn in a dataset where most of the customers stay with the service and a small minority cancel their subscription. To start, please download the [Kaggle telecom customer churn dataset](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset). **Do not push the CSV to your repo** (you may want to create a .gitignore file). One you have the data, you should be able to run the following code:

In [None]:
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv', encoding='latin-1')

In [None]:
df_train, df_test = train_test_split(df, test_size=0.1, random_state=100)
df_train, df_valid = train_test_split(df_train, test_size=0.25, random_state=100)

In [None]:
df_train.head()

The last column (`churn`) is the target. "True" means the customer left the subscription (churned) and "False" means they stayed.

**Note**: in this exercise you are welcome to copy/paste/adapt code from the **lecture notes** without attribution. However, you are **not** permitted to copy any other code from online sources without attribution.

**Note**: if available, you are welcome to use scikit-learn functions for any of the tasks below, such as confusion matrix. You are not required to implement them yourselves. 

#### 3(a)  
rubric={points:8} 

Perform some exploratory data analysis on the training set. In particular:

- How many rows and columns are there?
- How many True/False target values are there?

Come up with **two** more questions you would like to answer (similar to the above two), and explore those as well. Briefly discuss your results in 1-3 sentences.

#### 3(b)
rubric={points:20}

In preparation for building a classifier, perform whatever feature transformations you deem sensible. Use `ColumnTransformer` to combine the transformations (see Lecture 6 or 8). This can include dropping features if you think they are not helpful. 

In each case, briefly explain your rationale with 1-2 sentences. You do not need an explanation for every feature, but for every group of features that are being transformed the same way. For example, "I am doing transformation X to the following categorical features: `a`, `b`, `c` because of reason Y," etc.

Warning: as discussed in lecture, make sure not to violate the golden rule; you should not be calling `fit` or `fit_transform` on any test data!

#### 3(c)
rubric={points:10}

"Train" a `DummyClassifier` on your transformed data, using `strategy='prior'` as in Exercise 1. Report the following:

1. Train and validation accuracy.
2. Confusion matrix on the validation data.
3. Precision, recall, F1-score on the validation data.

Briefly comment on your results (2 sentences max).

#### 3(d)
rubric={points:20} 

Train a logistic regression classifier on your transformed data, using the default hyperparameters. Report the following metrics:

1. Train and validation accuracy.
2. Confusion matrix on the validation data.
3. Precision, recall, F1-score on the validation data.

Are you satisfied with the results? Use your `DummyClassifier` results as a reference point. Briefly discuss (1 paragraph max). 

#### 3(e)
rubric={points:5}

Set the `class_weight` parameter of your logistic regression model to `'balanced'`. Report the following metrics:

1. Train and validation accuracy.
2. Confusion matrix on the validation data.
3. Precision, recall, F1-score on the validation data.

Discuss your results in 1-3 sentences.

#### 3(f)
rubric={points:5}

On the same axes, plot the ROC curves for the three methods we tried. Make sure you have a legend labeling which curve is which. Also, report the AUC in each case. Briefly comment on your results (1 sentence).


#### 3(g)
rubric={points:10}

The function below plots histograms of the predicted probability, split by the true class, for each of the two logistic regression models. These are similar to the animated plots from lecture. 

Call this function using your (transformed) **validation** data and your two logistic regression models. Then, discuss your results. How did the regular and balanced logistic regression models compare in terms of accuracy and recall? How did the two models compare in terms of ROC curves? Do these new plots help explain what is going on here? Max 1 paragraph.

In [None]:
def make_hists(X, y, lr_original, lr_balanced):

    negative_examples = X_valid[y_valid == 0]
    positive_examples = X_valid[y_valid == 1]

    for name, model in {"log reg" : lr, "log reg balanced" :lr_balanced}.items():

        plt.hist(model.predict_proba(negative_examples)[:,1], alpha=0.5, bins=30, label="0", density=True)
        plt.hist(model.predict_proba(positive_examples)[:,1], alpha=0.5, bins=30, label="1", density=True)
        plt.legend(loc='upper right')

        plt.xlabel("predicted probability")
        plt.ylabel("normalized counts")
        plt.title(name);
        plt.show()
    

## Exercise 4: Hyperparameter optimization

#### 4(a)
rubric={points:5}

Try applying a random forest to this problem, again using `class_weight='balanced'`. Report the following metrics:

1. Train and validation accuracy.
2. Confusion matrix on the validation data.
3. Precision, recall, F1-score on the validation data. 

Briefly comment on the results (max 2 sentences).

#### 4(b)
rubric={points:5}

Next we will optimize the `n_estimators` hyperparameter of your random forest using `RandomizedSearchCV`, keeping `class_weight='balanced'`. Because cross-validation separates the folds for us, I will combine the training and validation sets so that we have more data to work with. 

In [None]:
X_train_valid = np.concatenate((X_train, X_valid), axis=0)
y_train_valid = np.concatenate((y_train, y_valid), axis=0)

In [None]:
import scipy.stats

param_dist = {
              "n_estimators"     : scipy.stats.randint(low=10, high=300),
             }

In [None]:
random_search = RandomizedSearchCV(RandomForestClassifier(class_weight='balanced', random_state=321), 
                                   param_distributions = param_dist, 
                                   n_iter = 20, 
                                   cv=3,
                                   verbose=1, 
                                   random_state=123)

In [None]:
random_search.fit(X_train_valid, y_train_valid);

In [None]:
random_search.best_params_

In [None]:
random_search.score(X_valid, y_valid)

This model is incredible - it gets 100% validation accuracy! This means it make no errors, so it must also be getting a perfect precision, recall, and F1-score.
What is wrong with my analysis? Answer in 1-2 sentences.

(Note: when you run the above code you might get slightly different results, depending on your feature preprocessing in Exercise 3. That is fine. Hopefully the score isn't too different.)

#### 4(c)
rubric={points:5}

Repeat the hyperparameter optimization from the previous part, this time doing it correctly. For your optimized model, report the F1-score on the validation set. Briefly comment on the results (max 2 sentences).

#### 4(d)
rubric={points:5}

This time optimize both `n_estimators` and `max_depth`. What F1-score do you get on the validation set?

#### 4(e) 
rubric={points:5}

When you carry out hyperparameter optimization, by default it is maximizing accuracy. In unbalanced datasets such as churn datasets, using accuracy does not make sense. You can use different [scoring metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) when you carry out hyperparameter optimization using `GridSearchCV` or `RandomizedSearchCV` using the `scoring` parameter. Try optimizing your model to pick the hyerparameters with the best F1-score by setting `scoring='f1'` when creating the `RandomizedSearchCV`. What F1-score do you achieve on the validation set?

Optional note / FYI: in the case of random search with a fixed random seed, you will end up exploring the same hyperparameter values as in the previous part. The only difference will be which one you consider the best. However, if you were using a fancier method like Bayesian optimization, then the choice of scoring function would actually affect which hyperparameter values were explored, because the suggested next set of hyperparameters depends on the scores of the previous ones.

#### 4(f)
rubric={points:5}

Evaluate your final model on the test data. Briefly discuss your results (1-2 sentences).