# Hands-On Section (HW2)

### Read Carefully Before Proceeding

If you are having issues with running this code because of missing libraries, check the material that we've done in class for installation instructions. This code uses what we have already seen, so if you've been able to execute the code of the Notebooks we've seen in class, you will be fine here as well.


You need to answer all questions. Make sure that you answer both **technical** (code-related) and **non-technical** (conceptual) parts of this homework. A lot of code is already available for you, and you can build on that. You are free to use code from our notebooks in class.  All visualizations must be generated by your code, programmatically.


Once you're done, download the notebook via `File` -> `Download as` -> `Notebook`, which will fetch a file with an ".ipynb" extension. Include this file in your submission, as a separate document -- **not** in the word / pdf submission itself. In case you use additional code stored in another directory, make sure to submit that as well.

***


## Customer Churn

In this section, you will demonstrate your technical and other Data Science skills, when applied to the Customer Churn problem. MTC has given you access to a small subset of their data with information they have collected according to your specifications.

As such, they have provided two different datasets: a **training** ('churn_train.csv') and a **testing** one ('churn_test.csv'). You do not know how they picked which instances to place in the two files. Both datasets contain the same _features_ and they both have a target variable: whether the customer left or not. The target variable is the last "column" of the csv file(s), named `LEAVE`.


These two datasets are located within the `data/` folder. These are csv files, which you can open and have a look if interested. Nevertheless, the goal is to use your technical skills to address the problem.


Following good practices, import everything we'll need first. If you need to import more libraries, you can do so here.

In [None]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Let's read the two files. Pandas have a very convenient way to read CSV files.
train_orig = pd.read_csv('data/churn_train.csv')  # Start with the training dataset. 
test_orig = pd.read_csv('data/churn_test.csv')  # Proceed with the testing dataset.

In [None]:
# Let's print the first 10 rows of the training data to see what we got
train_orig.head()

### Preprocessing #1.1: Converting Features

Sklearn - our Python library for Machine Learning - implements the CART algorithm for Decision Trees.
That algorithm does not work with categorical or nominal features. If you try to use the algorithm as is, you will get errors.

We can address this problem by converting each of these categorical features to numerical ones, and this practice is quite common. But first, we need to find out which features we must convert.

In [None]:
train_orig.dtypes

From the output above, the values associated with "object" as a data type are the categorical ones.
For each of those categorical features, let's see their unique values.

In [None]:
# Go over the column names that are associated with "object" as shown above - i.e., categorical features -,
# and for each one of them show the unique values that it takes
for col_name in [ 'COLLEGE', 'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL', 'CONSIDERING_CHANGE_OF_PLAN', 'LEAVE' ]:
    print( col_name, ": ", train_orig[col_name].unique() )

### Converting the feature values

Using the information that we see above, we will perform the following transformation / conversion on the features.

1. For the `COLLEGE` feature, we will map the values as follows: **'one'** = 1, **'zero'** = 0
2. For the `REPORTED_SATISFACTION`, we will map the values as follows: **'very_unsat'** = -2, **'unsat'** = -1, **'avg'** = 0, **'sat'** = 1, **'very_sat'** = 2
3. For the `REPORTED_USAGE_LEVEL`, we will map the values as follows: **'very_little'** = -2, **'little'** = -1, **'avg'** = 0, **'high'** = 1, **'very_high'** = 2
4. For the `CONSIDERING_CHANGE_OF_PLAN`, we will map the values as follows: **'no'** = -2, **'never_thought'** = -1, **'perhaps'** = 0, **'considering'** = 1, **'actively_looking_into_it'** = 2
5. For the `LEAVE`, we will map the values as follows: **'LEAVE'** = 1, **'STAY'** = 0

DataFrames provide a very convenient way to make such replacements. Assuming a variable `my_dframe` that points to a data frame, one may simply call the method `replace()` on that data frame, like so: `my_dframe.replace(...)`.

That method takes as input parameter the "rules" according to which the replacement will happen. These rules are given in the form of a dictionary. Each entry in that dictionary is a key value pair with the following semantics:
* key = column name
* value = { current_value1 : new_value1, current_value2 : new_value2, ... }

Take the `COLLEGE` column as an example. In that case, the key will be `COLLEGE` because that is the name of the column with values we want to modify. The _value_ corresponding to the `COLLEGE` key, will be _another_ dictionary. In that 2nd dictionary, the key is what we already have, e.g. `one`, or `zero`, and the value (of the dictionary) is what we want to change it to, i.e., `1` and `0`.
Putting it all together, for the `COLLEGE` example, we will have the following dictionary: `{'one' : 1, 'zero': 0}`.

The rule for just this column looks like this: `{"COLLEGE" : {"one" : 1, "zero": 0}}`.
The following piece of code shows all of the rules we must apply.


In [None]:
# create a dictionary mapping each string to a value
to_replace = {'COLLEGE':{'one':1,'zero':0},
           'REPORTED_SATISFACTION':{'very_unsat':-2,'unsat':-1,'avg':0,'sat':1,'very_sat':2},
           'REPORTED_USAGE_LEVEL':{'very_little':-2,'little':-1,'avg':0,'high':1,'very_high':2},
           'CONSIDERING_CHANGE_OF_PLAN':{'no':-2,'never_thought':-1,'perhaps':0,'considering':1,'actively_looking_into_it':2},
           'LEAVE':{'LEAVE':1,'STAY':0}
          }

The `replace()` method returns a new data frame where the modifications have already taken place.
We want to apply these transformations to **both** the train set **and** the test set.
The code below performs those changes.

In [None]:
# Apply the transformations to both the training and the testing dataset
train_data = train_orig.replace(to_replace)
test_data = test_orig.replace(to_replace)

In [None]:
# Check that the changes have been correctly applied
train_data.head()

### Preprocessing #1.2: From DataFrame to _Features_ and _Labels_

The dataframe has everything together: our features of interest and the value for the target variable -- what we often call _"labels"_. As we've already seen, when training a model, our classifier expects the _features_ separately from the _labels_.

In the [fit()](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) method, the first argument corresponds to the features only, while the labels (or target values) are the second argument.

For that reason, it is very convenient to separate the features from the labels for a given dataframe.

Implement a simple method below that is given a dataframe as input and returns 2 items: the first one will be the _features_ of the input dataframe, while the second one will be the _labels_ (column) of that dataframe.


In [None]:

# Implement the method, following the description above
# Given a dataframe, return the features and the labels (target variable values)

def separate_features_and_labels(dframe):
    features =    # Your code here to get the features
    labels =      # Your code here to get the labels
    return features, labels  # This method returns the features and the labels in that order

In [None]:
# In case you want to test your method


## Deliverable #2: Accuracy and Cross-validation

For the following questions, train and (subsequently) use a Decision Tree classifier.
The classifier must use the `entropy` criterion to split.
For now, **do not** specify the `max_depth` parameter.


Following the above instructions, compute and report the following:
1. Classifier Accuracy on the training data <br/> <br/>

2. Classifier Accuracy on the test data <br/> <br/>

3. Classifier Average Accuracy of a **10-fold** cross-validation. Check the `cv` parameter of the `model_selection.cross_val_score()` method. <br/> <br/>

4. Train your classifier on 66% of the trainnig data and report the accuracy on the remaining 34%. Use the **numerical part** of your _NetID_ to randomize the instance selection process. Check the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to do this step easily, and pay attention to the `random_state` parameter of that method.

In [None]:
# Implement the following

# accuracy on the training data:
acc_train = 

# accuracy on the test data:
acc_test = 


# average (mean) accuracy on 10-fold CV:
acc_10cv = 

# accuracy on 66% split:
acc_66pct = 

print("accuracy on the training data : %.4f" % acc_train)
print("accuracy on the test data     : %.4f" % acc_test)
print("accuracy on the 10-fold CV    : %.4f" % acc_10cv)
print("accuracy on the 66%% split     : %.4f" % acc_66pct)


## Deliverable #2.1: Complexity Control

We've said in class that Decision Trees can overfit if we allow them to increase in depth a lot, and we should plot their performance (accuracy) as a function of their depth to figure out the "sweet spot".

We also mentioned that this classifier takes into account several other conditions, not just the depth, to check if they should split a region further. One such condition is the minimum number of instances we allow to appear in a leaf node. In sklearn's `DecisionTreeClassifier` this parameter is called `min_samples_leaf` and is specified when _creating_ the classifier, in a way similar to specifying the `max_depth` parameter. For more details check [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

For this deliverable, you will pick several different values (at least 5) for the depth of the tree and several different values (at least 4) for the minimum number of leaf instances. For each combination of (`max_depth`, `min_samples_leaf`) you will compute the average accuracy of a 3-fold cross validation.

You will then visualize, in the same plot, the different 3-fold average accuracy scores (y-axis) that you get as a function of the tree depth (x-axis). Each value of `min_samples_leaf` will be a separate line in the plot that you generate.

<br/>

**Requirements Checklist**
* Select several values (at least 5) for the maximum depth of the tree.
* Select several values (at least 4) for the minimum number of leaf instances
* For each combination of (max_depth, min_samples_leaf):
    * Do 3-fold cross validation (CV)
    * Compute the average accuracy of the 3-fold CV process. That's the accuracy score for this combination
* Visualize the computed information as follows:
    * The x axis must be the maximum tree depth values (in ascending order)
    * The y axis must be the accuracy score
    * Each value of minimum instances at a leaf needs to be a separate line in the plot

***


An easy way to generate the necessary values is through a **nested for-loop**.
We can then store these results in a DataFrame and plot them.
A simplified example is shown below:

In [None]:
# Sample code that generates y-axis values for a given combination of 2 variables.
# This code is provided for convenience

xvals = range(1, 10)
line_vals = range(1, 4)

# We will use a dataframe because it has a simple-to-use plotting function.
df = pd.DataFrame(index=xvals ,columns=line_vals)

for v in line_vals:  # This loop controls the different lines

    l = []
    for x in xvals:  # This loop controls the different values on the x-axis
        l.append(x ** v)  # Simply compute x^d. Append in the list your value for the y-axis

    df[v] = l  #Store the result to the dataframe for the particular line

# We can plot the contents of a dataframe with the plot() method
ax = df.plot()
ax.set_xlabel("This should be tree depth")
ax.set_ylabel("This should be accuracy")

The assignment asked that, given a combination of depth and minimum instances at a leaf, you will perform a 3-fold cross validation and compute the average score for that.

Things will be a lot easier if you create a method that:
* Takes the necessary input arguments
* Trains a DecisionTreeClassifier on the proper input
* Runs a 3-fold cross validation
* Computes the average accuracy of these 3 folds
* Returns the average accuracy

_Note:_ You do not have to structure your code this way if you do not want to, as long as you meet the objective.

In [None]:
# Your code here

# Think about the parameters that you will need for your code. Add them as arguments to your method
def cv_eval(   ):
    # Check above for what the method should do
    return 0.0  # Return the proper value


Recall that you must select a number of depth levels and a number of minimum instances for the leaf nodes. You can then structure your code according to the sample code provided earlier.

You can then use the method that you created above to compute the average 3-fold Cross Validation accuracy score as your y-value for a given combination of parameters.

In [None]:
# Your code here

depths_list =    # Put your depth values here
min_leaf_size_list =    # Put your min leaf size here


# Generate the required accuracy values for each combination
# Check the sample code above with the nested for-loop


## Deliverable #2.2: Evaluation

Based on the graph that you generated above:
- What value would you use for max depth?
- What value would you use for min leaf size?
- If you could only specify one, which complexity parameter would you choose (max depth or min leaf size)? Why ?


### Answer

Write your answer here

***


## Deliverable #2.3: Test set Evaluation

Train a DecisionTreeClassifier using the parameters combination that produced the best result according to your complexity control evaluation. Report the accuracy of that classifier on the _test_ dataset, i.e. the 'churn_test.csv' dataset.

In [None]:
# Show work here

acc_test = 

print("Best Model Accuracy on the test data     : %.4f" % acc_test)


## Deliverable #2.3: Learning Curves

Learning curves determine how much data you realistically need to train your model. You will be picking a portion of your original training data to train your model and then testing it on other datasets. In particular:

* Use the parameters that you picked as your best performing ones in your complexity control.
* Select multiple (at least 5, non-zero) percentage values. Each such percentage corresponds to how much of your original data you'll use for training.
* Use `train_test_split()` to select a percentage of the training data to use to fit the model
* Compute the accuracy on:
    * The train set that you got from the `train_test_split()` method.
    * 3-fold cross validation on the train set that you got from the split. You can / should reuse your method from before.
    * The test set that you got from the `train_test_split()` method.
    * The original testing dataset (the one that was given to us and is stored in 'data/churn_test.csv')
* Plot the accuracy (y-axis) as computed above VS train size (x-axis), for each of the 4 cases


Start by computing the above once for each percentage value. However, for a given percentage value, you should repeat these computations multiple times and report their average, to get more robust results.

In [None]:

# Your code here


# Select the different training sizes


# Write your code to generate the requested accuracy scores.
# Make sure that you keep track of them so that you can visualize them.




## Deliverable #3.1: Evaluation
Would you recommend your firm spend money to collect data on more customers?


### Answer

Your answer here

***