In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split

# Machine Learning Models
---

In this lab, we'll cover a few machine learning models to make classification predictions. We'll start with the k-nearest neighbors classifier, which you should have seen in Data 8, then move to a couple more models. Note that the models that we work with are solely for classification. Numerical predictions will require different [models](https://scikit-learn.org/stable/supervised_learning.html).

1. [K-Nearest Neighbors](#knn)
2. [Decision Trees](#decisiontree)
3. [Random Forest](#randforest)
4. [Support Vector Classification](#svc)

## Data Cleaning & Preparation

This notebook will use credit card default data from Taiwan. You can find the original data and a description of the data [here](https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients).

In [None]:
data = pd.read_excel("credit_card_defaults.xls",  header=1, dtype=np.int64)
data.head(4)


### Cleaning the Dataset

You might have noted that the first column and the first row of our original data look a little funky. In the next few cells, we remedy this by dropping the column "Unnamed: 0" and replacing the current column names with the values in the first row.

First, we drop the first column, which was the original index.

In [None]:
clean = data.drop(data.columns[0], axis=1)
clean.head(2)

Then, we can collect all of the variable names


In [None]:
variables = clean.iloc[0].to_dict()
print("Some Keys: ", list(variables.keys())[:5])
print("Some Values: ", list(variables.values())[:5])
variables

You can also assign the explanatory columns directly 

In [None]:

print(model_data.columns)

X = model_data[['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE']].astype(int)  

Finally, we relabel our column names with the variables that were originally in the first row of our data.

### Working with a Subset of Our Data

Our dataset looks a little cleaner now. Next, take a look at how many entries are in our dataset. There are a large number of entries, which is great! However, DataHub can only do so much computation, so we'll take a random sample of 10,000 entries for this lab.

In [None]:
# look at the number of entries in our dataset
data.shape[0]

In [None]:
# take a sample of our data
model_data = clean.sample(n=10000, random_state=42)
model_data.head(4)

### Preparing the Data for an ML Model

Finally, the last thing we do is split our data into a training and test set. The training set will help us make our model, and the test set will evaluate how well our model works. 

Next, we'll split our data into `X` and `y` variables. Our `X` variable will be a dataset full of every variable but "payment next month", the last column in our table. The `y` variable will have the "payment next month" variable.

**Note:** The following lines of code have `astype(int)` at the end. Our data is not originally inputted as integers, so we need to make sure that they are!

In [None]:

X = model_data[list(variables.keys())[0:-1]].astype(int) 
y = model_data[list(variables.keys())[-1]].astype(int)

Now that we have our `X` and `y` variables, we can split them into training and test sets. The next cell does a random 80/20 split of our data into these two sets. Notice that there are four outputs from `train_test_split`, which we assign to `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# double check that we split the train set
print("X train shape: ", X_train.shape)
print("y train shape: ", y_train.shape)
print("X test shape: ", X_test.shape)
print("y test shape: ", y_test.shape)

## 1. K-Nearest Neighbors Classifier <a id='knn'></a>
---




### Specifying Hyperparameters

Let's start by creating a k-nearest neighbors classifier where $k=3$. We'll use the `KNeighborsClassifier` model from the scikit-learn library, which has a multitude of models you could use.

Fill in the first blank to specify your hyperparameter (in this case, the value of `k`). Then, use this [page](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to create the classifier. Please reference this page for the remainder of this section. You may find the "Examples" section helpful, but please also read through the "Parameters", "Attributes", and "Methods" sections.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
k = 3 

knn = KNeighborsClassifier(n_neighbors=k) 

### Training the Model on Your Data

Note that our `knn` model has not yet "trained" on any data, as we have not given it any data to train on.

Fit our model to the data we would like to train on. Your code should make use of your `X_train` dataset and your `y_train` dataset.

In [None]:

knn.fit(X_train, y_train)

### Making Predictions

After running your cell, you should see a blue box with "KNeighborsClassifer" on it. This indicates that our computer has prepared a model based on our specifications (ie. hyperparameters) and trained it. In other words, our model is now ready to make predictions!

In this section, let's make predictions for our test set. Refer back to the documentation provided earlier.

Your cell will create a light red box with an error message, but feel free to ignore it.

In [None]:
y_pred = knn.predict(X_test)
y_pred[:5]

### Calculating Accuracy

Now that we've made predictions for our test set, we should calculate the accuracy of our model. Take a look at the documentation for `accuracy_score` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html). Use this function to calculate the accuracy score, and recall that accuracy is computed as follows.

$$accuracy = \frac{\#\:of\:correct\:predictions}{\#\:of\:predictions}$$

In [None]:
from sklearn.metrics import accuracy_score

In [None]:

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

### Creating a Confusion Matrix

Accuracy isn't the only metric to measure how well our model works. We can also evaluate our model by looking at a confusion matrix. Again, take a look at the relevant scikit-learn documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) and create a confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# print confusion matrix
confusion_matrix_knn = confusion_matrix(y_test, y_pred) # SOLUTION
print(confusion_matrix_knn)

## 2. Decision Tree Classifier <a id='decisiontree'></a>
---



### Specifying Hyperparameters

Now let's make a decision tree classifier! We imported another scikit-learn model (`DecisionTreeClassifier`) for you below. See this [page](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for the official documentation and reference it for the remainder of this section.

Create a `DecisionTreeClassifier` with a maximum depth of 5 and ensure that it does _not_ split on any nodes with 3 datapoints or less.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
# specify the hyperparameter(s) & train the model
max_depth = 5 
min_split = 4 

tree = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_split) 

### Training the Model & Making Predictions

Then, fit the model to your data. This should be similar to what you wrote for your `KNeighborsClassifier`.

In [None]:
# fit the model to our data
tree.fit(X_train, y_train) 

Once again, you should see a blue box with "DecisionTreeClassifier" written on it. Make some predictions using this model, and assign those predictions to the variable `y_pred`.

In [None]:
# make predictions
y_pred = tree.predict(X_test) 

### Measuring Goodness

Finally, let's measure the goodness of our model. Calculate and print the accuracy of our model, then create and print a confusion matrix.

In [None]:
# print accuracy score
accuracy = accuracy_score(y_test, y_pred) 
print(f"Accuracy: {accuracy:.2f}")

In [None]:
# print confusion matrix
confusion_matrix_tree = confusion_matrix(y_test, y_pred) 
print(confusion_matrix_tree)

# Print the Tree

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plot_tree(tree, filled=True, feature_names=list(variables.values())[0:-1])
plt.show()


## 3. Random Forest <a id='randforest'></a>
---




Next, let's work with a random forest. Recall that a random forest is composed of _multiple_ decision trees that split nodes on different variables. When making predictions, it has each decision tree make its own prediction and then outputs the majority class as its final prediction.

See this [page](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) for the official documentation of scikit-learn's `RandomForestClassifier`. We'll repeat the process above by ...

1. Specifying the Hyperparameters

2. Training the Model & Making Predictions

3. Measuring Goodness

Specify a maximum depth of 5, split on nodes with only 4 or more values, use 10 decision trees, and only consider 1 feature when splitting.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# specify the hyperparameter(s) & train the model
max_depth = 5 
min_split = 4 
num_estimators = 10 
max_fts = 1 

forest = RandomForestClassifier(max_depth=max_depth, n_estimators=num_estimators, max_features=max_fts) # SOLUTION

In [None]:
# fit the model to our data
forest.fit(X_train, y_train) 

In [None]:
# make predictions
y_pred = forest.predict(X_test)

In [None]:
# print accuracy score
accuracy = accuracy_score(y_test, y_pred) 
print(f"Accuracy: {accuracy:.2f}")

In [None]:
# print confusion matrix
confusion_matrix_forest = confusion_matrix(y_test, y_pred) 
print(confusion_matrix_forest)

In [None]:
# print tree    
plt.figure(figsize=(20, 10))
plot_tree(forest.estimators_[0], filled=True, feature_names=list(variables.values())[0:-1])
plt.show()

## 4. Support Vector Classification <a id='svc'></a>
---
Lastly, we'll work with support vector classification. Reference this [page](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for the official documentation and repeat the steps above.

Make your regularization hyperparameter 1, then try out different values.

TODO make more complicated, clarify hyperparameters


In [None]:
from sklearn.svm import SVC

In [None]:
# specify the hyperparameter(s) & train the model
reg_param = 5

svc = SVC(C=reg_param) 

In [None]:
# fit the model to our data
svc.fit(X_train, y_train) 

In [None]:
# make predictions
y_pred = svc.predict(X_test) 

In [None]:
# print accuracy score
accuracy = accuracy_score(y_test, y_pred) 
print(f"Accuracy: {accuracy:.2f}")

In [None]:
# print confusion matrix
confusion_matrix_forest = confusion_matrix(y_test, y_pred) 
print(confusion_matrix_forest)