# FutureLab Machine Learning workshop - homework assignment

This notebook is the final homework assignment for the FutureLab Machine Learning workshop.

In [None]:
# Import statements
# Don't touch this code!
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import tensorflow as tf
import matplotlib.pyplot as plt

## Assignment 1: random forest

### Dataset

For the first assignment we will look at a dataset on default of credit card clients in Taiwan. This data has a binary variable, default payment (Yes = 1, No = 0), as the response variable, and the following 23 variables as explanatory variables: 
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit. 
* X2: Gender (1 = male; 2 = female). 
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). 
* X4: Marital status (1 = married; 2 = single; 3 = others). 
* X5: Age (year). 
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. 
* X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005. 
* X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005. 

### Model

We use a random forest to see if we can predict if a person will default on their payment or not.

### Questions

1. How does a random forest work? What are the advantages of using a random forest over a decision tree?
2. What do the parameters 'n_estimators' and 'max_depth' mean? Can you find values that push the test set accuracy over 83%?
3. Which parameter can you use for regularization and why?
4. If you work for a credit card company, which fields in the confusion matrix do you want to minimize?
5. Optional: write code to use gradient boosted decision trees instead of a random forest. Does this method perform better? What are the advantages of gradient boosting over a random forest?

In [None]:
# Get dataset
# Don't touch this code!
df_credit = pd.read_csv('credit.csv', sep=';').drop('ID', axis=1)
x = np.array(df_credit.drop('Y', axis=1))
y = np.array(df_credit['Y']).flatten()
x_train = x[:25000]
x_test = x[25000:]
y_train = y[:25000]
y_test = y[25000:]
print("First 5 rows of data:")
print(df_credit.head())

In [None]:
# Make and train decision tree model.
tree = RandomForestClassifier(random_state=0, n_estimators=1, max_depth=100)
tree.fit(x_train, y_train)

# Test the model.
# Don't touch this code!
acc = tree.score(x_test, y_test)
conf_matrix = confusion_matrix(y_test, tree.predict(x_test), labels=[0, 1])
df_conf = pd.DataFrame(data=conf_matrix.T)
print("Accuracy: ", acc)
print("---------")
print("Confusion matrix (horizontal axis: actual class, vertical axis: predicted class):")
print(df_conf)

## Assignment 2: neural networks

### Dataset

For this assignment we use the Fashion MNIST dataset which contains 70,000 grayscale images in 10 categories. The images show individual articles of clothing at low resolution (28 by 28 pixels).

### Model

We use a neural network to predict the clothing category from the pixel data.

### Questions

1. Turn the neural network into a convolutional neural network. What code did you change/add?
2. Explain what each layer does, and what its parameters mean.
3. What is the danger of using too many epochs to train your model?
4. How could we use data augmentation to improve our model? (You don't need to program this, just explain.)
5. Optional: can you get up to 90% test set accuracy within 3 epochs?

In [None]:
# Get dataset
# Don't touch this code!
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = np.array(['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                        'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'])

# Show some example figures and labels.
# Don't touch this code!
print("9 training images:")
plt.figure(figsize=(18, 2))
for i, x in enumerate(train_images[:9]):
    subplot_id = 191 + i
    plt.subplot(subplot_id)
    plt.imshow(x, cmap=plt.get_cmap('Greys'))
plt.show()
print("Labels:")
print(class_names[train_labels[:9]])

In [None]:
# Define a neural network.
# Here, you can add some extra layers to improve the model!
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
    tf.keras.layers.Dense(2, activation=tf.nn.tanh),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam', 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

In [None]:
# Fit model to training data.
train_images_reshaped = train_images.reshape(60000, 28, 28, 1)
model.fit(train_images_reshaped, train_labels, epochs=1)

In [None]:
# Evaluate convolutional neural network.
# Don't touch this code!
test_images_reshaped = test_images.reshape(10000, 28, 28, 1)
pred_labels = model.predict(test_images_reshaped).argmax(axis=1)
model.evaluate(test_images_reshaped, test_labels)

# Show the first 9 images, true labels and predicted labels.
# Don't touch this code!
print("-----------------")
print("9 testing images:")
plt.figure(figsize=(18, 2))
for i, x in enumerate(test_images[10:19]):
    subplot_id = 191 + i
    plt.subplot(subplot_id)
    plt.imshow(x, cmap=plt.get_cmap('Greys'))
plt.show()
print("True labels:")
print(class_names[test_labels[10:19]])
print("Predicted labels:")
print(class_names[pred_labels[10:19]])