# Machine Learning 2020 Course Projects

## Project Schedule

In this project, you will solve a real-life problem with a dataset. The project will be separated into two phases:

27th May - 9th June: We will give you a training set with target values and a testing set without target. You predict the target of the testing set by trying different machine learning models and submit your best result to us and we will evaluate your results first time at the end of phase 1.

10th June - 24th June: Students stand high in the leader board will briefly explain  their submission in a proseminar. We will also release some general advice to improve the result. You try to improve your prediction and submit final results in the end. We will again ask random group to present and show their implementation.
The project shall be finished by a team of two people. Please find your teammate and REGISTER via [here](https://docs.google.com/forms/d/e/1FAIpQLSf4uAQwBkTbN12E0akQdxfXLgUQLObAVDRjqJHcNAUFwvRTsg/alreadyresponded).

The submission and evaluation is processed by [Kaggle](https://www.kaggle.com/t/426d97d4138b49b2802c2ee0461a18ac).  In order to submit, you need to create an account, please use your team name in the `team tag` on the [kaggle page](https://www.kaggle.com/t/426d97d4138b49b2802c2ee0461a18ac). Two people can submit as a team in Kaggle.

You can submit and test your result on the test set 2 times a day, you will be able to upload your predicted value in a CSV file and your result will be shown on a leaderboard. We collect data for grading at 22:00 on the **last day of each phase**. Please secure your best results before this time.



## Project Description

Car insurance companies are always trying to come up with a fair insurance plan for customers. They would like to offer a lower price to the careful and safe driver while the careless drivers who file claims in the past will pay more. In addition, more safe drivers mean that the company will spend less in operation. However, for new customers, it is difficult for the company to know who the safe driver is. As a result, if a company offers a low price, it bears a high risk of cost. If not, the company loses competitiveness and encourage new customers to choose its competitors.


Your task is to create a machine learning model to mitigate this problem by identifying the safe drivers in new customers based on their profiles. The company then offers them a low price to boost safe customer acquirement and reduce risks of costs. We provide you with a dataset (train_set.csv) regarding the profile (columns starting with ps_*) of customers. You will be asked to predict whether a customer will file a claim (`target`) in the next year with the test_set.csv 

~~You can find the dataset in the `data/final-project-data` folders in the jupyter hub.~~ We also upload dataset to Kaggle and will test your result and offer you a leaderboard in Kaggle. Please find them under the Data tag on the following page:

https://www.kaggle.com/t/426d97d4138b49b2802c2ee0461a18ac

## Phase 1: 26th May - 9th June

### Data Description

In order to take a look at the data, you can use the `describe()` method. As you can see in the result, each row has a unique `id`. `Target` $\in \{0, 1\}$ is whether a user will file a claim in his insurance period. The rest of the 57 columns are features regarding customers' profiles. You might also notice that some of the features have minimum values of `-1`. This indicates that the actual value is missing or inaccessible.


In [194]:
# Quick load dataset and check
import pandas as pd
import os, sys
running_local = True if os.getenv('JUPYTERHUB_USER') is None else False
if not running_local:
    path = "/data/final-project-dataset/"
else:
    path = "./data/"
    !{sys.executable} -m pip install -r requirements.txt



In [195]:
filename = os.path.join(path, "train_set.csv")
data_train = pd.read_csv(filename)
filename = path + "test_set.csv"
data_test = pd.read_csv(filename)

The prefix, e.g. `ind` and `calc`, indicate the feature belongs to similiar groupings. The postfix `bin` indicates binary features and `cat` indicates categorical features. The features without postfix are ordinal or continuous. Similarly, you can check the statistics for testing data:

In [196]:
from tqdm.notebook import tqdm, trange
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from IPython.display import set_matplotlib_formats
from contracts import contract
import sklearn
from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA


### Handling missing values

In [197]:
# view the missing columns
missing_col = {}
for col in data_train.columns:
    counter = len(data_train[data_train[col] == -1])
    if counter > 0:
        missing_col[col] = counter / len(data_train) * 100
        print('{}\t{:.2f}'.format(col, missing_col[col]))

ps_ind_02_cat	0.03
ps_ind_04_cat	0.01
ps_ind_05_cat	0.98
ps_reg_03	18.13
ps_car_01_cat	0.02
ps_car_02_cat	0.00
ps_car_03_cat	69.07
ps_car_05_cat	44.77
ps_car_07_cat	1.91
ps_car_09_cat	0.09
ps_car_11	0.00
ps_car_12	0.00
ps_car_14	7.15


In [198]:
print(data_train.shape[1])
data_train.drop(columns=[col for col, val in missing_col.items() if val >= 10], inplace=True)
data_test.drop(columns=[col for col, val in missing_col.items() if val >= 10], inplace=True)
print(data_train.shape[1])

59
56


In [181]:
## not used in 0.52
data_train.drop(columns=[col for col, val in missing_col.items() if val >= 10 and 'cat' not in col], inplace=True)
data_test.drop(columns=[col for col, val in missing_col.items() if val >= 10 and 'cat' not in col], inplace=True)

# transform cat to bin and fill the rest with median values
for df in [data_train, data_test]:
    for col, val in missing_col.items():
        if val >= 10:
            if 'cat' not in col:
                continue
            df.loc[df[col] != -1, col] = 1
            df[col].replace(-1, 0, inplace=True)
            df.rename(columns={col: col.replace('cat', 'bin')}, inplace=True)
            print('{}\t-> {}'.format(col, col.replace('cat', 'bin')))
            continue
        median = df[df[col] != -1][col].median()
        df[col].replace(-1, median, inplace=True)

ps_car_03_cat	-> ps_car_03_bin
ps_car_05_cat	-> ps_car_05_bin
ps_car_03_cat	-> ps_car_03_bin
ps_car_05_cat	-> ps_car_05_bin


In [182]:
## not used in 0.52

# drop features with '_calc_' in feature names
feature_calc = list(data_train.columns[data_train.columns.str.contains('_calc_')])
print(feature_calc)

data_train = data_train.drop(feature_calc, axis = 1)
data_test = data_test.drop(feature_calc, axis = 1)

['ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11', 'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']


### PCA

In [199]:
## Select target and features
fea_col = data_train.columns[2:]
data_Y = data_train['target']
data_X = data_train[fea_col]

## Select target and features
fea_col = data_train.columns[2:]
data_Y = data_train['target']
data_X = data_train[fea_col]

percent_data = 0.999

## Get components that explain over 99.9% of data
pca = PCA(0, svd_solver='full')
for i in range(1,data_X.shape[1]):
    pca = PCA(i, svd_solver='full')
    pca.fit_transform(data_X)
    print(pca.explained_variance_ratio_.sum())
    if pca.explained_variance_ratio_.sum() > percent_data:
        break

## Transfrom data
print(data_X.shape[1])
data_X = pca.transform(data_X)
print(data_X.shape[1])

0.9091674540493707
0.9346317670937678
0.9456814518181943
0.9527356451329673
0.9590318073817083
0.9651454433117876
0.9704684099774006
0.9750092156046967
0.9786123544960751
0.9815372140898345
0.9839306292500312
0.9857103493575371
0.9873797389018276
0.9888929768515148
0.9903795713294008
0.9916775939394392
0.9928864841620509
0.9939627521893217
0.9950044174600585
0.9957494158101426
0.9963452126348805
0.9967859612629872
0.9971785534306211
0.9974400778524953
0.9976603968153036
0.9978666045810682
0.9980616565224008
0.9982512867590098
0.998427857827909
0.9985989137900184
0.9987571553819683
0.9988907972569371
0.999018243536765
54
33


### Normalizing data

In [200]:
from sklearn import preprocessing
# TODO
# scaler = StandardScaler()
# scaler.fit_transform(train.drop(['target'], axis=1))
data_train = preprocessing.normalize(data_train)

### One Hot Encoding

In [201]:
# TODO


### Oversampling with SMOTE

In [202]:
x_train, x_val, y_train, y_val = train_test_split(data_X, data_Y, test_size = 0.3, shuffle = True)

# try using class weight instead
#_, counts = np.unique(y_train, return_counts=True)
#weights =  counts[0] / counts
          
#print(weights)

smote = SMOTE(sampling_strategy='minority')
x_train, y_train = smote.fit_resample(x_train, y_train)


# from imblearn.over_sampling import RandomOverSampler
# # Random Oversampling
# over = RandomOverSampler(sampling_strategy=1)
# # fit and apply the transform
# x_train, y_train = over.fit_resample(x_train, y_train)


x_train, y_train = np.array(x_train), np.array(y_train)


### Neural Network

In [203]:
def train_neural_network_pytorch(net, inputs, labels, optimizer, criterion, iterations=1000):
    """
    :param net: the neural network object
    :param inputs: numpy array of training data values
    :param labels: numpy array of training data labels 
    :param optimizer: PyTorch optimizer instance
    :param criterion: PyTorch loss function
    :param iterations: number of training steps
    """
    net.train()  # Before training, set the network to training mode

    for iter in trange(iterations):  # loop over the dataset multiple times
        
        # Get the inputs; data is a list of [inputs, labels]
        # Convert to tensors if data is in the form of numpy arrays
        if not torch.is_tensor(inputs):
            inputs = torch.from_numpy(inputs.astype(np.float32)) 
            
        if not torch.is_tensor(labels):
            labels = torch.from_numpy(labels.astype(np.float32))

        # 1. Reset gradients
        optimizer.zero_grad()  
        # 2. Forward
        outputs = net(inputs)
        # 3. Compute the loss
        loss = criterion(outputs.reshape(-1), labels)
        # 4. Backward
        loss.backward()
        # 5. Update parameters
        optimizer.step()
        
    print('Finished Training')

In [204]:
def predict_pytorch(net, X, threshold=0.5):
    """
    Function for producing network predictions
    """
    
    net.eval()
    
    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    X = torch.from_numpy(X.astype(np.float32))
    logits = net(X)
    predictions = torch.sigmoid(logits) > threshold
    
    return predictions

In [205]:
from sklearn.metrics import f1_score

@contract(Y_pred='array[Mx1],M>0',
          Y='array[Mx1],M>0',
          returns='float,>=0.0,<=1.0')
def calc_accuracy(Y_pred, Y):
    """
    Calculates the accuracy of the predictions against the true labels
    (What percent of the predicted labels Y_pred matches the true labels in Y)
    
    param: Y_pred: Predictions of our model (numpy array of shape [m,1] containing 0s and 1s)
    param: Y: Target labels (numpy array of shape [m,output_dim])
    
    returns: accuracy (float between 0.0 and 1.0)  
    """
    
    #accuracy = float(np.dot(Y.T,Y_pred) + np.dot((1-Y).T,1-Y_pred))/float(Y.size)
    
    return f1_score(Y, Y_pred, average = 'macro')

In [206]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
from torchsummary import summary
torch.manual_seed(1234)

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout):
        super(Net, self).__init__()
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.output_size = output_size
        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Flatten the input x keeping the batch dimension the same
        x = x.reshape(-1, self.input_size)
        x = self.dropout(x)
        x = F.relu(self.fc1(x))  
        x = self.fc3(x)          

        return x  # Return x (logits)

In [207]:
# Define hyperparameters
LEARNING_RATE = 0.001
MOMENTUM = 0.9
MAX_ITERATIONS = 100
INPUT_SIZE = x_train.shape[1]
HIDDEN_SIZE = 12 # empirical rule ~ mean of the neurons in the input and output layers
OUTPUT_SIZE = 1
DROPOUT = 0.5

In [208]:
net = Net(INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE, DROPOUT)

# Define the loss criterion and the training algorithm
criterion = nn.BCEWithLogitsLoss()  # Be careful, use binary cross entropy for binary, CrossEntropy for Multi-class
# optimizer = optim.SGD(net.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM)
optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

In [210]:
from numpy import arange
from numpy import argmax

# Test different hidden sizes
net_list = []
thresholds = np.arange(0, 1, 0.01)

for i in [16,17,18,19]:
    HIDDEN_SIZE = i
    for j in [0.5,0.6]:  
        DROPOUT = j
        net = Net(INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE, DROPOUT)
        #criterion = nn.BCEWithLogitsLoss()  # Be careful, use binary cross entropy for binary, CrossEntropy for Multi-class
        #optimizer = optim.SGD(net.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM)
        optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)
        train_neural_network_pytorch(net, x_train, y_train, optimizer, criterion, MAX_ITERATIONS)
        
        # i tried to tune the threshold parameter
        scores = [f1_score(np.array(y_val).reshape(-1,1), predict_pytorch(net, np.array(x_val), threshold=t).data.numpy(), average = 'macro') for t in thresholds]
        ix = argmax(scores)
        
        train_macro_f = f1_score(np.array(y_train).reshape(-1,1), predict_pytorch(net, np.array(x_train), threshold=thresholds[ix]).data.numpy(), average = 'macro')
        test_macro_f = scores[ix]
        
        net_list.append((net,test_macro_f,thresholds[ix]))
        print(f"Train F1 score: {train_macro_f:.5f}, Test F1 score: {test_macro_f:.5f}, threshold: {thresholds[ix]:.4f}")
        print("---------------------------------------")

HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.40398, Test F1 score: 0.52120, threshold: 0.5800
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.41243, Test F1 score: 0.51776, threshold: 0.5300
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.42709, Test F1 score: 0.51903, threshold: 0.5600
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.40077, Test F1 score: 0.52112, threshold: 0.5800
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.40970, Test F1 score: 0.51770, threshold: 0.5900
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.40599, Test F1 score: 0.52064, threshold: 0.5600
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.41439, Test F1 score: 0.52065, threshold: 0.5600
---------------------------------------


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


Finished Training
Train F1 score: 0.40686, Test F1 score: 0.51708, threshold: 0.5900
---------------------------------------


In [212]:
import operator
# Take network with best accuracy
net = max(net_list, key=operator.itemgetter(1))[0]
threshold = max(net_list, key=operator.itemgetter(1))[2]
print(threshold)

train_acc = calc_accuracy(predict_pytorch(net, np.array(x_train), threshold=threshold).data.numpy(), np.array(y_train).reshape(-1,1))
test_acc = calc_accuracy(predict_pytorch(net, np.array(x_val), threshold=threshold).data.numpy(), np.array(y_val).reshape(-1,1))
print(f"Train accuracy: {train_acc:.5f}, Test accuracy: {test_acc:.5f}")

0.58
Train accuracy: 0.40398, Test accuracy: 0.52120


### Submission

Please only submit the csv files with predicted outcome with its id and target [here](https://www.kaggle.com/t/b3dc81e90d32436d93d2b509c98d0d71). Your column should only contain `0` and `1`.

In [108]:
data_test_X = data_test.drop(columns=['id'])
y_target = np.array(predict_pytorch(net, np.array(data_test_X), threshold=threshold)).astype(int)

In [109]:
data_out = pd.DataFrame(data_test['id'].copy())
data_out.insert(1, "target", y_target, True) 
data_out.to_csv('./data/submission.csv',index=False)

In [110]:
data_out

Unnamed: 0,id,target
0,100000,0
1,100001,0
2,100002,0
3,100003,0
4,100004,0
...,...,...
148795,248795,0
148796,248796,0
148797,248797,0
148798,248798,0


In [111]:
sum(data_out['target']==1)

6894