## Machine Learning Insurance Challenge ##

Welcome to the Insurance Challenge! The data for this challenge can be found here, on Kaggle: https://www.kaggle.com/c/prudential-life-insurance-assessment. See this link for data download and a description of each variable.

In this challenge, you will use a large number of variables in a tabular data set to categorize the risk level of a person buying insurance. For newcomers, we will walk you through the process of how to build a machine learning model. For more experienced members, we will discuss Grid Search Hyperparameter optimization and gradient boosting models. Let's start by reading in the data and the Python packages that we will need.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,...,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,...,0,0,0,0,0,0,0,0,0,8
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.6,...,0,0,0,0,0,0,0,0,0,4
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,...,0,0,0,0,0,0,0,0,0,8
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,...,0,0,0,0,0,0,0,0,0,8
4,8,1,D2,26,0.230769,2,3,1,0.41791,0.654545,...,0,0,0,0,0,0,0,0,0,8


In [3]:
#Note that this data needs some cleaning since some variables have many NA values or cannot be used.

#Start by finding all columns with NA
test = df.isna().sum()
drop = pd.Series(test[test != 0].index)

#Pick out the variables that you want to drop - 
# these include columns with no NA, as well as columns with irrelevant data such as "ID" or the variable you are predicting

X = df.drop(pd.concat([drop, pd.Series(["Product_Info_2","Id","Response"])]), axis = 1)

#Set the response that you are predicting as its own variable.
y = df.Response

#Feel free to run a cell with just "X" or "y" in it to see what these look like

### Train-Test Split ###

For evaluating a machine learning model, you must split the data into training and testing. By doing this, you ensure that the model is evaluated on data that it has *not* yet seen. This is important because, in the real world, you won't know what the actual label is. 

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8)

## Challenge #1: Basic Machine Learning ##

Our goal is to classify each observation into a risk level, which is defined by the "Response" variable. This is a supervised ML problem, because we are teaching the algorithm using pre-labeled training data. Of the two types of supervised ML problems, classification and regression, this is considered "classification" because we are predicting one of eight discrete outcome (a regression problem would predict a continuous outcome). 

Below are some commonly-used classification algorithms. Google them or follow the links to read the documentation and see what they are all about, and try picking two to compare. Don't forget to check out the examples!

* [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
* [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* [SVC (Support Vector Classifier)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

As an example, I will demonstrate a different classifier: [K-nearest-neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). In this algorithm, a point is predicted by looking at which class is the most frequent class in the "neighborhood" of the observation, or which class is most frequent among the "k-nearest" neighbors of the observation being predicted.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

#First, we define the model and the parameters we want to use
model = KNeighborsClassifier(n_neighbors = 10)

#Then, we fit to the training data using the "fit" function with the training data, 
# and obtain the accuracy via the "score" function with the testing data. 
model.fit(X_train, y_train)
model.score(X_test, y_test)
#While the algorithm is only about ~35% accurate, this is based on a multi-class model of 8 classes - 
# so, if we guessed randomly, we would only expect to see about 12.5% accuracy. 

0.3516881367348657

Many models have "hyperparameters," which represent the parameters of the model that are not contained in the training data. For instance, KNeighborsClassifier has "n_neighbors" which tells the algorithm how many neighbors to consider. When training a machine learning model, it is a good idea to try different hyperparameters to see which ones work the best

In [8]:
#Let's try increasing the number of neighbors
model = KNeighborsClassifier(n_neighbors = 20)
model.fit(X_train, y_train)
model.score(X_test, y_test)
#Looks like a small improvement!

0.36120232381914624

Now it's your turn! Similar to the code above, try using different algorithms to train your model and get the highest possible accuracy. Also, trying using "cross-validation" to make sure you model works under different training data, and evaluating the model using different metrics.

## Challenge #2: Hyperparameter Optimization ##

Let's try a more scientific approach to this problem. You may be asking, "with so many possible hyperparameters to pick, how can I find the best one?" This article by sci-kit learn may solve your issues: https://scikit-learn.org/stable/modules/grid_search.html. Read this article to find out more.

One of the most common methods to decide on a hyperparameter is using a grid search. In this, a grid of values is tested sequentially to see which is the best. An example using K-nearest-neighbors is shown below.

In [12]:
from sklearn.model_selection import GridSearchCV

params = {"n_neighbors": [1, 10, 100, 1000]}

model = KNeighborsClassifier()

grid = GridSearchCV(model, params)

#Only need X and y, since the model does cross-validation for us and splits into train and test automatically
grid.fit(X, y)


GridSearchCV(estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 10, 100, 1000]})

In [16]:
#Here we se the results for each point on the grid. Around 100 yielded the highest score
grid.cv_results_['mean_test_score']

#We can also do another grid search with more refined numbers, say with [80,90, 100, 110, 120] 
# to test more precisely once we have a good ballpark estimate

array([0.29344405, 0.35235176, 0.3559556 , 0.33569659])

Now that we've tested GridSearchCV, can you try another searching algorithm for refining your model's hyperparameters?

## Challenge #3: Gradient Boosting Models ##

The best models for tabular data in data science competitions are gradient-boosting algorithms. Gradient boosting algorithms use a loss function, which specifies error penalty, to sequentially add weaker predictors including linear models and tree-based models to the final model, similar to how a Random Forest uses many decision trees. See [here](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/) for more information.

The final challenge is to train a gradient-boosting model for predicting insurance risk. XGBoost is one of the best models for tabular data. To use it, please see the documentation for installation instructions and how to get started, since it is a separate package from sklearn: https://xgboost.readthedocs.io/en/latest/get_started.html. You can also view the specific Python documentation [here](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface) *Note that this is not available through Anaconda; you must install the package using pip*. I would highly recommend reading through the many parameters to get a sense for what it requires, and running some grid-search cross-validation to optimize the hyperparameters. Fair warning, though: Gradient boosters take longer to train than regular algorithms!

For the more scientifically-inclined, feel free to check out https://arxiv.org/abs/1603.02754. This is the published paper descibring XGBoost.

In [47]:
import xgboost as xgb

#Read in data for format in XGBoost
#Note we need to subtract 1 since classes in XGBoost start at 0
dtrain = xgb.DMatrix(X_train, label = y_train-1)
dtest = xgb.DMatrix(X_test, label = y_test-1)

#set up parameters by reading the documentation to figure out which ones you want to include
#We use the softmax function as the objective, since we are doing multivariate classification
#Note that I enable a GPU here since I have one; if you don't, remove the "gpu_id" and "tree_method" parameters.
param = {'max_depth':5, 
         'eta':1, 
         'num_class':8, 
         'gpu_id':0,
         'tree_method':'gpu_hist',
         'objective':'multi:softmax' }
num_round = 2

In [48]:
# Train Model and make prediction.
model = xgb.train(param, dtrain, num_round)
preds = model.predict(dtest)



In [49]:
#Calculate the confusion matrix and the accuracy
from sklearn.metrics import confusion_matrix

conf = confusion_matrix(y_test, preds+1)
np.diag(conf).sum() / conf.sum()

0.5228593079060369

Now try experimenting with different parameters (listed [here](https://xgboost.readthedocs.io/en/latest/parameter.html)) to make the model as accurate as possible!