# Risk assesment for an Insurance company for Vehicle loans.

Problem Statement: An Insurance company wants to asses the risk of their existing customers.

Data Collection: We have got the data from a trusted source where we have previously whipped the data into shape.  

### Loading the data.

In [None]:
#reading packages
import numpy as np  #Numpy provides a large set of numeric datatypes that you can use to construct arrays.
import pandas as pd #pandas DataFrame - allows data manipulation functions with numpy record arrays.

In [None]:
risk_data = pd.read_excel(r'C:\Users\ghreddy\Desktop\dxc\CustomerRiskProfile_Dataset_IIDT.xlsx') #load dataset

In [None]:
risk_data.shape #checking the size of the dataset

In [None]:
risk_data.head() #to check the attributes in the dataset.

### Exploratory Data Analysis.

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

In [None]:
risk_data.dtypes #checking the datatypes of each attribute.

In [None]:
risk_data.isnull().sum() #checking for null values in the data.If we find any missing values we have to impute the missing data.

In [None]:
#Checking for unique value records for each categorical attribute.
cat_var=[]
for i in risk_data.columns:
    if risk_data.dtypes[i]=='object':
        count = risk_data[i].value_counts()
        print(count)
        

In [None]:
import matplotlib.pyplot as plt #Matplotlib is a plotting library for numerical mathematics & provides visualizations.

In [None]:
risk_data['Gender'].value_counts() #to find the unique value records in the attribute 'Gender'.

#### Bivariate analysis

For the purpose of determining the empirical relationship between two(Dependent and Independent) variables.

In [None]:
pd.crosstab(risk_data['Risk Profile'],risk_data['Gender']).plot.bar() #gives us the bar plot for Gender contribution to risk profile.
pd.crosstab(risk_data['Risk Profile'],risk_data['Gender'],margins=True) #Gender contribution to risk profile.

Almost 63% of our customers are associated with medium risk 

In [None]:
#Driving frequency contribution to Risk profile.
pd.crosstab(risk_data['Risk Profile'],risk_data['Driving Frequency'],margins=True)

The highest number of customers are at medium risk who drive occasionally.

In [None]:
#Driving purpose contribution to Risk profile.
pd.crosstab(risk_data['Risk Profile'],risk_data['Driving Purpose'],margins=True),

Interestingly we can find that the customers who took loans for racing purpose are least at low risk and for the customers who are at high risk, racing tops the list.

In [None]:
#Driving experience contribution to Risk profile.
pd.crosstab(risk_data['Risk Profile'],risk_data['Driving Experience'],margins=True)

The customers who have less driving experience are classified as high risk customers.

In [None]:
#Age of vehicle contribution to Risk profile.
pd.crosstab(risk_data['Risk Profile'],risk_data['Age of Vehicle'],margins=True)

### Feature Engineering.

Building features for each label while filtering to make valid features.


Encoding the categorical variables for fitting the model.
We can encode the categorical variables in various procedures as OneHotEncoding or LabelEncoder but we have done it manually.

In [None]:
import seaborn as sns #Seaborn is a Python data visualization library based on matplotlib. 
#It provides a high-level interface for drawing attractive and informative statistical graphics.
sns.distplot(risk_data['Age'], bins=20) #Checking the distribution of age.

In [None]:
#The StandardScaler assumes data is normally distributed and will scale them such that the distribution
#is centred around 0, with a standard deviation of 1.
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler()
risk_data['Age']=sc.fit_transform(risk_data[['Age']]) #fitting the standard scaler on Age column.

In [None]:
risk_data['Age'].head() #Checking for the scaled values of Age column

In [None]:
sns.distplot(risk_data['Age'], bins=20), #Checking the distribution of age after scaling.

In [None]:
risk_data['Gender']=risk_data['Gender'].replace({'M':0,'F':1}) #Manually replacing the categories in the gender column.

In [None]:
#Manually replacing the categories in the education column.
risk_data['Education'] = risk_data['Education'].replace({'Primary School':0,'College':1,'Secondary School':2,'None':3,'Bachelors':4})

In [None]:
#Manually replacing the categories in the vehicle condition column.
risk_data['Vehicle Condition']=risk_data['Vehicle Condition'].replace({'Ex-demonstrator':0,'Classic/Vintage':1,'New':2,'Used':3})

In [None]:
sns.distplot(risk_data['Market Value'], bins=20) #Checking the distribution of market value.

In [None]:
#Transforming the market value column as it has large numeric values.
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler()
#fitting the standard scaler on Market value column to standardize its values.
risk_data['Market Value']=sc.fit_transform(risk_data[['Market Value']])

In [None]:
risk_data['Market Value'].head() #Checking for the scaled values of market value column

In [None]:
#Manually replacing the categories in the luxury category column.
risk_data['Luxury Category']=risk_data['Luxury Category'].replace({'Semi Luxury':0,'Luxury':1,'Compact':2,'Economy':3,'Intermediate':4,'Super Luxury':5,'Mini':6,'Executive':7})

In [None]:
#Manually replacing the categories in the driving time column.
risk_data['Driving Time']=risk_data['Driving Time'].replace({'Day Time':0,'Any Time':1,'Night Time':2})

In [None]:
#Manually replacing the categories in the driving frequency column.
risk_data['Driving Frequency']=risk_data['Driving Frequency'].replace({'Occasionaly':0,'Daily':1,'Weekly':2})

In [None]:
#Manually replacing the categories in the driving purpose column.
risk_data['Driving Purpose'] = risk_data['Driving Purpose'].replace({'Business':0,'Personal':1,'Racing':2})

In [None]:
#Manually replacing the categories in the risk profile column.
risk_data['Risk Profile']=risk_data['Risk Profile'].replace({"Medium":0,'High':1,'Low':2})

In [None]:
k#since we are unable to handle huge number of unique values manually we can do LabelEncoding.
columns=['Location Name']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in columns:
    risk_data[i] = le.fit_transform(risk_data[i])
risk_data['Location Name'] = risk_data[i]


In [None]:
risk_data['Location Name'].head() #checking the values in the location name column after encoding

In [None]:
#Manually replacing the categories in the road type column.
risk_data['Road Type'] = risk_data['Road Type'].replace({'Suburban':0,'Urban':1,'Highway':2,'Extra-urban High Density':3,'Extra-urban Low Density':4,'Rural':5})

In [None]:
#Manually replacing the categories in the topography column.
risk_data['Topography']=risk_data['Topography'].replace({'Plains':0,'Hills':1,'Rocky area':2,'Mountains':3,'Valleys':4})

In [None]:
#To plot correlation graph we use seaborn package.
#we get the correlation of each column compared to the other one.
corr = risk_data.corr()


In [None]:
#defining the size, shape and other parameters to plot the graph.
plt.figure(figsize=(10,10))
cmap = sns.diverging_palette(220,10,as_cmap=True) #220 is for height and 10 is for width
sns.heatmap(corr,xticklabels=corr.columns.values,yticklabels=corr.columns.values,cmap=cmap,vmax=.3,center=0,
            square=True,linewidths=0.5,cbar_kws={'shrink':.82})


### Model building and Test with accuracy.

## Decision Tree Classifier

In [None]:
#importing packages.
#In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making).
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, cohen_kappa_score
from sklearn.model_selection import train_test_split

In [None]:
#dividing the data into two parts as the dependant and target.

In [None]:
#dependant variables
x_train = risk_data.drop(['Risk Profile'],axis=1)

In [None]:
#independant variables
y_train = risk_data['Risk Profile']

In [None]:
x_train.shape, y_train.shape #check the shape of training set

In [None]:
#checking the header of the data after feature engineering.
risk_data.head()

In [None]:
#splitting the training dataset as train and test.
X_train,X_test,Y_train,Y_test = train_test_split(x_train,y_train,test_size=0.3,random_state=1234)

In [None]:
#checking for train data and test data.
X_train.shape,Y_train.shape,X_test.shape,Y_test.shape

In [None]:
#initiating the decision tree classifier.
tree_model = DecisionTreeClassifier()

In [None]:
#fitting the decision tree model
tree_model.fit(X_train,Y_train)
#class_weight:Weights associated with classes in the form
#criterion :The function to measure the quality of a split.
#max_depth :The maximum depth of the tree.
#min_samples_split :The minimum number of samples required to split an internal node.
#min_samples_leaf :The minimum number of samples required to be at a leaf node.
#splitter :The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and 
#           “random” to choose the best random split.

In [None]:
#predicting the model on test data.
predict_tree = tree_model.predict(X_test)

In [None]:
#checking the accuracy for decision tree model.
accuracy_score(Y_test,predict_tree)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
#Cohen's kappa coefficient (κ) is a statistic which measures inter-rater agreement test and train items.
cohen_kappa_score(Y_test,predict_tree)

#### Feature selection: using feat map we will find feature importances and eliminate the columns based on percentiles we got.

In [None]:
#Checking the contribution of each variable in the decision tree model.
#Feature selection is usually used as a pre-processing step before doing the actual learning. 
feat_map = pd.Series(tree_model.feature_importances_,index=X_train.columns).sort_values(ascending=False)

In [None]:
feat_map

From here we conclude that driving frequency, vehicle condition, driving purpose, luxury cateogery and gender can be deleted.

### Checking for accuracy after dropping the low performance attibutes.

In [None]:
#dropping the low performance attributes and assigning the dataframe to another variable for future use. 
risk_data_new = risk_data.drop(['Driving Frequency','Vehicle Condition','Driving Purpose','Luxury Category','Gender'],axis=1)

In [None]:
risk_data_new.head() #Checking for the data after dropping the low performance attributes.

In [None]:
#division of dependant and independant variables.
x_train_new = risk_data_new.drop(['Risk Profile'],axis=1)
y_train_new = risk_data['Risk Profile']

In [None]:
X_train_new,X_test_new,Y_train_new,Y_test_new = train_test_split(x_train_new,y_train_new,test_size=0.3,random_state=1234)

In [None]:
tree_model_new = DecisionTreeClassifier()

In [None]:
tree_model_new.fit(X_train_new,Y_train_new)

In [None]:
predict_tree_new = tree_model_new.predict(X_test_new)

In [None]:
accuracy_score(Y_test_new,predict_tree_new)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
cohen_kappa_score(Y_test_new,predict_tree_new)

### lets try to find accuracy by dropping the location column

In [None]:
x_train_new1 = risk_data.drop(['Risk Profile','Location Name'],axis=1) #defining new train

In [None]:
y_train_new1 = risk_data['Risk Profile']

In [None]:
X_train1,X_test1,Y_train1,Y_test1 = train_test_split(x_train_new1,y_train_new1,test_size=0.3,random_state=1234)

In [None]:
tree_model1 = DecisionTreeClassifier()

In [None]:
tree_model1.fit(X_train1,Y_train1)

In [None]:
predict_1 = tree_model1.predict(X_test1)

In [None]:
accuracy_score(predict_1,Y_test1)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
cohen_kappa_score(Y_test1,predict_1)

By comparing the cohen_kappa_score before and after dropping the 'location name' we can understand that location is an important attribute.

# NaiveBayes Classifier

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. 

In [None]:
#Importing the required packages.
#GaussianNB implements the Gaussian Naive Bayes algorithm for classification. 
from sklearn.naive_bayes import GaussianNB

In [None]:
nb = GaussianNB() #Initialization.

In [None]:
# Fitting the model.
nb.fit(X_train,Y_train)

In [None]:
predict_nb= nb.predict(X_test) #Predicted values for X_test.

In [None]:
accuracy_score(predict_nb,Y_test)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
cohen_kappa_score(predict_nb,Y_test)

## Random forest classifier

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.

In [None]:
#Importing the packages.
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier()

In [None]:
# Fitting the model.
rf.fit(X_train,Y_train)

In [None]:
#predicting the values for x_test.
predict_rf = rf.predict(X_test)

In [None]:
#finding the accuracy for the y_test by the predicted values of x_test. 
accuracy_score(predict_rf,Y_test)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
cohen_kappa_score(predict_rf,Y_test)

In [None]:
#Fitting the model on remaining attributes after removing the low performance attributes.
rf.fit(X_train_new,Y_train_new)

In [None]:
#predicting the values for x test.
predict_rf_new = rf.predict(X_test_new)

In [None]:
#finding the accuracy for the y_test by the predicted values of x_test. 
accuracy_score(predict_rf_new,Y_test_new)

In [None]:
#finding the Cohen's kappa score for the y_test by the predicted values of x_test.
cohen_kappa_score(predict_rf_new,Y_test_new)

# we can find the difference between the three classifiers 
The Cohen's kappa score for DecisionTreeClassifier on complete data gave 73% whereas the RandomForestClassifier gave us 77.7%.
After feature selection from caluclation of feature importances from decision tree, we observed an increase in both of these classifications.
