## Objective
***

Recently, there has been an increase in the number of building collapse in Lagos and major cities in Nigeria. Olusola Insurance Company offers a building insurance policy that protects buildings against damages that could be caused by a fire or vandalism, by a flood or storm.
You have been appointed as the Lead Data Analyst to build a predictive model to determine if a building will have an  insurance claim during a certain period or not. 
You will have to predict the probability of having at least one claim over the insured period of the building.

The evaluation metric for this competition is the Area Under the ROC curve (AUC).

In [53]:
import pandas as pd
import numpy as np
%matplotlib inline
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder  
from sklearn.preprocessing import StandardScaler  
from scipy import sparse
from catboost import CatBoostClassifier, Pool
import lightgbm as lgb
import xgboost as xgb
from math import sqrt
from sklearn.metrics import mean_squared_error
from scipy.stats import uniform, randint
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit, cross_val_score, GridSearchCV, RandomizedSearchCV
import random

In [98]:
#load data
train = pd.read_csv("train_data.csv") #test_data
test = pd.read_csv("test_data.csv")
sample_submission = pd.read_csv("sample_submission.csv")
description_data = pd.read_csv("VariableDescription.csv")

In [3]:
train.head()

Unnamed: 0,Customer Id,YearOfObservation,Insured_Period,Residential,Building_Painted,Building_Fenced,Garden,Settlement,Building Dimension,Building_Type,Date_of_Occupancy,NumberOfWindows,Geo_Code,Claim
0,H14663,2013,1.0,0,N,V,V,U,290.0,1,1960.0,.,1053,0
1,H2037,2015,1.0,0,V,N,O,R,490.0,1,1850.0,4,1053,0
2,H3802,2014,1.0,0,N,V,V,U,595.0,1,1960.0,.,1053,0
3,H3834,2013,1.0,0,V,V,V,U,2840.0,1,1960.0,.,1053,0
4,H5053,2014,1.0,0,V,N,O,R,680.0,1,1800.0,3,1053,0


In [4]:
test.head()

Unnamed: 0,Customer Id,YearOfObservation,Insured_Period,Residential,Building_Painted,Building_Fenced,Garden,Settlement,Building Dimension,Building_Type,Date_of_Occupancy,NumberOfWindows,Geo_Code
0,H11920,2013,1.0,0,V,N,O,R,300.0,1,1960.0,3,3310
1,H11921,2016,0.997268,0,V,N,O,R,300.0,1,1960.0,3,3310
2,H9805,2013,0.369863,0,V,V,V,U,790.0,1,1960.0,.,3310
3,H7493,2014,1.0,0,V,N,O,R,1405.0,1,2004.0,3,3321
4,H7494,2016,1.0,0,V,N,O,R,1405.0,1,2004.0,3,3321


In [5]:
sample_submission.head() 
#YearOfObservation, Residential, Building_Painted, Building_Fenced	Garden	Settlement,
#Building_Type	Date_of_Occupancy	NumberOfWindows

#'Insured_Period','Building Dimension'

Unnamed: 0,Customer Id,Claim
0,H0,1
1,H10000,1
2,H10001,1
3,H10002,1
4,H10003,1


In [7]:
description_data.head()

Unnamed: 0,Variable,Description
0,Customer Id,Identification number for the Policy holder
1,YearOfObservation,year of observation for the insured policy
2,Insured_Period,duration of insurance policy in Olusola Insura...
3,Residential,is the building a residential building or not
4,Building_Painted,"is the building painted or not (N-Painted, V-N..."


In [8]:
#check for missing data
train.isnull().sum()

Customer Id             0
YearOfObservation       0
Insured_Period          0
Residential             0
Building_Painted        0
Building_Fenced         0
Garden                  7
Settlement              0
Building Dimension    106
Building_Type           0
Date_of_Occupancy     508
NumberOfWindows         0
Geo_Code              102
Claim                   0
dtype: int64

In [9]:
train.shape

(7160, 14)

In [10]:
test.shape

(3069, 13)

In [11]:
test.isnull().sum()

Customer Id             0
YearOfObservation       0
Insured_Period          0
Residential             0
Building_Painted        0
Building_Fenced         0
Garden                  4
Settlement              0
Building Dimension     13
Building_Type           0
Date_of_Occupancy     728
NumberOfWindows         0
Geo_Code               13
dtype: int64

# build prelimnary model

In [99]:
#drop columns with missing data
#missing data Garden, building dimension, data of occupancy, geo_code
train = train[['Customer Id','YearOfObservation','Insured_Period','Residential','Building_Painted',
               'Building_Fenced','Garden','Settlement','Building Dimension','Building_Type',
               'Date_of_Occupancy','NumberOfWindows','Geo_Code','Claim']]
test = test[['Customer Id','YearOfObservation','Insured_Period','Residential','Building_Painted',
               'Building_Fenced','Garden','Settlement','Building Dimension','Building_Type',
               'Date_of_Occupancy','NumberOfWindows','Geo_Code']]

In [100]:
#remove customer id from train, data Garden, building dimension, data of occupancy, geo_code
train = train[['YearOfObservation','Insured_Period','Residential','Building_Painted',
               'Building_Fenced','Settlement','Building_Type','NumberOfWindows','Claim']] #'NumberOfWindows'

In [101]:
train['Claim'].reset_index(drop=True, inplace=True)
#df['IS_FRAUDSTER_N'].reset_index(drop=True, inplace=True)

In [102]:
test_cust_id = test['Customer Id']
test_cust_id.reset_index(drop=True, inplace=True)

In [103]:
#remove data Garden, building dimension, data of occupancy, geo_code from test
test = test[['YearOfObservation','Insured_Period','Residential','Building_Painted',
               'Building_Fenced','Settlement','Building_Type','NumberOfWindows']] #'NumberOfWindows'

In [104]:
#split into categorical

In [105]:
##factorize
train_cat = train[['YearOfObservation', 'Residential', 'Building_Painted', 'Building_Fenced',
                'Settlement','Building_Type','NumberOfWindows']] #'NumberOfWindows'
train_cat2 = pd.get_dummies(train_cat, columns=['YearOfObservation', 'Residential', 'Building_Painted', 
                                                'Building_Fenced',
                'Settlement','Building_Type','NumberOfWindows'], drop_first=False) #'NumberOfWindows'
train_cat2.reset_index(drop=True, inplace=True)

In [106]:
test_cat = test[['YearOfObservation', 'Residential', 'Building_Painted', 'Building_Fenced',
                'Settlement','Building_Type','NumberOfWindows']] #'NumberOfWindows'
test_cat2 = pd.get_dummies(test_cat, columns=['YearOfObservation', 'Residential', 'Building_Painted', 'Building_Fenced',
                'Settlement','Building_Type','NumberOfWindows'], drop_first=False) #'NumberOfWindows'
test_cat2.reset_index(drop=True, inplace=True)

In [107]:
#split into cts data
#train
scaler = preprocessing.MinMaxScaler()
to_scale_df = train[['Insured_Period']] #'Insured_Period'
train_scaled_df = scaler.fit_transform(to_scale_df)
train_scaled_df = pd.DataFrame(train_scaled_df, columns=['Insured_Period'])
train_scaled_df.reset_index(drop=True, inplace=True)

In [108]:
#test
scaler = preprocessing.MinMaxScaler()
to_scale_df = test[['Insured_Period']] #'Insured_Period'
test_scaled_df = scaler.fit_transform(to_scale_df)
test_scaled_df = pd.DataFrame(test_scaled_df, columns=['Insured_Period'])
test_scaled_df.reset_index(drop=True, inplace=True)

In [109]:
#combine
train_df = pd.concat([train_scaled_df, train_cat2,train['Claim']], axis=1)

In [110]:
#combine
test_df = pd.concat([test_scaled_df, test_cat2], axis=1)

In [115]:
train_df.shape

(7160, 30)

In [111]:
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc, recall_score, precision_score, f1_score, confusion_matrix, recall_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
#from sklearn.cross_validation import cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier

In [116]:
#build model
features=train_df.iloc[:,0:29]
target = train_df['Claim']
Name=[]
Accuracy=[]
model1=LogisticRegression(random_state=22,C=0.000000001,solver='liblinear',max_iter=200)
model2=GaussianNB()
model3=RandomForestClassifier(n_estimators=200,random_state=22)
model4=GradientBoostingClassifier(n_estimators=200)
model5=KNeighborsClassifier()
model6=DecisionTreeClassifier()
model7=LinearDiscriminantAnalysis()
model8=BaggingClassifier()
Ensembled_model=VotingClassifier(estimators=[('lr', model1), ('gn', model2), ('rf', model3),('gb',model4),('kn',model5),('dt',model6),('lda',model7), ('bc',model8)], voting='hard')
for model, label in zip([model1, model2, model3, model4,model5,model6,model7,model8,Ensembled_model], ['Logistic Regression','Naive Bayes','Random Forest', 'Gradient Boosting','KNN','Decision Tree','LDA', 'Bagging Classifier', 'Ensemble']):
    scores = cross_val_score(model, features, target, cv=5, scoring='accuracy')
    Accuracy.append(scores.mean())
    Name.append(model.__class__.__name__)
    print("Accuracy: %f of model %s" % (scores.mean(),label))

Accuracy: 0.771788 of model Logistic Regression
Accuracy: 0.727525 of model Naive Bayes
Accuracy: 0.755589 of model Random Forest
Accuracy: 0.768855 of model Gradient Boosting
Accuracy: 0.740505 of model KNN
Accuracy: 0.747209 of model Decision Tree
Accuracy: 0.773044 of model LDA
Accuracy: 0.751817 of model Bagging Classifier
Accuracy: 0.770949 of model Ensemble


In [119]:
#apply on test
from sklearn.metrics import accuracy_score
classifers=[model1,model7,Ensembled_model]
out_sample_accuracy=[]
Name_2=[]
for each in classifers:
    fit=each.fit(features,target)
    pred=fit.predict(test_df.iloc[:,0:29])
    #accuracy=accuracy_score(test_df['Claim'],pred)
    #Name_2.append(each.__class__.__name__)
    #out_sample_accuracy.append(accuracy)

In [122]:
#submission
pred_df = pd.DataFrame(pred)

In [124]:
#combine data
submission = pd.concat([test_cust_id, pred_df], axis=1)
submission.head()

Unnamed: 0,Customer Id,0
0,H11920,0
1,H11921,0
2,H9805,0
3,H7493,0
4,H7494,0


In [125]:
submission.columns = ['Customer Id','Claim']

In [126]:
#generate output
submission.to_csv("submission.csv",index=False)

In [None]:
#next steps
#clean up current code
 # have section for loading libraries for data overview
 # have section for descriptive statistics
 # have section to data cleaning
 # have section for data manipulation
 # make sure code is highly modularized and reduce hard coding
 # look at roc curve - https://stackabuse.com/understanding-roc-curves-with-python/
 # https://towardsdatascience.com/machine-learning-classifier-evaluation-using-roc-and-cap-curves-7db60fe6b716
 # https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
 # use normal ml algorithms + generate outputs
 # use boosting algorithms + generate outputs
 # combine algorithms + generate outputs