### Problem Description
Insurance companies take risks over customers. Risk management is a very important aspect of the insurance industry. Insurers consider every quantifiable factor to develop profiles of high and low insurance risks. Insurers collect vast amounts of information about policyholders and analyze the data.

As a Data scientist in an insurance company, you need to analyze the available data and predict whether to sanction the insurance or not.

### Dataset Description
A zipped file containing train, test and sample submission files are given. The training dataset consists of data corresponding to 52310 customers and the test dataset consists of 22421 customers. Following are the features of the dataset

Target: Claim Status (Claim)

Name of agency (Agency)

Type of travel insurance agencies (Agency.Type)

Distribution channel of travel insurance agencies (Distribution.Channel)

Name of the travel insurance products (Product.Name)

Duration of travel (Duration)

Destination of travel (Destination)

Amount of sales of travel insurance policies (Net.Sales)

The commission received for travel insurance agency (Commission)

Age of insured (Age)

The identification record of every observation (ID)

Evaluation Metric
The evaluation metric for this task will be precision_score. Read up about it more here.

### Submission Format
The user has to submit a csv file with the ID and Claim label. Sample submission file has been given to you. You can refer the sample submission file.

In [193]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import LabelEncoder,MinMaxScaler,StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier ,RandomForestClassifier ,GradientBoostingClassifier
from xgboost import XGBClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import roc_auc_score ,mean_squared_error,accuracy_score,classification_report,roc_curve,confusion_matrix
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns',None)
from imblearn.over_sampling import SMOTE

In [194]:
path = './data/train.csv'

# Load the dataframe
data = pd.read_csv(path,delimiter=',')

# Remove the Id column from the dataset
# data.drop('Id',axis=1,inplace=True)

print('Shape of the data is: ',data.shape)

data.head()

Shape of the data is:  (52310, 11)


Unnamed: 0,ID,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age,Claim
0,2010,EPX,Travel Agency,Online,Cancellation Plan,61,PHILIPPINES,12.0,0.0,41,0
1,4245,EPX,Travel Agency,Online,Cancellation Plan,4,MALAYSIA,17.0,0.0,35,0
2,9251,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,26,THAILAND,19.8,11.88,47,0
3,4754,EPX,Travel Agency,Online,2 way Comprehensive Plan,15,HONG KONG,27.0,0.0,48,0
4,8840,EPX,Travel Agency,Online,2 way Comprehensive Plan,15,MALAYSIA,37.0,0.0,36,0


In [195]:
#removing ID column
data.drop(columns=['ID'],axis=1,inplace=True)
data.head()

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age,Claim
0,EPX,Travel Agency,Online,Cancellation Plan,61,PHILIPPINES,12.0,0.0,41,0
1,EPX,Travel Agency,Online,Cancellation Plan,4,MALAYSIA,17.0,0.0,35,0
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,26,THAILAND,19.8,11.88,47,0
3,EPX,Travel Agency,Online,2 way Comprehensive Plan,15,HONG KONG,27.0,0.0,48,0
4,EPX,Travel Agency,Online,2 way Comprehensive Plan,15,MALAYSIA,37.0,0.0,36,0


In [196]:
a = data['Duration'] < 0
a.sum()

4

In [197]:
data.loc[data['Duration'] < 0, 'Duration'] = data['Duration'].mean()

In [198]:
a = data['Duration'] < 0
a.sum()

0

In [199]:
b= data['Net Sales']<data['Commision (in value)']
b.sum()

1454

In [200]:
data.loc[data['Net Sales'] == 0.0, 'Commision (in value)'] = 0

In [201]:
data.loc[data['Net Sales'] < 0.0, 'Commision (in value)'] = data['Net Sales'].mean()

In [202]:
# Predictors
X = data.iloc[:,:-1]

# Target
y = data.iloc[:,-1]

In [203]:
# Function to detect outliers in every feature
def detect_outliers(dataframe):
    cols = list(dataframe)
    outliers = pd.DataFrame(columns=['Feature','Number of Outliers','Percentage','Fence Low','Fence High'])
    
    for column in cols:
        if column in dataframe.select_dtypes(include=np.number).columns:
            # first quartile (Q1)
            q1 = dataframe[column].quantile(0.25) 
            
            # third quartile (Q3)
            q3 = dataframe[column].quantile(0.75)
            
            # IQR
            iqr = q3 - q1
            
            fence_low = q1 - (1.5*iqr)
            fence_high = q3 + (1.5*iqr)
            outliers = outliers.append({'Feature':column,
                                        'Number of Outliers':dataframe.loc[(dataframe[column] < fence_low) 
                                                                           | (dataframe[column] > fence_high)]
                                        .shape[0], 'Percentage':(dataframe.loc[(dataframe[column] < fence_low) | (dataframe[column] > fence_high)].shape[0]/len(dataframe))*100,
                                       'Fence Low': fence_low, 'Fence High': fence_high},
                                       ignore_index=True)
    return outliers
temp = detect_outliers(X)
temp


Unnamed: 0,Feature,Number of Outliers,Percentage,Fence Low,Fence High
0,Duration,5484,10.483655,-60.5,127.5
1,Net Sales,5335,10.198815,-33.0,107.8
2,Commision (in value),6580,12.578857,-18.9,31.5
3,Age,3675,7.025425,18.0,58.0


In [204]:
from scipy.stats.mstats import winsorize
# Function to treat outliers 
def treat_outliers(dataframe):
    cols = list(dataframe)
    for col in cols:
        if col in dataframe.select_dtypes(include=np.number).columns:
            dataframe[col] = winsorize(dataframe[col], limits=[0.05, 0.1],inclusive=(True, True))
    
    return dataframe    


df = treat_outliers(X)

# Checking for outliers after applying winsorization
detect_outliers(X)

Unnamed: 0,Feature,Number of Outliers,Percentage,Fence Low,Fence High
0,Duration,5484,10.483655,-60.5,127.5
1,Net Sales,0,0.0,-33.0,107.8
2,Commision (in value),6580,12.578857,-18.9,31.5
3,Age,0,0.0,18.0,58.0


In [205]:
#Treat Skewness
import scipy.stats as scs

features = []
skewness = []
for i in X.select_dtypes(include=np.number).columns:
    features.append(i)
    skewness.append(scs.skew(X[i]))
skewed = pd.DataFrame({'Features':features,'Skewness':skewness})

# If skewness is greater than 1 the feature is highly positively skewed
positively_skewed_variables = skewed[(skewed['Skewness']>1)]

# If the skewness is less than -1 the feature is highly negatively skewed.
negatively_skewed_variables = skewed[(skewed['Skewness']<-1)]

print('Positively Skewed Features \n',positively_skewed_variables)
print('*'*50)
print('Negatively Skewed Features \n',negatively_skewed_variables) 

# Let's remove the skewness in the positively skewed variables by using a log transform
# for i in positively_skewed_variables['Features']:
#     X[i] = np.log1p(X[i])

Positively Skewed Features 
                Features  Skewness
0              Duration  1.232691
1             Net Sales  1.014536
2  Commision (in value)  1.432992
**************************************************
Negatively Skewed Features 
 Empty DataFrame
Columns: [Features, Skewness]
Index: []


In [206]:
# Let's remove the skewness in the positively skewed variables by using a log transform
for i in positively_skewed_variables['Features']:
     X[i] = np.log1p(X[i])

In [207]:
# Checking for outliers after applying winsorization
detect_outliers(X)

Unnamed: 0,Feature,Number of Outliers,Percentage,Fence Low,Fence High
0,Duration,0,0.0,-0.095926,6.554265
1,Net Sales,0,0.0,1.549355,5.51095
2,Commision (in value),0,0.0,-3.915105,6.525174
3,Age,0,0.0,18.0,58.0


In [208]:
#Check for Class Imbalance
def class_imbalance(target):
    class_values = (target.value_counts()/target.value_counts().sum())*100
    return class_values

class_imbalance(data['Claim'])

0    83.330147
1    16.669853
Name: Claim, dtype: float64

In [209]:
le = LabelEncoder()
# Function that auto encodes any dataframe column of type category or object.
def dummyEncode(dataset):
        
        columnsToEncode = list(dataset.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                dataset[feature] = le.fit_transform(dataset[feature])
            except:
                print('Error encoding '+feature)
        return dataset
data = dummyEncode(data)


In [210]:
# Predictors
X = data.iloc[:,:-1]

# Target
y = data.iloc[:,-1]

In [211]:
X.head()

Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age
0,7,1,1,10,61.0,68,12.0,0.0,41
1,7,1,1,10,4.0,53,17.0,0.0,35
2,6,1,1,16,26.0,84,19.8,11.88,47
3,7,1,1,1,15.0,33,27.0,0.0,48
4,7,1,1,1,15.0,53,37.0,0.0,36


In [212]:
def random_forrest(dataframe,target):
    
    x_train,x_val,y_train,y_val = train_test_split(dataframe,target, test_size=0.3, random_state=42)
    
    # Applying Smote on train data for dealing with class imbalance
    smote = SMOTE(kind='regular')
    X_sm, y_sm =  smote.fit_sample(x_train, y_train)
    
    global rfc
    rfc = RandomForestClassifier()
    rfc.fit(x_train, y_train)
    y_pred=rfc.predict(x_val)
    precision=precision_score(y_val,y_pred)
    return precision

#trainning
precision = random_forrest(X,y)    
print('score is:',precision)


score is: 0.8457167832167832


In [213]:

#testing function
def prediction(test):
    y_pred = rfc.predict(test)
    
    return y_pred

test=pd.read_csv('./data/test.csv')

# Storing the Id column
Id = test[['ID']]

# Preprocessed Test File
test.drop('ID',1,inplace=True)
test.head()
#label encoder
test = dummyEncode(test)
test.head()


Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Duration,Destination,Net Sales,Commision (in value),Age
0,7,1,1,10,192,33,18.0,0.0,36
1,7,1,1,0,2,75,20.0,0.0,36
2,2,0,1,9,13,75,13.5,3.38,24
3,7,1,1,1,133,82,41.0,0.0,36
4,2,0,1,17,2,75,30.0,7.5,32


In [214]:

#predicting on test file
y_pred = pd.DataFrame(prediction(test),columns=['Claim']) 
print(y_pred['Claim'].value_counts())

0    19168
1     3253
Name: Claim, dtype: int64
