## Task 1
### Question
Briefly discuss why it is more difficult to find a good classifier on such a dataset than on one where, for example, 5,000 claims are fraudulent, and 5,000 are not. In particular, consider what happens when undetected fraudulent claims are very costly to the insurance company.

### Answer
When the dataset in highly unbalanced, seen with the car-insurance data, the machine learning algorithm will accurately predict the majority class but poorly predict in minority class. Since most machine learning algorithms always attempt to minimize the error rate and subsequently will output a low number. In our scenario, the algorithm tends to predict all the claims as non-fraudulent. However, the wrong prediction will increase the False-negative rate, which will increase the cost of fraudulent claims.
Another issue related to the scarce minority data is that we might miss some key combination of variables that have high probability to be fraudulent.

Check how many NaN values are in each column when the claim is fraudulent.
We find when the claim is fraudulent most of the numerical variables are not missing, which means we could directly drop them.

## Task 2
### Question
Load the dataset "Insurance_claims.csv" and clean it as appropriate for use with machine learning algorithms. A description of the features can be found at the end of this document.

### Principle
1. Since the dataset is highly unbalanced, and the fraudulent dataset is very scarce, we should not drop the data labeled 'fraudulent'.
2. When the variables are dummy variables, we tend to keep the NaN value as a classification value rather than drop it.
3. When the variables are numerical, we will check how many NaN values is related to the fraudulent case. If there are few of them, we will drop the variable. Otherwise, we will find a way to fill the missing values.

In [1]:
import numpy as np
import pandas as pd
import datetime 
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns',None)

In [2]:
# read data and get a brief idea of the data
df = pd.read_csv('./materials/Insurance_claims.csv')

print(f'Data Columns:\n' + str(df.columns))
print('--------------------------------------------------------------')
print(f'Data sample:')
df.head(5) #TODO use sentiment analysis 

Data Columns:
Index(['PolicyholderNumber', 'FirstPartyVehicleNumber',
       'ThirdPartyVehicleNumber', 'InsurerNotes', 'PolicyholderOccupation',
       'LossDate', 'FirstPolicySubscriptionDate', 'ClaimCause',
       'ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType',
       'ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet',
       'NumberOfPoliciesOfPolicyholder', 'FpVehicleAgeMonths',
       'EasinessToStage', 'ClaimWihoutIdentifiedThirdParty', 'ClaimAmount',
       'LossHour', 'PolicyHolderAge', 'NumberOfBodilyInjuries',
       'FirstPartyLiability', 'Fraud', 'LossAndHolderPostCodeSame'],
      dtype='object')
--------------------------------------------------------------
Data sample:


Unnamed: 0,PolicyholderNumber,FirstPartyVehicleNumber,ThirdPartyVehicleNumber,InsurerNotes,PolicyholderOccupation,LossDate,FirstPolicySubscriptionDate,ClaimCause,ClaimInvolvedCovers,DamageImportance,FirstPartyVehicleType,ConnectionBetweenParties,PolicyWasSubscribedOnInternet,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,Fraud,LossAndHolderPostCodeSame
0,531112,715507.0,,avoids a cat and hits a garage pole With deduc...,CivilServant,02.01.19,18.06.18,CollisionWithAnimal,MaterialDamages ActLiability,,Car,,1,1,104.0,0.25,1,4624.73,8.0,45.0,0,1.0,0,1
1,87170,71164.0,,accident only expert contacts us to inform us ...,Worker,02.01.19,29.06.17,LossOfControl,MaterialDamages ActLiability,,Car,,0,3,230.0,0.5,1,1606.81,11.0,20.0,0,1.0,0,0
2,98706,442609.0,,ae Miss/ for garage change A/ setting up EAD/ ...,Worker,02.01.19,05.02.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability,,Car,,0,9,93.0,0.25,0,998.2,18.0,32.0,0,0.5,0,1
3,38240,24604.0,,"awaiting report to determine rc, no box checke...",CivilServant,02.01.19,21.01.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability ReplacementVehicle,,Car,,0,2,56.0,0.25,0,2506.92,11.0,46.0,0,0.5,0,1
4,11339,2933.0,229134.0,Insured in THIRD-PARTY formula Insured in a su...,Farmer,02.01.19,13.01.18,AccidentWithIdentifiedThirdParty,ActLiability,,Car,,0,4,110.0,0.25,0,12.0,12.0,28.0,0,0.0,0,0


Check how many NaN values are in each column.
We can find that except for 'FirstPartyVehicleNumber', 'ThirdPartyVehicleNumber', and 'InsurerNotes', which we might not use in our models, most the NaN values are concentrated in the 'PolicyholderOccupation' and 'ClaimCause' which are mainly categorical variables. In this case, we could turn these NaN values into a category value in order to account for the influence of the missing values (??), regardless of why they are missing.
In terms of the numeric variables, we will check how many of them are missing when the claim is fraudulent. 

In [3]:
# Check how much NaN values in each column.
print(f'Number of NaN values in each column:') #TODO draw the distribution of of each variable
print(df.isnull().sum())

Number of NaN values in each column:
PolicyholderNumber                     0
FirstPartyVehicleNumber              495
ThirdPartyVehicleNumber            11151
InsurerNotes                        2357
PolicyholderOccupation               343
LossDate                               0
FirstPolicySubscriptionDate            0
ClaimCause                           197
ClaimInvolvedCovers                  195
DamageImportance                   10792
FirstPartyVehicleType                 12
ConnectionBetweenParties           11432
PolicyWasSubscribedOnInternet          0
NumberOfPoliciesOfPolicyholder         0
FpVehicleAgeMonths                    12
EasinessToStage                        0
ClaimWihoutIdentifiedThirdParty        0
ClaimAmount                            0
LossHour                              94
PolicyHolderAge                       36
NumberOfBodilyInjuries                 0
FirstPartyLiability                    0
Fraud                                  0
LossAndHolderPostCod

In [4]:
# Check the number of missing data when Fraud is True
df_fraud = df[df["Fraud"]==1]
print(f'Number of NaN values in each column when Frand is True:')
print(df_fraud.isnull().sum())

Number of NaN values in each column when Frand is True:
PolicyholderNumber                   0
FirstPartyVehicleNumber              9
ThirdPartyVehicleNumber            106
InsurerNotes                         1
PolicyholderOccupation               4
LossDate                             0
FirstPolicySubscriptionDate          0
ClaimCause                           0
ClaimInvolvedCovers                  0
DamageImportance                    96
FirstPartyVehicleType                2
ConnectionBetweenParties           102
PolicyWasSubscribedOnInternet        0
NumberOfPoliciesOfPolicyholder       0
FpVehicleAgeMonths                   2
EasinessToStage                      0
ClaimWihoutIdentifiedThirdParty      0
ClaimAmount                          0
LossHour                             1
PolicyHolderAge                      0
NumberOfBodilyInjuries               0
FirstPartyLiability                  0
Fraud                                0
LossAndHolderPostCodeSame            0
dtype: i

In conclusion, we can set NaN as a category of categorical data and generate dummy variables. And we can drop the rows that contains NaN values in numerical columns.

In [5]:
# get useful features that are needed in the machine learning model
needed_columns = [ 'PolicyholderOccupation',
       'LossDate', 'FirstPolicySubscriptionDate', 'ClaimCause',
       'ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType',
       'ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet',
       'NumberOfPoliciesOfPolicyholder', 'FpVehicleAgeMonths',
       'EasinessToStage', 'ClaimWihoutIdentifiedThirdParty', 'ClaimAmount',
       'LossHour', 'PolicyHolderAge', 'NumberOfBodilyInjuries',
       'FirstPartyLiability', 'LossAndHolderPostCodeSame','Fraud']
new_df = df[needed_columns]
new_df


Unnamed: 0,PolicyholderOccupation,LossDate,FirstPolicySubscriptionDate,ClaimCause,ClaimInvolvedCovers,DamageImportance,FirstPartyVehicleType,ConnectionBetweenParties,PolicyWasSubscribedOnInternet,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,LossAndHolderPostCodeSame,Fraud
0,CivilServant,02.01.19,18.06.18,CollisionWithAnimal,MaterialDamages ActLiability,,Car,,1,1,104.0,0.25,1,4624.73,8.0,45.0,0,1.0,1,0
1,Worker,02.01.19,29.06.17,LossOfControl,MaterialDamages ActLiability,,Car,,0,3,230.0,0.50,1,1606.81,11.0,20.0,0,1.0,0,0
2,Worker,02.01.19,05.02.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability,,Car,,0,9,93.0,0.25,0,998.20,18.0,32.0,0,0.5,1,0
3,CivilServant,02.01.19,21.01.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability ReplacementVehicle,,Car,,0,2,56.0,0.25,0,2506.92,11.0,46.0,0,0.5,1,0
4,Farmer,02.01.19,13.01.18,AccidentWithIdentifiedThirdParty,ActLiability,,Car,,0,4,110.0,0.25,0,12.00,12.0,28.0,0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11525,Employee,17.02.21,15.03.19,WindscreenDamage,Windscreen,,Car,,0,1,85.0,0.50,1,1010.23,0.0,56.0,0,0.0,0,0
11526,Employee,07.03.21,20.07.17,WindscreenDamage,Windscreen,,Car,,0,3,119.0,0.50,1,154.35,0.0,54.0,0,0.0,0,0
11527,Employee,15.03.21,30.09.20,WindscreenDamage,Windscreen,,Car,,0,4,139.0,0.50,1,420.25,0.0,34.0,0,0.0,0,0
11528,CivilServant,06.03.21,28.12.18,WindscreenDamage,Windscreen,,Car,,0,6,105.0,0.50,1,96.40,0.0,58.0,0,0.0,0,0


In [6]:
# clean features
# for the dummy variables with NaN, we want to keep it in the dataframe since NaN might be an important feature
dummy_columns = ['PolicyholderOccupation', 'ClaimCause','ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType','ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet']
new_df[dummy_columns] = new_df[dummy_columns].fillna('NaN')
new_df = pd.get_dummies(new_df,columns=dummy_columns,drop_first=True)
# turn the date into timestamp (get a numeric data)
new_df['LossDate'] = new_df['LossDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())
new_df['FirstPolicySubscriptionDate'] = new_df['FirstPolicySubscriptionDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())
# normalize the data
scale_list = ["ClaimAmount","LossHour","PolicyHolderAge"] 
new_df[scale_list] = scale(new_df[scale_list])
# for the numeric missing data, we need to drop it. 
# Since we have turned the NaN in categorical columns into 'str', we can directly drop the rows with NaN in the whole dataframe
new_df.dropna(inplace=True,axis=0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[dummy_columns] = new_df[dummy_columns].fillna('NaN')


In [7]:
new_df[new_df["Fraud"]==1]

Unnamed: 0,LossDate,FirstPolicySubscriptionDate,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,LossAndHolderPostCodeSame,Fraud,PolicyholderOccupation_Employee,PolicyholderOccupation_Executive,PolicyholderOccupation_Farmer,PolicyholderOccupation_HeadOfCompany,PolicyholderOccupation_Merchant,PolicyholderOccupation_NaN,PolicyholderOccupation_Retired,PolicyholderOccupation_SelfEmployed,PolicyholderOccupation_Student,PolicyholderOccupation_Unemployed,PolicyholderOccupation_Worker,ClaimCause_AccidentWithIdentifiedThirdParty,ClaimCause_AccidentWithUnidentifiedThirdParty,ClaimCause_CollisionWithAnimal,ClaimCause_CollisionWithPedestrian,ClaimCause_Fire,ClaimCause_Flood,ClaimCause_ForcesOfNature,ClaimCause_Hail,ClaimCause_LegalProtection,ClaimCause_LossOfControl,ClaimCause_MultiVehicleCrash,ClaimCause_NaN,ClaimCause_Storm,ClaimCause_TheftAttempt,ClaimCause_TheftOfExteriorElements,ClaimCause_TotalTheft,ClaimCause_Vandalism,ClaimCause_WindscreenDamage,ClaimInvolvedCovers_Accessories ActLiability Theft,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability MedicalCare,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability ThirdParty,ClaimInvolvedCovers_Accessories RiderClothes Windscreen ActLiability Theft,ClaimInvolvedCovers_Accessories Theft,ClaimInvolvedCovers_Accessories Windscreen,ClaimInvolvedCovers_Accessories Windscreen ActLiability,ClaimInvolvedCovers_Accessories Windscreen ActLiability Burglary,ClaimInvolvedCovers_Accessories Windscreen ActLiability Theft,ClaimInvolvedCovers_Accessories Windscreen Theft,ClaimInvolvedCovers_ActLiability,ClaimInvolvedCovers_ActLiability Burglary,ClaimInvolvedCovers_ActLiability Burglary ReplacementVehicle,ClaimInvolvedCovers_ActLiability Burglary Theft,ClaimInvolvedCovers_ActLiability Burglary Theft ReplacementVehicle,ClaimInvolvedCovers_ActLiability Fire,ClaimInvolvedCovers_ActLiability Fire Burglary,ClaimInvolvedCovers_ActLiability Fire ReplacementVehicle,ClaimInvolvedCovers_ActLiability Fire ThirdParty,ClaimInvolvedCovers_ActLiability MaterialDamages,ClaimInvolvedCovers_ActLiability MedicalCare,ClaimInvolvedCovers_ActLiability MedicalCare ThirdParty,ClaimInvolvedCovers_ActLiability NaturalCatastrophes,ClaimInvolvedCovers_ActLiability NaturalCatastrophes Burglary,ClaimInvolvedCovers_ActLiability NaturalCatastrophes ReplacementVehicle,ClaimInvolvedCovers_ActLiability ReplacementVehicle,ClaimInvolvedCovers_ActLiability Theft,ClaimInvolvedCovers_ActLiability Theft ReplacementVehicle,ClaimInvolvedCovers_ActLiability ThirdParty,ClaimInvolvedCovers_ActLiability ThirdParty ReplacementVehicle,ClaimInvolvedCovers_ActLiability ThirdParty Theft,ClaimInvolvedCovers_Burglary,ClaimInvolvedCovers_Burglary Theft,ClaimInvolvedCovers_MaterialDamages,ClaimInvolvedCovers_MaterialDamages ActLiability,ClaimInvolvedCovers_MaterialDamages ActLiability Burglary,ClaimInvolvedCovers_MaterialDamages ActLiability Burglary ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability Fire,ClaimInvolvedCovers_MaterialDamages ActLiability MedicalCare,ClaimInvolvedCovers_MaterialDamages ActLiability MedicalCare ThirdParty,ClaimInvolvedCovers_MaterialDamages ActLiability NaturalCatastrophes,ClaimInvolvedCovers_MaterialDamages ActLiability NaturalCatastrophes ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability Theft,ClaimInvolvedCovers_MaterialDamages ActLiability ThirdParty,ClaimInvolvedCovers_MaterialDamages ActLiability ThirdParty ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages Burglary,ClaimInvolvedCovers_MaterialDamages ThirdParty,ClaimInvolvedCovers_MedicalCare,ClaimInvolvedCovers_MedicalCare ThirdParty,ClaimInvolvedCovers_NaN,ClaimInvolvedCovers_NaturalCatastrophes,ClaimInvolvedCovers_NaturalCatastrophes ActLiability ReplacementVehicle,ClaimInvolvedCovers_Theft,ClaimInvolvedCovers_ThirdParty,ClaimInvolvedCovers_ThirdPartyMaterialDamages ActLiability,ClaimInvolvedCovers_Windscreen,ClaimInvolvedCovers_Windscreen ActLiability,ClaimInvolvedCovers_Windscreen ActLiability Burglary,ClaimInvolvedCovers_Windscreen ActLiability Burglary Theft,ClaimInvolvedCovers_Windscreen ActLiability NaturalCatastrophes,ClaimInvolvedCovers_Windscreen ActLiability Theft,ClaimInvolvedCovers_Windscreen ActLiability Theft ReplacementVehicle,ClaimInvolvedCovers_Windscreen MaterialDamages,ClaimInvolvedCovers_Windscreen MaterialDamages ActLiability,ClaimInvolvedCovers_Windscreen NaturalCatastrophes,ClaimInvolvedCovers_Windscreen Theft,DamageImportance_NaN,DamageImportance_TotalLoss,FirstPartyVehicleType_Caravan,FirstPartyVehicleType_Motorcycle,FirstPartyVehicleType_NaN,FirstPartyVehicleType_PrivateCar,ConnectionBetweenParties_SameAddress,ConnectionBetweenParties_SameBankAccount,ConnectionBetweenParties_SameEmail,ConnectionBetweenParties_SamePhone,ConnectionBetweenParties_SamePolice,PolicyWasSubscribedOnInternet_1
523,1.548115e+09,1.548115e+09,4,138.0,0.25,0,0.088401,0.639231,0.356493,0,0.0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
653,1.548634e+09,1.548634e+09,1,164.0,0.25,0,0.027145,-0.035794,-1.154930,0,1.0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
692,1.546733e+09,1.547165e+09,5,134.0,0.25,0,-0.527448,0.234216,-0.892074,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
697,1.546387e+09,1.548634e+09,1,122.0,0.25,1,-0.527448,0.234216,-0.957788,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
772,1.546647e+09,1.548115e+09,1,148.0,0.50,1,0.515126,0.234216,1.276490,0,1.0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9768,1.578010e+09,1.577837e+09,1,133.0,0.25,0,-0.304357,0.369221,-0.563503,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
10147,1.578010e+09,1.577837e+09,1,31.0,0.25,0,0.871442,1.179251,-0.300647,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
10166,1.578097e+09,1.577924e+09,1,101.0,0.50,1,-0.284205,1.179251,-0.957788,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0
10181,1.578183e+09,1.578183e+09,1,232.0,0.25,0,-0.527448,1.044246,-0.760646,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1


In [8]:
X = new_df.to_numpy().astype(np.float64)
y = new_df["Fraud"].to_numpy().reshape([-1])
print(f"X shape is:{X.shape}")
print(f"y shape is:{y.shape}")


X shape is:(11388, 122)
y shape is:(11388,)


Split the data into training and testing sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=112)


## Task 3
### Question
Start by creating a (deep) neural network in TensorFlow and train it on the data. Using training and validation sets, find a model with high accuracy, then evaluate it on the test set. In particular, record both the accuracy and AUC. Briefly discuss what issues you observe based on the metrics.


In [10]:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score,roc_auc_score


In [11]:
%load_ext tensorboard

In [12]:
rm -rf ./logs/

In [13]:
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001,0.1))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.3))
# the number of units in the hidden layer, 1 time, 2 times or 3 times of the unit number of input layer
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete(range(X.shape[1], X.shape[1]+1,X.shape[1]))) 
HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'sigmoid']))
HP_HIDDEN_LAYER_NUMBER = hp.HParam('hidden_layer_number', hp.Discrete(range(1,6)))
METRIC_CROSSENTROPY = 'binary_crossentropy'
EPOCHS = 10

Once we have set up our parameters and metrics, we write those into our folder with the logs:

In [14]:
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
    hp.hparams_config(hparams=[HP_LEARNING_RATE, HP_OPTIMIZER, HP_DROPOUT, HP_NUM_UNITS,HP_ACTIVATION,HP_HIDDEN_LAYER_NUMBER],
                      metrics = [hp.Metric(METRIC_CROSSENTROPY, display_name='CROSSENTROPY')])

2022-03-07 16:31:15.189468: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [15]:
def train_model(hparams,X_train=X_train,y_train=y_train,X_test=X_test,y_test=y_test):
    # early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
        tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=hparams[HP_ACTIVATION])]*hparams[HP_HIDDEN_LAYER_NUMBER]+[
        tf.keras.layers.Dense(1,activation='sigmoid')])

    if hparams[HP_OPTIMIZER] == 'sgd':
        # Note that exploding gradients can be a big problem when running regressions, especially under SGD
        # Hence, we use "gradient clipping" with parameter alpha, which means that the gradients are manually kept between -1 and 1
        # This is of course another hyperparameter that we might tune!
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=hparams[HP_LEARNING_RATE], clipvalue=1)
    elif hparams[HP_OPTIMIZER] == 'adam':
        optimizer = tf.keras.optimizers.Adam(
            learning_rate=hparams[HP_LEARNING_RATE])

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy')

    model.fit(X_train, y_train, epochs=EPOCHS)
    loss = model.evaluate(X_test, y_test)
    x_test_predict = model.predict(X_test)
    # calculate the roc
    roc_score = roc_auc_score(y_test, x_test_predict)
    # calculate the accuracy suppose the threshold is 0.5
    x_test_predict_binary = np.where(x_test_predict>0.5,1,0)
    accuracy = accuracy_score(y_test, x_test_predict_binary)
    # calculate the sensitivity
    sensitivity = recall_score(y_test, x_test_predict_binary)
    return loss, accuracy,roc_score,sensitivity


In [16]:
def run(run_dir, hparams):
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)
        loss, accuracy,roc_score,sensitivity = train_model(hparams)
        tf.summary.scalar('ACCUARY', accuracy, step=1)
        tf.summary.scalar('LOSS', loss, step=1)
        tf.summary.scalar('ROC', roc_score, step=1)
        tf.summary.scalar('SENSITIVITY', sensitivity, step=1)

In [17]:
total_sessions = 3 #FIXME: change this to the number of sessions you want to run, and fix the issue in the metrics

for session in range(total_sessions):
    
    # Create hyperparameters randomly
    dropout_rate = HP_DROPOUT.domain.sample_uniform()
    num_units = HP_NUM_UNITS.domain.sample_uniform()
    optimizer = HP_OPTIMIZER.domain.sample_uniform()
    activation = HP_ACTIVATION.domain.sample_uniform()
    hidden_layer_number = HP_HIDDEN_LAYER_NUMBER.domain.sample_uniform()
    
    r = -3*np.random.rand()
    learning_rate = 10.0**r
    
    # Create a dictionary of hyperparameters
    hparams = { HP_LEARNING_RATE: learning_rate,
                HP_OPTIMIZER: optimizer,
                HP_DROPOUT: dropout_rate,
                HP_NUM_UNITS: num_units,
                HP_ACTIVATION: activation,
                HP_HIDDEN_LAYER_NUMBER: hidden_layer_number}
    
    # train the model with the chosen parameters
    run_name = "run-%d" % session
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})
    run('logs/hparam_tuning/' + run_name, hparams)

--- Starting trial: run-0
{'learning_rate': 0.1775897545532337, 'optimizer': 'sgd', 'dropout': 0.27691885232847835, 'num_units': 122, 'activation': 'relu', 'hidden_layer_number': 2}
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--- Starting trial: run-1
{'learning_rate': 0.22333230880680327, 'optimizer': 'sgd', 'dropout': 0.1299281145004579, 'num_units': 122, 'activation': 'relu', 'hidden_layer_number': 4}
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
--- Starting trial: run-2
{'learning_rate': 0.5786125138435202, 'optimizer': 'sgd', 'dropout': 0.13974905391020548, 'num_units': 122, 'activation': 'relu', 'hidden_layer_number': 3}
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [18]:
%tensorboard --logdir logs

ERROR: Failed to launch TensorBoard (exited with -4).

## Task 4
### Question
Start by creating a (deep) neural network in TensorFlow and train it on the data. Using training and validation sets, find a model with high accuracy, then evaluate it on the test set. In particular, record both the accuracy and AUC. Briefly discuss what issues you observe based on the metrics.

### Principle
In this part, we are going to try both oversampling and undersampling.


In [22]:
# TODO draw the plot after resampling the data
import imblearn
from collections import Counter
import matplotlib.pyplot as plt

### Oversampling
We will successively try to oversample the minority class to 10%, 30%, 50% of the size of majority class.

In [23]:
# k_neighbors set to 20 to make sure that the result is more general 
over = imblearn.over_sampling.SMOTE(sampling_strategy=0.1, random_state = 483, k_neighbors=20)  
X_over_synth_10, y_over_synth_10 = over.fit_resample(X_train, y_train)
over = imblearn.over_sampling.SMOTE(sampling_strategy=0.5, random_state = 483, k_neighbors=20)
X_over_synth_30, y_over_synth_30 = over.fit_resample(X_train, y_train)
over = imblearn.over_sampling.SMOTE(sampling_strategy=1, random_state = 483, k_neighbors=20)
X_over_synth_50, y_over_synth_50 = over.fit_resample(X_train, y_train)

In [24]:
print("Percentage of 1 in y_over_synth10:", Counter(y_over_synth_10)[1]/len(y_over_synth_10))
print("Percentage of 1 in y_over_synth30:", Counter(y_over_synth_30)[1]/len(y_over_synth_30))
print("Percentage of 1 in y_over_synth50:", Counter(y_over_synth_50)[1]/len(y_over_synth_50))

Percentage of 1 in y_over_synth10: 0.09090909090909091
Percentage of 1 in y_over_synth30: 0.3333333333333333
Percentage of 1 in y_over_synth50: 0.5


### Undersampling
We will successively try to undersample the minority class to 10%, 30%, 50% of the size of majority class.

In [25]:
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=0.1, random_state = 483)  
X_under_synth_10, y_under_synth_10 = under.fit_resample(X_train, y_train)
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=0.5, random_state = 483)
X_under_synth_30, y_under_synth_30 = under.fit_resample(X_train, y_train)
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=1, random_state = 483)
X_under_synth_50, y_under_synth_50 = under.fit_resample(X_train, y_train)

In [26]:
print("Percentage of 1 in y_under_synth_10:", Counter(y_under_synth_10)[1]/len(y_under_synth_10))
print("Percentage of 1 in y_under_synth_30:", Counter(y_under_synth_30)[1]/len(y_under_synth_30))
print("Percentage of 1 in y_under_synth_50:", Counter(y_under_synth_50)[1]/len(y_under_synth_50))

Percentage of 1 in y_under_synth_10: 0.09090909090909091
Percentage of 1 in y_under_synth_30: 0.3333333333333333
Percentage of 1 in y_under_synth_50: 0.5


## Task 5
### Question
 Create a new (deep) neural network and train it on your enhanced dataset. Use training and validation sets derived from the enhanced dataset to find a model with high accuracy. Evaluate your final model on a test set consisting only of original data. Again, record the accuracy and AUC. Briefly discuss the changes you would expect in the metrics and the actual changes you observe. Would you say that you are now doing better at identifying fraudulent claims?

### Principle
To simplify the problem, and save computational time, we will apply all the synthetic data to a very simple neural network, and then compare the performance of this distinct synthetic data.
The neural network structure is as follows:
1. Input layer
2. 2 Hidden layers, in which the number of neurons in each layer is equal to input layer and 'relu' activation function is used.
3. No dropout layer
4. Output layer, with 'sigmoid' activation function.
5. Optimizer: AdamOptimizer

In [27]:
class TrainModel:
    def __init__(self, X_train, y_train, X_test= X_test, y_test=y_test,epochs=100,early_stopping_cb:bool=False,patience:int=10):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.epochs = epochs
        self.early_stopping_cb = early_stopping_cb
        self.patience = patience
        self.simple_model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(X_train.shape[1], activation=activation, input_shape=(X_train.shape[1],)),
            tf.keras.layers.Dense(X_train.shape[1], activation='relu'),
            tf.keras.layers.Dense(X_train.shape[1], activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ])

    def compile(self):
        self.simple_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    def fit(self):
        if self.early_stopping_cb:
            early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=self.patience, restore_best_weights=True) # set patience to 10 to accelerate the training
            self.simple_model.fit(self.X_train, self.y_train,validation_data= (X_test, y_test),callbacks=[early_stopping_cb])
        else:
            self.simple_model.fit(self.X_train, self.y_train,validation_data= (X_test, y_test),epochs=EPOCHS)

    def evaluate(self):
        loss = self.simple_model.evaluate(X_test, y_test)
        x_test_predict = self.simple_model.predict(X_test)
        # calculate the roc
        roc_score = roc_auc_score(y_test, x_test_predict)
        # calculate the accuracy suppose the threshold is 0.5
        x_test_predict_binary = np.where(x_test_predict>0.5,1,0)
        accuracy = accuracy_score(y_test, x_test_predict_binary)
        # calculate the sensitivity
        sensitivity = recall_score(y_test, x_test_predict_binary)
        return {'loss': loss, 'accuracy': accuracy, 'sensitivity': sensitivity, 'roc': roc_score}

    def run(self):
        self.compile()
        self.fit()
        return self.evaluate()
    


In [28]:
res_list = [] # FIXME
for X_train,y_train in zip([X_over_synth_10,X_over_synth_30,X_over_synth_50,X_under_synth_10,X_under_synth_30,X_under_synth_50], [y_over_synth_10,y_over_synth_30,y_over_synth_50,y_under_synth_10,y_under_synth_30,y_under_synth_50]):
    tm = TrainModel(X_train,y_train,early_stopping_cb=True)
    res = tm.run()
    res_list.append(res)



In [29]:
res_list

[{'loss': [36559.88671875, 0.9909276962280273],
  'accuracy': 0.9909277143693298,
  'sensitivity': 0.0,
  'roc': 0.5},
 {'loss': [775952.25, 0.408838152885437],
  'accuracy': 0.4088381621305239,
  'sensitivity': 0.8064516129032258,
  'roc': 0.6058247432501953},
 {'loss': [326437.15625, 0.7772900462150574],
  'accuracy': 0.7772900204858063,
  'sensitivity': 0.7096774193548387,
  'roc': 0.7437932282834442},
 {'loss': [868522.5, 0.9909276962280273],
  'accuracy': 0.9909277143693298,
  'sensitivity': 0.0,
  'roc': 0.5},
 {'loss': [160990.546875, 0.9909276962280273],
  'accuracy': 0.9909277143693298,
  'sensitivity': 0.0,
  'roc': 0.5},
 {'loss': [411068.125, 0.9909276962280273],
  'accuracy': 0.9909277143693298,
  'sensitivity': 0.0,
  'roc': 0.5}]