## Task 1
### Question
Briefly discuss why it is more difficult to find a good classifier on such a dataset than on one where, for example, 5,000 claims are fraudulent, and 5,000 are not. In particular, consider what happens when undetected fraudulent claims are very costly to the insurance company.

### Answer
When the dataset in highly unbalanced, like what in the car-insurance case, the machine learning algorithm can predict well in the majority class and predict poorly in minority class. Since most of machine learning algorithm always pursue to decrease error rate, thus they tend to predict the all the label to majority class one. In our scenario, the algorithm tends to predict all the claims non-fraudulent. However, the wrong prediction will increase the False-negative rate, which will increase the cost of fraudulent claims.
Another issue related to the scarce minority data is that we might miss some key combination of variables that has high probability to be fraudulent.

## Task 2
### Question
Load the dataset "Insurance_claims.csv" and clean it as appropriate for use with machine learning algorithms. A description of the features can be found at the end of this document.

### Principle
1. Since the dataset is highly unbalanced, and the fraudulent dataset is very scarce, we don't want to drop the data label with 'fradulent'.
2. When the variables is dummies varibles, we tend to keep the Nah value as a classificaiton value rather than drop it.
3. When the variables is numerical variables, we will check how many NaN values is related to the fraudulent case. If few of them, we will drop the variable. Otherwise, we will find a way to fill in the NaN values.

In [1]:
import numpy as np
import pandas as pd
import datetime 
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns',None)

In [2]:
# read data and get a brief idea of the data
df = pd.read_csv('./materials/Insurance_claims.csv')

print(f'Data Columns:\n' + str(df.columns))
print('--------------------------------------------------------------')
print(f'Data sample:')
df.head(5)

Data Columns:
Index(['PolicyholderNumber', 'FirstPartyVehicleNumber',
       'ThirdPartyVehicleNumber', 'InsurerNotes', 'PolicyholderOccupation',
       'LossDate', 'FirstPolicySubscriptionDate', 'ClaimCause',
       'ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType',
       'ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet',
       'NumberOfPoliciesOfPolicyholder', 'FpVehicleAgeMonths',
       'EasinessToStage', 'ClaimWihoutIdentifiedThirdParty', 'ClaimAmount',
       'LossHour', 'PolicyHolderAge', 'NumberOfBodilyInjuries',
       'FirstPartyLiability', 'Fraud', 'LossAndHolderPostCodeSame'],
      dtype='object')
--------------------------------------------------------------
Data sample:


Unnamed: 0,PolicyholderNumber,FirstPartyVehicleNumber,ThirdPartyVehicleNumber,InsurerNotes,PolicyholderOccupation,LossDate,FirstPolicySubscriptionDate,ClaimCause,ClaimInvolvedCovers,DamageImportance,FirstPartyVehicleType,ConnectionBetweenParties,PolicyWasSubscribedOnInternet,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,Fraud,LossAndHolderPostCodeSame
0,531112,715507.0,,avoids a cat and hits a garage pole With deduc...,CivilServant,02.01.19,18.06.18,CollisionWithAnimal,MaterialDamages ActLiability,,Car,,1,1,104.0,0.25,1,4624.73,8.0,45.0,0,1.0,0,1
1,87170,71164.0,,accident only expert contacts us to inform us ...,Worker,02.01.19,29.06.17,LossOfControl,MaterialDamages ActLiability,,Car,,0,3,230.0,0.5,1,1606.81,11.0,20.0,0,1.0,0,0
2,98706,442609.0,,ae Miss/ for garage change A/ setting up EAD/ ...,Worker,02.01.19,05.02.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability,,Car,,0,9,93.0,0.25,0,998.2,18.0,32.0,0,0.5,0,1
3,38240,24604.0,,"awaiting report to determine rc, no box checke...",CivilServant,02.01.19,21.01.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability ReplacementVehicle,,Car,,0,2,56.0,0.25,0,2506.92,11.0,46.0,0,0.5,0,1
4,11339,2933.0,229134.0,Insured in THIRD-PARTY formula Insured in a su...,Farmer,02.01.19,13.01.18,AccidentWithIdentifiedThirdParty,ActLiability,,Car,,0,4,110.0,0.25,0,12.0,12.0,28.0,0,0.0,0,0


Check how much NaN values in each column.
We can find that except for 'FirstPartyVehicleNumber', 'ThirdPartyVehicleNumber', 'InsurerNotes', which we might not use in our models, most the NaN values are concentrated in the 'PolicyholderOccupation', 'ClaimCause',etc which are mainly categorical variables. In this case, we could turn these NaN values into a category value in order to take account the influence of missing variables, no mather what reason they are missing.
In terms of the numeric variables, we will furtherly check how many of them are missing when the claim is fraudulent. 

In [3]:
# Check how much NaN values in each column.
print(f'Number of NaN values in each column:')
print(df.isnull().sum())

Number of NaN values in each column:
PolicyholderNumber                     0
FirstPartyVehicleNumber              495
ThirdPartyVehicleNumber            11151
InsurerNotes                        2357
PolicyholderOccupation               343
LossDate                               0
FirstPolicySubscriptionDate            0
ClaimCause                           197
ClaimInvolvedCovers                  195
DamageImportance                   10792
FirstPartyVehicleType                 12
ConnectionBetweenParties           11432
PolicyWasSubscribedOnInternet          0
NumberOfPoliciesOfPolicyholder         0
FpVehicleAgeMonths                    12
EasinessToStage                        0
ClaimWihoutIdentifiedThirdParty        0
ClaimAmount                            0
LossHour                              94
PolicyHolderAge                       36
NumberOfBodilyInjuries                 0
FirstPartyLiability                    0
Fraud                                  0
LossAndHolderPostCod

Check how much NaN values in each column when the claim is frandulent.
We can find when the claim is frandulent, most of the numerical variables are not missing, which means we could directly drop them.

In [4]:
# Check the number of missing data when Frand is True
df_fraud = df[df["Fraud"]==1]
print(f'Number of NaN values in each column when Frand is True:')
print(df_fraud.isnull().sum())

Number of NaN values in each column when Frand is True:
PolicyholderNumber                   0
FirstPartyVehicleNumber              9
ThirdPartyVehicleNumber            106
InsurerNotes                         1
PolicyholderOccupation               4
LossDate                             0
FirstPolicySubscriptionDate          0
ClaimCause                           0
ClaimInvolvedCovers                  0
DamageImportance                    96
FirstPartyVehicleType                2
ConnectionBetweenParties           102
PolicyWasSubscribedOnInternet        0
NumberOfPoliciesOfPolicyholder       0
FpVehicleAgeMonths                   2
EasinessToStage                      0
ClaimWihoutIdentifiedThirdParty      0
ClaimAmount                          0
LossHour                             1
PolicyHolderAge                      0
NumberOfBodilyInjuries               0
FirstPartyLiability                  0
Fraud                                0
LossAndHolderPostCodeSame            0
dtype: i

In conclusion, we can set NaN as a category of categorical data and generate dummy variables. And we can drop the rows that contains NaN values in numerical columns.

In [5]:
# get useful features that needed in the machine learning model
needed_columns = [ 'PolicyholderOccupation',
       'LossDate', 'FirstPolicySubscriptionDate', 'ClaimCause',
       'ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType',
       'ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet',
       'NumberOfPoliciesOfPolicyholder', 'FpVehicleAgeMonths',
       'EasinessToStage', 'ClaimWihoutIdentifiedThirdParty', 'ClaimAmount',
       'LossHour', 'PolicyHolderAge', 'NumberOfBodilyInjuries',
       'FirstPartyLiability', 'LossAndHolderPostCodeSame','Fraud']
new_df = df[needed_columns]
new_df


Unnamed: 0,PolicyholderOccupation,LossDate,FirstPolicySubscriptionDate,ClaimCause,ClaimInvolvedCovers,DamageImportance,FirstPartyVehicleType,ConnectionBetweenParties,PolicyWasSubscribedOnInternet,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,LossAndHolderPostCodeSame,Fraud
0,CivilServant,02.01.19,18.06.18,CollisionWithAnimal,MaterialDamages ActLiability,,Car,,1,1,104.0,0.25,1,4624.73,8.0,45.0,0,1.0,1,0
1,Worker,02.01.19,29.06.17,LossOfControl,MaterialDamages ActLiability,,Car,,0,3,230.0,0.50,1,1606.81,11.0,20.0,0,1.0,0,0
2,Worker,02.01.19,05.02.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability,,Car,,0,9,93.0,0.25,0,998.20,18.0,32.0,0,0.5,1,0
3,CivilServant,02.01.19,21.01.17,AccidentWithIdentifiedThirdParty,MaterialDamages ActLiability ReplacementVehicle,,Car,,0,2,56.0,0.25,0,2506.92,11.0,46.0,0,0.5,1,0
4,Farmer,02.01.19,13.01.18,AccidentWithIdentifiedThirdParty,ActLiability,,Car,,0,4,110.0,0.25,0,12.00,12.0,28.0,0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11525,Employee,17.02.21,15.03.19,WindscreenDamage,Windscreen,,Car,,0,1,85.0,0.50,1,1010.23,0.0,56.0,0,0.0,0,0
11526,Employee,07.03.21,20.07.17,WindscreenDamage,Windscreen,,Car,,0,3,119.0,0.50,1,154.35,0.0,54.0,0,0.0,0,0
11527,Employee,15.03.21,30.09.20,WindscreenDamage,Windscreen,,Car,,0,4,139.0,0.50,1,420.25,0.0,34.0,0,0.0,0,0
11528,CivilServant,06.03.21,28.12.18,WindscreenDamage,Windscreen,,Car,,0,6,105.0,0.50,1,96.40,0.0,58.0,0,0.0,0,0


In [6]:
# clean features
# for the dummy variables with Nan, we want to keep it in the dataframe since NaN might be an important feature
dummy_columns = ['PolicyholderOccupation', 'ClaimCause','ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType','ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet']
new_df[dummy_columns] = new_df[dummy_columns].fillna('NaN')
new_df = pd.get_dummies(new_df,columns=dummy_columns,drop_first=True)
# turn the date into timestamp(get a numeric data)
new_df['LossDate'] = new_df['LossDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())
new_df['FirstPolicySubscriptionDate'] = new_df['FirstPolicySubscriptionDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())
# normalize the data
scale_list = ["ClaimAmount","LossHour","PolicyHolderAge"] 
new_df[scale_list] = scale(new_df[scale_list])
# for the numeric missing data, we need to drop it. 
# Since we have turned the NaN in categorical columns into 'str', we can directly drop the rows with NaN in the whole dataframe
new_df.dropna(inplace=True,axis=0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df[dummy_columns] = new_df[dummy_columns].fillna('NaN')


In [7]:
new_df[new_df["Fraud"]==1]

Unnamed: 0,LossDate,FirstPolicySubscriptionDate,NumberOfPoliciesOfPolicyholder,FpVehicleAgeMonths,EasinessToStage,ClaimWihoutIdentifiedThirdParty,ClaimAmount,LossHour,PolicyHolderAge,NumberOfBodilyInjuries,FirstPartyLiability,LossAndHolderPostCodeSame,Fraud,PolicyholderOccupation_Employee,PolicyholderOccupation_Executive,PolicyholderOccupation_Farmer,PolicyholderOccupation_HeadOfCompany,PolicyholderOccupation_Merchant,PolicyholderOccupation_NaN,PolicyholderOccupation_Retired,PolicyholderOccupation_SelfEmployed,PolicyholderOccupation_Student,PolicyholderOccupation_Unemployed,PolicyholderOccupation_Worker,ClaimCause_AccidentWithIdentifiedThirdParty,ClaimCause_AccidentWithUnidentifiedThirdParty,ClaimCause_CollisionWithAnimal,ClaimCause_CollisionWithPedestrian,ClaimCause_Fire,ClaimCause_Flood,ClaimCause_ForcesOfNature,ClaimCause_Hail,ClaimCause_LegalProtection,ClaimCause_LossOfControl,ClaimCause_MultiVehicleCrash,ClaimCause_NaN,ClaimCause_Storm,ClaimCause_TheftAttempt,ClaimCause_TheftOfExteriorElements,ClaimCause_TotalTheft,ClaimCause_Vandalism,ClaimCause_WindscreenDamage,ClaimInvolvedCovers_Accessories ActLiability Theft,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability MedicalCare,ClaimInvolvedCovers_Accessories MaterialDamages ActLiability ThirdParty,ClaimInvolvedCovers_Accessories RiderClothes Windscreen ActLiability Theft,ClaimInvolvedCovers_Accessories Theft,ClaimInvolvedCovers_Accessories Windscreen,ClaimInvolvedCovers_Accessories Windscreen ActLiability,ClaimInvolvedCovers_Accessories Windscreen ActLiability Burglary,ClaimInvolvedCovers_Accessories Windscreen ActLiability Theft,ClaimInvolvedCovers_Accessories Windscreen Theft,ClaimInvolvedCovers_ActLiability,ClaimInvolvedCovers_ActLiability Burglary,ClaimInvolvedCovers_ActLiability Burglary ReplacementVehicle,ClaimInvolvedCovers_ActLiability Burglary Theft,ClaimInvolvedCovers_ActLiability Burglary Theft ReplacementVehicle,ClaimInvolvedCovers_ActLiability Fire,ClaimInvolvedCovers_ActLiability Fire Burglary,ClaimInvolvedCovers_ActLiability Fire ReplacementVehicle,ClaimInvolvedCovers_ActLiability Fire ThirdParty,ClaimInvolvedCovers_ActLiability MaterialDamages,ClaimInvolvedCovers_ActLiability MedicalCare,ClaimInvolvedCovers_ActLiability MedicalCare ThirdParty,ClaimInvolvedCovers_ActLiability NaturalCatastrophes,ClaimInvolvedCovers_ActLiability NaturalCatastrophes Burglary,ClaimInvolvedCovers_ActLiability NaturalCatastrophes ReplacementVehicle,ClaimInvolvedCovers_ActLiability ReplacementVehicle,ClaimInvolvedCovers_ActLiability Theft,ClaimInvolvedCovers_ActLiability Theft ReplacementVehicle,ClaimInvolvedCovers_ActLiability ThirdParty,ClaimInvolvedCovers_ActLiability ThirdParty ReplacementVehicle,ClaimInvolvedCovers_ActLiability ThirdParty Theft,ClaimInvolvedCovers_Burglary,ClaimInvolvedCovers_Burglary Theft,ClaimInvolvedCovers_MaterialDamages,ClaimInvolvedCovers_MaterialDamages ActLiability,ClaimInvolvedCovers_MaterialDamages ActLiability Burglary,ClaimInvolvedCovers_MaterialDamages ActLiability Burglary ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability Fire,ClaimInvolvedCovers_MaterialDamages ActLiability MedicalCare,ClaimInvolvedCovers_MaterialDamages ActLiability MedicalCare ThirdParty,ClaimInvolvedCovers_MaterialDamages ActLiability NaturalCatastrophes,ClaimInvolvedCovers_MaterialDamages ActLiability NaturalCatastrophes ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages ActLiability Theft,ClaimInvolvedCovers_MaterialDamages ActLiability ThirdParty,ClaimInvolvedCovers_MaterialDamages ActLiability ThirdParty ReplacementVehicle,ClaimInvolvedCovers_MaterialDamages Burglary,ClaimInvolvedCovers_MaterialDamages ThirdParty,ClaimInvolvedCovers_MedicalCare,ClaimInvolvedCovers_MedicalCare ThirdParty,ClaimInvolvedCovers_NaN,ClaimInvolvedCovers_NaturalCatastrophes,ClaimInvolvedCovers_NaturalCatastrophes ActLiability ReplacementVehicle,ClaimInvolvedCovers_Theft,ClaimInvolvedCovers_ThirdParty,ClaimInvolvedCovers_ThirdPartyMaterialDamages ActLiability,ClaimInvolvedCovers_Windscreen,ClaimInvolvedCovers_Windscreen ActLiability,ClaimInvolvedCovers_Windscreen ActLiability Burglary,ClaimInvolvedCovers_Windscreen ActLiability Burglary Theft,ClaimInvolvedCovers_Windscreen ActLiability NaturalCatastrophes,ClaimInvolvedCovers_Windscreen ActLiability Theft,ClaimInvolvedCovers_Windscreen ActLiability Theft ReplacementVehicle,ClaimInvolvedCovers_Windscreen MaterialDamages,ClaimInvolvedCovers_Windscreen MaterialDamages ActLiability,ClaimInvolvedCovers_Windscreen NaturalCatastrophes,ClaimInvolvedCovers_Windscreen Theft,DamageImportance_NaN,DamageImportance_TotalLoss,FirstPartyVehicleType_Caravan,FirstPartyVehicleType_Motorcycle,FirstPartyVehicleType_NaN,FirstPartyVehicleType_PrivateCar,ConnectionBetweenParties_SameAddress,ConnectionBetweenParties_SameBankAccount,ConnectionBetweenParties_SameEmail,ConnectionBetweenParties_SamePhone,ConnectionBetweenParties_SamePolice,PolicyWasSubscribedOnInternet_1
523,1.548115e+09,1.548115e+09,4,138.0,0.25,0,0.088401,0.639231,0.356493,0,0.0,1,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
653,1.548634e+09,1.548634e+09,1,164.0,0.25,0,0.027145,-0.035794,-1.154930,0,1.0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
692,1.546733e+09,1.547165e+09,5,134.0,0.25,0,-0.527448,0.234216,-0.892074,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
697,1.546387e+09,1.548634e+09,1,122.0,0.25,1,-0.527448,0.234216,-0.957788,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
772,1.546647e+09,1.548115e+09,1,148.0,0.50,1,0.515126,0.234216,1.276490,0,1.0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9768,1.578010e+09,1.577837e+09,1,133.0,0.25,0,-0.304357,0.369221,-0.563503,0,0.0,1,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
10147,1.578010e+09,1.577837e+09,1,31.0,0.25,0,0.871442,1.179251,-0.300647,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
10166,1.578097e+09,1.577924e+09,1,101.0,0.50,1,-0.284205,1.179251,-0.957788,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0
10181,1.578183e+09,1.578183e+09,1,232.0,0.25,0,-0.527448,1.044246,-0.760646,0,0.0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1


In [8]:
X = new_df.to_numpy().astype(np.float64)
y = new_df["Fraud"].to_numpy().reshape([-1])
print(f"X shape is:{X.shape}")
print(f"y shape is:{y.shape}")


X shape is:(11388, 122)
y shape is:(11388,)


Split the data into training and testing sets

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=112)


## Task 3
### Question
Start by creating a (deep) neural network in TensorFlow and train it on the data. Using training and validation sets, find a model with high accuracy, then evaluate it on the test set. In particular, record both the accuracy and AUC. Briefly discuss what issues you observe based on the metrics.


In [10]:
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp


In [11]:
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001,0.1))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.3))
# the number of units in the hidden layer, 1 time, 2 times or 3 times of the unit number of input layer
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete(range(X.shape[1], X.shape[1]+1,X.shape[1]))) 
HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'sigmoid']))
HP_HIDDEN_LAYER_NUMBER = hp.HParam('hidden_layer_number', hp.Discrete(range(1,6)))
METRIC_CROSSENTROPY = 'binary_crossentropy'

Once we have set up our parameters and metrics, we write those into our folder with the logs:

In [12]:
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
    hp.hparams_config(hparams=[HP_LEARNING_RATE, HP_OPTIMIZER, HP_DROPOUT, HP_NUM_UNITS,HP_ACTIVATION,HP_HIDDEN_LAYER_NUMBER],
                      metrics = [hp.Metric(METRIC_CROSSENTROPY, display_name='CROSSENTROPY')])

2022-03-01 14:21:03.647417: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-03-01 14:21:03.648515: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-01 14:21:03.654238: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [13]:
def train_model(hparams):
    early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
    model = tf.keras.models.Sequential([
        tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
        tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=hparams[HP_ACTIVATION])]*hparams[HP_HIDDEN_LAYER_NUMBER]+[
        tf.keras.layers.Dense(1)])

    if hparams[HP_OPTIMIZER] == 'sgd':
        # Note that exploding gradients can be a big problem when running regressions, especially under SGD
        # Hence, we use "gradient clipping" with parameter alpha, which means that the gradients are manually kept between -1 and 1
        # This is of course another hyperparameter that we might tune!
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=hparams[HP_LEARNING_RATE], clipvalue=1)
    elif hparams[HP_OPTIMIZER] == 'adam':
        optimizer = tf.keras.optimizers.Adam(
            learning_rate=hparams[HP_LEARNING_RATE])

    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy')

    model.fit(X_train, y_train, epochs=100, callbacks=[early_stopping_cb])
    loss,accuracy = model.evaluate(X_test, y_test)
    return loss, accuracy


In [14]:
def run(run_dir, hparams):
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)
        loss,accuracy = train_model(hparams)
        tf.summary.scalar(METRIC_CROSSENTROPY, accuracy, step=1)

In [15]:
total_sessions = 20

for session in range(total_sessions):
    
    # Create hyperparameters randomly
    dropout_rate = HP_DROPOUT.domain.sample_uniform()
    num_units = HP_NUM_UNITS.domain.sample_uniform()
    optimizer = HP_OPTIMIZER.domain.sample_uniform()
    activation = HP_ACTIVATION.domain.sample_uniform()
    hidden_layer_number = HP_HIDDEN_LAYER_NUMBER.domain.sample_uniform()
    
    r = -3*np.random.rand()
    learning_rate = 10.0**r
    
    # Create a dictionary of hyperparameters
    hparams = { HP_LEARNING_RATE: learning_rate,
                HP_OPTIMIZER: optimizer,
                HP_DROPOUT: dropout_rate,
                HP_NUM_UNITS: num_units,
                HP_ACTIVATION: activation,
                HP_HIDDEN_LAYER_NUMBER: hidden_layer_number}
    
    # train the model with the chosen parameters
    run_name = "run-%d" % session
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})
    run('logs/hparam_tuning/' + run_name, hparams)

--- Starting trial: run-0
{'learning_rate': 0.022697839724559767, 'optimizer': 'adam', 'dropout': 0.23092904135025175, 'num_units': 122, 'activation': 'relu', 'hidden_layer_number': 3}
Epoch 1/100


2022-03-01 14:21:03.905225: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-03-01 14:21:03.907974: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3293700000 Hz


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
  1/250 [..............................] - ETA: 2s - loss: 0.4820

In [None]:
dropout_rate = HP_DROPOUT.domain.sample_uniform()
num_units = HP_NUM_UNITS.domain.sample_uniform()
optimizer = HP_OPTIMIZER.domain.sample_uniform()
activation = HP_ACTIVATION.domain.sample_uniform()
hidden_layer_number = HP_HIDDEN_LAYER_NUMBER.domain.sample_uniform()

r = -3*np.random.rand()
learning_rate = 10.0**r

# Create a dictionary of hyperparameters
hparams = { HP_LEARNING_RATE: learning_rate,
            HP_OPTIMIZER: optimizer,
            HP_DROPOUT: dropout_rate,
            HP_NUM_UNITS: num_units,
            HP_ACTIVATION: activation,
            HP_HIDDEN_LAYER_NUMBER: hidden_layer_number}
print(hparams)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
    tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=hparams[HP_ACTIVATION])]*hparams[HP_HIDDEN_LAYER_NUMBER]+[
    tf.keras.layers.Dense(1)])

if hparams[HP_OPTIMIZER] == 'sgd':
    # Note that exploding gradients can be a big problem when running regressions, especially under SGD
    # Hence, we use "gradient clipping" with parameter alpha, which means that the gradients are manually kept between -1 and 1
    # This is of course another hyperparameter that we might tune!
    optimizer = tf.keras.optimizers.SGD(
        learning_rate=hparams[HP_LEARNING_RATE], clipvalue=1)
elif hparams[HP_OPTIMIZER] == 'adam':
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=hparams[HP_LEARNING_RATE])

model.compile(optimizer=optimizer,
                loss='binary_crossentropy')

model.fit(X_train, y_train, epochs=100)
y_test_pred = model.predict(X_test)
acc, acc_op=tf.metrics.Accuracy(labels=y_test, predictions=y_test_pred)
pre, pre_op=tf.metrics.Precision(labels=y_test, predictions=y_test_pred)
sen, sen_op=tf.metrics.Recall(labels=y_test, predictions=y_test_pred)
spe, spe_op=tf.metrics.SpecificityAtSensitivity(labels=y_test, predictions=y_test_pred, sensitivity=0.5)



{HParam(name='learning_rate', domain=RealInterval(0.001, 0.1), display_name=None, description=None): 0.15890530604783323, HParam(name='optimizer', domain=Discrete(['adam', 'sgd']), display_name=None, description=None): 'sgd', HParam(name='dropout', domain=RealInterval(0.1, 0.3), display_name=None, description=None): 0.20329072483391986, HParam(name='num_units', domain=Discrete([122]), display_name=None, description=None): 122, HParam(name='activation', domain=Discrete(['relu', 'sigmoid']), display_name=None, description=None): 'sigmoid', HParam(name='hidden_layer_number', domain=Discrete([1, 2, 3, 4, 5]), display_name=None, description=None): 5}
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoc

AttributeError: module 'tensorflow.keras.metrics' has no attribute 'accuracy'

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(100, activation='relu')]*2+[
    tf.keras.layers.Dense(1)])
model.compile(optimizer=optimizer,
                loss='binary_crossentropy',metrics=["accuracy"])

model.fit(X_train, y_train, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7fe11d08bbe0>

In [None]:

model.evaluate(X_test,y_test)





[0.1399395763874054, 0.9909276962280273]

In [None]:
x_test_predict = model.predict(X_test)
(x_test_predict == x_test_predict).sum()/x_test_predict.sum()

-2.9182575639301143e-09

In [None]:
tuner = kt.Hyperband(train_model,
                     objective='val_loss',
                     max_epochs=10,
                     factor=3,
                     directory='logs2',
                     project_name='kt_tutorial_2')
tuner.search(X_train, y_test_predin, validation_data=(X_valid,y_valid))

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07),
              loss='binary_crossentropy',
              metrics=['accuracy',tf.keras.metrics.TruePositives(),tf.keras.metrics.AUC()])
log = model.fit(X_train, y_test_predin, epochs=100, validation_data=(X_test, y_test))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100

KeyboardInterrupt: 

In [None]:
X.

SyntaxError: unexpected EOF while parsing (3477860531.py, line 1)