#### Import Packages

In [None]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import datetime 
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import imblearn
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import tensorflow as tf
from tensorboard.plugins.hparams import api as hp
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score,roc_auc_score,roc_curve
import statsmodels.api as sm
import keras_tuner as kt
import seaborn as sns

# Question 1
### Question
Briefly discuss why it is more difficult to find a good classifier on such a dataset than on one where, for example, 5,000 claims are fraudulent, and 5,000 are not. In particular, consider what happens when undetected fraudulent claims are very costly to the insurance company.

### Answer
When the dataset is highly unbalanced, seen with the car-insurance data, the machine learning algorithm will accurately predict the majority class but poorly predict in minority class. Due to the feature of cross-entropy the algorithm tends to label the minority into majority if we do not adjust the threshold, which is 0.5 by default. In our scenario, the algorithm tends to predict all the claims as non-fraudulent which has significant financial implications if the fraud case is mislabeled.However, the wrong prediction will increase the false-negative rate, which will subsequently increase the cost of fraudulent claims.

Another issue with scarce minority data is that individual, or a combination of variables, that have a high probability of being fraudulent may be overlooked.

# Question 2

### Question
Load the dataset "Insurance_claims.csv" and clean it as appropriate for use with machine learning algorithms. A description of the features can be found at the end of this document.

### Principle
1. Since the dataset is highly unbalanced, and the fraudulent dataset is very scarce, we should not drop the data labelled 'fraudulent'.
2. When the variables are dummy variables, we tend to keep the NaN value as a classification value rather than drop it.
3. When the variables are numeric variables, we will check how many NaN values are related to the rows that are fraudulent. If there are few of them, we will drop the variable. Otherwise, we will find a way to fill the missing values.

### Proprocess methodology
1. load data and select useful columns
2. drop duplicates
3. drop unreasonable rows
4. fix NaN and missing values data
    - fix NaN data for categorical variables
    - fix NaN data for numeric variables
5. reformat date data
6. check the distribution of dataset with statistical analysis
    - check whether there is a significant distribution difference in every column in fraud data and non-fraud data
    - plot categorical data
7. clean features
    - dummy variables
      - PolicyholderOccupation
      - ClaimCause
      - ClaimInvolvedCovers
      - DamageImportance
      - FirstPartyVehicleType 
      - ConnectionBetweenParties
      - PolicyWasSubscribedOnInternet
    - extract features in 'ClaimInvolvedCovers'
8. split data and scale
9. show the data structure after preprocess

## 2.1 Load data and select useful columns
We are going to use the following relevant features to predict the fraud cases as they are relevant and easy to quantify:
1. PolicyholderOccupation
2. LossDate
3. FirstPolicySubscriptionDate
4. ClainType
5. ClaimInvolvedCovers
6. DamageImportance
7. FirstPartyVehicleType
8. ConnectionBetweenParties
9. PolicyWasSubscribedOnInternet
10. NumberOfPoliciesOfPolicyholder
11. FpVehicleAgeMonths
12. EasinessToStage
13. ClaimWihoutIdentifiedThirdParty
14. ClaimAmount
15. LossHour
16. PolicyHolderAge
17. NumberOfBodilyInjuries
18. FirstPartyLiability
19. LossAndHolderPostCodeSame

And we also need label:
1. Fraud

In [None]:
# read data and get a brief idea of the data
df = pd.read_csv('./materials/Insurance_claims.csv')
# get useful features that needed in the machine learning model
# TODO using nlp to insurer notes data
needed_columns = [ 'PolicyholderOccupation',
       'LossDate', 'FirstPolicySubscriptionDate', 'ClaimCause',
       'ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType',
       'ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet',
       'NumberOfPoliciesOfPolicyholder', 'FpVehicleAgeMonths',
       'EasinessToStage', 'ClaimWihoutIdentifiedThirdParty', 'ClaimAmount',
       'LossHour', 'PolicyHolderAge', 'NumberOfBodilyInjuries',
       'FirstPartyLiability', 'LossAndHolderPostCodeSame','Fraud']
df = df[needed_columns]

# show the first 5 rows, get some idea of the data structure
print(f'Data sample:')
print(df.head(5)) #TODO use sentiment analysis 
print('-----------------------------------------------------')

# get the columns name
print('Data Columns:')
print(str(df.columns))
print('-----------------------------------------------------')

# get some basic information about the data, and we found that the min number 
# of FpVehicleAgeMonths is less than 0, which does't make sense. We are going 
# to detect whether these rows are fraud cases or not. If they are all non-fraud,
# we can drop the rows with negative FpVehicleAgeMonths value. Otherwise, we will 
# create a new feature that record these abnormal rows since these unreasonable 
# values might be evidence of fraud cases.
print('Data description:')
print(df.describe())
print('-----------------------------------------------------')

# check whether there are any duplicated rows and we found there are 8 duplicated rows,
# which we are going to drop.
print('Data duplicated rows:')
print(df.duplicated().sum())
print('-----------------------------------------------------')

## 2.2 Drop the duplicated rows

In [None]:
print('The shape of the data before dropping duplicated rows:')
df_shape_before_drop = df.shape
print(df.shape)
print('-----------------------------------------------------')

# drop the duplicated rows
df.drop_duplicates(inplace=True)

print('The shape of the data after dropping duplicated rows:')
df_shape_after_drop = df.shape
print(df.shape)
print('-----------------------------------------------------')

print(f'The number of rows that are dropped: {df_shape_before_drop[0]-df_shape_after_drop[0]}')

## 2.3 Drop unreasonable values


In [None]:
# check whether unreasonable rows contain fraud cases
df_unreasonable_rows = df[df['FpVehicleAgeMonths'] < 0]
df_shape_before_drop = df_unreasonable_rows.shape
print(df_unreasonable_rows)
print('-----------------------------------------------------')
# we can find that these three rows are not fraud cases. 
# Since we have enough non-fraud data, we can drop these rows.
df_shape_before_drop = df.shape
df.drop(df_unreasonable_rows.index, inplace=True)
df_shape_after_drop = df.shape

print(f'The number of rows that are dropped: {df_shape_before_drop[0]-df_shape_after_drop[0]}')

## 2.4 Deal with NaN data
Check how many NaN values are in each column.

We can find that except for 'FirstPartyVehicleNumber', 'ThirdPartyVehicleNumber', and 'InsurerNotes', which might not be used in the models, most the NaN values are concentrated in 'PolicyholderOccupation', and 'ClaimCause', which are categorical variables. In this case, these NaN values will be converted into a category value in order to account for the influence of the missing variables, regardless of why they are missing.

Concerning the numeric variables, we will check how many of them are missing when the claim is fraudulent or not.

In [None]:
# Check how much NaN values in each column.
print(f'Number of NaN values in each column:') 
print(df.isnull().sum())

We find that in the Fraud case, there is a lot of data missing in categorical variables, but few in numeric variables.

As a result, we can set NaN as a category of categorical data and generate dummy variables. Finally, we will drop the rows that contain NaN values in numerical columns.

In [None]:
# Check the number of missing data when Frand is True
df_fraud = df[df["Fraud"]==1]
print(f'Number of NaN values in each column when Frand is True:') 
print(df_fraud.isnull().sum())

In [None]:
df_non_frand = df[df["Fraud"]==0]
print(f'Number of NaN values in each column when Frand is False:')
print(df_non_frand.isnull().sum())

Fill the missing data in categorical columns with string NaN and make it a category

In [None]:
dummy_columns = ['PolicyholderOccupation', 'ClaimCause','ClaimInvolvedCovers', 'DamageImportance', 'FirstPartyVehicleType','ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet']
df[dummy_columns] = df[dummy_columns].fillna('NaN')

Drop the missing data in numerical columns

In [None]:
df_shape_before_drop = df.shape
df_fraud_shape_before_drop = df[df["Fraud"]==1].shape
df.dropna(subset=["LossHour","PolicyHolderAge","FpVehicleAgeMonths"],inplace=True)
df_shape_after_drop = df.shape
df_fraud_shape_after_drop = df[df["Fraud"]==1].shape
print(f'The number of rows that are dropped: {df_shape_before_drop[0]-df_shape_after_drop[0]}')
print(f'The number of rows that are dropped when Frand is True: {df_fraud_shape_before_drop[0]-df_fraud_shape_after_drop[0]}')

After these steps, we don't have any missing data.

In [None]:
print('The number of NaN values in each column:')
df.isna().sum()
print('-----------------------------------------------------')

print('The shape of final datasets:')
print(df.shape)
print('-----------------------------------------------------')

print("Data sample:")
print(df.head(5))
print('-----------------------------------------------------')

## 2.5 Reformat date data
We tend to consider the date data as an important feature considering people might have different tendencies to commit fraud, depending on the time period. This argument is supported by Pascal Blanque (2002), who asserted that the economic crisis is a significant reason to commit fraud.

However, the data is presented in datetime format, which cannot be used in machine learning models, so it will be converted into a timestamp.

In [None]:
# turn the date into timestamp
df['LossDate'] = df['LossDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())
df['FirstPolicySubscriptionDate'] = df['FirstPolicySubscriptionDate'].apply(lambda x:datetime.datetime.strptime(x,'%d.%M.%y').timestamp())

## 2.6 Check the distribution of dataset with statistical analysis
Based on the distribution of the numerical data, we can see that there are some differences between the fraud and non-fraud cases, especially in the distribution of 'FpVehicleAgeMonths' and 'claimAmount'. The fraud cases tend to have a more dispersed distribution in 'FpVehicleAgeMonths' and higher value in 'claimAmount'. 

In the categorical data some interesting patterns emerged. For example, in the fraud cases, the percentage of 'TotalLoss' is much higher than the non-fraud cases, which is understandable. When comparing fraud and non-fraud, fraud claims tended to have a high percentage of the same address details, whether it be email addresses, bank accounts, or phone numbers.

In [None]:
# get the datasets that are fraud cases and non-fraud cases
df_fraud = df[df['Fraud'] == 1]
df_non_fraud = df[df['Fraud'] == 0]

# get the mean of both datasets
print('The mean of fraud datasets:')
print('-----------------------------------------------------')
df_fraud_mean = df_fraud.mean()
print(df_fraud_mean)
print('-----------------------------------------------------')

print('The mean of non-fraud datasets:')
print('-----------------------------------------------------')
df_non_fraud_mean = df_non_fraud.mean()
print(df_non_fraud_mean)
print('-----------------------------------------------------')

In [None]:
# plot the distribution of fraud datasets
print('The distribution numerical data of fraud datasets:')
df_fraud.iloc[:,:-1].hist(bins=50, figsize=(20,15),density=True, xlabelsize=10, ylabelsize=10) #TODO add title
plt.show()
print('-----------------------------------------------------')

# plot the distribution of non-fraud datasets
print('The distribution numerical data of non-fraud datasets:')
df_non_fraud.iloc[:,:-1].hist(bins=50, figsize=(20,15),density=True, xlabelsize=10, ylabelsize=10)
plt.show()

In [None]:
dummy_columns = ['PolicyholderOccupation', 'ClaimCause', 'DamageImportance', 'FirstPartyVehicleType','ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet']

names = locals()
for i, col in enumerate(dummy_columns):
    names[f"ax_{i}"] = df.groupby(['Fraud'])[col].value_counts(normalize=True).unstack().plot(kind='bar', stacked=True, figsize=(20,10),title=f"Fig {i+1}")

## 2.7 Clean features
### 2.7.1 Deal with outliers 
The presence of outliers in the data is a major problem for machine learning algorithms (Chakravarty, et al., 2020).

Based on the histogram plots, the most prominent outliers were seen from ClaimAmount and FpVehicleAgeMonths for non-fraud cases. We decided to retain these data points because we would want to train the model to detect these rare cases correctly since we assume that fraudsters are very unlikely to claim huge amounts for a very old car to blend in with non-fraud cases.
     
### 2.7.2 Dummy variables

In [None]:
# Get dummy variables for categorical data
dummy_columns = ['PolicyholderOccupation', 'ClaimCause', 'DamageImportance', 'FirstPartyVehicleType','ConnectionBetweenParties', 'PolicyWasSubscribedOnInternet']
# Dummy variables for categorical data
df = pd.get_dummies(df,columns=dummy_columns,drop_first=True)
df.head()

In [None]:
# Extract ClaimInvolvedCovers data
# Get all covers
all_unique =  df["ClaimInvolvedCovers"].unique().tolist()
all_covers = str.join(' ', all_unique) # join the string to list
all_covers_set = set(all_covers.split()) # use set to drop duplicate covers
print(all_covers_set)

for cover in all_covers_set:
    df[f"ClaimInvolvedCovers_{cover}"] = df["ClaimInvolvedCovers"].apply(lambda x: 1 if cover in x else 0)
df = df.drop(columns=['ClaimInvolvedCovers'])

## 2.8 Split the data and scale
We are going to use MinMaxScaler in this case, since we can find the dataset does not follow a Gaussian distribution. In addition, we will fit the MinMaxScaler model with training and validation dataset and apply the model to the test dataset to ensure we are blind to the data information before we test the data.

### 2.8.1 Process of splitting and scaling data:
1. Turn the dataframe into X and y array
1. Split the train and test
2. Fit the MinMaxScaler to train data and then apply the model to test data
3. Split the train data into train and validation data

In [None]:
# Turn the dataframe into X and y array
X = df.drop(['Fraud'],axis=1,inplace=False).to_numpy()
y = df[['Fraud']].to_numpy().flatten()
print('The shape of X:')
print(X.shape)
print('-----------------------------------------------------')
print('The shape of y:')
print(y.shape)
print('-----------------------------------------------------')

In [None]:
# Split the train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=192,test_size=0.2)

# Fit the MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Split the train datasets into train and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,random_state=192,test_size=0.25)

## Show the data structure of the split datasets
print('-----------------------------------------------------')
print('Shape of X_train:')
print(X_train.shape)
print('-----------------------------------------------------')
print('Shape of y_train:')
print(y_train.shape)
print('-----------------------------------------------------')
print('Shape of X_val:')
print(X_val.shape)
print('-----------------------------------------------------')
print('Shape of y_val:')
print(y_val.shape)
print('-----------------------------------------------------')
print('Shape of X_test:')
print(X_test.shape)
print('-----------------------------------------------------')
print('Shape of y_test:')
print(y_test.shape)
print('-----------------------------------------------------')

# Question 3

### Question
Start by creating a (deep) neural network in TensorFlow and train it on the data. Using training and validation sets, find a model with high accuracy, then evaluate it on the test set. In particular, record both the accuracy and AUC. Briefly discuss what issues you observe based on the metrics.

In [None]:
%load_ext tensorboard

In [None]:
# rm -rf ./logs100/

## 3.1 Set the range of hyperparameters
We are going to explore the performance of distinct neural network when training with different hyperparameters using the following methodology:
1. Learning rate:
    Usually, a lower learning rate will result in a better fit model and higher learning rate will accelerate the training process but result in a underfitting model. In this case, we set the range of learning rate from 10**(0.001) to 10**(0.1)
2. Optimizer:
    We are going to try both AdamOptimizer and sgd optimizer. In most cases, SGD sacrifices efficiency for better convergence quality.
3. Dropout:
    We are going to use  HP_DROPOUT to set the dropout rate and HP_WHETHER_DROPOUT to decide whether to drop. The dropout rate ranges from 0.1 to 0.3 to see how the performance of the model changes.
4. Number of neurons in the hidden layer:
    In each hidden layer, we are going to use the same number of neurons. In terms of the number of neurons, we are going to use a rule of thumb, in which we can calculate the number of neurons as:
    
    $N_h = \frac{N_i}{\alpha * (N_i+N_o))})$
    
    Ni = number of input neurons.
    
    No = number of output neurons.
    
    Ns = number of samples in training data set.
    
    α = an arbitrary scaling factor usually 2-10.

    In our case, we are going to set $\alpha$ randomly to 2,3,4
5. Number of hidden layers:
    Normally, if the model is very simple, one hidden layer is enough according to Reed and Marks argument (Reed & Marks, 1999).
    ```
    Since a single sufficiently large hidden layer is adequate for approximation of most functions, why would anyone ever use more? One reason hangs on the words “sufficiently large”. Although a single hidden layer is optimal for some functions, there are others for which a single-hidden-layer-solution is very inefficient compared to solutions with more layers.
    ``` 

    However, in terms of a complex model, Goodfellow et al. (2016) argued that: 
    ```
    Specifically, the universal approximation theorem states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (such as the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired non-zero amount of error, provided that the network is given enough hidden units.
    ——Deep learning, 2016
    ```
    Since our model might not be able to be explained by a linear function, we are going to set the range of hidden layers from 1 to 3.
6. Activation
    We are going to compare the performance of the following activation functions:
    1. sigmoid
    2. relu

In [None]:
HP_LEARNING_RATE = hp.HParam('learning_rate', hp.RealInterval(0.001,0.1))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
HP_WHETHER_DROPOUT = hp.HParam('whether_dropout', hp.Discrete([True, False]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.3))
# the number of units in the hidden layer, 1 time, 2 times or 3 times of the unit number of input layer
BASE_NUM_UNITS = X_train.shape[0]/(X_train.shape[1] + 1)
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([int(BASE_NUM_UNITS/2), int(BASE_NUM_UNITS/3), int(BASE_NUM_UNITS/4),int(BASE_NUM_UNITS/5)])) 
HP_ACTIVATION = hp.HParam('activation', hp.Discrete(['relu', 'sigmoid']))
HP_HIDDEN_LAYER_NUMBER = hp.HParam('hidden_layer_number', hp.Discrete(range(1,4)))
METRIC_CROSSENTROPY = 'binary_crossentropy'
EPOCHS = 100

Once we have set up our parameters and metrics, we write those into our folder with the logs:

In [None]:
with tf.summary.create_file_writer('logs100/hparam_tuning').as_default():
    hp.hparams_config(hparams=[HP_LEARNING_RATE, HP_OPTIMIZER, HP_DROPOUT, HP_NUM_UNITS,HP_ACTIVATION,HP_HIDDEN_LAYER_NUMBER],
                      metrics = [hp.Metric(METRIC_CROSSENTROPY, display_name='CROSSENTROPY')])

In [None]:
def train_model(hparams,X_train=X_train,y_train=y_train,X_test=X_test,y_test=y_test):
    tf.keras.backend.clear_session()
    tf.random.set_seed(192)
    early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
    if hparams[HP_WHETHER_DROPOUT] == True:
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dropout(hparams[HP_DROPOUT]),
            tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=hparams[HP_ACTIVATION])]*hparams[HP_HIDDEN_LAYER_NUMBER]+[
            tf.keras.layers.Dense(1,activation='sigmoid')])
    else:
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(hparams[HP_NUM_UNITS], activation=hparams[HP_ACTIVATION])]*hparams[HP_HIDDEN_LAYER_NUMBER]+[
            tf.keras.layers.Dense(1,activation='sigmoid')])
    if hparams[HP_OPTIMIZER] == 'sgd':
        # Note that exploding gradients can be a big problem when running regressions, especially under SGD
        # Hence, we use "gradient clipping" with parameter alpha, which means that the gradients are manually kept between -1 and 1
        # This is of course another hyperparameter that we might tune!
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=hparams[HP_LEARNING_RATE], clipvalue=1)
    elif hparams[HP_OPTIMIZER] == 'adam':
        optimizer = tf.keras.optimizers.Adam(
            learning_rate=hparams[HP_LEARNING_RATE])

    # random_seed = 192
    model.compile(optimizer=optimizer,
                  loss='binary_crossentropy')

    model.fit(X_train, y_train, epochs=EPOCHS, validation_data=(X_val,y_val) ,callbacks=[early_stopping_cb],)
    loss = model.evaluate(X_test, y_test)
    x_test_predict = model.predict(X_test)
    # calculate the roc
    roc_score = roc_auc_score(y_test, x_test_predict)
    # calculate the accuracy suppose the threshold is 0.5
    x_test_predict_binary = np.where(x_test_predict>0.5,1,0)
    accuracy = accuracy_score(y_test, x_test_predict_binary)
    # calculate the sensitivity
    sensitivity = recall_score(y_test, x_test_predict_binary)
    return loss, accuracy,roc_score,sensitivity

In [None]:
def run(run_dir, hparams):
    with tf.summary.create_file_writer(run_dir).as_default():
        hp.hparams(hparams)
        
        loss, accuracy,roc_score,sensitivity = train_model(hparams) #TODO whether I did it right
        tf.summary.scalar('ACCUARY', accuracy, step=1)
        tf.summary.scalar('LOSS', loss, step=1)
        tf.summary.scalar('ROC', roc_score, step=1)
        tf.summary.scalar('SENSITIVITY', sensitivity, step=1)

## 3.2 Train the model and view on TensorBoard

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)

# 100 total sessions
total_sessions = 100 #FIXME: change this to the number of sessions you want to run, and fix the issue in the metrics

for session in range(total_sessions):
    
    # Create hyperparameters randomly
    whether_dropout = HP_WHETHER_DROPOUT.domain.sample_uniform()
    dropout_rate = HP_DROPOUT.domain.sample_uniform()
    num_units = HP_NUM_UNITS.domain.sample_uniform()
    optimizer = HP_OPTIMIZER.domain.sample_uniform()
    activation = HP_ACTIVATION.domain.sample_uniform()
    hidden_layer_number = HP_HIDDEN_LAYER_NUMBER.domain.sample_uniform()
    
    
    r = -3*np.random.rand()
    learning_rate = 10.0**r
    
    # Create a dictionary of hyperparameters
    hparams = { HP_LEARNING_RATE: learning_rate,
                HP_OPTIMIZER: optimizer,
                HP_WHETHER_DROPOUT: whether_dropout,
                HP_DROPOUT: dropout_rate,
                HP_NUM_UNITS: num_units,
                HP_ACTIVATION: activation,
                HP_HIDDEN_LAYER_NUMBER: hidden_layer_number}
    
    # train the model with the chosen parameters
    run_name = "run-%d" % session
    print('--- Starting trial: %s' % run_name)
    print({h.name: hparams[h] for h in hparams})
    run('logs100/hparam_tuning/' + run_name, hparams)

In [None]:
%tensorboard --logdir logs100

## 3.3 Evalaute the model

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)

dropout = 0.12292
learning_rate = 0.01
number_units = 34

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(number_units, activation='relu'),
    tf.keras.layers.Dense(number_units, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer=optimizer, loss='binary_crossentropy')

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
# log = model.fit(X_train, y_train, epochs=100, validation_data =(X_val, y_val), callbacks=[early_stopping_cb])
log = model.fit(X_train, y_train, epochs=25, validation_data =(X_val, y_val))

model.save('./models/question3.h5')

In [None]:
def create_plot(log):
    # plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
    plt.plot(log.history['loss'],label = "training loss",color='darkgreen')
    # plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
    plt.plot(log.history['val_loss'], label = "validation loss",color='darkblue')
    plt.legend()
    plt.show()
    
create_plot(log)

In [None]:
test_loss = model.evaluate(X_test, y_test)
y_test_predict = model.predict(X_test).flatten()

In [None]:
# calculate the roc
roc_score = roc_auc_score(y_test, y_test_predict)
print(f"test_loss: {test_loss}")
print(f"roc: {roc_score}")

In [None]:
def calculate_cost(fn,fp):
    return 10*fn+fp

cost_lost = {}
for i in np.linspace(0,0.1,501):
    pred_y = np.where(y_test_predict.flatten()> i, 1, 0)
    cm = confusion_matrix(y_test,pred_y)
    fn,fp = cm[1][0],cm[0][1]
    # cost_lost["Threshold: "+str(i)] = calculate_cost(fn,fp)
    cost_lost[i] = calculate_cost(fn,fp)

optimal_threshold = min(cost_lost, key=cost_lost.get)

In [None]:
pred_y = np.where(y_test_predict > optimal_threshold, 1, 0)
cm = confusion_matrix(y_test, pred_y)
print(f"Confusion matrix is :" )
print(cm)
accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
print(f"Accuracy rate is {accuracy_rate}")
# calculate the sensitivity 
sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
print(f"Sensitivity is {sensitivity}")

fig, ax = plt.subplots(figsize=(10,5))
sns.heatmap(cm, annot=True, fmt=".0f")
plt.show()

## 3.4 Briefly discuss what issues you observe based on the metrics

The dataset is too imbalanced for the model to be able to learn the relationship between the features and the target variable because the `loss` does not decrease.  

# Question 4
### Question
The file "SMOTE.ipynb" explains the process in detail and shows how to change the dataset with an example. You can copy and adjust the code to make it work within your analysis. You can adjust the "sampling_strategy" parameters as you see fit, particularly if
you want to fine-tune your model in part 5.

### Principle
In this part, we are going to try both oversampling and undersampling.

Here is the procedure to process the data
1. Split the raw data into train and test
2. Fit the MinMaxScaler to train data and then apply the model to test data
3. Oversample and undersample the train data
4. Split the train data into train and validation data
5. Show the data distribution before and after oversampling and undersampling

The reason to do so is that we want to make sure that the training and validation data are similar. When we test our model, we tend to use real test data instead of the simulated test data.

## 4.1 Split the raw data into train and test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=192,test_size=0.2)

## 4.2 Fit the MinMaxScaler to train data and apply the scaler to test data

In [None]:
# Fit the MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 4.3 Oversample and undersample the train data
We will successively try to oversample the minority class to 10%, 30%, 50% of the size of all datasets.

In [None]:
# Oversampling
# k_neighbors set to 20 to make sure that the result is more general
over = imblearn.over_sampling.SMOTE(sampling_strategy=0.1, random_state = 483, k_neighbors=20)  
X_over_synth_10, y_over_synth_10 = over.fit_resample(X_train, y_train)
over = imblearn.over_sampling.SMOTE(sampling_strategy=0.5, random_state = 483, k_neighbors=20)
X_over_synth_30, y_over_synth_30 = over.fit_resample(X_train, y_train)
over = imblearn.over_sampling.SMOTE(sampling_strategy=1, random_state = 483, k_neighbors=20)
X_over_synth_50, y_over_synth_50 = over.fit_resample(X_train, y_train)

print("Percentage of 1 in y_over_synth10:", Counter(y_over_synth_10)[1]/len(y_over_synth_10))
print("Percentage of 1 in y_over_synth30:", Counter(y_over_synth_30)[1]/len(y_over_synth_30))
print("Percentage of 1 in y_over_synth50:", Counter(y_over_synth_50)[1]/len(y_over_synth_50))

We will successively try to undersample the minority class to 10%, 30%,50% of the size of majority class.

In [None]:
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=0.1, random_state = 483)  
X_under_synth_10, y_under_synth_10 = under.fit_resample(X_train, y_train)
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=0.5, random_state = 483)
X_under_synth_30, y_under_synth_30 = under.fit_resample(X_train, y_train)
under = imblearn.under_sampling.RandomUnderSampler(sampling_strategy=1, random_state = 483)
X_under_synth_50, y_under_synth_50 = under.fit_resample(X_train, y_train)

print("Percentage of 1 in y_under_synth_10:", Counter(y_under_synth_10)[1]/len(y_under_synth_10))
print("Percentage of 1 in y_under_synth_30:", Counter(y_under_synth_30)[1]/len(y_under_synth_30))
print("Percentage of 1 in y_under_synth_50:", Counter(y_under_synth_50)[1]/len(y_under_synth_50))

Plot to illustrate before (left plot) and after (right plot) undersampling of majority class. Green dots represent the minority class, and Grey dots represent the majority class.

In [None]:
# set plot xticks
x = np.arange(0, len(X_under_synth_10), 1)
plt.xticks(x, x)
plt.subplot(1,2,1)
plt.scatter(X_train[:, 2], X_train[:, 3], c=y_train, s=10, cmap="Accent_r")
plt.subplot(1,2,2)
plt.scatter(X_under_synth_10[:, 2], X_under_synth_10[:, 3], c=y_under_synth_10, s=10, cmap="Accent_r")

## 4.4 Split the train data into train and validation data

In [None]:
X_over_synth_10_train, X_over_synth_10_val, y_over_synth_10_train, y_over_synth_10_val = train_test_split(
    X_over_synth_10, y_over_synth_10, random_state=192, test_size=0.25)
X_over_synth_30_train, X_over_synth_30_val, y_over_synth_30_train, y_over_synth_30_val = train_test_split(
    X_over_synth_30, y_over_synth_30, random_state=192, test_size=0.25)
X_over_synth_50_train, X_over_synth_50_val, y_over_synth_50_train, y_over_synth_50_val = train_test_split(
    X_over_synth_50, y_over_synth_50, random_state=192, test_size=0.25)
X_under_synth_10_train, X_under_synth_10_val, y_under_synth_10_train, y_under_synth_10_val = train_test_split(
    X_under_synth_10, y_under_synth_10, random_state=192, test_size=0.25)
X_under_synth_30_train, X_under_synth_30_val, y_under_synth_30_train, y_under_synth_30_val = train_test_split(
    X_under_synth_30, y_under_synth_30, random_state=192, test_size=0.25)
X_under_synth_50_train, X_under_synth_50_val, y_under_synth_50_train, y_under_synth_50_val = train_test_split(
    X_under_synth_50, y_under_synth_50, random_state=192, test_size=0.25)

print("The length of X_over_synth_10_train is:", len(X_over_synth_10_train))
print("The length of X_over_synth_10_val is:", len(X_over_synth_10_val))
print("The length of X_over_synth_30_train is:", len(X_over_synth_30_train))
print("The length of X_over_synth_30_val is:", len(X_over_synth_30_val))
print("The length of X_over_synth_50_train is:", len(X_over_synth_50_train))
print("The length of X_over_synth_50_val is:", len(X_over_synth_50_val))
print("The length of X_under_synth_10_train is:", len(X_under_synth_10_train))
print("The length of X_under_synth_10_val is:", len(X_under_synth_10_val))
print("The length of X_under_synth_30_train is:", len(X_under_synth_30_train))
print("The length of X_under_synth_30_val is:", len(X_under_synth_30_val))
print("The length of X_under_synth_50_train is:", len(X_under_synth_50_train))
print("The length of X_under_synth_50_val is:", len(X_under_synth_50_val))

# Question 5
### Question
 Create a new (deep) neural network and train it on your enhanced dataset. Use training and validation sets derived from the enhanced dataset to find a model with high accuracy. Evaluate your final model on a test set consisting only of original data. Again, record the accuracy and AUC. Briefly discuss the changes you would expect in the metrics and the actual changes you observe. Would you say that you are now doing better at identifying fraudulent claims?

### Methodology:
1. Test all synthetic datasets on basic model to select specific SMOTE technique
2. Selected 10% oversampling
3. Use tuner to train model and pick best model
4. Plot training and validation loss
5. Evaluate model using test set
6. Pick optimal threshold using cost = 10*FN + FP
7. Plot ROC curve and confusion matrix heatmap

### Principle
To simplify the problem, and save computational time, we will train the synthetic data to a very simple neural network, and then compare the performance of this distinct synthetic data.
After doing this, we are going to select the best performing dataset and then we can use a tuner to train the model for it.

The neural network structure is as follows:
1. Input layer
2. 2 hidden layers, in which the number of neurons in each layer is equal to 60 and 'relu' function is used
3. No dropout layer
4. Output layer, with 'sigmoid' activation function.
5. Optimizer: Adam

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)

## 5.1 Use a controlled model to compare the performance of the different sampling strategies

In [None]:
class TrainModel:
    def __init__(self, X_train, y_train, X_val=None, y_val=None,X_test=None,y_test=None, epochs=100,early_stopping:bool=False,patience:int=10):
        tf.keras.backend.clear_session()
        tf.random.set_seed(192)
        self.X_train = X_train
        self.y_train = y_train
        self.X_val = X_val
        self.y_val = y_val
        self.X_test = X_test
        self.y_test = y_test
        self.epochs = epochs
        self.early_stopping= early_stopping
        self.patience = patience
        self.simple_model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(60, activation='relu'),
            tf.keras.layers.Dense(60, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ])

    def compile(self):
        self.simple_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    def fit(self): # We fit the model with train and validation data becasue validation data can tell us when to stop training
        if self.early_stopping:
            early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=self.patience, restore_best_weights=True) # set patience to 10 to accelerate the training
            self.log = self.simple_model.fit(self.X_train, self.y_train,validation_data= (self.X_val, self.y_val),callbacks=[early_stopping_cb], epochs=self.epochs)
        else:
            self.log = self.simple_model.fit(self.X_train, self.y_train,validation_data= (self.X_val, self.y_val),epochs=self.epochs)

    def evaluate(self): # evalute it on the test dataset, since we are going to predict the raw data like test one
        loss = self.simple_model.evaluate(self.X_test, self.y_test)
        x_test_predict = self.simple_model.predict(self.X_test)
        # calculate the roc
        roc_score = roc_auc_score(self.y_test, x_test_predict)
        # calculate the accuracy suppose the threshold is 0.5
        x_test_predict_binary = np.where(x_test_predict>0.5,1,0)
        accuracy = accuracy_score(self.y_test, x_test_predict_binary)
        # calculate the sensitivity
        sensitivity = recall_score(self.y_test, x_test_predict_binary)
        return {'loss': loss, 'accuracy': accuracy, 'sensitivity': sensitivity, 'roc': roc_score,'modle':self.simple_model}
    
    def draw_the_loss_curve(self):
        # plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
        plt.plot(self.log.history['loss'],label = "training loss",color='darkgreen')
        # plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
        plt.plot(self.log.history['val_loss'], label = "validation loss",color='darkblue')
        plt.legend()
        plt.show()

    def run(self):
        self.compile()
        self.fit()
        # self.draw_the_loss_curve()
        return self.evaluate()

In this case, if we apply the model to the test data (real data), and select ROC as the metrics to select the sampling strategy, we can find that in the oversampling strategy, the ROC rate is higher than undersampling strategy. Since there is a slight difference between about 10%, 30% and 50% oversampling, we will use the 10% oversampling strategy.

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)
res_dict = {}
for X_train,y_train,X_val,y_val,name in zip([X_over_synth_10_train,X_over_synth_30_train,X_over_synth_50_train,X_under_synth_10_train,X_under_synth_30_train,X_under_synth_50_train], 
                           [y_over_synth_10_train,y_over_synth_30_train,y_over_synth_50_train,y_under_synth_10_train,y_under_synth_30_train,y_under_synth_50_train],
                           [X_over_synth_10_val,X_over_synth_30_val,X_over_synth_50_val,X_under_synth_10_val,X_under_synth_30_val,X_under_synth_50_val],
                           [y_over_synth_10_val,y_over_synth_30_val,y_over_synth_50_val,y_under_synth_10_val,y_under_synth_30_val,y_under_synth_50_val],
                           ["over_synth_10","over_synth_30","over_synth_50","under_synth_10","under_synth_30","under_synth_50"]):
    tm = TrainModel(X_train,y_train,X_val, y_val, X_test = X_test, y_test =y_test,early_stopping=True)
    res = tm.run()
    res_dict[name] = (res)

for key,value in res_dict.items():
    print(f"{key}:{value['roc']}")

## 5.2 Train model using Tuner

In [None]:
def train_model(hp):
    tf.keras.backend.clear_session()
    tf.random.set_seed(192)
    number_units = hp.Int('number_units', min_value=20, max_value=80, step=20)
    dropout_rate = hp.Float('dropout_rate', min_value = 0.1, max_value=0.3) 
    # optim_algo = hp.Choice('optimizer', values=['sgd','adam']) 
    optim_algo = 'adam'
    learning_rate = hp.Float('learning_rate', min_value = 0.001, max_value=1, sampling='log') 
    number_layers = hp.Int('number_layers', min_value=1, max_value=3) # hidden layers
    activation = hp.Choice('activation', values=['relu','sigmoid'])
    whether_dropout = hp.Choice('whether_dropout', values=[True,False])

    if whether_dropout == True:
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dropout(dropout_rate),
            tf.keras.layers.Dense(number_units, activation=activation)]*number_layers+[ # number_layers is the number of hidden layers
            tf.keras.layers.Dense(1,activation='sigmoid')])
    else:
        model = tf.keras.models.Sequential([
            tf.keras.layers.Dense(number_units, activation=activation)]*number_layers+[ # number_layers is the number of hidden layers
            tf.keras.layers.Dense(1,activation='sigmoid')])

    if optim_algo== 'sgd':
        # Note that exploding gradients can be a big problem when running regressions, especially under SGD
        # Hence, we use "gradient clipping" with parameter alpha, which means that the gradients are manually kept between -1 and 1
        # This is of course another hyperparameter that we might tune!
        optimizer = tf.keras.optimizers.SGD(
            learning_rate=learning_rate, clipvalue=1)
    elif optim_algo == 'adam':
        optimizer = tf.keras.optimizers.Adam(
            learning_rate=learning_rate)

    # random_seed = 192
    model.compile(optimizer=optimizer, loss='binary_crossentropy')
    return model


In [None]:
# rm log file
# ! rm -rf ./logs_over_synth_10/

In [None]:
# Run the best model for X_over_synth_10_train
tuner = kt.Hyperband(train_model,
                     objective='val_loss',
                     max_epochs=10,
                     factor=3,
                     directory='logs_over_synth_10',
                     project_name='kt_tutorial_over_synth_10')
tf.keras.backend.clear_session()
tf.random.set_seed(192)
tuner.search(X_over_synth_10_train, y_over_synth_10_train, epochs=10, validation_data =(X_over_synth_10_val,y_over_synth_10_val))

In [None]:
best_hps = tuner.get_best_hyperparameters()[0]
best_model = tuner.hypermodel.build(best_hps)

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True) # set patience to 10 to accelerate the training
log = best_model.fit(X_over_synth_10_train, y_over_synth_10_train, epochs=100, validation_data =(X_over_synth_10_val,y_over_synth_10_val),callbacks=[early_stopping_cb])

## 5.3 Evaluate model

In [None]:
def create_plot(log):
    # plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
    plt.plot(log.history['loss'],label = "training loss",color='darkgreen')
    # plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
    plt.plot(log.history['val_loss'], label = "validation loss",color='darkblue')
    plt.legend()
    plt.show()
create_plot(log)

In [None]:
# save the best model in question 5
# best_model.save('./best_model_question_5.h5')

# load the best model in question 5
best_model = tf.keras.models.load_model('./models/best_model_question_5.h5')

Evaluate the model on the test dataset, and we can find that our ROC score is about 0.82 which is good. In the next part, we are going to select the threshold to predict the fraud.

In [None]:
loss = best_model.evaluate(X_test, y_test)
y_test_predict = best_model.predict(X_test).flatten()
# calculate the roc
roc_score = roc_auc_score(y_test, y_test_predict)
print(f"test_loss: {loss}")
print(f"roc: {roc_score}")

Since it is more costly to miss the fraud cases, and less costly to make a false alarm, we are going to suppose that the cost missed fraud is 10 times more than the cost of false alarm (This ratio can be adjusted to the real case).
Thus, our cost function is:

```Cost = 10*FN + FP```

In [52]:
def calculate_cost(fn,fp):
    return 10*fn+fp

cost_lost = {}
for i in np.linspace(0,0.1,501):
    pred_y = np.where(y_test_predict.flatten()> i, 1, 0)
    cm = confusion_matrix(y_test,pred_y)
    fn,fp = cm[1][0],cm[0][1]
    # cost_lost["Threshold: "+str(i)] = calculate_cost(fn,fp)
    cost_lost[i] = calculate_cost(fn,fp)

optimal_threshold = min(cost_lost, key=cost_lost.get)
print(f"optimal_threshold is {optimal_threshold}")

optimal_threshold is 0.0936


In [None]:
pred_y = np.where(y_test_predict > optimal_threshold, 1, 0)
cm = confusion_matrix(y_test, pred_y)
print(f"Confusion matrix is :" )
print(cm)
accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
print(f"Accuracy rate is {accuracy_rate}")
# calculate the sensitivity 
sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
print(f"Sensitivity is {sensitivity}")

fig, ax = plt.subplots(figsize=(10,5))
sns.heatmap(cm, annot=True, fmt=".0f")
plt.show()

## 5.4 Briefly discuss the changes you would expect in the metrics and the actual changes you observe. Would you say that you are now doing better at identifying fraudulent claims?

Utilizing the same cost function, we can find that the sensitivity score is higher than the imbalanced dataset. We are performing better at identifying fraudulent claims.

# Question 6
### Question
Our second approach will be to use an autoencoder to learn what "normal" (non-fraudulent) data "looks like."

1. Prepare dataset for autoencoder

   - Using the original data, create a training set that contains only non fraudulent claims
   - As well as validation and test sets that contain non fraudulent and fraudulent claims. 
   - Make sure to spread fraudulent claims evenly across validation and test sets.

2. Create an autoencoder using TensorFlow

   - Ensure that the middle hidden layer has fewer neurons than your input features. 
   - Use training and validation sets to find a model that represents its input data well. In particular, you will want to predict your validation set observations. 
   - For each observation, you can measure the difference between the original observations and the predicted one, using, for example, the mean squared error of all features of the observation. 
   - Plot the errors for all your validation set observations in a histogram - in a good model, this error should be much higher for fraudulent claims than non-fraudulent ones.

3. Assess predictions of autoencoder created

   - Use your trained autoencoder to predict the test set and define the corresponding losses(?). 
   - Create a histogram of your test set claims, clearly marking fraudulent and non- fraudulent claims. 
   - Discuss how you could use this to decide whether a transaction is fraudulent or not. 
   - Can you also derive an AUC in this approach - if yes, how does it perform compared to the previous approaches?

### Answer
Before creating an autoencoder, the pre-processed data from **Question 2** was split into training, validation and test sets. The training set only contained non-fraud claims, whereas validation and test sets contained a mixture of both non-fraud and fraud claims such that the total fraud claims were equally distributed among the validation and test sets.

The following is a step-by-step methodology for splitting the original (pre-processed) dataset:
1. Split pre-processed dataset into non-fraud and fraud dataframes
2. Calculate the non-fraud and fraud sample sizes for training, validation and test sets respectively according to `train_split` percentage (this percentage does not include the proportion for validation set). The percentage splits for validation and test sets are equally divided after subtracting `train_split` percentage.
3. Sample non-fraud data for training set first, then sample non-fraud and fraud data for validation and test sets respectively.

After splitting, `scaler.fit_transform()` was applied to both training and validation sets and `scaler.transform()` was applied to the test set to ensure that our test set is truly unseen (i.e. the model does not learn about the test data).

In [None]:
# get a copy of raw df

df_autoencoder = df.copy()
# split dataset into non-fraud (normal) and fraud
normal_df = df_autoencoder[df_autoencoder.Fraud == 0]
fraud_df = df_autoencoder[df_autoencoder.Fraud == 1]
print(f'The length of normal_df: {len(normal_df)}')
print(f'The length of fraud_df: {len(fraud_df)}')

# variables for splitting data into train, val and test sets 
train_split = 0.8
test_split = 1 - train_split
normal_train_size = round(len(df_autoencoder) * train_split)
fraud_val_test_size = int(len(fraud_df) / 2)
normal_val_test_size = int(round(len(df_autoencoder) * test_split / 2) - fraud_val_test_size)
print(f'normal_train_size: {normal_train_size}')
print(f'fraud_val_test_size: {fraud_val_test_size}')
print(f'normal_val_test_size: {normal_val_test_size}\n')

# sample non-fraud data for train set
train_df = normal_df.sample(normal_train_size, random_state=0)
normal_df = normal_df[~normal_df.isin(train_df)].dropna()
print(f'len(train_df): {len(train_df)}')
print(f'len(normal_df) excluding data in train_df: {len(normal_df)}\n')

# sample non-fraud data for val and test sets
val_df = normal_df.sample(normal_val_test_size, random_state=0)
test_df = normal_df[~normal_df.isin(val_df)].dropna()
print(f'len(val_df): {len(val_df)}')
print(f'len(test_df): {len(test_df)}\n')

# check if all normal data is in the train, val and test sets
normal_df = df_autoencoder[df_autoencoder.Fraud == 0]
test = pd.concat([train_df, val_df, test_df]).isin(normal_df)
print(f'len(test[test.LossDate == False]) = {len(test[test.LossDate == False])}\n')

# sample fraud data for val and test sets
val_df = val_df.append(fraud_df.sample(fraud_val_test_size, random_state=0))
test_df = test_df.append(fraud_df[~fraud_df.isin(val_df)].dropna())
print(f'len(val_df): {len(val_df)}')
print(f'len(test_df): {len(test_df)}\n')

print(f'Check if len(train_df) + len(val_df) + len(test_df) == len(df_autoencoder): {len(train_df) + len(val_df) + len(test_df) == len(df_autoencoder)}')

In [None]:
# Turn the dataframe into numpy arrays
y_train = train_df[['Fraud']].to_numpy()
y_val = val_df[['Fraud']].to_numpy()
y_test = test_df[['Fraud']].to_numpy()

X_train = train_df.drop(['Fraud'], axis=1)
X_val = val_df.drop(['Fraud'], axis=1)
X_test = test_df.drop(['Fraud'], axis=1)

train_col_names = list(X_train.columns)+['df_key']

In [None]:
# Fit the MinMaxScaler
scaler = MinMaxScaler()

X_train.loc[:,'df_key'] = 1
X_val.loc[:,'df_key'] = 0
X_test.loc[:,'df_key'] = 0
X_train_val = pd.concat([X_train, X_val])
# X_df_key = X_train_val[['df_key']]
# X_train_val = X_train_val.drop(['df_key'], axis=1)

X_train_val = pd.DataFrame(scaler.fit_transform(X_train_val), columns=train_col_names)
X_test = pd.DataFrame(scaler.transform(X_test), columns=train_col_names)
X_test = X_test.drop(['df_key'], axis=1)
X_train = X_train_val[X_train_val.df_key == 1].drop(['df_key'], axis=1)
X_val = X_train_val[X_train_val.df_key == 0].drop(['df_key'], axis=1)

print(f'check if len(X_train) + len(X_val) + len(X_test) == len(df_autoencoder): {len(X_train) + len(X_val) + len(X_test) == len(df_autoencoder)}')
print(f"X_train.shape: {X_train.shape}")
print(f"X_val.shape: {X_val.shape}")
print(f"X_test.shape: {X_test.shape}")
X_train = X_train.to_numpy()
X_val = X_val.to_numpy()
X_test = X_test.to_numpy()

# Question 7

### Question
Using TensorFlow, create an autoencoder, ensuring that the middle hidden layer has fewer neurons than your input has features. Use training and validation sets to find a model that represents its input data well. In particular, you will want to predict your validation set observations. For each observation, you can measure the difference between the original observations and the predicted one, using, for example, the mean squared error of all features of the observation. Plot the errors for all your validation set observations in a histogram - in a good model, this error should be much higher for fraudulent claims than non-fraudulent ones.

## 7.1 Create an autoencoder
During the optimization of a basic autoencoder, we observed that it was difficult to assess whether the created autoencoder was overfitting since the train set is exclusively non-fraud, whereas the validation set is a mixture of non-fraud and fraud. We cannot conclude whether the “val_loss” is associated with an overfitting issue or the model failed to identify the pattern of fraud features, or both. To overcome this, we further split the main training set containing only non-fraud data into `X_train_train` set (70%) and `X_train_val` set (30%) to assess overfitting using the difference between final “val_loss” and final “loss” (from the last epoch). If the difference is very small, it means the model is not overfitting and can reconstruct non-fraud features well. We represented this difference as `overfit_metric` and aim to find the minimum.

After ensuring that the model is not overfitting on the `X_train_val` set, we continued to evaluate the trained model on the `X_val` set (the original validation set containing a mixture of non-fraud and fraud). If the model is good at reconstructing non-fraud features only, the MSE for non-fraud claims would be very close to zero, and the majority error would be associated with fraud claims. Therefore, we take the average of all MSE from the reconstruction predictions as a metric to assess whether the model reconstructs fraud features badly. We named this metric as `val2_avg_recon_error` and aim to find the maximum.

With these 2 metrics, we developed a `model_score` by taking the difference between `val2_avg_recon_error` and `overfit_metric` to select the best model from a list of models where the following hyperparameters were tuned:
   - dropout rate
   - L2 regularisation parameter
   - number of neurons
   - number of layers

In [None]:
X_train_train,X_train_val, y_train_train, y_train_val = train_test_split(X_train, y_train, test_size=0.3, random_state=0)

In [None]:
class AutoEncodingModel:
    def __init__(self, reg_param, number_units_layers, number_units_bottleneck, dropout_rate, number_layers, whether_dropout, whether_regularizer, X_train_train=X_train_train, X_train_val=X_train_val, X_val=X_val,epochs=100) -> None:
        tf.keras.backend.clear_session()
        tf.random.set_seed(192)
        self.X_train_train = X_train_train
        self.X_train_val = X_train_val
        self.X_val = X_val
        self.reg_param = reg_param
        self.number_units_layers = number_units_layers
        self.number_units_bottleneck = number_units_bottleneck
        self.dropout_rate = dropout_rate
        self.number_layers = number_layers
        self.whether_dropout =  whether_dropout
        self.whether_regularizer = whether_regularizer
        self.input_dim = self.X_train_train.shape[1]
        self.epochs  = epochs
        self.build_model()
        self.compile_model()

    def build_model(self):
        """
        Build the model according to the hyperparameters input
        """
        regularizer = tf.keras.regularizers.l2(
            self.reg_param*self.whether_regularizer)
        if self.whether_dropout == True:
            encoder = tf.keras.models.Sequential([
                tf.keras.layers.Dense(self.number_units_layers, activation="relu")]*self.number_layers+[
                tf.keras.layers.Dropout(self.dropout_rate), # dropout before the bottleneck layer
                tf.keras.layers.Dense(self.number_units_bottleneck,activation='sigmoid', kernel_regularizer=regularizer)])
        else:
            encoder = tf.keras.models.Sequential([
                tf.keras.layers.Dense(self.number_units_layers, activation="relu")]*self.number_layers+[
                tf.keras.layers.Dense(self.input_dim,activation='sigmoid')])
        decoder = tf.keras.models.Sequential([
                tf.keras.layers.Dense(self.number_units_layers, activation="relu")]*self.number_layers+[
                tf.keras.layers.Dense(self.input_dim, activation="sigmoid")])
        self.autoencoder = tf.keras.models.Sequential([encoder, decoder])

        # random_seed = 192
    def get_hp(self):
        # get all hyperparameters
        return {
            'reg_param': self.reg_param,
            'number_units_layers': self.number_units_layers,
            'number_units_bottleneck': self.number_units_bottleneck,
            'dropout_rate': self.dropout_rate,
            'number_layers': self.number_layers,
            'whether_dropout': self.whether_dropout,
            'whether_regularizer': self.whether_regularizer
        }

    def get_model(self):
        """
        get the model from the class
        """
        return self.autoencoder

    def compile_model(self):
        """
        compile the model
        """
        optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
        self.autoencoder.compile(
            optimizer=optimizer, loss='mean_squared_error')

    def __train_model(self):
        """
        Train the data only contains non-fraud claims
        """
        early_stopping_cb = tf.keras.callbacks.EarlyStopping(
            monitor='val_loss', patience=10, restore_best_weights=True)
        self.log_train_non_fraud = self.autoencoder.fit(x=self.X_train_train, y=self.X_train_train,
                                                        epochs=self.epochs,
                                                        validation_data=(X_train_val, X_train_val), callbacks=[early_stopping_cb])

    def get_train_non_fraud_loss_diff(self):
        """
        calculate the overfit_metric
        """
        self.__train_model()
        return self.log_train_non_fraud.history['val_loss'][-1]-self.log_train_non_fraud.history['loss'][-1] 
        
    def get_train_val_loss(self):
        """
        get the val1_loss
        """
        return self.log_train_non_fraud.history['val_loss'][-1]

    def __apply_model_fraud(self):
        """
        apply the model on the val2, and get the overall average mse loss
        """
        reconstructions = self.autoencoder.predict(self.X_val)
        self.val_loss = np.mean(tf.keras.losses.mse(reconstructions, self.X_val))

    def get_val_loss(self):
        self.__apply_model_fraud()
        return self.val_loss

    def run(self):
        # the difference of loss in the train_train and train_val, the metric to access the overfitting
        overfit_metric = self.get_train_non_fraud_loss_diff()  #TODO change it to overfit_metric
        val2_avg_recon_error = self.get_val_loss() #TODO change it to val2_avg_recon_error
        val1_loss = self.get_train_val_loss()  #TODO change it to val1_loss
        return {"overfit_metric": overfit_metric, "val2_avg_recon_error": val2_avg_recon_error, "val1_loss":val1_loss, "model": self.autoencoder, "log":self.log_train_non_fraud}


Generate a list of hyperparameters and train the model. After trainnig the model, we will record the hyperparameters and the performance of the model.

In [None]:
tf.keras.backend.clear_session()
tf.random.set_seed(192)
round = 30
model_list = [] # list of to save model
class_list = []
log_list = []
para_dfm = pd.DataFrame(columns=['reg_param', 'number_units_layers', 'number_units_bottleneck', 'dropout_rate', 'number_layers', 'whether_dropout', 'whether_regularizer','overfit_metric', 'val1_loss','val2_avg_recon_error'])
for i in range(round):
    print(f"This round is {i}")
    # generate a list of hyperparameters
    reg_param = np.random.uniform(low=0.1, high=0.3)
    numbers_units_layers = np.random.choice(np.arange(30, 61, 15))
    numbers_units_bottleneck = np.random.choice(np.arange(5, 16, 5))
    dropout_rate = np.random.uniform(low=0.01, high=0.05)
    numbers_layers = np.random.choice(np.arange(1, 4, 1))
    whether_dropout = np.random.choice([True, False])
    whether_regularizer = np.random.choice([True, False])
    autoencoder = AutoEncodingModel(reg_param, numbers_units_layers, numbers_units_bottleneck, dropout_rate, numbers_layers, whether_dropout, whether_regularizer)
    
    # get the result of the model
    res = autoencoder.run()
    overfit_metric  = res['overfit_metric']
    val2_avg_recon_error = res['val2_avg_recon_error']
    model = res['model']
    val1_loss = res['val1_loss']

    # fill the value into the dataframe
    para_dfm = para_dfm.append({"reg_param": reg_param, "number_units_layers": numbers_units_layers, "number_units_bottleneck": numbers_units_bottleneck, "dropout_rate": dropout_rate, "number_layers": numbers_layers, "whether_dropout": whether_dropout, "whether_regularizer": whether_regularizer, "overfit_metric": overfit_metric, "val2_avg_recon_error": val2_avg_recon_error,'val1_loss':val1_loss}, ignore_index=True)
    
    # add model into model_list
    model_list.append(model)

    # add initalized class into class_list
    class_list.append(autoencoder)

    # add log into log_list
    log_list.append(res['log'])

In [None]:
# para_dfm.to_csv("para_dfm.csv")
para_dfm = pd.read_csv("para_dfm.csv", index_col=0)

In [None]:
# calculate the model score according to the val2_avg_recon_error and overfit_metric
para_dfm["model_score"] = para_dfm["val2_avg_recon_error"] - para_dfm["overfit_metric"] 
para_dfm = para_dfm.sort_values(by=["model_score"], ascending=False)
para_dfm.head()

Go through all the models we have trained and look into the relationship of the model_score and the performance of model.
In the table, we are going to record the model_score, the sensitivity(select the mean of reconstruction_error as threshold), accuracy_rate and AUC.

In [None]:
# df_performance_diff_sens.to_csv("df_performance_diff_sens.csv")
df_performance_diff_sens = pd.read_csv("df_performance_diff_sens.csv", index_col=0)

In [None]:
# NOTE: read "df_performance_diff_sens" from local file

# df_performance_diff_sens = pd.DataFrame(columns=['model_score', 'sensitivity','accuracy_rate','roc'])
# for ind, row in para_dfm.iterrows():
#     print(f"The model_score is: {row['model_score']}")
#     model_pred = model_list[ind]
#     X_pred = model_pred.predict(X_val)
#     mse = np.mean(np.power(X_val.flatten() - X_pred.flatten(), 2))
#     reconstructions = model_pred.predict(X_val)
#     val_loss = tf.keras.losses.mae(reconstructions, X_val)
#     # sns.histplot(x=val_loss,y=y_val.flatten(),hue=y_val.flatten())
#     # plt.show()
#     df_tmp = pd.DataFrame({"val_loss": val_loss, "y_val": y_val.flatten()})
#     df_tmp_fraud = df_tmp[df_tmp["y_val"] == 1]
#     df_tmp_non_fraud = df_tmp[df_tmp["y_val"] == 0]
#     mse = np.mean(np.power(X_val.flatten() - X_pred.flatten(), 2))
#     error_df = pd.DataFrame({'Reconstruction_error': mse, 'True_class': y_val.flatten()})
#     df_temp = pd.DataFrame({'Reconstruction_error': val_loss, 'True_class': y_val.flatten()})
#     roc = roc_auc_score(y_val.flatten(), val_loss)
#     threshold = np.mean(df_temp["Reconstruction_error"])
#     pred_y = np.where(val_loss > threshold, 1, 0)
#     cm = confusion_matrix(y_val.flatten(), pred_y)
#     accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
#     # calculate the sensitivity 
#     sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
#     print(f"sensitivity is: {sensitivity}")
#     df_performance_diff_sens = df_performance_diff_sens.append({"model_score": row['model_score'], "sensitivity": sensitivity,'accuracy_rate':accuracy_rate,'roc':roc}, ignore_index=True)

# df_performance_diff_sens

## 7.2 Assess autoencoder on validation set

We are going to explore the distubution of the fraud cases in the test set of each model. Usually, the higher the model score, the fraud cases are more likely to be concentrated at a large number.

In [None]:
for ind, row in para_dfm.iterrows():
    print(f"The model_score is: {row['model_score']}")
    model_pred = model_list[ind]
    X_pred = model_pred.predict(X_val)
    mse = np.mean(np.power(X_val.flatten() - X_pred.flatten(), 2))
    reconstructions = model_pred.predict(X_val)
    val_loss = tf.keras.losses.mse(reconstructions, X_val)
    df_tmp = pd.DataFrame({"val_loss": val_loss, "y_val": y_val.flatten()})
    df_tmp_fraud = df_tmp[df_tmp["y_val"] == 1]
    df_tmp_non_fraud = df_tmp[df_tmp["y_val"] == 0]
    plt.hist(df_tmp_fraud["val_loss"], bins=50, alpha=0.5, label="fraud", color="red",density=True)
    plt.hist(df_tmp_non_fraud["val_loss"], bins=50, alpha=0.5, label="non_fraud", color="blue",density=True)
    plt.show()

Show the relationship between model_score and other columns, we can find a very significant linear relationship between these columns.

In [None]:
df_performance_diff_sens.plot(x='model_score', y='sensitivity', kind='scatter',title='Sensitivity_model_score')
plt.show()
df_performance_diff_sens.plot(x='model_score', y='roc', kind='scatter',title='ROC_model_score')
plt.show()
df_performance_diff_sens.plot(x='model_score', y='accuracy_rate', kind='scatter',title='Accuracy_model_score')
plt.show()

We are going to use statsmodel to test whether it is significant. We can find the p-value of the model is very low, which means the model is significant.

In [None]:
# linear regression #TODO more linear regression test

X  = np.array(df_performance_diff_sens['model_score'], dtype=float)
y = np.array(df_performance_diff_sens['sensitivity'], dtype=float)
X = sm.add_constant(X)
print("Statsmodels linear regression betwen model_score and sensitivity")
model = sm.OLS(y, X).fit()
print(model.summary())

X = np.array(df_performance_diff_sens['model_score'], dtype=float)
y = np.array(df_performance_diff_sens['accuracy_rate'], dtype=float)
X = sm.add_constant(X)
print("Statsmodels linear regression betwen model_score and accuracy_rate")
model = sm.OLS(y, X).fit()
print(model.summary())
print('-------------------------------------------------------')

X = np.array(df_performance_diff_sens['model_score'], dtype=float)
y = np.array(df_performance_diff_sens['roc'], dtype=float)
X = sm.add_constant(X)
print("Statsmodels linear regression betwen model_score and roc")
model = sm.OLS(y, X).fit()
print(model.summary())
print('-------------------------------------------------------')


# Question 8

### Question
Use your trained autoencoder to predict the test set and define the corresponding losses. Create a histogram of your test set claims, clearly marking fraudulent and nonfraudulent claims. Discuss how you could use this to decide whether a transaction is fraudulent or not. Can you also derive an AUC in this approach - if yes, how does it perform compared to the previous approaches?

## 8.1 Assess autoencoder on test set

Based on the results of the previous question, we can use the model_score to select the best model.

In [None]:
# get the index of largest model_score row
para_dfm_max_index = para_dfm[para_dfm["model_score"]==max(para_dfm["model_score"])].index[0]
# get the loss function of the best model
def create_plot(log,limit=None):
    if limit:
    # plt.plot(log.history['accuracy'],label = "training accuracy",color='green')
        plt.plot(log.history['loss'][-limit:],label = "training loss",color='darkgreen')
        # plt.plot(log.history['val_accuracy'], label = "validation accuracy",color='grey')
        plt.plot(log.history['val_loss'][-limit:], label = "validation loss",color='darkblue')
        plt.legend()
        plt.show()
    else:
        plt.plot(log.history['loss'],label = "training loss",color='darkgreen')
        plt.plot(log.history['val_loss'], label = "validation loss",color='darkblue')
        plt.legend()
        plt.show()

create_plot(log_list[para_dfm_max_index])

In [None]:
# get the index of largest model_score row
autoencoder_best = model_list[int(para_dfm_max_index)]
# autoencoder_best.save('./models/autoencoder_best.h5') # save the best model

We are going to reset the weight and train the model again.

In [None]:
def shuffle_weights(model, weights=None):
    """Randomly permute the weights in `model`, or the given `weights`.

    This is a fast approximation of re-initializing the weights of a model.

    Assumes weights are distributed independently of the dimensions of the weight tensors
      (i.e., the weights have the same distribution along each dimension).

    :param Model model: Modify the weights of the given model.
    :param list(ndarray) weights: The model's weights will be replaced by a random permutation of these weights.
      If `None`, permute the model's current weights.
    """
    if weights is None:
        weights = model.get_weights()
    weights = [np.random.permutation(w.flat).reshape(w.shape) for w in weights]
    # Faster, but less random: only permutes along the first dimension
    # weights = [np.random.permutation(w) for w in weights]
    model.set_weights(weights)
    return model


In [None]:
tf.keras.backend.clear_session()
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
# reset the all the weight of autoencoder model
autoencoder_best = shuffle_weights(autoencoder_best)
log = autoencoder_best.fit(x=X_train, y=X_train, epochs=200, validation_data=(X_val, X_val), callbacks=[early_stopping_cb])

create_plot(log)

In [None]:
# autoencoder_best.save('models/autoencoder_best.h5')
autoencoder_best = tf.keras.models.load_model('./models/autoencoder_best.h5')
X_pred = autoencoder_best.predict(X_val)
mse = np.mean(np.power(X_val.flatten() - X_pred.flatten(), 2))
reconstructions = autoencoder_best.predict(X_val)
val_loss = tf.keras.losses.mae(reconstructions, X_val)
# sns.histplot(x=val_loss,y=y_val.flatten(),hue=y_val.flatten())
# plt.show()
df_tmp = pd.DataFrame({"val_loss": val_loss, "y_val": y_val.flatten()})
df_tmp_fraud = df_tmp[df_tmp["y_val"] == 1]
df_tmp_non_fraud = df_tmp[df_tmp["y_val"] == 0]
mse = np.mean(np.power(X_val.flatten() - X_pred.flatten(), 2))
error_df = pd.DataFrame({'Reconstruction_error': mse, 'True_class': y_val.flatten()})
df_temp = pd.DataFrame({'Reconstruction_error': val_loss, 'True_class': y_val.flatten()})
roc = roc_auc_score(y_val.flatten(), val_loss)
print("ROC: ", roc)
#plot the roc curve
fpr, tpr, thresholds = roc_curve(y_val.flatten(), val_loss)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc)

### Set the threshold for fraud
We are going to use three method to set the threshold for fraud:
1. Use the mean of the reconstruction error plus one standard deviation as the threshold
2. Set the threshold to the minimum of the reconstruction error of fraud cases
3. Set the threshold to the minimize the cost function we mentioned above, which is 10*FN + FP

If it is real case, we would recommend to use the third method. Because it can have company to lower the cost.


### Method 1
Use the mean of the reconstruction error plus one standard deviation as the threshold

In [None]:
# select the threshold with test dataset
threshold = np.mean(val_loss) + np.std(val_loss)
print("Threshold: ", threshold)

In [None]:
# pred y based on the threshold
test_loss = autoencoder_best.predict(X_test)
reconstructions = autoencoder_best.predict(X_val)
test_loss = tf.keras.losses.mae(reconstructions, X_val)
pred_y = np.where(test_loss > threshold, 1, 0)
cm = confusion_matrix(y_test.flatten().astype(int), pred_y)
print(f"Confusion matrix is :" )
fig, ax = plt.subplots(figsize=(10,5))
sns.heatmap(cm, annot=True, fmt=".0f")
plt.show()
accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
print(f"Accuracy rate is {accuracy_rate}")
# calculate the sensitivity 
sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
print(f"Sensitivity is {sensitivity}")

### Method 2
Set the threshold to the minimum of the reconstruction error of fraud cases

In [None]:
# show the distribution of the model_score

sns.histplot(x=val_loss,y=y_val.flatten(),hue=y_val.flatten())

In [None]:
# select the threshold with test dataset
df_temp = pd.DataFrame({'Reconstruction_error': val_loss, 'True_class': y_val.flatten()})
df_temp_fraud = df_temp[df_temp['True_class']==1]
threshold = df_temp_fraud["Reconstruction_error"].min()
print("Threshold: ", threshold)

In [None]:
# pred y based on the threshold
pred_y = np.where(test_loss > threshold, 1, 0)
cm = confusion_matrix(y_test.flatten().astype(int), pred_y)
print(f"Confusion matrix is :" )
sns.heatmap(cm, annot=True)
accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
print(f"Accuracy rate is {accuracy_rate}")
# calculate the sensitivity 
sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
print(f"Sensitivity is {sensitivity}")

### Method 3
Set the threshold to the minimize the cost function we mentioned above, which is 10*FN + FP

In [None]:
# select the threshold with test dataset
#  the cost function
def calculate_cost(fn,fp):
    return 100*fn+fp

cost_lost = {}
for i in np.linspace(0,0.1,100):
    pred_y = np.where(val_loss > i, 1, 0)
    cm = confusion_matrix(y_val.flatten(), pred_y)
    fn,fp = cm[1][0],cm[0][1]
    # cost_lost["Threshold: "+str(i)] = calculate_cost(fn,fp)
    cost_lost[i] = calculate_cost(fn,fp)

threshold = min(cost_lost, key=cost_lost.get)
print("Threshold: ", threshold)

In [None]:
# pred y based on the threshold
pred_y = np.where(test_loss > threshold, 1, 0)
cm = confusion_matrix(y_test.flatten().astype(int), pred_y)
print(f"Confusion matrix is :" )
sns.heatmap(cm, annot=True)
accuracy_rate = (cm[0,0] + cm[1,1])/np.sum(cm)
print(f"Accuracy rate is {accuracy_rate}")
# calculate the sensitivity 
sensitivity = cm[1,1]/(cm[1,1] + cm[1,0])
print(f"Sensitivity is {sensitivity}")

## 8.2 Conclusion

We can obtain a relatively higher ROC score using the autoencoder for anomaly detection. Which means we can balance the sensitity and specificity well.

# Question 9

### Question
As you know, it is difficult to understand precisely why a neural network makes a specific prediction. Discuss why this might be problematic when the neural network prediction leads to a fraud investigation by the insurance company. What alternatives can you envision that make use of the techniques we have applied and allow for more interpretability and transparency?

### Answer
The issue with neural network predictions in a fraud investigation is its “black box” behaviour: its internal logic in achieving the classification results is often hard to be interpreted or explained to both users and observers. If a customer is wrongly investigated for fraud by the company, they would generally demand an understanding of why they were investigated. Due to its complex intricacies, neural networks fail to explain why a claim is fraudulent despite its potentially high accuracy.

Solutions to neural network anomaly detection can be recommended in two ways. Firstly the model can be improved and refined to continuously perform better after each iterative improvement. Secondly, alternative machine learning classifications can replace the neural networks to provide greater context to the decision to investigate a customer for potential fraud. Regarding the first point, several alternative neural network classification models can be suggested particularly for anomaly detection. Al’Dahoul et al., 2021 created a novel neural network strategy that “combines binary normal/attack DNN to detect the availability of any attack and multi-attacks DNN to categorize the attacks” (Al’Dahoul et al., 2021:16). This anomaly detection and classification model outperformed the baseline solution and reduced the ‘false alarm rate’ for the highly imbalanced dataset. Additionally, Long Short-Term Memory (LSTM) neural networks can be implemented to find complex relationships in multivariate time series data. This system looks at the previous timeframe and predicts the behavior for the next. If the actual value a minute later is within one standard deviation then there is no problem. Other novel classification solutions include DAICS, which is more robust to the additive noise (Abdelaty et al., 2020), GT, which eliminates the need for autoencoders and improves results (Golan & Yaniv, 2018), and E3 outlier, which uses inlier priority for unsupervised outlier detection  (Wang et al., 2019). These solutions are more accurate, and improve results such as ROC values of the models. 


There are alternative machine learning classification models that can both make use of the techniques we have applied, and also allow a much better interpretability and transparency. Decision trees, for example, can explain the features that are used to segment a claim to fraud and non-fraud, while random forests can be used to show the importance of each feature in classifying the claims. This is a better improvement for transparency, as the features can highlight the information that can potentially be linked to fraudulent behaviours. Furthermore, both random forests and decision trees can be tuned for the right hyperparameters in order to optimise the model. Decision trees can be pruned, while random forest can have hyperparameters, such as maximum number of depths, number of trees and the minimum number of samples per leaf, be fully optimised. Additionally, random forests can provide a different interpretation of a decision tree but with better performance. Considering we need to understand which combination of variables flagged a customer for investigation, a decision tree can help understand how each variable is contributing to the prediction model, despite providing a reduction in the model performance. 

# Question 10

### Question

Use your synthetically extended dataset and train a simple model, such as logistic
regression or a decision tree that allows you to interpret why fraud is suspected. Keep track of the accuracy and AUC on a test set made from original data only. How does your model perform compared to the previous models you have developed? Does your model allow you to answer a customer who asks, "why am I being investigated"?


We are building 3 simple models for our enhanced dataset

1. Logistic Regression
2. Decision Tree
3. Random Forest

In [None]:
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import metrics
from sklearn import tree


In [None]:
# helper function
def plot_confusionmatrix(y_train_pred,y_train,dom):
    print(f'{dom} Confusion matrix')
    cf = confusion_matrix(y_train_pred,y_train)
    sns.heatmap(cf,annot=True,yticklabels=classes
               ,xticklabels=classes,cmap='Blues', fmt='g')
    plt.tight_layout()
    plt.show()

In [None]:
# Function for Predicting the score

def ModelPerformance(model, X_test, y_test):
    predlr = model.predict(X_test)
    print("Accuracy Score: {}".format(accuracy_score(y_test, predlr)*100))
    print("f1_Score: {}".format(f1_score(y_test, predlr)*100))
    print(confusion_matrix(y_test, predlr))

def Model_ROC_AUC(model, x, y):
    Y_probs=model.predict_proba(x)[:,1]
    fpr, tpr, thresholds = metrics.roc_curve(y,Y_probs)
    plt.plot(fpr, tpr, linewidth=4)
    plt.show()
    print(f'AUC score: {roc_auc_score(y,Y_probs)}')


In [None]:
x_train = X_over_synth_10_train
y_train = y_over_synth_10_train
x_val = X_over_synth_10_val
y_val = y_over_synth_10_val

### Model 1 - Logistic Regression

In [None]:
# Function for LogisticRegression (basic model)

def LogisticRegressionModel(x_train, y_train):
    lr = LogisticRegression(max_iter=300)
    lr.fit(x_train, y_train)
    return lr
    

In [None]:
lr = LogisticRegressionModel(x_train, y_train)

In [None]:
print("---------------------------")
print("Accuracy of Logistic Regression on Test Dataset")
print("---------------------------")
ModelPerformance(lr, X_test, y_test)
Model_ROC_AUC(lr, X_test, y_test)

### Model 2 - Decision Tree

Fitting a simple decision tree model

In [None]:
dtc = tree.DecisionTreeClassifier(random_state=66)
dtc.fit(x_train,y_train)
y_train_pred = dtc.predict(x_train)
y_val_pred = dtc.predict(x_val)
y_test_pred = dtc.predict(X_test)

Hyper-tuning the parameters

In [None]:
path = dtc.cost_complexity_pruning_path(x_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [None]:
# For each alpha we will append our model to a list
clfs = []
for ccp_alpha in ccp_alphas:
    clf = tree.DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(x_train, y_train)
    clfs.append(clf)

In [None]:
train_acc = []
test_acc = []
for c in clfs:
    y_train_pred = c.predict(x_train)
    y_test_pred = c.predict(x_val)
    train_acc.append(accuracy_score(y_train_pred,y_train))
    test_acc.append(accuracy_score(y_test_pred,y_val))

plt.scatter(ccp_alphas,train_acc)
plt.scatter(ccp_alphas,test_acc)
plt.plot(ccp_alphas,train_acc,label='train_accuracy',drawstyle="steps-post")
plt.plot(ccp_alphas,test_acc,label='val_accuracy',drawstyle="steps-post")
plt.legend()
plt.title('Accuracy vs alpha')
plt.show()

We choose the alpha value to be 0.003 as our hyperparameter

In [None]:
clf_ = tree.DecisionTreeClassifier(random_state=66,ccp_alpha=0.003)
clf_.fit(x_train,y_train)
y_train_pred = clf_.predict(x_train)
y_val_pred = clf_.predict(x_val)
y_test_pred = clf_.predict(X_test)


In [None]:
plt.figure(figsize=(50,50))
features = df.columns
classes = ['Not Fraud','Fraud']
tree.plot_tree(clf_,feature_names=features,class_names=classes,filled=True)
plt.show()

In [None]:
print("---------------------------")
print("Accuracy of Decision Tree on Test Dataset")
print("---------------------------")
ModelPerformance(clf_, X_test, y_test)
Model_ROC_AUC(clf_, X_test, y_test)

### Model 3 - Random Forest

Fitting a simple Random Forest model

In [None]:
rfc = RandomForestClassifier(random_state=66)
rfc.fit(x_train, y_train)

In [None]:
y_train_pred = rfc.predict(x_train)
y_val_pred = rfc.predict(x_val)
y_test_pred = rfc.predict(X_test)

#### Hyperparameter tuning using Randomized Search Cross Validation

To potentially avoid the issue of overfitting, we try to optimise our random forest model through randomized search cross validation in order to tune our hyperparameter

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 200, num = 20)] # number of trees in the random forest
max_features = ['auto', 'sqrt'] # number of features in consideration at every split
max_depth = [int(x) for x in np.linspace(10, 120, num = 12)] # maximum number of levels allowed in each decision tree
min_samples_split = [2, 5, 10] # minimum sample number to split a node
min_samples_leaf = [1, 2, 4] # minimum sample number that can be stored in a leaf node
bootstrap = [True, False] # method used to sample data points

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf,

'bootstrap': bootstrap}

In [None]:
from sklearn.model_selection import RandomizedSearchCV
rf_random = RandomizedSearchCV(estimator = rfc,param_distributions = random_grid,
               n_iter = 100, cv = 5, verbose=2, random_state=35, n_jobs = -1)

In [None]:
rf_random.fit(x_train, y_train)

In [None]:
print ('Random grid: ', random_grid, '\n')
# print the best parameters
print ('Best Parameters: ', rf_random.best_params_, ' \n')

Fitting the best parameters into the Random Forest model

In [None]:
rfc_tuned = RandomForestClassifier(n_estimators = 190, min_samples_split = 2, 
                                    min_samples_leaf= 1, max_features = 'auto', max_depth= 110, 
                                    bootstrap=False, random_state=66) 


In [None]:
rfc_tuned.fit(x_train, y_train)

In [None]:
print("---------------------------")
print("Accuracy of Random Forest on Test Dataset")
print("---------------------------")
ModelPerformance(rfc_tuned, X_test, y_test)
Model_ROC_AUC(rfc_tuned, X_test, y_test)

After training logistic regression, decision tree and random forest on our enhanced model, we found the following AUC and accuracy score for our models:
<br>   
<br> 
__Logistic Regression__

Accuracy = 97.0%

AUC Score = 85.7%

__Decision Tree__

Accuracy = 97.5%

AUC Score = 73.2%

__Random Forest__

Accuracy = 98.8%

AUC Score = 85.4%<br>   
<br> 

The machine learning models has high accuracy and AUC scores, and perform better than the previous models. However, it is predicting non-fraud cases well but not fraud cases. The tree-based classification models can help explain why claims are being marked as fraudulent. For example, a feature “ClaimWithoutIdentifiedThirdParty” are considered as one of the significant features that distinguish a claim into fraudulent, acting as one of the main features to split a sample to ‘Fraud’ and ‘Not Fraud’. This feature can be rationalised, where fraud claims are more likely to not have a third party. Furthermore, Random Forest models can also calculate the importance of each variables when determining the classification. This can highlight which features are significant in decidingn the binary class.


# References
References:

Abdelaty, M. F., Doriguzzi Corin, R., & Siracusa, D. (2021;2020;). DAICS: A deep learning solution for anomaly detection in industrial control systems. IEEE Transactions on Emerging Topics in Computing, , 1-1. https://doi.org/10.1109/TETC.2021.3073017

AlDahoul, N., Abdul Karim, H., & Ba Wazir, A. S. (2021). Model fusion of deep neural networks for anomaly detection. Journal of Big Data, 8(1), 1-18. https://doi.org/10.1186/s40537-021-00496-w

Blanque, P. (2003). Crisis and fraud. Journal of Financial Regulation and Compliance, 11(1), 60-70. https://doi.org/10.1108/13581980310810417

Chakravarty, S., Demirhan, H., & Baser, F. (2020). Fuzzy regression functions with a noise cluster and the impact of outliers on mainstream machine learning methods in the regression setting. Applied Soft Computing, 96, 106535. https://doi.org/10.1016/j.asoc.2020.106535

Golan, I., & El-Yaniv, R. (2018). Deep anomaly detection using geometric transformations.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Reed, R., & Marks, R. J. (1999). Neural smithing: Supervised learning in feedforward artificial neural networks. MIT Press. https://doi.org/10.7551/mitpress/4937.001.0001
