### Why this new ipynb file? 
In the fraud_detection_without_oversampling.ipynb file, there is one problem.
Let's disccus this problem by analyzing dataset we have. 

In [2]:
# importing the library 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libs
from sklearn.ensemble import RandomForestClassifier
import tensorflow as tf

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score


In [6]:
dataframe = pd.read_csv("Fraud.csv")
df = dataframe.copy()
print(df.columns)

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud'],
      dtype='object')


In [8]:

#Number of negative examples
y = dataframe.iloc[:, 9].values
print("negative examples: ",np.count_nonzero(y))
print("positive example: ", y.size -  np.count_nonzero(y))


negative examples:  8213
positive example:  6354407


As per the observation, we have only 8213 negative example and 6354407 positve examples in our dataset. That means our dataset is imbalanced. Therefor, we were getting pretty low f1 score, except Random Forest, in all ML models. 

So, we will address this problem with either oversampling or undersampling. 

In [9]:
# To begin with, we have step col in dataset, which elicits the time at which transaction is done. 
# Therefore, it is curtial to convert it into convinient format.
# here, i am going to convert each steps into timeDelta format, and soon i will transform it into other format.

df["step"] = pd.to_timedelta(df["step"], unit='h')
df.head()


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,0 days 01:00:00,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,0 days 01:00:00,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,0 days 01:00:00,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,0 days 01:00:00,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,0 days 01:00:00,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [10]:


# if there exists any null values, i am removing those values as we have large dataset which is more then enogh to train ML model.
df.dropna()

# converting categorical variable into numertical with the help of one hot encoding. 
ct = ColumnTransformer(
    transformers=[
        ('encoder', OneHotEncoder(), [1])
    ],
    remainder='passthrough'
)

df_encoded = ct.fit_transform(df)
df_encoded = pd.DataFrame(df_encoded, columns=ct.get_feature_names_out())

In [11]:
df_encoded.head()

Unnamed: 0,encoder__type_CASH_IN,encoder__type_CASH_OUT,encoder__type_DEBIT,encoder__type_PAYMENT,encoder__type_TRANSFER,remainder__step,remainder__amount,remainder__nameOrig,remainder__oldbalanceOrg,remainder__newbalanceOrig,remainder__nameDest,remainder__oldbalanceDest,remainder__newbalanceDest,remainder__isFraud,remainder__isFlaggedFraud
0,0.0,0.0,0.0,1.0,0.0,0 days 01:00:00,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,0.0,0.0,0.0,1.0,0.0,0 days 01:00:00,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,0.0,0.0,0.0,0.0,1.0,0 days 01:00:00,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,0.0,1.0,0.0,0.0,0.0,0 days 01:00:00,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,0.0,0.0,0.0,1.0,0.0,0 days 01:00:00,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [12]:

df_encoded['customer_start_freq'] = df_encoded.groupby('remainder__nameOrig')['remainder__nameOrig'].transform('count')
df_encoded['customer_recipient_freq'] = df_encoded.groupby('remainder__nameDest')['remainder__nameDest'].transform('count')
df_encoded = df_encoded.drop('remainder__nameDest', axis=1)
df_encoded = df_encoded.drop('remainder__nameOrig', axis=1)

# we had a step column in formant of time delta, 
# but hear i have converted it into the three different columns, 
# since these coulumn helps the machine learnig model to get better idea of fraudant transcation. 

df_encoded['days'] = df_encoded['remainder__step'].dt.days
df_encoded['hours'] = df_encoded['remainder__step'].dt.components.hours
df_encoded['minutes'] = df_encoded['remainder__step'].dt.components.minutes
df_encoded  = df_encoded.drop("remainder__step", axis=1)
col_to_move = df_encoded.pop('remainder__isFraud')
df_encoded.insert(len(df_encoded.columns), 'remainder__isFraud', col_to_move)


In [14]:
df_encoded.head()

Unnamed: 0,encoder__type_CASH_IN,encoder__type_CASH_OUT,encoder__type_DEBIT,encoder__type_PAYMENT,encoder__type_TRANSFER,remainder__amount,remainder__oldbalanceOrg,remainder__newbalanceOrig,remainder__oldbalanceDest,remainder__newbalanceDest,remainder__isFlaggedFraud,customer_start_freq,customer_recipient_freq,days,hours,minutes,remainder__isFraud
0,0.0,0.0,0.0,1.0,0.0,9839.64,170136.0,160296.36,0.0,0.0,0,1,1,0,1,0,0
1,0.0,0.0,0.0,1.0,0.0,1864.28,21249.0,19384.72,0.0,0.0,0,1,1,0,1,0,0
2,0.0,0.0,0.0,0.0,1.0,181.0,181.0,0.0,0.0,0.0,0,1,44,0,1,0,1
3,0.0,1.0,0.0,0.0,0.0,181.0,181.0,0.0,21182.0,0.0,0,1,41,0,1,0,1
4,0.0,0.0,0.0,1.0,0.0,11668.14,41554.0,29885.86,0.0,0.0,0,1,1,0,1,0,0


In [16]:
#separating labels and non labled data
X = df_encoded.iloc[:, :-1].values
Y= df_encoded.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [18]:
scaler = StandardScaler()
X_train[:, 5:10] = scaler.fit_transform(X_train[:, 5:10])
X_test[:, 5:10] = scaler.transform(X_test[:, 5:10])


In [19]:
# Furthermore, ANN and other machine learning model accepts the either float32 ot float64 datatype
# and data type of our data is "float"
X_train = X_train.astype(np.float64)
X_test = X_test.astype(np.float64)
y_train = y_train.astype(np.float64)
y_test = y_test.astype(np.float64)

### Applying Undersampling

In [20]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_rus, y_rus = rus.fit_resample(X_train,y_train)

In [28]:
from collections import Counter
print(f'Original class distribution: {Counter(y_train)}')
print(f'Resampled class distribution: {Counter(y_rus)}')

Original class distribution: Counter({0.0: 5083524, 1.0: 6572})
Resampled class distribution: Counter({0.0: 6572, 1.0: 6572})


In [30]:

model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=32, activation='relu', input_shape=(X_train.shape[1],)),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dropout(rate=0.5),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

# compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# train model
history = model.fit(X_rus, y_rus, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Performance evaluation

In [33]:
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score


y_pred = model.predict(X_test)
y_pred = (y_pred >0.5)

# AUC-ROC
auc_roc = roc_auc_score(y_test, y_pred)
print(f'AUC-ROC: {auc_roc:.2f}')
print()

# accuracy and confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print()

# f-1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1:.2f}')
print()

# k fold cross validation
# accuracies =  cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
# print("Accuracy {:.2f} %".format(accuracies.mean()*100))
# print("Standard Deviation {:.2f} %".format(accuracies.std()*100))
# print()



AUC-ROC: 0.92

[[1222106   48777]
 [    185    1456]]
0.9615237119299911

F1 score: 0.06



#### in deep learnig model, we can not detrmine relevance of feature.

### Applying Oversampling

In [34]:
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
from collections import Counter

# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=20,
                           n_clusters_per_class=1, n_samples=1000,
                           random_state=10)

# Print the original class distribution
print(f'Original class distribution: {Counter(y_train)}')

# Perform oversampling using SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# Print the resampled class distribution
print(f'Resampled class distribution: {Counter(y_resampled)}')

Original class distribution: Counter({0.0: 5083524, 1.0: 6572})
Resampled class distribution: Counter({0.0: 5083524, 1.0: 5083524})


In [35]:

newmodel = tf.keras.Sequential([
    tf.keras.layers.Dense(units=32, activation='relu', input_shape=(X_resampled.shape[1],)),
    tf.keras.layers.Dense(units=64, activation='relu'),
    tf.keras.layers.Dropout(rate=0.5),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

# compile model
newmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# train model
history = model.fit(X_resampled, y_resampled, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [36]:
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score


y_pred = newmodel.predict(X_test)
y_pred = (y_pred >0.5)

# AUC-ROC
auc_roc = roc_auc_score(y_test, y_pred)
print(f'AUC-ROC: {auc_roc:.2f}')
print()

# accuracy and confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print()

# f-1 score
f1 = f1_score(y_test, y_pred)
print(f'F1 score: {f1:.2f}')
print()

# k fold cross validation
# accuracies =  cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
# print("Accuracy {:.2f} %".format(accuracies.mean()*100))
# print("Standard Deviation {:.2f} %".format(accuracies.std()*100))
# print()

AUC-ROC: 0.59

[[1205222   65661]
 [   1264     377]]
0.9474076716824201

F1 score: 0.01



To conclude, it is clear that model has performed better on undersampled dataset, and performance is poor for oversampled dataset. Hence, for given dataset, in particular, undersampling has done better job as AUC ROC of that model is higher(0.92) as compared to other model's AUC ROC (0.59). However, F1 socre it still not much that great. 

Question 7: What kind of prevention should be adopted while company update its infrastructure?

While updating the infrastructure of the company, in general, following preventions should be considered:

Understand your work and necessary requirements before updating the infrastructure.
The dependability of your candidate infrastructure must be accessed.
Think about legal and ethical issues.
Consider financial issues.


Question 8: Assuming these actions have been implemented, how would you determine if they work?

To determine if the actions have been implemented and are working, you can use the following methods:

Conduct a survey to gather feedback from employees.
Conduct a post-update review to evaluate whether the update was completed successfully and within the anticipated timeframe.
Check the backup and recovery systems to ensure that they are functioning correctly and that data can be recovered if necessary.
Perform stress tests on the updated infrastructure to identify and address any potential issues that may arise.
Monitor system performance and availability after the update to ensure that there are no significant disruptions or downtime.
Evaluate user feedback and satisfaction to determine if the updated infrastructure meets their needs and expectations.

### Other answers are given in the code
