<a href="https://colab.research.google.com/github/aksharat/FinanceProjects/blob/main/Template_Synthetic_Identity_Fraud.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What Is Synthetic Identity Fraud?**

---



* Synthetic identity fraud is a type of identity theft in which criminals create fictitious identities by combining real and fake information. Unlike traditional identity theft, where the criminal assumes the identity of a real person, synthetic identity fraud involves the fabrication of entirely new identities.

* Synthetic identity fraud is a complex and evolving form of identity theft that poses significant challenges for law enforcement agencies, financial institutions, and individuals. Detecting and preventing this type of fraud requires a multi-layered approach that involves robust identity verification processes, improved data security measures, and increased awareness among individuals to safeguard their personal information.

This notebook presents a Python-based solution to combat synthetic identity fraud.

# **TABLE OF CONTENTS**

---



>[What Is Synthetic Identity Fraud?](#scrollTo=jGCca5uryCH7)

>[Setup: Library Imports](#scrollTo=yDtNLyGAMarO)

>[Load Dataset](#scrollTo=Cu4mZjOcMarS)

>[Data Exploration](#scrollTo=2Is3sWpNWPpF)

>>[Data Visualization](#scrollTo=is0BrIHJ-kT2)

>[Data Cleaning](#scrollTo=59hF-jqZXebA)

>[Data Preparation](#scrollTo=7Q_iDr_rMarT)

>[Test/Train Split Overview](#scrollTo=zVKTBgO6Nh9z)

>>[Why Create A Split Based On Months?](#scrollTo=zVKTBgO6Nh9z)

>[Model Evaluation](#scrollTo=AqCEwHsNMarT)

>>[But WHY are we using these metrics?](#scrollTo=0T2Aaa0Un9Fa)

>[Model Building](#scrollTo=Qx8Mx6YYMarV)

>>[Algorithm Selection](#scrollTo=Pdbh5ZwEnvbN)

>>>[Simplest Solution](#scrollTo=8bw3LbQTn30Y)

>>>[Medium-level Algorithm Selection](#scrollTo=GWvgYnB_9l_E)

>>>[High Complexity Solution](#scrollTo=AqSmTpVN-Nr_)

>>>[Deep Learning Approach](#scrollTo=PjE5xqv79Kwr)

>[Analyzing Model Performance](#scrollTo=KOlWnsQQlS5i)




## **Setup: Library Imports**

---



NumPy (Numerical Python) is a powerful library in Python that provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently. It is a fundamental package for scientific computing in Python.

* [Numpy Reference Material](https://www.w3schools.com/python/numpy/numpy_intro.asp)

* [Numpy Video Tutorial](https://www.youtube.com/watch?v=QUT1VHiLmmI&ab_channel=freeCodeCamp.org
)


Pandas is a powerful open-source library in Python that provides high-performance data manipulation and analysis tools. It is built on top of NumPy and provides an easy-to-use data structure called DataFrame, which allows for efficient handling and manipulation of structured data.

* [Pandas reference material](https://www.w3schools.com/python/pandas/default.asp)

* [Pandas video tutorial](https://www.youtube.com/watch?v=vmEHCJofslg&ab_channel=KeithGalli
)

---
These imports may seem overwhelming at first glance, but each library will be explained in detail before its use to ensure a clear understanding of their purpose and functionality.

In [None]:
# Import Libraries


# **Load Dataset**

---



# **Data Exploration**

---



**Tasks:**
1. Data Types
2. Unique Values
3. Distribution of the values of features
4. Feature statistics
5. Identify outliers
6. Correlation between features
7. Identify missing data

The Pandas library offers an exploratory data visualization tool through Profiling Report.

Profiling Report includes:

* **Overview**: Basic information about the dataset, such as the number of variables, unique values, observations, and missing values.

* **Variable types**: Identification of the data types of each variable, such as numerical, categorical, or date/time.

* **Correlations**: Analysis of correlations between pairs of variables, helping to identify any strong relationships or dependencies.

* **Missing values**: Identification of variables with missing values and the percentage of missing values in each variable.

* **Interaction**: Identification of potential interactions between variables.

Profiling Report also provides **visual representations of the data**:
* **Bar Graph**: A visual representation using rectangular bars to show the frequency or count of categorical data. It is commonly used for comparing and displaying categorical data, such as sales performance of different products or customer preferences across categories.

  [Text and Video Resource exploring Bar Graphs in Python ](https://dataindependent.com/pandas/pandas-bar-plot-dataframe-plot-bar/)

* **Heat Map**: A graphical representation using colors to display data values in a matrix or table format. It is useful for visualizing relationships or patterns in large datasets, particularly for showing correlations or associations between variables.

    [Short video explanation on how to interpret Heat Maps](https://www.youtube.com/watch?v=VmlO-GIvEek)

* **Matrix**: A tabular representation of data with rows and columns displaying variables or categories. It is used to display summary statistics, similarity measures, or correlation coefficients between variables.

    [Article explaining correlation matrices](https://www.displayr.com/what-is-a-correlation-matrix/)

    [Detailed text resource for creating correlation matrices on Pandas](https://datatofish.com/correlation-matrix-pandas/)

* **Histogram**: A graphical representation showing the distribution of numerical data through intervals or bins. It helps in understanding the distribution and shape of a numerical variable, such as identifying patterns or skewness in exam scores.

  [Detailed text resource exploring histograms on Pandas](https://sparkbyexamples.com/pandas/pandas-plot-a-histogram/)

  [Short video tutorial for creating histograms on Pandas](https://www.youtube.com/watch?v=ra2pw0qKWvg)


**Additional Resources On Profiling Report**:

* [YouTube video outlining Profiling Report](https://www.youtube.com/watch?v=Ef169VELt5o)
* [Text Resource on Profiling Report](https://towardsdatascience.com/pandas-profiling-easy-exploratory-data-analysis-in-python-65d6d0e23650)


In [None]:
# Data Exploration



# Documentation on Visualization Library
---


Matplotlib is a widely-used data visualization library in Python. It provides a variety of functions and classes for creating high-quality static, animated, and interactive plots. Matplotlib allows you to create a wide range of visualizations, including line plots, scatter plots, bar plots, histograms, heatmaps, contour plots, and more.

Additional Resources:
* [Brief Matplotlib Tutorial](https://www.youtube.com/watch?v=a9UrKTVEeZA)
* [Extensive Matplotlib Tutorial ](https://www.youtube.com/watch?v=UO98lJQ3QGI&list=PL-osiE80TeTvipOqomVEeZ1HRrcEvtZB_)
*   [Matplotlib Video Playlist](https://www.youtube.com/watch?v=UO98lJQ3QGI&list=PL-osiE80TeTvipOqomVEeZ1HRrcEvtZB_)





Seaborn is a popular data visualization library in Python that is built on top of Matplotlib. It provides a high-level interface for creating aesthetically pleasing and informative statistical graphics. Seaborn simplifies the process of creating visually appealing plots and offers a wide range of built-in features and customization options.





The plot generated by the following code block uses a bar graph to visualize the distribution of fraudulent and legitimate transactions, helping us understand the frequency or occurrence of each type of transaction.

In [None]:
# Any additional visualizations



We can also visualize the missing values for each variable and observe how the distribution of fraudulent and legitimate transactions is related to these missing values.

This plot can provide insights into any potential patterns or associations between missing data and the transaction types.

# **Data Cleaning**

---





In [None]:
df = df.dropna(axis=0)
df.shape

(1000000, 32)

In [None]:
# Split data into features and target
X = df.drop(['fraud_bool'], axis=1)
y = df['fraud_bool']

##  **Data Preparation**

---


**Task:** Categorical Features: OneHotEncoding on all the categorical features

**One-hot encoding** is a technique used to represent categorical variables as binary vectors in machine learning and data analysis. It is commonly used when working with categorical data that cannot be directly used as input for certain models or algorithms.

**Material:**

[Text article exploring One-hot encoding in Python](https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/)

[Youtube video](https://www.youtube.com/watch?v=i2JSH5tn2qc
)

The **process of one-hot encoding** involves converting each categorical value into a binary vector where only one bit is "on" (1) while the rest are "off" (0). This binary vector representation allows the categorical variable to be used as input in numerical computations.

In [None]:
# Data Preparation

## **Test/Train Split Overview**

---


We split our data into a feature data frame and target variable column.

We perform a **Test/Train Split** on the data, dividing it into a training set and a testing set. The training set is used to train the model, while the testing set evaluates its performance on new data. This split allows the model to learn patterns and make predictions, while the testing set assesses its ability to generalize and make accurate predictions. It simulates real-world scenarios, measuring the model's performance and its capability to handle unseen data.
## Why Create A Split Based On Months?
Real-world datasets are susceptible to biases and variations over time, which can impact the distribution of features. In dynamic environments like fraud detection, fraudsters constantly adapt their behavior to avoid being caught. This means that the features that were effective in detecting fraud before may become ineffective as fraudsters find new ways to evade detection.

To account for potential variations over time, we split the data into training and test sets based on months. The training set includes data from January to May, while the test set comprises data from June to July. This division helps us assess how well the model performs on new data and generalize beyond the training period.

Additional Resources on Test/Train Split:
*   [Article with in-depth description of test-train split](https://builtin.com/data-science/train-test-split)
*   [Video explanation of test-train split ](https://www.youtube.com/watch?v=BAiMKBrFntc)

**Numerical Feature Standardization**

We will standardize the numerical features of our train and test set. We will be using scikit-learn, a popular open-source machine learning library for Python, specifically the StandardScaler class, which transforms numerical data by subtracting the mean and dividing by the standard deviation, resulting in features with zero mean and unit variance.

This ensures that ensures that the transformed features have comparable scales, making them suitable for machine learning models that are sensitive to the scale of the input data.

Here are some resources so that you may develop a broader understanding of Scikit-learn:

*  [Scikit-learn Guide](https://scikit-learn.org/stable/)

*  [In-depth video explanation of Scikit-learn](https://www.youtube.com/watch?v=pqNCD_5r0IU)




## **Model Evaluation**

---



**Plotting ROC Curve Method**
* The **ROC** curve illustrates the relationship between TPR (true positive rate) and FPR (false positive rate) at various classification thresholds, or decision boundaries used to determine class label. An ideal ROC curve exhibits high values in its leftmost quadrant, signifying a high TPR and low FPR.

**Fairness Metrics Method**

* Evaluates the fairness of a classification model's predictions across different groups or demographics.
* Using the aequitas library, a lightweight library for model fariness analytics, the method builds a **confusion matrix**, a table that summarizes the performance of a classification model by displaying the counts of true positives, true negatives, false positives, and false negatives.
* The method also calculates the Predictive_equality, which is the difference between the disparities in the false positive rate. Ideally, this value should be low or zero.
* **Additional Resources:**
  * [Basic Introduction To Fairness in Machine Learning](https://www.youtube.com/watch?v=euwc0va-7Vo)



**Evaluation Function Concepts**
* The Evaluation function ties the other two methods together. It calculates predictive equality and plots the ROC, as well as the **AUC** or "area under the curve," which summarizes the ROC curve, representing the model's overall performance. Higher AUC indicates better discrimination ability, while 0.5 suggests random performance and 1 represents a perfect model.

# But WHY are we using these metrics?

##Storytime
**Imagine this:** You've saved up for an exciting vacation in a foreign country. Having taken your family out for a great dinner at a seaside restaurant, you couldn't imagine anything going wrong. But when you hand your card to the waiter, disaster strikes. The bank notices an unexpected transaction in a different country and flags your delicious dinner as fraud! As the waiter loudly explains that your card has been declined, your face turns bright red with embarrassment as you feel a growing frustration towards your bank.

\
*Well what does this have to do with the problem we are trying to solve?*

\
Customers strongly dislike being falsely flagged for fraud. Incorrectly labeling a customer as a fraudster is a surefire method to lose them as a client.

\
## The Need For Real World Metrics
**Ultimately**, it's best to use metrics that fit into place in the real world. For example, a common model metric such as "accuracy" would fall short in this problem context. While accuracy incorporates both true positives and true negatives, it treats false positives and false negatives equally. In the context of fraud detection, false positives (incorrectly flagging legitimate transactions) tend to have more immediate and tangible consequences compared to false negatives (missed fraudulent transactions). Therefore, optimizing FPR becomes crucial to minimize the negative impact on customers and the business.

\
FPR and TPR offer a more targeted evaluation, allowing companies to optimize their fraud detection systems to strike a balance between customer satisfaction, revenue protection, and risk mitigation.

# **Model Building**

---




# Algorithm Selection




**Classification**:
Machine learning classification is a supervised learning technique that focuses on categorizing or assigning discrete class labels to input data based on their features and patterns. Classification algorithms aim to generalize the patterns observed in the training data to make accurate predictions on unseen data, enabling automated decision-making and pattern recognition in various domains.

**Binary Classification:** This type of classification involves dividing the data into two distinct classes. Examples include classifying emails as spam or not spam, predicting whether a customer will churn or not, or determining if a credit card transaction is fraudulent or legitimate.

**Multiclass Classification:** In multiclass classification, the data is divided into more than two classes. Each instance is assigned to one and only one class. Examples include classifying images into different object categories, recognizing handwritten digits, or classifying news articles into different topics.

**Imbalanced Classification:** Imbalanced classification refers to situations where the classes are not represented equally in the dataset. Typically, one class (majority class) has a significantly larger number of instances compared to the other class(es) (minority class(es)). Imbalanced classification algorithms are specifically designed to handle such scenarios, where the focus is on accurately identifying the minority class.

**Multi-label Classification:** In multi-label classification, each instance can be assigned multiple labels or categories simultaneously. This means that an instance can belong to more than one class. Examples include assigning tags to documents, predicting the presence of multiple diseases in medical diagnosis, or labeling images with multiple objects present.

**Material**:

[In-depth text explanation of Classification](https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning)

[Youtube video on Classification ](https://www.youtube.com/watch?v=xG-E--Ak5jg)


We have provided with you three distinct solutions to our synthetic identity fraud problem which you can explore through the rest of this module.




##**Simplest Solution**

For the simplist solution, we have decided to approach this problem through a Classification algorithm known as Logistic Regression.


**Logistic Regression** is a classification algorithm that predicts the probability of an instance belonging to a specific class using the logistic function. It is widely used for binary classification tasks, offering simplicity and interpretability in modeling the relationship between input features and class probabilities.

[Text example of Logistic Regression](https://www.w3schools.com/python/python_ml_logistic_regression.asp)

[Youtube tutorial of Logistic Regression](https://www.youtube.com/watch?v=XnOAdxOWXWg&t=12s)


 ## **Medium-level Algorithm Selection**

Down below, you will use the usage of XGBoost library, more specifically the **XGBoost Classifer**.

**XGBoost** is an open-source gradient boosting library that provides a powerful and efficient framework for training and applying gradient boosting models, capable of handling diverse machine learning tasks with high accuracy and speed.

[Video resource for XGboost Library](https://www.youtube.com/watch?v=GrJP9FLV3FE&ab_channel=StatQuestwithJoshStarmer
)

[Xgboost document text article](https://xgboost.readthedocs.io/en/stable/get_started.html)


**XGBoost Classifier** is a specific implementation of the XGBoost library designed for classification tasks, utilizing the gradient boosting framework to build highly accurate and efficient classification models. It offers enhanced performance through various optimization techniques, handling both binary and multiclass classification problems with the flexibility to handle imbalanced datasets and handle missing values.

[Xgboost classifier video resource](https://www.youtube.com/watch?v=2Ou7gcqTqBE
)





##  **High Complexity Solution**

Down below, the **RandomForest Classifier** is used:

Random Forest is an ensemble learning algorithm that combines multiple decision trees to make predictions. It uses random subsets of the training data and features to reduce overfitting and provides accurate and robust predictions for both classification and regression tasks.



[Text resource for RandomForest](https://www.ibm.com/topics/random-forest)

[Youtube explanation of RandomForest](https://www.youtube.com/watch?v=eM4uJ6XGnSM&t=22s
)

RandomForestClassifier is a specific implementation of the Random Forest algorithm for classification tasks. It combines the predictions of multiple decision trees, each trained on a random subset of the training data, to produce a reliable and robust classifier. RandomForestClassifier is known for its ability to handle high-dimensional data, feature interactions, and noisy or unbalanced datasets, making it a popular choice for classification problems in machine learning.

[Text resource for RandomForest Classifier](https://www.datacamp.com/tutorial/random-forests-classifier-python)

[Short video resource for RandomForest Classifier](https://www.youtube.com/watch?v=x9pIM2GkbF4)

## **Deep Learning Approach**
The Next Approach Utilizes **Keras**, a high-level deep learning framework used for building and training neural network models. In classification problems, the approach involves data preparation, defining the model architecture, compiling the model with loss function and optimizer, training the model on the prepared data, and evaluating its performance using metrics like accuracy or F1 score. The model can then be fine-tuned and deployed for making predictions on new data.

## Utility Functions For Keras Models

The **F1 score** is a metric commonly used to evaluate the performance of a binary classification model. It considers both precision and recall to provide a balanced measure of the model's accuracy.

* **Precision** measures the proportion of true positive predictions among all positive predictions. It represents the model's ability to correctly identify positive instances. Precision is calculated as the ratio of true positive predictions to the sum of true positive and false positive predictions.

* **Recall**, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive instances. It represents the model's ability to capture all positive instances. Recall is calculated as the ratio of true positive predictions to the sum of true positive and false negative predictions.
* **Additional Resources:**
  * [Precision, Recall, & F1 Score Intuitively Explained](https://www.youtube.com/watch?v=8d3JbbSj-I8)
  * [What are Precision and Recall in Machine Learning?](https://www.youtube.com/watch?v=FWBoW04gyew)

\
**Focal loss** is a specialized loss function used in classification tasks to address class imbalance. By assigning higher weights to misclassified examples, particularly from the minority class (in our case, the minority class is fraudulent transactions), the focal loss function helps the model focus on reducing false positives. The increased emphasis on misclassified instances encourages the model to pay closer attention to the minority class and improve its ability to correctly classify those instances.
* **Additional Resources:**
  * [Focal Loss — What, Why, and How?
  ](https://medium.com/swlh/focal-loss-what-why-and-how-df6735f26616)
  * [Focal Loss: A better alternative for Cross-Entropy](https://towardsdatascience.com/focal-loss-a-better-alternative-for-cross-entropy-1d073d92d075)

* **Compilation:** The compilation step in the code sets up the model for training. It defines the metrics that will be used to evaluate the model's performance during training and optimization.
 * The **loss function**, in this case "binary_crossentropy," measures the difference between the predicted and actual outputs and guides the model to minimize this difference.
 * The **optimization function**, in this case "Adam" with a learning rate of 1e-2, determines how the model's weights are updated based on the loss function's value.

* **Training Loop:** The training loop is responsible for actually training the model on the provided dataset. It includes techniques like *EarlyStopping*, which monitors the model's performance during training and stops training if the performance does not improve for a certain number of epochs, thus preventing overfitting. *Class weights* are calculated to address class imbalance, giving more importance to the minority class during training. The model is trained for a specified number of *epochs*, which represent the number of times the entire dataset is passed through the model during training.

* **Evaluation:** The evaluation step measures the model's performance on the test dataset. It generates predictions using the trained model on the test data and then evaluates these predictions using the performance metric functions (AUC, etc) discussed earlier.

* **Additional Resources:**
  * [What are Optimizers in deep learning?](https://www.youtube.com/watch?v=JhQqquVeCE0)
  * [What are Loss functions in machine learning?](https://www.youtube.com/watch?v=JhQqquVeCE0)
  * [How to choose an optimizer for a Tensorflow Keras model?](https://www.youtube.com/watch?v=pd3QLhx0Nm0)


## Machine Learning Layers
**Layers** in machine learning are basic units that process input data and contribute to the learning and decision-making process in neural networks.

* **Dense**: a dense layer is a fundamental building block of a neural network. It is also known as a fully connected layer or a linear layer. The purpose of a dense layer is to transform the input data by performing a linear operation followed by an activation function.

* **Dropout**: a regularization technique that randomly sets a fraction of input units to 0 during training. This means that for each training sample, some neurons in the network are "dropped out" or temporarily ignored. **Dropout helps prevent overfitting** by reducing the co-adaptation of neurons. By randomly dropping out neurons, the network becomes more robust and less likely to rely on specific neurons for making predictions, thus improving generalization. During testing or inference, dropout is turned off, and the full network is used for predictions.

* **Batch normalization**: a technique used to normalize the inputs of each layer in a neural network. It normalizes the input by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. This helps stabilize and speed up the training process by reducing the internal covariate shift. Additionally, it provides regularization effects by adding some noise to the network and reducing the impact of individual mini-batch examples. Batch Normalization can improve model convergence, allow for higher learning rates, and make the model more robust to changes in input distributions.

**Additional Resources**:


*   [Introduction to Convolutional Neural Network (CNN) using Tensorflow](https://towardsdatascience.com/introduction-to-convolutional-neural-network-cnn-de73f69c5b83#:~:text=Dense%20Layer%20is%20simple%20layer,multiple%20number%20of%20such%20neurons.)
*   [Batch normalization in 3 levels of understanding](https://towardsdatascience.com/batch-normalization-in-3-levels-of-understanding-14c2da90a338)
*   [Dropout in Neural Networks](https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9)
*   [Dense layers explained in a simple way](https://medium.com/datathings/dense-layers-explained-in-a-simple-way-62fe1db0ed75)




# **Analyzing Model Performance**

---




## Logistic Regression
* AUC: 0.8779388959897662
* TPR:  49.69%
* FPR:  5.0%
*Threshold:  0.76

The logistic regression model has a relatively high AUC, indicating good performance in distinguishing between positive and negative samples. The TPR of 49.69% suggests that it correctly identifies approximately half of the positive cases, while the FPR of 5.0% implies that it misclassifies 5.0% of the negative cases as positive. The chosen threshold of 0.76 determines the classification boundary.

## XGBoost
* AUC: 0.7721620028119188
* TPR:  17.65%
* FPR:  2.85%
* Threshold:  0.0

The XGBoost model has a lower AUC compared to logistic regression, indicating a somewhat weaker performance. The TPR of 17.65% suggests that it captures a smaller proportion of the positive cases, while the FPR of 2.85% indicates a lower rate of false positives. The chosen threshold of 0.0 suggests a sensitive classification boundary.

## Random Forest
* AUC: 0.6313642342056329
* TPR:  0.0%
* FPR:  1.0699999999999998%
* Threshold:  0.02

The random forest model has the lowest AUC among the analyzed models, indicating the weakest discriminatory power. The TPR of 0.0% suggests that it fails to identify any positive cases, which is a significant limitation. However, the FPR of 1.07% indicates a relatively low rate of false positives.

## Keras
* AUC: 0.6824300421787837
* TPR:  23.53%
* FPR:  2.65%
* Threshold:  0.76

The Keras model's AUC falls between that of XGBoost and random forest, indicating moderate performance. The TPR of 23.53% suggests it captures a larger proportion of positive cases compared to XGBoost. The FPR of 2.65% indicates a relatively low rate of false positives. Similar to logistic regression, the chosen threshold of 0.76 determines the classification boundary for this model.

##Best Overall Performance
Overall, based on the provided metrics, the logistic regression model appears to have the best performance among the analyzed models, with the highest AUC and relatively balanced TPR and FPR. However, it's important to note that the interpretation and comparison of model performance should take into consideration the specific context of deployment.




# TODO
- Hyperparameter tuning for the models