# Machine Learning Basics for Chemical Engineering Research



## Problem statement
Fraud is a problem for any bank. Fraud can take many forms, whether it is someone stealing a single credit card, to large batches of stolen credit card numbers being used on the web, or even a mass compromise of credit card numbers stolen from a merchant via tools like credit card skimming devices. In this notebook, we provide analysis and insights for Fraud action in credit card transactions. For the challenge, we will use credit card transaction data.
<br>

By following the table of contents, we will find a general description of collected data, detailed illustration and analysis of the dataset, and a predictive model to determine whether a given transaction will be fraudulent or not.

### Table of Contents

I.    [Programmatically load dataset](#Load)<br>
II.   [Plot and general analysis of dataset](#Plot)<br>
III.  [Data Wrangling - Duplicate Transactions](#Clean)<br>
IV.   [Fraud detection model](#Model)<br>
V.   [Future Prospective](#Future)<br>
VI.   [Reference](#Ref)

### Bullet points and logic path
**Question 1: Load</font>**
<br>`a.` Dataset is downloaded and parsed into **pandas DataFrame** through git
<br>`b.` **Shape** of the dataset and **Type**, **Count** of individual features is provided, Features are categories into four groups according to data type 
<br>`c.` **Num of Null**, **Min**, **Max**, **UniqueValue** and **numerical statistics** of individual features is sumerized
<br>`d.` Empty columns **echoBuffer**, **merchantCity**, **merchantState**, **merchantZip**, **posOnPremises**, **recurringAuthInd** are removed

**Question 2: Plot</font>**
<br>`a.` **Histogram plot** and **box chart** of the processed amounts is provided
<br>`b.` **Histogram** of Fraud vs normal transactions are compared
<br>`c.` Majority numeric features follows a **asymmetric distribution patterns**, usually **right skew**
<br>`d.` **Correlation heat map** of features is provided
<br>`e.` **Hypothesis 1:** Fraud transaction are more likely to appear on bigger amount.
<br>`f.` **Hypothesis 2:** If cardCVV is not equal to entered CVV, the corresponding transaction is highly possible to be a Fraud.
<br>`g.` **Hypothesis 3:** Features are not independent between each other (currentbalance, creditlimit, avaiablemoney)

**Question 3: Data Wrangling</font>**
<br>`a.`The dataset contains duplicate items other than Reverse transactions and multi-swipe transactions
<br>`b.`There are **20303** record of reversal transactions, with total amount of **2821792** dollars
<br>`c.`There are **2477** record of multi-swipe transactions, with total amount of **389751** dollars (**repeat transaction with time sequence less then 2mins is treated as multi-swipe**)
<br>`d.`Fraud transactions are more common on **REVERSE** transactions 
<br>`e.`**Time**, **merchantName** features are reshaped before modeling

**Question 4: Model</font>**
<br>`a.` Models are designed to emphasize more on **Sensitivity**
<br>`b.` **Clustering** was tried, but **failed** due to **curse of dimensions**
<br>`c.` **Positively skewed distribution** pattern of specific features are emphasized
<br>`d.` Simple Logistic Regression Model is built with AUC 0.510732
<br>`e.` **Logistic Regression Model** is optimized with **SMOT** technique to deal with **imbalance distribution**
<br>
<br> **Performance of optimized model:**
- Accuracy: 0.98
- Precision: 0.13
- Recall: 0.071
- Average precision-recall score: 0.06
- **AUC** : 0.75

**Future Prospective**
<br>`a.` Multi-normal distribution pattern of **creditLimit** feature should be reshaped, and modified for further improvement 
<br>`b.` Explore other technique to deal with **asymmetric distribution features**, such as **square root** transformation
<br>`c.` Reshape the dataset and use **MySQL** to check for multi-swipe with higher efficiency

In [None]:
import pandas as pd
import os  
import numpy as np
from zipfile import ZipFile
import matplotlib.pyplot as plt
import matplotlib.style as style
import math
import seaborn as sns