## *Fraud Analysis* Machine Learning Models Project
## <font size=5 color='gray'>Daniel Behar</font>

The goal of the project is to select and create the best prediction algorithm to predict whether a certain transaction is fraudulent or not. This is the first projecto of Machine Learning Models course.
#### Structure of the notebook:
* `Libraries`: Includes a briefly description of where they were used in the process
* `Data Export`: Includes the data export, separation in train-test and exploration
* `Cleaning Pipelines`: Building of the three pipelines that I used
* `Model`: Building of the pipeline where I'm placing my model

## Importing Libraries

In [None]:
#General use libraries
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import numpy as np

#To split into train and test
from sklearn.model_selection import train_test_split

#Full Pipeline
from sklearn.compose import ColumnTransformer

#Dates Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

#Numeric Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

#Categoric Pipeline
from sklearn.preprocessing import OneHotEncoder

#Model
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score
from sklearn.tree import DecisionTreeClassifier
import joblib

%matplotlib inline

## Data
* Getting the data
* Extracting isFraud from the Train set (Test doesn't have it)
* Extracting id from the Test set (it will be needed later)
* Creating train and test sets from the Train dataset

In [None]:
#Data will be the data that I'll use to create and train the model, datat is the data for which I want to predict the isFraud variable
data = pd.read_csv("Train.csv")
datat = pd.read_csv("Test.csv")

In [None]:
data.head(5)

In [None]:
y = data.loc[:,"isFraud"]
id = datat.loc[:,"id"]
data.drop(["id", "currentExpDate", "merchantName", "isFraud", "transactionDateTime", "accountOpenDate", "dateOfLastAddressChange"],axis=1,inplace=True)
datat.drop(["id", "merchantName", "currentExpDate", "transactionDateTime", "accountOpenDate", "dateOfLastAddressChange"],axis=1,inplace=True)
print(data.shape)
print(y.shape)
print(datat.shape)

After reviewing the model, this variables were not really useful, so I removed them:
- currentExpDate
- merchantName
- isFraud
- transactionDateTime
- accountOpenDate
- dateOfLastAddressChange

In [None]:
data.head(5)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.3, random_state=20)

### Exploring data
* Search for NAs values and not normal observations
* Searching for relevant information about the data

In [None]:
np.unique(y)

In [None]:
# % of NAs for each column
nas = pd.DataFrame(data.isna().sum()/data.shape[0], columns = ['%NAs']).reset_index()
nas

In [None]:
def unique(column):
    return data[column].value_counts()

unique("posEntryMode")
unique("posConditionCode")

In [None]:
cols = []
val = []
for col in data.select_dtypes(include='object').columns:
    cols.append(col)
    val.append(data[col].str.contains(r'/').sum())
pd.DataFrame({
    'cols':cols,
    'val':val
})

Data behavior

In [None]:
data.describe()

In [None]:
plt.figure()
data.hist(figsize=(8,8))
plt.show()

In [None]:
corr_df = data.corr(method='pearson')
plt.figure(figsize=(8, 6))
sns.heatmap(corr_df, annot=True)
plt.show()

Distribution of Trues and Falses in the data

In [None]:
def count(lst):
    return sum(bool(x) for x in lst)
 
# Driver code
print(count(y))
print(count(y)/len(y))

In [None]:
def count(lst):
    return sum(bool(x) for x in lst)

print((count(y)/len(y))*100)

fig = plt.figure(figsize=(10,10))
y.value_counts().plot(kind='bar', title='True/False relation')

In [None]:
print(data.dtypes)

## Pipelines to clean the data

### Numeric Pipeline

In [None]:
numeric_pipeline = Pipeline([
                            ('Imputador', SimpleImputer(strategy="most_frequent")),
                            ('std_scaler', StandardScaler()),
                        ])

posEntryMode and posConditionCode are the only numeric values that have NAs, and because those are discrete variables the only imputer strategy valid is Mode, aka "Most_frequent"

In [None]:
numerical = numeric_pipeline.fit_transform(data.select_dtypes(include='number'))
numerical[0,:]

### Categoric Pipeline

In [None]:
categoric_pipeline = Pipeline([
                        ('Imputador', SimpleImputer(strategy="most_frequent")),
                        ('ohe', OneHotEncoder())
                        ])

Testing the pipeline. Categoricas contains the raw columns, categorical contains the processed data

In [None]:
categoricas = data[["acqCountry", "merchantCountryCode", "merchantCategoryCode", "transactionType", "cardPresent", "expirationDateKeyInMatch"]]
categorical = categoric_pipeline.fit_transform(categoricas.values)
categorical.toarray()[0,:]

### Full pipeline

- Date columns are string but I can't send them in the categorical pipeline because isn't useful to me
- I need to pick specific string columns in order to clean correctly the data

In [None]:
numerical_attributes = data.select_dtypes(include='number').columns 
categorical_attributes = data.select_dtypes(exclude='number').columns 

full_cleaning_pipeline = ColumnTransformer([
        ("numerics", numeric_pipeline, numerical_attributes),
        ("categorics", categoric_pipeline, categorical_attributes)
    ])

In [None]:
full_cleaning_pipeline.fit(data)
ready_Xtrain = full_cleaning_pipeline.transform(X_train)
ready_Xtest = full_cleaning_pipeline.transform(X_test)

print(ready_Xtrain.shape)
print(ready_Xtest.shape)

## Model

In [None]:
predictor_pipeline = Pipeline([
        ("data_preparation", full_cleaning_pipeline),
        ("DTC", DecisionTreeClassifier(random_state=10000))
    ])

predictor_pipeline.fit(X_train, y_train)
predicted_vals = predictor_pipeline.predict(X_test)

print('DecisionTreeClassifier:\n')
print('F1: {0}'.format(f1_score(y_test,predicted_vals,average='weighted')))
print('Precision Score: {0}'.format(precision_score(y_test,predicted_vals,average='weighted')))
print('Recall Score: {0}'.format(recall_score(y_test,predicted_vals, average='weighted')))
cm = confusion_matrix(y_test,predicted_vals)

In [None]:
plt.figure(figsize = (10,7))
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g')

## Saving model pipeline
- Using datat (Test dataset) to predict the isFraud variable
- Because I uploaded the answers to Kaggle, I need to merge the ID's of the transactions and the transactions result

In [None]:
DT_model = predictor_pipeline
joblib.dump(DT_model, "FraudModel.pkl")
FraudModel = joblib.load("FraudModel.pkl")
isFraud = FraudModel.predict(datat)
solutions = pd.concat([id, pd.DataFrame(isFraud)], axis=1)
solutions.to_csv('solutions.csv')