# Data processing

Our objective is to model the data previously studied in *00_EDA.ipynb*, and obtain the probabilities of fraud in each of the transactions.

### Libraries

In [1]:
from imblearn.over_sampling import SMOTE, KMeansSMOTE, BorderlineSMOTE
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline 

from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, classification_report, ConfusionMatrixDisplay, confusion_matrix, accuracy_score, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from collections import Counter

import warnings
warnings.filterwarnings('ignore')

### Pre-processed data


In [2]:
file = '../data/raw/Original_dataset_payments_fraud.csv'
df = pd.read_csv(file, sep=';')

#### Filter sensitive information, delete gender and race columns

Researching legislation in the European Union, we decided to eliminate race and sex.

Source: https://ec.europa.eu/info/law/law-topic/data-protection/reform/rules-business-and-organisations/legal-grounds-processing-data/sensitive-data/what-personal-data-considered-sensitive_en

In [3]:
df = df.drop(columns=['gender', 'race'])

As we have seen previously in the EDA study, there are different variables that appear Nan, for which there are different ways to treat these Nan. In this case, we have decided to treat the different Nan by changing them to unknown.

We are going to pass the *nan* to 'Unknown' in order to treat them as a new variable within these categories.

In [4]:
df[['device', 'zone']] = df[['device','zone']].fillna('Unknown')

Now we see again the types of columns that we have

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 17 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   step              1048575 non-null  int64  
 1   type              1048575 non-null  object 
 2   amount            1048575 non-null  float64
 3   device            1048575 non-null  object 
 4   connection_time   1048575 non-null  object 
 5   nameOrig          1048575 non-null  object 
 6   oldbalanceOrg     1048575 non-null  float64
 7   age               1048575 non-null  int64  
 8   newbalanceOrig    1048575 non-null  float64
 9   zone              1048575 non-null  object 
 10  user_number       1048575 non-null  int64  
 11  nameDest          1048575 non-null  object 
 12  user_connections  1048575 non-null  int64  
 13  security_alert    1048575 non-null  int64  
 14  oldbalanceDest    1048575 non-null  float64
 15  newbalanceDest    1048575 non-null  float64
 16  

In the EDA study, we have realized that the variables nameOrig and nameDest are the types of transactions that have been made, so we are going to eliminate these two columns and create a new column that reflects both. If we go deeper we see how the letter C indicates "customer" and M "merchant".
Let's add a new feature that will indicate the type of account to which transactions are made. 
CC (customer - customer), CM (customer - merchant), etc. 

In [6]:
df["Transaction_type"] = np.nan 

# filling feature column
df.loc[df.nameOrig.str.contains('C') & df.nameDest.str.contains('C'), "Transaction_type"] = "CC" 
df.loc[df.nameOrig.str.contains('C') & df.nameDest.str.contains('M'), "Transaction_type"] = "CM"
df.loc[df.nameOrig.str.contains('M') & df.nameDest.str.contains('C'), "Transaction_type"] = "MC"
df.loc[df.nameOrig.str.contains('M') & df.nameDest.str.contains('M'), "Transaction_type"] = "MM"

Let's check which type of CC/CM/MC/MM Transactions are fraudulent.

In [7]:
print("Successful transactions: \n",df[df["isFraud"] == 0].Transaction_type.value_counts())
print("\n Fraudulent transactions: \n",df[df["isFraud"] == 1].Transaction_type.value_counts())

Successful transactions: 
 CC    693560
CM    353873
Name: Transaction_type, dtype: int64

 Fraudulent transactions: 
 CC    1142
Name: Transaction_type, dtype: int64


After printing out the transactions that have been fraudulent, we can see that there are only fraudulent transactions in which Client-Clients are the recipients,it is important for the further development of the different models. 

Delete the columns *nameOrig* and *nameDest*.

In [8]:
df = df.drop(columns=['nameOrig', 'nameDest'])

In [9]:
df.head()

Unnamed: 0,step,type,amount,device,connection_time,oldbalanceOrg,age,newbalanceOrig,zone,user_number,user_connections,security_alert,oldbalanceDest,newbalanceDest,isFraud,Transaction_type
0,1,PAYMENT,9839.64,mac,140039412,170136.0,85,160296.36,capital,138,5,1,0.0,0.0,0,CM
1,1,PAYMENT,1864.28,mac,496889534,21249.0,57,19384.72,country,909,1,0,0.0,0.0,0,CM
2,1,TRANSFER,181.0,pc,781150327,181.0,66,0.0,capital,2569,10,0,0.0,0.0,1,CC
3,1,CASH_OUT,181.0,mac,565068378,181.0,31,0.0,country,1787,3,0,21182.0,0.0,1,CC
4,1,PAYMENT,11668.14,mac,517114493,41554.0,90,29885.86,country,3997,8,0,0.0,0.0,0,CM


As noted above in the EDA, it may be helpful to create two new variables representing the day number and the time of day when these transactions occur.

In [10]:
df["HourOfDay"] = np.nan # initializing feature column
df["Day"] = np.nan # initializing feature column

df.HourOfDay = df.step % 24
df.Day = (df.step//24) % 7

### We deal with categorical variables

In [11]:
X = df.drop('isFraud',axis = 1)  # Data
Y = df.isFraud # target variable

For the categorical variables we have decided to use the OneHotEncoder in order to be able to treat them and start the transformation into numerical variables.

In [12]:
df1 = df.copy()
categorical_columns = ['type', 'device', 'zone', 'Transaction_type']

ohe = ce.OneHotEncoder() 
df1 = ohe.fit_transform(df[categorical_columns], Y)

df = df.join(df1)

In [13]:
#Delete categorical variables
df = df.drop(columns=categorical_columns)

#### This is what our new data set would look like after preprocessing.

In [14]:
df.head()

Unnamed: 0,step,amount,connection_time,oldbalanceOrg,age,newbalanceOrig,user_number,user_connections,security_alert,oldbalanceDest,...,device_1,device_2,device_3,device_4,zone_1,zone_2,zone_3,zone_4,Transaction_type_1,Transaction_type_2
0,1,9839.64,140039412,170136.0,85,160296.36,138,5,1,0.0,...,1,0,0,0,1,0,0,0,1,0
1,1,1864.28,496889534,21249.0,57,19384.72,909,1,0,0.0,...,1,0,0,0,0,1,0,0,1,0
2,1,181.0,781150327,181.0,66,0.0,2569,10,0,0.0,...,0,1,0,0,1,0,0,0,0,1
3,1,181.0,565068378,181.0,31,0.0,1787,3,0,21182.0,...,1,0,0,0,0,1,0,0,0,1
4,1,11668.14,517114493,41554.0,90,29885.86,3997,8,0,0.0,...,1,0,0,0,0,1,0,0,1,0


In [15]:
path = '../data/processed/'
new_file = 'new_dataset_payments_fraud.parquet'

df.to_parquet(path+new_file)