# Exploratory Data Analysis (EDA) and data cleaning notebook #

**Importing libraries and modules**

In [None]:
# importing external libraries
from pathlib import Path
import os
import pandas as pd
import pickle
import json

# Importing function to load data

# Making sure any changes are instantly added
%load_ext autoreload
%autoreload 2

from Modules.load_data import load_data
from Modules.preprocessing import missing_summary, merge_dfs, dollar_to_int



Please unhash and run the cell below if you have not yet installed the dataset using the kaggle API

In [2]:
#load_data()

– Describe your data (e.g. dtypes, descriptive statistics)
– What is the distribution of the target variable?
– Do we face missing values / outliers?
– How do specific features correlate with the target variable?
– What features can we use for the specific prediction task?

– Describe your data (e.g. dtypes, descriptive statistics)
– What is the distribution of the target variable?
– Do we face missing values / outliers?
– How do specific features correlate with the target variable?
– What features can we use for the specific prediction task?

In [2]:
# Obtaining absolute path to data folder
data_folder = str(Path(os.getcwd()) / "data")

# Obtaining absolute paths to relevant datasets

cards_data = data_folder + "/cards_data.csv"
transaction_data = data_folder + "/transactions_data.csv"

In [3]:
# Reading datasets into pandas
cards_data_df = pd.read_csv(cards_data)
transaction_data_df = pd.read_csv(transaction_data)


**Pre-processing steps for transaction data**

 * `date` column to be decomposed into seperate month, data, time columns if there is correlation between time of day, day of week etc. and fraduluent transactions
 * `merchant_id` represents the business where transaction was made. Likely too many clients to use as categorical variable. Possible do mean-encoding.
 * `card_id`. May be possible to somehow represent if cardholder has been flagged for fradulent transaction before
 * `client_id` similar type of encoding as for merchant id.
 * `merchant_city` and `merchant_state` likely one-hot encoding.
 * `zip`, perhaps any predictability is covered in the other location variables. Check for correlation and then drop.
 * `mcc` represents the type of merchant. Possible one hot encoding or mean encoding.
 * `errors`. Over 98% missing. Look for correlations then maybe drop.
 * Any columns with $ values are object type. Remove $ sign and change to int.



In [None]:
# Merging cards and transactions df and saving to pickle
# Unhash if this is the first time running the code

#merge_dfs(transaction_data_df=transaction_data_df, cards_data_df=cards_data_df,data_folder=data_folder)

In [None]:
# Loading the data from pickle

merged_df = pd.read_pickle(data_folder + "/merged_data.pkl")

In [None]:
# Running info to see column types
merged_df.info()

"""As we can see a lot of columns that should be numerical are objects as they have dollar signs"""

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8914963 entries, 0 to 8914962
Data columns (total 24 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   date                   object 
 1   client_id              int64  
 2   card_id                int64  
 3   amount                 object 
 4   use_chip               object 
 5   merchant_id            int64  
 6   merchant_city          object 
 7   merchant_state         object 
 8   zip                    float64
 9   mcc                    int64  
 10  errors                 object 
 11  card_brand             object 
 12  card_type              object 
 13  card_number            int64  
 14  expires                object 
 15  cvv                    int64  
 16  has_chip               object 
 17  num_cards_issued       int64  
 18  credit_limit           object 
 19  acct_open_date         object 
 20  year_pin_last_changed  int64  
 21  card_on_dark_web       object 
 22  id                

In [15]:
# Running the dollar_to_int function
dollar_to_int(merged_df)

In [None]:
# Running .info() again
merged_df.info()

""" 
'amount' and 'credit limit' now changed to int types
"""

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8914963 entries, 0 to 8914962
Data columns (total 24 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   date                   object 
 1   client_id              int64  
 2   card_id                int64  
 3   amount                 int64  
 4   use_chip               object 
 5   merchant_id            int64  
 6   merchant_city          object 
 7   merchant_state         object 
 8   zip                    float64
 9   mcc                    int64  
 10  errors                 object 
 11  card_brand             object 
 12  card_type              object 
 13  card_number            int64  
 14  expires                object 
 15  cvv                    int64  
 16  has_chip               object 
 17  num_cards_issued       int64  
 18  credit_limit           int64  
 19  acct_open_date         object 
 20  year_pin_last_changed  int64  
 21  card_on_dark_web       object 
 22  id                

In [17]:
# Running describe
merged_df.describe()

Unnamed: 0,client_id,card_id,amount,merchant_id,zip,mcc,card_number,cvv,num_cards_issued,credit_limit,year_pin_last_changed,id
count,8914963.0,8914963.0,8914963.0,8914963.0,7807586.0,8914963.0,8914963.0,8914963.0,8914963.0,8914963.0,8914963.0,8914963.0
mean,1026.637,3474.887,42.52761,47725.66,51328.55,5565.097,4817349000000000.0,495.3292,1.522064,15549.59,2011.34,15584730.0
std,581.6755,1674.427,81.51282,25816.23,29405.18,875.5078,1311465000000000.0,288.5735,0.5151711,12181.99,2.894518,4703991.0
min,0.0,0.0,-500.0,1.0,1001.0,1711.0,300105500000000.0,0.0,1.0,0.0,2002.0,7475327.0
25%,519.0,2413.0,8.0,25887.0,28601.0,5300.0,4489873000000000.0,247.0,1.0,8100.0,2010.0,11507860.0
50%,1070.0,3584.0,28.0,45926.0,47710.0,5499.0,5112842000000000.0,499.0,2.0,13455.0,2011.0,15571400.0
75%,1530.0,4899.0,63.0,67570.0,77901.0,5812.0,5566696000000000.0,740.0,2.0,20839.0,2013.0,19653870.0
max,1998.0,6138.0,6613.0,100342.0,99928.0,9402.0,6994218000000000.0,999.0,3.0,141391.0,2020.0,23761870.0


In [None]:
# Running missing summary
missing_summary(merged_df)

"""
A lot of missing values for `errors` column 
"""

Unnamed: 0,Missing Values,Percentage missing (%)
date,0,0.0
client_id,0,0.0
card_id,0,0.0
amount,0,0.0
use_chip,0,0.0
merchant_id,0,0.0
merchant_city,0,0.0
merchant_state,1047865,11.754003
zip,1107377,12.421555
mcc,0,0.0


In [None]:
merged_df["errors"][merged_df["errors"].notna()]
'''
A missing value likely to mean there was no transaction error
'''

161              Bad Expiration
180             Bad Card Number
262        Insufficient Balance
319        Insufficient Balance
320        Insufficient Balance
                   ...         
8914572    Insufficient Balance
8914610    Insufficient Balance
8914635    Insufficient Balance
8914782                 Bad PIN
8914898    Insufficient Balance
Name: errors, Length: 141767, dtype: object