#**Machine Learning for Digital Fraud Detection**
###Comparing the Performance of Stacking Classifier & Deep Learning Models in Identifying Instances of Digital Transaction Fraud



Brian Morrison

DATA606 - Capstone Project in Data Science

Professor Jay Wang

The University of Maryland, Baltimore County

##**Part Zero - Source Data Splitting**

###*Introduction & Notebook Overview*

The purpose of this notebook is to read in the initial >600MB source data file for this project, and split it into 7 component data files that can be uploaded to the project's GitHub repository - which has a file size limit of 100MB. Following this split, the succeeding notebook will read in and concatenate these files for project use.  

In [None]:
#we'll start by reading in the full dataset, stored in my google drive

import pandas as pd
import numpy as np
from google.colab import drive

drive.mount('/content/gdrive') # mounting

df = pd.read_csv('/content/gdrive/My Drive/Course Folders/DATA606 - Data Science Capstone/Final Project Files/train_transaction.csv') #reading data from path into pandas dataframe, df

Mounted at /content/gdrive


In [None]:
df.head(5) #examining the head of our dataframe

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
length = len(df) #examining the length of the dataframe to decide on sizing for components
print(length)

590540


In [None]:
component_size = round((length) / 7) #examining what our component size will need to be

full_size = component_size * 7

print(component_size)
print(full_size) 

84363
590541


In [None]:
#we can now define a function to split our dataframe into 7 component dataframes, slicing row wise

def split_dataframe(df, chunk_size): 
    chunks = list()
    #num_chunks = len(df) // chunk_size + 1
    num_chunks = len(df) // chunk_size + (1 if len(df) % chunk_size else 0)
    for i in range(num_chunks):
        chunks.append(df[i * chunk_size: (i + 1) * chunk_size])
    return chunks

#reference link for splitting function: https://stackoverflow.com/questions/17315737/split-a-large-pandas-dataframe

In [None]:
df1, df2, df3, df4, df5, df6, df7 = split_dataframe(df, component_size) #defining dataframes

In [None]:
print(len(df1) + len(df2) + len(df3) + len(df4) + len(df5) + len(df6) + len(df7))

590540


In [None]:
#finally, this step downloads the dataframes as csv files to my local machine to be uploaded to the project's github repository

from google.colab import files

df_list = [df1, df2, df3, df4, df5, df6, df7]

num = 1
for df in df_list:
  df_name = 'df' + str(num) + '.csv'
  df.to_csv(df_name, encoding = 'utf-8-sig')
  files.download(df_name)
  num += 1

#note that this export process can be performed manually using the colab sidebar

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

We can now use git commands to upload each dataframe as a file to the project's GitHub repository, which will allow us to easily access them in subsequent project notebooks. For this portion of the project, I used Visual Studio Code to easily interact with the terminal and project files simultaneously. I have included a brief overview of the process below. 

Basic Git commands leveraged:

Checking Git version

`$git --version`

Cloning the project repository to make local edits

`$git clone https://github.com/briancmorrison/brian_data606.git`

Navigating to cloned directory

`$cd brian_data606`

Adding files to repository

`$git add ['filenames']`

Committing changes locally

`$git commit -m "Uploaded Source Data"`

Pushing to actual GitHub repository

`$git push origin master`