# Credit Card Data Processing

Welcome to the data processing notebook for the Credit Card Analytics notebook! We will be downloading, processing, and uploading files from/to the following [S3 location](https://s3.console.aws.amazon.com/s3/buckets/data.atoti.io?region=eu-west-3&bucketType=general&prefix=notebooks/retail-banking/input/&showversions=false).

**[Input](https://s3.console.aws.amazon.com/s3/buckets/data.atoti.io?region=eu-west-3&bucketType=general&prefix=notebooks/retail-banking/input/&showversions=false):** 

* `cards.csv`
* `credit_card_transactions.csv.gz`
* `loan_contracts.csv`
* `retailers.csv`
* `users.csv`

**[Data](https://s3.console.aws.amazon.com/s3/buckets/data.atoti.io?region=eu-west-3&bucketType=general&prefix=notebooks/retail-banking/data/&showversions=false)**

* `cards_processed.csv` <-- created from this notebook
* `users_processed.csv` <-- created from this notebook
* `credit_card_transactions_processed_5MM.csv.gz` <-- created from this notebook
* `fico.csv`
* `loans.csv`
* `retailers.csv`

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=cc-data-processing" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Discover-Atoti-now.png" alt="Try atoti"></a></div>

## Import Libraries

In [1]:
import atoti
import pandas as pd
import time
import os
import numpy as np

# Create output folder containing processed files,
# Files in this directory will be added to .gitignore
outdir = "data"
if not os.path.exists(outdir):
    os.mkdir(outdir)

## Process Credit Card Info Data from S3

1. [Load credit card info data from S3.](#Load-Credit-Card-Info-Data-from-S3)
2. [Analyze user credit card data.](#Analyze-User-Credit-Card-Data)
3. [Load retailer data from S3.](#Load-Retailer-Data-from-S3)
4. [Assign Entity Relationships for Cards-to-Retailers.](#Assign-Entity-Relationships-for-Cards-to-Retailers)
5. [Format Credit Card Data.](#Format-Credit-Card-Data)
6. [Output Credit Card Data as CSV.](#Output-Credit-Card-Data-as-CSV)

### Load Credit Card Info Data from S3

In [2]:
# Use `read_csv` pandas function to read from S3 URI
cc_df = pd.read_csv("s3://data.atoti.io/notebooks/retail-banking/input/cards.csv")
cc_df

Unnamed: 0,User,CARD INDEX,Card Brand,Card Type,Card Number,Expires,CVV,Has Chip,Cards Issued,Credit Limit,Acct Open Date,Year PIN last Changed,Card on Dark Web
0,0,0,Visa,Debit,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,0,1,Visa,Debit,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,0,2,Visa,Debit,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,0,3,Visa,Credit,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,0,4,Mastercard,Debit (Prepaid),5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,1997,1,Amex,Credit,300609782832003,01/2024,663,YES,1,$6900,11/2000,2013,No
6142,1997,2,Visa,Credit,4718517475996018,01/2021,492,YES,2,$5700,04/2012,2012,No
6143,1998,0,Mastercard,Credit,5929512204765914,08/2020,237,NO,2,$9200,02/2012,2012,No
6144,1999,0,Mastercard,Debit,5589768928167462,01/2020,630,YES,1,$28074,01/2020,2020,No


### Analyze User Credit Card Data

In [3]:
# Find the distinct credit card brands
unique_values = cc_df["Card Brand"].unique()
print(sorted(unique_values))

['Amex', 'Discover', 'Mastercard', 'Visa']


In [4]:
# Find the distinct credit card combinations, and their frequency counts overall
cc_combinations_df = cc_df.groupby(["Card Brand", "Card Type"], as_index=False).size()
cc_combinations_df

Unnamed: 0,Card Brand,Card Type,size
0,Amex,Credit,402
1,Discover,Credit,209
2,Mastercard,Credit,635
3,Mastercard,Debit,2191
4,Mastercard,Debit (Prepaid),383
5,Visa,Credit,811
6,Visa,Debit,1320
7,Visa,Debit (Prepaid),195


In [5]:
# Find the distinct credit card combinations for each user, and their frequency counts
cc_combinations_df = cc_df.groupby(
    ["User", "Card Brand", "Card Type"], as_index=False
).size()
cc_combinations_df

Unnamed: 0,User,Card Brand,Card Type,size
0,0,Mastercard,Debit (Prepaid),1
1,0,Visa,Credit,1
2,0,Visa,Debit,3
3,1,Mastercard,Debit,1
4,1,Mastercard,Debit (Prepaid),2
...,...,...,...,...
4646,1997,Mastercard,Debit,1
4647,1997,Visa,Credit,1
4648,1998,Mastercard,Credit,1
4649,1999,Mastercard,Debit,1


In [6]:
# Find the max frequency count for all distinct credit card combinations at the user level
max_cc_combinations = cc_combinations_df.groupby(
    ["Card Brand", "Card Type"], as_index=False
)["size"].max()
max_cc_combinations

Unnamed: 0,Card Brand,Card Type,size
0,Amex,Credit,4
1,Discover,Credit,2
2,Mastercard,Credit,4
3,Mastercard,Debit,6
4,Mastercard,Debit (Prepaid),3
5,Visa,Credit,4
6,Visa,Debit,5
7,Visa,Debit (Prepaid),2


### Load Retailer Data from S3

In [7]:
# Load retailer data
cc_info = pd.read_csv("s3://data.atoti.io/notebooks/retail-banking/input/retailers.csv")
cc_info

Unnamed: 0,Retailer ID,Retailer Name,Card Brand,Card Type,Level 1,Level 2,Level 3,Level 4,Level 5,Industry
0,1,Cathay Pacific Elite,Amex,Credit,Bank Corp,Consumer Banking,Cards Business,Travel,Airline,Airline
1,2,Hilton Honors,Amex,Credit,Bank Corp,Consumer Banking,Cards Business,Travel,Hotel,Hotel
2,3,Delta SkyMiles Reserve,Amex,Credit,Bank Corp,Consumer Banking,Cards Business,Travel,Airline,Airline
3,4,Marriot Bonvoy Brilliant,Amex,Credit,Bank Corp,Consumer Banking,Cards Business,Travel,Hotel,Hotel
4,5,Discover it Miles,Discover,Credit,Bank Corp,Consumer Banking,Cards Business,Financials,Cards & Banking,Cards & Banking
5,6,Discover it Secured,Discover,Credit,Bank Corp,Consumer Banking,Cards Business,Financials,Cards & Banking,Cards & Banking
6,7,Upromise,Mastercard,Credit,Bank Corp,Consumer Banking,Cards Business,Nonprofit,Education,Education
7,8,AARP,Mastercard,Credit,Bank Corp,Consumer Banking,Cards Business,Nonprofit,Health Care,Health Care
8,9,Banana Republic,Mastercard,Credit,Bank Corp,Consumer Banking,Cards Business,Retail,Consumer Discretionary,Fashion
9,10,Barnes & Noble,Mastercard,Credit,Bank Corp,Consumer Banking,Cards Business,Retail,Consumer Discretionary,Books


### Assign Entity Relationships for Cards-to-Retailers

In [8]:
# Assign Retailer IDs to Card Brand, Card Type combinations
cc_dict = {
    "Amex Credit": [1, 2, 3, 4],
    "Discover Credit": [5, 6],
    "Mastercard Credit": [7, 8, 9, 10],
    "Mastercard Debit": [11, 12, 13, 14, 15, 16],
    "Mastercard Debit (Prepaid)": [17, 18, 19],
    "Visa Credit": [20, 21, 22, 23],
    "Visa Debit": [24, 25, 26, 27, 28],
    "Visa Debit (Prepaid)": [29, 30],
}

In [9]:
# Add the new `Retailer ID` column with empty values
cc_df.insert(2, "Retailer ID", "")
cc_df

Unnamed: 0,User,CARD INDEX,Retailer ID,Card Brand,Card Type,Card Number,Expires,CVV,Has Chip,Cards Issued,Credit Limit,Acct Open Date,Year PIN last Changed,Card on Dark Web
0,0,0,,Visa,Debit,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,0,1,,Visa,Debit,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,0,2,,Visa,Debit,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,0,3,,Visa,Credit,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,0,4,,Mastercard,Debit (Prepaid),5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,1997,1,,Amex,Credit,300609782832003,01/2024,663,YES,1,$6900,11/2000,2013,No
6142,1997,2,,Visa,Credit,4718517475996018,01/2021,492,YES,2,$5700,04/2012,2012,No
6143,1998,0,,Mastercard,Credit,5929512204765914,08/2020,237,NO,2,$9200,02/2012,2012,No
6144,1999,0,,Mastercard,Debit,5589768928167462,01/2020,630,YES,1,$28074,01/2020,2020,No


In [10]:
# Get cards for each unique user and sort by Card Brand and Card Type
# Set the Retailer ID for each user credit card, while accounting for
# Users with multiple distinct credit cards of the same Card Brand and Card Type
# ----
# UNCOMMENT PRINT STATEMENTS TO SEE GRANULAR OPERATIONS
for user in cc_df["User"].unique():
    df = cc_df.loc[cc_df["User"] == user].sort_values(by=["Card Brand", "Card Type"])
    prev_row = None

    for index, row in df.iterrows():
        cc_input = row["Card Brand"] + " " + row["Card Type"]
        distinct_count = cc_combinations_df.loc[
            (cc_combinations_df["User"] == user)
            & (cc_combinations_df["Card Brand"] == row["Card Brand"])
            & (cc_combinations_df["Card Type"] == row["Card Type"])
        ]["size"].values[0]

        if prev_row is None:
            # print("FIRST ROW AND NEW CARD FOR USER")
            num_counter = 0
            # print(f"  User {user} has a {cc_input}, and has {distinct_count} distinct cards")
            assignment = cc_dict[cc_input][num_counter]
            # print(f"    Assigning to Retailer ID... {assignment}")
            cc_df.loc[index, "Retailer ID"] = assignment
            prev_row = row

        else:
            if str(prev_row["Card Brand"]) == str(row["Card Brand"]) and str(
                prev_row["Card Type"]
            ) == str(row["Card Type"]):
                # print("SAME CARD AS PREVIOUS ROW")
                num_counter += 1
                # print(f"  User {user} has a {cc_input}, which is same as above, and has {distinct_count} distinct cards")
                assignment = cc_dict[cc_input][num_counter]
                # print(f"    Assigning to Retailer ID... {assignment}")
                cc_df.loc[index, "Retailer ID"] = assignment
                prev_row = row

            else:
                # print("NEW CARD FOR SAME USER")
                num_counter = 0
                # print(f"  User {user} has a {cc_input}, and has {distinct_count} distinct cards")
                assignment = cc_dict[cc_input][num_counter]
                # print(f"    Assigning to Retailer ID... {assignment}")
                cc_df.loc[index, "Retailer ID"] = assignment
                prev_row = row

In [11]:
# Check Retailer ID values have been assigned
cc_df

Unnamed: 0,User,CARD INDEX,Retailer ID,Card Brand,Card Type,Card Number,Expires,CVV,Has Chip,Cards Issued,Credit Limit,Acct Open Date,Year PIN last Changed,Card on Dark Web
0,0,0,24,Visa,Debit,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,0,1,25,Visa,Debit,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,0,2,26,Visa,Debit,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,0,3,20,Visa,Credit,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,0,4,17,Mastercard,Debit (Prepaid),5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6141,1997,1,1,Amex,Credit,300609782832003,01/2024,663,YES,1,$6900,11/2000,2013,No
6142,1997,2,20,Visa,Credit,4718517475996018,01/2021,492,YES,2,$5700,04/2012,2012,No
6143,1998,0,7,Mastercard,Credit,5929512204765914,08/2020,237,NO,2,$9200,02/2012,2012,No
6144,1999,0,11,Mastercard,Debit,5589768928167462,01/2020,630,YES,1,$28074,01/2020,2020,No


### Format Credit Card Data

In [12]:
# Remove the `Card Brand` and `Card Type` columns
cc_df.drop(columns=["Card Brand", "Card Type"], inplace=True)

# Rename `CARD INDEX` column to `Card` to match joined `Card` column from cc_sales_gzip_df
cc_df.rename(columns={"CARD INDEX": "Card"}, inplace=True)
cc_df.head()

Unnamed: 0,User,Card,Retailer ID,Card Number,Expires,CVV,Has Chip,Cards Issued,Credit Limit,Acct Open Date,Year PIN last Changed,Card on Dark Web
0,0,0,24,4344676511950444,12/2022,623,YES,2,$24295,09/2002,2008,No
1,0,1,25,4956965974959986,12/2020,393,YES,2,$21968,04/2014,2014,No
2,0,2,26,4582313478255491,02/2024,719,YES,2,$46414,07/2003,2004,No
3,0,3,20,4879494103069057,08/2024,693,NO,1,$12400,01/2003,2012,No
4,0,4,17,5722874738736011,03/2009,75,YES,1,$28,09/2008,2009,No


In [13]:
# Cast intended measures as numerical data types
cc_df["Credit Limit"] = cc_df["Credit Limit"].str.replace("$", "")
cc_df["Credit Limit"] = cc_df["Credit Limit"].astype(int)
cc_df["Credit Limit"] = cc_df["Credit Limit"] + 50000

### Output Credit Card Data as CSV

In [14]:
# Output DataFrame to CSV file
cc_df.to_csv(f"{outdir}/cards_processed.csv", index=False)

## Process Users Data from S3

1. [Load Users Data from S3.](#Load-Users-Data-from-S3)
2. [Load Contracts Data from S3.](#Load-Contracts-Data-from-S3)
3. [Join Credit Loss Attributes to Users Data.](#Join-Credit-Loss-Attributes-to-Users-Data)
4. [Format Users Data.](#Format-Users-Data)
5. [Output Users Data as CSV.](#Output-Users-Data-as-CSV)

### Load Users Data from S3

In [15]:
# Load users data
users_df = pd.read_csv("s3://data.atoti.io/notebooks/retail-banking/input/users.csv")
users_df = users_df.rename_axis("User").reset_index()
users_df

Unnamed: 0,User,Person,Current Age,Retirement Age,Birth Year,Birth Month,Gender,Address,Apartment,City,State,Zipcode,Latitude,Longitude,Per Capita Income - Zipcode,Yearly Income - Person,Total Debt,FICO Score,Num Credit Cards
0,0,Hazel Robinson,53,66,1966,11,Female,462 Rose Lane,,La Verne,CA,91750,34.15,-117.76,$29278,$59696,$127613,787,5
1,1,Sasha Sadr,53,68,1966,12,Female,3606 Federal Boulevard,,Little Neck,NY,11363,40.76,-73.74,$37891,$77254,$191349,701,5
2,2,Saanvi Lee,81,67,1938,11,Female,766 Third Drive,,West Covina,CA,91792,34.02,-117.89,$22681,$33483,$196,698,5
3,3,Everlee Clark,63,63,1957,1,Female,3 Madison Street,,New York,NY,10069,40.71,-73.99,$163145,$249925,$202328,722,4
4,4,Kyle Peterson,43,70,1976,9,Male,9620 Valley Stream Drive,,San Francisco,CA,94117,37.76,-122.44,$53797,$109687,$183855,675,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1995,Jose Faraday,32,70,1987,7,Male,6577 Lexington Lane,9.0,Freeport,NY,11520,40.65,-73.58,$23550,$48010,$87837,703,3
1996,1996,Ximena Richardson,62,65,1957,11,Female,2 Elm Drive,955.0,Independence,KY,41051,38.95,-84.54,$24218,$49378,$104480,740,4
1997,1997,Annika Russell,47,67,1973,1,Female,276 Fifth Boulevard,,Elizabeth,NJ,7201,40.66,-74.19,$15175,$30942,$71066,779,3
1998,1998,Juelz Roman,66,60,1954,2,Male,259 Valley Boulevard,,Camp Hill,PA,17011,40.24,-76.92,$25336,$54654,$27241,618,1


### Load Contracts Data from S3

In [16]:
# Use `read_csv` pandas function to load contracts data from S3 URI
contracts_df = pd.read_csv(
    "s3://data.atoti.io/notebooks/retail-banking/input/loan_contracts.csv",
    low_memory=False,
)
contracts_df

Unnamed: 0,Reporting Date,EAD,PD12,PDLT,LGD,Maturity Date,Residual Maturity,Bucketed Arrears,Reporting Index,Is New Contract,Client ID,FICO,FICO Segment,LTV Segment,Macro Economic Scenario,Entity
0,12/30/22,69795.693590,0.057123,0.024117,0.607348,11/5/32,3597,0,2,False,9XFALYXL,300,300-410,81%-90%,Base,Paris
1,12/30/22,24045.796250,0.056460,0.021793,0.613929,12/5/32,3598,0,0,True,KHAAK99A,300,300-410,>30%,Base,Paris
2,12/30/22,2019.251356,0.017708,0.017708,0.607348,8/10/23,460,0,113,False,AZHHTHYX,300,300-410,>30%,Base,Paris
3,12/30/22,20563.067570,0.034836,0.154493,0.613929,8/15/31,3152,0,17,False,YZAZKX99,300,300-410,>30%,Base,Paris
4,12/30/22,2666.804003,0.017043,0.017043,0.613929,5/15/24,504,0,20,False,AZLKF9YH,300,300-410,30%-50%,Base,Paris
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336393,12/30/22,460.992072,0.041017,0.041017,0.322573,10/15/23,290,91-180,50,False,BN9JJQ,850,741-850,71%-80%,Base,NewYork
336394,12/30/22,116.777909,0.041017,0.041017,0.322573,6/28/23,181,0,47,False,NSJJQ,850,741-850,>30%,Base,NewYork
336395,12/30/22,2461.296310,0.013934,0.063931,0.497079,4/9/26,1168,0,44,False,BJGBJ9,850,741-850,81%-90%,Base,NewYork
336396,12/30/22,41522.000000,0.019270,0.038064,0.607348,1/15/25,746,0,11,False,122O9E2E,850,741-850,71%-80%,Base,NewYork


### Join Credit Loss Attributes to Users Data

In [17]:
# Group by FICO and calculate the average for EAD along with other numeric values
groupby_contracts_df = contracts_df.groupby(["FICO"]).mean(["EAD"])
groupby_contracts_df.reset_index()

Unnamed: 0,FICO,EAD,PD12,PDLT,LGD,Residual Maturity,Reporting Index,Is New Contract
0,300,8980.553229,0.104477,0.155930,0.604633,1787.578864,26.446372,0.015773
1,301,9349.136091,0.112370,0.163001,0.605998,1786.359551,26.280899,0.028892
2,302,8508.016229,0.100375,0.148074,0.602229,1739.838710,27.416129,0.030645
3,303,7804.911004,0.091073,0.137547,0.598811,1727.121685,26.806552,0.031201
4,304,8688.726128,0.103132,0.154085,0.604463,1741.822476,28.037459,0.027687
...,...,...,...,...,...,...,...,...
546,846,9125.236908,0.094646,0.143602,0.600449,1820.182566,27.258224,0.023026
547,847,8643.170678,0.103138,0.149518,0.605348,1697.993681,27.949447,0.018957
548,848,7772.732491,0.097820,0.148221,0.606113,1737.099130,27.521739,0.020870
549,849,7968.876954,0.104263,0.151679,0.607293,1769.934084,26.741158,0.030547


In [18]:
# Join both DataFrames on FICO score to consolidate credit loss attributes
merge_df = pd.merge(
    users_df, groupby_contracts_df, left_on="FICO Score", right_on="FICO"
)
merge_df

Unnamed: 0,User,Person,Current Age,Retirement Age,Birth Year,Birth Month,Gender,Address,Apartment,City,...,Total Debt,FICO Score,Num Credit Cards,EAD,PD12,PDLT,LGD,Residual Maturity,Reporting Index,Is New Contract
0,0,Hazel Robinson,53,66,1966,11,Female,462 Rose Lane,,La Verne,...,$127613,787,5,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316
1,955,Nickolas Lopez,21,67,1999,2,Male,92196 Tenth Drive,,Leesburg,...,$85204,787,2,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316
2,1134,Kallie Rodriguez,39,71,1980,7,Female,135 Littlewood Avenue,6.0,Oceanside,...,$91549,787,1,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316
3,1479,Rylan Rodriguez,33,69,1986,10,Female,928 Bayview Street,,Portage,...,$0,787,3,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316
4,1,Sasha Sadr,53,68,1966,12,Female,3606 Federal Boulevard,,Little Neck,...,$191349,701,5,8943.997200,0.105292,0.150311,0.602701,1747.076299,26.618506,0.038961
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1796,Alessandro Davis,37,66,1982,12,Male,550 Forest Street,,Helena,...,$71180,580,1,8195.670961,0.103159,0.150773,0.607600,1747.212670,26.556561,0.024133
1996,1817,Darren Turner,31,63,1988,5,Male,6692 Lake Street,,Taylorsville,...,$53853,514,1,8681.589080,0.123131,0.171764,0.611464,1825.388976,27.366929,0.020472
1997,1874,August Braun,42,72,1977,8,Male,331 Oak Lane,,Antioch,...,$99235,563,2,9434.481972,0.097110,0.143415,0.605630,1780.612245,26.211931,0.032967
1998,1888,Kyng El-Mafouk,51,68,1968,10,Male,207 Ocean View Street,,Berkeley Heights,...,$242379,505,1,8936.491473,0.101380,0.149277,0.602884,1747.658249,26.885522,0.040404


### Format Users Data

In [19]:
# View columns of combined DataFrame
merge_df.columns

Index(['User', 'Person', 'Current Age', 'Retirement Age', 'Birth Year',
       'Birth Month', 'Gender', 'Address', 'Apartment', 'City', 'State',
       'Zipcode', 'Latitude', 'Longitude', 'Per Capita Income - Zipcode',
       'Yearly Income - Person', 'Total Debt', 'FICO Score',
       'Num Credit Cards', 'EAD', 'PD12', 'PDLT', 'LGD', 'Residual Maturity',
       'Reporting Index', 'Is New Contract'],
      dtype='object')

In [20]:
# Remove currency symbols
merge_df["Per Capita Income - Zipcode"] = merge_df[
    "Per Capita Income - Zipcode"
].str.replace("$", "")
merge_df["Yearly Income - Person"] = merge_df["Yearly Income - Person"].str.replace(
    "$", ""
)
merge_df["Total Debt"] = merge_df["Total Debt"].str.replace("$", "")
merge_df

# Cast intended measures and numeric columns
merge_df["Per Capita Income - Zipcode"] = merge_df[
    "Per Capita Income - Zipcode"
].astype(int)
merge_df["Yearly Income - Person"] = merge_df["Yearly Income - Person"].astype(int)
merge_df["Total Debt"] = merge_df["Total Debt"].astype(int)

In [21]:
# Create age range bins
merge_df["Age Range"] = pd.cut(
    merge_df["Current Age"],
    [0, 10, 20, 30, 40, 50, 60, np.inf],
    labels=["0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60+"],
    right=False,
)

# Create income range bins
merge_df["Income Range"] = pd.cut(
    merge_df["Yearly Income - Person"],
    [0, 20000, 50000, 80000, 100000, 150000, 200000, np.inf],
    labels=[
        "0K - 20K",
        "20K - 50K",
        "50K - 80K",
        "80K - 100K",
        "100K - 150K",
        "150K - 200K",
        "200K+",
    ],
    right=False,
)

In [22]:
merge_df

Unnamed: 0,User,Person,Current Age,Retirement Age,Birth Year,Birth Month,Gender,Address,Apartment,City,...,Num Credit Cards,EAD,PD12,PDLT,LGD,Residual Maturity,Reporting Index,Is New Contract,Age Range,Income Range
0,0,Hazel Robinson,53,66,1966,11,Female,462 Rose Lane,,La Verne,...,5,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316,50-59,50K - 80K
1,955,Nickolas Lopez,21,67,1999,2,Male,92196 Tenth Drive,,Leesburg,...,2,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316,20-29,80K - 100K
2,1134,Kallie Rodriguez,39,71,1980,7,Female,135 Littlewood Avenue,6.0,Oceanside,...,1,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316,30-39,20K - 50K
3,1479,Rylan Rodriguez,33,69,1986,10,Female,928 Bayview Street,,Portage,...,3,7543.452009,0.102502,0.148166,0.608904,1664.486322,28.908815,0.024316,30-39,20K - 50K
4,1,Sasha Sadr,53,68,1966,12,Female,3606 Federal Boulevard,,Little Neck,...,5,8943.997200,0.105292,0.150311,0.602701,1747.076299,26.618506,0.038961,50-59,50K - 80K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1796,Alessandro Davis,37,66,1982,12,Male,550 Forest Street,,Helena,...,1,8195.670961,0.103159,0.150773,0.607600,1747.212670,26.556561,0.024133,30-39,20K - 50K
1996,1817,Darren Turner,31,63,1988,5,Male,6692 Lake Street,,Taylorsville,...,1,8681.589080,0.123131,0.171764,0.611464,1825.388976,27.366929,0.020472,30-39,20K - 50K
1997,1874,August Braun,42,72,1977,8,Male,331 Oak Lane,,Antioch,...,2,9434.481972,0.097110,0.143415,0.605630,1780.612245,26.211931,0.032967,40-49,50K - 80K
1998,1888,Kyng El-Mafouk,51,68,1968,10,Male,207 Ocean View Street,,Berkeley Heights,...,1,8936.491473,0.101380,0.149277,0.602884,1747.658249,26.885522,0.040404,50-59,100K - 150K


In [23]:
# Drop extraneous columns
merge_df.drop(
    columns=["Residual Maturity", "Reporting Index", "Is New Contract"], inplace=True
)

### Output Users Data as CSV

In [24]:
merge_df.to_csv(f"{outdir}/users_processed.csv", index=False)

## Process Credit Card Transactions Data from S3

* [Load Credit Card Transactions Data from S3](#Load-Credit-Card-Transactions-Data-from-S3)
* [Modify Merchant Names to be MerchantN](#Modify-Merchant-Names-to-be-MerchantN-e.g.-(Merchant1,-Merchant2,-etc.))
* [Join New Merchant Names to Credit Card Transactions Data](#Join-New-Merchant-Names-to-Credit-Card-Transactions-Data)
* [Drop Extraneous Columns and Format Data](#Drop-Extraneous-Columns-and-Format-Data)
* [Format Credit Card Transactions Data](#Format-Credit-Card-Transactions-Data)
* [Downsize Original Data Volume to 5 Million](#Downsize-Original-Data-Volume-to-5-Million)
* [Simulate Payments](#Simulate-Payments)
* [Output Credit Card Transactions Data to CSV](#Output-Credit-Card-Transactions-Data-to-CSV)
* [Compress CSV to GZIP](#Compress-CSV-to-GZIP)

### Load Credit Card Transactions Data from S3

In [25]:
# Use `read_csv` pandas function to load contracts data from S3 URI
cc_sales_gzip_df = pd.read_csv(
    "s3://data.atoti.io/notebooks/retail-banking/input/credit_card_transactions.csv.gz",
    compression="gzip",
)
cc_sales_gzip_df

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No
1,0,0,2002,9,1,06:42,$38.48,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
2,0,0,2002,9,2,06:22,$120.34,Swipe Transaction,-727612092139916043,Monterey Park,CA,91754.0,5411,,No
3,0,0,2002,9,2,17:45,$128.95,Swipe Transaction,3414527459579106770,Monterey Park,CA,91754.0,5651,,No
4,0,0,2002,9,3,06:23,$104.71,Swipe Transaction,5817218446178736267,La Verne,CA,91750.0,5912,,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24386895,1999,1,2020,2,27,22:23,$-54.00,Chip Transaction,-5162038175624867091,Merrimack,NH,3054.0,5541,,No
24386896,1999,1,2020,2,27,22:24,$54.00,Chip Transaction,-5162038175624867091,Merrimack,NH,3054.0,5541,,No
24386897,1999,1,2020,2,28,07:43,$59.15,Chip Transaction,2500998799892805156,Merrimack,NH,3054.0,4121,,No
24386898,1999,1,2020,2,28,20:10,$43.12,Chip Transaction,2500998799892805156,Merrimack,NH,3054.0,4121,,No


### Modify Merchant Names to be MerchantN e.g. (Merchant1, Merchant2, etc.)

In [26]:
merchant_name_df = pd.DataFrame(
    cc_sales_gzip_df.groupby("Merchant Name").count().index.tolist(),
    columns=["Merchant Name"],
)
merchant_name_df

Unnamed: 0,Merchant Name
0,-9222899435637403521
1,-9222692221935167526
2,-9222439367252190791
3,-9222264855000293132
4,-9222232253446715869
...,...
100338,9222821118491815331
100339,9222874644865944349
100340,9222877122873253163
100341,9222957302638210593


In [27]:
merchant_name_df.insert(1, "Merchant Name (Revised)", "")
merchant_name_df

Unnamed: 0,Merchant Name,Merchant Name (Revised)
0,-9222899435637403521,
1,-9222692221935167526,
2,-9222439367252190791,
3,-9222264855000293132,
4,-9222232253446715869,
...,...,...
100338,9222821118491815331,
100339,9222874644865944349,
100340,9222877122873253163,
100341,9222957302638210593,


In [28]:
counter = 1

for index, row in merchant_name_df.iterrows():
    name = f"Merchant {counter}"
    merchant_name_df.loc[index, "Merchant Name (Revised)"] = name
    counter += 1

In [29]:
merchant_name_df

Unnamed: 0,Merchant Name,Merchant Name (Revised)
0,-9222899435637403521,Merchant 1
1,-9222692221935167526,Merchant 2
2,-9222439367252190791,Merchant 3
3,-9222264855000293132,Merchant 4
4,-9222232253446715869,Merchant 5
...,...,...
100338,9222821118491815331,Merchant 100339
100339,9222874644865944349,Merchant 100340
100340,9222877122873253163,Merchant 100341
100341,9222957302638210593,Merchant 100342


### Join New Merchant Names to Credit Card Transactions Data

In [30]:
merchant_name_merge_df = pd.merge(
    cc_sales_gzip_df,
    merchant_name_df,
    left_on="Merchant Name",
    right_on="Merchant Name",
)
merchant_name_merge_df

Unnamed: 0,User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?,Merchant Name (Revised)
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No,Merchant 69375
1,0,0,2002,9,10,06:22,$102.18,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No,Merchant 69375
2,0,0,2002,9,16,06:00,$115.34,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No,Merchant 69375
3,0,0,2002,9,18,06:19,$128.85,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No,Merchant 69375
4,0,0,2002,9,23,06:01,$134.89,Swipe Transaction,3527213246127876953,La Verne,CA,91750.0,5300,,No,Merchant 69375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24386895,1999,1,2019,12,21,07:59,$42.80,Chip Transaction,-3533580464561517260,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386896,1999,1,2019,12,22,08:15,$46.72,Chip Transaction,-3533580464561517260,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386897,1999,1,2019,12,22,20:25,$46.30,Chip Transaction,-3533580464561517260,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386898,1999,1,2019,12,23,19:48,$49.00,Chip Transaction,-3533580464561517260,Russellville,AL,35653.0,4121,,No,Merchant 31143


### Drop Extraneous Columns and Format Data

In [31]:
merchant_name_merge_df.pop("Merchant Name")
merchant_name_merge_df.rename(
    columns={"Merchant Name (Revised)": "Merchant Name"}, inplace=True
)

merchant_name_merge_df["Year"] = merchant_name_merge_df["Year"].astype(str)
merchant_name_merge_df["Month"] = merchant_name_merge_df["Month"].astype(str)
merchant_name_merge_df["Day"] = merchant_name_merge_df["Day"].astype(str)

# Create a `Datetime` column and combine values
# From Year, Month, Day, and Time columns to
# Create a proper Datetime data type column
merchant_name_merge_df.insert(2, "Datetime", "")
merchant_name_merge_df["Datetime"] = pd.to_datetime(
    merchant_name_merge_df["Year"]
    + " "
    + merchant_name_merge_df["Month"]
    + " "
    + merchant_name_merge_df["Day"]
    + " "
    + merchant_name_merge_df["Time"]
)
merchant_name_merge_df.drop(columns=["Year", "Month", "Day", "Time"], inplace=True)
merchant_name_merge_df

Unnamed: 0,User,Card,Datetime,Amount,Use Chip,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?,Merchant Name
0,0,0,2002-09-01 06:21:00,$134.09,Swipe Transaction,La Verne,CA,91750.0,5300,,No,Merchant 69375
1,0,0,2002-09-10 06:22:00,$102.18,Swipe Transaction,La Verne,CA,91750.0,5300,,No,Merchant 69375
2,0,0,2002-09-16 06:00:00,$115.34,Swipe Transaction,La Verne,CA,91750.0,5300,,No,Merchant 69375
3,0,0,2002-09-18 06:19:00,$128.85,Swipe Transaction,La Verne,CA,91750.0,5300,,No,Merchant 69375
4,0,0,2002-09-23 06:01:00,$134.89,Swipe Transaction,La Verne,CA,91750.0,5300,,No,Merchant 69375
...,...,...,...,...,...,...,...,...,...,...,...,...
24386895,1999,1,2019-12-21 07:59:00,$42.80,Chip Transaction,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386896,1999,1,2019-12-22 08:15:00,$46.72,Chip Transaction,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386897,1999,1,2019-12-22 20:25:00,$46.30,Chip Transaction,Russellville,AL,35653.0,4121,,No,Merchant 31143
24386898,1999,1,2019-12-23 19:48:00,$49.00,Chip Transaction,Russellville,AL,35653.0,4121,,No,Merchant 31143


In [32]:
merchant_name_merge_df_revised = merchant_name_merge_df[
    [
        "User",
        "Card",
        "Datetime",
        "Amount",
        "Use Chip",
        "Merchant Name",
        "Merchant City",
        "Merchant State",
        "Zip",
        "MCC",
        "Errors?",
        "Is Fraud?",
    ]
].copy()

In [33]:
merchant_name_merge_df_revised

Unnamed: 0,User,Card,Datetime,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
0,0,0,2002-09-01 06:21:00,$134.09,Swipe Transaction,Merchant 69375,La Verne,CA,91750.0,5300,,No
1,0,0,2002-09-10 06:22:00,$102.18,Swipe Transaction,Merchant 69375,La Verne,CA,91750.0,5300,,No
2,0,0,2002-09-16 06:00:00,$115.34,Swipe Transaction,Merchant 69375,La Verne,CA,91750.0,5300,,No
3,0,0,2002-09-18 06:19:00,$128.85,Swipe Transaction,Merchant 69375,La Verne,CA,91750.0,5300,,No
4,0,0,2002-09-23 06:01:00,$134.89,Swipe Transaction,Merchant 69375,La Verne,CA,91750.0,5300,,No
...,...,...,...,...,...,...,...,...,...,...,...,...
24386895,1999,1,2019-12-21 07:59:00,$42.80,Chip Transaction,Merchant 31143,Russellville,AL,35653.0,4121,,No
24386896,1999,1,2019-12-22 08:15:00,$46.72,Chip Transaction,Merchant 31143,Russellville,AL,35653.0,4121,,No
24386897,1999,1,2019-12-22 20:25:00,$46.30,Chip Transaction,Merchant 31143,Russellville,AL,35653.0,4121,,No
24386898,1999,1,2019-12-23 19:48:00,$49.00,Chip Transaction,Merchant 31143,Russellville,AL,35653.0,4121,,No


### Format Credit Card Transactions Data

In [34]:
# Cast intended measures as numerical data types
merchant_name_merge_df_revised["Amount"] = merchant_name_merge_df_revised[
    "Amount"
].str.replace("$", "")
merchant_name_merge_df_revised["Amount"] = merchant_name_merge_df_revised[
    "Amount"
].astype(float)
merchant_name_merge_df_revised["Amount"] = (
    merchant_name_merge_df_revised["Amount"] * 0.20
)

### Downsize Original Data Volume to 5 Million

In [35]:
merchant_name_merge_df_5MM = merchant_name_merge_df_revised[19386900:]

### Simulate Payments

In [36]:
dfupdate = merchant_name_merge_df_5MM.sample(1000000)
dfupdate.Amount *= -1
merchant_name_merge_df_5MM.update(dfupdate)
merchant_name_merge_df_5MM.head(20)

Unnamed: 0,User,Card,Datetime,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
19386900,50,0,2019-07-04 23:37:00,4.432,Chip Transaction,Merchant 47076,Beaverton,OR,97007.0,5921,,No
19386901,50,1,2019-06-28 23:33:00,3.436,Chip Transaction,Merchant 47076,Beaverton,OR,97007.0,5921,,No
19386902,792,1,2015-11-28 16:08:00,1.672,Chip Transaction,Merchant 47076,Yonkers,NY,10703.0,5921,,No
19386903,1210,1,2007-03-07 22:52:00,4.076,Swipe Transaction,Merchant 47076,Beaverton,OR,97007.0,5921,,No
19386904,1575,0,2016-05-23 07:07:00,-2.462,Swipe Transaction,Merchant 47076,Shreveport,LA,71107.0,5921,,No
19386905,50,0,2019-12-17 23:56:00,3.818,Chip Transaction,Merchant 48650,Whittaker,MI,48190.0,5921,,No
19386906,50,3,2017-01-07 23:46:00,1.75,Chip Transaction,Merchant 48650,Whittaker,MI,48190.0,5921,,No
19386907,598,1,2019-09-18 15:09:00,-3.66,Chip Transaction,Merchant 48650,Whittaker,MI,48190.0,5921,,No
19386908,598,1,2019-09-18 15:51:00,4.084,Chip Transaction,Merchant 48650,Whittaker,MI,48190.0,5921,,No
19386909,598,1,2019-09-20 15:10:00,3.144,Chip Transaction,Merchant 48650,Whittaker,MI,48190.0,5921,,No


### Output Credit Card Transactions Data to CSV

In [37]:
merchant_name_merge_df_5MM.to_csv(
    f"{outdir}/credit_card_transactions_processed_5MM.csv", index=False
)

### Compress CSV to GZIP

In [38]:
!gzip -f data/credit_card_transactions_processed_5MM.csv

## Notes for Improvement

* We can refactor the entity relationship assignment for Cards-to-Retailers.
* We should adjust `Amount` values depending on the category of the purchase (e.g. Airline purchases should probably be over $100 at minimum).
* We shoud make sure that no single user is over `100%` credit card utilization.
* We should make sure that no single user is under `0%` credit card utilization.

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=cc-data-processing" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Your-turn-to-try-Atoti.jpg" alt="Try atoti"></a></div>