## Preprocessing on Credit Card Spending

### Imports

In [1]:
# necessary imports
import pandas as pd

# surpress warnings
pd.options.mode.chained_assignment = None  # default='warn'

### Preprocessing

To quickly reiterate the preprocesssing plan laid out in `eda.ipynb`:

Preprocessing for individual DataFrames:

* `boa_df`:
    * Drop null records as they are of `Type` "Payment" and I'm only interested in "Sale"
    * Convert `Amount` to positive values
* `chase_df`:
    * Drop null records, same reasoning as above
    * Convert `Amount` to positive values
* `apple_df`:
    * Change record with `Merchant` "Best Buy" with `Category` "Other" to "Shopping"
    * Perform mapping for `Category` column:
        * | BoA/Chase | Apple |
            | --- | --- |
            | Shopping | Shopping |
            | Food & Drink | Restaurants |
            | Groceries | Grocery |

General preprocessing:
* Rename all column names to the corresponding name desired in the table schema:
    * | column_name | boa_df | chase_df | apple_df |
        | --- | --- | --- | --- |
        | transaction_date | Posted Date | Transaction Date | Transaction Date |
        | merchant | Payee | Description | Merchant |
        | category | Category | Category | Category |
        | amount | Amount | Amount | Amount (USD) |
        | card | "BoA" | "Chase" | "Apple" |
* Drop all other columns that are not enumerated in the schema
* For consistency, convert to lower case all values in the `Merchant` column. Additionally get rid of special characters `*` and `.` and replace with the empty character `''`

Reading in the datasets:

In [2]:
# Read in the datasets
boa_csv = "data/bofa_spending_mod.csv"
boa_df = pd.read_csv(boa_csv)

chase_csv = "data/chase_spending.csv"
chase_df = pd.read_csv(chase_csv)

apple_csv = "data/apple_spending.csv"
apple_df = pd.read_csv(apple_csv)

#### Preprocessing Helper Functions

Below, I define a some general helper functions for preprocessing. Mostly simple filters and transformations. Honestly, these modifications aren't difficult and can be done in one line, but I decided to abstract them away into a helper function.

For the preprocessing needed for `apple_df`, I will perform it in `preprocess_apple()`, as it is specific to just one record.

In [3]:
# Takes in a DataFrame `df` and removes records with `Type` payment
# Returns the new filtered DataFrame
def remove_payment(df):
    return df[df['Type']=='Sale']

# Takes in a DataFrame `df` and converts `amount` values to positive
# Returns the new filtered DataFrame
def convert_amount(df):
    df['amount'] = abs(df['amount'])
    return df

# Takes in a DataFrame `df` and transforms the values in `column` (containing Strings) into lowercase
# Returns the new transformed DataFrame
def to_lower(df, column):
    df[column] = df[column].str.lower()
    return df

# Takes in a DataFrame `df` and uses `mapping` to map (rename) column names to
# those desired in the table schema
# Returns the DataFrame with the renamed columns
def transform_columns(df, mapping):
    return df.rename(columns=mapping)

# Takes in a DataFrame `df` and adds a column `card` containing values `card_name`
# for every record in the DataFrame
# Returns the new DataFrame with the added column
def add_card_column(df, card_name):
    df['card'] = [card_name]*len(df) # array containing len(df) number of card_names
    return df

def remove_special(df, column, pat, repl):
    df[column] = df[column].str.replace(pat, repl, regex=True)
    return df

#### Bank of America Preprocessing

Below, I perform preprocessing on `boa_df`:

In [4]:
schema_columns = ['transaction_date', 'merchant', 'category', 'amount', 'card']

# Function that performs all the preprocessing functions for the Bank of America DataFrame
# Returns the cleaned DataFrame
def preprocess_boa():
    # perform boa-specific preprocessing
    # 1. Filter for `Type` "Sale"
    clean_boa_df = remove_payment(boa_df)

    # 3. Create boa-specific column mapping to table schema to rename columns
    boa_mapping = {'Posted Date':'transaction_date', 'Payee':'merchant', 'Category':'category', 'Amount':'amount'}
    clean_boa_df = transform_columns(clean_boa_df, boa_mapping)

    # 2. Convert `Amount` to positive values
    clean_boa_df = convert_amount(clean_boa_df)

    # 4. Add `card` column containing 'BoA' for each record
    clean_boa_df = add_card_column(clean_boa_df, 'BoA')

    # 5. Remove special characters specified
    clean_boa_df = remove_special(clean_boa_df, column='merchant', pat='[*.]', repl='')

    # 6. Lowercase values in the `merchant` column
    clean_boa_df = to_lower(clean_boa_df, 'merchant')

    # 6. Keep desired columns as specified in table schema
    clean_boa_df = clean_boa_df[schema_columns]

    # Return the cleaned DataFrame
    return clean_boa_df

Call `preprocess_boa()` and check if it was properly cleaned:

In [6]:
clean_boa_df = preprocess_boa()
clean_boa_df.head(10)

Unnamed: 0,transaction_date,merchant,category,amount,card
0,05/09/2023,galpao gaucho cupertino,Food & Drink,175.44,BoA
2,04/28/2023,tst teaspoon - saratoga,Food & Drink,5.5,BoA
3,04/25/2023,chipotle 1031,Food & Drink,24.53,BoA
4,04/24/2023,chipotle 1031,Food & Drink,13.84,BoA
5,04/24/2023,dino's restaurant,Food & Drink,38.0,BoA
6,04/24/2023,ckeikes place palo alto,Food & Drink,36.33,BoA
7,04/24/2023,ikea east palo alto,Shopping,7.81,BoA
8,04/24/2023,ikea east palo alto,Shopping,313.13,BoA
9,04/24/2023,dinos grill,Food & Drink,33.03,BoA
10,04/24/2023,safeway #1224,Groceries,11.98,BoA


It looks like the cleaning was successful. Values in `merchant` have been converted to all lowercase and special characters have been replaced with `''`. All values in `amount` have been converted to positive values and an additional `card` column was added to the DataFrame. All desired columns have been renamed in accordance to the names specified in the table schema.

Now, to repeat this process $2$ more times for `chase_df` and `apple_df`.

#### Chase Preprocessing

#### Apple Preprocessing