In [1]:
# Import relevant python packages
from datetime import datetime
import pandas as pd
import numpy as np

## Data Sources & Context
> This project uses a subset of anonymized donor data provided by <b>Student Mobilization, Inc.</b> The dataset includes key features such as <b>transaction amounts</b>, <b>payment methods</b>, and <b>donor contact information</b>. Due to the sensitive nature of donor information, the raw CSV files <b>are not publicly shared</b> or uploaded to this notebook. All analyses, visualizations, and summaries are based on internal datasets and are presented sequentially throughout the project to protect individual privacy.
>
>  <i> <b>Note</b>: In the following sections, we refer to multiple donors throughout the data cleaning process. These are not the actual names of any of our donors; they are aliases used in place of real names to protect donor privacy.   </i>

## Data Loading & Initial Exploration
> The <b>donor</b> and <b>transaction</b> data are loaded from CSV files into <b>pandas DataFrames</b>. We begin by inspecting the structure of each dataset and reviewing descriptive statistics to understand key features, data types, and overall data quality. This step provides the foundation for all subsequent analysis.

In [2]:
# Load the data from CSV files directly to pandas
donors_df = pd.read_csv('donors_updated.csv', header = 0)
transactions_df = pd.read_csv('transactions_updated.csv', header = 0)

In [3]:
# Display a concise summary of the DataFrame, including column names, non-null counts, and data types
donors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Donor Type      155 non-null    object
 1   Family Name     155 non-null    object
 2   Email           153 non-null    object
 3   Address Line 1  152 non-null    object
 4   Address Line 2  1 non-null      object
 5   City            152 non-null    object
 6   State           152 non-null    object
 7   Postal Code     152 non-null    object
 8   Country         152 non-null    object
dtypes: object(9)
memory usage: 11.0+ KB


In [4]:
transactions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2392 entries, 0 to 2391
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           2392 non-null   object 
 1   Donor Name     2392 non-null   object 
 2   Recurring      2392 non-null   object 
 3   Description    2392 non-null   object 
 4   Amount         2392 non-null   float64
 5   Currency Type  2392 non-null   object 
dtypes: float64(1), object(5)
memory usage: 112.3+ KB


In [5]:
# Get descriptive statistics for all DataFrame columns, including categorical and numeric data
donors_df.describe(include = 'all').drop(index = 'top')

Unnamed: 0,Donor Type,Family Name,Email,Address Line 1,Address Line 2,City,State,Postal Code,Country
count,155,155,153,152,1,152,152,152,152
unique,2,146,139,132,1,69,21,131,1
freq,149,8,10,10,1,40,76,10,152


In [6]:
transactions_df.describe(include = 'all').drop(index = 'top')

Unnamed: 0,Date,Donor Name,Recurring,Description,Amount,Currency Type
count,2392.0,2392.0,2392.0,2392.0,2392.0,2392.0
unique,2305.0,146.0,2.0,252.0,,4.0
freq,3.0,47.0,2301.0,44.0,,1478.0
mean,,,,,130.58398,
std,,,,,333.995303,
min,,,,,-100.0,
25%,,,,,51.5,
50%,,,,,100.0,
75%,,,,,103.0,
max,,,,,10000.0,


In [7]:
# Inspect the donors_df for null values
donors_df.isnull().sum()

Donor Type          0
Family Name         0
Email               2
Address Line 1      3
Address Line 2    154
City                3
State               3
Postal Code         3
Country             3
dtype: int64

In [8]:
# Inspect the transactions_df for null values
transactions_df.isnull().sum()

Date             0
Donor Name       0
Recurring        0
Description      0
Amount           0
Currency Type    0
dtype: int64

## Data Cleaning & Preprocessing 
>- Negative donation amounts (e.g., refunds or entry errors) are removed from the `transactions_df`.  
>- We also rename the `Family Name` column in `donors_df` to `Donor Name` to match the transaction records and enable accurate merging later on.
>- We standardize column names to ensure safe referencing without unpredictable spacing or capitalization issues.
>- Duplicated rows are dropped from `donors_df`.

In [9]:
# Rename the Family Name column in donors_df to match the corresponding column in transactions_df 
donors_df.rename(columns = {'Family Name': 'Donor Name'}, inplace = True)

# Normalize column names in DataFrames
donors_df.columns = ['_'.join(col.lower().split()) for col in donors_df.columns]
transactions_df.columns = ['_'.join(col.lower().split()) for col in transactions_df.columns]

In [10]:
donors_df.columns

Index(['donor_type', 'donor_name', 'email', 'address_line_1', 'address_line_2',
       'city', 'state', 'postal_code', 'country'],
      dtype='object')

In [11]:
transactions_df.columns

Index(['date', 'donor_name', 'recurring', 'description', 'amount',
       'currency_type'],
      dtype='object')

In [12]:
# Drop rows in transactions_df with negative amounts (e.g., -100.0) -- likely due to entry errors or refunds 
transactions_df = transactions_df[transactions_df['amount'] > 0]

In [13]:
# Drop duplicate rows in donors_df
donors_df.drop_duplicates(inplace = True)

### Disambiguating Duplicate Donor Names

In [None]:
# Identify cases where multiple donor rows share the same name
donors_df[donors_df['donor_name'].str.lower().duplicated(keep = False)]

> During exploration, we discovered multiple entries for the donor name `Quinn Maddox`, which appear to correspond to different individuals based on their email addresses (as seen above) and payment methods (as seen in `transactions_df`). 
> 
> To ensure accurate analysis and merging, we disambiguate these records by:
> - **Identifying distinct email patterns** (`quinn_cat` and `abquinn`)
> - **Matching them to corresponding payment types** (`Credit Card` and `ACH`)
> - **Renaming both donor and transaction records** with unique identifiers
> 
> This step helps to prevent attributing transactions to the wrong individual when aggregating and merging data. 

In [16]:
# Standardize duplicate donor identity for "Quinn Maddox"
# Some donations contain multiple identifiers: different emails + different payment methods.
# We unify these into clearly labeled donor names for consistent matching later.

# Map partial email usernames → standardized donor label
email_to_unified_name = {
    'quinn_cat': 'Quinn Maddox (quinn_cat)', 
    'abquinn': 'Quinn Maddox (abquinn)'
}

# Map payment method → standardized donor label (used in transactions_df before name updates)
payment_to_unified_name = {
    'Credit Card': 'Quinn Maddox (quinn_cat)', 
    'ACH': 'Quinn Maddox (abquinn)'
}


# Update donor names in donors_df based on email pattern matches
for email_pattern, unified_name in email_to_unified_name.items():
    # Only modify rows where the donor_name is currently "Quinn Maddox"
    # AND email partially matches the specific identifier
    donor_match_mask = (
        donors_df['donor_name'].eq('Quinn Maddox') &
        donors_df['email'].str.contains(email_pattern, na = False) 
    )
    # Overwrite original donor_name with a fully-distinguished label
    donors_df.loc[donor_match_mask, 'donor_name'] = unified_name

# Update donor names in transactions_df based on payment method
for payment_type, unified_name in payment_to_unified_name.items():
    # Match rows where the donor is labeled "Quinn Maddox"
    # AND specific currency type applies
    trans_match_mask = (
        transactions_df['donor_name'].eq('Quinn Maddox') &
        transactions_df['currency_type'].eq(payment_type)
    )
    # Overwrite original donor_name with a fully-distinguished label
    transactions_df.loc[trans_match_mask, 'donor_name'] = unified_name

### Normalizing Donor Identities Across Data Sources
> Some donors appear multiple times under different identifiers such as email variations, maiden vs. married names, spouse-combined names, or duplicate entries caused by shared addresses.
>
> To ensure accurate donor-level aggregation later, we:
>- **Detect duplicate donor records** using email or physical address
>- **Group records belonging to the same individual** based on unique identifiers (partial emails or exact address match)
>- **Replace name variants with a single standardized donor name**
>- **Propagate corrected names into both DataFrames**
>- **Remove remaining duplicates** after successful normalization
>
> This creates a **clean one-to-one mapping** between each donor and their giving history, preventing under/over-counting in RFM segmentation and donation metrics.

In [17]:
# Identify cases where multiple donor rows share the same email (excluding nulls)
duplicate_email_mask = (
    donors_df['email'].str.lower().duplicated(keep = False) &
    donors_df['email'].notnull()
)
donors_df[duplicate_email_mask]

# Mapping of partial email identifiers → standardized donor names
email_name_mapping = [
    ('emily.carter', 'Emily Carter'),
    ('nathan.brooks', 'Nathan Brooks'),
    ('sofia.delgado', 'Sofia Delgado'),
    ('marcus.tanaka', 'Marcus Tanaka'),
    ('chloe.ramirez', 'Chloe Ramirez'),
    ('julian.porter', 'Julian Porter')
]

# Update donors with unified names using email patterns
for email_pattern, unified_name in email_name_mapping:
    # Donor records matching the email pattern
    match_mask = donors_df['email'].str.contains(email_pattern, na = False)

    # Extract the specific name variants currently used
    name_variants = donors_df.loc[match_mask, 'donor_name'].to_list()
    
    # Update name in both DataFrames wherever those variants occur
    donors_df.loc[donors_df['donor_name'].isin(name_variants), 'donor_name'] = unified_name
    transactions_df.loc[transactions_df['donor_name'].isin(name_variants), 'donor_name'] = unified_name

# Remove duplicates caused by name normalization
donors_df.drop_duplicates(subset = 'donor_name', keep = 'first', inplace = True)

# Detect donors with duplicate physical address lines
duplicate_address_mask = (
    donors_df['address_line_1'].duplicated(keep = False) &
    donors_df['address_line_1'].notnull()
)
donors_df[duplicate_address_mask]

# Manually unify one household with multiple donor rows
household_variants = donors_df.loc[
    donors_df['address_line_1'].eq('4821 Willow Crest Ln'), 'donor_name'
    ].to_list()

# Update name in both DataFrames wherever those variants occur
donors_df.loc[donors_df['donor_name'].isin(household_variants), 'donor_name'] = 'Olivia Bennett'
transactions_df.loc[transactions_df['donor_name'].isin(household_variants), 'donor_name'] = 'Olivia Bennett'

# Remove duplicates again after household consolidation
donors_df.drop_duplicates(subset = 'donor_name', inplace = True)

### Manual Location Input for Missing Donor Data
>Some donors were **missing location information** in the original dataset. We manually added `city`, `state`, and `country` values for these donors based on **external knowledge**. These columns will be helpful in any location-based analyses or segmentation (e.g., state-by-state analysis of total donation amounts).
>
>- `Ellis` → Salt Lake City, UT, US  
>- `Monroe` → Austin, TX, US  
>- `Sinclair` → Denver, CO, US

In [18]:
# Fill in missing donor location data using known partial last-name matches.
# These donors have no location stored, but can be reliably identified from their names.

# Mapping of donor last-name keyword → [city, state, country]
last_name_to_location = {
    'Ellis': ['Salt Lake City', 'UT', 'US'],
    'Monroe': ['Austin', 'TX', 'US'],
    'Sinclair': ['Denver', 'CO', 'US']
        }

for last_name_keyword, location_values in last_name_to_location.items():
    # Create a boolean mask for donors whose names contain this keyword
    matching_donors = donors_df['donor_name'].str.contains(last_name_keyword)

    # Assign consistent city/state/country values to matched rows
    donors_df.loc[matching_donors, ['city', 'state', 'country']] = location_values

### Column Pruning
>Before continuing with analysis, we **drop unnecessary columns** that are irrelevant to our goals.
>
>- From `donors_df`: `donor_type`, `email`, `address_line_1`, `address_line_2`, `postal_code`
>- From `transactions_df`: `recurring`, `description`

In [19]:
# Drop columns that will not be necessary for our analysis
donors_df.drop(
    columns = ['donor_type', 'email', 'address_line_1', 'address_line_2', 'postal_code'], 
    inplace = True
)
transactions_df.drop(
    columns = ['recurring', 'description'], 
    inplace = True
)

## Feature Engineering: Donor-Level Metrics
>In this section, **we engineer key features that summarize each donor's giving behavior** based on the raw transactions dataset. For the most part, transactions are grouped by `donor_name`, aggregated, and then merged into the `donors_df`. **These features are critical for segmenting donors according to their giving behavior**.

### Timestamp Conversion
> We start by ensuring all donation dates are in a **consistent pandas datetime format** (with timezones removed to simplify calculations).

In [20]:
# Convert all timestamps to pandas datetime objects
# Removing the timezone localization reduces complications in pandas datetime functions
transactions_df['date'] = pd.to_datetime(transactions_df['date']).dt.tz_localize(None) 

### Aggregating Donor-Level Features
>We define a function to help simplify aggregating feature metrics and merging into `donors_df`.

In [21]:
# Compute and merge aggregated features per donor 
def add_agg_feature(donors, trans, col, agg, new_feature):
    # .agg expects named keyword arguments — ** expands a dict:
    # {new_feature: (col, agg)} → new column name mapped to (column_to_aggregate, operation)
    df_feature = trans.groupby('donor_name', as_index = False).agg(**{new_feature: (col, agg)})
    return pd.merge(donors, df_feature[['donor_name', new_feature]], how = 'inner', on = 'donor_name')

>We then compute and merge the following features (per donor):
> - **Frequency**: total number of donations 
> - **Monetary**: total amount donated 
> - **Donation Statistics**: mean, median, maximum, and minimum donation amounts 
> - **Donation Timeline**: first and last donation dates 
> - **Recency**: number of days since last donation
> - **Recurring Flag**: `True` if donor gave more than once, `False` otherwise

In [22]:
# Define list of aggregation instructions:
# (new column name, source column, aggregation function)
agg_instructions = [
    ('monetary', 'amount', 'sum'), # Sum of all donations per donor
    ('frequency', 'amount', 'count'), # Total number of donations per donor
    ('donation_start_date', 'date', 'min'), # Earliest donation date
    ('last_donation_date', 'date', 'max'), # Most recent donation date
    ('mean_amount', 'amount', 'mean'), # Average donation size
    ('med_amount', 'amount', 'median'), # Median donation size
    ('max_amount', 'amount', 'max'), # Largest single donation
    ('min_amount', 'amount', 'min'), # Smallest single donation
]

# Loop through aggregation instructions and create donor-level features
for new_col, source_col, agg_func in agg_instructions:
    donors_df = add_agg_feature(donors_df, transactions_df, source_col, agg_func, new_col)

# Calculate recency: number of days since last donation
donors_df['recency'] = (pd.to_datetime(datetime.now()) - donors_df['last_donation_date']).dt.days

# Classify donors as recurring (more than one donation) using boolean column
donors_df['is_recurring'] = (donors_df['frequency'] > 1)
#donors_df['is_recurring'] = donors_df['frequency'].apply(lambda x: 1 if x > 1 else 0)

# Quick statistical summary of donor-level features
donors_df.describe()

Unnamed: 0,monetary,frequency,donation_start_date,last_donation_date,mean_amount,med_amount,max_amount,min_amount,recency
count,139.0,139.0,139,139,139.0,139.0,139.0,139.0,139.0
mean,2252.567482,17.151079,2023-07-05 17:46:13.323740928,2024-12-13 01:55:25.776978432,345.380164,339.123165,496.427194,328.766906,362.892086
min,12.0,1.0,2022-02-22 04:01:00,2022-02-22 04:01:00,10.2,10.3,10.3,10.0,-1.0
25%,604.5,3.0,2022-09-06 21:16:39,2024-03-27 19:25:18.500000,52.144737,51.5,56.65,51.5,9.0
50%,1400.0,14.0,2022-10-15 20:18:07,2025-09-23 19:16:25,101.0,100.0,103.0,100.0,78.0
75%,2896.5,26.5,2024-10-07 14:40:59,2025-12-01 19:22:51,197.23635,154.5,253.75,150.0,623.0
max,14234.25,48.0,2025-09-21 19:26:04,2025-12-12 09:00:00,10000.0,10000.0,10000.0,10000.0,1388.0
std,2485.703189,14.019093,,,1106.750158,1107.530659,1271.212448,1104.975932,447.168446


In [None]:
# Save the cleaned DataFrames to CSV files for downline analysis
donors_df.to_csv('donors_base_metrics.csv', index = False)
transactions.to_csv('transactions_cleaned.csv', index = False)