# Feature Engineering — UPI Transactions 2024

This notebook creates **3 new derived columns** from the existing dataset:

| New Column | Source Column | Mapping |
|---|---|---|
| `day_part` | `hour_of_day` | Morning (6-11), Afternoon (12-17), Evening (18-21), Night (22-5) |
| `amount_tier` | `amount (INR)` | Small (<500), Medium (500-5000), Large (5000-50000), Very Large (>50000) |
| `sender_age_label` / `receiver_age_label` | `sender_age_group` / `receiver_age_group` | Young (18-25), Adult (26-55), Old (56+) |

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('upi_transactions_2024.csv')
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (250000, 17)


Unnamed: 0,transaction id,timestamp,transaction type,merchant_category,amount (INR),transaction_status,sender_age_group,receiver_age_group,sender_state,sender_bank,receiver_bank,device_type,network_type,fraud_flag,hour_of_day,day_of_week,is_weekend
0,TXN0000000001,2024-10-08 15:17:28,P2P,Entertainment,868,SUCCESS,26-35,18-25,Delhi,Axis,SBI,Android,4G,0,15,Tuesday,0
1,TXN0000000002,2024-04-11 06:56:00,P2M,Grocery,1011,SUCCESS,26-35,26-35,Uttar Pradesh,ICICI,Axis,iOS,4G,0,6,Thursday,0
2,TXN0000000003,2024-04-02 13:27:18,P2P,Grocery,477,SUCCESS,26-35,36-45,Karnataka,Yes Bank,PNB,Android,4G,0,13,Tuesday,0
3,TXN0000000004,2024-01-07 10:09:17,P2P,Fuel,2784,SUCCESS,26-35,26-35,Delhi,ICICI,PNB,Android,5G,0,10,Sunday,1
4,TXN0000000005,2024-01-23 19:04:23,P2P,Shopping,990,SUCCESS,26-35,18-25,Delhi,Axis,Yes Bank,iOS,WiFi,0,19,Tuesday,0


## 1. Day Part — from `hour_of_day`

| Hour Range | Label |
|---|---|
| 6 – 11 | Morning |
| 12 – 17 | Afternoon |
| 18 – 21 | Evening |
| 22 – 5 | Night |

In [3]:
def get_day_part(hour):
    if 6 <= hour <= 11:
        return 'Morning'
    elif 12 <= hour <= 17:
        return 'Afternoon'
    elif 18 <= hour <= 21:
        return 'Evening'
    else:  # 22-23 and 0-5
        return 'Night'

df['day_part'] = df['hour_of_day'].apply(get_day_part)

print("Day Part distribution:")
print(df['day_part'].value_counts())

Day Part distribution:
day_part
Afternoon    88982
Evening      76055
Morning      58162
Night        26801
Name: count, dtype: int64


## 2. Amount Tier — from `amount (INR)`

| Amount Range | Label |
|---|---|
| < 500 | Small |
| 500 – 5,000 | Medium |
| 5,000 – 50,000 | Large |
| > 50,000 | Very Large |

In [4]:
def get_amount_tier(amount):
    if amount < 500:
        return 'Small'
    elif amount <= 5000:
        return 'Medium'
    elif amount <= 50000:
        return 'Large'
    else:
        return 'Very Large'

df['amount_tier'] = df['amount (INR)'].apply(get_amount_tier)

print("Amount Tier distribution:")
print(df['amount_tier'].value_counts())

Amount Tier distribution:
amount_tier
Medium    132582
Small     106462
Large      10956
Name: count, dtype: int64


## 3. Age Group Labels — Sender & Receiver

The original age groups (`18-25`, `26-35`, `36-45`, `46-55`, `56+`) are mapped to:

| Original | New Label |
|---|---|
| 18-25 | Young |
| 26-35, 36-45, 46-55 | Adult |
| 56+ | Old |

In [5]:
age_label_map = {
    '18-25': 'Young',
    '26-35': 'Adult',
    '36-45': 'Adult',
    '46-55': 'Adult',
    '56+':   'Old'
}

df['sender_age_label']   = df['sender_age_group'].map(age_label_map)
df['receiver_age_label'] = df['receiver_age_group'].map(age_label_map)

print("Sender Age Label distribution:")
print(df['sender_age_label'].value_counts())
print("\nReceiver Age Label distribution:")
print(df['receiver_age_label'].value_counts())

Sender Age Label distribution:
sender_age_label
Adult    175146
Young     62345
Old       12509
Name: count, dtype: int64

Receiver Age Label distribution:
receiver_age_label
Adult    174838
Young     62611
Old       12551
Name: count, dtype: int64


## Quick Sanity Check

In [6]:
# Show a sample of all new columns alongside the source columns
cols = [
    'hour_of_day', 'day_part',
    'amount (INR)', 'amount_tier',
    'sender_age_group', 'sender_age_label',
    'receiver_age_group', 'receiver_age_label'
]
df[cols].sample(10, random_state=42)

Unnamed: 0,hour_of_day,day_part,amount (INR),amount_tier,sender_age_group,sender_age_label,receiver_age_group,receiver_age_label
38683,15,Afternoon,988,Medium,36-45,Adult,26-35,Adult
64939,11,Morning,57,Small,18-25,Young,26-35,Adult
3954,5,Night,5872,Large,18-25,Young,26-35,Adult
120374,16,Afternoon,134,Small,26-35,Adult,36-45,Adult
172861,12,Afternoon,325,Small,18-25,Young,26-35,Adult
149303,15,Afternoon,2613,Medium,26-35,Adult,26-35,Adult
111626,16,Afternoon,2128,Medium,18-25,Young,26-35,Adult
164553,21,Evening,162,Small,36-45,Adult,26-35,Adult
55779,9,Morning,830,Medium,26-35,Adult,36-45,Adult
141990,13,Afternoon,974,Medium,26-35,Adult,26-35,Adult


In [7]:
# Save the enriched dataset
df.to_csv('upi_transactions_2024_enriched.csv', index=False)
print(f"Saved enriched dataset with {df.shape[1]} columns and {df.shape[0]} rows.")

Saved enriched dataset with 21 columns and 250000 rows.
