# Day 2: Data Sources & Initial Assessment

## Objectives
- Understand NovaPay’s transaction dataset and schema
- Assess fraud label distribution and class imbalance
- Identify missing data, inconsistencies, and duplicates
- Highlight potential bias and fairness risks relevant to fraud modelling

In [8]:
# Core data manipulation libraries
import pandas as pd
import numpy as np

# Path to the raw dataset
# Keeping raw data untouched is best practice
DATA_PATH = "../data/raw/nova_pay_combined.csv"

# Load the dataset into a pandas DataFrame
df = pd.read_csv(DATA_PATH)

# Display dataset dimensions (rows, columns)
df.shape

(11400, 26)

In [9]:
# Preview the first few rows to understand the structure
df.head()

Unnamed: 0,transaction_id,customer_id,timestamp,home_country,source_currency,dest_currency,channel,amount_src,amount_usd,fee,...,ip_risk_score,kyc_tier,account_age_days,device_trust_score,chargeback_history_count,risk_score_internal,txn_velocity_1h,txn_velocity_24h,corridor_risk,is_fraud
0,fee8542d-8ee6-4b0d-9671-c294dd08ed26,402cccc9-28de-45b3-9af7-cc5302aa1f93,2022-10-03 18:40:59.468549+00:00,US,USD,CAD,ATM,278.19,278.19,4.25,...,0.123,standard,263,0.522,0,0.223,0,0,0.0,0
1,bfdb9fc1-27fe-4a85-b043-4d813d679259,67c2c6b3-ef0a-4777-a3f1-c84a851bb6ad,2022-10-03 20:39:38.468549+00:00,CA,CAD,MXN,web,208.51,154.29,4.24,...,0.569,standard,947,0.475,0,0.268,0,1,0.0,0
2,fc855034-3ea5-4993-9afa-b511d93fe5e8,6d0d9b27-fa26-45f8-93b1-2df29d182d9c,2022-10-03 23:02:43.468549+00:00,US,USD,CNY,mobile,160.33,160.33,2.7,...,0.437,enhanced,367,0.939,0,0.176,0,0,0.0,0
3,2cf8c08e-42ec-444d-a755-34b9a2a0a4ca,7bd5200c-5d19-44f0-9afe-8b339a05366b,2022-10-04 01:08:53.468549+00:00,US,USD,EUR,mobile,59.41,59.41,2.22,...,0.594,standard,147,0.551,0,0.391,0,0,0.0,0
4,d907a74d-b426-438d-97eb-dbe911aca91c,70a93d26-8e3a-4179-900c-a4a7a74d08e5,2022-10-04 09:35:03.468549+00:00,US,USD,INR,mobile,200.96,200.96,3.61,...,0.121,enhanced,257,0.894,0,0.257,0,0,0.0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11400 entries, 0 to 11399
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   transaction_id             11400 non-null  object 
 1   customer_id                11400 non-null  object 
 2   timestamp                  11371 non-null  object 
 3   home_country               11400 non-null  object 
 4   source_currency            11400 non-null  object 
 5   dest_currency              11400 non-null  object 
 6   channel                    11400 non-null  object 
 7   amount_src                 11400 non-null  object 
 8   amount_usd                 11095 non-null  float64
 9   fee                        11105 non-null  float64
 10  exchange_rate_src_to_dest  11400 non-null  float64
 11  device_id                  11400 non-null  object 
 12  new_device                 11400 non-null  bool   
 13  ip_address                 11095 non-null  obj

In [10]:
# Summary statistics for both numerical and categorical features
# Transposed for easier reading
df.describe(include="all").T
#df.describe()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
transaction_id,11400.0,11200.0,9cfbbce9-979c-4d34-bf5a-98531362bd9a,2.0,,,,,,,
customer_id,11400.0,1315.0,402cccc9-28de-45b3-9af7-cc5302aa1f93,1510.0,,,,,,,
timestamp,11371.0,11141.0,0000-00-00T00:00:00Z,21.0,,,,,,,
home_country,11400.0,7.0,US,7940.0,,,,,,,
source_currency,11400.0,3.0,USD,8031.0,,,,,,,
dest_currency,11400.0,9.0,NGN,1474.0,,,,,,,
channel,11400.0,12.0,mobile,6366.0,,,,,,,
amount_src,11400.0,9856.0,100.0,15.0,,,,,,,
amount_usd,11095.0,,,,452.022083,1403.973062,7.23,92.465,163.48,302.19,12498.57
fee,11105.0,,,,100.309441,958.128504,-1.0,2.38,3.5,5.55,9999.99


In [11]:
# Count of fraudulent vs non-fraudulent transactions
df["is_fraud"].value_counts()

is_fraud
0    10403
1      997
Name: count, dtype: int64

In [12]:
# Percentage distribution of fraud labels
# Useful for understanding class imbalance severity
df["is_fraud"].value_counts(normalize=True) * 100

is_fraud
0    91.254386
1     8.745614
Name: proportion, dtype: float64

The fraud label is highly imbalanced, with fraudulent transactions representing a small minority
of all observations. This confirms that standard accuracy-based evaluation would be misleading,
and imbalance-aware modelling techniques will be required.

In [13]:
# Count missing values per column
missing_counts = df.isna().sum().sort_values(ascending=False)

# Display only columns with missing data
missing_counts[missing_counts > 0]

ip_address            305
amount_usd            305
ip_country            301
kyc_tier              300
fee                   295
device_trust_score    295
timestamp              29
dtype: int64

In [14]:
# Convert missing counts to percentages for better interpretation
(missing_counts / len(df) * 100).round(2)

ip_address                   2.68
amount_usd                   2.68
ip_country                   2.64
kyc_tier                     2.63
fee                          2.59
device_trust_score           2.59
timestamp                    0.25
location_mismatch            0.00
corridor_risk                0.00
txn_velocity_24h             0.00
txn_velocity_1h              0.00
risk_score_internal          0.00
chargeback_history_count     0.00
account_age_days             0.00
ip_risk_score                0.00
transaction_id               0.00
customer_id                  0.00
new_device                   0.00
device_id                    0.00
exchange_rate_src_to_dest    0.00
amount_src                   0.00
channel                      0.00
dest_currency                0.00
source_currency              0.00
home_country                 0.00
is_fraud                     0.00
dtype: float64

In [15]:
# Check for fully duplicated rows
df.duplicated().sum()

200

In [16]:
# Check for duplicate transaction IDs
# transaction_id should be unique per transaction
df.duplicated(subset=["transaction_id"]).sum()

200

In [18]:
# Check data types of amount columns
df[["amount_src", "amount_usd"]].dtypes

amount_src     object
amount_usd    float64
dtype: object