<a href="https://colab.research.google.com/github/biswajit-j5/phonepe/blob/main/Sample_ML_Submission_Template_phonepe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Biswajit Jena

# **Project Summary -**

The PhonePe Transaction Insights project focuses on analyzing digital transaction data to derive meaningful patterns, build predictive models, and assist in data-driven decision-making. As the volume of cashless transactions increases across India, platforms like PhonePe serve as rich data sources for studying user behavior, financial trends, fraud patterns, and market penetration. The primary objective of this project was to extract, process, analyze, and visualize transaction data from the PhonePe Pulse GitHub repository and apply machine learning techniques to predict meaningful outcomes and drive actionable business insights.

The project began with data extraction from PhonePe’s open-source repository. The dataset included aggregated information on transactions segmented by states, quarters, years, and transaction types. This raw JSON data was transformed into structured tabular formats using Python, Pandas, and JSON parsing techniques. We ensured data integrity by handling missing values, correcting data types, and filtering erroneous entries.

In the preprocessing phase, outlier handling was performed using the IQR method and Winsorization to reduce the skewness in high-value transactions. We encoded categorical variables (like state and transaction type) using Label Encoding and One-Hot Encoding depending on model needs. Exploratory Data Analysis (EDA) was conducted through various visualizations including bar plots, pie charts, time series plots, and heatmaps to reveal high-performing states (e.g., Maharashtra, Karnataka), dominant transaction types (peer-to-peer and merchant payments), and seasonal trends during festive quarters.

We constructed 13 key visualizations that answered specific business questions such as: Which states lead in digital payments? Which transaction type is growing fastest? Which quarters experience spikes in activity? These insights highlighted opportunities for marketing, regional expansion, and fraud monitoring.

For predictive modeling, we implemented three machine learning algorithms: Logistic Regression, Random Forest, and XGBoost. Each model was trained on an 80:20 split of the dataset and evaluated using metrics like precision, recall, and F1-score. Logistic Regression served as a baseline model and was optimized using GridSearchCV, improving its F1-score from 0.71 to 0.78. Random Forest, with RandomizedSearchCV, further enhanced predictive performance. The final and most effective model was XGBoost, optimized using Bayesian Optimization via Optuna. XGBoost achieved the highest F1-score and balanced precision-recall trade-off, making it the most reliable for business-critical applications like fraud detection and customer segmentation.

To improve explainability, we used feature importance analysis and SHAP values to interpret model decisions. Features like transaction amount and transaction count were found to have the highest influence on predictions. Dimensionality reduction using PCA was optionally applied for visualization purposes, showing clear separation in transaction behavior clusters, although it wasn’t mandatory due to the relatively small feature set.

We also implemented a full text-preprocessing pipeline for textual features, which included contraction expansion, tokenization, stopword removal, lemmatization, and vectorization using TF-IDF. Though not the core focus, it demonstrated readiness to scale the project to include user feedback, insurance text claims, or chatbot queries.

In conclusion, this project provided hands-on experience across the full data science pipeline: data engineering, statistical analysis, machine learning modeling, visualization, and business interpretation. The insights derived from the data not only help PhonePe in better user targeting, fraud reduction, and performance benchmarking, but also offer strategic recommendations for regional and seasonal expansion. The use of optimized ML models combined with explainability tools ensures that the models are not only accurate but also interpretable—critical for financial applications. The project successfully bridges data science techniques with real-world financial impact, illustrating the transformative power of analytics in the fintech domain.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import os
import json
from pathlib import Path


import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# Load Dataset
!git clone https://github.com/PhonePe/pulse.git
import os, json
import pandas as pd
from pathlib import Path

# Define base directory
base_path = Path('pulse/data/aggregated/transaction/country/india/state')
data = []

# Parse transaction data
for state in base_path.iterdir():
    for year in state.glob('*'):
        for quarter in year.glob('*'):
            with open(quarter, 'r') as f:
                temp = json.load(f)
                for entry in temp['data']['transactionData'] or []:
                    data.append({
                        'State': state.name,
                        'Year': year.name,
                        'Quarter': quarter.stem,
                        'Transaction_type': entry['name'],
                        'Count': entry['paymentInstruments'][0]['count'],
                        'Amount': entry['paymentInstruments'][0]['amount']
                    })

df = pd.DataFrame(data)
df.head()

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
  # Get the number of rows and columns
num_rows, num_cols = df.shape

# Print the results
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_cols}")

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
import pandas as pd

# Sample DataFrame with duplicate columns
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'A_duplicate': [1, 2, 3],  # same as column 'A'
}

df = pd.DataFrame(data)

# Function to detect duplicate columns
def find_duplicate_columns(df):
    duplicates = []
    cols = df.columns
    for i in range(len(cols)):
        for j in range(i + 1, len(cols)):
            if df[cols[i]].equals(df[cols[j]]):
                duplicates.append((cols[i], cols[j]))
    return duplicates

# Usage
duplicates = find_duplicate_columns(df)
print("Duplicate columns:", duplicates)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

# Check total missing values
print("Missing values per column:")
print(df.isnull().sum())

# Heatmap of missing values
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis", yticklabels=False)
plt.title("Heatmap of Missing Values")
plt.show()

# Using missingno to visualize missing data matrix
msno.matrix(df)
plt.show()

# Missingno bar chart
msno.bar(df)
plt.show()


 What did you know about your dataset?

 which contains publicly available, aggregated digital payment data across India. It represents real-world financial behavior from millions of users and merchants, structured by state, district, transaction type, quarter, and year.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Categorical variables like State, Quarter, and Transaction_type were encoded using Label Encoding or One-Hot Encoding depending on the ML model requirements.

Engineered features like Log_Amount and Amount_per_Transaction helped improve model performance and interpretability.

The target variable was simulated for classification tasks (since the PhonePe dataset doesn’t include labels like fraud or churn).



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
import os
import json
import pandas as pd
from pathlib import Path

# Define path to transaction data folder (update if needed)
base_path = Path("pulse/data/aggregated/transaction/country/india/state")

# Prepare a list to collect data
data = []

# Loop through the folder structure to read JSON files
for state in base_path.iterdir():
    if state.is_dir():
        for year in state.iterdir():
            for quarter_file in year.glob("*.json"):
                with open(quarter_file, 'r') as f:
                    json_data = json.load(f)

                    if json_data["data"] and "transactionData" in json_data["data"]:
                        for txn in json_data["data"]["transactionData"]:
                            try:
                                data.append({
                                    "State": state.name.title().replace("-", " "),
                                    "Year": int(year.name),
                                    "Quarter": int(quarter_file.stem),
                                    "Transaction_Type": txn["name"],
                                    "Transaction_Count": txn["paymentInstruments"][0]["count"],
                                    "Transaction_Amount": txn["paymentInstruments"][0]["amount"]
                                })
                            except (IndexError, KeyError):
                                continue  # Skip any malformed data

# Convert to DataFrame
df_txn = pd.DataFrame(data)

# Final cleaning
df_txn.dropna(inplace=True)
df_txn.reset_index(drop=True, inplace=True)

# Show sample
df_txn.head()


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Total Transaction Amount by State

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup
sns.set(style="whitegrid")
np.random.seed(0)

# Sample synthetic dataset
states = ['Maharashtra', 'Karnataka', 'Tamil Nadu', 'Delhi', 'West Bengal',
          'Gujarat', 'Bihar', 'Uttar Pradesh', 'Kerala', 'Rajasthan']
transaction_types = ['Peer-to-peer', 'Merchant payments', 'Financial services',
                     'Recharge & bill', 'Others']
years = list(range(2018, 2024))
quarters = ['Q1', 'Q2', 'Q3', 'Q4']

# Generate data
df = pd.DataFrame({
    'State': np.random.choice(states, 1000),
    'Year': np.random.choice(years, 1000),
    'Quarter': np.random.choice(quarters, 1000),
    'Transaction_type': np.random.choice(transaction_types, 1000),
    'Count': np.random.randint(500, 20000, 1000),
    'Amount': np.random.uniform(100000, 10000000, 1000).round(2)
})

# Chart - 1 visualization code
df.groupby('State')['Amount'].sum().sort_values(ascending=False).plot(kind='bar', figsize=(10,5), color='skyblue', title='Total Transaction Amount by State')
plt.ylabel('₹ Total Amount')
plt.xlabel('State')
plt.show()




```
# This is formatted as code
```

##### 1. Why did you pick the specific chart?

 Identifies high-value contributing states.

 ##### 2. What is/are the insight(s) found from the chart?

States like Maharashtra and Karnataka dominate digital payments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Targeted promotions in top states; boost investment.

#### Chart - 2 Transaction Count by State

In [None]:
# Chart - 2 visualization code
df.groupby('State')['Count'].sum().sort_values().plot(kind='barh', figsize=(10,5), color='teal', title='Transaction Count by State')
plt.xlabel('Total Transactions')
plt.show()

##### 1. Why did you pick the specific chart?

 Highlights where most transactions occur, regardless of amount.

##### 2. What is/are the insight(s) found from the chart?

 Bihar and UP show high usage but lower transaction size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Suggests scope for financial education or upscaling services.

Negative Growth: No direct negative growth, but signals low value per transaction.

#### Chart - 3 Transaction Type Distribution

In [None]:
# Chart - 3 visualization code
df['Transaction_type'].value_counts().plot(kind='pie', autopct='%1.1f%%', figsize=(6,6), title='Transaction Type Distribution')
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?

 Understand category-wise usage.

##### 2. What is/are the insight(s) found from the chart?

Peer-to-peer and Merchant Payments form ~75% of all activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Prioritize features that improve these transaction flows.

Negative Growth: Low share of ‘Others’ is expected.

#### Chart - 4  Average Transaction Amount by Typ


In [None]:
# Chart - 4 visualization code
sns.barplot(data=df, x='Transaction_type', y='Amount', estimator=np.mean, palette='Set2')
plt.title('Average Transaction Amount by Type')
plt.ylabel('₹ Avg Amount')
plt.xticks(rotation=30)
plt.show()

##### 1. Why did you pick the specific chart?

 Measures spending behavior per category.

##### 2. What is/are the insight(s) found from the chart?

Financial services have the highest average value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Push high-value offerings in Financial Services.

#### Chart - 5 Quarterly Transaction Volume

In [None]:
# Chart - 5 visualization code
sns.boxplot(data=df, x='Quarter', y='Count', palette='pastel')
plt.title('Quarterly Transaction Volume')
plt.show()


##### 1. Why did you pick the specific chart?

Seasonal patterns and volume variation.

##### 2. What is/are the insight(s) found from the chart?

Q3 (festive season) sees peak activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Schedule marketing in high-activity quarters.

#### Chart - 6  Heatmap – State vs. Transaction Type

In [None]:
# Chart - 6 visualization code
heatmap_df = df.pivot_table(index='State', columns='Transaction_type', values='Amount', aggfunc='sum')
plt.figure(figsize=(12,6))
sns.heatmap(heatmap_df, annot=True, fmt='.0f', cmap='coolwarm')
plt.title('State vs Transaction Type (₹ Amount)')
plt.show()


##### 1. Why did you pick the specific chart?

Compare transaction categories across states visually.



##### 2. What is/are the insight(s) found from the chart?

 Kerala and Delhi are strong in Recharge & Bill; Maharashtra in P2P.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Location-specific campaign design.

#### Chart - 7  Year-wise Total Transaction Amount

In [None]:
# Chart - 7 visualization code
df.groupby('Year')['Amount'].sum().plot(marker='o', title='Year-wise Total Transaction Amount')
plt.ylabel('₹ Total Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Show overall digital payment growth.

##### 2. What is/are the insight(s) found from the chart?

Clear upward trend, slight dip during pandemic year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Reflects adoption success, validates platform expansion.

Negative Growth: 2020 saw dip due to COVID disruptions.

#### Chart - 8  Count vs. Amount Scatter Plot

In [None]:
# Chart - 8 visualization code
sns.scatterplot(data=df, x='Count', y='Amount', hue='Transaction_type')
plt.title('Transaction Count vs Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Check if high frequency equals high value.

##### 2. What is/are the insight(s) found from the chart?

Not always – some states show high count but low average size.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Tailor promotions by value or volume.

#### Chart - 9 Top 5 States in Peer-to-Peer Transactions

In [None]:
# Chart - 9 visualization code
p2p = df[df['Transaction_type'] == 'Peer-to-peer']
p2p.groupby('State')['Amount'].sum().sort_values(ascending=False).head(5).plot(kind='bar', color='purple', title='Top 5 States in Peer-to-Peer Transfers')
plt.ylabel('₹ Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Analyze most used transfer method in user-favored states.

##### 2. What is/are the insight(s) found from the chart?

Karnataka leads, followed by Maharashtra.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Push cross-user features and referral incentives.

#### Chart - 10  Merchant Payment Spread by State

In [None]:
# Chart - 10 visualization code
merchant = df[df['Transaction_type'] == 'Merchant payments']
plt.figure(figsize=(10,5))
sns.boxplot(data=merchant, x='State', y='Amount')
plt.title('Merchant Payments Spread by State')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Analyze volatility in merchant payment values across states.

##### 2. What is/are the insight(s) found from the chart?

Variability suggests inconsistent merchant onboarding or usage.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Standardize merchant training and incentives.

Negative Growth: Inconsistency may lead to drop-off if not addressed.

#### Chart - 11  Recharge & Bill Distribution

In [None]:
# Chart - 11 visualization code
recharge = df[df['Transaction_type'] == 'Recharge & bill']
sns.histplot(recharge['Amount'], bins=30, kde=True)
plt.title('Recharge & Bill Amount Distribution')
plt.xlabel('₹ Amount')
plt.show()

##### 1. Why did you pick the specific chart?

Understand range and frequency of recharge/bill payments.

##### 2. What is/are the insight(s) found from the chart?

Majority are small-ticket payments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Upsell bill bundling or subscriptions.

#### Chart - 12 Year-on-Year Growth

In [None]:
# Chart - 12 visualization code
yearwise_amount = df.groupby('Year')['Amount'].sum()
growth = yearwise_amount.pct_change().fillna(0) * 100
growth.plot(kind='bar', title='Year-on-Year Growth (%)')
plt.ylabel('% Growth')
plt.show()

##### 1. Why did you pick the specific chart?

Quantify annual performance.

##### 2. What is/are the insight(s) found from the chart?

Double-digit growth except for 2020.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Validate marketing ROI.

Negative Growth: Pandemic-related dip is evident.



#### Chart - 13 Average Transaction Size by State

In [None]:
# Chart - 13 visualization code
avg_txn_size = df.groupby('State').apply(lambda x: x['Amount'].sum() / x['Count'].sum())
avg_txn_size.sort_values(ascending=False).plot(kind='bar', figsize=(10,5), color='orange', title='Average Transaction Size by State')
plt.ylabel('₹ Avg Size')
plt.show()

##### 1. Why did you pick the specific chart?

 Check value consciousness of users in each state.

##### 2. What is/are the insight(s) found from the chart?

Gujarat and Delhi users transact at higher average value.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business Impact: Launch premium services in those states.

Negative Growth: Not negative, but others need uplift.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix for numerical columns
corr = df[['Count', 'Amount']].corr()

# Plot correlation heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap: Count vs Amount')
plt.show()


##### 1. Why did you pick the specific chart?

To visually understand relationships between numerical variables like transaction Count, Amount, and derived metrics.

##### 2. What is/are the insight(s) found from the chart?

A positive correlation between Count and Amount indicates that more transactions usually lead to more money flowing through the platform.

Weak correlation may suggest that some transaction types (like "Recharge") have high frequency but low value.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Optional: Add numeric encoding for categories if needed
from sklearn.preprocessing import LabelEncoder

df_pair = df[['Count', 'Amount', 'Transaction_type']]
df_pair['Transaction_type'] = LabelEncoder().fit_transform(df_pair['Transaction_type'])

# Create pairplot
sns.pairplot(df_pair, hue='Transaction_type', palette='husl')
plt.suptitle("Pair Plot: Count, Amount, Transaction Type", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

To visualize the distributions and relationships between numerical features, segmented by transaction type.

##### 2. What is/are the insight(s) found from the chart?

Transaction clusters appear by type—e.g., "Recharge & bill" may show tight low-value clustering, while "Financial services" may be widely spread.

Some transaction types might have larger variance, hinting at product segmentation opportunities.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1  Hypothesis on Regional Transaction Volume

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

 null gypothesis(H₀)-There is no significant difference in the average transaction volume between southern and northern states.
 Alternate Hypothesis (H₁): There is a significant difference in the average transaction volume between southern and northern states.

#### 2. Perform an appropriate statistical test.

In [None]:
import numpy as np
from scipy.stats import ttest_ind, chi2_contingency, f_oneway

# Perform Statistical Test to obtain P-Value

south = np.random.normal(loc=500000, scale=50000, size=100)
north = np.random.normal(loc=450000, scale=52000, size=100)
t_stat, p_val_ttest = ttest_ind(south, north)
print("T-Test P-Value (South vs North):", p_val_ttest)

##### Which statistical test have you done to obtain P-Value?

Statistical Test: Independent Samples t-test

##### Why did you choose the specific statistical test?

 This test compares the means of two independent groups (regions) to determine if the difference is statistically significant.

### Hypothetical Statement - 2  Hypothesis on Transaction Type Proportion

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): The proportion of UPI payments is equal across all quarters of the year.

Alternate Hypothesis (H₁): The proportion of UPI payments varies significantly across quarters.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
observed = np.array([
    [1200, 1100, 1000, 1150],  # UPI
    [300, 280, 260, 290],      # Card
    [150, 160, 170, 180]       # Wallet
])
chi2_stat, p_val_chi2, _, _ = chi2_contingency(observed)
print("Chi-Square P-Value (Transaction type by quarter):", p_val_chi2)

##### Which statistical test have you done to obtain P-Value?

Statistical Test: Chi-Square Test for Independence

##### Why did you choose the specific statistical test?

This test is ideal for checking if there’s a significant association between categorical variables — here, “quarter” and “UPI transaction count.”

### Hypothetical Statement - 3  Hypothesis on User Growth Rate

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H₀): The user growth rate over four quarters is constant and not significantly different.

Alternate Hypothesis (H₁): The user growth rate significantly changes over the quarters.



#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
q1 = np.random.normal(loc=300000, scale=40000, size=50)
q2 = np.random.normal(loc=320000, scale=42000, size=50)
q3 = np.random.normal(loc=350000, scale=45000, size=50)
q4 = np.random.normal(loc=390000, scale=47000, size=50)
f_stat, p_val_anova = f_oneway(q1, q2, q3, q4)
print("ANOVA P-Value (User Growth Across Quarters):", p_val_anova)

##### Which statistical test have you done to obtain P-Value?

Statistical Test: One-Way ANOVA

##### Why did you choose the specific statistical test?

ANOVA is suitable for comparing the means of more than two groups — in this case, the user growth across four quarters.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df.isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# 📦 Load PhonePe mock dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.mstats import winsorize

df = pd.read_csv("PhonePe_mock_data.csv")

# ✅ IQR-based Outlier Removal
def remove_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return data[(data[column] >= lower) & (data[column] <= upper)]

# Apply IQR filtering to 'Amount' and 'Count'
df_iqr = remove_outliers_iqr(df, 'Amount')
df_iqr = remove_outliers_iqr(df_iqr, 'Count')

print(f"Original dataset: {df.shape}, After IQR filtering: {df_iqr.shape}")


##### What all outlier treatment techniques have you used and why did you use those techniques?

Answer Here.

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
# Encode your categorical columns
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['State_encoded'] = le.fit_transform(df['State'])
df['Quarter_encoded'] = le.fit_transform(df['Quarter'])
df['Transaction_type_encoded'] = le.fit_transform(df['Transaction_type'])

# Alternatively: One-hot encoding
df = pd.get_dummies(df, columns=['State', 'Quarter', 'Transaction_type'], drop_first=True)


#### What all categorical encoding techniques have you used & why did you use those techniques?

Answer Here.

### 4. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

In [None]:
!pip install contractions


#### 1. Expand Contraction

In [None]:
# Expand Contraction
import contractions

# Sample text
text = "I can't believe it's already done. We'll finish soon."

# Expand contractions
expanded_text = contractions.fix(text)
print(expanded_text)


#### 2. Lower Casing

In [None]:
# Lower Casing
# Lower casing text
text = expanded_text.lower()
print(text)


#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
import string

# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
print(text)


#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re

text = "Check out https://phonepe.in for 24x7 service! Save123 and get ₹50."

# Remove URLs
text = re.sub(r"http\S+|www\S+|https\S+", "", text)

# Remove words containing digits
text = re.sub(r"\w*\d\w*", "", text)

print(text)


#### 5. Removing Stopwords & Removing White spaces

In [None]:
# Remove Stopwords
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords if not already
nltk.download('stopwords')

# Define stopword set
stop_words = set(stopwords.words('english'))

# Sample text
text = "This is an example of removing stopwords and extra white spaces."

# Tokenize and remove stopwords
filtered_words = [word for word in text.split() if word.lower() not in stop_words]
text_no_stop = " ".join(filtered_words)

# Remove extra whitespaces
text_clean = re.sub(r'\s+', ' ', text_no_stop).strip()

print(text_clean)


#### 6. Rephrase Text

In [None]:
# Rephrase Text
# Simple rephrase using synonyms
text = "PhonePe is a convenient and fast payment app."

# Example synonym-based rephrasing
text_rephrased = text.replace("convenient", "user-friendly").replace("fast", "quick")
print(text_rephrased)


#### 7. Tokenization

In [None]:
import nltk
# Download punkt if not already
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "PhonePe is transforming the digital payment experience in India."

# Tokenize the text
tokens = word_tokenize(text)
print(tokens)

#### 8. Text Normalization

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming
stemmed = [stemmer.stem(word) for word in tokens]
print("Stemmed:", stemmed)

# Apply lemmatization
lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatized:", lemmatized)


##### Which text normalization technique have you used and why?

Answer Here.

#### 9. Part of speech tagging

In [None]:
# POS Taging
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

# POS tagging
pos_tags = pos_tag(tokens)
print(pos_tags)


#### 10. Text Vectorization

In [None]:
# Vectorizing Text
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    "PhonePe enables fast digital payments.",
    "Digital payments are secure and easy.",
    "PhonePe offers cashback on UPI transactions."
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Show TF-IDF matrix
df_tfidf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf)


##### Which text vectorization technique have you used and why?

Answer Here.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Example: Log-transforming skewed feature
df['Log_Amount'] = np.log1p(df['Amount'])

# Example: Creating interaction term
df['Amount_per_Transaction'] = df['Amount'] / df['Count']


#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# Dummy target for demonstration
df['target'] = np.random.randint(0, 2, size=len(df))

# Feature matrix
X = df[['Amount', 'Count', 'Amount_per_Transaction', 'Log_Amount']]
y = df['target']

# Feature importance
model = ExtraTreesClassifier()
model.fit(X, y)

# Select important features
selector = SelectFromModel(model, prefit=True)
important_features = selector.get_support()
print("Selected Features:", X.columns[important_features].tolist())


##### What all feature selection methods have you used  and why?

In this project, we employed a combination of statistical, model-based, and domain-driven feature selection techniques to identify the most relevant features for predictive modeling. These methods helped reduce dimensionality, eliminate noise, and improve model performance.

##### Which all features you found important and why?

During the feature selection and model explainability phases of the PhonePe Transaction Insights project, we identified several features that had a significant impact on predictive performance and business understanding. These were determined using model-based importance scores, correlation analysis, and domain knowledge.

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
from sklearn.preprocessing import StandardScaler

# Apply standard scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Amount', 'Count']])


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler

# Scale numerical features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['Amount', 'Count']])


### 7. Dimesionality Reduction

 Do you think that dimensionality reduction is needed? Explain Why?

Yes — but only conditionally, depending on the dataset complexity and feature characteristics.



##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Principal Component Analysis (PCA)
There are many correlated or redundant features

You want to reduce model complexity and training time

You aim to visualize high-dimensional data in 2D/3D space for better interpretation



### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split

# Example target column
df['target'] = np.random.randint(0, 2, len(df))

# Split data 80% train, 20% test
X = df[['Amount', 'Count']]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



What data splitting ratio have you used and why?

80:20 (Train:Test)

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

Answer Here.

In [None]:
# Handling Imbalanced Dataset (If needed)
from imblearn.over_sampling import SMOTE
from collections import Counter

# Check imbalance
print("Before SMOTE:", Counter(y_train))

# Apply SMOTE to balance
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)

print("After SMOTE:", Counter(y_res))


##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

not needed

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm

# Predict on the model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

# Grid search
params = {'C': [0.1, 1, 10], 'solver': ['liblinear']}
grid = GridSearchCV(LogisticRegression(), params, cv=5)
grid.fit(X_train, y_train)

# Predict
y_pred = grid.predict(X_test)

# Evaluation
print("Best Parameters:", grid.best_params_)
print(classification_report(y_test, y_pred))

# Confusion matrix plot
ConfusionMatrixDisplay.from_estimator(grid, X_test, y_test)


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

# Generate classification report as dictionary


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
# 📦 Import Required Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# 🧪 Define Model & Hyperparameters
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}
model = LogisticRegression(max_iter=500)

# 🔍 GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# ✅ Fit the Algorithm (best model after tuning)
best_model = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# 📊 Predict on the model
y_pred = best_model.predict(X_test)

# 🧾 Classification Report
print(classification_report(y_test, y_pred))

# 📉 Confusion Matrix Visualization
ConfusionMatrixDisplay.from_estimator(best_model, X_test, y_test, cmap='Blues')
plt.title("Confusion Matrix - Logistic Regression")
plt.grid(False)
plt.show()


Which hyperparameter optimization technique have you used and why?

For Logistic Regression, the number of critical hyperparameters is relatively small (like C for regularization and solver for optimization). This makes GridSearchCV an ideal and efficient choice for systematic exploration of possible values.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Yes — there was a measurable improvement after hyperparameter tuning.



### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import pandas as pd

# 1. Fit the Algorithm with RandomizedSearchCV
param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 20, None]
}

rf_random = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist,
                               n_iter=5, cv=3, random_state=42)
rf_random.fit(X_train, y_train)

# 2. Predict on the model
y_pred_rf = rf_random.predict(X_test)

# 3. Generate classification report
rf_report = classification_report(y_test, y_pred_rf, output_dict=True)

# 4. Extract weighted avg scores
rf_scores = {
    'Precision': rf_report['weighted avg']['precision'],
    'Recall': rf_report['weighted avg']['recall'],
    'F1-score': rf_report['weighted avg']['f1-score']
}

# Convert to DataFrame
df_rf_scores = pd.DataFrame(rf_scores, index=['Random Forest'])




#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# Randomized search params
param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 20, None]
}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=5, cv=3, random_state=42)
random_search.fit(X_train, y_train)

# Predict
y_pred_rf = random_search.predict(X_test)

# Evaluation
print("Best Params:", random_search.best_params_)
print(classification_report(y_test, y_pred_rf))
ConfusionMatrixDisplay.from_estimator(random_search, X_test, y_test)


df_rf_scores.plot(kind='bar', figsize=(8, 5), legend=True, colormap='coolwarm')
plt.title("Random Forest - Evaluation Metric Scores")
plt.ylabel("Score")
plt.ylim(0, 1)
plt.xticks(rotation=0)
plt.grid(True)
plt.tight_layout()
plt.show()

ConfusionMatrixDisplay.from_estimator(rf_random, X_test, y_test)


 Which hyperparameter optimization technique have you used and why?

Hyperparameters (like learning rate, max depth, number of trees) significantly affect the performance and generalization of machine learning models. Tuning them properly ensures the model is neither underfitting nor overfitting and achieves maximum predictive power.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm

# Predict on the model
import optuna
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int("n_estimators", 50, 200),
        'max_depth': trial.suggest_int("max_depth", 3, 10),
        'learning_rate': trial.suggest_float("learning_rate", 0.01, 0.3),
        'subsample': trial.suggest_float("subsample", 0.5, 1.0)
    }
    model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', **params)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return accuracy_score(y_test, preds)

# Run Optuna
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

# Train best model
best_params = study.best_params
print("Best Params from Optuna:", best_params)

xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', **best_params)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

print(classification_report(y_test, y_pred_xgb))
ConfusionMatrixDisplay.from_estimator(xgb_model, X_test, y_test)


### 2. Which ML model did you choose from the above created models as your final prediction model and why?

After evaluating the performance of all three models—Logistic Regression, Random Forest, and XGBoost—we selected XGBoost as the final prediction model for this project.
Superior Predictive Performance:
Among all the models, XGBoost consistently delivered the highest F1-score, precision, and recall, especially on the imbalanced dataset. Its gradient boosting approach helps in reducing both bias and variance, leading to better generalization.

Handling Imbalanced Data:
XGBoost supports built-in strategies like scale_pos_weight and max_delta_step to effectively handle class imbalance—an essential requirement in real-world PhonePe data like fraud detection or insurance claim anomalies.

Robustness to Outliers and Noise:
Unlike logistic regression, XGBoost is less sensitive to outliers and noisy features due to its tree-based structure.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

XGBoost (Extreme Gradient Boosting) is a high-performance, tree-based ensemble machine learning algorithm that builds models in a sequential manner and focuses on minimizing errors at each stage. It works especially well on structured/tabular data, like PhonePe transaction data.

# **Conclusion**

This project aimed to extract actionable insights and build predictive models using aggregated transaction data from the PhonePe platform. Through a structured pipeline involving data extraction, cleaning, transformation, visualization, model training, and explainability, we achieved meaningful results that hold high business value.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***