# 📘 1. Introduction
This notebook explores user engagement metrics to support the design of a badge system
for a community platform. The analysis is based on a technical assessment and follows
a structured approach to:
- Understand the distribution and types of user activity
- Compare engagement behavior across multiple timeframes (4, 6, 8, 12 weeks)
- Identify stable and differentiating metrics
- Simulate threshold strategies for badge assignment

# 📥 2. Data Loading & Preparation


In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import warnings 

warnings.filterwarnings('ignore')


In [None]:
# Load dataset
data = pd.read_csv('../data/data_analyst.csv')

# Print the first few rows of the dataset
data.head()

In [None]:
# Convert column names to lowercase
data.columns = data.columns.str.lower()

# 80,000 rows in total and activity columns are of type float
data.info()

In [None]:
# print basic summary statistics of the dataset - outliers across all columns except 'last_x_weeks' and 'user_id
data.describe()

## 🧰 2.1 Utility Functions

In [None]:
def last_x_weeks(data, week):
    return data[data['last_x_weeks'] == week]

def count_non_zero(column):
    return column[column > 0].count()

def q25(x):
    return x.quantile(0.25)

def q75(x):
    return x.quantile(0.75)

def get_activity_summary(data, column):
    return data.groupby('last_x_weeks').agg({
        column : ['mean', 'std', 'median', 'max']
    }).reset_index()



# 📊 3. Exploratory Data Analysis (EDA)
<!-- Explore the overall distribution of key engagement metrics such as:
- Posts created
- Replies received
- Events and participation
- Items gifted
- Places recommended
- Thank-yous received

Plot histograms, CDFs, and boxplots to understand skewness and sparsity. -->

## 3.1 Missing Value Analysis

In [None]:
# Percentage of missing values in each column
missing_values = round(data.isna().sum() / len(data) * 100,2)
missing_values

In [None]:
# Percentage of missing values by week and activity
missing_by_week = data.groupby('last_x_weeks').apply(lambda x: x.isna().sum() / 20000 * 100).drop(columns=['user_id', 'last_x_weeks'])
missing_by_week

In [None]:
# Fill missing values with 0. It is assumed that the missing values indicate that the user did not engage in that activity 
data = data.fillna(0)

# Convert data types 
activity_columns = ['posts_created', 'replies_received', 'thankyous_received', 'events_created', 'event_participants', 'items_gifted', 'places_recommended']
data[activity_columns] = data[activity_columns].astype(int)

## 3.2 Distribution Analysis

In [None]:
# Split data into different weeks
last_4_weeks = last_x_weeks(data, 4)
last_6_weeks = last_x_weeks(data, 6)
last_8_weeks = last_x_weeks(data, 8)
last_12_weeks = last_x_weeks(data, 12)

## 3.3 Outlier Detection

## 3.4 Metric Comparison Across Timeframes

## 3.5 Correlations 

## 3.6 Engagement Segment Checks

In [None]:
# # Example: Distribution of posts per week
# sns.histplot(df['posts_created_pw'], bins=50)
# plt.title("Posts Created per Week Distribution")
# plt.show()

# 📈 4. Timeframe Comparison
Compare how engagement metrics behave across the 4, 6, 8, and 12-week windows.
Assess:
- Metric stability: standard deviation, drift, range
- Engagement level trends: increasing or plateauing
- Active user proportions by metric and timeframe


In [None]:
# Grouped summary by timeframe
df.groupby("last_x_weeks")["posts_created_pw"].describe()


# 📐 5. Metric Selection & Differentiation
Identify which metrics:
- Have clear separation between active and inactive users
- Are not dominated by noise or sparsity
- Maintain user ranking across timeframes (Spearman correlation)

These are the metrics best suited for badge assignment.

# 📏 6. Threshold Simulation
Use percentile-based thresholds (e.g., 75th, 90th) to:
- Simulate how many users would qualify per badge
- Define thresholds that are balanced (not too easy or too exclusive)
- Build badge categories based on different engagement types

Example badges:
- "Top Contributor": posts_created_pw ≥ 75th percentile
- "Helpful Neighbor": thankyous_received_pw ≥ 75th percentile
- "Conversation Starter": post_replies_ratio ≥ 1.0

In [None]:
# Example: calculate thresholds
quantiles = df[df['last_x_weeks'] == 8]['posts_created_pw'].quantile([0.5, 0.75, 0.9])
print(quantiles)

# 🎯 7. Badge System Recommendations

Summarize badge logic:
- Which metrics and thresholds are used
- Which timeframe is selected (e.g., 8 weeks for balance and stability)
- How thresholds ensure fairness and meaning

Provide a final table of badge names, metrics, thresholds, and estimated coverage.

# ✅ 8. Conclusion
This analysis defines a defensible, data-backed badge system using engagement data.
The selected metrics and timeframe support stable and fair badge assignment across a diverse user base.
