# 📊 Exploratory Data Analysis (EDA)

This notebook performs basic exploratory data analysis on the cleaned dataset: `outputs/cleaned_data.csv`.

- Dataset source: cleaned by `clean_data.py`
- Goal: understand distributions, missing values, outliers, and overall structure

In [None]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

sns.set(style="whitegrid")

# Load cleaned data
df = pd.read_csv("../outputs/cleaned_data.csv")
df.head()

## 🔍 Basic Information

In [None]:
df.info()

In [None]:
df.describe(include="all")

## 📉 Missing Values Visualization

In [None]:
msno.matrix(df, figsize=(10, 4))
plt.title("Missing Data Overview")
plt.show()

## 📌 Categorical Distributions

In [None]:
sns.countplot(data=df, x="Status")
plt.title("User Status Distribution")
plt.ylabel("Count")
plt.show()

In [None]:
sns.countplot(data=df, x="Country")
plt.title("Country Distribution")
plt.xticks(rotation=45)
plt.ylabel("Count")
plt.show()

## 📊 Numerical Distributions

In [None]:
sns.histplot(df["Age"], bins=10, kde=True)
plt.title("Age Distribution")
plt.show()

In [None]:
sns.histplot(df["Income"], bins=10, kde=True)
plt.title("Income Distribution")
plt.show()

## 📎 Income by Country

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data=df, x="Country", y="Income")
plt.title("Income Distribution by Country")
plt.xticks(rotation=45)
plt.show()

## ✅ Summary
- The dataset is clean with minor missing values (e.g., Email, SignUpDate)
- Age is normally distributed around ~29 years
- Income varies between users and countries
- Status is mostly 'Active', followed by 'Inactive' and 'Unknown'

This notebook gives a solid understanding of the dataset and can be expanded for feature engineering or business insights.