# Exploratory Data Analysis (EDA)

## Purpose

Exploratory Data Analysis (EDA) is the first and most critical step in any statistical
analysis or hypothesis testing workflow.

The goals of this notebook are to:
- Understand the structure and quality of the dataset
- Explore distributions and relationships
- Identify potential assumption violations
- Motivate appropriate hypothesis tests

No formal hypothesis testing is performed here.
We only observe, visualize, and reason.


## ðŸŸ¦ Imports & Setup

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from src.data_generation import generate_student_dataset

sns.set(style="whitegrid")


## ðŸŸ¦ Load or Generate Data

In [None]:
# Option 1: generate data
df = generate_student_dataset(n=4000, random_state=42)

# Option 2 (later): load from CSV
# df = pd.read_csv("../data/student_performance.csv")

df.head()


## ðŸŸ¦ Dataset Overview

We begin by examining:
- number of observations
- feature types
- presence of missing values
- basic summary statistics


### ðŸŸ¦ Shape, Info, Missing Values

df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

- The dataset contains several thousand observations.
- No missing values are present.
- Variables include both numerical and categorical features.
- This dataset is suitable for a wide range of hypothesis tests.


## ðŸŸ¦ Summary Statistics

In [None]:
df.describe()

## ðŸŸ¦ Initial Observations

From summary statistics we can already note:
- Reasonable ranges for all variables
- No obvious data corruption
- Exam scores span the full 0â€“100 range
- Study behavior and GPA show realistic variation

These observations support further statistical analysis.


## ðŸŸ¦ Distribution of Exam Scores

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(df["score"], bins=30, kde=True)
plt.xlabel("Exam Score")
plt.title("Distribution of Exam Scores")
plt.show()

### ðŸŸ¦ Distribution of Exam Scores

- The distribution is approximately bell-shaped
- Mild skewness is present, as expected in real data
- No extreme outliers dominate the distribution

This suggests that **parametric tests (t-test, ANOVA)** may be appropriate,
though assumptions must still be checked later.


## ðŸŸ¦ Boxplot for Outliers

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(y=df["score"])
plt.title("Boxplot of Exam Scores")
plt.show()


## ðŸŸ¦ Outlier Discussion

- Some extreme values exist, but they are not implausible
- No immediate need for outlier removal
- Robust and non-parametric tests will still be considered later

## ðŸŸ¦ Scores by Gender

In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(x="gender", y="score", data=df)
plt.title("Exam Scores by Gender")
plt.show()


## ðŸŸ¦ Visual Hypothesis (Gender)

### Scores by Gender

- Median and mean scores appear slightly higher for one group
- Distributions overlap substantially

This motivates a **two-sample hypothesis test**, but no conclusion is drawn yet.

## ðŸŸ¦ Scores by Teaching Method

In [None]:
plt.figure(figsize=(7, 4))
sns.boxplot(x="teaching_method", y="score", data=df)
plt.title("Exam Scores by Teaching Method")
plt.show()


## ðŸŸ¦ Visual Hypothesis (ANOVA)

### Scores by Teaching Method

- Clear separation between groups is visible
- Variability within groups is smaller than variability between groups

This strongly motivates **one-way ANOVA** followed by post-hoc testing.


## ðŸŸ¦ Relationship with Continuous Variables

In [None]:
sns.pairplot(
    df[["score", "study_hours", "attendance_rate", "previous_gpa"]],
    diag_kind="kde"
)
plt.show()


## ðŸŸ¦ Relationship Interpretation

### Relationships Between Variables

- Score increases with study hours and GPA
- Attendance shows a positive but weaker relationship
- Linear trends are visible

This suggests regression-based hypothesis testing will be meaningful.


## ðŸŸ¦ Group-wise Aggregation

In [None]:
df.groupby("teaching_method")["score"].agg(
    ["count", "mean", "std"]
)


## EDA Summary

From exploratory analysis we conclude:

- The dataset is clean and well-behaved
- Parametric tests are likely applicable
- Group differences are visually evident
- Relationships between predictors and outcome are meaningful

EDA has guided our choice of statistical methods.
Formal hypothesis testing begins in the next notebook.
