

###  Exploratory Data Analysis (EDA) on Student Grades

**Goal:** practice end‑to‑end EDA on a small, intentionally "messy" dataset (35 students).  
You will:
- Generate **synthetic data** (First Name, Last Name, Grade).  
- **Inject data issues**: missing values, negative grades, out-of-range values (e.g., 540).  
- Perform **EDA**: preview, schema, summary stats, missingness, range checks, outlier flags.  
- **Fix errors** with documented, reproducible rules.  
- **Visualize** the distribution **before & after** cleaning.  

> Replace or extend any section with your own dataset later. Keep the *structure* and *explanations*.



## 1) Environment Setup

We use standard libraries only:
- `pandas` for data wrangling
- `numpy` for random generation and numeric ops
- `matplotlib` for basic plots

> No external installs required. If you run into missing modules, install them in your environment (e.g., `pip install pandas numpy matplotlib`).


In [4]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# For reproducibility
rng = np.random.default_rng(4064)
pd.set_option('display.max_rows', 50)
print("Environment ready.")

ModuleNotFoundError: No module named 'pandas'


## 2) Generate Synthetic Data (35 students)

**Design:**
- Randomly sample first and last names from small lists.
- Generate grades around a typical distribution (mean≈75, std≈12).
- Create a DataFrame with columns: `FirstName`, `LastName`, `Grade`.

We'll **intentionally introduce errors** in the next step.


In [None]:

# Small name banks (edit/expand as you like)
first_names = [
    "Alex","Taylor","Jordan","Riley","Casey","Avery","Morgan","Quinn","Jamie","Skyler",
    "Sam","Cameron","Drew","Jesse","Parker","Rowan","Hayden","Reese","Emerson","Logan",
    "Milan","Noa","Eden","Remy","Ari","Kendall","Harley","Corey","Shay","Sage",
    "Blake","Shawn","Robin","Kris","Cody"
]

last_names = [
    "Smith","Lee","Patel","Brown","Martin","Garcia","Nguyen","Johnson","Williams","Davis",
    "Miller","Wilson","Anderson","Thomas","Lopez","Harris","Clark","Lewis","Walker","Young",
    "King","Wright","Hill","Scott","Green","Baker","Adams","Nelson","Carter","Mitchell",
    "Perez","Roberts","Turner","Phillips","Campbell"
]

# Generate base grades ~ N(75, 12), clipped to [0, 100] initially (we'll add errors later)
base_grades = np.clip(rng.normal(loc=75, scale=12, size=35), 0, 100).round(1)

df = pd.DataFrame({
    "FirstName": first_names,
    "LastName": last_names,
    "Grade": base_grades
})

df.head(10)


## 3) Inject Data Issues (Missing, Negative, Out-of-Range)

We simulate common data quality problems:
- Missing values (`NaN`)
- Negative grades (e.g., `-10`)
- Out-of-range high values (e.g., `540` where `54` was intended)

> **Why simulate problems?** EDA isn't just stats; it's about *diagnosing and repairing* real‑world messiness.


In [None]:

df_dirty = df.copy()

# Inject a few issues deterministically so everyone's notebook is similar
issue_indices = rng.choice(df_dirty.index, size=6, replace=False)

# 2 missing grades
df_dirty.loc[issue_indices[0], "Grade"] = np.nan
df_dirty.loc[issue_indices[1], "Grade"] = np.nan

# 2 negative grades
df_dirty.loc[issue_indices[2], "Grade"] = -10
df_dirty.loc[issue_indices[3], "Grade"] = -3

# 1 extreme high (likely 10x typo)
df_dirty.loc[issue_indices[4], "Grade"] = 540

# 1 slightly >100 (e.g., 104) to test upper-bound fix
df_dirty.loc[issue_indices[5], "Grade"] = 104

print("Indices with injected issues:", issue_indices.tolist())
df_dirty.head(15)


## 4) EDA: Quick Preview & Schema

Look at a **sample**, the **schema**, and **summary statistics** to get a sense of the data.


In [None]:

print("Head:")
display(df_dirty.head())

print("\nInfo:")
display(df_dirty.info())

print("\nDescribe (numeric):")
display(df_dirty.describe())


## 5) Missingness & Validity Checks

**Checks:**
- Missing values per column
- Count invalid grades: `<0` or `>100`
- Identify obvious 10× typos (e.g., values > 100 and ending with a '0')

> These *rule‑based* checks should be **clear** and **defensible**. Document your assumptions.


In [None]:

missing_counts = df_dirty.isna().sum()

invalid_negative = (df_dirty["Grade"] < 0).sum(skipna=True)
invalid_over_100 = (df_dirty["Grade"] > 100).sum(skipna=True)

# flag potential 10x typos: >100 and approximate multiple of 10 after rounding
potential_ten_x = df_dirty["Grade"].apply(lambda x: isinstance(x, (int, float)) and x>100 and abs(x/10 - round(x/10)) < 1e-9)

summary_checks = pd.DataFrame({
    "missing": missing_counts,
})
print("Missing values per column:")
display(summary_checks)

print(f"Invalid negatives: {invalid_negative}")
print(f"Invalid >100: {invalid_over_100}")
print("Potential 10x-typos (value > 100 and ending in 0):")
display(df_dirty.loc[potential_ten_x.fillna(False), ["FirstName","LastName","Grade"]])


## 6) Visualize Distribution (Before Cleaning)

Plot the grade distribution to see the impact of errors and missing values.
> Keep plots simple and readable. (One chart per cell; do not set custom colors.)


In [None]:

plt.figure()
df_dirty["Grade"].plot(kind="hist", bins=15, edgecolor="black")
plt.title("Grade Distribution (Before Cleaning)")
plt.xlabel("Grade")
plt.ylabel("Frequency")
plt.show()


## 7) Cleaning Strategy (Documented Rules)

We will apply **clear, reproducible** rules:
1. **Fix obvious 10× typos**: if `Grade > 100` *and* ends with `0`, divide by 10 (e.g., `540 → 54`).  
2. **Clip out-of-range values**: after step 1, clip remaining grades to `[0, 100]`.  
3. **Impute missing values**: use the **median** (robust to outliers) of valid grades.  
4. Keep a **cleaning log** of what changed.

> Replace these with your own rules when you switch to your dataset.


In [None]:

df_clean = df_dirty.copy()

cleaning_log = []

# 1) Fix 10x typos
ten_x_mask = df_clean["Grade"].apply(lambda x: isinstance(x, (int, float)) and x>100 and abs(x/10 - round(x/10)) < 1e-9)
df_clean.loc[ten_x_mask, "Grade"] = df_clean.loc[ten_x_mask, "Grade"] / 10.0
cleaning_log.append(f"Divided {ten_x_mask.sum()} suspected 10x-typo grade(s) by 10.")

# 2) Clip to [0, 100]
before_clip_out_of_range = ((df_clean["Grade"] < 0) | (df_clean["Grade"] > 100)).sum(skipna=True)
df_clean["Grade"] = df_clean["Grade"].clip(lower=0, upper=100)
after_clip_out_of_range = ((df_clean["Grade"] < 0) | (df_clean["Grade"] > 100)).sum(skipna=True)
cleaning_log.append(f"Clipped out-of-range values: before={before_clip_out_of_range}, after={after_clip_out_of_range}.")

# 3) Impute missing with median of valid grades
median_grade = df_clean["Grade"].median(skipna=True)
n_missing = df_clean["Grade"].isna().sum()
df_clean["Grade"] = df_clean["Grade"].fillna(median_grade)
cleaning_log.append(f"Imputed {n_missing} missing grade(s) with median={median_grade:.1f}.")

print("\n".join(cleaning_log))
df_clean.head(10)


## 8) Validate After Cleaning

Double‑check:
- No missing grades
- All grades within `[0, 100]`
- Summary stats look reasonable


In [None]:

print("Any missing now?", df_clean["Grade"].isna().any())
print("Any out-of-range now?", ((df_clean["Grade"] < 0) | (df_clean["Grade"] > 100)).any())
display(df_clean.describe())


## 9) Visualize Distribution (After Cleaning)

Compare with the earlier plot. Is the distribution more plausible?


In [None]:

plt.figure()
df_clean["Grade"].plot(kind="hist", bins=15, edgecolor="black")
plt.title("Grade Distribution (After Cleaning)")
plt.xlabel("Grade")
plt.ylabel("Frequency")
plt.show()


## 10) Save Outputs

Save both the **raw-with-issues** and **cleaned** datasets to `outputs/` for your portfolio repo.


In [None]:

from pathlib import Path
out_dir = Path("outputs")
out_dir.mkdir(exist_ok=True, parents=True)

raw_path = out_dir / "student_grades_raw.csv"
clean_path = out_dir / "student_grades_clean.csv"

df_dirty.to_csv(raw_path, index=False)
df_clean.to_csv(clean_path, index=False)

print(f"Saved raw → {raw_path}")
print(f"Saved clean → {clean_path}")


## 11) Reflection (Short Answer)

Add a few bullets here summarizing what you learned:
- Which checks caught the most issues?
- Which assumptions did you make? Are they defensible?
- If this were a production pipeline, how would you log anomalies and fixes?
- What would you change if you had categorical grades (A/B/C) instead of numeric?
