# 1. Data Loading and Initial Cleaning

This notebook focuses on loading the dataset, performing an initial inspection, and handling any basic cleaning tasks like data type corrections or checking for missing values.

## Load Data

In [None]:
df = pd.read_csv('Student Habits vs Academic Performance.csv')
df.head()

## Initial Inspection

In [None]:
df.info()

**Observations from `.info()`:**
*   All columns seem to have 1000 non-null entries, suggesting no missing data.
*   Data types are mostly appropriate (object for categoricals, float/int for numericals).
*   `student_id` is an identifier and can be set as index or dropped for most analyses.

In [None]:
df.describe(include='all')

**Observations from `.describe()`:**
*   `age` ranges from 17 to 24.
*   `study_hours_per_day` ranges from 0 to 8.3.
*   `exam_score` ranges from 18.4 to 100. The mean is around 72.3, which matches the HTML's 72.5 (likely a slight difference due to the HTML using simulated data for the JS part or different rounding).
*   Categorical columns like `gender`, `diet_quality`, `parental_education_level` show their unique values and frequencies.

## Data Cleaning / Preprocessing (Minimal for this dataset)

Given the dataset appears clean, we'll mainly focus on ensuring categorical variables are explicitly typed as `category` for pandas, which can be beneficial for some plotting libraries and memory efficiency.

In [None]:
# Identify categorical columns (those with dtype 'object')
categorical_cols = df.select_dtypes(include='object').columns.tolist()
# We can remove student_id if it's in there, as it's an identifier
if 'student_id' in categorical_cols:
    categorical_cols.remove('student_id')

for col in categorical_cols:
    df[col] = df[col].astype('category')

# Verify changes
df.info()

### Define Ordinal Categories (Important for correct plotting and modeling later)

Some categorical variables have an inherent order.

In [None]:
from pandas.api.types import CategoricalDtype

# Diet Quality
diet_order = ['Poor', 'Fair', 'Good']
diet_dtype = CategoricalDtype(categories=diet_order, ordered=True)
df['diet_quality'] = df['diet_quality'].astype(diet_dtype)

# Parental Education Level - This dataset's values seem to be: 'None', 'High School', 'Bachelor', 'Master'
# We need to check actual unique values to set the order correctly.
print(f"Unique parental education levels: {df['parental_education_level'].unique()}")

parental_edu_order = ['None', 'High School', 'Bachelor', 'Master'] # Based on typical progression
parental_edu_dtype = CategoricalDtype(categories=parental_edu_order, ordered=True)
df['parental_education_level'] = df['parental_education_level'].astype(parental_edu_dtype)

# Internet Quality
print(f"Unique internet quality levels: {df['internet_quality'].unique()}")
internet_order = ['Poor', 'Average', 'Good'] # Assuming this order
internet_dtype = CategoricalDtype(categories=internet_order, ordered=True)
df['internet_quality'] = df['internet_quality'].astype(internet_dtype)

# Verify changes
df.info()
print("\nDiet Quality Categories:", df['diet_quality'].cat.categories)
print("Parental Education Categories:", df['parental_education_level'].cat.categories)
print("Internet Quality Categories:", df['internet_quality'].cat.categories)

The data seems ready for exploratory analysis.

We can save the processed DataFrame if we want to load this state directly in other notebooks, though for this small dataset, re-running preprocessing is fast.

In [None]:
# df.to_parquet('cleaned_student_data.parquet') # Optional save