In [None]:
import os
os.chdir("../")

# EDA for Students Habits Performance
In this notebook, we will explore the data and later, apply those changes to our `main.py` file

In [None]:
# Import the data
from src.data_loader import DataLoader

student_data = DataLoader().load_raw_data('student-habits-performance.csv')

## Preprocessor
The columns present in the pd.Fataframe are:

In [None]:
from src.data_cleaner import Preprocessor

prep = Preprocessor(student_data)
print(prep.data_frame.columns)

Let's start by removing the `student_id` column, since this will not be relevant for modeling later on.

In [None]:
prep.remove_columns(['student_id'])

## Overall picture of the data
Let's start by getting the `head()`, `info()` and `describe()` of the data, now without the `student_id` column.

In [None]:
print(prep.get_head())

In [None]:
print(prep.data_frame.describe())

Here we see the unique values for the categorical variables of the dataset.

In [None]:
print(prep.data_frame.info())

## Handling missing data
In this section, we are going to handle missing values on the `parental_education_level` variable.

The percentage of missing values, by variable, are the following:

In [None]:
prep.data_frame.isnull().mean().sort_values(ascending=False)

Since the variable `parental_education_level` is categorical, and NA percentage being less than 10%, replacing NA's with "Unknown" seem to be the best options.

**Replacing with the mode would result in possible bias on the data, since this is still roughly 10% of total sample, and removing those rows would remove a great part of the total sample.**

In [12]:
prep.replace_nas_with_value('parental_education_level')

## Transform categorical str into categorical int
Let's now convert categorical variables into numeric categorizations for model understanding.

In [None]:
print(prep.see_uniques(['gender', 'diet_quality', 'part_time_job', 'parental_education_level', 'internet_quality', 'extracurricular_participation']))

Here, we see that there are 91 missing values in the `parental_education_level` variable. Let's first check what we are going to do with those rows, since this represents roughly 10% missing from our sample.

In [None]:
prep.encode_categoricals()
print(prep.data_frame.describe())