# AIAP Foundation Self Practice

The objective is to predict the students' O-level mathematics examination score to help the school to identify weaker students prior to the examination using the dataset provided. In your submission, you should evaluate at least 3 suitable models for estimating the students' scores.

## Data Dictionary

| Column               | Description                        |
| -------------------- | ---------------------------------- |
| student_id           | Unique ID for each student         |
| number_of_siblings   | Number of siblings                 |
| direct_admission     | Mode of entering the school        |
| CCA                  | Enrolled CCA                       |
| learning_style       | Primary learning style             |
| tuition              | Indication of whether the student has a tuition   |
| final_test           | Student's O-level mathematics examination score   |
| n_male               | Number of male classmates          |
| n_female             | Number of female classmates        |
| gender               | Gender type                        |
| age                  | Age of the student                 |
| hours_per_week       | Number of hours student studies per week          |
| attendance_rate      | Attendance rate of the student (%) |
| sleep_time           | Daily sleeping time (hour:minutes) |
| wake_time            | Daily waking up time (hour:minutes)               |
| mode_of_transport    | Mode of transport to school        |
| bag_color            | Colours of student's bag           |

<br>
<hr>

## Exploratory Data Analysis

### Load libraries

In [None]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sys
import pprint
from pathlib import Path

### Setting notebook settings

In [2]:
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 999)

In [3]:
# Setting global variables for dataset path and path to save plots and graphs
FILE_PATH = "./data/regression_bonus_practice_data.csv"
PLOT_PATH = "./images/"

### Load data

In [7]:
# Load the dataset: first check the file exists and stop execution with a clear message if not.
file_path_obj = Path(FILE_PATH)
if not file_path_obj.is_file():
    print(f"File not found at: {FILE_PATH}")
    # print("Please check the FILE_PATH variable or place the CSV at the expected location.")
    print("Exiting program.")
    sys.exit(1)
# If the file exists, read it into a DataFrame and show its shape.
df = pd.read_csv(FILE_PATH)
print(f"Dataset loaded: {df.shape[0]:,} rows by {df.shape[1]:,} columns.")

Dataset loaded: 15,900 rows by 18 columns.


### Checking dataset structure

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15900 entries, 0 to 15899
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   index               15900 non-null  int64  
 1   number_of_siblings  15900 non-null  int64  
 2   direct_admission    15900 non-null  object 
 3   CCA                 12071 non-null  object 
 4   learning_style      15900 non-null  object 
 5   student_id          15900 non-null  object 
 6   gender              15900 non-null  object 
 7   tuition             15900 non-null  object 
 8   final_test          15405 non-null  float64
 9   n_male              15900 non-null  float64
 10  n_female            15900 non-null  float64
 11  age                 15900 non-null  float64
 12  hours_per_week      15900 non-null  float64
 13  attendance_rate     15122 non-null  float64
 14  sleep_time          15900 non-null  object 
 15  wake_time           15900 non-null  object 
 16  mode

### Checking for missing values

In [12]:
# Identify columns with missing data
missing_counts = df.isnull().sum()
missing_percent = df.isnull().mean() * 100

# Filter columns with missing values
missing_columns = missing_counts[missing_counts > 0].index.tolist()

# Display columns with missing values, their count, and percentage
print("Columns with missing data:\n")
for col in missing_columns:
    print(
        f"{col:<25} : {missing_counts[col]:>6,} missing ({missing_percent[col]:>5.2f}%)")

Columns with missing data:

CCA                       :  3,829 missing (24.08%)
final_test                :    495 missing ( 3.11%)
attendance_rate           :    778 missing ( 4.89%)


### Checking numerical columns statistics

In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
index,15900.0,7949.5,4590.078975,0.0,3974.75,7949.5,11924.25,15899.0
number_of_siblings,15900.0,0.886541,0.751346,0.0,0.0,1.0,1.0,2.0
final_test,15405.0,67.165401,13.977879,32.0,56.0,68.0,78.0,100.0
n_male,15900.0,13.88,6.552584,0.0,10.0,14.0,18.0,31.0
n_female,15900.0,8.906038,6.663852,0.0,4.0,8.0,13.0,31.0
age,15900.0,15.213459,1.758941,-5.0,15.0,15.0,16.0,16.0
hours_per_week,15900.0,10.312579,4.461861,0.0,7.0,9.0,14.0,20.0
attendance_rate,15122.0,93.270268,7.98423,40.0,92.0,95.0,97.0,100.0


### Checking for duplicates

In [15]:
# Checking for duplicates
if df.duplicated().sum() == 0:
    print('There are no duplicates in the dataset.')
else:
    print('There are duplicates found in the dataset.')
    print('Recommend dropping the duplicates.')

There are no duplicates in the dataset.


### Checking non-numerical columns statistixs

In [20]:
# Checking object type columns
# df.select_dtypes(include = 'object').value_counts()
value_counts_dict = {col: df[col].value_counts() for col in df.select_dtypes(include='object').columns}

print('Breakdown of the frequency of values in the object data type.')
print('-' * 60)
pprint.pprint(value_counts_dict)


Breakdown of the frequency of values in the object data type.
------------------------------------------------------------
{'CCA': CCA
Clubs     3912
Sports    3865
Arts      3785
CLUBS      143
NONE       130
ARTS       128
SPORTS     108
Name: count, dtype: int64,
 'bag_color': bag_color
yellow    2731
green     2653
black     2650
blue      2634
red       2620
white     2612
Name: count, dtype: int64,
 'direct_admission': direct_admission
No     11195
Yes     4705
Name: count, dtype: int64,
 'gender': gender
Male      7984
Female    7916
Name: count, dtype: int64,
 'learning_style': learning_style
Auditory    9132
Visual      6768
Name: count, dtype: int64,
 'mode_of_transport': mode_of_transport
public transport     6371
private transport    6323
walk                 3206
Name: count, dtype: int64,
 'sleep_time': sleep_time
23:00    3131
22:00    3067
22:30    3034
21:00    2953
21:30    2875
0:00      240
23:30     183
1:00      122
0:30       93
2:00       81
1:30       73
2:30  

## Data Cleaning

### Cleaning `CCA` column

In [None]:
df['CCA'].value_counts()

### Creating sleep duration column

The 2 columns - **sleep_time** and **wake_time** - seem to be stgoring the times the students sleep and awake for school. in HH:MM format. These 2 columns will be converted into datetime formaats. Then, an arithmatic operation will be performed to calculate the sleep duration of the student. Then, it will be converted into minutes.

In [None]:
# 2. Convert string columns to datetime objects
# We only care about the time, but converting to full datetime makes calculations easy
df['sleep_time_dt'] = pd.to_datetime(df['sleep_time'], format='%H:%M')
df['wake_time_dt'] = pd.to_datetime(df['wake_time'], format='%H:%M')


# 3. Calculate the duration, handling the overnight case
# np.where(condition, value_if_true, value_if_false)
duration = np.where(
    df['wake_time_dt'] < df['sleep_time_dt'],
    # If wake time is "before" sleep time, add a day to wake time
    df['wake_time_dt'] + pd.Timedelta(days=1) - df['sleep_time_dt'],
    # Otherwise, it's a simple subtraction
    df['wake_time_dt'] - df['sleep_time_dt']
)
df['sleep_duration'] = duration


# 4. Convert the duration (Timedelta) to total minutes
df['sleep_minutes'] = (df['sleep_duration'].dt.total_seconds() / 60).astype(int)

df['sleep_minutes'].describe()

In [None]:
df['sleep_minutes'].value_counts()

On the whole, more than 90% of students sleep at least 8 hours of sleep. in fact about 2.4% of the students do not get sufficient sleep, and they sleep at least 5 hours. Will need to investigate if sleep time impacts a student's final test result.

In [None]:
df.head()

### Cleaning `tuition` column

In [None]:
df['tuition'].value_counts()

There are inconsistencies in the values encoded into this column. There are 4 values for whether a sudent receives tuition or not. As such, we will standardize the encoding to only 2 values - 'Y' and 'N'. 

We will need to check if a model can differentiate if the student receiving higher marks received tuition.

In [None]:
tuition_replacement_code = {'Yes': 'Y', 
                            'No': 'N'
                            }
df['tuition'] = df['tuition'].replace(tuition_replacement_code)

df['tuition'].value_counts()

### Filling up missing values in `attendance_rate`

In [None]:
df['attendance_rate'].describe()

As there are missing values in this column, about 778 observations, or 4.48%, we will use the median value of this column, which is at 95%.

In [None]:
attendance_rate_median = df['attendance_rate'].median()

df['attendance_rate'] = df['attendance_rate'].fillna(attendance_rate_median)

# df['attendance_rate'].describe()

print('Missing values in attendance rate has been replaced with the median.')

### Cleaning `CCA` column

In [None]:
df['CCA'] = df['CCA'].str.upper()

df['CCA'].fillna('NONE', inplace = True)

print(df['CCA'].value_counts())

print('Missing values in the CCA column has been replaced with NONE and all values converted to the uppercase.')

### Cleaning `final_test` column

In [None]:
final_test_median = df['final_test'].median()
print(f"The median value of the column {final_test_median}")

df['final_test'].fillna(final_test_median, inplace = True)

The missing value in the `final_test` column has been replaced with the median value of the column, which is 68.0.

In [None]:
# print(df.isnull().sum().sum())

if df.isnull().sum().sum() == 0:
    print('All missing values have been treated.')
else:
    print('There are still missing values in the dataset')
    sys.exit(1)

In [None]:
df.head(10)

### Saving a cleaned copy and removing observationss with no `final_test` values

In [None]:
df_null_score = df[df['final_test'].isnull()]

df.dropna(inplace = True)

df_cleaned = df.copy(deep = True)

df.drop(columns = ['index', 'student_id', 'sleep_time', 'wake_time', 'sleep_time_dt', 'wake_time_dt', 'sleep_duration'],
        inplace = True)

### Checking cleaned data structure

In [None]:
df.info()

In [None]:
df.duplicated().sum()

## Univariate analysis

### Pie Charts

In [None]:
# 2. List of columns
columns_to_plot = [
    'number_of_siblings', 'direct_admission', 'learning_style', 
    'gender', 'mode_of_transport', 'bag_color'
]

# 3. Create a figure and a set of subplots
# We have 6 columns, so a 2x3 grid is perfect.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(20, 12))
fig.suptitle('Distribution of Categorical Features', fontsize=20)

# Flatten the axes array to make it easy to loop over
axes = axes.flatten()

# 4. Loop through columns and plot on each subplot
for i, column in enumerate(columns_to_plot):
    value_counts = df[column].value_counts()
    ax = axes[i] # Select the subplot
    
    ax.pie(
        value_counts, 
        labels=value_counts.index, 
        autopct='%1.1f%%', 
        startangle=90,
        # Add some styling for a cleaner look
        wedgeprops={'edgecolor': 'white'},
        textprops={'fontsize': 18} 
    )
    ax.set_title(f'{column.replace("_", " ").title()}', fontsize = 20)

# If you have an odd number of plots, you might want to hide the last empty one
# for i in range(len(columns_to_plot), len(axes)):
#     fig.delaxes(axes[i])

plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust layout to make room for the suptitle
plt.savefig(PLOT_PATH + "pie_charts_6.png")
plt.show()


#### Analysis notes

- **Number of siblings**: 42.2% of the observations in the cleaned dataset haws 1 other siblings, followed by 34.6% has no other siblings, while 23.3% has 2 siblings. It is possible that having siblings may lead to a higher test scores.
- **Direct admission**: Nearly 3/4, or exactly 70.5% of the students did not get admitted directly.
- **Learning style**: 57.4% of students learn through auditory methods, which means they acquire retain knowledge better listening to lectures and podcasts/videos. We would need to check students who learn through auditory or visual learning style.
- **Gender**: Students are equally balanced between male and female.
- **Mode of transport**: 
- **Bag color**: The color of the students' bags seems to be fairly distributed for all 6 colors. But does the choice of bag color affect a student's mathematics score?
  

### Histogram of `CCA`

In [None]:
# 2. Create the histogram using Matplotlib
plt.figure(figsize=(10, 6)) # Set the figure size for better readability

plt.hist(
    df['CCA'], 
    bins=10,          # You can adjust the number of bins to see more or less detail
    color='skyblue',  # Set the color of the bars
    edgecolor='black' # Add black edges to bars for better separation
)

# 3. Add labels and a title for clarity
plt.title('Distribution of CCA Participation', fontsize=16)
plt.xlabel('CCA Participation', fontsize=12)
plt.ylabel('Frequency (Number of Students)', fontsize=12)
plt.grid(axis='y', alpha=0.75) # Add a grid for the y-axis

# SDave the plot
plt.savefig(PLOT_PATH + "bar_chart_CCA.png")

# 4. Display the plot
plt.show()


#### Analysis notes

The initial dataset has about 25% of observationds having missing values.