## Student Performance Factors  
#### Insights into Academic Success and Contributing Elements

##### **About the Dataset:**  
This dataset provides a comprehensive analysis of key factors that influence student performance in exams. It includes data on study habits, attendance, parental involvement, and other variables that contribute to academic outcomes, offering valuable insights into the drivers of educational success.

##### **Model Objectives:**  
The goal of this model is to predict students' final exam scores based on the dataset's features. We aim to explore various correlations, including:

- **Exam Score vs Parental Involvement**
- **Exam Score vs Extracurricular Activities**
- **Exam Score vs Motivation Level**
- **Exam Score vs Family Income**
- **Exam Score vs Parental Education**

Using these features, the model will strive to accurately predict each student's final grade, leveraging patterns and relationships identified in the data.


---

#### Step 1 - Setupt Infrasctructure
**Goal:** Load Libs and Dataset 

In [1]:
import pandas as pd;
import plotly.express as px;
import numpy as np;
from sklearn.linear_model import LinearRegression;

In [2]:
path = '../dataset/StudentPerformanceFactors.csv'
df_raw = pd.read_csv(path)

#### Step 2 - EDA: Exploratory Data Analysis  
**Goal:** Is to analyze the dataset and identify necessary data treatments to ensure the information is consistent and reliable.

##### 2.1 - Explore DataFrame  
**Goal:** In this step, we will examine the structure of the DataFrame, ensuring that its format is correct. Our objective is to identify any null or inconsistent data that may require cleaning or further treatment.


In [3]:
# Output the first 5 rows of the dataframe
df_raw.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [4]:
# Output some statistical infos about the dataframe
df_raw.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


In [28]:
# Foreach dataframe and select columns with numeriacal values
df_numerical = df_raw.select_dtypes(include=[np.number])
# Update the dataframe to only include the columns with numerical values and add sufix _num at the end of column names
df_numerical.columns = [str(col) + '_num' for col in df_numerical.columns]
# Foreach dataframe and select columns with numeriacal values
df_categorical = df_raw.select_dtypes(include=[np.object_])
# Update the dataframe to only include the columns with numerical values and add sufix _num at the end of column names
df_categorical.columns = [str(col) + '_cat' for col in df_categorical.columns]
# Merge the two dataframes
df = pd.concat([df_numerical, df_categorical], axis=1)

# Put all the columns in lowercase
df.columns = map(str.lower, df.columns)

df.head()



Unnamed: 0,hours_studied_num,attendance_num,sleep_hours_num,previous_scores_num,tutoring_sessions_num,physical_activity_num,exam_score_num,parental_involvement_cat,access_to_resources_cat,extracurricular_activities_cat,motivation_level_cat,internet_access_cat,family_income_cat,teacher_quality_cat,school_type_cat,peer_influence_cat,learning_disabilities_cat,parental_education_level_cat,distance_from_home_cat,gender_cat
0,23,84,7,73,0,3,67,Low,High,No,Low,Yes,Low,Medium,Public,Positive,No,High School,Near,Male
1,19,64,8,59,2,4,61,Low,Medium,No,Low,Yes,Medium,Medium,Public,Negative,No,College,Moderate,Female
2,24,98,7,91,2,4,74,Medium,Medium,Yes,Medium,Yes,Medium,Medium,Public,Neutral,No,Postgraduate,Near,Male
3,29,89,8,98,1,4,71,Low,Medium,Yes,Medium,Yes,Medium,Medium,Public,Negative,No,High School,Moderate,Male
4,19,92,6,65,3,4,70,Medium,Medium,Yes,Medium,Yes,Medium,High,Public,Neutral,No,College,Near,Female


In [None]:
# Count if there are any missing values in the dataframe 
df.isnull().sum()

In [4]:
# Remove empty rows from the dataframe
df_raw.dropna(inplace=True)

##### 2.2 - Converting Dataframe Values  
**Goal:** All categorical variables used for prediction must be converted to numerical values to ensure compatibility with the model.


In [34]:
# Check columns type before convert to numerical
if df['parental_involvement_cat'].dtype == 'object':
    df['parental_involvment_num'] = df['parental_involvement_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})
    
if df['access_to_resources_cat'].dtype == 'object':
    df['access_to_resources_num'] = df['access_to_resources_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['extracurricular_activities_cat'].dtype == 'object':
    df['extracurricular_activities_num'] = df['extracurricular_activities_cat'].map({'No': 0, 'Yes': 1})

if df['motivation_level_cat'].dtype == 'object':
    df['motivation_level_num'] = df['motivation_level_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['internet_access_cat'].dtype == 'object':
    df['internet_access_num'] = df['internet_access_cat'].map({'No': 0, 'Yes': 1})

if df['family_income_cat'].dtype == 'object':
    df['family_income_num'] = df['family_income_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['teacher_quality_cat'].dtype == 'object':
    df['teacher_quality_num'] = df['teacher_quality_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['school_type_cat'].dtype == 'object':
    df['school_type_num'] = df['school_type_cat'].map({'Public': 0, 'Private': 1})

if df['peer_influence_cat'].dtype == 'object':
    df['peer_influence_num'] = df['peer_influence_cat'].map({'Negative': 0, 'Neutral': 5, 'Positive': 10})

if df['learning_disabilities_cat'].dtype == 'object':
    df['learning_disabilities_num'] = df['learning_disabilities_cat'].map({'No': 0, 'Yes': 1})

if df['parental_education_level_cat'].dtype == 'object':
    df['parental_education_level_num'] = df['parental_education_level_cat'].map({'High School': 0, 'College': 5, 'Postgraduate': 10})

if df['distance_from_home_cat'].dtype == 'object':
    df['distance_from_home_num'] = df['distance_from_home_cat'].map({'Near': 0, 'Moderate': 5, 'Far': 10})

if df['gender_cat'].dtype == 'object':
    df['gender_num'] = df['gender_cat'].map({'Female': 0, 'Male': 1})

In [35]:
df.head()

Unnamed: 0,hours_studied_num,attendance_num,sleep_hours_num,previous_scores_num,tutoring_sessions_num,physical_activity_num,exam_score_num,parental_involvement_cat,access_to_resources_cat,extracurricular_activities_cat,...,motivation_level_num,internet_access_num,family_income_num,teacher_quality_num,school_type_num,peer_influence_num,learning_disabilities_num,parental_education_level_num,distance_from_home_num,gender_num
0,23,84,7,73,0,3,67,Low,High,No,...,0,1,0,5.0,0,10,0,0.0,0.0,1
1,19,64,8,59,2,4,61,Low,Medium,No,...,0,1,5,5.0,0,0,0,5.0,5.0,0
2,24,98,7,91,2,4,74,Medium,Medium,Yes,...,5,1,5,5.0,0,5,0,10.0,0.0,1
3,29,89,8,98,1,4,71,Low,Medium,Yes,...,5,1,5,5.0,0,0,0,0.0,5.0,1
4,19,92,6,65,3,4,70,Medium,Medium,Yes,...,5,1,5,10.0,0,5,0,5.0,0.0,0


In [50]:
# Plot the distribution of the exam score by family_iconme
fig = px.box(df, 
             x='family_income_cat',
             y='exam_score_num', 
             title='Average Exam Score by Family Income', 
             labels={'exam_score_num': 'Average Exam Score', 'family_income_cat': 'Family Income'},
             color='family_income_cat',)

# Update plot, put the x-axis in ascending order and select different colors for the bars
fig.update_xaxes(categoryorder='total descending')
fig.update_traces(marker_color='rgb(158,202,225)', 
                  marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.show()

### Step 3 - Model Organization

At this step, we will organize and structure the data for the model training process. This involves the crucial task of splitting the dataset into **training** and **testing** subsets. The **training data** will be used to fit the model, while the **testing data** will evaluate its performance and generalization. 

The goal of this separation is to ensure that the model is trained on one subset and evaluated on another to prevent overfitting and to provide an unbiased assessment of its
