# Student Performance Factors  
#### Insights into Academic Success and Contributing Elements

##### **About the Dataset:**  
This dataset provides a comprehensive analysis of key factors that influence student performance in exams. It includes data on study habits, attendance, parental involvement, and other variables that contribute to academic outcomes, offering valuable insights into the drivers of educational success.

##### **Model Objectives:**  
The goal of this model is to predict students' final exam scores based on the dataset's features. We aim to explore various correlations, including:

- **Exam Score vs Parental Involvement**
- **Exam Score vs Extracurricular Activities**
- **Exam Score vs Motivation Level**
- **Exam Score vs Family Income**
- **Exam Score vs Parental Education**

Using these features, the model will strive to accurately predict each student's final grade, leveraging patterns and relationships identified in the data.


---

### Step 1 - Load Required Libs

In [1]:
import pandas as pd;
import numpy as np;
import plotly.express as px;

### Step 2 - Load Dataset


In [2]:
path = '../dataset/StudentPerformanceFactors.csv'
df = pd.read_csv(path)

### Step 3 - EDA - Exploratory Data Analysis

#### 3.1 - Explore DataFrame
##### **Goal:** In this step we will take a look in df structure, our gols is to verify df structure and search for null or inconsistente data

In [3]:
# Output the first 5 rows of the dataframe
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [4]:
df.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


In [5]:

# Rename the columns and put titles in lowercase
df.columns = df.columns.str.lower().str.replace(' ', '_')

# OUtput the first 5 fow of dataframe after renaming the columns
df.head()


Unnamed: 0,hours_studied,attendance,parental_involvement,access_to_resources,extracurricular_activities,sleep_hours,previous_scores,motivation_level,internet_access,tutoring_sessions,family_income,teacher_quality,school_type,peer_influence,physical_activity,learning_disabilities,parental_education_level,distance_from_home,gender,exam_score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [6]:
# Count if there are any missing values in the dataframe 
df.isnull().sum()

hours_studied                  0
attendance                     0
parental_involvement           0
access_to_resources            0
extracurricular_activities     0
sleep_hours                    0
previous_scores                0
motivation_level               0
internet_access                0
tutoring_sessions              0
family_income                  0
teacher_quality               78
school_type                    0
peer_influence                 0
physical_activity              0
learning_disabilities          0
parental_education_level      90
distance_from_home            67
gender                         0
exam_score                     0
dtype: int64

In [7]:
##### After check missing values we identify that all values isn't numerical, due it we canot fill these fields using median values.
# Remove the rows with missing values
df.dropna(inplace=True)

In [8]:
# Show comparsion of exam_score and studied_hours
fig = px.scatter(df, x='hours_studied', y='exam_score', title='Exam Score vs Studied Hours', labels={'hours_studied': 'Studied Hours', 'exam_score': 'Exam Score'})

fig.show()

#### 3.2 - Convert Dataframe Values

##### All categorical values needs to be converted to numerical values


In [11]:
# Check columns type before convert to numerical
if df['parental_involvement'].dtype == 'object':
    df['parental_involvement'] = df['parental_involvement'].map({'Low': 0, 'Medium': 5, 'High': 10})
    
if df['access_to_resources'].dtype == 'object':
    df['access_to_resources'] = df['access_to_resources'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['extracurricular_activities'].dtype == 'object':
    df['extracurricular_activities'] = df['extracurricular_activities'].map({'No': 0, 'Yes': 1})

if df['motivation_level'].dtype == 'object':
    df['motivation_level'] = df['motivation_level'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['internet_access'].dtype == 'object':
    df['internet_access'] = df['internet_access'].map({'No': 0, 'Yes': 1})

if df['family_income'].dtype == 'object':
    df['family_income'] = df['family_income'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['teacher_quality'].dtype == 'object':
    df['teacher_quality'] = df['teacher_quality'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['school_type'].dtype == 'object':
    df['school_type'] = df['school_type'].map({'Public': 0, 'Private': 1})

if df['peer_influence'].dtype == 'object':
    df['peer_influence'] = df['peer_influence'].map({'Negative': 0, 'Neutral': 5, 'Positive': 10})

if df['learning_disabilities'].dtype == 'object':
    df['learning_disabilities'] = df['learning_disabilities'].map({'No': 0, 'Yes': 1})

if df['parental_education_level'].dtype == 'object':
    df['parental_education_level'] = df['parental_education_level'].map({'High School': 0, 'College': 5, 'Postgraduate': 10})

if df['distance_from_home'].dtype == 'object':
    df['distance_from_home'] = df['distance_from_home'].map({'Near': 0, 'Moderate': 5, 'Far': 10})

if df['gender'].dtype == 'object':
    df['gender'] = df['gender'].map({'Female': 0, 'Male': 1})

df.head()


Unnamed: 0,hours_studied,attendance,parental_involvement,access_to_resources,extracurricular_activities,sleep_hours,previous_scores,motivation_level,internet_access,tutoring_sessions,family_income,teacher_quality,school_type,peer_influence,physical_activity,learning_disabilities,parental_education_level,distance_from_home,gender,exam_score
0,23,84,0,10,0,7,73,0,1,0,0,5,0,10,3,0,0,0,1,67
1,19,64,0,5,0,8,59,0,1,2,5,5,0,0,4,0,5,5,0,61
2,24,98,5,5,1,7,91,5,1,2,5,5,0,5,4,0,10,0,1,74
3,29,89,0,5,1,8,98,5,1,1,5,5,0,0,4,0,0,5,1,71
4,19,92,5,5,1,6,65,5,1,3,5,10,0,5,4,0,5,0,0,70
