## Student Performance Factors  
##### Insights into Academic Success and Contributing Elements

##### **About the Dataset:**  
This dataset provides a comprehensive analysis of key factors that influence student performance in exams. It includes data on study habits, attendance, parental involvement, and other variables that contribute to academic outcomes, offering valuable insights into the drivers of educational success.

##### **Model Objectives:**  
The goal of this model is to predict students' final exam scores based on the dataset's features. We aim to explore various correlations, including:

- **Exam Score vs Parental Involvement**

- **Exam Score vs Extracurricular Activities**

- **Exam Score vs Motivation Level**

- **Exam Score vs Family Income**

- **Exam Score vs Parental Education**

Using these features, the model will strive to accurately predict each student's final grade, leveraging patterns and relationships identified in the data.


---

#### Step 1 - Setupt Infrasctructure
**Goal:** Load Libs and Dataset 

In [1]:
import pandas as pd;
import plotly.express as px;
import plotly.graph_objects as go;
import numpy as np;
from sklearn.linear_model import LinearRegression;

In [2]:
path = '../dataset/StudentPerformanceFactors.csv'
df = pd.read_csv(path)

#### Step 2 - EDA: Exploratory Data Analysis  
**Goal:** Is to analyze the dataset and identify necessary data treatments to ensure the information is consistent and reliable.

##### 2.1 - Explore DataFrame  
**Goal:** In this step, we will examine the structure of the DataFrame, ensuring that its format is correct. Our objective is to identify any null or inconsistent data that may require cleaning or further treatment.


In [3]:
# Output the first 5 rows of the dataframe
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


In [4]:
# Output some statistical infos about the dataframe
df.describe()

Unnamed: 0,Hours_Studied,Attendance,Sleep_Hours,Previous_Scores,Tutoring_Sessions,Physical_Activity,Exam_Score
count,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0,6607.0
mean,19.975329,79.977448,7.02906,75.070531,1.493719,2.96761,67.235659
std,5.990594,11.547475,1.46812,14.399784,1.23057,1.031231,3.890456
min,1.0,60.0,4.0,50.0,0.0,0.0,55.0
25%,16.0,70.0,6.0,63.0,1.0,2.0,65.0
50%,20.0,80.0,7.0,75.0,1.0,3.0,67.0
75%,24.0,90.0,8.0,88.0,2.0,4.0,69.0
max,44.0,100.0,10.0,100.0,8.0,6.0,101.0


In [5]:
# Foreach dataframe and select columns with numeriacal values
df_numerical = df.select_dtypes(include=[np.number])
# Update the dataframe to only include the columns with numerical values and add sufix _num at the end of column names
df_numerical.columns = [str(col) + '_num' for col in df_numerical.columns]
# Foreach dataframe and select columns with numeriacal values
df_categorical = df.select_dtypes(include=[np.object_])
# Update the dataframe to only include the columns with numerical values and add sufix _num at the end of column names
df_categorical.columns = [str(col) + '_cat' for col in df_categorical.columns]
# Merge the two dataframes
df = pd.concat([df_numerical, df_categorical], axis=1)

# Put all the columns in lowercase
df.columns = map(str.lower, df.columns)

# Create a identifier column for each row
df['register_code'] = range(1, 1+len(df))



df.head()



Unnamed: 0,hours_studied_num,attendance_num,sleep_hours_num,previous_scores_num,tutoring_sessions_num,physical_activity_num,exam_score_num,parental_involvement_cat,access_to_resources_cat,extracurricular_activities_cat,...,internet_access_cat,family_income_cat,teacher_quality_cat,school_type_cat,peer_influence_cat,learning_disabilities_cat,parental_education_level_cat,distance_from_home_cat,gender_cat,register_code
0,23,84,7,73,0,3,67,Low,High,No,...,Yes,Low,Medium,Public,Positive,No,High School,Near,Male,1
1,19,64,8,59,2,4,61,Low,Medium,No,...,Yes,Medium,Medium,Public,Negative,No,College,Moderate,Female,2
2,24,98,7,91,2,4,74,Medium,Medium,Yes,...,Yes,Medium,Medium,Public,Neutral,No,Postgraduate,Near,Male,3
3,29,89,8,98,1,4,71,Low,Medium,Yes,...,Yes,Medium,Medium,Public,Negative,No,High School,Moderate,Male,4
4,19,92,6,65,3,4,70,Medium,Medium,Yes,...,Yes,Medium,High,Public,Neutral,No,College,Near,Female,5


In [6]:
# Count if there are any missing values in the dataframe 
df.isnull().sum()

hours_studied_num                  0
attendance_num                     0
sleep_hours_num                    0
previous_scores_num                0
tutoring_sessions_num              0
physical_activity_num              0
exam_score_num                     0
parental_involvement_cat           0
access_to_resources_cat            0
extracurricular_activities_cat     0
motivation_level_cat               0
internet_access_cat                0
family_income_cat                  0
teacher_quality_cat               78
school_type_cat                    0
peer_influence_cat                 0
learning_disabilities_cat          0
parental_education_level_cat      90
distance_from_home_cat            67
gender_cat                         0
register_code                      0
dtype: int64

In [7]:
# Remove empty rows from the dataframe
df.dropna(inplace=True)



##### 2.2 - Converting Dataframe Values  
**Goal:** All categorical variables used for prediction must be converted to numerical values to ensure compatibility with the model.


In [8]:
# Check columns type before convert to numerical
if df['parental_involvement_cat'].dtype == 'object':
    df['parental_involvment_num'] = df['parental_involvement_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})
    
if df['access_to_resources_cat'].dtype == 'object':
    df['access_to_resources_num'] = df['access_to_resources_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['extracurricular_activities_cat'].dtype == 'object':
    df['extracurricular_activities_num'] = df['extracurricular_activities_cat'].map({'No': 0, 'Yes': 1})

if df['motivation_level_cat'].dtype == 'object':
    df['motivation_level_num'] = df['motivation_level_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['internet_access_cat'].dtype == 'object':
    df['internet_access_num'] = df['internet_access_cat'].map({'No': 0, 'Yes': 1})

if df['family_income_cat'].dtype == 'object':
    df['family_income_num'] = df['family_income_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['teacher_quality_cat'].dtype == 'object':
    df['teacher_quality_num'] = df['teacher_quality_cat'].map({'Low': 0, 'Medium': 5, 'High': 10})

if df['school_type_cat'].dtype == 'object':
    df['school_type_num'] = df['school_type_cat'].map({'Public': 0, 'Private': 1})

if df['peer_influence_cat'].dtype == 'object':
    df['peer_influence_num'] = df['peer_influence_cat'].map({'Negative': 0, 'Neutral': 5, 'Positive': 10})

if df['learning_disabilities_cat'].dtype == 'object':
    df['learning_disabilities_num'] = df['learning_disabilities_cat'].map({'No': 0, 'Yes': 1})

if df['parental_education_level_cat'].dtype == 'object':
    df['parental_education_level_num'] = df['parental_education_level_cat'].map({'High School': 0, 'College': 5, 'Postgraduate': 10})

if df['distance_from_home_cat'].dtype == 'object':
    df['distance_from_home_num'] = df['distance_from_home_cat'].map({'Near': 0, 'Moderate': 5, 'Far': 10})

if df['gender_cat'].dtype == 'object':
    df['gender_num'] = df['gender_cat'].map({'Female': 0, 'Male': 1})

In [9]:
df.head()

Unnamed: 0,hours_studied_num,attendance_num,sleep_hours_num,previous_scores_num,tutoring_sessions_num,physical_activity_num,exam_score_num,parental_involvement_cat,access_to_resources_cat,extracurricular_activities_cat,...,motivation_level_num,internet_access_num,family_income_num,teacher_quality_num,school_type_num,peer_influence_num,learning_disabilities_num,parental_education_level_num,distance_from_home_num,gender_num
0,23,84,7,73,0,3,67,Low,High,No,...,0,1,0,5,0,10,0,0,0,1
1,19,64,8,59,2,4,61,Low,Medium,No,...,0,1,5,5,0,0,0,5,5,0
2,24,98,7,91,2,4,74,Medium,Medium,Yes,...,5,1,5,5,0,5,0,10,0,1
3,29,89,8,98,1,4,71,Low,Medium,Yes,...,5,1,5,5,0,0,0,0,5,1
4,19,92,6,65,3,4,70,Medium,Medium,Yes,...,5,1,5,10,0,5,0,5,0,0


#####  2.3 - Visualizing the Relationship Between Family Income and Exam Scores

In this section, we generate a **box plot** to visualize how **family income categories** relate to **average exam scores**. A box plot is a useful tool for showing the distribution of data and identifying any potential outliers.


In [10]:
# Plot the distribution of the exam score by family_iconme
fig = px.box(df, 
             x='family_income_cat',
             y='exam_score_num', 
             title='Average Exam Score by Family Income', 
             labels={'exam_score_num': 'Average Exam Score', 'family_income_cat': 'Family Income'},
             color='family_income_cat',)

# Update plot, put the x-axis in ascending order and select different colors for the bars
fig.update_xaxes(categoryorder='total descending')
fig.update_traces(marker_color='rgb(158,202,225)', 
                  marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.show()

In [11]:
# Plot a chart to show the relationship between the exam score and the parental involvement
fig = go.Figure()

# Linha de valores executados
fig.add_trace(go.Line(x=df['register_code'], y=df['previous_scores_num'], mode='lines+markers',
                         name='Previous Score', line=dict(color='blue', width=4), marker=dict(size=8)
))

fig.add_trace(go.Line(x=df['register_code'], y=df['exam_score_num'], mode='lines+markers',
                         name='Exame Score', line=dict(color='red', width=4), marker=dict(size=8)
))

fig.update_layout(
    title="Previous Score vs Actual Score Values",
    xaxis_title="Index",
    yaxis_title="Score",
    legend_title="Legend",
    template="plotly_white"
)

# Mostrar o gráfico
fig.show() 


plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




In [12]:
# Plot the distribution of the exam score by family_iconme
fig = px.bar(df, 
             x='parental_involvement_cat',
             y='exam_score_num', 
             title='Average Exam Score by Parental Involvement', 
             labels={'exam_score_num': 'Average Exam Score', 'parental_involvement_cat': 'Psrental Involvement'},
             color='family_income_cat',)

# Update plot, put the x-axis in ascending order and select different colors for the bars
fig.update_xaxes(categoryorder='total ascending')
fig.update_traces(marker_color='rgb(158,202,225)', 
                  marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.show()


#### Visualizing the Correlation Matrix
In this section, we calculate and plot the correlation matrix for all the numerical features in our dataset. Correlation is a statistical measure that indicates the extent to which two variables are related. It ranges from -1 to 1, where:

-  1 means a perfect positive correlation (when one feature increases, the other increases as well);
- -1 means a perfect negative correlation (when one feature increases, the other decreases);
-  0 indicates no linear correlation between the variables;

In [13]:
# Select numerical columns and calculate the correlation matrix
df_corr = df.select_dtypes(include=[np.number]).corr()

# Plot the correlation matrix
fig = px.imshow(df_corr, 
                color_continuous_scale='Viridis',
                labels=dict(x='Features', y='Features', color='Correlation'), width=800, height=800)
fig.show()

### Step 3 - Training Model

At this step, we will organize and structure the data for the model training process. This involves the crucial task of splitting the dataset into **training** and **testing** subsets. The **training data** will be used to fit the model, while the **testing data** will evaluate its performance and generalization. 

The goal of this separation is to ensure that the model is trained on one subset and evaluated on another to prevent overfitting and to provide an unbiased assessment of its


#### Step 3.1 - Split Data into numeric and categorical

In [14]:
df_numerical = df.select_dtypes(include=[np.number])
df_categorical  = df.select_dtypes(include=[np.object_])

In [15]:
# I decided to create two new dataframes, one for the target variable and another for the features.
# The target variable is the exam_score_num column
df_target = df_numerical['exam_score_num']

# The features are all the columns except the exam_score_num column
df_train = df_numerical.drop(['exam_score_num'], axis=1)


In [40]:
# Separete the data into training and test sets
# Import the train_test_split function from the sklearn library
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(df_train, df_target, test_size=.33, random_state=42)


In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
ml_model = LogisticRegression(max_iter=100000, solver='saga')

x_train_scaled = scaler.fit_transform(x_train)

ml_model.fit(x_train_scaled, y_train)

print(f'Model Accuracy: {ml_model.score(x_train_scaled, y_train) * 100:.2f}%')


Model Accuracy: 65.62%


##### Step 3.2 - Split Data into numeric and categorical
##### Model Performance Summary:

The trained model was evaluated on the test dataset, and the classification metrics (precision, recall, F1-score) are presented for each class in the table below. The dataset consisted of 4273 samples with 23 distinct classes (57–79). The overall model performance is as follows:

- **Accuracy**: The overall accuracy of the model is **66%**, meaning that 66% of the predictions match the actual labels.

- **Macro Average**: The macro average for precision, recall, and F1-score is **0.40**, **0.33**, and **0.35**, respectively. This metric is calculated by averaging the performance of each class equally, without taking into account the class imbalance.

- **Weighted Average**: The weighted average takes the class distribution into account and results in **precision** of **0.65**, **recall** of **0.66**, and an **F1-score** of **0.64**.

##### Insights:
- The model performs well for the most frequent classes, such as **66** (precision = 0.72, recall = 0.80, F1-score = 0.76), **67** (F1-score = 0.80), and **68** (F1-score = 0.79), which likely contribute to the high overall accuracy.

- For classes with fewer samples (e.g., **57**, **76**, **77**, **78**, **79**), the model struggles, as indicated by precision, recall, and F1-scores of **0.00** across these categories. This could be due to the small number of support cases, leading to poor generalization in these classes.

- The performance on intermediate classes (e.g., **61**, **62**, **63**) shows moderate precision and recall, indicating room for improvement in predicting less frequent classes.

##### Recommendations:
- Consider techniques such as **resampling** to handle class imbalance.

- Investigate further tuning of the model hyperparameters or test alternative algorithms to improve the prediction for minority classes.


In [58]:
from sklearn.metrics import classification_report

print(classification_report(y_train, ml_model.predict(x_train_scaled)))

              precision    recall  f1-score   support

          57       0.00      0.00      0.00         4
          58       0.75      0.23      0.35        13
          59       0.73      0.30      0.42        27
          60       0.48      0.23      0.31        53
          61       0.45      0.41      0.43       113
          62       0.49      0.54      0.52       179
          63       0.58      0.50      0.54       230
          64       0.63      0.60      0.61       317
          65       0.66      0.74      0.70       425
          66       0.72      0.80      0.76       486
          67       0.80      0.80      0.80       490
          68       0.71      0.89      0.79       493
          69       0.78      0.55      0.64       377
          70       0.62      0.79      0.69       363
          71       0.55      0.50      0.52       267
          72       0.51      0.55      0.53       206
          73       0.50      0.29      0.36        87
          74       0.42    


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

