# Student Performance Analysis

---

This project delves into understanding student performance, particularly focusing on:

1. Identifying Performance Differences Between Schools:

- We will compare the final grades of students from two schools to understand if there are statistically significant differences in their academic outcomes.
- This analysis will involve exploring the distributions and statistics of final grades for both schools, potentially investigating specific subjects or student subgroups.

2. Predicting Final Grades through Regression:

- We will leverage regression analysis to model the relationship between various factors and final grades.
- This could involve exploring the influence of variables like student demographics, exam scores, course attendance, learning habits, and socio-economic factors on their overall academic performance.

Our aim is to gain valuable insights into:

- Potential gaps in performance between the two schools.
- The key factors influencing final grades and their relative importance.
- Utilizing these insights to inform strategies for improving student success in both schools.


## 1. Import data


In [37]:
import pandas as pd
from statsmodels.stats.proportion import (
    proportions_ztest,
    confint_proportions_2indep,
)
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [3]:
df = pd.read_csv("student-mat.csv", sep=";")

df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


## 2. Statistical Test: Comparing School Performance

Goal:

To determine whether Gabriel Pereira (MS) School demonstrates statistically significant superior performance compared to Mousinho de Silveira (GP) School.
Hypotheses:

- Null hypothesis (H₀): There is no significant difference in performance between MS and GP schools. More formally, MS's performance is less than or equal to that of GP (MS ≤ GP).
- Alternative hypothesis (H₁): MS School exhibits significantly better performance than GP School (MS > GP).


In [44]:
df_GP = df[df["school"] == "GP"]["G3"]
df_MS = df[df["school"] == "MS"]["G3"]

stat, pval = proportions_ztest(
    [df_MS.mean(), df_GP.mean()],
    [df_MS.count(), df_GP.count()],
    alternative="larger",
)

stat, pval

(5.308812035723443, 5.5171026168843736e-08)

In [43]:
if pval < 0.05:
    print("Reject null hypothesis")
else:
    print("Fail to reject null hypothesis")

Reject null hypothesis


In [26]:
ci_low, ci_upp = confint_proportions_2indep(
    df_MS.mean(), df_MS.count(), df_GP.mean(), df_GP.count(), alpha=0.05
)

ci_low, ci_upp

(0.08714247255827207, 0.32270203071763404)

### Conclusion

- p-value less than significance level (0,05) we can reject null Hyphotesis (𝐻0)
- Therefore, from the data, we can reject claim that Mousinho de Silveira is better school than Gabriel Pereira
- We are 95% confidence that Mousinho de Silveira is better at 8.71% - 32.2%


## 3. Regression

---

For regression, we only take numeric features for simplicity.


In [32]:
# Select numeric columns using select_dtypes()
numeric_df = df.select_dtypes(include=["number"])

numeric_df.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,18,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6
1,17,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6
2,15,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10
3,15,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15
4,16,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10


then drop column 'age'


In [34]:
numeric_df.drop(["age"], axis=1, inplace=True)

numeric_df.head()

Unnamed: 0,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,4,4,2,2,0,4,3,4,1,1,3,6,5,6,6
1,1,1,1,2,0,5,3,3,1,1,3,4,5,5,6
2,1,1,1,2,3,4,3,2,2,3,3,10,7,8,10
3,4,2,1,3,0,3,2,2,1,1,5,2,15,14,15
4,3,3,1,2,0,4,3,2,1,2,5,4,6,10,10


In [45]:
# Separate features and target variable
X = numeric_df.drop("G3", axis=1)
y = numeric_df["G3"]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)  # Split 80% for training, 20% for testing

# Create a linear regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# 6. Make predictions on test data
y_predicted = model.predict(X_test)

mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)
print("Mean Squared Error on Test Data:", mse)
print("R-squared on Test Data:", r2)

Mean Squared Error on Test Data: 2.9617080456890372
R-squared on Test Data: 0.855109115533618


In [41]:
X.columns

Index(['Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2'],
      dtype='object')

In [38]:
model.coef_

array([ 0.08640215, -0.10531842,  0.07037297, -0.25659545, -0.22857072,
        0.30662504,  0.04983653,  0.11151495, -0.0452695 ,  0.00984139,
        0.10461645,  0.03999493,  0.17160922,  0.97737145])

In [40]:
model.intercept_

-3.699675447503461

### Conclussions

- A one unit increase of famrel (quality of family relationship) is is associated with an estimated increase of 0.3 units in G3
- A one unit increase of G2 (second period grade) is is associated with an estimated increase of 0.97 units in G3
- A one unit increase of studytime (weekly study time) is is associated with an estimated decrease of 0.26 units in G3
- A one unit increase of failures (number of past class failures) is is associated with an estimated decrease of 0.22 units in G3


#### Recommendations

School Choice:

- Students attending Mousinsa de Silveira School have demonstrated higher average final grades compared to Gabriel Pereira School. Further research into individual school strengths may be beneficial.

Student Support:

- Supportive family environments are positively correlated with academic success. Providing resources and programs to strengthen family support could be explored.

Academic Habits:

- Higher second-period grades have a statistically significant association with higher final grades. Encouraging consistent academic - performance throughout the semester may be impactful.
- Targeting X hours of weekly study time based on individual needs might optimize learning outcomes.
