<a href="https://colab.research.google.com/github/Y-Tee23/StudentPerformance/blob/main/student_perfromance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# **2. Import and Read Data**

In [None]:
data = pd.read_csv("study_performance.csv")

In [None]:
data.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [None]:
data.tail()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77
999,female,group D,some college,free/reduced,none,77,86,86


# **3. Data Description**
*   gender - The gender of the student
*   race_ethnicity - The racial group of the student(A,B,C,D or E)
*   parental_level_of_education - The highest education level of the parents
*   lunch - having lunch before test (standard or free/reduced)
*   test_preparation_course - Whether the student completed the preparation course before test
*   math_score - The math score of the student
*   reading_score - The reading score of the student
*   writing_score - The writing score of the student




# **4. Exploratory Data Analysis**

---
**4.1 Data types, missing data and summary statistics**


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race_ethnicity               1000 non-null   object
 2   parental_level_of_education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test_preparation_course      1000 non-null   object
 5   math_score                   1000 non-null   int64 
 6   reading_score                1000 non-null   int64 
 7   writing_score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


We can see we have no null values in the data

In [None]:
data.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


# **4.2 Feature Analysis**

---
**4.2.1 Categorical Features**


In [None]:
data["gender"].value_counts(dropna=False)

female    518
male      482
Name: gender, dtype: int64

In [None]:
data[["gender", "math_score","reading_score", "writing_score"]].groupby("gender", as_index=False).mean().sort_values(by="math_score", ascending =False)

Unnamed: 0,gender,math_score,reading_score,writing_score
1,male,68.728216,65.473029,63.311203
0,female,63.633205,72.608108,72.467181


We can see that males did better in maths but females did better in reading and writing

In [None]:
data[["race_ethnicity", "math_score","reading_score", "writing_score"]].groupby("race_ethnicity", as_index=False).mean().sort_values(by="math_score", ascending =False)

Unnamed: 0,race_ethnicity,math_score,reading_score,writing_score
4,group E,73.821429,73.028571,71.407143
3,group D,67.362595,70.030534,70.145038
2,group C,64.46395,69.103448,67.827586
1,group B,63.452632,67.352632,65.6
0,group A,61.629213,64.674157,62.674157


We can see that students from race group E performed the best.

In [None]:
data[["parental_level_of_education", "math_score","reading_score", "writing_score"]].groupby("parental_level_of_education", as_index=False).mean().sort_values(by="math_score", ascending=False)

Unnamed: 0,parental_level_of_education,math_score,reading_score,writing_score
3,master's degree,69.745763,75.372881,75.677966
1,bachelor's degree,69.389831,73.0,73.381356
0,associate's degree,67.882883,70.927928,69.896396
4,some college,67.128319,69.460177,68.840708
5,some high school,63.497207,66.938547,64.888268
2,high school,62.137755,64.704082,62.44898


As expected student's with parents who have a master's degree performed the best in math, reading and writing.

In [None]:
data[["test_preparation_course", "math_score","reading_score", "writing_score"]].groupby("test_preparation_course", as_index=False).mean().sort_values(by="math_score", ascending=False)

Unnamed: 0,test_preparation_course,math_score,reading_score,writing_score
0,completed,69.695531,73.893855,74.418994
1,none,64.077882,66.534268,64.504673


Students that completed the preparation course did better than students that did not.

In [None]:
data[["lunch", "math_score","reading_score", "writing_score"]].groupby("lunch", as_index=False).mean().sort_values(by="math_score", ascending=False)

Unnamed: 0,lunch,math_score,reading_score,writing_score
1,standard,70.034109,71.654264,70.823256
0,free/reduced,58.921127,64.653521,63.022535


In [None]:
data["average_score"] = ((data["math_score"]+data["writing_score"]+data["reading_score"])/3).round()

In [None]:
data.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score,average_score
0,female,group B,bachelor's degree,standard,none,72,72,74,73.0
1,female,group C,some college,standard,completed,69,90,88,82.0
2,female,group B,master's degree,standard,none,90,95,93,93.0
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.0
4,male,group C,some college,standard,none,76,78,75,76.0


In [None]:
X_df = data.drop(columns=["math_score","reading_score","writing_score"])
Y_df = data[["math_score","reading_score","writing_score"]]

In [None]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

In [None]:
def fit_model(model, X_train, y_train):
  models = MultiOutputRegressor(model)
  models.fit(X_train, y_train)
  return models


In [None]:
algo = LinearRegression()
algo2 = RandomForestRegressor(random_state=42)

In [None]:
num_features = X_df.select_dtypes(exclude="object").columns
cat_features = X_df.select_dtypes(include="object").columns

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("OneHotEncoder", oh_transformer, cat_features),
        ("StandardScaler", numeric_transformer, num_features),
    ]
)

In [None]:
X = preprocessor.fit_transform(X_df)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y_df, test_size=0.2, random_state=42)

In [None]:
model = fit_model(algo, X_train, y_train)
model2 = fit_model(algo2,X_train,y_train)

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [None]:
def print_MSE(model):
  y_pred = model.predict(X_test)
  mae, rmse,r2 = evaluate_model(y_test, y_pred)
  # Evaluate the model
  mse = mean_squared_error(y_test, y_pred)

  print('Model performance for Training set')
  print("- Root Mean Squared Error: {:.4f}".format(rmse))
  print("- Mean Absolute Error: {:.4f}".format(mae))
  print("- R2 Score: {:.4f}".format(r2))

In [None]:
print("Linear Regression")
print_MSE(model)

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 3.1301
- Mean Absolute Error: 2.4759
- R2 Score: 0.9587


In [None]:
print("RandomForestRegressor")
print_MSE(model2)

RandomForestRegressor
Model performance for Training set
- Root Mean Squared Error: 3.7969
- Mean Absolute Error: 2.9375
- R2 Score: 0.9394
