# Student Performance Data Set

### This data approach student achievement in secondary education of two Portuguese schools.

https://www.kaggle.com/datasets/ishandutta/student-performance-data-set

![image.png](../Images/Student.png)

#### [ 데이터 설명 ]

- 1044명의 학생 데이터 : 학교 / 성별 / 나이 / 주소 / 가족 구성원 / 아버지 교육수준 / 어머니 교육수준 등의 feature

|Feature|Definition|Value|
|:------|:---------|:------------|
|school| 학생의 학교 |'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira |
|sex| 성별 | 'F' - female or 'M' - male |
|age| 나이 | from 15 to 22 |
|address| 주소  |'U' - urban or 'R' - rural |
|famsize| 가족 구성원 수|   LE3' - less or equal to 3 or 'GT3' - greater than 3 |
|Pstatus| 부모님과 동거 |   'T' - living together or 'A' - apart |
|Medu| 어머니 교육수준 | 0 - none, 1 - primary education, 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education |
|Fedu| 아버지 교육수준 | 0 - none, 1 - primary education, 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education |
|Mjob| 어머니 직업 | 'teacher', 'health' care related, civil 'services' |
|Fjob| 아버지 직업 | 'teacher', 'health' care related, civil 'services' |
|reason| 학교선택 이유 | close to 'home', school 'reputation', 'course' preference or 'other' |
|guardian| 학생의 보호자 | 'mother', 'father' or 'other' |
|traveltime| 이동시간 | 1 - < 15 min, 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour |
|studytime| 학습시간 | 1 - < 2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours |
|failures| 유급 | n if 1<=n<3, else 4 |
|schoolsup| 방과 후 교육 | yes or no |
|famsup| 가족 교육| yes or no |
|paid| 지출금액 | yes or no |
|activities|교외 활동|yes or no|
|nursery| 보육원 참석 |yes or no|
|higher|고등교육 희망|yes or no|
|internet|인터넷 가능여부| yes or no |
|romantic| 이성관계 | yes or no |
|famrel| 가족관계 | from 1 - very bad to 5 - excellent
|freetime| 자유시간 |from 1 - very low to 5 - very high |
|goout| 친구와 외출 |from 1 - very low to 5 - very high|
|Dalc| 하루 알콜 섭취 |from 1 - very low to 5 - very high|
|Walc| 주간 알콜 섭취 | from 1 - very low to 5 - very high |
|health| 건강상태 | from 1 - very bad to 5 - very good |
|absences| 결석 횟수 |from 0 to 93 |

### Library & Data Import

In [86]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

In [87]:
X_train = pd.read_csv('../Datasets/Student_X_train.csv')
X_test = pd.read_csv('../Datasets/Student_X_test.csv')
y_train = pd.read_csv('../Datasets/Student_y_train.csv')

In [88]:
X_train

Unnamed: 0,StudentID,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,...,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2
0,1714,GP,F,18,U,GT3,T,4,3,other,...,no,4,3,3,1,1,3,0,14,13
1,1254,GP,F,17,U,GT3,T,4,3,health,...,yes,4,4,3,1,3,4,0,13,15
2,1639,GP,F,16,R,GT3,T,4,4,health,...,no,2,4,4,2,3,4,6,10,11
3,1118,GP,M,16,U,GT3,T,4,4,services,...,no,5,3,3,1,3,5,0,15,13
4,1499,GP,M,19,U,GT3,T,3,2,services,...,yes,4,5,4,1,1,4,0,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
673,1074,GP,M,15,U,GT3,T,4,4,services,...,no,5,3,3,1,1,5,4,10,13
674,1044,GP,M,15,R,GT3,T,4,4,other,...,yes,1,3,5,3,5,1,8,12,10
675,1078,GP,M,17,U,LE3,T,4,4,other,...,no,4,1,1,2,2,5,0,12,13
676,1055,MS,M,17,R,GT3,T,1,1,other,...,yes,4,5,5,1,3,2,0,10,9


In [89]:
X_test

Unnamed: 0,StudentID,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,...,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2
0,1000,GP,F,16,U,GT3,T,4,2,services,...,no,4,2,3,1,1,5,2,15,16
1,1008,GP,M,19,U,GT3,T,1,2,other,...,no,4,5,2,2,2,4,3,13,11
2,1013,GP,F,16,U,GT3,T,4,4,services,...,no,3,2,3,1,2,2,6,13,14
3,1014,GP,F,16,U,GT3,T,3,1,services,...,no,4,3,3,1,2,5,4,7,7
4,1017,GP,F,15,U,LE3,A,3,4,other,...,yes,5,3,2,1,1,1,0,10,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,2032,GP,F,15,R,LE3,T,3,1,other,...,no,4,4,2,2,3,3,6,15,15
362,2034,GP,M,17,U,GT3,T,3,3,services,...,yes,4,3,4,2,3,4,12,12,12
363,2035,GP,F,16,U,GT3,T,2,3,services,...,no,2,3,1,1,1,3,2,16,16
364,2036,MS,F,16,U,GT3,T,3,1,other,...,no,3,1,3,1,3,1,0,8,6


In [90]:
y_train

Unnamed: 0,StudentID,G3
0,1714,14
1,1254,15
2,1639,11
3,1118,13
4,1499,0
...,...,...
673,1074,14
674,1044,11
675,1078,13
676,1055,10


In [91]:
X_train.isna().sum()

StudentID     0
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
dtype: int64

In [92]:
X_test.isna().sum()

StudentID     0
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
dtype: int64

In [93]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 678 entries, 0 to 677
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   StudentID   678 non-null    int64 
 1   school      678 non-null    object
 2   sex         678 non-null    object
 3   age         678 non-null    int64 
 4   address     678 non-null    object
 5   famsize     678 non-null    object
 6   Pstatus     678 non-null    object
 7   Medu        678 non-null    int64 
 8   Fedu        678 non-null    int64 
 9   Mjob        678 non-null    object
 10  Fjob        678 non-null    object
 11  reason      678 non-null    object
 12  guardian    678 non-null    object
 13  traveltime  678 non-null    int64 
 14  studytime   678 non-null    int64 
 15  failures    678 non-null    int64 
 16  schoolsup   678 non-null    object
 17  famsup      678 non-null    object
 18  paid        678 non-null    object
 19  activities  678 non-null    object
 20  nursery   

In [94]:
StudentID = X_test['StudentID'].copy()

X_train = X_train.drop(columns=['StudentID'])
X_test = X_test.drop(columns=['StudentID'])
y_train = y_train.drop(columns=['StudentID'])

In [95]:
X_train_cat = X_train.select_dtypes('object').copy()
X_test_cat = X_test.select_dtypes('object').copy()

ohe = OneHotEncoder(sparse=False)
ohe.fit(X_train_cat)

X_train_ohe = ohe.transform(X_train_cat)
X_test_ohe = ohe.transform(X_test_cat)

In [96]:
X_train_num = X_train.select_dtypes(exclude='object').copy()
X_test_num = X_test.select_dtypes(exclude='object').copy()

scaler = MinMaxScaler()
scaler.fit(X_train_num)

X_train_sca = scaler.transform(X_train_num)
X_test_sca = scaler.transform(X_test_num)

In [97]:
X_TRAIN = np.concatenate([X_train_ohe, X_train_sca], axis=1)
X_TEST = np.concatenate([X_test_ohe, X_test_sca], axis=1)

y_TRAIN = y_train['G3']

print(X_train.shape, X_test.shape, y_TRAIN.shape)

(678, 32) (366, 32) (678,)


In [98]:
xtrain, xtest, ytrain, ytest = train_test_split(X_TRAIN, y_TRAIN, test_size = 0.25, random_state=1234)

print(xtrain.shape, xtest.shape, ytrain.shape, ytest.shape)

(508, 58) (170, 58) (508,) (170,)


In [99]:
def make_models(xtrain, xtest, ytrain, ytest):
    model1 = LinearRegression().fit(xtrain, ytrain)
    print('model1', get_score(model1, xtrain, xtest, ytrain, ytest))
    
    model2 = RandomForestRegressor(random_state=2022).fit(xtrain, ytrain)
    print('model2', get_score(model2, xtrain, xtest, ytrain, ytest))
    
    for d in range(3,8):
        model2 = RandomForestRegressor(500, max_depth=d, random_state=2022).fit(xtrain, ytrain)
        print('model2', d, get_score(model2, xtrain, xtest, ytrain, ytest))

In [100]:
def get_score(model, xtrain, xtest, ytrain, ytest):
    A = model.score(xtrain, ytrain)
    ypred = model.predict(xtest)
    B = mean_squared_error(ytest, ypred)
    
    return f'{A:.4} {B:.4}'

In [101]:
make_models(xtrain, xtest, ytrain, ytest)

model1 0.8788 3.656
model2 0.9822 3.09
model2 3 0.895 3.08
model2 4 0.9175 3.035
model2 5 0.945 3.036
model2 6 0.9627 3.029
model2 7 0.9724 3.07


In [102]:
final_model = LinearRegression().fit(xtrain, ytrain)
print('final_model', get_score(final_model, xtrain, xtest, ytrain, ytest))

final_model 0.8788 3.656


In [103]:
y_pred = final_model.predict(X_TEST)

obj = {'StudentID' : StudentID,
        'G3' : y_pred
}
result = pd.DataFrame(obj)
result.to_csv('./result.csv', index=False)