#### Important comments from the EDA process
* There're no missing values
* There're no duplicated values
* Two new features were added: average_score and total_score 

#### General Steps to Follow
* Importaing packages and reading the data
* Handling Categorical Features
* Dropping some features
* Splitting Data into train and test sets

## 1) Importaing packages and reading the data

In [24]:
import pandas as pd
import numpy as np

In [25]:
data = pd.read_csv("../../data/StudentsPerformanceModified.csv")

In [26]:
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,average_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

#### In parental level of education, there're two similar values. I will combine them into one value: "high school"

In [27]:
data["parental level of education"].value_counts()

parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

In [30]:
temp = data["parental level of education"].values
for i in range(len(temp)):
    if temp[i] == "some high school":
        temp[i] = "high school"

In [31]:
data["parental level of education"].value_counts()

parental level of education
high school           375
some college          226
associate's degree    222
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 2) Handling Categorical Features

* I used "One-Hot-Encoding" in the categorical features
* If the categorical features has 3 classes, then we split it into 3 new features. Each feature has two values: 0 and 1

### 2.1 Get the categorical features

In [32]:
categorical_features = []
for x in data:
    if data[x].dtype == 'O':
        categorical_features.append(x)
print(categorical_features)

['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']


### --------------------------------------------------------------------------------

### 2.2 One-Hot-Encoding

In [33]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [34]:
data_encoded = data.copy()
for x in categorical_features:
    values = np.array(list(data[x]))
    one_hot_encoded = encoder.fit_transform(values.reshape(-1,1)).toarray()
    data_encoded = pd.concat([data_encoded, pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out([x]))], axis=1)

In [35]:
data_encoded.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,average_score,...,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,female,group C,some college,standard,completed,69,90,88,247,82.333333,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,male,group C,some college,standard,none,76,78,75,229,76.333333,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### 2.3 Deleting Categorical Features

In [36]:
for x in categorical_features:
    data_encoded.drop([x], axis = 1, inplace = True)

In [37]:
data_encoded.head()

Unnamed: 0,math score,reading score,writing score,total_score,average_score,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,...,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,72,72,74,218,72.666667,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,69,90,88,247,82.333333,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,90,95,93,278,92.666667,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,47,57,44,148,49.333333,0.0,1.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,76,78,75,229,76.333333,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 3) Dropping some features

* I will drop the math, reading, writing, and total scores, and I will keep the average score.
* The goal of the model is to predict the average score of the student based on the other features

In [41]:
data_encoded.drop(["math score"], axis = 1, inplace = True)
data_encoded.drop(["writing score"], axis = 1, inplace = True)
data_encoded.drop(["reading score"], axis = 1, inplace = True)
data_encoded.drop(["total_score"], axis = 1, inplace = True)

In [42]:
data_encoded.head()

Unnamed: 0,average_score,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,72.666667,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,82.333333,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,92.666667,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,49.333333,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,76.333333,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 4) Splitting Data into train and test sets

In [43]:
from sklearn.model_selection import train_test_split

In [44]:
train_set, test_set = train_test_split(data_encoded, test_size = 0.2, random_state = 42)
train_set.to_csv("../../data/train.csv", index = False)
test_set.to_csv("../../data/test.csv", index = False)