# Task 2

## Data Pre-processing

**author:** "Gökberk Abdullah" 

**school number:** "090170341"

**date:** "March 20, 2023"

**Task 2:** Find a data set which is suitable for regression analysis and consists of a mix of numerical, nominal, and ordinal variables. Look for the cases, where at least one of the variable has missing values. If not, you can randomly delete a very small portion of one of the variables.
Design a machine learning pipeline where you scale the numerical features and encode the nominal and ordinal features along with imputing the missing values.

#### Instructions to follow:

- Include all your codes here. Be sure that your code is CLEAN, READABLE, and REPRODUCIBLE.
- Put your data set into a **datasets** folder.
- Put your images (if available) into an **images** folder.
- Please return a NICE and CLEAR homework. Otherwise, it will not be graded.
- Please write YOUR OWN code. **DO NOT copy** my codes or someone else's codes.

## Data Decription

the "Student Performance Data Set," also known as the "Student Achievement Data Set," is an education data set that can be used to predict students' exam performance. The data set contains data that measures the performance of 1,000 students in math, reading, and writing exams.

The data set contains a total of 8 variables:

* "gender": the student's gender (male/female)
* "race/ethnicity": the student's ethnic background (Group A, Group B, Group C, Group D, Group E)
* "parental level of education": the education level of the student's parents
* "lunch": whether the student participates in the school lunch program or not
* "test preparation course": whether the student participates in a test preparation course or not
* "math score": the student's score on the math exam (ranging from 0-100)
* "reading score": the student's score on the reading exam (ranging from 0-100)
* "writing score": the student's score on the writing exam (ranging from 0-100)

In this data set, while the exam scores are numerical variables, other variables such as gender, ethnic background, parents' education level, lunch program participation, and test preparation course participation are nominal or ordinal variables.

This data set can be used for various purposes in the field of education, including predicting exam performance, analyzing student performance, and developing exam policies, by researchers and teachers working in the education sector.I will predict the math score.

Let's import the necessary libraries:

In [42]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor

Let's go to the datasets folder and read our data with the help of ```pandas``` library.

In [43]:
file_path = os.path.join("datasets", "StudentsPerformance.csv")

df = pd.read_csv(file_path)
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


Since there is no missing value in my data, I delete 10 random lines from the ```race/ethnicity``` column and 11 random lines from the ```parental level of education``` column.

I deleted 10 from the ```writing score``` and 5 from the ```gender``` column.

In [44]:
random_indices = np.random.choice(df['race/ethnicity'].index, size=10, replace=False)
df.loc[random_indices, 'race/ethnicity'] = np.nan

In [45]:
random_indices = np.random.choice(df['parental level of education'].index, size=11, replace=False)
df.loc[random_indices, 'parental level of education'] = np.nan

In [46]:
random_indices = np.random.choice(df['writing score'].index, size=10, replace=False)
df.loc[random_indices, 'writing score'] = np.nan

In [47]:
random_indices = np.random.choice(df['gender'].index, size=5, replace=False)
df.loc[random_indices, 'gender'] = np.nan

In [48]:
df.isnull().sum()

gender                          5
race/ethnicity                 10
parental level of education    11
lunch                           0
test preparation course         0
math score                      0
reading score                   0
writing score                  10
dtype: int64

Let's check how many unique values ​​there are.

In [57]:
df['parental level of education'].unique()

array(["bachelor's degree", 'some college', "master's degree",
       "associate's degree", 'high school', 'some high school', nan],
      dtype=object)

In [56]:
df['race/ethnicity'].unique()

array(['group B', 'group C', 'group A', 'group D', 'group E', nan],
      dtype=object)

In [49]:
df['test preparation course'].unique()

array(['none', 'completed'], dtype=object)

## Pipeline Design

Let's divide our data into train and test. Since there is little data, it makes more sense to set the test portion as 10 percent.

In [49]:
X = df.drop('math score', axis=1)
y = df['math score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=55)

The first step is to fill in the missing values ​​with average values ​​using the ```SimpleImputer``` object. 

The second step is scaling numeric values ​​using the ```StandardScaler``` object. This allows the different numeric columns in the dataset to be scaled and made comparable to each other. I created a pipeline by combining these process steps.

In [50]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

In this pipeline, I filled the missing values ​​with the help of ```SimpleImputer``` using the most repetitive word.

Then I installed the pipeline that will label the two columns in the ordinal structure as I want with ```OrdinalEncoder```.

In [59]:
race_cat = ["group E", "group D", "group C", "group B","group A"] #0,1,2,3,4
parent_cat = ["some high school","high school","some college","associate's degree","bachelor's degree","master's degree"] #0,1,2,3,4,5
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[race_cat, parent_cat]))
])

I did the same thing this time using ```OneHotEncoder``` for columns with only two different categories. I added the ```'drop='``` statement to avoid falling into the dummy trap.

In [52]:
onehot_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='if_binary'))
])

I connected these 3 pipelines that I built using ```ColumnTransformer``` in order.

In this way, I can do whatever I want on the column I want.

In [60]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['reading score', 'writing score']),
        ('cat', categorical_transformer, ['race/ethnicity', 'parental level of education']),
        ('onehot', onehot_transformer, ['gender', 'lunch', 'test preparation course'])
    ])

Finally, I connected the ```ColumnTransformer``` structure that I created, called preprocessor, to ```RandomForestRegressor()``` with the pipeline to build our model.

In [61]:
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', RandomForestRegressor())])

I'm fit with this pipeline where I set up ```X_train``` and ```y_train```. You can see the detail of the structure I installed in the image.

In [62]:
model.fit(X_train, y_train)

## Pipeline Results

My code has become much more organized and ready thanks to this pipeline that I have established rather than the crowd of processing each column one by one.

The R2 value represents the percentage that the independent variables explain the variability in the dependent variable, and the closer it is to 1, the better the model will perform. An R2 value of 0.87 is a fairly high value and indicates that the model explains the effect of the independent variables on the target variable well.

The MSE value measures how far the predictions deviate from the true values, and the closer to 0, the better the model will perform. An MSE of 31.5 is also quite low, showing that the model's predictions are close to the true values.

In this context, the values ​​given indicate that the model performs quite well.

In [63]:
y_pred=model.predict(X_test)
print('R2 Score:',r2_score(y_test, y_pred))
print('MSE score:',mean_squared_error(y_test, y_pred))

R2 Score: 0.8736363563922251
MSE score: 31.528373534722217
