#### Important comments from the EDA process
* There're no missing values
* There're no duplicated values
* Two new features were added: average_score and total_score 

#### General Steps to Follow
* Importaing packages and reading the data
* Handling Categorical Features
* Dropping some features
* Splitting Data into train and test sets

## 1) Importaing packages and reading the data

In [24]:
import pandas as pd
import numpy as np

In [25]:
data = pd.read_csv("../../data/StudentsPerformanceModified.csv")

In [26]:
data.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,average_score
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667
1,female,group C,some college,standard,completed,69,90,88,247,82.333333
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333
4,male,group C,some college,standard,none,76,78,75,229,76.333333


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

#### In parental level of education, there're two similar values. I will combine them into one value: "high school"

In [27]:
data["parental level of education"].value_counts()

parental level of education
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

In [30]:
temp = data["parental level of education"].values
for i in range(len(temp)):
    if temp[i] == "some high school":
        temp[i] = "high school"

In [31]:
data["parental level of education"].value_counts()

parental level of education
high school           375
some college          226
associate's degree    222
bachelor's degree     118
master's degree        59
Name: count, dtype: int64

### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 2) Handling Categorical Features

* I used "One-Hot-Encoding" in the categorical features
* If the categorical features has 3 classes, then we split it into 3 new features. Each feature has two values: 0 and 1

### 2.1 Get the categorical features

In [32]:
categorical_features = []
for x in data:
    if data[x].dtype == 'O':
        categorical_features.append(x)
print(categorical_features)

['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']


### --------------------------------------------------------------------------------

### 2.2 One-Hot-Encoding

In [33]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [34]:
data_encoded = data.copy()
for x in categorical_features:
    values = np.array(list(data[x]))
    one_hot_encoded = encoder.fit_transform(values.reshape(-1,1)).toarray()
    data_encoded = pd.concat([data_encoded, pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out([x]))], axis=1)

In [35]:
data_encoded.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,total_score,average_score,...,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,female,group B,bachelor's degree,standard,none,72,72,74,218,72.666667,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,female,group C,some college,standard,completed,69,90,88,247,82.333333,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,female,group B,master's degree,standard,none,90,95,93,278,92.666667,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,male,group A,associate's degree,free/reduced,none,47,57,44,148,49.333333,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,male,group C,some college,standard,none,76,78,75,229,76.333333,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### 2.3 Deleting Categorical Features

In [36]:
for x in categorical_features:
    data_encoded.drop([x], axis = 1, inplace = True)

In [37]:
data_encoded.head()

Unnamed: 0,math score,reading score,writing score,total_score,average_score,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,...,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,72,72,74,218,72.666667,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,69,90,88,247,82.333333,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,90,95,93,278,92.666667,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,47,57,44,148,49.333333,0.0,1.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,76,78,75,229,76.333333,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 3) Dropping some features

* I will drop the math, reading, writing, and total scores, and I will keep the average score.
* The goal of the model is to predict the average score of the student based on the other features

In [41]:
data_encoded.drop(["math score"], axis = 1, inplace = True)
data_encoded.drop(["writing score"], axis = 1, inplace = True)
data_encoded.drop(["reading score"], axis = 1, inplace = True)
data_encoded.drop(["total_score"], axis = 1, inplace = True)

In [42]:
data_encoded.head()

Unnamed: 0,average_score,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,72.666667,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,82.333333,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
2,92.666667,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,49.333333,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,76.333333,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0


### ----------------------------------------------------------------------------------------------------------------------------------------------------------

## 4) Splitting Data into train and test sets

In [43]:
from sklearn.model_selection import train_test_split

In [44]:
train_set, test_set = train_test_split(data_encoded, test_size = 0.2, random_state = 42)
train_set.to_csv("../../data/train.csv", index = False)
test_set.to_csv("../../data/test.csv", index = False)

In [53]:
def welcome_statement():
    print("Welocme sir!, this is an AI model that predicts the average score of students taking three tests: Math, Reading, and Writing.")
    print("-----------------------------------------------------------")
    print("The prediction is base on the following features:")
    print("1- Gender")
    print("2- Race/ethnicity")
    print("3- Parental level of education")
    print("4- Student had/didn't have lunch before exam")
    print("5- Student copleted the test preparation course")
    print("-----------------------------------------------------------")
    print("Please answer the following questions to get the prediction score of the student.\n")

def get_gender():
    print("Gender:")
    gender = ""
    while(True):
        try:
            inp = int(input("For male write 1\nFor female write 2\n"))
            if(inp == 1):
                gender = "male"
                break
            elif(inp == 2):
                gender = "female"
                break
            else:
                print("ERROR: Invalid input.\n")
        except:
            print("ERROR: Invalid input.\n")
    return gender


def get_race():
    print("\nRace/ethnicity:")
    race = ""
    while(True):
        try:
            inp = int(input("For group A write 1\nFor group B write 2\nFor group C write 3\nFor group D write 4\nFor group E write 5\n"))
            if(inp == 1):
                race = "group A"
                break
            elif(inp == 2):
                race = "group B"
                break
            elif(inp == 3):
                race = "group C"
                break   
            elif(inp == 4):
                race = "group D"
                break
            elif(inp == 5):
                race = "group E"
                break
            else:
                print("ERROR: Invalid input.\n")
        except:
            print("ERROR: Invalid input.\n")
    return race


def get_parent_edu():
    print("\nParental level of education:")
    level = ""
    while(True):
        try:
            inp = int(input("For bachelor's degree write 1\nFor some college write 2\nFor master's degree write 3\nFor associate's degree write 4\nFor high school write 5\n"))
            if(inp == 1):
                level = "bachelor's degree"
                break
            elif(inp == 2):
                level = "some college"
                break
            elif(inp == 3):
                level = "master's degree"
                break   
            elif(inp == 4):
                race = "associate's degree"
                break
            elif(inp == 5):
                race = "high school"
                break
            else:
                print("ERROR: Invalid input.\n")
        except:
            print("ERROR: Invalid input.\n")
    return level


def get_lunch():
    print("\nStudent had/didn't have lunch before exam:")
    lunch = ""
    while(True):
        try:
            inp = int(input("For YES write 1\nFor NO write 2\n"))
            if(inp == 1):
                lunch = "standard"
                break
            elif(inp == 2):
                lunch = "free/reduced"
                break
            else:
                print("ERROR: Invalid input.\n")
        except:
            print("ERROR: Invalid input.\n")
    return lunch

def get_course():
    print("\nStudent completed\didn't complete the test preparation course:")
    course = ""
    while(True):
        try:
            inp = int(input("For YES write 1\nFor NO write 2\n"))
            if(inp == 1):
                course = "completed"
                break
            elif(inp == 2):
                course = "none"
                break
            else:
                print("ERROR: Invalid input.\n")
        except:
            print("ERROR: Invalid input.\n")
    return course

def prediction(x_pred):
    import joblib
    model = joblib.load("model.pkl")
    y_pred = model.predict(x_pred)
    print("The predicted average score for the student is: ")

def main():
    welcome_statement()
    gender = get_gender()
    race = get_race()
    level = get_parent_edu()
    lunch = get_lunch()
    course = get_course()

    def data_transformation():
        import pandas as pd

        data = {
        "gender_female"                                   : [float(gender == "female")]  ,
        "gender_male"                                     : [float(gender == "male")]    ,
        "race/ethnicity_group A"                          : [float(race == "group A")]   ,
        "race/ethnicity_group B"                          : [float(race == "group B")]   ,
        "race/ethnicity_group C"                          : [float(race == "group C")]   ,
        "race/ethnicity_group D"                          : [float(race == "group D")]   ,
        "race/ethnicity_group E"                          : [float(race == "group E")]   ,
        "parental level of education_associate's degree"  : [float(level == "associate's degree")]  ,
        "parental level of education_bachelor's degree"   : [float(level == "bachelor's degree")]  ,
        "parental level of education_high school"         : [float(level == "high school")]  ,
        "parental level of education_master's degree"     : [float(level == "master's degree")]  ,
        "parental level of education_some college"        : [float(level == "some college")]  ,
        "lunch_free/reduced"                              : [float(lunch == "free/reduced")]  ,
        "lunch_standard"                                  : [float(lunch == "standard")]  ,
        "test preparation course_completed"               : [float(course == "completed")],
        "test preparation course_none"                    : [float(course == "none")]
        }
        data = pd.DataFrame(data)
        return data

    d = data_transformation()
    return d

d = main()
d


Welocme sir!, this is an AI model that predicts the average score of students taking three tests: Math, Reading, and Writing.
-----------------------------------------------------------
The prediction is base on the following features:
1- Gender
2- Race/ethnicity
3- Parental level of education
4- Student had/didn't have lunch before exam
5- Student copleted the test preparation course
-----------------------------------------------------------
Please answer the following questions to get the prediction score of the student.

Gender:
For male write 1
For female write 2
1

Race/ethnicity:
For group A write 1
For group B write 2
For group C write 3
For group D write 4
For group E write 5
1

Parental level of education:
For bachelor's degree write 1
For some college write 2
For master's degree write 3
For associate's degree write 4
For high school write 5
1

Student had/didn't have lunch before exam:
For YES write 1
For NO write 2
1

Student completed\didn't complete the test preparation c

Unnamed: 0,gender_female,gender_male,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_associate's degree,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none
0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
