# Project Description
## Xiang Ding
In this project, we aim to explore the factors that may affect students' exam performance and understand how these factors relate to their grades. We will begin by visualizing the data on exam scores and grades to identify any patterns or trends. In the second part of the project, we will take a closer look at three key factors that may impact students' overall exam performance: standard lunch, parents' education level, and completion of exam preparation. To achieve this, we will use scikit learn to build a linear model to perform Multiple Linear Regression and examine the correlation between these variables and students' grades. By the end of the project, we hope to gain a better understanding of how these factors contribute to students' academic success and identify potential areas for improvement.
##### Run the cell from top to bottom

In [None]:
#!pip install -U scikit-learn
#!pip install pandas
#!pip install numpy
#!pip install -U matplotlib

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from data_module import df_vis
from data_module import data_proc
from data_module import basic_model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

### Read the CSV file

In [None]:
db = pd.read_csv('data/exams.csv')
# Uncomment to see the part of the dataframe
db.head()
db.get("parental level of education").unique()

### Part I: Data Visualization 
Factors: Parents education level, Free/reduced lunch, test preparation course

In [None]:
data_proc.grades_count(db, 'parental level of education')
data_proc.grades_count(db, 'lunch')
data_proc.grades_count(db, 'test preparation course')

In [None]:
df_vis.parents_education_score(db)
df_vis.overall_lunch(db)
df_vis.test_prep(db)

From the analysis of the three factors' data visualizations above, it is evident that there may be a strong correlation between higher academic performance and certain factors. Specifically, students with standard lunch, parents with higher education, and those who have completed exam preparation tend to have higher average exam scores.

### Part II:

In the first step, I quantify the data in these categories. To quantify the parents' education background, Based on the plot above. I assign a numerical value of 1 to indicate any master degree, 0.8 for bachelor's degree, 0.6 for some college, associate's degree and 0.4 for no college education. As for the second factor, school lunch, I assign a binary variable where a value of 1 indicates that the student receives a free school lunch, and 0 indicates that the student does not receive a free school lunch. Similarly, I follow the same procedure for exam preparation.
##### Create a new table with the quantify data

In [None]:
new_db = data_proc.quantify_df(db).drop(columns=['race/ethnicity','math score', 'reading score',
                                                 'writing score'])
new_db.get('overall_score').apply(np.log)
# Uncomment to the new dataframe
#new_db['overall_score'].describe()
plt.hist(new_db['overall_score'])
new_db

 In the second step, I divide the data into two parts: x_variables and y_variables. x_variables will represent the input variables that will be used in the regression model, while y_variables will represent the target variables.
 ##### Note: Since I am running mutiplie linear regressions, I do not need to stanardize my numbers, the model can take in raw data

In [None]:
x_var = new_db.drop(columns=['overall_score'])
y_var = new_db['overall_score']
# uncomment to see the dataframe
#x_var
#y_var

In the third step, I split the data into training and testing subsets using an 80/20 ratio. This means that 80% of the dataset will be used for training the linear regression model, while the remaining 20% will be used for testing. To accomplish this, we will be utilizing the train_test_split function from the sklearn.model_selection library.
##### But feel free to play around the data split percentage to see how the r^2 values changes with different split

In [None]:
x_train, x_test, y_train, y_test = basic_model.datasplit(x_var, y_var, train=0.8, test=0.2)

Use the data to train the sklearn.linear_model

In [None]:
model = basic_model.my_model()
model.fit(x_train, y_train)

Apply the trained model to make prediction on the training/test set and output the model stats.

In [None]:
score_predict_train = basic_model.score_predict_train(model, x_train, y_train)
score_predict_test = basic_model.score_predict_test(model, x_test, y_test)

### Output the equation of the linear model and the model's prediction graph
Pe = Parents education level, l = school lunch, ep = exam prep

In [None]:
basic_model.linear_eq(model)
df_vis.predict_model(y_train, score_predict_train, y_test)
df_vis.actual(score_predict_test, y_test)

### Analysis
Based on my analysis, the linear model I built did not accurately predict students' exam scores. The model had high MSE values and low R^2 values, indicating that the variables we examined - parents' education, school lunch, and exam preparation - did not have a significant impact on overall exam performance. These results suggest that other factors may play a more important role in determining students' academic success. Further reserach is needed and perhaps a different model may have a better results.

In [None]:
!pytest

## Extra Credits
I belived I have really challenged myself by exploring and utilizing various libraries, including Pandas and Matplotlib, to manipulate data and generate insightful visualizations. However, the most exhilarating part has been diving into the world of machine learning with Scikit-Learn. By delving into the mathematics behind linear regression models and researching various online articles, I've gained a deeper understanding of the mechanics behind the model and how to effectively apply it to my datasets. Moreover, This project has also been uploaded on github as well https://github.com/fanhh/examdata_prediction