<h1>Scenario: UMBC Entrance Exams and Student Success

<h1>Abstract

This project uses generated data about students and their test scores on three tests. Since the results are a continuous variable (average test score), the initial analysis was performed by regression. In order to make the project more interesting, I am creating a scenario where this data is related to a fictional experiment by UMBC:
<br>
<br>
UMBC has begun administering entrance exams in Math, Reading, and Writing in order to ensure students are prepared for the rigors of academic study. After the first 5,000 tests were completed, the administration wanted to see what features about the students, if any, could be used to predict their average score on the tests. The features extracted were Sex, Ethnic Group, Parental Level of Education, Lunch Status (Standard or Reduced / Free), and Test Preparation (Completed or Not).
<br>   
A data science student was asked to try to build a model that could predict the average score a student would achieve on the tests using the features provided above. Two models were ultimately used, an OLS regression and a Decision Tree.
<br>
<br>
Unfortunately, neither model was terribly good at predicting the average score based on the provided features. Given the low R^2 value generated by both models, it seems that the features are overall not very good predictors in combination with the tools used. More information on the students should be collected and more models applied to the problem when time allows.

<h1>Introduction

A plethora of factors can have an impact on how a student performs on an exam. Some of these are within the student's control, while other's are not. If we can predict what kind of features might have an impact on a student's success, we can make changes to UMBC's policies to better advise and accomodate students from a variety of backgrounds.

<h1>Motivation

Equipped with information that can be used to predict student success, new students can be given the tools and advice early on in order to succeed throughout their college career. For example, encouraging students to complete the test preparation exam or making financial waivers available for those who might qualify (indicated by their status with regards to their lunches).

<h1>Related Work

I am not aware of any work related to this specific (and fictional) topic.

<h1>Proposed Method

Because this represents a regression problem due to the continuous nature of the target variable, I wanted to attempt to use an OLS regression and a decision tree regression. Comparing these two models might help determine which features are the most important and whether or not additional information on the students is necessary.

<h1>Experiments

Basic data preparation (concatenating the CSV files, etc.) was carried out in this notebook:
https://github.com/dbbabcock/data602_project_1_student_success/blob/main/data_prep/stu_perf_file_prep.ipynb

Cleaning/Transforming and EDA was carried out in this notebook:
https://github.com/dbbabcock/data602_project_1_student_success/blob/main/clean_and_eda/stu_perf_cleaning_and_eda.ipynb

The average scores of the students were found to be approximately normally distributed about a mean of 68.38. 

![Average Score Histogram](images/avg_ts_hist.png)

![Scatterplot Average Scores](images/avg_ts_per_stu.png)

Students generally performed worse on the math test compared to the reading and writing tests, indicated by the lower mean for the math scores (67.03 vs. 69.77 and 68.59 for reading and writing, respectively)

![All 3 Scores Scatterplots](images/all_ts_per_stu.png)

I also found that several of the features, especially the parent's education and test preparation had a decent impact on average score.

![Bar Chart Parent Ed](images/avg_ts_vs_p_ed.png)

![Bar Chart Lunch Status](images/avg_ts_vs_lnch.png)

![Average Score vs. Parent Ed and Lunch](images/avg_ts_vs_p_ed_lnch.png)

There was also evidence to suggest that there may have been some underlying socioeconomic factors at play in how a student performed given that the lunch status and ethnic groups seemed to have some degree of impact on the average score.

![Average Score vs. Ethnic Group and Lunch Status](images/avg_ts_vs_eth_lunch.png)

Overall, however, it was parent education and whether or not the student had completed a test prepration course that seemed to have the greatest impact on average score.

![Average Score vs. Test Prep and Parent Ed](images/avg_ts_vs_p_ed_prep.png)

Once the data was ready for modeling, an OLS model and a decision tree classifier model were fit in the following notebook with mixed results:
https://github.com/dbbabcock/data602_project_1_student_success/blob/main/data_prep/stu_perf_file_prep.ipynb

Unfortunately, the R^2 and RMSE of both of the models demonstrated that the models were not able to reliably predict the average score a student would achieve by using the features provided. There was a lot of variance present in the outcomes that prevented decent predictions from being made

<h1>Results and Discussion

Overall, the models were not a good fit for the features that were provided with the data. While the features definitely seemed to have some sway over the final outcome, they were clearly not the main predictors by any means.

This was demonstrated by the relatively low R^2 value associated with each model: 0.24 for the Decision Tree model, and 0.29 for the OLS regression model.

There was also a relatively high RMSE for each model: 12.64 for the Decision Tree model, and 14.48 for the OLS regression model. This meant that there was significant variance in the final predictions.

![OLS Res vs. Pred Test Scores](images/ols_res_vs_pre_ts.png)

![Dec. Tree Res vs. Pred. Test Scores](images/dtree_res_vs_pre_ts.png)

While the residuals didn't show any problematic patterns, it was clear that the model could we well off on its predictions (even up to 40+!) and could not predict any scores <50.

This definitely demonstrates the model's inability to reliably predict a student's average score given the features used here.

<h1>Conclusion and Summary

In conclusion, the models were ultimately not very successful in this application. These simply are not very good features to work with, and/or the generated nature of the data does not allow for a real connection between the features and scores to be demonstrated.

There was simply too much variation in the predicted scores which resulted in very large residuals values regardless of the model chosen.

<h1>Limitations and Later Work

The dataset for this project was far more restrictive than I had initially realized. When I first found the dataset, I found it separately on Kaggle and thought it was real data. It was not until I did a little more digging that I realized it was generated from another website.

If given more time, I would have preferred to have selected different and real data that might have had real insights to be extracted. This was a good practice run for implementing decision trees, however.

For later work, I would use these models on real data with more features.