# Applied Machine Learning Student Regression 
### Summary
The purpose of this project was to predict a student's final yearly grade based on data about their previous academic performance and personal circumstances. This was treated as a supervised regression problem. We recommend the use of the (INSERT HERE). Features with information about the student's previous grades had the highest predictive power. 

### Problem Description
The dataset consisted of 316 samples of 33 variables describing the student's academic and personal circumstances. Academic variables include the student's grades for 3 trimesters (G1, G2, G3), the number of past failed classes (failures) and absences (absences), weekly study time (studytime), and access to extracurriculars and school support. Personal variables describe the student's basic demographics (i.e., sex, age, residential status), home life (i.e., parental occupation, marital status, and quality of family relationships), and social life (i.e., how much free time they have, how much they consume alcohol). 

The target variable was the student's final grade (G3), whose range was any integer between 0-20, with 20 being the highest grade. 

The remaining 32 variables were used as features for the regression implementation. All features were discrete. There were 13 binary features, 15 numeric features, and 4 nominal features. 

(See student.txt for detailed information about features.)

### Exploratory Data Analysis (EDA):
The distributions of features was examined.

<img src="figures/cat_features_hist.png" alt="Grades distribution" width = 800><br>
<img src="figures/num_5_hist.png" alt="Feature distribution" width =500>
<img src="figures/num_other_hist.png" alt="Feature distribution" width = 500>
<img src="figures/age_abs_hist.png" alt="Feature distribution" width = 800>


For the grade features (G1, G2, and G3), there were certain values that were not represented in the sample. Across all 3 trimesters, no scores of 1-4 were recorded. For G1, no scores lower than 5 were recorded. For G2, 0 was represented but scores 1-4 were not. For both G1 and G2, there were no scores of 20, and in G3, only 1 sample had a score of 20. This imbalance in score distribution, particularly in the target variable (G3) should be considered when creating the testing and training set. 
All 3 trimesters had comparable medians, but G3 had a much wider distribution of values. 

<img src="figures/grades_hist.png" alt="Grades distribution" width = 800>

Mutual information (MI) between G3 and each of the remaining variables was also analyzed. Several variables had an MI score of 0. This may be due to the highly discrete nature of the variables, so mutual information may not be a suitable metric to capture predictive relationships. However, grade features G1 and G2 had high MI scores, indicating strong predictive powers. 

<img src="figures/mutual_info.png" alt="Mutual information" width = 800>



### Feature engineering and selection
The binary and nominal features required encoding:
* All binary features were converted to (0,1) values. For features with (yes, no) values, yes was assigned to 1, and no to 0. 
* Nominal features were encoded with both One-Hot encoding and Ordinal encoding. Neglible differences in performance were seen. For the sake of minimizing dimensionality, Ordinal encoding was used. 

<img src="figures/comparing_encoding.png" alt="One Hot vs Ordinal">

* Binary encoding was performed before splitting the dataset, and ordinal encoding was done after splitting.

Some new features were engineered to reduce dimensionality:
* Parent education (Pedu):  average mother and father education ((Medu + Fedu)/2)
* Total alcohol consumption (Talc): average weekend and daily alcohol consumption ((Dalc + Walc)/2)
* Average grade (Gavg): average of G1 and G2 grades ((G1 + G2)/2)
* Social score (social): sum of the 'goout', 'romantic', and 'famrel' metrics with a max value of 11

Normalizing using the StandardScaler was also tested, but resulted in worse performance across all models, so it was not used to transform the features. This may be due to the effect of normalization on minimizing the impact of changes in grades, which were on a scale of 20. 

Features were selected using the k-best features algorithm with mutual information as the metric. For the decision tree estimators, recursive feature elimination was also implemented; negligible changes in performance were seen, so k-best was used for all model types for simplicity's sake. For each model type, the ideal number of features was found based on relative error (explained in more detail in the "Model Selection" section). 

<img src="figures/kfeatures.png" alt="Error vs. K features" width = 500>

(See feature_eng.py for details about feature engineering and encoding)


### Model Selection
Regression models were selected based on the relatively small sample size of data and the discrete nature of all features. A total of 5 regression models were tested first with default hyperparameters: 
- Random Forest
- Gradient Boosting Tree
- K Neighbors 
- SVR
- Ridge Regression

For each model, the ideal number of features was determined by selecting the k-best features starting at k = 3 and incrementing up to the maximum number of features in the set. Four performance metrics were measured for each feature set: r2, mean squared error (MSE), mean absolute error (MAE), and relative error (RE). Relative error was calculated as the ratio of mean absolute error to the mean value of the target set (MAE/mean(y)). This was done in lieu of mean percentage error due to the presence of 0 in the target range, which yields a computational error. The best feature set per model was determined using relative error as the metric. The best 3 performing models were Random Forest with 20 features, K Neighbors with 5 features, and Gradient Boosting Tree with 14 features, in ascending order of RE. 

<img src="figures/relative_e_feature_selection.png" alt="Relative e" width = 500>
<img src="figures/r2_feature_selection.png" alt="R2" width = 500>


Hyperparameters for these 3 models were then tuned to optimize performance.

(See model_functions.py for details about evaluation process)

### Hyperparameter Tuning
For Random Forest, K Neighbors, and Gradient Boosting, hyperparameters were tuned using an exhaustive grid search with cross validation (GridSearchCV). The following hyperparameters were chosen for each model type:
* Random Forest:
    * Number of decision trees (n_estimators): 10 through 150 at intervals of 10
    * Maximum features to consider in trees (max_features): 'sqrt', 'log2', None (all features)
* K Neighbors: 
    * Number of neighbors (n_neighbors): 2 through 20 at intervals of 1
    * Distance weighting (weight): uniform or inverse
* Gradient Boosting Tree: 
    * Loss function (loss): 'squared_error', 'absolute_error', 'huber', 'quantile'
    * Number of trees: 10 through 150 at intervals of 10

After tuning, K Neighbors achieved the best performance across all cross-validation folds with respect to relative error with 2 neighbors and uniform weighting.

<img src="figures/relative_e_tuned.png" alt="Relative e" width = 500>
<img src="figures/r2_tuned.png" alt="R2" width = 500>

### Results and Recommendations 
The best overall performance with the least number of features was achieved with the K Neighbors model with 5 features (G1, G2, Gavg, social, absences). Random Forest and Gradient Boost also performed well, but with a greater number of features. 

With the K-Neighbors model, the following errors in prediction were achieved for each possible G3 value:

<img src="figures/error_vs_true.png" alt="Errors vs True Value" width = 700>

The model is weakest at predicting G3 scores of 0, as seen by the wide spread in error. A substantive interpretation of this uncertainty is that weak academic performance can be effected through a variety of mechanisms, and thus can be difficult to confidently predict. 

For the sake of minimizing dimensionality and computational demands, the K Neighbors model with 2 neighbors and the feature set [G1, G2, Gavg, social, and absences] is recommended. Additionally, to make the model more robust against imbalanced representation of target values in the testing and training set, an ensemble of K Neighbors regressions could be implemented. 


### Limitations and Next Steps 
#### Testing/training set representation
Another factor that may impact the performance of this regression is the uneven representation of target values within the testing and training split. There is only 1 sample with a G3 value of 20, which means it may be represented in the testing set and not the training set, or vice versa, which will impact the model's ability to predict this value or the ability to evaluate the model's performance on predicting this score, respectively. This issue may also arise for other values; this was seen with the scores 5 and 6, which were represented in the training set but not the testing set, so the model's performance on predicting these scores could not be evaluated. 

To resolve this, an ensemble could be used where multiple regressions are fit to different iterations of the testing and training set, and the final predicted value would be an aggregation of the predictions of each different regression (similar to a random forest). 

#### Predicting G3 without other grade data
Predicting G3 without previous grade data (G1, G2, and Gavg) was also attempted. The same three models (Random Forest, K-Neighbors, and Gradient Boosting) performed the best during feature selection. However, after tuning and cross validation, Gradient Boosting was the best-performing model and K-Neighbors was the worst. Overall performance also declined in comparison to the regression with grade features, with the best performing tuned model achieving a relative error of .292, an increase of over 300%. This validates the predictive power of grade features and indicates K-Neighbors' performance is dependent on the presence of these features.

<img src="figures/r2_tuned_no_grades.png" alt="Relative e" width = 800>
<img src="figures/relative_e_tuned_no_grades.png" alt="Relative e" width = 800><br>

<img src="figures/error_vs_true_no_grades.png" alt="Grades distribution" width = 800>

Errors may be reduced with more robust feature engineering of non-grade features. It is also worth exploring grade prediction as a classification problem, since the target range consists of integers 0-20. 
