## Student Grades Regression Analysis
### Process Outline

**ETL**

* Addressed raw data handling and database-related requests before any data interpretation or manipulation
    * Rationale: If client wants ability to update dataset, they would doubtless intend for predictive model to be used on the most up-to-date version of the data
* Establishing database was done with SQLAlchemy
* Other functions done with psycopg2
* Ideally would be able to do all functions with one library but moved on for the sake of time


**EDA**

* Examined schema in detail to isolate nominal, ordinal, numeric features
* Studied categorical features and decided to encode at this point to facilitate analysis / visualization
* Chose encoding options to address large number of categorical features while minimizing dimensionality increase
* Looked at: 
    * Overall feature distribution
    * All possible correlations
    * Relation of most correlated features
    * Mutual information scores

**Feature Engineering**

* Decided to attempt testing different feature sets alongside / concurrently with different estimators to make process more robust against missing out on good combinations

* Used insights from examining the dataset to come up with candidate feature sets
* Tried:
    * Isolating features with most mutual information
    * Dropping features with no mutual information
    * Dropping features with least variance
    * Including engineered 'mean grades' feature
    * Recursive feature elimination
    * KBest

**Model Selection**

* Wrote validation function to test set of sets against all candidate estimators
    * Performed two rounds of candidate estimator validation
    * Chose suite of estimators based on scikit-learn's algorithm selection flowchart
    * Round 1: SVR (linear), SVR (rbf), Lasso, ElasticNet, Ridge, Random Forest (winner)
    * Round 2: Random Forest, Linear Regression, Gradient Boost, Voting Regression
    * Random Forest performed best across all metrics
* Hyperparameter tuning: chose the randomized search algorithm as its use is indicated when the model has many hyperparameters and one lacks knowledge about which ones are important
* Minimal effect on validation performance


**Metrics**

* Metric selection: Mean absolute error, R^2, root mean squared error
* Attempted to go with metrics that are commonly used in regression problems and are easy to interpret
* Paired MAE with RMSE to have two takes in the same units on the significance of outliers (found in e.g. 'absences', a feature included in many of the candidate sets)
* Included R^2 as most independent variables were pared down in feature selection and the most important features had a somewhat linear relationship with the target