## DELIVERABLES 
- Perform EDA with visualizations to assist in feature selection and engineering
- Convert the CSV to a Postgres database and create a python function to easily add new
students to the Postgres database. Ensure the database schema has appropriate
datatypes.
- Selection and/or engineer the feature set you will use for your model
- Select and train a model to predict the final grade of a student.
- Tune the model to get the best results possible
- Generate valid metrics to evaluate your model

## BONUS 
- Create another model that predicts final grades without using any of the previous
period grades. This would be immensely helpful to the company to aid in predicting
student performance before they fall behind
- Write a full Data Analysis report on the statistics gleaned from the dataset
- Include documentation used to plan out this project and its timeline.

## REQUIREMENTS
- All code shall be written in python or SQL
- All code shall be managed via git
- The repository shall be named ‘<first_name>-<last_name>-aml-student-regression' (all
lowercase, i.e. Tom Cruise’s repository name is ‘tom-cruise-aml-capstone')
- Only libraries inherent to python or listed below can be used
- The Postgres database should be managed via docker
- Only data from the given dataset will be used

## OUTLINE 
1. CSV to Postgres
    - DONE CSV to Postgres DB
    - Function to add new students to DB 
2. EDA 
    - DONE Preprocessing: 
        - Handle NAs (imputing or removing)
        - Remove duplicates 
    - Visuals:
        - Distributions (histogram or boxplot)
        - Correlation/MI (heat map)
3. Feature engineering
    - DONE Ratios for high collinear (n/a too many categorical categories)
    - DONE Log to normalize distribution (n/a all categories are discrete)
    - Scaling (do after test/train split to prevent data leakage)
    - Feature selection (probably k best?) with different number of features
4. Model implementation and evaluation
    - DONE Make test/train set
    - Implement pipelines
    - Normalize
    - Test different models
        - Hyperparameter tuning
        - Feature combos
    - Evaluate
        - Acc, pre, re, f1
        - Cross validation? 
5. Repeat for model without grade data! 
6. Data Analytics report
    - EDA 
    - Data preprocessing
    - Model selection
    - Model tuning 
    - Model without grades
    - Analysis/conclusion
    

# DAILY TASK LOG / NOTES
## Friday 1/12
* Loaded data into python and started EDA 
* Lots of categorical features -- need to figure out best way to handle

## Monday 1/15
* Got SQL db up and running
    * To start docker: 
        1. Open Docker app
        2. Be in directory of docker-compose.yaml
        3. Command line: docker-compose up
    * To get into pgadmin: 
        1. Open PGAdmin
        2. Add new server
        3. Put in user and password, server name = localhost 
    * Started function to add new students but didn't finish
* Did more EDA 
* Feature engineering
    * Made total alcohol consumption + parental education metrics
    * Encoding nominal data and binary data
* Testing models 
    * Feature selection algorithm
    * Looking at logistic, SVC, KNeighbors ? 
### Issues
* Need to figure out if I encode before or after selecting K best features -- if before, then how do I encode in a way that preserves the feature info
    * Also, is it ok to just select parts of the encoded feature for the regression or do I need the whole thing lols
* Implementing pipeline? 
* Need to clarify if this is classification
* Do we need to normalize values? 
## Tuesday 1/16
* Made visuals for EDA
* Finished postgres function 
    * Could clean up so you don't need to feed it 33 parameters lol ; feed it dictionary instead?
* Hyperparameter tuning
    * Alpha for ridge: decreases curve complexity (less overfitting)
* Got 5 best models
### Issues
* Should I customize feature selection for each model type? I think yes but not sure...
* How to deal with percent error calculation
    * Currently handling as MAE / mean
    * How to deal with 0 true value
* Understand how docker works
* When tuning hyperparameters, I get worse performance than just the default settings? 
## Wednesday 1/17
* Goals:
    * Have all models done
    * Make visuals 
    * Start report
* Started no grades model -- very bad performance so far
* SGD breaks with hyperparameter tuning for some reason
## Thursday 1/18
* Goals:
    * First draft of report done 
    * Visuals done

## Friday 1/19
* Goals:
    * Clean up visuals 
    * Clean up report
    * Final test run
    * All done!!!