# CSCE 623 Homework Assignment 5

## Feature Transformation & Dimensionality's Curse

## Instructions

In this assignment, you'll

- conduct transformations on feature inputs and target outputs
- experiment with and select hyperparameters to find a model that will best fit a baseball player's dataset in predicting a ball player's salary

Several demos have been provided for your use:
- [Scaling Demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_scaling.ipynb)
- [RFECV Demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_RFECV.ipynb)
- [Ridge, Lasso, ElasticNet Demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_ridge_lasso_elasticnet.ipynb)

Using the data transformation pipeline in the scikit-learn library, you'll produce a table which lists some of the models you attempted, any applicable parameters you used, and the features used in the model. See an example table below.


Specific expectations:


* Perform your data analysis on the [ISLR_Hitters.csv](https://github.com/afit-csce623-master/datasets/blob/main/hw4_ISLR_Hitters.csv) dataset. 
* Separate the data into training and test sets
* Choose appropriate transformations for the input features. You may choose from the following transformations:
  * [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
  * [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
  * [PowerTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html)
  * [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
  * Custom encoder of your own creation (you are welcome to incorporate other sources)
* Incorporate your chosen transformations into a scikit learn [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)
  
  Note: Though you can explore the effect of different transformations on the input features, this is not expected. Instead you should choose your input transformations well and put your primary effort into evaluating a variety of models

* You will conduct the following steps using a variety of models. Ideally, you'll explore at least one range of model parameters using loops or nested loops to find optimal hyperparemeters. You may choose from the following:
  * [Linear](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model
  * [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) or [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) models with a range of $\alpha$ values
  * [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) or [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) models with a range of $\alpha$ values
  * [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) or [ElasticNetCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html) models with a range of $\alpha$ or $l$1_ratio values
  * Any of these models with subsets of the 22 feature columns (the csv has 21 columns, which will be 19 features after the names and salary are removed, but then it will be 21 features after the three categorical features are changed to one-hot categories)
* Incorporate the column transformation with each selected model into a scikit learn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
* If necessary, select or create a target transformation and incorporate both your target transformation and your Pipeline into a [TransformedTargetRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.compose.TransformedTargetRegressor.html)
* Capture the best version of each model you create in a table such as the the one below.
  
  Note: Provide the cross-validation RMSE followed by the model's RMSE and the $R^2$ score on the test set. RMSE units should be in the same units as the original dataset. Every entry in the table should be supported by code in your notebook

* Answer the questions in the cell with the "Student Answers" header




Example Completed Table:

| Algorithm                                           | # Features |  CV RMSE | Test RMSE | Test $R^2$ | AtBat    | Hits     | HmRun    |  Runs    | RBI      | Walks    | Years    | CAtBat   | CHits    | CHmRun   | CRuns    | CRBI     | CWalks   | League   | Division | PutOuts  | Assists  | Errors    | NewLeague |
|-----------------------------------------------------|-----------:|-----------:|:---------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|
| Linear                                              |         19 |     305.23 |    356.44 |   0.1805 |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x     |     x     |
| Best Subset <br/>Linear                             |          7 |     300.41 |    338.01 |   0.2631 |          |          |     x    |     x    |          |          |          |     x    |     x    |          |          |     x    |          |          |     x    |     x    |          |           |           |
| RFECV                                               |         15 |     305.45 |    363.32 |   0.1485 |     x    |     x    |     x    |     x    |          |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |          |           |           |
| RidgeCV <br/>$\alpha$=10                            |         19 |     316.47 |    342.79 |   0.2421 |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x     |     x     |
| LassoCV <br/>$\alpha$=0.01                          |         10 |     314.22 |    350.08 |   0.2095 |          |     x    |     x    |     x    |     x    |          |     x    |          |          |     x    |          |     x    |          |     x    |     x    |     x    |          |           |           |
| ElasticNetCV <br/>$\alpha$=0.01 <br/>$l$1ratio=0.70 |         11 |     313.86 |    350.26 |   0.2086 |          |     x    |     x    |     x    |     x    |          |     x    |          |          |     x    |          |     x    |          |     x    |     x    |     x    |          |     x     |           |
| Best Set <br/>ElasticNet <br/>$\alpha$=0.01 <br/>$l$1ratio=0.70 | 6 | 316.39 | 348.27       | 0.2281 |     x    |          |     x    |     x    |          |          |          |          |          |          |          |     x    |          |          |     x    |     x    |          |           |           |

## Hints

Here is a possible workflow, if you're looking for one.

- Load data (this step has already been accomplished for you)
- Review the [Scaling Demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_scaling.ipynb)
- Sequester Test Data
- Analyze Training Data and decide on an appropriate transformation for each feature
- Build ColumnTransformer
- Create a list of column names meaningful to your transform (this will make it easier to refer to your columns later)
- Decide on an appropriate transformation for your target data (but don't build or apply the transformation, yet)
- First Model
  - Build a ColumnTransformer -> Pipeline -> TransformedTargetRegressor (use a Linear Regression model)
  - Fit your model, collect RMSE for the train and test set, $R^2$ for the test set
- Subsequent models
  - Review the [Ridge, Lasso, ElasticNet Demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_ridge_lasso_elasticnet.ipynb)
  - Try RidgeCV, LassoCV, or ElasticNetCV. Note that with the models, you don't need to use loops to iterate over hyperparameters. Instead, pass in the range of $\alpha$ and $l$1_ratio values, if applicable, in an array in to the model (you'll need a new ColumnTransformer -> Pipeline -> TransformedTargetRegressor workflow, but only the model in Pipeline should need to change)
  - Try using [subsets of features](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_scaling.ipynb#scrollTo=2Ju0idShanht) with a Linear model
  - Combine subsets of features while also searching over a range of $\alpha$ and $l$1ratio values with ElasticNet
  - Try RFECV with the [RFECV demo](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_RFECV)

## FAQ

1. Do I have to use the scikit-learn data transformation pipeline? I think I can implement the transformations on my own and save myself the trouble.

  With basic coding skills, you can implement the transforms yourself, but I advise against it for three reasons.
 
  - Every key stroke of code that you type is a potential source of error. Some errors are readily caught by the interpreter. Others are syntatically correct, but have semantic errors. When you implement your own transformations, you have to verify that they are correct. Custom implementation increases the complexity of your code and increases the amount of time you must spend in debugging, testing, and maintaining your code.
  - When you implement your own transformations, you must remember to apply them every time you are using the features to train or evaluate a model. When using transformations on the target data, you must remember to apply them when you are training your model, but you must invert the transformation to evaluate and score your model. The scikit learn transformation workflow makes it easy to link all of these behaviors so that they are applied whenever you fit, predict, or score a model. Implementing these transformations yourself increase the liklihood that one or more of them is dropped inadvertently when refactoring or updating your code. A dropped transformation can be a challenging bug to identify and fix.
  - The simplicity of the scikit learn workflow, after you learn it, increases the probability you'll actually apply a transformation when you think you need it. If you had to implement your own transformation or go find some old code where you had previously implemented something similar, you're less likely to implement a transformation in marginal cases. You're even less likely to try several transformations to compare performance. ON the other hand, if you learn the scikit learn workflow, you're more likely to apply a transformation and possibly compare the results of several transformations.

2. Help! I just realized it will take _6 years_ to run my hyperparameter search.

  Welcome to ***Dimensionality's Curse!*** Though you can potentially speed things up by parallelizing the problem, assigning parts of the analysis to different processors, but the run time of your problem will grow much faster with increases in dimensionality than you'll be able to reduce it by incorporating more processors. By way of example, consider the following set of options:

  - 2 model options: Linear and ElasticNet
  - 10-fold cross-validation
  - 8 options for ElasticNet alpha
  - 10 options for ElasticNet 1l_ratio 
  - 22 features with best subset selection

  The above options will require evaluating 3,355,443,200 models. Even if it takes just 0.008 seconds to evaluate a single model, it will take nearly 27 million hours to consider every possible model created by the combinations above.
  - Linear: 10 * 4,194,304 = 41,943,040
  - ElasticNet: 10 * 8 * 10 * 4,194,304 = 3,355,443,200

  You'll have to smartly down-select from the above options in order to achieve timely results.

3. What are some ways that I can reduce the number of models I have to fit?

  - If looking at best subsets of features, rather than considering the complete powerset of features, consider those that are correlated with salary. Consider using a [correlation heat map](https://colab.research.google.com/github/afit-csce623-master/demos/blob/main/demo_RFECV.ipynb#scrollTo=b9jMDC2P3s_l&line=1&uniqifier=1) to identify candidate features. Remember that both positive and negative values are potentially meaningful. The key attribute in correlation is distance to zero / closeness to either -1.0 or 1.0.
  - Reduce the space of feature values over which you're searching on the basis of other searches. For example, consider limiting the features you examine to those kept by RFECV. Narrow your search of exhaustive $\alpha$ and $l$1_ratio values to those near where ElasticNetCV was optimized.


# Student Answers

1. Insert your summary model performance table in the text cell below. Hint: with the limited width available in the Colab IDE, it can be helpful to manipulate the table in a separate text editor.

2. If required to use a model to predict baseball player salaries, which model would you use? Why would you choose it? Is it a good model? What are its strengths or weaknesses?

   <font color="green">Student Answer</font>

3. Discuss any relationships between features and the models you created and recorded. Are there some features that seem more important than others? What features are likely to be important in a successful model? What features could be interchanged with others (because they are highly correlated and contain similar information)? What do you think would be needed to improve your model?

   <font color="green">Student Answer</font>

4. After applying a one-hot encoding to the feature set, there are 22 features. There are 4,194,303 combinations of features that can be evaluated. Using the Google Colab servers, it would take approximately 48 hours just to evaluate a Linear regression model on each of these combinations using 10-fold cross-validation. Discuss how you dealt with the curse of dimensionality in order to optimize your search for a good model.

  <font color="green">Student Answer</font>



<< Student Model Performance Table >>

| Algorithm                                           | # Features | Train RMSE | Test RMSE |   $R^2$  | AtBat    | Hits     | HmRun    |  Runs    | RBI      | Walks    | Years    | CAtBat   | CHits    | CHmRun   | CRuns    | CRBI     | CWalks   | League   | Division | PutOuts  | Assists  | Errors    | NewLeague |
|-----------------------------------------------------|-----------:|-----------:|:---------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:---------:|:---------:|
| Template                                            |         19 |            |    180.00 |   0.9999 |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x    |     x     |     x     |


# Implementation

## Configuration & Imports

Add all packages you'll use here

In [None]:
from IPython.display import Markdown as md
from IPython.display import display, Math, Latex
from itertools import chain, combinations
from contextlib import suppress
import time
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import train_test_split, RepeatedKFold, cross_val_score
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, PowerTransformer
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV, ElasticNet
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

## Helper Functions

In [None]:
# converts a floating point value to the format HH:MM:SS
def convert_to_time(seconds):
    return time.strftime('%H:%M:%S', time.gmtime(seconds))

In [None]:
# outputs a progress string, keeping only the best score while showing current progress
def print_progress_str(start_time, trial_time, total_trials, current_trial_idx, feature_set_name_str, best_score, current_score, params=' '):

    time_elapsed_str = convert_to_time(time.perf_counter() - start_time)
    time_trial_sec = time.perf_counter() - trial_time

    if params != ' ':
        params = ' ' + params + ' '

    status_str = f'{time_elapsed_str} {time_trial_sec:.2f}s {current_trial_idx} of {total_trials}:{params}{feature_set_name_str} {current_score}'
    remaining_time = (time.perf_counter() - start_time) / current_trial_idx * (total_trials - current_trial_idx)
    
    if (current_score < best_score):
        print('\r', status_str)
    else:
        print('\r', status_str, f'Time Remaining: {time.strftime("%H:%M:%S", time.gmtime(remaining_time))}', end='')

In [None]:
# return the powerset downselected by sets having between j and k, inclusive, elements
# adapted from https://docs.python.org/3/library/itertools.html#itertools-recipes

def powerset(iterable, j=1, k=999):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)

    # start with 1, as we don't need the empty set
    p = list(chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1)))
    return [item for item in p if len(item) >= j and len(item) <= k]

## Global Constants

In [None]:
# Global Constants
KFOLD = 5


## Load Data

Load the _Hitters_ dataset provided by _ISLR_ text.

- Be sure to cleanup data by removing missing data fields
- Conduct minimal analysis of the dataset to become familiar with the data types and ranges. Leave more thorough analysis for the training set.

***This step is already completed for you***.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/afit-csce623-master/datasets/main/hw4_ISLR_Hitters.csv').dropna()
df.info()
display(df.head())
display(df.describe())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 263 entries, 1 to 321
Data columns (total 21 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  263 non-null    object 
 1   AtBat       263 non-null    int64  
 2   Hits        263 non-null    int64  
 3   HmRun       263 non-null    int64  
 4   Runs        263 non-null    int64  
 5   RBI         263 non-null    int64  
 6   Walks       263 non-null    int64  
 7   Years       263 non-null    int64  
 8   CAtBat      263 non-null    int64  
 9   CHits       263 non-null    int64  
 10  CHmRun      263 non-null    int64  
 11  CRuns       263 non-null    int64  
 12  CRBI        263 non-null    int64  
 13  CWalks      263 non-null    int64  
 14  League      263 non-null    object 
 15  Division    263 non-null    object 
 16  PutOuts     263 non-null    int64  
 17  Assists     263 non-null    int64  
 18  Errors      263 non-null    int64  
 19  Salary      263 non-null    f

Unnamed: 0.1,Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
1,Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N
5,Alfredo Griffin,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,750.0,A


Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
count,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0
mean,403.642586,107.828897,11.619772,54.745247,51.486692,41.114068,7.311787,2657.543726,722.186312,69.239544,361.220532,330.418251,260.26616,290.711027,118.760456,8.593156,535.925882
std,147.307209,45.125326,8.757108,25.539816,25.882714,21.718056,4.793616,2286.582929,648.199644,82.197581,331.198571,323.367668,264.055868,279.934575,145.080577,6.606574,451.118681
min,19.0,1.0,0.0,0.0,0.0,0.0,1.0,19.0,4.0,0.0,2.0,3.0,1.0,0.0,0.0,0.0,67.5
25%,282.5,71.5,5.0,33.5,30.0,23.0,4.0,842.5,212.0,15.0,105.5,95.0,71.0,113.5,8.0,3.0,190.0
50%,413.0,103.0,9.0,52.0,47.0,37.0,6.0,1931.0,516.0,40.0,250.0,230.0,174.0,224.0,45.0,7.0,425.0
75%,526.0,141.5,18.0,73.0,71.0,57.0,10.0,3890.5,1054.0,92.5,497.5,424.5,328.5,322.5,192.0,13.0,750.0
max,687.0,238.0,40.0,130.0,121.0,105.0,24.0,14053.0,4256.0,548.0,2165.0,1659.0,1566.0,1377.0,492.0,32.0,2460.0


## Sequester Test Data

## Setup Transforms & Pipeline

## Linear Model

## Next Model