### Notes and Resources

#### R-workshop equiv.
https://carpentries-incubator.github.io/high-dimensional-stats-r/03-regression-regularisation/index.html

#### Dataset Documentation
* Dataset documentation: https://www.openml.org/d/42165

#### On log-transforming the target variable
* To shrink or not to shrink: https://florianwilhelm.info/2020/05/honey_i_shrunk_the_target_variable/

#### Multicolinearity in predictors; is it an issue?
* https://stats.stackexchange.com/questions/568406/is-the-non-multicollinearity-assumption-for-ols-multiple-regression-just-an-assu
* housing data examined: https://www.datasklr.com/ols-least-squares-regression/multicollinearity

#### Should predictor variables be corrected for skew?
* No (hot take): https://www.statsimprove.com/en/linear-regression-should-dependent-and-independent-variables-be-distributed-normally/#:~:text=The%20answer%20is%20no%3A%20the,useless%20trying%20to%20normalize%20everything.
* How correcting for skew helps: https://anshikaaxena.medium.com/how-skewed-data-can-skrew-your-linear-regression-model-accuracy-and-transfromation-can-help-62c6d3fe4c53

#### Third party analyses of Ames housing data
* linear modeling with Ames housing data: https://chriskhanhtran.github.io/minimal-portfolio/projects/ames-house-price.html
* https://towardsdatascience.com/predicting-housing-prices-using-advanced-regression-techniques-8dba539f9abe
    * not much depth to this one
* https://towardsdatascience.com/wrangling-through-dataland-modeling-house-prices-in-ames-iowa-75b9b4086c96
    * It is first and foremost useful to understand that the Ames dataset fits into the long-established hedonic pricing method to analyzing housing prices. Some domain knowledge will go a long way.

#### Alternatives to looking at coef size for feature importance
- Change in R^2: https://blog.minitab.com/en/adventures-in-statistics-2/how-to-identify-the-most-important-predictor-variables-in-regression-models
- T/F tests: https://randomeffect.net/post/2020/11/01/variable-importance-linear-regression/ "The t scores approach is in direct opposition to the standardized regression coefficients method to variable importance. Using regression coefficients, the idea is that the most important variables have the largest effect sizes. Using t scores (or, equivalently, p -values) the idea is that the most important variables are the ones that most certainly have non-zero effects. This is what Fisher was thinking about when he thought up p -values: he was looking for a continuous measure of evidence against a singular point-null hypothesis. But this exercise is somewhat pointless, because the t scores are the same (after all, the scaling is just a linear transform to the data matrix). And it’s really the t scores that should be used to determine variable importance, because these take into account the uncertainty in the regression coefficients. At least that’s what I think for linear regression. If we have groups of variables that are correlated, or a categorical variable, replace t score with F score."

# Outline
1. Intro regression and the problems it solves; intro predicting vs explaining; cover prediction using univariate regression
2. Explaining using univariate regression - regression assumptions
3. multivariate regression - prediction problems
4. multivariate regression - explain problems


### Lesson Outline

1. **Overview of research questions, problem setup**
    - Which variable(s) are most predictive of housing price? Is there some combination of variables (multivariate regression) that predicts housing prices better than individual variables (univariate regression)?
    - What kind of regression model gives us the most accurate housing price prediction (we will test a few)?
    - Which variables significantly relate to housing price?
    - intro idea of predictive vs. interpretive models
    - intro assumptions of univariate models
    
2. **Explore data (review from day 1)**
    * Distribution of target variable and mitigating skew 
    * Correlation across predictor variables

3. **Data formatting/cleaning**
    * One-hot encoding of categorical predictors
    * Remove low variance predictors and predictors containing NaNs
    * Train/test split
    * Zscore all predictors based on training data stats
    * Outlier detection (save for later)
    
4. **Explore univariate models**
    * Which variables are most predictive of housing price by themselves?
    * Interpret results
    * Introduce model assumptions
5. **Explore model with all predictors**
    * Multivariate linear regression assumptions
    * Goals of multivariate regression - predictive vs interpretive
    * Demo overfitting
    * Intro curse of dimensionality
6. **Ridge model**
    * Compare to previous models
    * Interpret model results
7. **Lasso model**
    * Compare to previous models
    * Interpret model results
8. **Elastic net model**
    * Compare to previous models
    * Interpret model results
9. **Revisiting multicolinearity and interpretability**
    * Removing correlated features
    * PCA plus elastic net model
    
**What's missing?**
- More feature engineering to allow for more predictable modeling results
- Model interpretability
    - Biplots to help interpret models
    