### MATH 3375 Project 3 - Dimension Reduction

For this project, we will use a data set with several attributes of white wine to predict the quality of the wine. The data set was obtained from the Machine Learning Repository at UC Irvine. 

Below is documentation related to the data set. (_Note that we are only using the **white** wine data for this assignment._)

    1. Title: Wine Quality 

    2. Sources
       Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, 
       Telmo Matos and Jose Reis (CVRVV) @ 2009
   
    3. Past Usage:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
    Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    In the above reference, two datasets were created, using red and white wine samples. 
    The inputs include objective tests (e.g. PH values) and the output is based on sensory data
    (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
    between 0 (very bad) and 10 (very excellent). 
 
    4. Relevant Information:

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. 
    For more details, consult: http://www.vinhoverde.pt/en/ or reference [Cortez et al., 2009]. 
    Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output)
    variables are available (e.g. there is no data about grape types, wine brand, wine selling 
    price, etc.).

    The classes are ordered and not balanced (e.g. there are munch more normal wines than 
    excellent or poor ones). Outlier detection algorithms could be used to detect the few 
    excellent or poor wines. Also, we are not sure if all input variables are relevant. 
    So it could be interesting to test feature selection methods. 

    5. Number of Instances: red wine - 1599; white wine - 4898. 

    6. Number of Attributes: 11 + output attribute
  
    Note: several of the attributes may be correlated, thus it makes sense to apply some 
    sort of feature selection.

    7. Attribute information:

    For more information, read [Cortez et al., 2009].

    Input variables (based on physicochemical tests):
        1 - fixed acidity
        2 - volatile acidity
        3 - citric acid
        4 - residual sugar
        5 - chlorides
        6 - free sulfur dioxide
        7 - total sulfur dioxide
        8 - density
        9 - pH
        10 - sulphates
        11 - alcohol
    
    Output variable (based on sensory data from wine experts): 
        12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None


In [None]:
wine_data <- read.csv("winequality_white.csv")
head(wine_data)

## Partitioning Data

This step is done for you below.  Note that you should train ALL models **_using ONLY_ wine_train**.  We are saving **wine_test** to evaluate performance of the models.

In [None]:
set.seed(3375)

train_size <- round(nrow(wine_data) * 0.7, 0)
train_rows <- sample(1:nrow(wine_data), train_size)

wine_train <- wine_data[train_rows,]
wine_test <- wine_data[-train_rows,]


## Tasks

### Exploratory Data Analysis

##### 1. Create a boxplot, histogram, and data summary for the response variable (wine quality).

Use the **_original data set_** (wine_data) for this step.

In [None]:
#Put solution to Exercise 1 in this cell. You may add additional cells if you like.


##### 1a. What observations do you have about the distribution of the response variable?

(Write your answer in this cell.)

### Multicollinearity

##### 2. Create a correlation matrix to show plots of the relationship between all variables in the data set.

Use the **_original data set_** (wine_data) for this step.

In [None]:
#Put solution to Exercise 2 in this cell. You may add additional cells if you like.


##### 3. Examine Variance Inflation Factors for a Full Model

* Create a model predicting quality using all other features in the data set as predictors. **Use only the training data set** (wine_train) to create this model.
* Print a summary of your model.
* Compute the VIF for all the predictors in the data set.

In [None]:
#Put solution to Exercise 3 in this cell. You may add additional cells if you like.


##### 4. Based on your results in items 2 and 3, do you believe multicollinearity is an issue for this data set?

Explain your reasoning clearly. Be sure to compare VIF values to an appropriate threshold. 

(Type your answer in this cell.)

### Models

Remember to use the **_training data_** set (wine_train) to create each of the models below.

##### 5. Create a forward stepwise regression model.

Start with a model using no predictors and allow R to show the steps taken in determining the final model.  Be sure to show a summary of the final model.

In [None]:
#Put solution to Exercise 5 in this cell. You may add additional cells if you like.


##### 6. Use LASSO to select a subset of the variables. 

* First create a LASSO model using all features as predictors
* Show a summary of the LASSO model coefficients
* Using only the predictors not eliminated by LASSO, create an ordinary least squares (OLS) model
* Show the OLS model summary

In [None]:
#Put solution to Exercise 6 in this cell. You may add additional cells if you like.


##### 7. Create an Elastic Net Model

* Create the Elastic Net using all features as predictors, with an alpha value of 0.5
* Show a summary of the Elastic Net model coefficients

In [None]:
#Put solution to Exercise 7 in this cell. You may add additional cells if you like.


##### 8. Compute Mean Square Error (MSE)

Using the **_test data set_** (wine_test), compute the MSE for each of the 3 models (Stepwise, OLS with LASSO-selected features, and Elastic Net).

In [None]:
#Put solution to Exercise 8 in this cell. You may add additional cells if you like.


##### 8a. Compare the Models

Type your answers in this cell.

1. Which predictors were included in each model? What was different and what was the same?
2. Which models had the highest and lowest MSE?
3. Which model would you choose to make predictions, and why?
