## MATH 3375 Final Exam - Applied Tasks

For this exam, we will use a data set with several attributes of wine. The data set was obtained from the Machine Learning Repository at UC Irvine. 

Below is documentation related to the data set. 

    1. Title: Wine Quality 

    2. Sources
       Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, 
       Telmo Matos and Jose Reis (CVRVV) @ 2009
   
    3. Past Usage:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
    Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    In the above reference, two datasets were created, using red and white wine samples.
    [Instructor note: the data sets have been combined for this exercise, but the color
    of the wine is not represented in the data.]
    
    The inputs include objective tests (e.g. PH values) and the output is based on sensory data
    (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
    between 0 (very bad) and 10 (very excellent). 
 
    4. Relevant Information:

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. 
    For more details, consult: http://www.vinhoverde.pt/en/ or reference [Cortez et al., 2009]. 
    Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output)
    variables are available (e.g. there is no data about grape types, wine brand, wine selling 
    price, etc.).

    The classes are ordered and not balanced (e.g. there are munch more normal wines than 
    excellent or poor ones). Outlier detection algorithms could be used to detect the few 
    excellent or poor wines. Also, we are not sure if all input variables are relevant. 
    So it could be interesting to test feature selection methods. 

    5. Number of Instances: red wine - 1599; white wine - 4898. 

    6. Number of Attributes: 11 + output attribute
  
    Note: several of the attributes may be correlated, thus it makes sense to apply some 
    sort of feature selection.

    7. Attribute information:

    For more information, read [Cortez et al., 2009].

    Input variables (based on physicochemical tests):
        1 - fixed acidity
        2 - volatile acidity
        3 - citric acid
        4 - residual sugar
        5 - chlorides
        6 - free sulfur dioxide
        7 - total sulfur dioxide
        8 - density
        9 - pH
        10 - sulphates
        11 - alcohol
    
    Output variable (based on sensory data from wine experts): 
        12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None




In [None]:
wine <- read.csv("winequality_all_blind.csv")
head(wine)

### Task 1 - Clustering

Although the summary describes **quality** as the output variable, we will first seek to divide the wine into 2 different groups using k-means clustering. We will use all attributes _**except**_ quality to create 2 clusters of data points that appear to have the most similar attributes. 

#### Task 1a
First, use the cell below to scale each variable so that all attributes _**except**_ quality have values between 0 and 1 (common scaling).  **DO NOT SCALE THE _quality_ VARIABLE.**

_Note that you should store the scaled version of the variables directly in the **wine** data set, so that all tasks completed after this step will be completed with scaled data._

In [None]:
#Scale all attributes except quality


#View results to verify scaling
head(wine)

#### Task 1b

With the scaled attributes, perform k-means clustering using all attributes except quality (i.e., use the first 11 columns). Use the code cell below, and be sure to keep the seed value set to 3375 immediately before performing the clustering.

Store your clustering model in a variable and display the model for your reference (you will need the output to answer questions in the D2L Quiz.)

In [None]:
#Carry out k-means clustering to create 2 clusters 

set.seed(3375)  #Do NOT move or change this - place your code below.



#### Task 1c
Visualize the clustering in 2 dimensions. Using the **alcohol** and **residual.sugar** attributes as your 2 dimensions, create a scatter plot and color code the points to show the cluster to which they are assigned. (You will have an opportunity to upload this plot to the D2L quiz.)

In [None]:
# Create 2-dimensional plot to visualize clusters


### Task 2 - Principal Components Analysis

Perform a Principal Components Analysis (PCA) on the data set, using all attributes **except** quality. Display the following items so that you can refer to them in the D2L quiz:
* The principal component values for the first few rows of the data set
* The rotation matrix of variable loadings
* The proportion of variance explained by each Principal Component

In [None]:
#Principal Component Analysis


### Tasks 3-5: Classification Models

We will create models to classify the wine as low, moderate, or high quality. To prepare for this task, run the cell below to add a _category_ column. This category will be a factor variable, and we will use it as our response variable. 

#### Important notes:
* Before running the cell, ensure that **wine** contains the _**scaled**_ version of the first 11 columns.
* Just run the cell; do **NOT** change anything in the cell.

In [None]:
wine$category <- "moderate"
wine$category[wine$quality < 5] <- "low"
wine$category[wine$quality > 6] <- "high"

wine$category <- as.factor(wine$category)
head(wine)

Now run the cell below to create a training and test set.  Again, do not make any changes; just run the cell.

In [None]:
set.seed(3375)

train_size <- round(nrow(wine) * 0.8, 0)
train_rows <- sample(1:nrow(wine), train_size)

wine_train <- wine[train_rows,]
wine_test <- wine[-train_rows,]


### Task 3 - k-Nearest Neighbors

#### Task 3a

Use the training data set to create a k-Nearest Neighbors model with k=5, using only the first 11 columns of data as attributes, with _category_ as the response variable. The model should predict the _category_ of wine in the test set. 

Display the predicted values next to the actual values for later reference (at least the first 10 rows of each).

In [None]:
#k-Nearest Neighbors Model


#### Task 3b
Compute the overall accuracy of the kNN predictions. What proportion of predictions match the actual category?

In [None]:
#Overall accuracy of kNN predictions


### Task 4 - Support Vector Machine

#### Task 4a

Use the training data set to create a support vector machine model with **radial** kernel, _**using only the volatile.acidity and alcohol as predictors**_, with _category_ as the response variable. Display a model summary. 

In [None]:
#Support Vector Machine model



#### Task 4b

Create a classification plot of the SVM model.

In [None]:
#SVM Classification Plot


#### Task 4c

Use the SVM model to predict the _category_ of wine in the test set. 

Display the predicted values next to the actual values for later reference (at least the first 10 rows of each).

In [None]:
#SVM Prediction on Test Data 



#### Task 4d
Compute the overall accuracy of the SVM predictions on the test set. What proportion of predictions match the actual category?

In [None]:
#Accuracy of SVM Model


### Task 5 - Random Forest 

#### Task 5a

Use the training data set to create a random forest model with k=5, using only the first 11 columns of data as attributes, with _category_ as the response variable. (_Be sure to create model after setting the seed; do not change the seed._)

Display the variable importance of the model.

In [None]:
#Random Forest Model and Variable Importance 

set.seed(3375)


#### Task 5b

Use the Random Forest model to predict the _category_ of wine in the test set. 

Display the predicted values next to the actual values for later reference (at least the first 10 rows of each).

In [None]:
#Random Forest Prediction on Test Data 



#### Task 5c
Compute the overall accuracy of the random forest predictions on the test set. What proportion of predictions match the actual category?

In [None]:
#Accuracy of Random Forest Predictions



### Task 6 - Multiple Regression

#### Task 6a

Use the training data set to create a multiple regression model predicting **density** from the other 10 attributes (do NOT include the **quality** or **category** variables). 

Display the model summary.

In [None]:
#Multiple Regression Model to Predict Density
