# PCA on wine quality data
We like this data, because it is all numeric data.
It looks like this

```
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
```

Check data in *datasets/wine-quality*
- winequality-red.csv
- winequality-white.csv

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd


print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1 : Load Data

In [None]:
## read data
## TODO : try red and white wine data
data_file= '/data/wine-quality/winequality-red.csv'
#data_file= '/data/wine-quality/winequality-white.csv'
column_to_remove = 'quality'


t1 = time.perf_counter()

data = spark.read.\
          option('header', 'true').\
          option('inferSchema', 'true').\
          option('delimiter', ';').\
          csv(data_file)
t2 = time.perf_counter()
print("read {:,} records in {:,.2f} ms".format(data.count(), (t2-t1)*1000))

data_clean = data.na.drop()
print("raw data count {},  cleaned data count {}".format(data.count(), data_clean.count()))

data_clean.show()

## Step 2 : Basic data analysis

In [None]:
## remove columns we don't need
columns = data_clean.columns
columns.remove(column_to_remove)  # this is the target , so remove
data2 = data_clean.select(columns)

print("original data columns  ", len(data2.columns))
data2.show()

In [None]:
## TODO : use 'describe' to do basic data analytics
data2.???().show()

## Step 3 : Create feature vector

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=columns, outputCol="features")
feature_vector = assembler.transform(data2)
feature_vector.select('features').show(10, False)

## Step 4 : Correlation Matrix of original data
Do see any correlation?

In [None]:
from pyspark.ml.stat import Correlation


## TODO : Identify components that are highly correlated

corr1 = Correlation.corr(feature_vector, "features").head()
pearson_corr = corr1[0]
#print("Pearson correlation matrix:\n" + str(pearson_corr))

## convert to numpy for pretty print
print("Pearson correlation matrix")
np.set_printoptions(precision=2,  linewidth=200)
pearson_corr_nparr = pearson_corr.toArray()
print(pearson_corr_nparr)

## convert to pandas for even prettier print :-)
names = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides",\
         "free sulfur dioxide","total sulfur dioxide","density","pH", "sulphates","alcohol"]

df = pd.DataFrame(pearson_corr_nparr, index=names, columns=names)
df

## Step 5 : Scale Data
We need to scale data before PCA

In [None]:
from pyspark.ml.feature import StandardScaler


## TODO : create a scaler
##   Hint : inputCol = 'features'
##   Hint : outputCol  = 'scaledFeatures'

scaler = StandardScaler(inputCol="???", outputCol="???",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scaler_model = scaler.fit(feature_vector)

# Normalize each feature to have unit standard deviation.
fv_scaled = scaler_model.transform(feature_vector)
fv_scaled.select('features', 'scaledFeatures').show()

## Step 6 : Do PCA

In [None]:
from pyspark.ml.feature import PCA

## TODO : create 5 Principal Components
num_pc = ???

pca = PCA(k=num_pc, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(fv_scaled)
pca_features = model.transform(fv_scaled).select("pcaFeatures")
pca_features.select('pcaFeatures').show(10, False)

## Step 7 : Correlation Matrix for Principal Components
These should be very small (close to zero!)

In [None]:
from pyspark.ml.stat import Correlation

## correlation matrix for PC
## should be very close to zero
corr_pc = Correlation.corr(pca_features, "pcaFeatures").head()[0]
corr_pc_nparr = corr_pc.toArray()

print ("Correlation Matrtix for Principal Components")
np.set_printoptions(precision=2, suppress=False)
print(corr_pc_nparr)
print()

print ("Correlation Matrtix for Principal Components")
np.set_printoptions(precision=2, suppress=True)
print(corr_pc_nparr)

## TODO : Inspect at correlations for PC
##      Are they highly correlated?  Not? can you explain?

## Step 8 : Calculate PC Variance

We started with 5 PCs.  
How much coverage (variance) are we getting?

Play with **num_pc** in Step-6 to get 90% coverage


In [None]:
## variance
variance = model.explainedVariance.toArray()
print(variance)
print ("Original data had {} features,  principal components {}".format(len(data2.columns), num_pc))
print("Cumulative Explained Variance: " + str(np.cumsum(variance)[-1]))

## Step 9 : Screeplot
Screeplot goes from 0.0  to 1.0

In [None]:
variance = model.explainedVariance.toArray()
fig = plt.figure(figsize=(8,5))
sing_vals = np.arange(num_pc) + 1
plt.plot(np.arange(num_pc) + 1, np.cumsum(variance), 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')


leg = plt.legend(['Explained Variance'], loc='best', borderpad=0.3, 
                 shadow=False, prop=matplotlib.font_manager.FontProperties(size='small'),
                 markerscale=0.4)

## TODO : Explain the screeplot