# PCA in Spark ML

Let's try our hands at dimensionality reduction in Spark MLLib.  We are going to look at the Walmart Dataset. Remember how many dimensions we had in that one? (70, to be exact). Perhaps we can get a lower-dimensional representation of that.


In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import PCA

import numpy as np
import matplotlib
import matplotlib.pyplot as plt



In [None]:
dataset = spark.read.csv("/data/walmart-triptype/train-transformed.csv.gz", header=True, inferSchema=True)


In [None]:
dataset.show()

## Step 1: Creating Vectors

Let's load the data and create vectors out of it.

In [None]:
columns = dataset.columns
columns.remove('VisitNumber') #We don't care about visit number as a feature.
columns.remove('TripType') #Triptype is what we're predicting!

print(columns)

In [None]:
# Build the vector.
assembler = VectorAssembler(inputCols=columns, outputCol="features")
featureVector = assembler.transform(dataset)


In [None]:
# Print some sample rows.
for row in featureVector.select('features').take(10):
    print("Vector: %s\n" % (str(row)))

Note the output. These are Sparse (not dense) Vectors.  That's because we our data IS sparse, we have relatively few of the variables at any given time.

## Step 2: Build a Correlation Matrix

We're going to build a correlation matrix.  This will have all features as rows and columns, showing what is the correlation between variables.  Naturally, every feature will be perfectly correlated to itself, so we expect to see a diagonal of 1's.  Correlations in the upper right and lower left should also be mirror images of each other.

Perfectly uncorrelated features (orthogonal) would have the identity matrix as its correlation matrix.  Part of our goal in PCA is to create orthogonal features.

In [None]:
## Checking the correlation matrix of the data.

r1 = Correlation.corr(featureVector, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))


Note that there are some correlated dimensions in the original dataset.  We can identify this by the nonzero values in the correlation matrix.  Naturally, dimensions are always related to themselves with a 1.

## Step 3: Scale (Normalize) the data 

We need to scale our features so we do not have one dimension dominate. Why does this matter? Since some dimensions are scaled differently than others, those dimensions will be unfairly weighted in our analysis. We want to avoid this.



In [None]:
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(featureVector)

# Normalize each feature to have unit standard deviation.
ScaledFeatures = scalerModel.transform(featureVector)
ScaledFeatures.select('features', 'scaledFeatures').show()

## Step 4: Running PCA

Now we will run PCA to reduce and uncorrelate dimensions.  

**Try with five dimensions to start with.**

In [None]:
num_vars = ???

pca = PCA(k=5, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(ScaledFeatures)
pcaFeatures = model.transform(ScaledFeatures).select("pcaFeatures")

Let's take a look at the transformed dataset.  let's look at a distribution of our transformed dataset

In [None]:
S = model.explainedVariance.toArray()
print(S)
print("Cumulative Explained Variance: " + str(np.cumsum(S)[-1]))

### Do a Scree plot

This will show us the cumulative explained variance.


In [None]:

S = model.explainedVariance.toArray()
fig = plt.figure(figsize=(8,5))
sing_vals = np.arange(num_vars) + 1
plt.plot(np.arange(num_vars) + 1, np.cumsum(S), 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')


leg = plt.legend(['Explained Variance'], loc='best', borderpad=0.3, 
                 shadow=False, prop=matplotlib.font_manager.FontProperties(size='small'),
                 markerscale=0.4)

What do you think?  Is the cumulative explained variance enough to represent our data in fewer dimensions?


## Step 5: Re-Running the Correlation Matrix

In [None]:
## Checking the correlation matrix of the data.

r1 = Correlation.corr(pcaFeatures, "pcaFeatures").head()
print("Pearson correlation matrix:\n" + str(r1[0]))


Note the very small, close to zero correlations in the matrix.  The 5 dimensions are for all practical purposes independent and orthogonal.

### Step 6: Running PCA with more dimensions to get Explained Variance higher

Try to find at least 80% of explained variance.

Can we use the elbow method here?  Why (or why not)?

What does this say about the relative correlation of the dimensions?

In [None]:
num_vars = ??? # Enter number of dimensions for explained Variance her.
pca = PCA(k=num_vars, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(ScaledFeatures)
pcaFeatures = model.transform(ScaledFeatures).select("pcaFeatures")


S = model.explainedVariance.toArray()
fig = plt.figure(figsize=(8,5))
sing_vals = np.arange(num_vars) + 1
plt.plot(np.arange(num_vars) + 1, np.cumsum(S), 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')


leg = plt.legend(['Explained Variance'], loc='best', borderpad=0.3, 
                 shadow=False, prop=matplotlib.font_manager.FontProperties(size='small'),
                 markerscale=0.4)

print("Cumulative Explained Variance = " + str(np.cumsum(S)[-1]))

### Conclusions

What are your conclusions?  Were we able to reduce dimensions in this dataset without losing much of the "signal" of the data?

Why or why not?