# Spark MLLilb Example: Clustering

Let's look at a clustering example in Spark MLLib.

Here, we are going to load the mtcars dataset. This has some stats on different models of cars.  Here, we will load the CSV file as a spark dataframe, and view it.

In [None]:
%matplotlib inline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import PCA



In [None]:
dataset = spark.read.csv("/data/walmart-triptype/train-transformed.csv.gz", header=True, inferSchema=True)


In [None]:
dataset.show()

## Step 1: Creating Vectors

Let's load the data and create vectors out of it.

In [None]:
columns = dataset.columns
columns.remove('VisitNumber') #We don't care about visit number as a feature.
columns.remove('TripType') #Triptype is what we're predicting!

print(columns)

**=> Build the feature vector with VectorAssembler. Output column is "features" **


In [None]:
# Build the vector.
assembler = VectorAssembler(inputCols=columns, outputCol="???")
featureVector = assembler.transform(???)


In [None]:
# Print some sample rows.
for row in featureVector.select('features').take(10):
    print("Vector: %s\n" % (str(row)))

Note the output. These are Sparse (not dense) Vectors.  That's because we our data IS sparse, we have relatively few of the variables at any given time.

## Step 2: Build a Correlation Matrix

We're going to build a correlation matrix.  This will have all features as rows and columns, showing what is the correlation between variables.  Naturally, every feature will be perfectly correlated to itself, so we expect to see a diagonal of 1's.  Correlations in the upper right and lower left should also be mirror images of each other.

Perfectly uncorrelated features (orthogonal) would have the identity matrix as its correlation matrix.  Part of our goal in PCA is to create orthogonal features.

In [None]:
## Checking the correlation matrix of the data.

r1 = Correlation.corr(featureVector, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))


Note that there are many correlated dimensions in the original dataset.  We can identify this by the nonzero values in the correlation matrix.  Naturally, dimensions are always related to themselves with a 1.

## Step 3: Scale (Normalize) the data 

We need to scale our features so we do not have one dimension dominate. Why does this matter? Since some dimensions are scaled differently than others, those dimensions will be unfairly weighted in our analysis. We want to avoid this.

**=> Build the scaler with inputCol as "features", output as "scaledFeatures" **

In [None]:
scaler = StandardScaler(inputCol="???", outputCol="???",
                        withStd=True, withMean=False)

# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(featureVector)

# Normalize each feature to have unit standard deviation.
ScaledFeatures = scalerModel.transform(featureVector)
ScaledFeatures.select('features', 'scaledFeatures').show()

## Step 4: Running PCA

Now we will run PCA to reduce and uncorrelate dimensions.  

**=> Run PCA with 5 dimensions **


In [None]:
pca = PCA(k=???, inputCol="scaledFeatures", outputCol="pcaFeatures")
model = pca.fit(ScaledFeatures)
pcaFeatures = model.transform(???).select("pcaFeatures")

Let's take a look at the transformed dataset.  let's look at a distribution of our transformed dataset

## Step 5: Re-Running the Correlation Matrix

**=> Rerun the correlation matrix to see **

**=> Examine the output **

In [None]:
## Checking the correlation matrix of the data.

r1 = ??? # Rerun correlation matrix
print("Pearson correlation matrix:\n" + str(r1[0]))


Note the very small, close to zero correlations in the matrix.  The 5 dimensions are for all practical purposes independent and orthogonal.