In [1]:
# SparkContext is already defined as sc
HDFS = 'hdfs://scut0:9000/'

# Extracting features from the LFW dataset

## Exploring the face data

- Spark provides us with a way to read text files and custom Hadoop input data sources. However, there is no built-in functionality to allow us to read images
- Spark provides a method called wholeTextFiles, which allows us to operate on entire files at once

In [2]:
# binaryFiles returns an RDD that contains key-value pairs,
# where the key is the path of the file while the value is the content of binary file
pairRDD = sc.binaryFiles(HDFS+'/lfw/*')
print(pairRDD.count())
# print(pairRDD.first())

1054


## Extracting facial images as vectors

For many image-processing and machine learning tasks related to images
it is common to operate on grayscale images. 


- We will do this here by converting the color images to grayscale frst.
- then resize the images to 100 x 100 pixels(original 250 X 250) and flatten it

In [3]:
import numpy as np
import cv2
def img2Vector(imgStr):
    resizedSize = (100, 100)
    npArr = np.fromstring(imgStr, np.uint8)
    npImg = cv2.imdecode(npArr, cv2.IMREAD_COLOR)
    grayImg = cv2.cvtColor(npImg, cv2.COLOR_BGR2GRAY)
    return cv2.resize(grayImg, resizedSize).flatten()

from pyspark.mllib.linalg import Vectors
imgVectors = pairRDD.map(lambda (path, imgStr): Vectors.dense(img2Vector(imgStr)))
print(imgVectors.count())

1054


In [4]:
imgVectors.cache
print(type(imgVectors.first()))

<class 'pyspark.mllib.linalg.DenseVector'>


## Normalization

It is a common practice to standardize input data prior to running dimensionality reduction models, in particular for PCA.
we will do this using the built-in `StandardScaler` provided by MLlib's feature package

In [5]:
from pyspark.mllib.feature import StandardScaler
scaler = StandardScaler(withMean = True, withStd = False).fit(imgVectors)
scaledImgVectors = scaler.transform(imgVectors)

# Training a dimensionality reduction model

Dimensionality reduction models in MLlib require vectors as inputs. However,
unlike clustering that operated on an RDD[Vector], PCA and SVD computations are
provided as methods on a distributed RowMatrix

## Running PCA on the LFW dataset

In [6]:
from pyspark.mllib.linalg.distributed import RowMatrix
matrix = RowMatrix(scaledImgVectors)
# PCA is not supported in pyspark before spark 2.1.0
# pc = matrix.computePrincipalComponents(5)

## Visualizing the Eigenfaces

Since the above the PCA is not available in pyspark, here we just show the result that the book provide
below are the top 10 engenfaces extracted with PCA

![](http://static.zybuluo.com/WuLiangchao/ull2ex1a454wcndk07tj5pgg/eigenfaces.png)

The above [eigenfaces](https://en.wikipedia.org/wiki/Eigenface) are actually the eigenvectors from PCA, we just visualize them as images

Looking at the preceding images, we can see that the PCA model has effectively extracted recurring patterns of variation, which represent various features of the facial images. 

# Using a dimensionality reduction model

The overall purpose of using dimensionality reduction is to create a more compact representation of the data that still captures the important features and variability
in the raw dataset.

To do this, we need to use a trained model to transform our raw data by projecting it into the new, lower-dimensional space represented by the principal components.

Transformation can be easily implemented by a matrix multiplication of the image matrix with the matrix of principal components,but as mentioned above, PCA doesn't work here, so we just comment the code below

In [None]:
# projected = matrix.multiply(pc)

## The relationship between PCA and SVD

There is a close relationship between PCA and SVD. In fact, we can recover the same principal components and also apply the same
projection into the space of principal components using SVD

In our example, **the right singular vectors(svd.V below) derived from computing the SVD will be equivalent to the principal components** we have calculated

In [None]:
# SVD is not supported either
"""
svd = matrix.computeSVD(10, computeU=True)
U = svd.U       # The U factor is a RowMatrix.
s = svd.s       # The singular values are stored in a local dense vector.
V = svd.V       # The V factor is a local dense matrix
"""

**The other relationship that holds is that the multiplication of the matrix U and vector
s (or, strictly speaking, the diagonal matrix s) is equivalent to the PCA projection of
our original image data into the space of the top 10 principal components.**

More details about SVD and PCA can be obtained here: https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca