# PCA Basics

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

In [19]:
# Initialize pyspark
import findspark
findspark.init()
import pyspark

In [20]:
# Initialize and create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PCA_Basics').getOrCreate()

In [21]:
# Creating some Data
from pyspark.ml.linalg import Vectors

data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]

In [22]:
data

[(SparseVector(5, {1: 1.0, 3: 7.0}),),
 (DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]),),
 (DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]),)]

In [23]:
type(data)

list

In [24]:
# Perform the operation
df = spark.createDataFrame(data, ["features"])

In [25]:
df.show(truncate=False)

+---------------------+
|features             |
+---------------------+
|(5,[1,3],[1.0,7.0])  |
|[2.0,0.0,3.0,4.0,5.0]|
|[4.0,0.0,0.0,6.0,7.0]|
+---------------------+



In [26]:
#Setting up PCA
from pyspark.ml.feature import PCA

pca = PCA(inputCol='features', outputCol='pca_features')

In [27]:
#Setting the num of pca features to 3
pca_model = pca.setK(3).fit(df)

In [28]:
pcaDF = pca_model.transform(df)

In [29]:
pcaDF.show(truncate=False)

+---------------------+-----------------------------------------------------------+
|features             |pca_features                                               |
+---------------------+-----------------------------------------------------------+
|(5,[1,3],[1.0,7.0])  |[1.6485728230883807,-4.013282700516296,-5.524543751369388] |
|[2.0,0.0,3.0,4.0,5.0]|[-4.645104331781534,-1.1167972663619026,-5.524543751369387]|
|[4.0,0.0,0.0,6.0,7.0]|[-6.428880535676489,-5.337951427775355,-5.524543751369389] |
+---------------------+-----------------------------------------------------------+



In [None]:
#Closing the spark session
spark.stop()