In [1]:
try:
    sc.stop()
except:
    pass
from pyspark import SparkConf,SparkContext
from pyspark.sql import SparkSession
sc=SparkContext()
spark=SparkSession(sc)

## Example Data

In [4]:
import warnings
warnings.simplefilter('ignore')

import pandas as pd

In [7]:
pdf=pd.DataFrame({
    'x1':['a','b','c','a','b','b'],
    'x2':['apple','banana','orange','peach','pine','grape'],
    'x3':[2,3,1,5,11,10]
})

df=spark.createDataFrame(pdf)
df.show()



+---+------+---+
| x1|    x2| x3|
+---+------+---+
|  a| apple|  2|
|  b|banana|  3|
|  c|orange|  1|
|  a| peach|  5|
|  b|  pine| 11|
|  b| grape| 10|
+---+------+---+



## StringIndexer

In [8]:
from pyspark.ml.feature import StringIndexer


***StringIndexer*** maps a string column to a index column which is categorical.**The indices start with 0 and are labeled by frequencies.** If the column is numeric,then it is first converted to string and *StringIndexer* is applied.


StringIndexer need 3 steps to implement:
      <br> **Build the StringIndexer model** :specify input column and output column.
      <br> **Learn the StringIndexer model** :*fit* the model with the data.
      <br> **Executing the indexing** :*transform* function is called for indexing.

In [9]:
#build Indexer
stringindexer=StringIndexer(inputCol='x1',outputCol='Index_x1')

#learn Indexer
stringindexer_model=stringindexer.fit(df)

#execute the transform
df_stringindexer=stringindexer_model.transform(df)

df_stringindexer.show()

+---+------+---+--------+
| x1|    x2| x3|Index_x1|
+---+------+---+--------+
|  a| apple|  2|     1.0|
|  b|banana|  3|     0.0|
|  c|orange|  1|     2.0|
|  a| peach|  5|     1.0|
|  b|  pine| 11|     0.0|
|  b| grape| 10|     0.0|
+---+------+---+--------+



## OneHotEncoder

In [10]:
from pyspark.ml.feature import OneHotEncoder

***OneHotEncoder*** converts each category of *StringIndexed* column to a sparse vector.Each sparse vector has atmost **one single active element** that indicate the category index.

In [11]:
df_x1=df.select('x1')
df_x1.show()

+---+
| x1|
+---+
|  a|
|  b|
|  c|
|  a|
|  b|
|  b|
+---+



#### StringIndex x1

In [13]:
df_Stringindexed=StringIndexer(inputCol='x1',outputCol='index_x1').fit(df_x1).transform(df_x1)
df_Stringindexed.show()

+---+--------+
| x1|index_x1|
+---+--------+
|  a|     1.0|
|  b|     0.0|
|  c|     2.0|
|  a|     1.0|
|  b|     0.0|
|  b|     0.0|
+---+--------+



Encoding format: 'string index': ['string indices vector size', 'index of string index in string indices vector', **1.0** ]

Here the string indices vector is [0.0, 1.0, 2.0]. Therefore, the mapping between string indices and sparse vectors are:


<ol>
    <li>0.0: [3, [0], [1.0]]</li>
    <li>1.0: [3, [1], [1.0]]</li>
    <li>2.0: [3, [2], [1.0]]</li>
</ol>

In [16]:
onehot_encoder=OneHotEncoder(dropLast=False,inputCol='index_x1',outputCol='sparse_x1').transform(df_Stringindexed)
onehot_encoder.show()

+---+--------+-------------+
| x1|index_x1|    sparse_x1|
+---+--------+-------------+
|  a|     1.0|(3,[1],[1.0])|
|  b|     0.0|(3,[0],[1.0])|
|  c|     2.0|(3,[2],[1.0])|
|  a|     1.0|(3,[1],[1.0])|
|  b|     0.0|(3,[0],[1.0])|
|  b|     0.0|(3,[0],[1.0])|
+---+--------+-------------+



**OneHotEncoder** by default will drop the last category. So the string indices vector becomes [0.0, 1.0], and the mappings between string indices and sparse vectors are:

<ol>
    <li>0.0: [2, [0], [1.0]]</li>
    <li>1.0: [2, [1], [1.0]]</li>
    <li>2.0: [2, [ ], [ ]]</li>
</ol>

In [17]:
onehot_encoder=OneHotEncoder(inputCol='index_x1',outputCol='sparse_x1').transform(df_Stringindexed)
onehot_encoder.show()

+---+--------+-------------+
| x1|index_x1|    sparse_x1|
+---+--------+-------------+
|  a|     1.0|(2,[1],[1.0])|
|  b|     0.0|(2,[0],[1.0])|
|  c|     2.0|    (2,[],[])|
|  a|     1.0|(2,[1],[1.0])|
|  b|     0.0|(2,[0],[1.0])|
|  b|     0.0|(2,[0],[1.0])|
+---+--------+-------------+



we can convert the ***sparse vectors*** to ***Dense vectors*** which is a numpy matrix used in statistics.

In [18]:
from pyspark.ml.linalg import DenseVector, SparseVector, DenseMatrix, SparseMatrix
x = [SparseVector(3, {0: 1.0}).toArray()] + \
    [SparseVector(3, {1: 1.0}).toArray()] + \
    [SparseVector(3, {2: 1.0}).toArray()]
x

[array([1., 0., 0.]), array([0., 1., 0.]), array([0., 0., 1.])]

In [19]:
import numpy as np
np.array(x)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])