# Spark Learning Note - Preprocessing and Feature Engineering

Jia Geng | gjia0214@gmail.com

In [1]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('MLExample').getOrCreate()
spark

In [4]:
# load the example data
sale_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/retail-data/by-day/*.csv' 
int_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml-integers'
simple_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml'
scale_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml-scaling'
sales = spark.read.format('csv').option('header', True)\
                                .option('inferSchema', True)\
                                .load(sale_path).coalesce(5).where('Description is not null')
fakeIntDF = spark.read.parquet(int_path)
simpleDF = spark.read.json(simple_path)
scaleDF = spark.read.parquet(scale_path)

In [6]:
sales.show(1)
sales.cache()
fakeIntDF.show(1)
simpleDF.show(1)
scaleDF.show(1)

+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|       Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row

+----+----+----+
|int1|int2|int3|
+----+----+----+
|   4|   5|   6|
+----+----+----+
only showing top 1 row

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
+-----+----+------+------------------+
only showing top 1 row

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
+---+--------------+
only showing 

## 1. Formating Modelings for Different Type of Tasks

Classification and Regression
- a Double Type column for label
- a Vector[Double] column for features

Recommendation
- a column for user
- a column for items (movies or books)
- a column for rating

Unsupervised Learning
- a Vector (dense/sparse) for feature

Graph Analysis
- a DataFrame of vertices 
- a DataFrame of edges

Two types of feature transformation
- **Transfomer**: convert data in a way that is not affected by input data, e.g. `Tokenizer`
    - all transformer and some estimator for preprocessing has a `setInputCol()` and `setOutputCol` method.
    - transformer has a `transform()` to perform the transformation
- **Estimator** for preprocessing: convert data in a way that is affected by input data, e.g. scaling, normalizaton, `StandardScaler` the transformatin affected by input value and distribution
    - estimator need to be `fit()` on data first, then perform transforming using the fitted object

## 2. pyspark.ml.feature

### 2.1 Common Transformer for Feature Transformation
- **High Level Transformers**
    - RFormula: good for conventionally formatted data. No need to extract values from strings or manipulate them in any way. RFormula will automatically handle categorical input by performing one-hot encoding. Numberic and labels are converted to double.
        - need a `.fit()` process
    - check MLlib overview for example
- **SQL Transformer** use SQL language to transform the data
- **VectorAssembler**: Concate the features into one big vector. A tool that will be used (directly or indirectly) in nearly every pipline. 


### 2.2 Continuous Data

Common transformation for continuous data including: **scaling** and **categorizing**.

Categorizer:
- **Bucketizer**: split conitnuous feature into buckets.
    - need to specify the buckets with a list of splitting point (include the two boundary)
    - use float('inf') or float('-inf') if needed 
- **QuantileDiscretizer**: use equally divided quantiles to discretinize the data
    - need to fit on data to find the quantiles
 
Scaler:
- **Standard Scaler**: standardize the data. 
    - support standardize the data vectors. 
    - vector in spark is a list of different features
    - standardization is conducted on the rows of data, instead of within each vector!!!
- **MinMaxScaler**: 0 ~ 1
- **MaxAbsScaler**: -1 ~ 1
- **ElementwiseProduct**: scale each value within a vector by an arbitrary value
    - use `pyspark.ml.linalg.Vectors` `Vectors.dense()` to declare the scaling vector
    - `.dense()` must take float input, size must match
    - since using arbitrary scales, no `.fit()` process
- **Normalizer**: 
    - normalize by the lp norm of the vector (not by the feature norm!)

In [8]:
from pyspark.ml.feature import Tokenizer

# tokenizer, can be applied directly on the data!!
tkn = Tokenizer().setInputCol('Description').setOutputCol('Tokenized')
tkn.transform(sales).show(1)

+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
|InvoiceNo|StockCode|       Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|           Tokenized|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
|   580538|    23084|RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|[rabbit, night, l...|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
only showing top 1 row



In [12]:
from pyspark.ml.feature import VectorAssembler

vca = VectorAssembler().setInputCols(['int1', 'int2', 'int3']).setOutputCol('Assembled')
vca.transform(fakeIntDF).show(1)

+----+----+----+-------------+
|int1|int2|int3|    Assembled|
+----+----+----+-------------+
|   4|   5|   6|[4.0,5.0,6.0]|
+----+----+----+-------------+
only showing top 1 row



In [23]:
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import expr

# convert the integer to doulbe
contDF = spark.range(20)
contDF.show(1)
contDF.printSchema()
contDF = contDF.withColumn('id', expr('cast(id as double)'))
contDF.printSchema()

# bucketize
splits = [float('-inf'), 1, 3, 10, 15, 20, float('inf')]
bucketer = Bucketizer().setInputCol('id').setOutputCol('bucket').setSplits(splits)
bucketer.transform(contDF).show(5)

+---+
| id|
+---+
|  0|
+---+
only showing top 1 row

root
 |-- id: long (nullable = false)

root
 |-- id: double (nullable = false)

+---+------+
| id|bucket|
+---+------+
|0.0|   0.0|
|1.0|   1.0|
|2.0|   1.0|
|3.0|   2.0|
|4.0|   2.0|
+---+------+
only showing top 5 rows



In [29]:
from pyspark.ml.feature import QuantileDiscretizer

# Quantile Discretizer Need to be fit on data to find the quantiles
qdt = QuantileDiscretizer().setInputCol('id').setOutputCol('bucket')\
                            .setNumBuckets(100).setRelativeError(0)
qdt.fit(contDF).transform(contDF).show(2)  # if there are void buckets, will ignore them

+---+------+
| id|bucket|
+---+------+
|0.0|   1.0|
|1.0|   2.0|
+---+------+
only showing top 2 rows



In [34]:
from pyspark.ml.feature import StandardScaler

# scaler need to learn the mean and std from the data
# so there is a fit step!!!!
ss = StandardScaler().setInputCol('features').setOutputCol('standardized')
stds = ss.fit(scaleDF)

# standardization is across rows NOT within each vector!!!
stds.transform(scaleDF).show(20, False)

+---+--------------+------------------------------------------------------------+
|id |features      |standardized                                                |
+---+--------------+------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
+---+--------------+------------------------------------------------------------+



In [45]:
from pyspark.ml.feature import MinMaxScaler, MaxAbsScaler, ElementwiseProduct
from pyspark.ml.linalg import Vectors

# min max scale -> 0 ~ 1
mms = MinMaxScaler().setInputCol('features').setOutputCol('scaled')
mms.fit(scaleDF).transform(scaleDF).show(20, False)

# scale -> -1 ~ 1
ma = MaxAbsScaler().setInputCol('features').setOutputCol('scaled')
ma.fit(scaleDF).transform(scaleDF).show(20, False)

# arbitrary scale
# must delace a vector
# must use float, size must match
scales = Vectors.dense(1.0, 2.0, 3.0)
eps = ElementwiseProduct().setInputCol('features').setOutputCol('scaled').setScalingVec(scales)
eps.transform(scaleDF).show(20, False)

+---+--------------+-------------+
|id |features      |scaled       |
+---+--------------+-------------+
|0  |[1.0,0.1,-1.0]|[0.0,0.0,0.0]|
|1  |[2.0,1.1,1.0] |[0.5,0.1,0.5]|
|0  |[1.0,0.1,-1.0]|[0.0,0.0,0.0]|
|1  |[2.0,1.1,1.0] |[0.5,0.1,0.5]|
|1  |[3.0,10.1,3.0]|[1.0,1.0,1.0]|
+---+--------------+-------------+

+---+--------------+-------------------------------------------------------------+
|id |features      |scaled                                                       |
+---+--------------+-------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009901,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009901,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|1  |[3.0,10.1,3.0]|[1.0,1.0,1.0]                                                |
+---+--------------+

In [54]:
from pyspark.ml.feature import Normalizer

# normalize each vector by the L-p norm
# E.g. l1 norm for first vector 1 + 1 + 0.1 = 2.1  ==> 1/2.1 = 0.476
nm = Normalizer().setInputCol('features').setP(1)
nm.transform(scaleDF).show(30, False)

+---+--------------+---------------------------------------------------------------+
|id |features      |Normalizer_5ccbb7c91686__output                                |
+---+--------------+---------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.47619047619047616,0.047619047619047616,-0.47619047619047616]|
|1  |[2.0,1.1,1.0] |[0.48780487804878053,0.26829268292682934,0.24390243902439027]  |
|0  |[1.0,0.1,-1.0]|[0.47619047619047616,0.047619047619047616,-0.47619047619047616]|
|1  |[2.0,1.1,1.0] |[0.48780487804878053,0.26829268292682934,0.24390243902439027]  |
|1  |[3.0,10.1,3.0]|[0.18633540372670807,0.6273291925465838,0.18633540372670807]   |
+---+--------------+---------------------------------------------------------------+



### 2.3 Transformer for Categorical Data

The most common task for categorical data is indexing: converts a categorical variable in a column to numerical one that can be plug into machine learning algorithm. Some time the input need to be tokenized before such transformation.

**Categorical Encoding**

- **StringIndexer**: maps strings to different numerical ids
    - need to `.fit()` on data
    - also works on non-string column
    - **StringIndexer can not handle unseen data by default.** The other option is to skip the row with an unseen data.
    - encode with certain order 

- **IndexToString**: convert the ids back to strings.
    - no need to specify the string column or the matching, spark will handle all the metadata
    - if you do need to specify the labels, do this via `.setLabels()`

- **VectorIndexer**: automatic detect the numerical categorical data to 0-based cateorical data.
    - need to fit on data
    - need to `setMaxCategories()`
        - E.g. if set to 2, it will detect all features that have 2 or less than 2 unqiue values then convert it to 0-based indexes
    - be careful if you have features that are not categorical but does not have much unique values
    
**One-Hot Encoding**
- **OneHotEncoder**: one hot does not introduce the numerical difference between cats
    - **one hot encoder only works on the numerical data, usually need to convert the data using StringEncoder, then apply one hot on it**
    - thus, one hot encoder does not need the fit process
    - spark use sparse matrix for one hot 
    - by default it will drop the last cat

In [67]:
from pyspark.ml.feature import StringIndexer

# automatically encode the string into numerical ids
stdIdxer = StringIndexer().setInputCol('lab').setOutputCol('labelID')
encoded = stdIdxer.fit(simpleDF).transform(simpleDF)
encoded.show(2)

# also work with numerical column
stdIdxer = StringIndexer().setInputCol('value1').setOutputCol('labelID')
stdIdxer.fit(simpleDF).transform(simpleDF).show(5)


# if want to skip the unseen data
stdIdxer = stdIdxer.setHandleInvalid('skip')

+-----+----+------+------------------+-------+
|color| lab|value1|            value2|labelID|
+-----+----+------+------------------+-------+
|green|good|     1|14.386294994851129|    1.0|
| blue| bad|     8|14.386294994851129|    0.0|
+-----+----+------+------------------+-------+
only showing top 2 rows

+-----+----+------+------------------+-------+
|color| lab|value1|            value2|labelID|
+-----+----+------+------------------+-------+
|green|good|     1|14.386294994851129|    2.0|
| blue| bad|     8|14.386294994851129|    4.0|
| blue| bad|    12|14.386294994851129|    0.0|
|green|good|    15| 38.97187133755819|    5.0|
|green|good|    12|14.386294994851129|    0.0|
+-----+----+------+------------------+-------+
only showing top 5 rows



In [71]:
from pyspark.ml.feature import IndexToString

# covert back, original mapping was handled by spark
idx2str = IndexToString().setInputCol('labelID')
idx2str.transform(encoded).show(2)

# covert back, original mapping was handled by spark
idx2str = IndexToString().setInputCol('labelID').setLabels(['Terrible', 'Great'])
idx2str.transform(encoded).show(2)

+-----+----+------+------------------+-------+----------------------------------+
|color| lab|value1|            value2|labelID|IndexToString_8c624bf87972__output|
+-----+----+------+------------------+-------+----------------------------------+
|green|good|     1|14.386294994851129|    1.0|                              good|
| blue| bad|     8|14.386294994851129|    0.0|                               bad|
+-----+----+------+------------------+-------+----------------------------------+
only showing top 2 rows

+-----+----+------+------------------+-------+----------------------------------+
|color| lab|value1|            value2|labelID|IndexToString_46daa60f5f9e__output|
+-----+----+------+------------------+-------+----------------------------------+
|green|good|     1|14.386294994851129|    1.0|                             Great|
| blue| bad|     8|14.386294994851129|    0.0|                          Terrible|
+-----+----+------+------------------+-------+---------------------------

In [75]:
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors

sample = spark.createDataFrame([([Vectors.dense([1, 2, 3]), 1]),
                               ([Vectors.dense([2, 5, 6]), 2]),
                               ([Vectors.dense([1, 8, 9]), 3])], ['features', 'label'])
sample.show()

# set up the vector indexer and max cats
vidxerA = VectorIndexer().setInputCol('features').setMaxCategories(2)  # 2 cats 
vidxerB = VectorIndexer().setInputCol('features').setMaxCategories(3)  # 3 cats

+-------------+-----+
|     features|label|
+-------------+-----+
|[1.0,2.0,3.0]|    1|
|[2.0,5.0,6.0]|    2|
|[1.0,8.0,9.0]|    3|
+-------------+-----+



In [77]:
# A max cat = 2 so only covert the first feature
vidxerA.fit(sample).transform(sample).show()

# B max cat = 3 so converts all features to 0-based index
vidxerB.fit(sample).transform(sample).show()

+-------------+-----+----------------------------------+
|     features|label|VectorIndexer_4f7942a7eb42__output|
+-------------+-----+----------------------------------+
|[1.0,2.0,3.0]|    1|                     [0.0,2.0,3.0]|
|[2.0,5.0,6.0]|    2|                     [1.0,5.0,6.0]|
|[1.0,8.0,9.0]|    3|                     [0.0,8.0,9.0]|
+-------------+-----+----------------------------------+

+-------------+-----+----------------------------------+
|     features|label|VectorIndexer_6daa1678ddbc__output|
+-------------+-----+----------------------------------+
|[1.0,2.0,3.0]|    1|                     [0.0,0.0,0.0]|
|[2.0,5.0,6.0]|    2|                     [1.0,1.0,1.0]|
|[1.0,8.0,9.0]|    3|                     [0.0,2.0,2.0]|
+-------------+-----+----------------------------------+



In [85]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

# label encoding first
strIdxer = StringIndexer().setInputCol('color').setOutputCol('colorId')
encoded = strIdxer.fit(simpleDF).transform(simpleDF)

# one-hot encoding on the label encoded data
# one-hot only works on numerical data
# must transform to numerical indexes first
onehot = OneHotEncoder().setInputCol('colorId').setOutputCol('onehot').setDropLast(False)
onehot.transform(encoded).show(5)

+-----+----+------+------------------+-------+-------------+
|color| lab|value1|            value2|colorId|       onehot|
+-----+----+------+------------------+-------+-------------+
|green|good|     1|14.386294994851129|    1.0|(3,[1],[1.0])|
| blue| bad|     8|14.386294994851129|    2.0|(3,[2],[1.0])|
| blue| bad|    12|14.386294994851129|    2.0|(3,[2],[1.0])|
|green|good|    15| 38.97187133755819|    1.0|(3,[1],[1.0])|
|green|good|    12|14.386294994851129|    1.0|(3,[1],[1.0])|
+-----+----+------+------------------+-------+-------------+
only showing top 5 rows



### 2.3 Transform Text Data

**Tokenizer**
- take a string or words seperated by space, covert them into array of words
**RegexTokenizer**
- 