# Spark Learning Note - Preprocessing and Feature Engineering

Jia Geng | gjia0214@gmail.com

<a id='directory'></a>

## Directory

- [Data Source](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/)
- [1. Formating Modelings for Different Type of Tasks](#sec1)
- [2. `pyspark.ml.feature`](#sec2)
    - [2.1 Common Transformer for Feature Transformation](#sec2-1)
    - [2.2 Transform Continuous Data](#sec2-2)
    - [2.3 Transform Categorical Data](#sec2-3)
    - [2.4 Transform Text Data](#sec2-4)
    - [2.5 Feature Manipulation and Selection](#sec2-5)
- [3. Advanced Topics](#sec3)
    - [3.1 Persisting Transformer](#sec3-1)
    - [3.2 Writing a Custom Transformer](#sec3-2)

In [1]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('MLExample').getOrCreate()
spark

In [2]:
# load the example data
sale_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/retail-data/by-day/*.csv' 
int_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml-integers'
simple_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml'
scale_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/simple-ml-scaling'
sales = spark.read.format('csv').option('header', True)\
                                .option('inferSchema', True)\
                                .load(sale_path).coalesce(5).where('Description is not null')
fakeIntDF = spark.read.parquet(int_path)
simpleDF = spark.read.json(simple_path)
scaleDF = spark.read.parquet(scale_path)

In [3]:
sales.show(1)
sales.cache()
fakeIntDF.show(1)
simpleDF.show(1)
scaleDF.show(1)

+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
|InvoiceNo|StockCode|       Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
|   580538|    23084|RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+
only showing top 1 row

+----+----+----+
|int1|int2|int3|
+----+----+----+
|   4|   5|   6|
+----+----+----+
only showing top 1 row

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
+-----+----+------+------------------+
only showing top 1 row

+---+--------------+
| id|      features|
+---+--------------+
|  0|[1.0,0.1,-1.0]|
+---+--------------+
only showing 

## 1. Formating Modelings for Different Type of Tasks <a id='sec1'></a>

Classification and Regression
- a Double Type column for label
- a Vector[Double] column for features

Recommendation
- a column for user
- a column for items (movies or books)
- a column for rating

Unsupervised Learning
- a Vector (dense/sparse) for feature

Graph Analysis
- a DataFrame of vertices 
- a DataFrame of edges

Two types of feature transformation
- **Transfomer**: convert data in a way that is not affected by input data, e.g. `Tokenizer`
    - all transformer and some estimator for preprocessing has a `setInputCol()` and `setOutputCol` method.
    - transformer has a `transform()` to perform the transformation
- **Estimator** for preprocessing: convert data in a way that is affected by input data, e.g. scaling, normalizaton, `StandardScaler` the transformatin affected by input value and distribution
    - estimator need to be `fit()` on data first, then perform transforming using the fitted object
    
[back to top](#directory)

## 2. pyspark.ml.feature <a id='sec2'></a>

[back to top](#directory)

### 2.1 Common Transformer for Feature Transformation <a id='sec2-1'></a>
- **High Level Transformers**
    - RFormula: good for conventionally formatted data. No need to extract values from strings or manipulate them in any way. RFormula will automatically handle categorical input by performing one-hot encoding. Numberic and labels are converted to double.
        - need a `.fit()` process
    - check MLlib overview for example
- **VectorAssembler**: Concate the features into one big vector. A tool that will be used (directly or indirectly) in nearly every pipline. 
    - Do transformation on each column then combine them using the assembler
- **SQL Transformer** use SQL language to transform the data
    - Most flexible build-in transformer
    
[back to top](#directory)

### 2.2 Transform Continuous Data  <a id='sec2-2'></a>

Common transformation for continuous data including: **scaling** and **categorizing**.

Categorizer:
- **Bucketizer**: split conitnuous feature into buckets.
    - need to specify the buckets with a list of splitting point (include the two boundary)
    - use float('inf') or float('-inf') if needed 
- **QuantileDiscretizer**: use equally divided quantiles to discretinize the data
    - need to fit on data to find the quantiles
 
Scaler:
- **Standard Scaler**: standardize the data. 
    - support standardize the data vectors. 
    - vector in spark is a list of different features
    - standardization is conducted on the rows of data, instead of within each vector!!!
- **MinMaxScaler**: 0 ~ 1
- **MaxAbsScaler**: -1 ~ 1
- **ElementwiseProduct**: scale each value within a vector by an arbitrary value
    - use `pyspark.ml.linalg.Vectors` `Vectors.dense()` to declare the scaling vector
    - `.dense()` must take float input, size must match
    - since using arbitrary scales, no `.fit()` process
- **Normalizer**: 
    - normalize by the lp norm of the vector (not by the norm of the feature!)

[back to top](#directory)

In [8]:
from pyspark.ml.feature import Tokenizer

# tokenizer, can be applied directly on the data!!
tkn = Tokenizer().setInputCol('Description').setOutputCol('Tokenized')
tkn.transform(sales).show(1)

+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
|InvoiceNo|StockCode|       Description|Quantity|        InvoiceDate|UnitPrice|CustomerID|       Country|           Tokenized|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
|   580538|    23084|RABBIT NIGHT LIGHT|      48|2011-12-05 08:38:00|     1.79|   14075.0|United Kingdom|[rabbit, night, l...|
+---------+---------+------------------+--------+-------------------+---------+----------+--------------+--------------------+
only showing top 1 row



In [10]:
from pyspark.ml.feature import VectorAssembler

vca = VectorAssembler().setInputCols(['int1', 'int2', 'int3']).setOutputCol('Assembled')
vca.transform(fakeIntDF).show(1)

+----+----+----+-------------+
|int1|int2|int3|    Assembled|
+----+----+----+-------------+
|   4|   5|   6|[4.0,5.0,6.0]|
+----+----+----+-------------+
only showing top 1 row



In [11]:
from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import expr

# convert the integer to doulbe
contDF = spark.range(20)
contDF.show(1)
contDF.printSchema()
contDF = contDF.withColumn('id', expr('cast(id as double)'))
contDF.printSchema()

# bucketize
splits = [float('-inf'), 1, 3, 10, 15, 20, float('inf')]
bucketer = Bucketizer().setInputCol('id').setOutputCol('bucket').setSplits(splits)
bucketer.transform(contDF).show(5)

+---+
| id|
+---+
|  0|
+---+
only showing top 1 row

root
 |-- id: long (nullable = false)

root
 |-- id: double (nullable = false)

+---+------+
| id|bucket|
+---+------+
|0.0|   0.0|
|1.0|   1.0|
|2.0|   1.0|
|3.0|   2.0|
|4.0|   2.0|
+---+------+
only showing top 5 rows



In [12]:
from pyspark.ml.feature import QuantileDiscretizer

# Quantile Discretizer Need to be fit on data to find the quantiles
qdt = QuantileDiscretizer().setInputCol('id').setOutputCol('bucket')\
                            .setNumBuckets(100).setRelativeError(0)
qdt.fit(contDF).transform(contDF).show(2)  # if there are void buckets, will ignore them

+---+------+
| id|bucket|
+---+------+
|0.0|   1.0|
|1.0|   2.0|
+---+------+
only showing top 2 rows



In [13]:
from pyspark.ml.feature import StandardScaler

# scaler need to learn the mean and std from the data
# so there is a fit step!!!!
ss = StandardScaler().setInputCol('features').setOutputCol('standardized')
stds = ss.fit(scaleDF)

# standardization is across rows NOT within each vector!!!
stds.transform(scaleDF).show(20, False)

+---+--------------+------------------------------------------------------------+
|id |features      |standardized                                                |
+---+--------------+------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|0  |[1.0,0.1,-1.0]|[1.1952286093343936,0.02337622911060922,-0.5976143046671968]|
|1  |[2.0,1.1,1.0] |[2.390457218668787,0.2571385202167014,0.5976143046671968]   |
|1  |[3.0,10.1,3.0]|[3.5856858280031805,2.3609991401715313,1.7928429140015902]  |
+---+--------------+------------------------------------------------------------+



In [14]:
from pyspark.ml.feature import MinMaxScaler, MaxAbsScaler, ElementwiseProduct
from pyspark.ml.linalg import Vectors

# min max scale -> 0 ~ 1
mms = MinMaxScaler().setInputCol('features').setOutputCol('scaled')
mms.fit(scaleDF).transform(scaleDF).show(20, False)

# scale -> -1 ~ 1
ma = MaxAbsScaler().setInputCol('features').setOutputCol('scaled')
ma.fit(scaleDF).transform(scaleDF).show(20, False)

# arbitrary scale
# must delace a vector
# must use float, size must match
scales = Vectors.dense(1.0, 2.0, 3.0)
eps = ElementwiseProduct().setInputCol('features').setOutputCol('scaled').setScalingVec(scales)
eps.transform(scaleDF).show(20, False)

+---+--------------+-------------+
|id |features      |scaled       |
+---+--------------+-------------+
|0  |[1.0,0.1,-1.0]|[0.0,0.0,0.0]|
|1  |[2.0,1.1,1.0] |[0.5,0.1,0.5]|
|0  |[1.0,0.1,-1.0]|[0.0,0.0,0.0]|
|1  |[2.0,1.1,1.0] |[0.5,0.1,0.5]|
|1  |[3.0,10.1,3.0]|[1.0,1.0,1.0]|
+---+--------------+-------------+

+---+--------------+-------------------------------------------------------------+
|id |features      |scaled                                                       |
+---+--------------+-------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009901,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009901,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|1  |[3.0,10.1,3.0]|[1.0,1.0,1.0]                                                |
+---+--------------+

In [16]:
from pyspark.ml.feature import Normalizer

# normalize each vector by the L-p norm
# E.g. l1 norm for first vector 1 + 1 + 0.1 = 2.1  ==> 1/2.1 = 0.476
nm = Normalizer().setInputCol('features').setP(1)
nm.transform(scaleDF).show(30, False)

+---+--------------+---------------------------------------------------------------+
|id |features      |Normalizer_1679abac28bc__output                                |
+---+--------------+---------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.47619047619047616,0.047619047619047616,-0.47619047619047616]|
|1  |[2.0,1.1,1.0] |[0.48780487804878053,0.26829268292682934,0.24390243902439027]  |
|0  |[1.0,0.1,-1.0]|[0.47619047619047616,0.047619047619047616,-0.47619047619047616]|
|1  |[2.0,1.1,1.0] |[0.48780487804878053,0.26829268292682934,0.24390243902439027]  |
|1  |[3.0,10.1,3.0]|[0.18633540372670807,0.6273291925465838,0.18633540372670807]   |
+---+--------------+---------------------------------------------------------------+



### 2.3 Transform Categorical Data <a id='sec2-3'></a>

The most common task for categorical data is indexing: converts a categorical variable in a column to numerical one that can be plug into machine learning algorithm. Some time the input need to be tokenized before such transformation.

**Categorical Encoding**

- **StringIndexer**: maps strings to different numerical ids
    - work on the whole string not each word
            - for each word, use `Tokenizer`
    - need to `.fit()` on data
    - also works on non-string column
    - **StringIndexer can not handle unseen data by default.** The other option is to skip the row with an unseen data.
    - encode with certain order 

- **IndexToString**: convert the ids back to strings.
    - no need to specify the string column or the matching, spark will handle all the metadata
    - if you do need to specify the labels, do this via `.setLabels()`

- **VectorIndexer**: automatic detect the numerical categorical data to 0-based cateorical data.
    - need to fit on data
    - need to `setMaxCategories()`
        - E.g. if set to 2, it will detect all features that have 2 or less than 2 unqiue values then convert it to 0-based indexes
    - be careful if you have features that are not categorical but does not have much unique values
    
**One-Hot Encoding**
- **OneHotEncoder**: one hot does not introduce the numerical difference between cats
    - **one hot encoder only works on the numerical data, usually need to convert the data using StringEncoder, then apply one hot on it**
    - thus, one hot encoder does not need the fit process
    - spark use sparse matrix for one hot 
    - by default it will drop the last cat
    
[back to top](#directory)

In [67]:
from pyspark.ml.feature import StringIndexer

# automatically encode the string into numerical ids
stdIdxer = StringIndexer().setInputCol('lab').setOutputCol('labelID')
encoded = stdIdxer.fit(simpleDF).transform(simpleDF)
encoded.show(2)

# also work with numerical column
stdIdxer = StringIndexer().setInputCol('value1').setOutputCol('labelID')
stdIdxer.fit(simpleDF).transform(simpleDF).show(5)


# if want to skip the unseen data
stdIdxer = stdIdxer.setHandleInvalid('skip')

+-----+----+------+------------------+-------+
|color| lab|value1|            value2|labelID|
+-----+----+------+------------------+-------+
|green|good|     1|14.386294994851129|    1.0|
| blue| bad|     8|14.386294994851129|    0.0|
+-----+----+------+------------------+-------+
only showing top 2 rows

+-----+----+------+------------------+-------+
|color| lab|value1|            value2|labelID|
+-----+----+------+------------------+-------+
|green|good|     1|14.386294994851129|    2.0|
| blue| bad|     8|14.386294994851129|    4.0|
| blue| bad|    12|14.386294994851129|    0.0|
|green|good|    15| 38.97187133755819|    5.0|
|green|good|    12|14.386294994851129|    0.0|
+-----+----+------+------------------+-------+
only showing top 5 rows



In [71]:
from pyspark.ml.feature import IndexToString

# covert back, original mapping was handled by spark
idx2str = IndexToString().setInputCol('labelID')
idx2str.transform(encoded).show(2)

# covert back, original mapping was handled by spark
idx2str = IndexToString().setInputCol('labelID').setLabels(['Terrible', 'Great'])
idx2str.transform(encoded).show(2)

+-----+----+------+------------------+-------+----------------------------------+
|color| lab|value1|            value2|labelID|IndexToString_8c624bf87972__output|
+-----+----+------+------------------+-------+----------------------------------+
|green|good|     1|14.386294994851129|    1.0|                              good|
| blue| bad|     8|14.386294994851129|    0.0|                               bad|
+-----+----+------+------------------+-------+----------------------------------+
only showing top 2 rows

+-----+----+------+------------------+-------+----------------------------------+
|color| lab|value1|            value2|labelID|IndexToString_46daa60f5f9e__output|
+-----+----+------+------------------+-------+----------------------------------+
|green|good|     1|14.386294994851129|    1.0|                             Great|
| blue| bad|     8|14.386294994851129|    0.0|                          Terrible|
+-----+----+------+------------------+-------+---------------------------

In [75]:
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors

sample = spark.createDataFrame([([Vectors.dense([1, 2, 3]), 1]),
                               ([Vectors.dense([2, 5, 6]), 2]),
                               ([Vectors.dense([1, 8, 9]), 3])], ['features', 'label'])
sample.show()

# set up the vector indexer and max cats
vidxerA = VectorIndexer().setInputCol('features').setMaxCategories(2)  # 2 cats 
vidxerB = VectorIndexer().setInputCol('features').setMaxCategories(3)  # 3 cats

+-------------+-----+
|     features|label|
+-------------+-----+
|[1.0,2.0,3.0]|    1|
|[2.0,5.0,6.0]|    2|
|[1.0,8.0,9.0]|    3|
+-------------+-----+



In [77]:
# A max cat = 2 so only covert the first feature
vidxerA.fit(sample).transform(sample).show()

# B max cat = 3 so converts all features to 0-based index
vidxerB.fit(sample).transform(sample).show()

+-------------+-----+----------------------------------+
|     features|label|VectorIndexer_4f7942a7eb42__output|
+-------------+-----+----------------------------------+
|[1.0,2.0,3.0]|    1|                     [0.0,2.0,3.0]|
|[2.0,5.0,6.0]|    2|                     [1.0,5.0,6.0]|
|[1.0,8.0,9.0]|    3|                     [0.0,8.0,9.0]|
+-------------+-----+----------------------------------+

+-------------+-----+----------------------------------+
|     features|label|VectorIndexer_6daa1678ddbc__output|
+-------------+-----+----------------------------------+
|[1.0,2.0,3.0]|    1|                     [0.0,0.0,0.0]|
|[2.0,5.0,6.0]|    2|                     [1.0,1.0,1.0]|
|[1.0,8.0,9.0]|    3|                     [0.0,2.0,2.0]|
+-------------+-----+----------------------------------+



In [85]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

# label encoding first
strIdxer = StringIndexer().setInputCol('color').setOutputCol('colorId')
encoded = strIdxer.fit(simpleDF).transform(simpleDF)

# one-hot encoding on the label encoded data
# one-hot only works on numerical data
# must transform to numerical indexes first
onehot = OneHotEncoder().setInputCol('colorId').setOutputCol('onehot').setDropLast(False)
onehot.transform(encoded).show(5)

+-----+----+------+------------------+-------+-------------+
|color| lab|value1|            value2|colorId|       onehot|
+-----+----+------+------------------+-------+-------------+
|green|good|     1|14.386294994851129|    1.0|(3,[1],[1.0])|
| blue| bad|     8|14.386294994851129|    2.0|(3,[2],[1.0])|
| blue| bad|    12|14.386294994851129|    2.0|(3,[2],[1.0])|
|green|good|    15| 38.97187133755819|    1.0|(3,[1],[1.0])|
|green|good|    12|14.386294994851129|    1.0|(3,[1],[1.0])|
+-----+----+------+------------------+-------+-------------+
only showing top 5 rows



### 2.4 Transform Text Data <a id='sec2-4'></a>

**Tokenizer**

- take a string or words seperated by space, covert them into an **array** of words

**RegexTokenizer**
- use regex expression, only the substring that match with the expression will be put into the array

**StopWordsRemover**
- remove the common words
- support diffrent languages via `loadDefaultsStopWords()`

**N Gram**
- for each document (row) combine all N consecutive words together

**CountVectorizer**
- treats each row as a document every word as a term and the total collection of terms as vocabulary.
- `fit()` will scan the words in all documents and count
- `transform()` will output a sparse vector with the terms that occur in that row
    - sparse vector `[voca_size, [term1_idx, term2_idx, ...], [term1_count, ...]]`
- many options
     - `minTF` for minimum term frequency to be included in vocabulary
     - `minDF` for minimum document appreance to be included in vocabulary
     - `vocabSize` for maximum vocabulary size
     - `setBinary(True)` if only want to use whether a word exisits in a document, False for output the frequency of the term
     
**HashingTF & IDF - TFID**
- `HashingTF` is similar to `CountVectorizer` but is ireversible. It does NOT have `fit()` as it does not provide the term index information. it provide term frequence info.
- `IDF` provide the document frequency info. 
    - `IDF` requires `fit()` step because it build term indexes
- TF-IDF is also reprented by sparse vector.

**Word2Vec**
- vectorize words by the relationships between words based on their semantics.
- work best when the input is free-form text in the form of tokens
- need to specify the vector size when construct `Word2Vec`
- need a fit step, many other configurations are supported

[back to top](#directory)

In [21]:
descDF = sales.select('InvoiceNo', 'Description')
descDF.show(2)

+---------+-------------------+
|InvoiceNo|        Description|
+---------+-------------------+
|   580538| RABBIT NIGHT LIGHT|
|   580538|DOUGHNUT LIP GLOSS |
+---------+-------------------+
only showing top 2 rows



In [27]:
from pyspark.ml.feature import Tokenizer

# it will convert everything to lower case!
tokenizer = Tokenizer().setInputCol('Description').setOutputCol('Tokens')
tokenizer.transform(descDF).show(2)

+---------+-------------------+--------------------+
|InvoiceNo|        Description|              Tokens|
+---------+-------------------+--------------------+
|   580538| RABBIT NIGHT LIGHT|[rabbit, night, l...|
|   580538|DOUGHNUT LIP GLOSS |[doughnut, lip, g...|
+---------+-------------------+--------------------+
only showing top 2 rows



In [50]:
from pyspark.ml.feature import RegexTokenizer

# this is identical to tokenizer
tokenizer = RegexTokenizer().setInputCol('Description').setOutputCol('Tokens')
tokenizer.transform(descDF).show(2)

# use setPattern to set up the delimiter
# this is the default behavior
tokenizer = RegexTokenizer().setInputCol('Description').setOutputCol('Tokens').setPattern('')
tokenizer.transform(descDF).show(2)

# if setGap(False)
# then setPattern() will be used for extracting words that match with the pattern
tokenizer = RegexTokenizer().setInputCol('Description').setOutputCol('Token')\
                            .setGaps(False).setPattern('i')
tokenizer.transform(descDF).show(2)

+---------+-------------------+--------------------+
|InvoiceNo|        Description|              Tokens|
+---------+-------------------+--------------------+
|   580538| RABBIT NIGHT LIGHT|[rabbit, night, l...|
|   580538|DOUGHNUT LIP GLOSS |[doughnut, lip, g...|
+---------+-------------------+--------------------+
only showing top 2 rows

+---------+-------------------+--------------------+
|InvoiceNo|        Description|              Tokens|
+---------+-------------------+--------------------+
|   580538| RABBIT NIGHT LIGHT|[r, a, b, b, i, t...|
|   580538|DOUGHNUT LIP GLOSS |[d, o, u, g, h, n...|
+---------+-------------------+--------------------+
only showing top 2 rows

+---------+-------------------+---------+
|InvoiceNo|        Description|    Token|
+---------+-------------------+---------+
|   580538| RABBIT NIGHT LIGHT|[i, i, i]|
|   580538|DOUGHNUT LIP GLOSS |      [i]|
+---------+-------------------+---------+
only showing top 2 rows



In [61]:
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Tokenizer

# prepare the remover: put stop words in list -> set it up
stopwords = StopWordsRemover.loadDefaultStopWords('english')
print('stopwords:', stopwords[:5])  # append anything else to the list 
swr = StopWordsRemover().setInputCol('Tokens').setStopWords(stopwords)

# stop word remover can only work on tokenized column (array of words)
tokenized = Tokenizer().setInputCol('Description').setOutputCol('Tokens').transform(descDF)
swr.transform(tokenized).show(3)

stopwords: ['i', 'me', 'my', 'myself', 'we']
+---------+--------------------+--------------------+-------------------------------------+
|InvoiceNo|         Description|              Tokens|StopWordsRemover_1f4efa701d1b__output|
+---------+--------------------+--------------------+-------------------------------------+
|   580538|  RABBIT NIGHT LIGHT|[rabbit, night, l...|                 [rabbit, night, l...|
|   580538| DOUGHNUT LIP GLOSS |[doughnut, lip, g...|                 [doughnut, lip, g...|
|   580538|12 MESSAGE CARDS ...|[12, message, car...|                 [12, message, car...|
+---------+--------------------+--------------------+-------------------------------------+
only showing top 3 rows



In [68]:
from pyspark.ml.feature import NGram
from pyspark.ml.feature import Tokenizer

# stop word remover can only work on tokenized column (array of words)
tokenized = Tokenizer().setInputCol('Description').setOutputCol('Tokens').transform(descDF)
swr.transform(tokenized).show(3)

# ngram only work with tokenized column
ngramer = NGram().setInputCol('Tokens').setOutputCol('ngram').setN(2)
ngramer.transform(tokenized).select('Tokens', 'ngram').show(3, False)

+---------+--------------------+--------------------+-------------------------------------+
|InvoiceNo|         Description|              Tokens|StopWordsRemover_1f4efa701d1b__output|
+---------+--------------------+--------------------+-------------------------------------+
|   580538|  RABBIT NIGHT LIGHT|[rabbit, night, l...|                 [rabbit, night, l...|
|   580538| DOUGHNUT LIP GLOSS |[doughnut, lip, g...|                 [doughnut, lip, g...|
|   580538|12 MESSAGE CARDS ...|[12, message, car...|                 [12, message, car...|
+---------+--------------------+--------------------+-------------------------------------+
only showing top 3 rows

+-------------------------------------+-------------------------------------------------------+
|Tokens                               |ngram                                                  |
+-------------------------------------+-------------------------------------------------------+
|[rabbit, night, light]               |[rab

In [75]:
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import Tokenizer

# count vectorizer only work on tokenized column (array of words)
tokenized = Tokenizer().setInputCol('Description').setOutputCol('Tokens').transform(descDF)
swr.transform(tokenized).show(3)

# count the token column, need a fit step
# support many settings such as min/maxDF, binary (only return whether term showed up), etc
# in sparse vector, the second items is the term idx
# count vectorizer will scan all data and build a vocabulary so each term will have a unique ID
counter = CountVectorizer().setInputCol('Tokens').setOutputCol('Counts').setMinDF(5)
counter.fit(tokenized).transform(tokenized).show(3, False)  # result in sparse vector


+---------+--------------------+--------------------+-------------------------------------+
|InvoiceNo|         Description|              Tokens|StopWordsRemover_1f4efa701d1b__output|
+---------+--------------------+--------------------+-------------------------------------+
|   580538|  RABBIT NIGHT LIGHT|[rabbit, night, l...|                 [rabbit, night, l...|
|   580538| DOUGHNUT LIP GLOSS |[doughnut, lip, g...|                 [doughnut, lip, g...|
|   580538|12 MESSAGE CARDS ...|[12, message, car...|                 [12, message, car...|
+---------+--------------------+--------------------+-------------------------------------+
only showing top 3 rows

+---------+-------------------------------+-------------------------------------+------------------------------------------------+
|InvoiceNo|Description                    |Tokens                               |Counts                                          |
+---------+-------------------------------+--------------------------

In [109]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import array_contains, round

# only work on tokenized column (array of words)
tokenized = Tokenizer().setInputCol('Description').setOutputCol('Tokens').transform(descDF)

# only keep the rows that contains 'red'
tokenized = tokenized.where(array_contains(col('Tokens'), 'red'))
swr.transform(tokenized).show(3)

# construct a hashingTF, simular to the CountVectorizer but it does not build the vocabulary map
# so each word has an id with it, same term could have different id
# e.g. the hashed column sparse vector shows there are ~ 26k vocabularies --> 26k words
hasher = HashingTF().setInputCol('Tokens').setOutputCol('Hashed')
hashed = hasher.transform(tokenized)
hashed.show(3)

# the idf column is a sparse vector: total vocabulary size, hash, weights
idfer = IDF().setInputCol('Hashed').setOutputCol('idf')
idfDF = idfer.fit(hashed).transform(hashed)
idfDF.show(1, False)

+---------+--------------------+--------------------+-------------------------------------+
|InvoiceNo|         Description|              Tokens|StopWordsRemover_1f4efa701d1b__output|
+---------+--------------------+--------------------+-------------------------------------+
|   580539|GINGHAM HEART  DO...|[gingham, heart, ...|                 [gingham, heart, ...|
|   580541|RED FLORAL FELTCR...|[red, floral, fel...|                 [red, floral, fel...|
|   580543|ALARM CLOCK BAKEL...|[alarm, clock, ba...|                 [alarm, clock, ba...|
+---------+--------------------+--------------------+-------------------------------------+
only showing top 3 rows

+---------+--------------------+--------------------+--------------------+
|InvoiceNo|         Description|              Tokens|              Hashed|
+---------+--------------------+--------------------+--------------------+
|   580539|GINGHAM HEART  DO...|[gingham, heart, ...|(262144,[47362,10...|
|   580541|RED FLORAL FELTCR...

In [118]:
from pyspark.ml.feature import Word2Vec

docDF = spark.createDataFrame([
    ('Hi, I heard about spark'.split(' '),),
    ('I wish java could use case classes'.split(' '),),
    ('Logistic regression models are neat'.split(' '),)
], ['text'])  # must include the , for each row!!!

docDF.show(5, False)
docDF.printSchema()

+------------------------------------------+
|text                                      |
+------------------------------------------+
|[Hi,, I, heard, about, spark]             |
|[I, wish, java, could, use, case, classes]|
|[Logistic, regression, models, are, neat] |
+------------------------------------------+

root
 |-- text: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [126]:
w2v = Word2Vec(vectorSize=3, minCount=0).setInputCol('text').setOutputCol('vector')
w2v.fit(docDF).transform(docDF).show(3, False)

+------------------------------------------+-----------------------------------------------------------------+
|text                                      |vector                                                           |
+------------------------------------------+-----------------------------------------------------------------+
|[Hi,, I, heard, about, spark]             |[-0.07255996316671372,0.029591199429705742,-0.0400250225327909]  |
|[I, wish, java, could, use, case, classes]|[0.041278247854539325,-0.05718414538672992,0.01688022825068661]  |
|[Logistic, regression, models, are, neat] |[-0.001246955245733261,0.09623391311615706,-0.009747498854994775]|
+------------------------------------------+-----------------------------------------------------------------+



### 2.5 Feature Manipulation and Selection <a id='sec2-5'></a>

**PCA**
- simply specifying the features column and K components
- fit and transform

**Interaction**
- Use `RFormula` is the easiest way and supported by pyspark
- `Interaction` also allows user to create interactions manually (**2.2 only available for Scala**)

**PolynomialExpansion**
- feature expantion
- set up the features column and the degree
- no fit stage

**ChiSqSelector**
- https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
- leverage statistical test (chi square test) to find features that are not independent from the label, and drop the uncorrelated feature.
- selection methods
    - `numTopFeatures`: select n features by the order of p values
    - `percentile`: select a porportion of the input features
    - `fpr`: select features by an arbitrary cut off p-value

[back to top](#directory)

In [139]:
from pyspark.ml.feature import PCA

# pca usually works on the features column (the column that combines all features together) 
pcaer = PCA().setInputCol('features').setOutputCol('PCs').setK(2)  # k need to be smaller than input k
reduced = pcaer.fit(scaleDF).transform(scaleDF)
reduced.printSchema()
reduced.show(5, False)

root
 |-- id: integer (nullable = true)
 |-- features: vector (nullable = true)
 |-- PCs: vector (nullable = true)

+---+--------------+------------------------------------------+
|id |features      |PCs                                       |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.07137194992484153,-0.45266548881478463]|
|1  |[2.0,1.1,1.0] |[-1.6804946984073725,1.2593401322219144]  |
|0  |[1.0,0.1,-1.0]|[0.07137194992484153,-0.45266548881478463]|
|1  |[2.0,1.1,1.0] |[-1.6804946984073725,1.2593401322219144]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060149758]|
+---+--------------+------------------------------------------+



In [142]:
from pyspark.ml.feature import PolynomialExpansion

expansioner = PolynomialExpansion().setInputCol('features').setOutputCol('expanded').setDegree(2)
expansioner.transform(scaleDF).select('expanded').show(10, False)

+-----------------------------------------------------------------------------------+
|expanded                                                                           |
+-----------------------------------------------------------------------------------+
|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|[3.0,9.0,10.1,30.299999999999997,102.00999999999999,3.0,9.0,30.299999999999997,9.0]|
+-----------------------------------------------------------------------------------+



In [7]:
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.feature import Tokenizer, CountVectorizer

# tokenize input and fit a counter to vectorize the input
tokenizer = Tokenizer().setInputCol('Description').setOutputCol('tokens')
tokenized = tokenizer.transform(sales.limit(100)).select('tokens', 'CustomerID').where('CustomerID is not null')
tokenized.show(1)
cv = CountVectorizer().setInputCol('tokens').setOutputCol('vec')
cvDF = cv.fit(tokenized).transform(tokenized)
cvDF.show(1, False)

# set up the chi square selector and fit, transform it
# this takes a long time
selector = ChiSqSelector().setFeaturesCol('vec').setLabelCol('CustomerID').setNumTopFeatures(50)
selector.fit(cvDF).transform(cvDF).show(10, False)


+--------------------+----------+
|              tokens|CustomerID|
+--------------------+----------+
|[rabbit, night, l...|   14075.0|
+--------------------+----------+
only showing top 1 row

+----------------------+----------+-------------------------------+
|tokens                |CustomerID|vec                            |
+----------------------+----------+-------------------------------+
|[rabbit, night, light]|14075.0   |(233,[70,83,155],[1.0,1.0,1.0])|
+----------------------+----------+-------------------------------+
only showing top 1 row

+-------------------------------------+----------+--------------------------------------------+----------------------------------+
|tokens                               |CustomerID|vec                                         |ChiSqSelector_c4efd7311477__output|
+-------------------------------------+----------+--------------------------------------------+----------------------------------+
|[rabbit, night, light]               |14075.0   

## 3. Advanced Topic <a id='sec3'></a>

### 3.1 Persisting Transformer <a id='sec3-1'></a>

Some estimator require long fitting time, e.g. PCA. The estimator can be write on disk and load when needed to reduce the time spent on fitting.

### 3.2 Writing a Custom Transformer <a id='sec3-2'></a>
 
Most of the time, use the SQLTransformer which provides the flexibility to transform the data. Sometime, need to wrap some build-in methods and implement the transformer from scratch. But using Python will cause lots of overhead.

[back to top](#directory)