# 转换器
At the high level, when deriving from the `Transformer` abstract class, each and 
every new `Transformer` needs to implement a `.transform(...)` method

- inputCol, 输入特征列 默认 `"features"`
- outputCol, 转换输出结果

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('feature test').setMaster('local[4]')
spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

In [2]:
import pyspark.ml.feature as ft

## Binarizer

Given a threshold, the method takes a continuous variable and 
transforms it into a binary one.

In [3]:
df = spark.createDataFrame([(0.511,), (0.6232,), (0.4323,), (0.9434,), (0.3213,)],
                           ["values"])
df.show()

+------+
|values|
+------+
| 0.511|
|0.6232|
|0.4323|
|0.9434|
|0.3213|
+------+



In [4]:
binarizer = ft.Binarizer(threshold=0.5, inputCol="values", outputCol='features')
binarizer.transform(df).show()

+------+--------+
|values|features|
+------+--------+
| 0.511|     1.0|
|0.6232|     1.0|
|0.4323|     0.0|
|0.9434|     1.0|
|0.3213|     0.0|
+------+--------+



## Bucketizer
Similar to the Binarizer, this method takes a list of thresholds 
(the splits parameter) and transforms a continuous variable into a 
multinomial one.

In [5]:
values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
df = spark.createDataFrame(values, ['values'])
df.show()

+------+
|values|
+------+
|   0.1|
|   0.4|
|   1.2|
|   1.5|
|   NaN|
|   NaN|
+------+



In [7]:
bucketizer = ft.Bucketizer(
    splits=[-float("inf"), 0.5, 1.4, float("inf")],
    inputCol='values', outputCol='features'
)
# 保持 无效值NaN

bucketed = bucketizer.setHandleInvalid("keep").transform(df)
bucketed.show()

+------+--------+
|values|features|
+------+--------+
|   0.1|     0.0|
|   0.4|     0.0|
|   1.2|     1.0|
|   1.5|     2.0|
|   NaN|     3.0|
|   NaN|     3.0|
+------+--------+



In [8]:
bucketizer.setParams(outputCol="b").transform(df).show()

+------+---+
|values|  b|
+------+---+
|   0.1|0.0|
|   0.4|0.0|
|   1.2|1.0|
|   1.5|2.0|
|   NaN|3.0|
|   NaN|3.0|
+------+---+



## ChiSqSelector
使用卡方检验(Chi-Square) 完成特征选择

$\chi^2 -test$

参考  https://blog.csdn.net/sinat_33761963/article/details/54910955

In [9]:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame(
    [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
     (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.),
     (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.),],
    ["features", 'label'])
df.show()

+------------------+-----+
|          features|label|
+------------------+-----+
|[0.0,0.0,18.0,1.0]|  1.0|
|[0.0,1.0,12.0,0.0]|  0.0|
|[1.0,0.0,15.0,0.1]|  0.0|
+------------------+-----+



In [10]:
# 选择最优的特征
selector = ft.ChiSqSelector(numTopFeatures=1, outputCol='selectedFeature')
model = selector.fit(df)
model.transform(df).show()

+------------------+-----+---------------+
|          features|label|selectedFeature|
+------------------+-----+---------------+
|[0.0,0.0,18.0,1.0]|  1.0|         [18.0]|
|[0.0,1.0,12.0,0.0]|  0.0|         [12.0]|
|[1.0,0.0,15.0,0.1]|  0.0|         [15.0]|
+------------------+-----+---------------+



In [11]:
model.selectedFeatures

[2]

##  CountVectorizer
处理标记文本

In [12]:
df = spark.createDataFrame(
    [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
    ["label", 'raw']
)
df.show()

+-----+---------------+
|label|            raw|
+-----+---------------+
|    0|      [a, b, c]|
|    1|[a, b, b, c, a]|
+-----+---------------+



In [13]:
cv = ft.CountVectorizer(minTF=1., minDF=1., 
                        inputCol='raw', outputCol='vectors')
model = cv.fit(df)
model.transform(df).show(truncate=False)

+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+



In [14]:
model.vocabulary  

['a', 'b', 'c']

In [15]:
fromVocabModel = ft.CountVectorizerModel.from_vocabulary(
    ['a', 'b', 'c'],
    inputCol='raw',
    outputCol='vectors')
fromVocabModel.transform(df).show(truncate=False)

+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+



## DCT
A feature transformer that takes the 1D discrete cosine transform of a real vecto

## ElementwiseProduct

元素级别的向量乘积

In [16]:
df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], 
                           ["values"])
df.show()

+-------------+
|       values|
+-------------+
|[2.0,1.0,3.0]|
+-------------+



In [17]:
ep = ft.ElementwiseProduct(
    scalingVec=Vectors.dense([1.0, 2.0, 3.0]),
    inputCol='values',
    outputCol='eprod'
)
ep.transform(df).show()

+-------------+-------------+
|       values|        eprod|
+-------------+-------------+
|[2.0,1.0,3.0]|[2.0,2.0,9.0]|
+-------------+-------------+



## FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

In [18]:
data = [(2.0, True, "1", "foo"), (3.0, False, "2", "bar")]
cols = ["real", "bool", "stringNum", "string"]
df = spark.createDataFrame(data, cols)
df.show()

+----+-----+---------+------+
|real| bool|stringNum|string|
+----+-----+---------+------+
| 2.0| true|        1|   foo|
| 3.0|false|        2|   bar|
+----+-----+---------+------+



In [19]:
hasher = ft.FeatureHasher(inputCols=cols, outputCol='features')
hasher.transform(df).head().features

SparseVector(262144, {174475: 2.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})

In [20]:
hasher.setCategoricalCols(["real"]).transform(df).head().features

SparseVector(262144, {171257: 1.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})

## HashingTF

输入为标记文本的列表, 返回一个带有计数的有预定长度的向量

Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a `power of two` as the `numFeatures` parameter; otherwise the features will not be mapped evenly to the columns.

In [37]:
df = spark.createDataFrame([(["a", "b", "c"], ), (["a", "c", "d"],), (['a', 'd', 'f'],)], ["words"])
df.show()

+---------+
|    words|
+---------+
|[a, b, c]|
|[a, c, d]|
|[a, d, f]|
+---------+



In [40]:
hashingTF = ft.HashingTF(numFeatures=10, inputCol='words', outputCol='tf_features')
hashed_data = hashingTF.transform(df)
hashed_data.show(truncate=False)

+---------+--------------------------+
|words    |tf_features               |
+---------+--------------------------+
|[a, b, c]|(10,[0,1,2],[1.0,1.0,1.0])|
|[a, c, d]|(10,[0,2,4],[1.0,1.0,1.0])|
|[a, d, f]|(10,[0,4,8],[1.0,1.0,1.0])|
+---------+--------------------------+



In [39]:
params = {hashingTF.numFeatures: 16, hashingTF.outputCol: "vector"}
hashingTF.transform(df, params).head().vector

SparseVector(16, {1: 1.0, 2: 1.0, 10: 1.0})

## IDF
逆文档词频, 文档需要提前使用向量表示 如HashingTF或CountVectorizer

In [41]:
idf = ft.IDF(inputCol='tf_features', outputCol='idf_features')
model = idf.fit(hashed_data)
model.transform(hashed_data).show(truncate=False)

+---------+--------------------------+----------------------------------------------------------+
|words    |tf_features               |idf_features                                              |
+---------+--------------------------+----------------------------------------------------------+
|[a, b, c]|(10,[0,1,2],[1.0,1.0,1.0])|(10,[0,1,2],[0.0,0.6931471805599453,0.28768207245178085]) |
|[a, c, d]|(10,[0,2,4],[1.0,1.0,1.0])|(10,[0,2,4],[0.0,0.28768207245178085,0.28768207245178085])|
|[a, d, f]|(10,[0,4,8],[1.0,1.0,1.0])|(10,[0,4,8],[0.0,0.28768207245178085,0.6931471805599453]) |
+---------+--------------------------+----------------------------------------------------------+



## IndexToString

将字符串索引反转到原始值

## StringIndexer 
一列类别型特征 编码数值化, 索引从0开始

优先编码频率较大的标签，所以出现频率最高的标签为0号

如果输入的是数值型的，会首先把它转化成字符型，然后再对其进行编码


In [42]:
df = spark.createDataFrame([(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],["id", "category"])
df.show()

+---+--------+
| id|category|
+---+--------+
|  0|       a|
|  1|       b|
|  2|       c|
|  3|       a|
|  4|       a|
|  5|       c|
+---+--------+



In [43]:
indexer = ft.StringIndexer(inputCol='category', outputCol='category_index')
model = indexer.fit(df)
df_indexed = model.transform(df)
df_indexed.show()

+---+--------+--------------+
| id|category|category_index|
+---+--------+--------------+
|  0|       a|           0.0|
|  1|       b|           2.0|
|  2|       c|           1.0|
|  3|       a|           0.0|
|  4|       a|           0.0|
|  5|       c|           1.0|
+---+--------+--------------+



In [44]:
# 从index转换回来
toString = ft.IndexToString(inputCol='category_index', outputCol='origin_category')

df_string = toString.transform(df_indexed)
df_string.show()

+---+--------+--------------+---------------+
| id|category|category_index|origin_category|
+---+--------+--------------+---------------+
|  0|       a|           0.0|              a|
|  1|       b|           2.0|              b|
|  2|       c|           1.0|              c|
|  3|       a|           0.0|              a|
|  4|       a|           0.0|              a|
|  5|       c|           1.0|              c|
+---+--------+--------------+---------------+



## MaxAbsScaler
把数据 调整到\[-1.0, 1.0\], 不会移动数据中心

## MinMaxScaler
把数据 调整到\[0.0, 1.0\]

In [46]:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame(
    [(Vectors.dense(1.0, 2.0, 3.0),),
     (Vectors.dense(-4.0, 1.0, -5.0),)],
    ['datas'])
df.show()

+---------------+
|          datas|
+---------------+
|  [1.0,2.0,3.0]|
|[-4.0,1.0,-5.0]|
+---------------+



In [47]:
maScalar = ft.MaxAbsScaler(inputCol='datas', outputCol='ma_datas')
model = maScalar.fit(df)
model.transform(df).show() # 每列/每列的最大abs

+---------------+---------------+
|          datas|       ma_datas|
+---------------+---------------+
|  [1.0,2.0,3.0]| [0.25,1.0,0.6]|
|[-4.0,1.0,-5.0]|[-1.0,0.5,-1.0]|
+---------------+---------------+



In [48]:
model.maxAbs

DenseVector([4.0, 2.0, 5.0])

In [49]:
mmScalar = ft.MinMaxScaler(min=0.0, max=1.0, inputCol='datas', outputCol='mm_datas')
model = mmScalar.fit(df)
model.transform(df).show() # 每列/每列的最大abs

+---------------+-------------+
|          datas|     mm_datas|
+---------------+-------------+
|  [1.0,2.0,3.0]|[1.0,1.0,1.0]|
|[-4.0,1.0,-5.0]|[0.0,0.0,0.0]|
+---------------+-------------+



In [52]:
model.originalMax

DenseVector([1.0, 2.0, 3.0])

In [53]:
model.originalMin

DenseVector([-4.0, 1.0, -5.0])

## NGram

n元词组

In [55]:
from pyspark.sql import Row

df = spark.createDataFrame([Row(inputtoken=['a', 'b', 'c', 'd', 'e'])])
df.show()

+---------------+
|     inputtoken|
+---------------+
|[a, b, c, d, e]|
+---------------+



In [57]:
ngram = ft.NGram(n=2, inputCol='inputtoken', outputCol='n-Gram')
ngram.transform(df).head()

Row(inputtoken=['a', 'b', 'c', 'd', 'e'], n-Gram=['a b', 'b c', 'c d', 'd e'])

In [59]:
ngram.setParams(n=3).transform(df).select('n-Gram').show(truncate=False)

+---------------------+
|n-Gram               |
+---------------------+
|[a b c, b c d, c d e]|
+---------------------+



##  Normalizer
根据p范数将数据缩放为单位范数

In [64]:
svec = Vectors.sparse(4, {1:4.0, 3:3.0})
df = spark.createDataFrame([(Vectors.dense([-3.0, 4.0]), svec)], 
                           ["dense", 'sparse'])
df.show()                          

+----------+-------------------+
|     dense|             sparse|
+----------+-------------------+
|[-3.0,4.0]|(4,[1,3],[4.0,3.0])|
+----------+-------------------+



In [65]:
normalizer = ft.Normalizer(p=2.0, inputCol="dense", outputCol='features')
normalizer.transform(df).head()

Row(dense=DenseVector([-3.0, 4.0]), sparse=SparseVector(4, {1: 4.0, 3: 3.0}), features=DenseVector([-0.6, 0.8]))

In [66]:
normalizer.setParams(inputCol="sparse", outputCol='freqs').transform(df).head().freqs

SparseVector(4, {1: 0.8, 3: 0.6})

##  StandardScaler

0均值, 1方差 标准化

In [68]:
df = spark.createDataFrame([(Vectors.dense([0.0, 1.0]),), (Vectors.dense([2.0, 3.0]),)], ["a"])
scandardScalar = ft.StandardScaler(
    withMean=False, withStd=True,
    inputCol='a', outputCol='scaled'
 )
model = scandardScalar.fit(df)
model.mean

DenseVector([1.0, 2.0])

In [69]:
model.std

DenseVector([1.4142, 1.4142])

In [78]:
model.transform(df).collect()

[Row(a=DenseVector([0.0, 1.0]), scaled=DenseVector([0.0, 0.7071])),
 Row(a=DenseVector([2.0, 3.0]), scaled=DenseVector([1.4142, 2.1213]))]

## OneHotEncoder 
将分类列编码为二进制向量列

In [80]:
df_indexed.show()  # 如果分类信息是特征的一部分  而不是标签y  使用onehot

+---+--------+--------------+
| id|category|category_index|
+---+--------+--------------+
|  0|       a|           0.0|
|  1|       b|           2.0|
|  2|       c|           1.0|
|  3|       a|           0.0|
|  4|       a|           0.0|
|  5|       c|           1.0|
+---+--------+--------------+



In [83]:
encoder = ft.OneHotEncoder(dropLast=False, inputCol='category_index', outputCol='category_onehot')
encoder.transform(df_indexed).show()

+---+--------+--------------+---------------+
| id|category|category_index|category_onehot|
+---+--------+--------------+---------------+
|  0|       a|           0.0|  (3,[0],[1.0])|
|  1|       b|           2.0|  (3,[2],[1.0])|
|  2|       c|           1.0|  (3,[1],[1.0])|
|  3|       a|           0.0|  (3,[0],[1.0])|
|  4|       a|           0.0|  (3,[0],[1.0])|
|  5|       c|           1.0|  (3,[1],[1.0])|
+---+--------+--------------+---------------+



In [84]:
encoder.setParams(dropLast=True).transform(df_indexed).show()  # 类别2 全0 其他至少一个为1

+---+--------+--------------+---------------+
| id|category|category_index|category_onehot|
+---+--------+--------------+---------------+
|  0|       a|           0.0|  (2,[0],[1.0])|
|  1|       b|           2.0|      (2,[],[])|
|  2|       c|           1.0|  (2,[1],[1.0])|
|  3|       a|           0.0|  (2,[0],[1.0])|
|  4|       a|           0.0|  (2,[0],[1.0])|
|  5|       c|           1.0|  (2,[1],[1.0])|
+---+--------+--------------+---------------+



## PCA 
降维

In [86]:
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
    (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
    (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show(truncate=False)

+---------------------+
|features             |
+---------------------+
|(5,[1,3],[1.0,7.0])  |
|[2.0,0.0,3.0,4.0,5.0]|
|[4.0,0.0,0.0,6.0,7.0]|
+---------------------+



In [88]:
pca = ft.PCA(k=2, inputCol='features', outputCol='pca_features')
model = pca.fit(df)
model.transform(df).collect()

[Row(features=SparseVector(5, {1: 1.0, 3: 7.0}), pca_features=DenseVector([1.6486, -4.0133])),
 Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
 Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]

In [89]:
model.pc

DenseMatrix(5, 2, [-0.4486, 0.133, -0.1252, 0.2165, -0.8477, -0.2842, -0.0562, 0.7636, -0.5653, -0.1156], 0)

## Tokenizer
分词器 转为 小写, 按空格划分

In [92]:
df = spark.createDataFrame([("a b c",)], ["text"])
tokenizer = ft.Tokenizer(inputCol="text", outputCol="words")
tokenizer.transform(df).head()

Row(text='a b c', words=['a', 'b', 'c'])

In [93]:
# Temporarily modify a parameter.
tokenizer.transform(df, {tokenizer.outputCol: "words"}).head()

Row(text='a b c', words=['a', 'b', 'c'])

## VectorAssembler
多个数字向量列 合并为1列

In [94]:
df = spark.createDataFrame([(1, 0, 3), (2, 1, 4)], ["a", "b", "c"])
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  0|  3|
|  2|  1|  4|
+---+---+---+



In [95]:
vecAssembler = ft.VectorAssembler(inputCols=['a', 'b', 'c'], outputCol='features')
vecAssembler.transform(df).show()

+---+---+---+-------------+
|  a|  b|  c|     features|
+---+---+---+-------------+
|  1|  0|  3|[1.0,0.0,3.0]|
|  2|  1|  4|[2.0,1.0,4.0]|
+---+---+---+-------------+



## VectorIndexer

为类别列label生成索引向量

StringIndexer是针对单个类别型特征进行转换，倘若所有特征都已经被组织在一个向量中，又想对其中某些单个分量进行处理时
VectorIndexer类来解决向量数据集中的类别性特征转换

通过为其提供maxCategories超参数，它可以自动识别哪些特征是类别型的，并且将原始值转换为类别索引。它基于不同特征值的数量来识别哪些特征需要被类别化，那些取值可能性最多不超过maxCategories的特征需要会被认为是类别型的


In [96]:
df = spark.createDataFrame(
     [(Vectors.dense(-1.0, 1.0, 1.0),),
      (Vectors.dense(-1.0, 3.0, 1.0),),
      (Vectors.dense(0.0, 5.0, 1.0), )],
     ["features"])
df.show()

+--------------+
|      features|
+--------------+
|[-1.0,1.0,1.0]|
|[-1.0,3.0,1.0]|
| [0.0,5.0,1.0]|
+--------------+



In [99]:
indexer = ft.VectorIndexer(maxCategories=2, inputCol='features', outputCol='indexed')
model = indexer.fit(df)
model.transform(df).show()

+--------------+-------------+
|      features|      indexed|
+--------------+-------------+
|[-1.0,1.0,1.0]|[1.0,1.0,0.0]|
|[-1.0,3.0,1.0]|[1.0,3.0,0.0]|
| [0.0,5.0,1.0]|[0.0,5.0,0.0]|
+--------------+-------------+



In [98]:
model.categoryMaps  # 共有两个特征被转换，分别是0号和2号。

{0: {0.0: 0, -1.0: 1}, 2: {1.0: 0}}