# 转换器
At the high level, when deriving from the `Transformer` abstract class, each and 
every new `Transformer` needs to implement a `.transform(...)` method

- inputCol, 输入特征列 默认 `"features"`
- outputCol, 转换输出结果

In [1]:
from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().setAppName('feature test').setMaster('local[4]')
spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

In [2]:
import pyspark.ml.feature as ft

## Binarizer

Given a threshold, the method takes a continuous variable and 
transforms it into a binary one.

In [4]:
df = spark.createDataFrame([(0.511,), (0.6232,), (0.4323,), (0.9434,), (0.3213,)],
                           ["values"])
df.show()

+------+
|values|
+------+
| 0.511|
|0.6232|
|0.4323|
|0.9434|
|0.3213|
+------+



In [6]:
binarizer = ft.Binarizer(threshold=0.5, inputCol="values", outputCol='features')
binarizer.transform(df).show()

+------+--------+
|values|features|
+------+--------+
| 0.511|     1.0|
|0.6232|     1.0|
|0.4323|     0.0|
|0.9434|     1.0|
|0.3213|     0.0|
+------+--------+



## Bucketizer
Similar to the Binarizer, this method takes a list of thresholds 
(the splits parameter) and transforms a continuous variable into a 
multinomial one.

In [7]:
values = [(0.1,), (0.4,), (1.2,), (1.5,), (float("nan"),), (float("nan"),)]
df = spark.createDataFrame(values, ['values'])
df.show()

+------+
|values|
+------+
|   0.1|
|   0.4|
|   1.2|
|   1.5|
|   NaN|
|   NaN|
+------+



In [13]:
bucketizer = ft.Bucketizer(
    splits=[-float("inf"), 0.5, 1.4, float("inf")],
    inputCol='values', outputCol='features'
)
# 保持 无效值NaN

bucketed = bucketizer.setHandleInvalid("keep").transform(df)
bucket.show()

+------+--------+
|values|features|
+------+--------+
|   0.1|     0.0|
|   0.4|     0.0|
|   1.2|     1.0|
|   1.5|     2.0|
|   NaN|     3.0|
|   NaN|     3.0|
+------+--------+



In [14]:
bucketizer.setParams(outputCol="b").transform(df).show()

+------+---+
|values|  b|
+------+---+
|   0.1|0.0|
|   0.4|0.0|
|   1.2|1.0|
|   1.5|2.0|
|   NaN|3.0|
|   NaN|3.0|
+------+---+



## ChiSqSelector
使用卡方检验(Chi-Square) 完成特征选择

$\chi^2 -test$

参考  https://blog.csdn.net/sinat_33761963/article/details/54910955

In [19]:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame(
    [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
     (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.),
     (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.),],
    ["features", 'label'])
df.show()

+------------------+-----+
|          features|label|
+------------------+-----+
|[0.0,0.0,18.0,1.0]|  1.0|
|[0.0,1.0,12.0,0.0]|  0.0|
|[1.0,0.0,15.0,0.1]|  0.0|
+------------------+-----+



In [20]:
# 选择最优的特征
selector = ft.ChiSqSelector(numTopFeatures=1, outputCol='selectedFeature')
model = selector.fit(df)
model.transform(df).show()

+------------------+-----+---------------+
|          features|label|selectedFeature|
+------------------+-----+---------------+
|[0.0,0.0,18.0,1.0]|  1.0|         [18.0]|
|[0.0,1.0,12.0,0.0]|  0.0|         [12.0]|
|[1.0,0.0,15.0,0.1]|  0.0|         [15.0]|
+------------------+-----+---------------+



In [21]:
model.selectedFeatures

[2]

##  CountVectorizer
处理标记文本

In [22]:
df = spark.createDataFrame(
    [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])],
    ["label", 'raw']
)
df.show()

+-----+---------------+
|label|            raw|
+-----+---------------+
|    0|      [a, b, c]|
|    1|[a, b, b, c, a]|
+-----+---------------+



In [24]:
cv = ft.CountVectorizer(minTF=1., minDF=1., 
                        inputCol='raw', outputCol='vectors')
model = cv.fit(df)
model.transform(df).show(truncate=False)

+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+



In [27]:
model.vocabulary  

['a', 'b', 'c']

In [28]:
fromVocabModel = ft.CountVectorizerModel.from_vocabulary(
    ['a', 'b', 'c'],
    inputCol='raw',
    outputCol='vectors')
fromVocabModel.transform(df).show(truncate=False)

+-----+---------------+-------------------------+
|label|raw            |vectors                  |
+-----+---------------+-------------------------+
|0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
|1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+-----+---------------+-------------------------+



## DCT
A feature transformer that takes the 1D discrete cosine transform of a real vecto

## ElementwiseProduct

元素级别的向量乘积

In [29]:
df = spark.createDataFrame([(Vectors.dense([2.0, 1.0, 3.0]),)], 
                           ["values"])
df.show()

+-------------+
|       values|
+-------------+
|[2.0,1.0,3.0]|
+-------------+



In [30]:
ep = ft.ElementwiseProduct(
    scalingVec=Vectors.dense([1.0, 2.0, 3.0]),
    inputCol='values',
    outputCol='eprod'
)
ep.transform(df).show()

+-------------+-------------+
|       values|        eprod|
+-------------+-------------+
|[2.0,1.0,3.0]|[2.0,2.0,9.0]|
+-------------+-------------+



## FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick (https://en.wikipedia.org/wiki/Feature_hashing) to map features to indices in the feature vector.

In [31]:
data = [(2.0, True, "1", "foo"), (3.0, False, "2", "bar")]
cols = ["real", "bool", "stringNum", "string"]
df = spark.createDataFrame(data, cols)
df.show()

+----+-----+---------+------+
|real| bool|stringNum|string|
+----+-----+---------+------+
| 2.0| true|        1|   foo|
| 3.0|false|        2|   bar|
+----+-----+---------+------+



In [34]:
hasher = ft.FeatureHasher(inputCols=cols, outputCol='features')
hasher.transform(df).head().features

SparseVector(262144, {174475: 2.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})

In [35]:
hasher.setCategoricalCols(["real"]).transform(df).head().features

SparseVector(262144, {171257: 1.0, 247670: 1.0, 257907: 1.0, 262126: 1.0})

## HashingTF

输入为标记文本的列表, 返回一个带有计数的有预定长度的向量

Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a `power of two` as the `numFeatures` parameter; otherwise the features will not be mapped evenly to the columns.

In [39]:
df = spark.createDataFrame([(["a", "b", "c"],)], ["words"])
hashingTF = ft.HashingTF(numFeatures=10, inputCol='words', outputCol='features')
hashingTF.transform(df).head().features

SparseVector(10, {0: 1.0, 1: 1.0, 2: 1.0})

In [40]:
params = {hashingTF.numFeatures: 16, hashingTF.outputCol: "vector"}
hashingTF.transform(df, params).head().vector

SparseVector(16, {1: 1.0, 2: 1.0, 10: 1.0})