<h4>转换器和估计器的区别</h4>

- 转换器不需要根据输入数据进行参数调整，它可能只是对列进行类型变换，或者从两个变量中创建交互变量
- 估计器需要根据数据执行转换， 比如将数据归一化为0均值和单位方差

- 首先读取数据

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("test").getOrCreate()

In [4]:
sales = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("./data/retail-data/by-day/2010-12-01.csv")

In [6]:
fakeIntDF = spark.read.parquet("./data/simple-ml-integers/part-00000-ce2a44c8-feb4-4369-a2c3-4bf2f0e63b07-c000.gz.parquet")
simpleDF = spark.read.json("./data/simple-ml/part-r-00000-f5c243b9-a015-4a3b-a4a8-eca00f80f04c.json")
scaleDF = spark.read.parquet("./data/simple-ml-scaling/part-00000-cd03406a-cc9b-42b0-9299-1e259fdd9382-c000.gz.parquet")

<h4>转换器</h4>

- 以下将开始介绍各种转换器
- 所有转换器都要求至少指定inputCol和outputCol，分别代表输入和输出的列名
- 以下首先介绍高级转换器

<h4>RFormula</h4>

- RFormula自动处理数值或类别变量，数值列会被转为Double类型‘
- 而String列会被判定为类别变量，并采用one-hot encoding编码

In [8]:
from pyspark.ml.feature import RFormula

In [10]:
supervised = RFormula(formula="lab ~ . + color:value1 + color:value2")
supervised.fit(simpleDF).transform(simpleDF).show(2, False)

+-----+----+------+------------------+--------------------------------------------------------------------+-----+
|color|lab |value1|value2            |features                                                            |label|
+-----+----+------+------------------+--------------------------------------------------------------------+-----+
|green|good|1     |14.386294994851129|(10,[1,2,3,5,8],[1.0,1.0,14.386294994851129,1.0,14.386294994851129])|1.0  |
|blue |bad |8     |14.386294994851129|(10,[2,3,6,9],[8.0,14.386294994851129,8.0,14.386294994851129])      |0.0  |
+-----+----+------+------------------+--------------------------------------------------------------------+-----+
only showing top 2 rows



<h4>SQL转换器</h4>

- SQLTransformer可以让你再用MLlib做转换的时候，像使用SQL一样使用Spark的库
- 唯一需要更改的是不要使用表名，而是使用关键字THIS

In [12]:
from pyspark.ml.feature import SQLTransformer

In [13]:
basicTransformation = SQLTransformer()\
    .setStatement("""
        SELECT sum(Quantity), count(*), CustomerID
        FROM __THIS__
        GROUP BY CustomerID
    """)

In [14]:
basicTransformation.transform(sales).show(2)

+-------------+--------+----------+
|sum(Quantity)|count(1)|CustomerID|
+-------------+--------+----------+
|          197|      36|   15311.0|
|          301|      21|   16539.0|
+-------------+--------+----------+
only showing top 2 rows



<h4>VectorAssembler</h4>

- 几乎每条流水线都会使用的工具，将所有特征组合成一个大向量，然后将其传递到估计器
- 将多个Boolean, Double, Vector作为输入，输出一个大向量

In [20]:
fakeIntDF.show()

+----+----+----+
|int1|int2|int3|
+----+----+----+
|   1|   2|   3|
+----+----+----+



In [18]:
from pyspark.ml.feature import VectorAssembler

In [19]:
va = VectorAssembler().setInputCols(["int1", "int2", "int3"])
va.transform(fakeIntDF).show()

+----+----+----+------------------------------------+
|int1|int2|int3|VectorAssembler_c3f9c29e9d67__output|
+----+----+----+------------------------------------+
|   1|   2|   3|                       [1.0,2.0,3.0]|
+----+----+----+------------------------------------+



<h4>处理连续型特征（Continuous Feature）</h4>

- 有两个常用的用于处理连续型特征的转换器
    - 一个称为分桶的过程将连续特征转为类别特征
    - 也可以根据不同的要求来缩放和归一化特征

In [25]:
contDF = spark.range(20).selectExpr("cast(id as Double)")

In [26]:
contDF.show(2)

+---+
| id|
+---+
|0.0|
|1.0|
+---+
only showing top 2 rows



<h4>分桶</h4>

- 实现分桶的最直接方式是使用Bucktizer，比如将体重分为"超重"、"平均值"和"偏轻"的桶会更简单
- 指定拆分值时，要满足三个条件
    - 拆分数组中的最小值必须小于DataFrame中的最小值
    - 拆分数组中的最大值必须小于DataFrame中的最大值
    - 至少指定三个值，也就是两个桶
- 要覆盖所有可能值，可以使用float("int")和float("-inf")
- 为了处理null或NaN，必须指定handleInvalid参数，可以设置保留（keep），报错（设置error或null）或跳过（skip）

In [23]:
from pyspark.ml.feature import Bucketizer

In [31]:
bucketBorders = [-1.0, 5.0, 10.0, 250.0, 600.0]
bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("id")
bucketer.transform(contDF).show(6)

+---+-------------------------------+
| id|Bucketizer_a4b26a079c78__output|
+---+-------------------------------+
|0.0|                            0.0|
|1.0|                            0.0|
|2.0|                            0.0|
|3.0|                            0.0|
|4.0|                            0.0|
|5.0|                            1.0|
+---+-------------------------------+
only showing top 6 rows



- 除了硬编码值直接split分桶外，还可以基于数据的百分比进行拆分
- 通过QuantileDiscretizer完成
- 例如，数据中的90分位数表示90%的数据小于该值，通过setRelativeError设置近似分位数计算的相对误差

In [34]:
from pyspark.ml.feature import QuantileDiscretizer

In [38]:
bucketer = QuantileDiscretizer().setNumBuckets(5).setInputCol("id")
fittedBucketer = bucketer.fit(contDF)

In [39]:
fittedBucketer.transform(contDF).show()

+----+----------------------------------------+
|  id|QuantileDiscretizer_b9cfcfdc0b25__output|
+----+----------------------------------------+
| 0.0|                                     0.0|
| 1.0|                                     0.0|
| 2.0|                                     0.0|
| 3.0|                                     1.0|
| 4.0|                                     1.0|
| 5.0|                                     1.0|
| 6.0|                                     1.0|
| 7.0|                                     2.0|
| 8.0|                                     2.0|
| 9.0|                                     2.0|
|10.0|                                     2.0|
|11.0|                                     3.0|
|12.0|                                     3.0|
|13.0|                                     3.0|
|14.0|                                     3.0|
|15.0|                                     4.0|
|16.0|                                     4.0|
|17.0|                                  

<h4>缩放和归一化</h4>

- 前面讲了如何用分桶计数来创建连续型变量的分组
- 另一个常见任务时缩放（Scaling）和归一化（Normalization）连续型数据
- 以下转换器是对竖着用的，比如features由一行行的列表组成
- 则转换器会对每一行的第1,2,n...值进行归一化

<h4>StandardScaler</h4>

- 将一组特征值归一化为平均值为0，标准差为1的一组心智
- withStd标志表示将数据缩放到单位标准差
- withMean标志表示将数据缩放之间进行中心化，即变量减去均值，默认为False

In [40]:
from pyspark.ml.feature import StandardScaler

In [53]:
sScaler = StandardScaler(withMean=True).setInputCol("features")
sScaler.fit(scaleDF).transform(scaleDF).show(5, False)

+---+--------------+--------------------------------------------------------------+
|id |features      |StandardScaler_827ba24da26c__output                           |
+---+--------------+--------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[-0.9561828874675149,-0.5610294986546213,-0.9561828874675149] |
|1  |[2.0,1.1,1.0] |[0.23904572186687867,-0.32726720754852906,0.23904572186687872]|
|0  |[1.0,0.1,-1.0]|[-0.9561828874675149,-0.5610294986546213,-0.9561828874675149] |
|1  |[2.0,1.1,1.0] |[0.23904572186687867,-0.32726720754852906,0.23904572186687872]|
|1  |[3.0,10.1,3.0]|[1.4342743312012722,1.7765934124063008,1.4342743312012722]    |
+---+--------------+--------------------------------------------------------------+



<h4>MinMaxScaler</h4>

- 将向量中的值基于给定值的最小值和最大值按比例缩放

In [59]:
from pyspark.ml.feature import MinMaxScaler

In [67]:
minMax = MinMaxScaler().setMin(0).setMax(1).setInputCol("features")
fittedminMax = minMax.fit(scaleDF)
fittedminMax.transform(scaleDF).show()

+---+--------------+---------------------------------+
| id|      features|MinMaxScaler_7742165c9d70__output|
+---+--------------+---------------------------------+
|  0|[1.0,0.1,-1.0]|                        (3,[],[])|
|  1| [2.0,1.1,1.0]|                    [0.5,0.1,0.5]|
|  0|[1.0,0.1,-1.0]|                        (3,[],[])|
|  1| [2.0,1.1,1.0]|                    [0.5,0.1,0.5]|
|  1|[3.0,10.1,3.0]|                    [1.0,1.0,1.0]|
+---+--------------+---------------------------------+



<h4>MaxAbsScaler</h4>

- 最大绝对值缩放，将每个值除以该特征的最大绝对值来缩放数据
- 因此，所有的值最终都会在-1和1之间

In [68]:
from pyspark.ml.feature import MaxAbsScaler

In [73]:
maScaler = MaxAbsScaler().setInputCol("features")
fittedmaScaler = maScaler.fit(scaleDF)
fittedmaScaler.transform(scaleDF).show(5, False)

+---+--------------+-------------------------------------------------------------+
|id |features      |MaxAbsScaler_ea07f3e42d7e__output                            |
+---+--------------+-------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009903,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|0  |[1.0,0.1,-1.0]|[0.3333333333333333,0.009900990099009903,-0.3333333333333333]|
|1  |[2.0,1.1,1.0] |[0.6666666666666666,0.10891089108910892,0.3333333333333333]  |
|1  |[3.0,10.1,3.0]|[1.0,1.0,1.0]                                                |
+---+--------------+-------------------------------------------------------------+



<h4>ElementwiseProduct</h4>

- ElementwiseProduct允许用一个缩放向量对每个feature的每个值进行缩放

In [74]:
from pyspark.ml.feature import ElementwiseProduct
from pyspark.ml.linalg import Vectors

In [75]:
scaleUpVec = Vectors.dense(10.0,15.0,20.0)
scalingUp = ElementwiseProduct()\
    .setScalingVec(scaleUpVec)\
    .setInputCol("features")

In [76]:
scalingUp.transform(scaleDF).show()

+---+--------------+---------------------------------------+
| id|      features|ElementwiseProduct_aaf031090d55__output|
+---+--------------+---------------------------------------+
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  0|[1.0,0.1,-1.0]|                       [10.0,1.5,-20.0]|
|  1| [2.0,1.1,1.0]|                       [20.0,16.5,20.0]|
|  1|[3.0,10.1,3.0]|                      [30.0,151.5,60.0]|
+---+--------------+---------------------------------------+



<h4>Normalizer</h4>

- 作用范围是每一行，是每一个行向量的范数变味了单位范数
- 1阶范数也是曼哈顿范数，即所有值的绝对值之和

In [77]:
from pyspark.ml.feature import Normalizer

In [78]:
manhattanDistance = Normalizer().setP(1).setInputCol("features")
manhattanDistance.transform(scaleDF).show()

+---+--------------+-------------------------------+
| id|      features|Normalizer_7592648caac4__output|
+---+--------------+-------------------------------+
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  0|[1.0,0.1,-1.0]|           [0.47619047619047...|
|  1| [2.0,1.1,1.0]|           [0.48780487804878...|
|  1|[3.0,10.1,3.0]|           [0.18633540372670...|
+---+--------------+-------------------------------+



<h3>使用类别特征（Categorical Feature）</h3>

- 索引将列中的一个类别变量转换为数值，进而可以嵌入进机器学习算法

<h4>StringIndexer</h4>

- 将字符串映射到不同的数字id

In [79]:
from pyspark.ml.feature import StringIndexer

In [80]:
lblIndexr = StringIndexer().setInputCol("lab").setOutputCol("labelInd")

In [81]:
idxRes = lblIndexr.fit(simpleDF).transform(simpleDF)
idxRes.show()

+-----+----+------+------------------+--------+
|color| lab|value1|            value2|labelInd|
+-----+----+------+------------------+--------+
|green|good|     1|14.386294994851129|     1.0|
| blue| bad|     8|14.386294994851129|     0.0|
| blue| bad|    12|14.386294994851129|     0.0|
|green|good|    15| 38.97187133755819|     1.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     0.0|
|  red|good|    35|14.386294994851129|     1.0|
|  red| bad|     1| 38.97187133755819|     0.0|
|  red| bad|     2|14.386294994851129|     0.0|
|  red| bad|    16|14.386294994851129|     0.0|
|  red|good|    45| 38.97187133755819|     1.0|
|green|good|     1|14.386294994851129|     1.0|
| blue| bad|     8|14.386294994851129|     0.0|
| blue| bad|    12|14.386294994851129|     0.0|
|green|good|    15| 38.97187133755819|     1.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     0.0|
|  red|good|    35|14.386294994851129|  

- 还可以使用StringIndexer于非字符串的列，这些字符串列将在被索引前转换为字符串

In [82]:
valIndexer = StringIndexer().setInputCol("value1").setOutputCol("valueInd")
valIndexer.fit(simpleDF).transform(simpleDF).show()

+-----+----+------+------------------+--------+
|color| lab|value1|            value2|valueInd|
+-----+----+------+------------------+--------+
|green|good|     1|14.386294994851129|     0.0|
| blue| bad|     8|14.386294994851129|     7.0|
| blue| bad|    12|14.386294994851129|     1.0|
|green|good|    15| 38.97187133755819|     3.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     2.0|
|  red|good|    35|14.386294994851129|     5.0|
|  red| bad|     1| 38.97187133755819|     0.0|
|  red| bad|     2|14.386294994851129|     4.0|
|  red| bad|    16|14.386294994851129|     2.0|
|  red|good|    45| 38.97187133755819|     6.0|
|green|good|     1|14.386294994851129|     0.0|
| blue| bad|     8|14.386294994851129|     7.0|
| blue| bad|    12|14.386294994851129|     1.0|
|green|good|    15| 38.97187133755819|     3.0|
|green|good|    12|14.386294994851129|     1.0|
|green| bad|    16|14.386294994851129|     2.0|
|  red|good|    35|14.386294994851129|  

In [83]:
valIndexer.setHandleInvalid("skip")

StringIndexer_9d3946b539ae

- StringIndexer是一个必须符合输入数据的估计器
- 即，如果在输入"a""b"和"c"上训练StringIndexer，然后再输入"d"的话，默认情况会返回错误

<h4>将索引值转回文本</h4>

- 如果希望将索引的数值类型映射回原始的类别值
- 这种转换能帮助将模型预测值（索引值）转换回原始代表的类别

In [84]:
from pyspark.ml.feature import IndexToString

In [85]:
labelReverse = IndexToString().setInputCol("labelInd")
labelReverse.transform(idxRes).show()

+-----+----+------+------------------+--------+----------------------------------+
|color| lab|value1|            value2|labelInd|IndexToString_7c3092fed8a1__output|
+-----+----+------+------------------+--------+----------------------------------+
|green|good|     1|14.386294994851129|     1.0|                              good|
| blue| bad|     8|14.386294994851129|     0.0|                               bad|
| blue| bad|    12|14.386294994851129|     0.0|                               bad|
|green|good|    15| 38.97187133755819|     1.0|                              good|
|green|good|    12|14.386294994851129|     1.0|                              good|
|green| bad|    16|14.386294994851129|     0.0|                               bad|
|  red|good|    35|14.386294994851129|     1.0|                              good|
|  red| bad|     1| 38.97187133755819|     0.0|                               bad|
|  red| bad|     2|14.386294994851129|     0.0|                               bad|
|  r

<h4>向量索引</h4>

- 当处理数据集向量内部有类别变量时，可以使用VectorIndexer
- 可以查找输入向量内部的类别特征，并将其转换为具有从0开始的类别索引

In [86]:
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.linalg import Vectors

In [89]:
idxIn = spark.createDataFrame([
    (Vectors.dense(1,2,3),1),
    (Vectors.dense(2,5,6),2),
    (Vectors.dense(1,8,9),3),
]).toDF("features", "label")

In [91]:
indxr = VectorIndexer()\
    .setInputCol("features")\
    .setOutputCol("idxed")\
    .setMaxCategories(2)

In [92]:
indxr.fit(idxIn).transform(idxIn).show()

Py4JJavaError: An error occurred while calling o1009.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 100.0 failed 1 times, most recent failure: Lost task 0.0 in stage 100.0 (TID 166) (26.26.26.1 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:684)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:650)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:626)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:583)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:540)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 29 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2729)
	at org.apache.spark.ml.util.MetadataUtils$.$anonfun$getNumFeatures$1(MetadataUtils.scala:51)
	at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.ml.util.MetadataUtils$.getNumFeatures(MetadataUtils.scala:52)
	at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:143)
	at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:117)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:684)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:650)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:626)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:583)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:540)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 29 more


<h4>one-hot编码</h4>

- 我们类别变量之前是直接转为数字，蓝色是1，绿色2
- 但这样编码数字是有问题的，这暗示了绿色>蓝色
- 为了避免这个情况，OneHotEncoder，将不同的类别值转换为向量中的一个布尔值元素
- 这些类别也不会被模型排序

In [93]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

In [94]:
lblIndxr = StringIndexer().setInputCol("color").setOutputCol("colorInd")
colorLab = lblIndxr.fit(simpleDF).transform(simpleDF.select("color"))

In [96]:
colorLab.show(2)

+-----+--------+
|color|colorInd|
+-----+--------+
|green|     1.0|
| blue|     2.0|
+-----+--------+
only showing top 2 rows



In [101]:
ohe = OneHotEncoder().setInputCol("colorInd")
ohe.fit(colorLab).transform(colorLab).show(2)

+-----+--------+----------------------------------+
|color|colorInd|OneHotEncoder_4dfceb2d51f1__output|
+-----+--------+----------------------------------+
|green|     1.0|                     (2,[1],[1.0])|
| blue|     2.0|                         (2,[],[])|
+-----+--------+----------------------------------+
only showing top 2 rows



<h3>文本数据转换器</h3>

- 接下来讨论自由格式的文本

<h4>文本分词</h4>

- 将任意格式的文本转变成一个”符号“（"token"）列表或者一个单词列表
- 通过Tokenizer将文本转为token列表

In [102]:
from pyspark.ml.feature import Tokenizer

In [209]:
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized = tkn.transform(sales.select("Description")).where("CustomerId IS NOT NULL")
tokenized.show(5, False)

+-----------------------------------+------------------------------------------+
|Description                        |DescOut                                   |
+-----------------------------------+------------------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER |[white, hanging, heart, t-light, holder]  |
|WHITE METAL LANTERN                |[white, metal, lantern]                   |
|CREAM CUPID HEARTS COAT HANGER     |[cream, cupid, hearts, coat, hanger]      |
|KNITTED UNION FLAG HOT WATER BOTTLE|[knitted, union, flag, hot, water, bottle]|
|RED WOOLLY HOTTIE WHITE HEART.     |[red, woolly, hottie, white, heart.]      |
+-----------------------------------+------------------------------------------+
only showing top 5 rows



- Tokenizer就是基于空格分词，也可以用RegexTokenizer指定分词的正则表达式
- 比如以下就是setPattern指定空格分词

In [106]:
from pyspark.ml.feature import RegexTokenizer

In [107]:
rt = RegexTokenizer()\
    .setInputCol("Description")\
    .setOutputCol("DescOut")\
    .setPattern(" ")\
    .setToLowercase(True)

In [108]:
rt.transform(sales.select("Description")).show(5, False)

+-----------------------------------+------------------------------------------+
|Description                        |DescOut                                   |
+-----------------------------------+------------------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER |[white, hanging, heart, t-light, holder]  |
|WHITE METAL LANTERN                |[white, metal, lantern]                   |
|CREAM CUPID HEARTS COAT HANGER     |[cream, cupid, hearts, coat, hanger]      |
|KNITTED UNION FLAG HOT WATER BOTTLE|[knitted, union, flag, hot, water, bottle]|
|RED WOOLLY HOTTIE WHITE HEART.     |[red, woolly, hottie, white, heart.]      |
+-----------------------------------+------------------------------------------+
only showing top 5 rows



- 当然，如果指定setGaps(False)，那么输出模式将作用于捕捉单词
- 以下用空格捕捉单词，当然只能捕捉到空格

In [109]:
from pyspark.ml.feature import RegexTokenizer

In [110]:
rt = RegexTokenizer()\
    .setInputCol("Description")\
    .setOutputCol("DescOut")\
    .setPattern(" ")\
    .setGaps(False)\
    .setToLowercase(True)

In [112]:
rt.transform(sales.select("Description")).show(5, False)

+-----------------------------------+---------------+
|Description                        |DescOut        |
+-----------------------------------+---------------+
|WHITE HANGING HEART T-LIGHT HOLDER |[ ,  ,  ,  ]   |
|WHITE METAL LANTERN                |[ ,  ]         |
|CREAM CUPID HEARTS COAT HANGER     |[ ,  ,  ,  ]   |
|KNITTED UNION FLAG HOT WATER BOTTLE|[ ,  ,  ,  ,  ]|
|RED WOOLLY HOTTIE WHITE HEART.     |[ ,  ,  ,  ]   |
+-----------------------------------+---------------+
only showing top 5 rows



<h4>删除常用词</h4>

- 另一个常见任务是删除常用词（stop word）

In [113]:
from pyspark.ml.feature import StopWordsRemover

In [114]:
englishStopWords = StopWordsRemover.loadDefaultStopWords("english")

In [115]:
stops = StopWordsRemover()\
    .setStopWords(englishStopWords)\
    .setInputCol("DescOut")

In [119]:
stops.transform(tokenized).show(5, True)

+--------------------+--------------------+-------------------------------------+
|         Description|             DescOut|StopWordsRemover_bba842b7bc87__output|
+--------------------+--------------------+-------------------------------------+
|WHITE HANGING HEA...|[white, hanging, ...|                 [white, hanging, ...|
| WHITE METAL LANTERN|[white, metal, la...|                 [white, metal, la...|
|CREAM CUPID HEART...|[cream, cupid, he...|                 [cream, cupid, he...|
|KNITTED UNION FLA...|[knitted, union, ...|                 [knitted, union, ...|
|RED WOOLLY HOTTIE...|[red, woolly, hot...|                 [red, woolly, hot...|
+--------------------+--------------------+-------------------------------------+
only showing top 5 rows



<h4>创建词组合</h4>

- 这个就是创建n-gram

In [120]:
from pyspark.ml.feature import NGram

In [121]:
unigram = NGram().setInputCol("DescOut").setN(1)
bigram = NGram().setInputCol("DescOut").setN(2)

In [130]:
unigram.transform(tokenized).show(5)

+--------------------+--------------------+--------------------------+
|         Description|             DescOut|NGram_5864baa5de3e__output|
+--------------------+--------------------+--------------------------+
|WHITE HANGING HEA...|[white, hanging, ...|      [white, hanging, ...|
| WHITE METAL LANTERN|[white, metal, la...|      [white, metal, la...|
|CREAM CUPID HEART...|[cream, cupid, he...|      [cream, cupid, he...|
|KNITTED UNION FLA...|[knitted, union, ...|      [knitted, union, ...|
|RED WOOLLY HOTTIE...|[red, woolly, hot...|      [red, woolly, hot...|
+--------------------+--------------------+--------------------------+
only showing top 5 rows



In [152]:
bigram.transform(tokenized).show(5)

+--------------------+--------------------+--------------------------+
|         Description|             DescOut|NGram_f7eb838f0da8__output|
+--------------------+--------------------+--------------------------+
|WHITE HANGING HEA...|[white, hanging, ...|      [white hanging, h...|
| WHITE METAL LANTERN|[white, metal, la...|      [white metal, met...|
|CREAM CUPID HEART...|[cream, cupid, he...|      [cream cupid, cup...|
|KNITTED UNION FLA...|[knitted, union, ...|      [knitted union, u...|
|RED WOOLLY HOTTIE...|[red, woolly, hot...|      [red woolly, wool...|
+--------------------+--------------------+--------------------------+
only showing top 5 rows



<h4>单词转换成数值表示</h4>

- 对单词和单词组合进行计数，这里可以使用CountVectorizer对单词进行计数
- 一个CountVectorizer对分词之后数据进行操作
    1. 在fit过程中，在全部文档中查找一个单词集合，然后计算在所有文档中的这些单词的出现次数
    2. 在转换过程中计算DataFrame列每行中给定单词的出现次数，输出包含在该行中的单词向量

- 概念上，转换器将每行看作文档（document），将每个单词看作项（term），将所有项的集合看作词库（vocabulary）
- 设置最小频率参数（minTF）有效删除低频单词
- 设置总的最大单词量（vocabSize）
- 若要只返回word是否存在于文档中，可以使用setBinary

In [133]:
from pyspark.ml.feature import CountVectorizer

In [210]:
cv = CountVectorizer()\
    .setInputCol("DescOut")\
    .setOutputCol("countVec")\
    .setVocabSize(500)\
    .setMinTF(1)\
    .setMinDF(2)

In [211]:
tokenized.first()

Row(Description='WHITE HANGING HEART T-LIGHT HOLDER', DescOut=['white', 'hanging', 'heart', 't-light', 'holder'])

In [219]:
fittedCV = cv.fit(tokenized)
fittedCV.transform(tokenized).show(2,False)

+----------------------------------+----------------------------------------+------------------------------------------+
|Description                       |DescOut                                 |countVec                                  |
+----------------------------------+----------------------------------------+------------------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|[white, hanging, heart, t-light, holder]|(500,[4,8,16,23,24],[1.0,1.0,1.0,1.0,1.0])|
|WHITE METAL LANTERN               |[white, metal, lantern]                 |(500,[8,25,155],[1.0,1.0,1.0])            |
+----------------------------------+----------------------------------------+------------------------------------------+
only showing top 2 rows



- countVec输出的是三元组，(单词总量,单词索引,单词在行中的个数)

<h4>词频-逆文档频率</h4>

- 另一种将文本转换为数值的方法是使用词频-逆文档频率（TF-IDF）
- TF-IDF度量一个单词在每个文档中出现的频率，并根据单词出现过的文档数进行加权
- 较少文档中出的单词，权重会高
- 即较少文档中出现的高频词，TF-IDF会高

In [172]:
tfIdfIn = tokenized\
    .where("array_contains(DescOut, 'red')")\
    .select("DescOut")\
    .limit(10)

In [173]:
tfIdfIn.show(10, False)

+------------------------------------+
|DescOut                             |
+------------------------------------+
|[red, woolly, hottie, white, heart.]|
|[hand, warmer, red, polka, dot]     |
|[red, coat, rack, paris, fashion]   |
|[alarm, clock, bakelike, red]       |
|[set/2, red, retrospot, tea, towels]|
|[red, toadstool, led, night, light] |
|[hand, warmer, red, polka, dot]     |
|[edwardian, parasol, red]           |
|[red, woolly, hottie, white, heart.]|
|[edwardian, parasol, red]           |
+------------------------------------+



- 首先对每个单词进行哈希运算，将其转换为数值表示形式
- 然后根据逆文档频率对词库中的每个单词进行加权

In [175]:
from pyspark.ml.feature import HashingTF, IDF

In [176]:
tf = HashingTF()\
    .setInputCol("DescOut")\
    .setOutputCol("TFOut")\
    .setNumFeatures(10000)

In [177]:
idf = IDF()\
    .setInputCol("TFOut")\
    .setOutputCol("IDFOut")\
    .setMinDocFreq(2)

In [183]:
idf.fit(tf.transform(tfIdfIn)).transform(tf.transform(tfIdfIn)).show(5,False)

+------------------------------------+------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
|DescOut                             |TFOut                                                 |IDFOut                                                                                                            |
+------------------------------------+------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
|[red, woolly, hottie, white, heart.]|(10000,[52,388,691,5843,9470],[1.0,1.0,1.0,1.0,1.0])  |(10000,[52,388,691,5843,9470],[0.0,1.2992829841302609,1.2992829841302609,1.2992829841302609,1.2992829841302609])  |
|[hand, warmer, red, polka, dot]     |(10000,[52,3197,3423,8151,8944],[1.0,1.0,1.0,1.0,1.0])|(10000,[52,3197,3423,8151,8944],[0.0,1.2992829841302609,1.2992829841302

- 向量由三个不同的值表示：总的词汇量，文档中每个出现值得哈希值以及单词权重
- 可以看到在引入idf以后，red得权重下降到了0.0

<h4>Word2Vec</h4>

- 用于计算一组单词得向量表示形式

In [184]:
from pyspark.ml.feature import Word2Vec

In [185]:
documentDF = spark.createDataFrame([
    ("Hi I head about Spark".split(" "),),
    ("I wish Java could use case classes".split(" "),),
    ("Logistic regression models are neat".split(" "),),
], ["text"])

In [186]:
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="Result")

In [187]:
model = word2Vec.fit(documentDF)
result = model.transform(documentDF)

Py4JJavaError: An error occurred while calling o2187.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 173.0 failed 1 times, most recent failure: Lost task 8.0 in stage 173.0 (TID 239) (26.26.26.1 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.sql.execution.SQLExecutionRDD.$anonfun$compute$1(SQLExecutionRDD.scala:52)
	at org.apache.spark.sql.internal.SQLConf$.withExistingConf(SQLConf.scala:124)
	at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:684)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:650)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:626)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:583)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:540)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 48 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at org.apache.spark.mllib.feature.Word2Vec.learnVocab(Word2Vec.scala:191)
	at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:312)
	at org.apache.spark.ml.feature.Word2Vec.fit(Word2Vec.scala:183)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:107)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.sql.execution.SQLExecutionRDD.$anonfun$compute$1(SQLExecutionRDD.scala:52)
	at org.apache.spark.sql.internal.SQLConf$.withExistingConf(SQLConf.scala:124)
	at org.apache.spark.sql.execution.SQLExecutionRDD.compute(SQLExecutionRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
	at java.base/sun.nio.ch.NioSocketImpl.timedAccept(NioSocketImpl.java:708)
	at java.base/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:752)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:684)
	at java.base/java.net.ServerSocket.platformImplAccept(ServerSocket.java:650)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:626)
	at java.base/java.net.ServerSocket.implAccept(ServerSocket.java:583)
	at java.base/java.net.ServerSocket.accept(ServerSocket.java:540)
	at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:174)
	... 48 more


In [None]:
for row in result.collect():
    text, vector = row
    print("Text: [%s] => \nVector: %s\n" % (",".join(text), str(vector)))

<h3>特征操作</h3>

- 几乎每个ML的转换器都在某种程度上操纵特征空间

<h4>PCA</h4>

- 主成分分析（PCA）是一种数据方法，用于找到数据中最重要的方法（主成分）
- PCA借助原特征集创建一组更小，更有意义的，新的特征集合
- 以下将维度降为二维

In [188]:
from pyspark.ml.feature import PCA

In [189]:
pca = PCA().setInputCol("features").setK(2)
pca.fit(scaleDF).transform(scaleDF).show(20, False)

+---+--------------+------------------------------------------+
|id |features      |PCA_5b1e2212ae69__output                  |
+---+--------------+------------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|0  |[1.0,0.1,-1.0]|[0.0713719499248417,-0.4526654888147805]  |
|1  |[2.0,1.1,1.0] |[-1.6804946984073723,1.2593401322219198]  |
|1  |[3.0,10.1,3.0]|[-10.872398139848944,0.030962697060155975]|
+---+--------------+------------------------------------------+



<h3>交互作用</h3>

- 交互作用是将两个特征变量相乘
- 调用RFormula

<h4>多项式扩展（Polynomial Expansion）</h4>

- 以下展示一个二阶多项式的例子

In [190]:
from pyspark.ml.feature import PolynomialExpansion

In [191]:
pe = PolynomialExpansion().setInputCol("features").setOutputCol("res").setDegree(2)

In [194]:
pe.transform(scaleDF).show(5, False)

+---+--------------+-----------------------------------------------------------------------------------+
|id |features      |res                                                                                |
+---+--------------+-----------------------------------------------------------------------------------+
|0  |[1.0,0.1,-1.0]|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|1  |[2.0,1.1,1.0] |[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|0  |[1.0,0.1,-1.0]|[1.0,1.0,0.1,0.1,0.010000000000000002,-1.0,-1.0,-0.1,1.0]                          |
|1  |[2.0,1.1,1.0] |[2.0,4.0,1.1,2.2,1.2100000000000002,1.0,2.0,1.1,1.0]                               |
|1  |[3.0,10.1,3.0]|[3.0,9.0,10.1,30.299999999999997,102.00999999999999,3.0,9.0,30.299999999999997,9.0]|
+---+--------------+-----------------------------------------------------------------------------------+



<h3>特征选择</h3>

- 通常会有大量可选特征，希望选择一个较小的子集用于训练
- 许多特征可能紧密关联，这时候就只需选一个
- 筛选特征的过程就是特征选择

<h4>ChiSqSelector</h4>

In [198]:
from pyspark.ml.feature import ChiSqSelector, Tokenizer

In [223]:
tkn = Tokenizer().setInputCol("Description").setOutputCol("DescOut")
tokenized2 = tkn.transform(sales.select("Description","CustomerId"))\
    .where("CustomerId IS NOT NULL")

In [230]:
prechi = fittedCV.transform(tokenized2)\
    .where("CustomerId IS NOT NULL")
prechi.show(2,False)

+----------------------------------+----------+----------------------------------------+------------------------------------------+
|Description                       |CustomerId|DescOut                                 |countVec                                  |
+----------------------------------+----------+----------------------------------------+------------------------------------------+
|WHITE HANGING HEART T-LIGHT HOLDER|17850.0   |[white, hanging, heart, t-light, holder]|(500,[4,8,16,23,24],[1.0,1.0,1.0,1.0,1.0])|
|WHITE METAL LANTERN               |17850.0   |[white, metal, lantern]                 |(500,[8,25,155],[1.0,1.0,1.0])            |
+----------------------------------+----------+----------------------------------------+------------------------------------------+
only showing top 2 rows



In [231]:
chisq = ChiSqSelector()\
    .setFeaturesCol("countVec")\
    .setLabelCol("CustomerId")\
    .setNumTopFeatures(2)

In [232]:
chisq.fit(prechi).transform(prechi).drop("CustomerId","Description","DescOut").show()

+--------------------+----------------------------------+
|            countVec|ChiSqSelector_238960b47f7c__output|
+--------------------+----------------------------------+
|(500,[4,8,16,23,2...|                         (2,[],[])|
|(500,[8,25,155],[...|                         (2,[],[])|
|(500,[49,97,109,1...|                         (2,[],[])|
|(500,[9,11,12,52,...|                         (2,[],[])|
|(500,[0,8,174,177...|                         (2,[],[])|
|(500,[1,37,59,220...|                         (2,[],[])|
|(500,[16,24,31,45...|                         (2,[],[])|
|(500,[17,18,52,86...|                         (2,[],[])|
|(500,[0,17,18,302...|                         (2,[],[])|
|(500,[53,62,89,24...|                         (2,[],[])|
|(500,[387,449],[1...|                         (2,[],[])|
|(500,[237,387,449...|                         (2,[],[])|
|(500,[33,118,156,...|                         (2,[],[])|
|(500,[42,84,99,19...|                         (2,[],[])|
|(500,[2,6,30,