<h4>二元分类（Binary Classification）</h4>

- 垃圾邮件分类

<h4>多分类（Multiclass Classification）</h4>

- 从多于两个的候选标签中选中一个

<h4>多标签分类（Multilabel Classification）</h4>

- 一个给定的输入对应多个标签

<h4>MLlib中的分类模型</h4>

- 逻辑回归（Logistic regression）
- 决策树（Decision tree）
- 随机森林（Random forests）
- 梯度提升决策树（Gradient-boosted trees）


- Spark本身不支持多标签分类，

<h4>模型的可扩展性</h4>

|模型|特征数量|训练样例数|输出类别
|:----|:----|:----|:----
|逻辑回归|100万~1000万|无限|特征\*类别数<1000万
|决策树|1000|无限|特征\*类别数<10000
|随机森林|10000|无限|特征\*类别数<100000
|梯度提升树|1000|无限|特征\*类别数<10000

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()

In [12]:
bInput = spark.read.format("parquet").load("./data/binary-classification/part-r-00007-e02e56d5-d522-4b93-a7f2-f2dc1b2fdba9.gz.parquet")

In [15]:
%%html
<img src="img/load_bInput.png", width=600>

In [13]:
bInput = bInput.selectExpr("features","cast(label as double) as label")

In [17]:
%%html
<img src="img/cast_bInput.png", width=400>

<h4>逻辑回归</h4>

- 它是一种线性模型，为输入的每个特征赋以权重后组合
- 以下解释逻辑回归的超参数
- family
    - 可以设置为"multinomial"（多分类）或"binary"（二分类）
- elasticNetParam
    - 从0到1的浮点值。该参数依照弹性网络正则化的方法将L1正则化和L2正则化混合
    - L1正则化会在模型中产生稀疏性，即对输出影响不大的权重会变为0
    - L2正则化则不会造成稀疏，权重只会趋于零不会等于零
- fitintersept
    - 可以是true或false，决定是否适应截距，通常情况下，如果没有对数据进行标准化，则需要添加截距
- regParam
    - 大于等于0的值，确定在目标函数中正则化项的权重，它的选择和数据集的噪声情况，数据维度有关，最好尝试多个值（如0、0.01、0.1、1）
- standardization
    - 可以是true或false。设置用于决定在将输入数据传递到模型之前是否对其标准化

<h4>逻辑回归的训练参数</h4>

- maxIter
    - 迭代次数，默认值100，更改不会造成太大影响
- tol
    - 此值用于指定一个停止迭代的阈值，达到该阈值说明模型优化的很好了
- weightCol
    - 权重列的名称，用于赋予某些行更大的权重，即对每个行赋予不同的训练权重值
    - 当你知道哪些样本的标签更精确时使用

<h4>预测参数</h4>

- threshold
    - 一个0~1的Double值，此参数是预测时的概率阈值，可以调整此参数以平衡误报（false positive）和漏报（false negative），如果误报成本高昂，那么可能希望该预测阈值非常高
- thresholds
    - 多分类时指定每个类的阈值数组

<h4>示例</h4>

In [7]:
from pyspark.ml.classification import LogisticRegression

In [8]:
lr = LogisticRegression()

- 以下就是LogisticRegression的各项参数

In [9]:
print(lr.explainParams())

aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bounds vector size must beequal wi

- 模型训练好后就可以观察模型的系数和截距项
- 本例三个feature一个label，应该有三个系数，一个截距

In [18]:
%%html
<img src="img/lrmodel.png", width=600>

<h4>模型摘要（Model Summary）</h4>

- 给出了最终训练模型的相关信息

In [21]:
%%html
<img src="img/lrmodelsum.png", width=200>

<h4>决策树（Decision Tree）</h4>

- 基于所有输入构建一个树形结构，在预测时，通过判断各种可能的分支给出预测结果
- 缺点是会非常快的出现过拟合的情况，决策树会基于每个样例创建一条判断路径
- 这意味着它会对模型训练中所有信息进行编码
- 以下讲解模型超参数
- maxDepth
    - 指定最大深度避免过拟合，默认为5
- maxBins
    - 连续特征被转换为类别特征，确定应基于连续特征创建多少个bin
- impurity
    - 不纯度度量，可以设置"entropy"或"gini"
- minInfoGain
    - 确定可用于分割的最小信息增益
- minInstancePerNode
    - 在一个节点结束训练的实例最小数目

<h4>训练参数</h4>

- checkpointInterval
    - 检查点是一种在训练过程中保存模型的方法
    - 设置为10时，就是每10次迭代保存一个检查点

<h4>预测参数</h4>

- thresholds

In [26]:
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier()
print(dt.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featuresCol: features column name. (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name. (default: label)
leafCol: Leaf indices column name. Predicted leaf index of each instance in each tree by preorder. (default: )
maxBins: Max number of bins for discretizing continuous features.  Must be 

<h4>随机森林和梯度提升树</h4>

- 在不同数据子集上训练多棵树
- 随机森林（Random Forest）和梯度提升树（Gradient-Boosted Tree）
- 随机森林训练大量的树，平均结果做出预测
- 梯度提升树则会对每棵树进行加权预测
- 模型的超参数
- 仅适合随机森林
    - numTrees: 用于训练的树的总数
    - featureSubsetStrategy: 此参数确定拆分时应考虑多少特征，它可以是"auto","all","sqrt","log2"或数字n
- 仅适合梯度提升树（GBT）
    - lossType: 优化的损失函数
    - maxIter: 迭代次数，默认值100
    - stepSize: 步长，算法的学习速度，默认0.1，可以是0~1之间任意一个

<h4>训练参数</h4>

- 只有一个训练参数，checkpointInterval

In [27]:
from pyspark.ml.classification import RandomForestClassifier
rfClassifier = RandomForestClassifier()
print(rfClassifier.explainParams())

bootstrap: Whether bootstrap samples are used when building trees. (default: True)
cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the featur

In [29]:
from pyspark.ml.classification import GBTClassifier
gbtClassifier = GBTClassifier()
print(gbtClassifier.explainParams())

cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 

<h4>朴素贝叶斯</h4>

- 该模型的核心假设是数据的所有特征相互独立，通常用于文本和文档分类
- 有两种不同的模型类型
    - 多元贝努利模型（multivariate Bernoulli model），指示器变量代表文档中的一个单词是否存在
    - 多项式模型（multinomial model），使用所有单词计数

<h4>模型超参数</h4>

- modelType: "bernoulli"或"multinomial"
- weightCol: 允许对不同的数据点赋值不同的权值

<h4>训练参数</h4>

- smoothing: 指定使用加法平滑（additive smoothing）

<h4>预测参数</h4>

- thresholds

In [30]:
from pyspark.ml.classification import NaiveBayes

In [31]:
nb = NaiveBayes()
print(nb.explainParams())

featuresCol: features column name. (default: features)
labelCol: label column name. (default: label)
modelType: The model type which is a string (case-sensitive). Supported options: multinomial (default), bernoulli and gaussian. (default: multinomial)
predictionCol: prediction column name. (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities. (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name. (default: rawPrediction)
smoothing: The smoothing parameter, should be >= 0, default is 1.0 (default: 1.0)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, w