[数据科学和人工智能技术笔记 十二、逻辑回归](https://github.com/apachecn/ds-ai-tech-notes/blob/master/12.md)
# Cs 超参数快速调优
---

有时，学习算法的特征使我们能够比蛮力或随机模型搜索方法更快地搜索最佳**超参数**(在开始学习过程之前设置值的参数)。
>超参数的一些示例：
 - 树的数量或树的深度
 - 矩阵分解中潜在因素的数量
 - 学习率（多种模式）
 - 深层神经网络隐藏层数
 - k均值聚类中的簇数

scikit-learn 的`LogisticRegressionCV`方法包含一个参数`Cs`。 如果提供了一个列表，`Cs`是inverse可供选择的候选超参数值。 如果提供了一个整数，`Cs`的这么多个候选值，将从 0.0001 和 10000 之间的对数标度（`Cs`的合理值范围）中提取。 Like in support vector machines, smaller values specify stronger regularization

LogisticRegressionCV通过内置的交叉验证支持实现Logistic回归，以根据评分属性找到最佳C和l1_ratio参数。

In [13]:
from sklearn import datasets, linear_model
import numpy as np
np.set_printoptions(threshold=15)

In [29]:
iris = datasets.load_iris()
X = iris.data[:100]
y = iris.target[:100]
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       ...,
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3]])

'''
Cs [list of floats or int, optional (default=10)] Each of the values in Cs describes the inverse of
regularization strength. If Cs is as an int, then a grid of Cs values are chosen in a logarithmic scale between 1e-4 and 1e4. Like in support vector machines, smaller values specify
stronger regularization
'''

In [26]:
# 创建logistic回归的交叉验证
clf = linear_model.LogisticRegressionCV(Cs=100, multi_class='auto', cv=5)
clf
# . ‘auto’ selects ‘ovr’ if the data
# is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.

LogisticRegressionCV(Cs=100, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [27]:
clf.fit(X, y)

LogisticRegressionCV(Cs=100, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [30]:
new_observation = [[5.0, 3.5, 1.3, 0.2], [6.2, 2.9, 4.3, 1.3]]
clf.predict(new_observation)

array([0, 1])

# 在逻辑回归中处理不平衡类别
---
像 scikit-learn 中的许多其他学习算法一样，`LogisticRegression`带有处理不平衡类的内置方法。 如果我们有高度不平衡的类，并且在预处理期间没有解决它，我们可以选择使用`class_weight`参数来对类加权，确保我们拥有每个类的平衡组合。 具体来说，`balanced`参数会自动对类加权，与其频率成反比：
$$w_j = \frac {n}{kn_j}$$
$w_j$是$j$类的权重, $n$是总观测数, $n_j$是类$j$的观测数, $k$为类的总数

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [18]:
# 通过移除前 40 个观测，使类高度不均衡
X = iris.data[40:, :]
y = iris.target[40:]
# 创建目标向量，如果表示类别是否为 0
y = np.where(y==0, 0, 1)  # 10 个label为0 其余100个设为1

In [23]:
scaler = StandardScaler()
# The standard score of a sample x is calculated as:z = (x - u) /s
# where u is the mean of the training samples or zero if with_mean=False, 
# and s is the standard deviation of the training sample
X_std = scaler.fit_transform(X)
X_std

array([[-1.51810416,  1.60348073, -2.61034098, -2.20112727],
       [-2.18607   , -1.67806123, -2.61034098, -2.20112727],
       [-2.31966316,  0.78309524, -2.61034098, -2.37721745],
       ...,
       [ 0.48579333,  0.23617158,  0.48056788,  0.79240582],
       [ 0.08501383,  1.3300189 ,  0.63907603,  1.32067636],
       [-0.31576567,  0.23617158,  0.40131381,  0.44022545]])

In [24]:
# 创建决策树分类器对象
clf = LogisticRegression(random_state=0, class_weight='balanced')
model = clf.fit(X_std, y)



In [31]:
new_observation = [[5.0, 3.5, 1.3, 0.2], [6.2, 2.9, 4.3, 1.3]]
new_observation_std = scaler.transform(new_observation)
new_observation_std

array([[-1.51810416,  1.60348073, -2.61034098, -2.37721745],
       [ 0.08501383, -0.03729025, -0.23271878, -0.44022545]])

In [32]:
model.predict_proba(new_observation_std)

array([[0.99411884, 0.00588116],
       [0.11739546, 0.88260454]])