- link:
  - https://jishuin.proginn.com/p/763bfbd391c6
  - https://cuijiahua.com/blog/2017/11/ml_7_logistic_2.html
### 逻辑回归 logistic
虽然名字是回归其实是分类算法
#### 应用场景
- 广告点击率（是否会被点击）
- 判断垃圾邮件
- 是否患病
- 金融诈骗
- 虚假账号
上述场景都是二分类问题  
正例 / 反例
#### 原理
逻辑回归=线性回归+sigmoid函数  
将线性回归得输出，输入到sigmoid函数，得到概率值，以此来进行分类
- 输入，线性回归的输出就是逻辑回归的输入
$$h(w)=w_1x_1+w_2x_2+w_3x_3\dots+b$$
- 激活函数 sigmoid 函数
  - 回归的结果输入到 sigmoid 函数中
  - 代入到 x 中便可算得概率值
  - 输出：[0,1]区间中的概率值，默认阈值为0.5
$$
h_\theta(x)=g(\theta^Tx) \\
z=[\theta_0 \quad \theta_1 \quad \dots \quad \theta_n]\begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} = \theta^Tx \\
g(Z)=\frac{1}{1+e^{-Z}} \\
$$
整合如下：
$$
h_\theta(x)=g(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}}
$$
- 损失函数，逻辑回归得损失，称为对数自然损失
$$
cost(h_\theta(x),y)=
\begin{cases}
-log(h_\theta(x)), &if\ y=1 \\
-log(1-h_\theta(x)), &if\ y=0
\end{cases}=\sum^m_{i=1}-y_ilog(h_\theta(x))-(1-y_i)log(1-h_\theta(x))
$$
- 优化损失
  - 梯度下降，减少损失函数的值，更新权重参数，让分类更准确
- 流程  
![逻辑回归流程](src/logistic_flow.png)

#### API
~~~python
sklearn.linear_model.LogisticRegression
~~~
- penalty：惩罚项，str类型，可选参数为l1和l2，默认为l2。用于指定惩罚项中使用的规范。
- c：正则化系数λ的倒数
- solver：优化算法选择参数，只有五个可选参数，即newton-cg,lbfgs,liblinear,sag,saga。默认为liblinear。

In [56]:
# dataset link:https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/

# load dataset
import pandas as pd
import numpy as np
col_names = [
"Sample code number",
"Clump Thickness",
"Uniformity of Cell Size",    
"Uniformity of Cell Shape",   
"Marginal Adhesion",          
"Single Epithelial Cell Size",
"Bare Nuclei",                
"Bland Chromatin",            
"Normal Nucleoli",
"Mitoses",                    
"Class"]
cancer = pd.read_csv("../datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data",names=col_names)
cancer


Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...,...
694,776715,3,1,1,1,3,2,1,1,1,2
695,841769,2,1,1,1,2,1,1,1,1,2
696,888820,5,10,10,3,7,3,8,10,2,4
697,897471,4,8,6,4,3,4,10,6,1,4


In [57]:
# 处理缺失值
cancer.replace("?",np.nan,inplace=True)
cancer.loc[40]

# 删除缺失行
cancer.dropna(inplace=True)
cancer.shape

(683, 11)

In [60]:
# 筛选特征值和目标值
x = cancer.iloc[:,1:-1]
y = cancer["Class"]

In [61]:
# 划分数据集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)


In [62]:
# 标准化
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)


In [70]:
# 逻辑回归
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train, y_train)

# 逻辑回归的模型参数
print("coef_",clf.coef_)
print("intercept_",clf.intercept_)

coef_ [[1.19777169 0.10877967 0.73209957 0.60323232 0.12122898 1.48162508
  0.75112762 0.79980762 0.82133788]]
intercept_ [-1.0749105]


In [72]:
# 评估
y_pred = clf.predict(x_test)
from sklearn.metrics import mean_squared_error
err = mean_squared_error(y_test,y_pred)
print("err",err)
score = clf.score(x_test,y_test)
print("score",score)

err 0.0935672514619883
score 0.9766081871345029
