我们使用逻辑回归算法来做分类时, 采用的是:
准确率
但是这个准确率是有缺陷的.
![p1](img/p1.png)
这样的原因是因为数据分布非常不均匀导致的, 也叫 极度倾斜(Skewed Data)的数据
这样就算我们直接预测都没有癌症 发现准确率也很高.

![mat](img/mat.png)

## 精准率和召回率
通过混淆矩阵获得.

精准率 = TP / (TP+FP)

召回率 = TP / (TP+FN)

比如用0 代表无癌症, 1 代表有癌症,
那么精准率就是: 我们预测100个人有癌症, 但是真正有癌症的了人的概率

召回率就是:每100个有癌症的人中, 我们预测到了有癌症的人正确的概率.



# 实现精准率和召回率

In [1]:
import numpy as np
from sklearn import datasets

In [5]:
digits = datasets.load_digits()
X = digits.data
y = digits.target.copy() # 不修改原有数据

In [3]:
# 手动让数据产生偏斜

In [6]:
y[digits.target==9]=1
y[digits.target!=9]=0
# 10个类别的分类问题 转换为了二分类问题 
# 那么数据应该是极度偏斜的

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [10]:
log_reg.score(X_test, y_test)

0.9755555555555555

In [14]:
def TN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict==0)) #True =1, False = 0

In [12]:
y_log_predict = log_reg.predict(X_test)

In [15]:
TN(y_test, y_log_predict)

403

In [16]:
def FP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 0) & (y_predict==1)) #True =1, False = 0

In [17]:
FP(y_test, y_log_predict)

2

In [18]:
def FN(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict==0)) #True =1, False = 0

In [19]:
def TP(y_true, y_predict):
    assert len(y_true) == len(y_predict)
    return np.sum((y_true == 1) & (y_predict==1)) #True =1, False = 0

In [20]:
FN(y_test, y_log_predict)

9

In [21]:
TP(y_test, y_log_predict)

36

In [22]:
def confusion_matrix(y_true, y_predict):
    return np.array([
        [TN(y_true, y_predict), FP(y_true, y_predict)],
        [FN(y_true, y_predict), TP(y_true, y_predict)]
    ])

In [23]:
confusion_matrix(y_test, y_log_predict)

array([[403,   2],
       [  9,  36]])

In [24]:
# 那么精准率和召回率

In [25]:
def precision_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fp = FP(y_true, y_predict)
    try:
        return tp/(tp+fp)
    except:
        return 0.0

In [26]:
precision_score(y_test, y_log_predict)

0.9473684210526315

In [27]:
def recall_score(y_true, y_predict):
    tp = TP(y_true, y_predict)
    fn = FN(y_true, y_predict)
    try:
        return tp/(tp+fn)
    except:
        return 0.0

In [28]:
recall_score(y_test, y_log_predict)

0.8

# sklearn中的精准率

In [29]:
from sklearn.metrics import confusion_matrix

In [30]:
confusion_matrix(y_test, y_log_predict)

array([[403,   2],
       [  9,  36]], dtype=int64)

In [31]:
from sklearn.metrics import precision_score

In [32]:
precision_score(y_test, y_log_predict)

0.9473684210526315

In [33]:
from sklearn.metrics import recall_score

In [34]:
recall_score(y_test, y_log_predict)

0.8

## 平衡精准率和召回率F1 score
实际上是precision和recall的**调和平均数**
$$
\frac{1}{F1} =\frac{1}{2}(\frac{1}{precision} + \frac{1}{recall})
$$
调和平均数的特点是 必须2个都比较高时 值才比较高.
$$
F1 =\frac{2*precision*recall}{precision+recall}
$$


In [35]:
def f1_score(precision, recall):
    try:
        return 2 * precision * recall / (precision + recall)
    except:
        return 0

In [36]:
precision = recall = 0.5
f1_score(precision, recall)

0.5

In [37]:
precision = 0.1
recall = 0.9
f1_score(precision, recall)

0.18000000000000002

In [38]:
# 真实数据中的f1_score

In [39]:
log_reg.score(X_test, y_test)

0.9755555555555555

In [40]:
precision_score = precision_score(y_test, y_log_predict)

In [41]:
recall_score = recall_score(y_test, y_log_predict)

In [42]:
from sklearn.metrics import f1_score

In [43]:
f1_score?

In [44]:
# sklearn 中的f1_score(y_true, y_pred, labels=None...)

In [46]:
f1_score(y_test, y_log_predict)

0.8674698795180723

In [47]:
# 可以看出结果并不是很高