继续使用上一小节的数据

In [1]:
import numpy as np
from sklearn import datasets
np.set_printoptions(threshold=np.inf)
mydigits = datasets.load_digits()
X = mydigits.data
y = mydigits.target.copy()
y[mydigits.target == 9] = 1
y[mydigits.target != 9] = 0

用逻辑回归对数据进行训练

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
logReg = LogisticRegression()
logReg.fit(X_train, y_train)
y_predict = logReg.predict(X_test)

调用sklearn中的confusion matrix,precision_score,recall_score

In [4]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [5]:
confusion_matrix(y_test, y_predict)

array([[403,   2],
       [  9,  36]])

In [6]:
precision_score(y_test, y_predict)

0.9473684210526315

In [7]:
recall_score(y_test, y_predict)

0.8

In [8]:
f1_score(y_test, y_predict)

0.8674698795180723

由理论分析可知，将准率和召回率是互相矛盾的,逻辑回归中，改变阈值，精准率和召回率会朝相反的方向变化（比如增加阈值，精准率会提升，但召回率下降）

是否发现 sklearn.linear_model.LogisticRegression 中有一个method：   
decision_function(X)	   Predict confidence scores for samples. 即可通过score来改变阈值（Threshold）    
default： sklearn.linear_model.LogisticRegression 中 score>0,被预测为1; score<0,被预测为0

In [10]:
scores = logReg.decision_function(X_test)

In [11]:
scores[:10]

array([-22.05700211, -33.02940211, -16.21332666, -80.37910549,
       -48.25124099, -24.54003786, -44.39166499, -25.04289548,
        -0.97829288, -19.71740555])

In [12]:
y_predict[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

由上可以看出,预测的前10个元素都为0,他们的scores均<0      
也验证了我们上边的叙述

# 增大LogisticRegression的预测Threshold

In [17]:
y_predict2 = np.array(scores >= 5, dtype='int')

compute confusion_matrix, precision_score, recall_score, f1_score

In [18]:
confusion_matrix(y_test, y_predict2)

array([[404,   1],
       [ 21,  24]])

In [19]:
precision_score(y_test, y_predict2)

0.96

In [20]:
recall_score(y_test, y_predict2)

0.5333333333333333

In [21]:
f1_score(y_test, y_predict2)

0.6857142857142858

According this data, we can know that precision_score is higher, recall_score is lower, f1_score is lower with increasing threshold (compared standard: threshold=0).

# 减小LogisticRegression的预测Threshold

In [22]:
y_predict3 = np.array(scores >= -5, dtype = 'int')

In [23]:
confusion_matrix(y_test, y_predict3)

array([[390,  15],
       [  5,  40]])

In [24]:
precision_score(y_test, y_predict3)

0.7272727272727273

In [25]:
recall_score(y_test, y_predict3)

0.8888888888888888

In [26]:
f1_score(y_test, y_predict3)

0.7999999999999999

Acording data, we can know that precision_score is lower, recall_score is higher, f1_score is lower with decreasing threshold.(Comparison standard：threshold=0)