### Metrices we use in Classification:

In [None]:
# 1. Accuracy: when data is balanced. If data is imbalanced, accuracy is cursed
# 	- For imbalanced dataset, we use recall, precision or f-score

In [None]:
# 2. when to use recall, when to use precision in imbalance dataset?
# 	- Say in case of "Spam Mails", [1: 'spam mail', 0:'non a spam mail']; 
      # our goal is to reduce false positive (mail is not a spam, but the model detected it as positive i.e., spam) 
      # in order to avoid missing out important mails. Here "Precision" must be improved. (FP must be reduced)

# 	- On the other hand, in "Cancer detection", [1: 'cancer positive', 0: 'cancer negative']; 
    # our goal is to reduce false negative (we don't want to declare a person as non-cancerous who is actually having cancer). 
    # Here "recall" must be imporved. (FN must be reduced)

# 	- But if both recall and precision are both important, we should use f-score. [(1 + beta^2)(precision * recall) / (beta^2 * precision + recall)]
    # 		> if both FP and FN are both important, then we consider beta = 1
    # 		> if FP is more important than FN, then we consider beta < 1
    # 		> if FN is more important than FP, then we consider beta > 1

In [1]:
# 3. ROC(Receiver Operator Characteristic) graph and AUC curves:
# 	- The ROC graph summarizes all of the confusion matrices that each threshold produced.

# 	- true positive rate (TPR) = [tp / (tp + fn)] --> equivalent to recall; 
# 	- false positive rate (FPR) = [fp / (fp + tn)] 

# 	- TPR (y-axis) vs FPR (x-axis) graph, the area under graph is called AUC (area under curve). The more the area, the better the model is.
# 		> a good model should always have area greater than the (1,1) right-angle triangle area
#     > ** based on our business requirements, we choose threshold value, 
        # say, we don't want any FP, so we can choose that threshold corresponding to which, FPR == 0; but TPR != 0
        

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score


In [5]:
# Loading data
X, y = load_breast_cancer(return_X_y=True)
len(X[0]) # 30 features

30

In [11]:
import numpy as np
np.unique(y)

array([0, 1])

In [7]:
# fitting model
clf = LogisticRegression(solver="liblinear", random_state=0).fit(X, y)


In [8]:
clf.predict_proba(X) # probability for each class
# first probability for 0 (no cancer)
# second probability for 1 (cancerous)

array([[1.00000000e+00, 2.62841744e-15],
       [9.99999981e-01, 1.88458428e-08],
       [9.99999962e-01, 3.83511402e-08],
       ...,
       [9.98210533e-01, 1.78946713e-03],
       [1.00000000e+00, 6.73439803e-11],
       [3.64914824e-02, 9.63508518e-01]])

In [9]:
# aur for cancerous class
roc_auc_score(y, clf.predict_proba(X)[:, 1])

0.994767718408118

In [10]:
roc_auc_score(y, clf.predict_proba(X)[:, 0])

0.005232281591882046