# Lecture 12: Use logistic regression
## Agenda
1. Model training of Logistic Regression
2. sklearn.linear_model.LogisticRegression
3. A realistic example: breast cancer detection
4. Search prameters using GridSearchCV


### 1. Model training of logistic regression
1. Training set. Data sampels: X_train = {$x_1, x_2, ..., x_i, ..., x_m$}. Target (class labels):Y_train = {$y_1, y_2, ..., y_i, ..., y_m$}
    
2. Model in logistic regression: $\hat{p_i}  = \sigma(w_0+ w_1 \times x_{i,1}+ ... + w_n \times x_{i,n})$, where n is the feature dimentionality of data sample.

3. Cost/loss function of the logistic regression: the criterion to judge how good the current model is.
\begin{equation*}
    J(w) = C \sum_{i=1}^n \log(e^{- y_i \hat{y_i}} + 1) + P
\end{equation*}
    - In the above equation, $\hat{y_i}$  and  $y_i$ are the predicted class lable and the true class label of data sample $x_i$.
    - P is the penalty/regularizer, and C controls the contribution of P. There are three typical options of P: L1 norm ($ \|w\|_1$), L2 norm ($\frac{1}{2}w^T w$) and elstic-Net($\frac{1 - \rho}{2}w^T w + \rho \|w\|_1$)
    
4. Model training is to use an 'optimization algorithm' to find the best $w$ that can minimize the loss function $J(w)$. Refer to the gradient descent algorithm in the Other Learning materials for more information.

### 2. The sklearn.linear_model.LogisticRegression class

1. class introduction: Key attributes (e.g., C, and penalty)and methods (fit, predict, predict_proba, and score)
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#
    
2. source code
    - https://github.com/scikit-learn/scikit-learn/blob/b194674c4/sklearn/linear_model/_logistic.py#L1191

### 3. Breast cancer detection using logistic regression

1. Dataset

    :Number of data samples: 569
    
    :Number of features: 30 numeric. The first 10 features were directly calculated using mean feautues of all nuclei in an image
    
    :Class labels
        : Malignant
        : Benign
        
    https://scikit-learn.org/stable/datasets/index.html#breast-cancer-dataset
    
    

In [5]:
# load the data set
import sklearn.datasets as ds
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = ds.load_breast_cancer(return_X_y=True) # a simplified data loading approach
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [1]:
# create and train a logistic regression model


In [2]:
# use the trained model to predict class labels (after thresholding) for the first 5 training samples


In [3]:
# use the trained model to output the probability estimates (after thresholding) of the first 5 training samples


In [4]:
# model evaulation


# use the score() function


### 4. Parameter searching using GridSearchCV
1. Tune the hyperparameter searching, penalty and C, to achieve the best performance: k-fold validation + parameter searching: GridSearchCV
2. Retrain the model use the best parameters


In [57]:
from sklearn.model_selection import GridSearchCV

rs = 1
X,y = ds.load_breast_cancer(return_X_y=True)
y = abs(y - 1) # 1--> 0, 0 --> 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = rs)

rs = 0
logitR = LogisticRegression(random_state = rs)

param_grid = {
            'penalty' : ['l2','l1'],  
            'C' : [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

clf = GridSearchCV(estimator = logitR, cv = 3, param_grid = param_grid , scoring = 'accuracy', verbose = 1)
clf.fit(X_train, y_train)

Fitting 3 folds for each of 14 candidates, totalling 42 fits


[Parallel(n_jobs=1)]: Done  42 out of  42 | elapsed:   13.7s finished


GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l2', 'l1'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=1)

In [59]:
# Parameters and methods in GridSearchCV class
clf.cv_results_  # output the results (average and all folds)
print('mean test scores: ', clf.cv_results_['mean_test_score']) # average test score on all folds
print('best score: ', clf.best_score_)
print('best  parameters: ', clf.best_params_)
print('best  parameters: ', clf.scorer_)

mean test scores:  [ 0.92132505  0.91718427  0.92546584  0.91925466  0.93167702  0.92339545
  0.93995859  0.94824017  0.94409938  0.95031056  0.94616977  0.96273292
  0.94824017  0.95445135]
best score:  0.962732919255
best  parameters:  {'C': 100, 'penalty': 'l1'}
best  parameters:  make_scorer(accuracy_score)


In [60]:
# methods of class GridSearchCV
# fit(X,y) # train models for all combinations of parameters using k-fold cross-validation
# predict(X,y) # use the model with the best found parameter


In [61]:
# Questions:
# 1. how can we keep recall ratio = 1 and increase the precison
# 2. why do we do K-fold CV on the training set