<a href="https://colab.research.google.com/github/anyuanay/INFO213/blob/main/INFO213_Week7_pipeline_crossV_hyperparameter_tuning_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 213: Data Science Programming 2
___

### Week 7: Pipeline, Cross Validation, and Hyperparameter Tuning


**Agenda:**
- Pipeline for wrapping up preprocessing and model fitting steps.
- Needs for cross-validation to assess model performance effectively.
- K-Fold Cross-Validation.
- Understanding hyperparameters vs. parameters.
- Grid Search: Exhaustive search over specified hyperparameter values

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

# Why do we need model evaluation and selection?
- Machine learning involves choosing from a variety of algorithms and models.
- We need to compare different models and algorithms to determine which one performs best.
-  Model selection and evaluation help ensure that a chosen model is not overfitting or underfitting, and can be generalized to unseen data.

## Underfitting and Overfitting

A common danger in machine learning is overfitting—producing a model that performs well on the data you train it on but that generalizes poorly to any new data.

The other side of this is underfitting, producing a model that doesn’t perform well
even on the training data, although typically when this happens you decide your
model isn’t good enough and keep looking for a better one.

![](https://i.imgur.com/oDDNc07.png)


The most fundamental approach to dealing with underfitting and overfitting is to use different data to train the
model and to test the model.

## The Bias and Variance Trade-Off
- Underfitting models will make a lot of mistakes for pretty much any training set (drawn from the same population),
which means that it has a high bias. Any two randomly chosen
training sets should give pretty similar models (since any two randomly chosen training
sets should have pretty similar average values). So we say that it has a low variance.
- Overfitting models have very low bias but very high variance (since any two training sets would likely give rise to very different models).

Thinking about model problems this way can help you figure out what to do when your
model doesn’t work so well.

- If your model has high bias (which means it performs poorly even on your training data) then one thing to try is adding more features.
    - Adopt more powerful model
    - Better feature selection
    - Reduce regularization

- If your model has high variance, then you can similarly remove features. But another solution is to obtain more data (if you can).
    - Decrease model complexity
    - Increase training data sample size
    - Regularization (parameter tuning, may increase bias)

Holding model complexity constant, the more data you have, the harder it is to overfit.
On the other hand, more data won’t help with bias. If your model doesn’t use enough
features to capture regularities in the data, throwing more data at it won’t help.

<img src="https://github.com/anyuanay/INFO213/blob/main/underfitting-overfitting.png?raw=true" width="600px" />

# Streamlining Workflows with Pipeline

- ## How to create and use a pileline?

- Prior to training a model, we may apply a variety of preprocessing techniques on the data, such as standardization for feature scaling.
- We have to reuse the parameters that were obtained during the fitting of the training data to scale and compress any new data.
- Scikit-learn provides an extremely handy tool, the `Pipeline`
class that allows us to fit a model including an arbitrary number of transformation steps and apply it to make predictionsabout new data.

## We will use the Breast Cancer Wisconsin dataset for illustration:
### Load the data from the file:
- We will be working with the Breast Cancer Wisconsin dataset, which contains 569 examples of malignant and benign tumor cells.
- The first two columns in the dataset store the unique ID
numbers of the examples and the corresponding diagnoses (M = malignant, B = benign), respectively.
- Columns 3-32 contain 30 real-valued features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

```python
from google.colab import files
files.upload()
```

```python
df = pd.read_csv("wdbc.data", header=None)

df.head()
```

```python
df.shape
```

We will assign the 30 features to a NumPy array, X. Using a
LabelEncoder object, we will transform the class labels from their original string representation ('M' and 'B') into integers:

```python
from sklearn.preprocessing import LabelEncoder

X = df.loc[:, 2:].values
y = df.loc[:, 1].values
```

```python
X.shape, y.shape
```

```python
le = LabelEncoder()
y = le.fit_transform(y)
le.classes_
```

- After encoding the class labels (diagnosis) in an array, y, the malignant tumors are now represented as class 1, and the benign tumorsare represented as class 0, respectively.
- We can double-check this mapping by calling the transform method of the fitted LabelEncoder on two dummy class labels:

```python
le.transform(['M', 'B'])
```

Divide the dataset into a separate training dataset (80 percentof the data) and a separate test dataset (20 percent of the data):

```python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y,
                     test_size=0.20,
                     stratify=y,
                     random_state=1)
```

## How to combine transformers and estimators in a pipeline?
- We will standardize the columns in the Breast Cancer
Wisconsin dataset before we feed them to a linear classifier, such as logistic regression.
- We can also compress our data from the initial 30 dimensions onto a lower two-dimensional subspace via principal component analysis (PCA), a feature extraction technique for dimensionality reduction.
- We can chain the StandardScaler and LogisticRegression
objects in a pipeline (with or without PCA):

```python
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(random_state=1, solver='lbfgs'))
```

```python
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
```

The pipelines of the scikit-learn library are immensely useful wrapper tools. To make sure that you've got a good grasp of how the Pipeline object works, please take a close look at the following illustration:
<img src="https://i.imgur.com/1vxItHg.png" width = 800>

## Explain the pipeline:
- The make_pipeline function takes an arbitrary number of scikit-learn transformers (objects that support
the fit and transform methods as input), followed by a scikit-learn estimator that implements the
fit and predict methods.
    - In our example, we provided two scikit-learn transformers,
StandardScaler and PCA, and a LogisticRegression estimator as inputs to the make_pipeline function.
- We can think of a scikit-learn Pipeline as a meta-estimator or wrapper around those individual transformers and estimators.
- If we call the fit method of Pipeline, the data will be passed down a series of transformers via fit and transform calls on these intermediate steps until it arrives at the
estimator object.
- The estimator will then be fitted to the transformed
training data.
- We should note that there is no limit to the number of
intermediate steps in a pipeline; however, if we want to use the pipeline for prediction tasks, the last
pipeline element has to be an estimator.
- Similar to calling fit on a pipeline, pipelines also implement a predict method if the last step in the
pipeline is an estimator.
    - If we feed a dataset to the predict call of a Pipeline object instance, the data will pass through the intermediate steps via transform calls. In the final step, the estimator object will then return a prediction on the transformed data.


# Retrieval Practice on Previous Weeks

### Visualize the data points


```python
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)
for label in [0, 1]:
    plt.scatter(
        X_pca[y_train == label, 0],
        X_pca[y_train == label, 1],
        label=f'Class {label}',
        alpha=0.7
    )
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Training Data')

plt.legend()
plt.show()
```

```python
X_test_pca = pca.fit_transform(X_test)
for label in [0, 1]:
    plt.scatter(
        X_test_pca[y_pred == label, 0],
        X_test_pca[y_pred == label, 1],
        label=f'Class {label}',
        alpha=0.7
    )
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Classified Test Data')

plt.legend()
plt.show()
```

### Pipeline without PCA

```python
pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(random_state=1, solver='lbfgs'))
```

```python
pipe_lr.fit(X_train, y_train)
```

```python
y_pred = pipe_lr.predict(X_test)
```

```python
from sklearn.metrics import confusion_matrix
```

```python
confusion_matrix(y_test, y_pred)
```

```python
from sklearn.metrics import precision_recall_fscore_support
```

```python
precision_recall_fscore_support(y_test, y_pred)
```

# K-Fold Cross Validation for Assessing Model Performance

## What is the heldout method and what is the purpose?
- In typical machine learning applications, we are interested in tuning and comparing different parameter settings to further improve the performance for making predictions on unseen data. - This process is called **model selection**, with the name referring to a given classification problem for which we want to select the optimal values of tuning parameters (also called hyperparameters).
- In machine learning, a "held-out" set is a portion of the dataset that is intentionally set aside and not used during the training of a model.
- The held-out set is reserved for evaluation purposes.
- The main idea is to have a separate dataset that the model has not seen during training, allowing us to assess how well the model generalizes to new, unseen data.

<img src="https://i.imgur.com/t6Nyb0t.png" width=800>

## Why do we need multiple held-out sets?
- The evaluation may be sensitive to how to choose a single held-out set.
- We can obtain a more reliable estimate of a model's performance by using multiple subsets of the dataset for training and testing.  
- Cross validation allows us to split the training data into folds for more robust evaluation.

## What is K-fold cross-validation?
- K-fold cross-validation avoids overlapping test sets
    - First step: split data into k subsets of equal size
    - Second step: use each subset in turn for testing, the remainder for training
- This means the learning algorithm is applied to k different training sets
- The error estimates are averaged to yield an overall error estimate; also, standard deviation is often computed

<img src="https://i.imgur.com/DfOfyn8.png" width=800>

## What is Stratified K-Fold cross validation?
- A slight improvement over the standard k-fold cross-validation approach is stratified k-fold cross-validation.
- In stratified cross-validation, the class label proportions
are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.

```python
pd.value_counts(y_train) / y_train.shape[0]
```

```python
X_train.shape, y_train.shape
```

```python
import numpy as np
from sklearn.model_selection import StratifiedKFold


kfold = StratifiedKFold(n_splits=10).split(X_train, y_train)
```

```python
for k, (train, test) in enumerate(kfold):
    train_counts = np.bincount(y_train[train])
    test_counts = np.bincount(y_train[test])

    train_percent = 100 * train_counts / train_counts.sum()
    test_percent = 100 * test_counts / test_counts.sum()

    print('Fold: %2d, Train class dist.: %s, Test class dist: %s' % (
        k + 1,
        np.array2string(train_percent, precision=1, separator=', '),
        np.array2string(test_percent, precision=1, separator=', ')
    ))
```

```python
scores = []
for k, (train, test) in enumerate(kfold):
    pipe_lr.fit(X_train[train], y_train[train])
    score = pipe_lr.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold: %2d, Class dist.: %s, Acc: %.3f' % (k+1,
          np.bincount(y_train[train]), score))
```

```python
print('\nCV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
```

## How to choose the numbe of folds in cross-validation?
- Standard method for evaluation: stratified ten-fold cross-validation
- Why ten?
    - Extensive experiments have shown that this is the best choice to get an accurate estimate
    - There is also some theoretical evidence for this
Stratification reduces the estimate’s variance
- Even better: repeated stratified cross-validation
E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

### Scikit-Learn K-Fold Cross-Validation Scorer
- For convenience, Scikit-Learn also implements a k-fold cross-validation scorer, which allows us to evaluate our model using stratified k-fold cross-validation less verbosely.
- An extremely useful feature of the cross_val_score approach is that we can distribute the evaluation of the different folds across multiple central processing units (CPUs) on our machine. - If we set the `n_jobs=2`, we could distribute the 10 rounds
of cross-validation to two CPUs (if available on our machine), and by setting `n_jobs=-1`, we can use all available CPUs on our machine to do the computation in parallel.

```python
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator=pipe_lr,
                         X=X_train,
                         y=y_train,
                         cv=10,
                         n_jobs=1)
print('CV accuracy scores: %s' % scores)
```

```python
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
```

# Debugging Algorithms with Learning and Validation Curves

- Two very simple yet powerful diagnostic tools that can help us to improve the performance of a learning algorithm: learning curves and validation curves.

## Diagnosing bias and variance problems with learning curves

- If a model is too complex for a given training dataset,
it can help to collect more training examples to reduce the degree of overfitting.
- However, in practice, it can often be very expensive or simply not feasible to collect more data.
- By plotting the model training and validation accuracies as functions of the training dataset size, we can
easily detect whether the model suffers from high variance or high bias, and whether the collection
of more data could help to address this problem.

## How to use the learning_curve in scikit-learn?

<img src="https://i.imgur.com/zWLPRXO.png" width=800>

- The graph in the upper left shows a model with a high bias. This model has both low training and cross-validation accuracy.
- The graph in the upper-right shows a model that suffers from high variance, which is indicated by the
large gap between the training and cross-validation accuracy.


```python
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve


pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(penalty='l2', random_state=1,
                                           solver='lbfgs', max_iter=10000))

```

```python
train_sizes, train_scores, test_scores =\
                learning_curve(estimator=pipe_lr,
                               X=X_train,
                               y=y_train,
                               train_sizes=np.linspace(0.1, 1.0, 10),
                               cv=10,
                               n_jobs=1)
```

```python
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
```

```python
plt.plot(train_sizes, train_mean,
         color='blue', marker='o',
         markersize=5, label='Training accuracy')

plt.fill_between(train_sizes,
                 train_mean + train_std,
                 train_mean - train_std,
                 alpha=0.15, color='blue')

plt.plot(train_sizes, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='Validation accuracy')

plt.fill_between(train_sizes,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

plt.grid()
plt.xlabel('Number of training examples')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim([0.8, 1.03])
plt.tight_layout()
# plt.savefig('images/06_05.png', dpi=300)
plt.show()
```

- As we can see in the preceding learning curve plot, our model performs quite well on both the training and validation datasets if it has seen more than 250 examples during training.
- We can also see that the
training accuracy increases for training datasets with fewer than 250 examples, and the gap between
validation and training accuracy widens—an indicator of an increasing degree of overfitting.

## How to address over- and underfitting with validation curves in scikit-learn?

- Validation curves are related to learning curves.
- But instead of plotting the training and test accuracies as functions of the sample size, we vary the values of the model parameters, for example, the inverse regularization parameter, C, in logistic regression.

```python
from sklearn.model_selection import validation_curve


param_range = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(
                estimator=pipe_lr,
                X=X_train,
                y=y_train,
                param_name='logisticregression__C',
                param_range=param_range,
                cv=10)

```

```python
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
```

```python
plt.plot(param_range, train_mean,
         color='blue', marker='o',
         markersize=5, label='Training accuracy')

plt.fill_between(param_range, train_mean + train_std,
                 train_mean - train_std, alpha=0.15,
                 color='blue')

plt.plot(param_range, test_mean,
         color='green', linestyle='--',
         marker='s', markersize=5,
         label='Validation accuracy')

plt.fill_between(param_range,
                 test_mean + test_std,
                 test_mean - test_std,
                 alpha=0.15, color='green')

plt.grid()
plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('Parameter C')
plt.ylabel('Accuracy')
plt.ylim([0.8, 1.0])
plt.tight_layout()
# plt.savefig('images/06_06.png', dpi=300)
plt.show()
```

- Similar to the learning_curve function, the validation_curve function uses stratified k-fold cross-validation
by default to estimate the performance of the classifier.
- Although the differences in the accuracy for varying values of C are subtle, we can see that the model slightly underfits the data when we increase the regularization strength (small values of C).
- However, for large values of C, it means lowering the strength of regularization, so the model tends to slightly
overfit the data. In this case, the sweet spot appears to be between 0.01 and 0.1 of the C value.

# Model Selection
- Cross validation evaluates the robust performance on a single model.
- How to systematically evaluate multiple models and selection the right one?
- Grid search is a hyperparameter tuning technique used in machine learning to systematically search for the optimal combination of hyperparameter values for a given model.

# How to fine-tuning machine learning models via grid search?
- There two types of parameters: those that are learned from the training data, for
example, the weights in logistic regression, and the parameters of a learning algorithm that are optimized separately.
- The latter are the tuning parameters (or hyperparameters) of a model, for example, the regularization parameter in logistic regression or the depth parameter of a decision tree.
- Grid search is a popular hyperparameter optimization
technique.

## What is grid-search and how to use it in scikit-learn to tune hyperparameters?
- It's a brute-force exhaustive search paradigm.
- We specify a list of values for different
hyperparameters.
- The computer evaluates the model performance for each combination to obtain the optimal combination of values from this set.

<img src="https://github.com/anyuanay/INFO213/blob/main/grid-search-2d.png?raw=true" width="600px" />

```python
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),
                        PCA(n_components=2),
                        LogisticRegression(random_state=1))

```

```python
param_range = [0.01, 0.1, 1.0, 10.0, 100.0]

param_grid = {'logisticregression__C': param_range, 'logisticregression__solver': ('lbfgs', 'liblinear')}

gs = GridSearchCV(estimator=pipe_lr,
                  param_grid=param_grid,
                  scoring='accuracy',
                  refit=True,
                  cv=10,
                  n_jobs=-1)
```

```python
gs = gs.fit(X_train, y_train)
```

```python
print(gs.best_score_)
print(gs.best_params_)
```

```python
clf = gs.best_estimator_

# clf.fit(X_train, y_train)
# note that we do not need to refit the classifier
# because this is done automatically via refit=True.

print('Test accuracy: %.3f' % clf.score(X_test, y_test))
```

## Exploring hyperparameter configurations more widely with randomized search

- Specifying large hyperparameter
grids makes grid search very expensive in practice.
- An alternative approach for sampling different
parameter combinations is randomized search.
- We draw hyperparameter configurations randomly from distributions (or discrete sets).
- Randomized search does not do an exhaustive search over the hyperparameter space.
- Still, it allows us to explore a wider range of hyperparameter value settings in a more cost- and time-effective manner.

```python
import scipy
param_range = scipy.stats.loguniform(0.0001, 1000.0)
```

- Using a loguniform distribution instead of a regular uniform distribution will ensure that in a sufficiently large number of trials, the same number of samples will be drawn from the
[0.0001, 0.001] range as, for example, the [10.0, 100.0] range. - To check its behavior, we can draw 10
random samples from this distribution via the rvs(10) method, as shown here:

```python
np.random.seed(1)
param_range.rvs(10)
```

Let us now see the RandomizedSearchCV in action and tune a LogisticRegression as we did with GridSearchCV:

```python
from sklearn.model_selection import RandomizedSearchCV
```

```python
param_grid = {'logisticregression__C': param_range, 'logisticregression__solver': ('lbfgs', 'liblinear')}

rs = RandomizedSearchCV(estimator=pipe_lr,
                  param_distributions=param_grid,
                  scoring='accuracy',
                  refit=True,
                  cv=10,
                  n_jobs=-1)
```

```python
rs.fit(X_train, y_train)
```

```python
print(rs.best_score_)
print(rs.best_params_)
```

```python
clf = rs.best_estimator_

print('Test accuracy: %.3f' % clf.score(X_test, y_test))
```

## What is the nested cross-validation? How to use it?

- Using k-fold cross-validation in combination with grid search or randomized search is a useful approach
for fine-tuning the performance of a machine learning model by varying its hyperparameter values.
- If we want to select among different machine
learning algorithms, though, another recommended approach is nested cross-validation.
- In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training and test folds.
- An inner loop is used to select the model using k-fold cross-validation on the training fold.
- After model selection, the test fold is then used to evaluate the model performance.
<img src="https://i.imgur.com/6ny8VBz.png" width=800>

```python
from sklearn.model_selection import cross_val_score

param_range = [0.01, 0.1, 1.0, 10.0, 100.0]

param_grid = {'logisticregression__C': param_range, 'logisticregression__solver': ('lbfgs', 'liblinear')}


gs = GridSearchCV(estimator=pipe_lr,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=2)

scores = cross_val_score(gs, X_train, y_train,
                         scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores),
                                      np.std(scores)))
```

We can use the nested cross-validation approach to compare a logistic regression model to a simple decision tree classifier:

```python
from sklearn.tree import DecisionTreeClassifier

gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
                  param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7, None]}],
                  scoring='accuracy',
                  cv=2)

scores = cross_val_score(gs, X_train, y_train,
                         scoring='accuracy', cv=5)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores),
                                      np.std(scores)))
```

As we can see, the nested cross-validation performance of the Logist Regression model (94.1 percent) is better than the performance of the decision tree (93.4 percent), and thus, we'd expect that it might be the better choice to classify new data that comes from the same population as this particular dataset.

# Looking at Different Performance Evaluation Metrics
## Precision, Recall, and F-Score

Given a set of labeled data and a predictive model, every data point lies in one of four categories:
* True positive (tp): “This message is spam, and we correctly predicted spam.”
* False positive (fp) (Type 1 Error): “This message is not spam, but we predicted spam.”
* False negative (fn) (Type 2 Error): “This message is spam, but we predicted not
spam.”
* True negative (tn): “This message is not spam, and we correctly predicted not spam.”

We often represent these as counts in a confusion matrix:

|      |positive | negative |
|------|------|------|
|positive|tp |fp|
|negative|fn | tn|

$accuracy = \frac{tp + tn}{tp + fp+tn+fn}$

$precision = \frac{tp}{tp+fp}$

$recall = \frac{tp}{tp+fn}$

$F1\_Score = \frac{2}{\frac{1}{recall} + \frac{1}{precision}}$

```python
pipe_lr = make_pipeline(StandardScaler(),
                        LogisticRegression(random_state=1, solver='lbfgs'))
```

```python
pipe_lr.fit(X_train, y_train)
```

```python
y_pred = pipe_lr.predict(X_test)
```

```python
from sklearn.metrics import confusion_matrix
```

```python
confmat = confusion_matrix(y_test, y_pred)
confmat
```

```python
fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j],
                va='center', ha='center')
ax.xaxis.set_ticks_position('bottom')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

```

# Plotting a receiver operating characteristic
- Receiver operating characteristic (ROC) graphs are useful tools to select models for classification
based on their performance with respect to the false positive rate (FPR) and true positive rate (TPR).
- The diagonal of a ROC graph can be interpreted as random guessing, and classification models that fall below the diagonal are considered as worse than random guessing.
- A perfect classifier would fall into the top-left corner of the graph with a TPR of 1 and an FPR of 0.
- Based on the ROC curve, we can then compute the so-called ROC area under the curve (ROC AUC) to characterize the performance of a classification model.
- Similar to ROC curves, we can compute precision-recall curves for different probability thresholds of a classifier.
- A function for plotting those precision-recall curves is also implemented in scikit-learn.


```python
from sklearn.metrics import roc_curve, auc
from numpy import interp

pipe_lr = make_pipeline(
    StandardScaler(),
    PCA(n_components=2),
    LogisticRegression(penalty='l2', random_state=1,
        solver='lbfgs', C=100.0)
    )
```

```python
# we are only using two features this time. This is to make the classification task more challenging for the
# classifier, by withholding useful information contained in the other features, so that the resulting
# ROC curve becomes visually more interesting.
X_train2 = X_train[:, [4, 14]]
```

```python
# For similar reasons, we are also reducing the number
# of folds in the StratifiedKFold validator to three.

from sklearn.model_selection import StratifiedKFold

cv = list(StratifiedKFold(n_splits=3).split(X_train, y_train))
```

```python
fig = plt.figure(figsize=(7, 5))

mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []

for i, (train, test) in enumerate(cv):
    probas = pipe_lr.fit(X_train2[train],
                         y_train[train]).predict_proba(X_train2[test])

    fpr, tpr, thresholds = roc_curve(y_train[test],
                                     probas[:, 1],
                                     pos_label=1)
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr,
             tpr,
             label=f'ROC fold {i+1} (area = {roc_auc:.2f})')

plt.plot([0, 1],
         [0, 1],
         linestyle='--',
         color=(0.6, 0.6, 0.6),
         label='Random guessing (area = 0.5)')

mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
         label=f'Mean ROC (area = {mean_auc:.2f})', lw=2)
plt.plot([0, 0, 1],
         [0, 1, 1],
         linestyle=':',
         color='black',
         label='Perfect performance (area = 1.0)')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend(loc='lower right')

plt.tight_layout()
# plt.savefig('figures/06_10.png', dpi=300)
plt.show()
```

- We interpolated the average ROC curve from the three folds via the interp function that we imported from SciPy and calculated the area under the curve via the auc function.
- The resulting ROC curve indicates that there is a certain degree of variance between the different folds, and the average
ROC AUC (0.76) falls between a perfect score (1.0) and random guessing (0.5).
- Note that if we are just interested in the ROC AUC score, we could also directly import the roc_auc_score function from the sklearn.metrics.
- Reporting the performance of a classifier as the ROC AUC can yield further insights into a classifier's performance with respect to imbalanced samples.

# Dealing with class imbalance
- Class imbalance is a quite common problem when working with real-world data.
- Imagine that the Breast Cancer Wisconsin dataset that we've been working with in this chapter consisted of 90 percent healthy patients.
- Training a model on such a dataset that achieves approximately
90 percent test accuracy would mean our model hasn't learned anything useful from the features provided in this dataset.
- Let us create an imbalanced dataset from our dataset, which originally consisted of 357 benign tumors (class 0) and 212 malignant tumors (class 1):


```python
X_imb = np.vstack((X[y == 0], X[y == 1][:40]))
y_imb = np.hstack((y[y == 0], y[y == 1][:40]))
```

```python
X_imb.shape, y_imb.shape
```

```python
np.mean(y_imb)
```

- One way to deal with imbalanced class proportions during model fitting is to assign a larger penalty to wrong predictions on the minority class.
- Via scikit-learn, adjusting such a penalty is as convenient
as setting the class_weight parameter to `class_weight='balanced`, which is implemented for most
classifiers.
- Other popular strategies for dealing with class imbalance include upsampling the minority class, downsampling the majority class, and the generation of synthetic training examples.
- There's no universally best solution or technique that works best across different problem domains.
- It is recommended to try out different strategies on a given problem, evaluate tht results, and choose the technique that seems most appropriate.
- The scikit-learn library implements a simple resample function that can help with the upsampling of the minority class by drawing new samples from the dataset with replacement.

```python
from sklearn.utils import resample

print('Number of class 1 examples before:', X_imb[y_imb == 1].shape[0])
```

```python
X_upsampled, y_upsampled = resample(X_imb[y_imb == 1],
                                    y_imb[y_imb == 1],
                                    replace=True,
                                    n_samples=X_imb[y_imb == 0].shape[0],
                                    random_state=123)

print('Number of class 1 examples after:', X_upsampled.shape[0])
```

```python
X_bal = np.vstack((X[y == 0], X_upsampled))
y_bal = np.hstack((y[y == 0], y_upsampled))
```

```python
np.mean(y_bal)
```

```python

```