In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab5.ipynb")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from resources.hashutils import *
rng_seed=454

---

<h1><center>SDSE Lab 5 <br><br> Scikit-learn, logistic regression, feature selection, and regularization</center></h1>

---

In this lab we will build a model for diagnosing breast cancer from various measurements of a tumor. To do this we will use [scikit-learn](https://scikit-learn.org/stable/), which is a package for performing a host of machine learning tasks. We will learn about scikit-learn's train-test data splitter, its standard scaler, pipelines, cross-validation, and LASSO regularization. 

The lab has 11 parts.

**Prelminaries**

1. Load the data
2. Extract test data
3. Normalize the training data

**Simple logistic regression**

4. Most correlated feature
5. Train simple logistic regression
6. Create a scikit-learn pipeline
7. Evaluate the models with cross-validation

**Regularization**

8. Lasso regularized logistic regression
9. Choose the best model
10. Significant features
11. Evaluate the final model with test data


---

<h1><center><font color='purple'>Preliminaries</font><br></center></h1>

# 1. Load the data

The [dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)) originates from the University of Wisconsin and is included in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), as well as in scikit-learn's collection of [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html). It can be loaded with scikit-learn's [load_breast_cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) method. 

```python
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True).frame
```

Passing `as_frame=True` prompts the loader to return a pandas DataFrame. The raw dataset encodes a benign tumor as a 1 and a malignant tumor as a 0. We flip these tags so that the encoding agrees with the convention of a malignant tumor producing a "positive" outcome (1) and a benign tumor producing a "negative" outcome (0).

```python
data['target'] = 1-data['target']
```


In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer(as_frame=True).frame
data['target'] = 1-data['target']

Use `data.info()` to display a summary of the dataset. 

In [None]:
data.info()

# 2. Extract test data

The first step is to set aside a portion of the data for final testing. Use scikit-learn's [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to create the testing and training datasets. 

Note: `train_test_split` takes these arguments:
1. The input samples: Use `data.iloc` to select all rows and all but the last column. 
2. The target (output) samples: The last column of `data` (named "target")
3. `test_size` is the portion of the dataset reserved for testing. You should set this to 20% (0.2).
4. Pass `random_state=rng_seed` to fix the random seed and ensure reproducibility of the results. 

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(...,
                                                ...,
                                                test_size=...,
                                                random_state=rng_seed )

In [None]:
grader.check("q2")

# 3. Normalize the inputs

Some models benefit from ``normalizing'' the input data. This aperation brings all of the inputs into a similar range of values, which can be helpful to numerical methods.  To normalize a numerical column is to subtract out its mean and divide by its standard deviation.

\begin{equation*}
\widetilde{X} = \frac{X-\text{mean}(X)}{\text{std}(X)}
\end{equation*}

Where $X$ is an input column of the DataFrame. 

Models whose training procedure is impervious to poorly scaled inputs are called "scale invariant". Scale invariant models do not benefit from normalizing the inputs. Logistic regression is *not* scale invariant. We will use scikit-learn's [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to perform the normalization of the training input data (`Xtrain`) for logistic regression. The normalized table should be stored in a separate pandas DataFrame called `Xtrain_norm`. 

**Hints**: 
+ Obtain the index of a DataFrame df with [df.index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html)
+ Obtain the column headers of a DataFrame with [df.columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)

In [None]:
from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()

# Use the fit_transform method to perform the normalization of columns
X = scaler.fit_transform(...)

# Format the normalized input as a DataFram
Xtrain_norm = pd.DataFrame(X, index=..., columns=...) 

In [None]:
grader.check("q3")


---

<h1><center><font color='purple'> Simple logistic regression</font><br></center></h1>


# 4. Most correlated feature

Our first model will be a simple logistic regression model based on the single feature that best correlates with the output. Find this feature and save its name (i.e. its header value) to `best_single_feature`. 

In [None]:
...
best_single_feature = ...

In [None]:
grader.check("q4")

# 5. Train simple logistic regression

Next we train the simple logistic regression model for the feature that was selected in the previous part. For this we will use scikit-learn's implementation of [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). 

1. Pass `random_state=rng_seed` into the LogisticRegression constructor to ensure repeatability of the results. 
2. Call the [`fit`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) function of the model object, passing in the training data. The model input corresponds to the single best feature already identified.
3. Extract the trained model coefficients. The intercept term $\hat\theta_0$ is stored in the `intercept_[0]` attribute of the model. The remaining coefficients $\hat\theta_1$ through $\hat\theta_P$ (in this case just $\hat\theta_1$) are in `coef_[0,:]`.

This has been done for you with the original (un-normalized) input data. Repeat the exercise with the normalized data. 

In [None]:
from sklearn.linear_model import LogisticRegression

model_nonorm = LogisticRegression(random_state=rng_seed)
model_nonorm.fit(Xtrain[[best_single_feature]],ytrain) 
print(model_nonorm.intercept_[0], model_nonorm.coef_[0,:])

model_norm = ...

In [None]:
grader.check("q5")

# 6. Create a scikit-learn pipeline

Scikit-learn provides a *pipeline* class that collects all of the preprocessing, feature transformation, and modeling components into a single object with `fit` and `predict` methods. You can  read the documentation on [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to learn more. 

Each component in the pipeline is identified with a string name. The following code creates a pipeline with a `StandardScaler` labeled as `scaler`, followed by a logistic regression model labeled as `logreg`.

``` python
pipeline = Pipeline([('scaler', StandardScaler()), 
                     ('logreg', LogisticRegression(random_state=rng_seed)) ])
```

Create this pipeline and train it on the `best_single_feature` of the un-normalized dataset (`Xtrain`,`ytrain`) using the `fit` method. 

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(...)
pipeline.fit(...) 

In [None]:
grader.check("q6")

# 7. Evaluate the three models with cross-validation

In diagnosing disease, it is important to carefully consider the negative impacts of both false positives and false negatives. A positive diagnosis, whether true or false, will cause significant anxiety and stress to patients and their families. It may also lead to further testing, which can be expensive, time consuming, and possibly dangerous. All of which would be unnecesasary and useless if the diagnosis were false. A false negative, on the other hand, means that a sick patient goes undiagnosed, which can have even more severe consequences. 

In cancer diagnosis, false negatives are generally considered worse than false positives. The performance metric used to evaluate a cancer diagnosis tool should therefore tilt toward recall, rather than precision. In this lab exercise we will use the $F_\beta$ score with $\beta$ set to 3.0. 


K-fold cross-validation is a model evaluation technique that provides an unbiased estimate of model performance without sacrificing any of the training data. It does this by splitting the training set into K equal parts (or "folds"), and then training K separate models, each with one of the K parts used as validation data and the remaining K-1 parts as training data. 
We will use 4-fold cross-validation to evaluate the $F_\beta$ score of our three models: `model_nonorm`, `model_norm`, and `pipeline`.

Note the following:
+ The $F_\beta$ score is implemented in scikit-learn's [`fbeta_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html) method.
+ Cross-validation is implemented in scikit-learn's [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) method.  
+ To use `fbeta_score` with `cross_val_score`, one must first create a "callable scoring object" by passing `fbeta_score` into [`make_scorer`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer).
+ The first three arguments for the `cross_val_score` are the model, the training input data, and the training output data. These latter two entries are the same as were passed to the `fit` function in part 6. 
+ Pass the scoring object returned by `make_scorer` to `cross_val_score` via its `scoring` argument.
+ `cross_val_score` should return 4 values of $F_\beta$; one for each of the folds. Store the *mean* of these as `fbeta_nonorm`, `fbeta_norm`, and `fbeta_pipe` for the un-normalized, normalized, and pipeline models respectively. 
+ Note the improvement due to normalization. What do you think might account for the difference?
+ Compare the performance of the normalized model and the pipeline. Does this make sense?

In [None]:
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import cross_val_score

fbeta_scorer = make_scorer(fbeta_score, beta=...)
fbeta_nonorm = ...
fbeta_norm = ...
fbeta_pipe = ...

In [None]:
grader.check("q7")


---

<h1><center><font color='purple'>Regularization</font><br></center></h1>


# 8. Lasso regularized logistic regression

Regularization is a method for avoiding overfitting by penalizing the complexity of the model in the training process. Lasso regularization in particular penalizes the sum of the absolute values of the slope parameters. It has the property that it will tend to "zero out" coefficients as the penalty $\lambda$ increases. This gives it an additional role as a feature selection technique. 

In this part we will train a lasso regularized logistic regression model. Instead of $\lambda$, scikit-learn uses the `C` parameter of `LogisticRegression`, which is the proportional to the inverse of $\lambda$. Hence, increasing `C` results in *less* regularization, and generally larger parameter values. 

Write code that loops through a logarithmically spaced array of 20 regularization parameters `C` ranging from $10^{-2}$ to $10^{2}$. Create this array using [np.logspace](https://numpy.org/devdocs/reference/generated/numpy.logspace.html) and store it as `Cs`.

For each value in the array, the code should 
1. train and evaluate a logistic regression pipeline. 
2. Store the model in a list of models called `models`. 
3. Store the $F_\beta$ score (with $\beta=3$) in a NumPy array called `fbetas`.

Your pipeline should have two componenents: a `StandardScaler` for normalizing the data, followed by a `LogisticRegression` regression model. When building the pipeline, you should pass these parameters to the `LogisticRegression` constructor: 

```python 
LogisticRegression(C=C[c],
                   penalty='l1',
                   solver='liblinear',
                   random_state=rng_seed)
```

+ `penalty='l1'` specifies the lasso penalty, rather than the ridge penalty. 
+ `solver='liblinear'` tells scikit-learn to use the liblinear library for solving logistic regression. This is a good choice for lasso regularization. 
+ `random_state=rng_seed` ensures that the result is reproducible.

Note:
+ Use  to generate the array of `C` values. 
+ Use the same performance metric and cross validation approach as in part 7.

In [None]:
fbeta_scorer = make_scorer(fbeta_score, beta=...)
Cs = ...
fbetas = ...
models = ...

for c, C in enumerate(Cs):   

    # Create a pipeline model 
    model = Pipeline([ ... ])
    
    # Fit the model using the training data
    model.fit(...,...)     

    # Store the model in the models list
    ...

    # Compute the fbeta score using 4-fold cross-validation. 
    cvscores = ...

    # Save the average of the folds in the fbetas array
    fbetas[c] = ...

In [None]:
grader.check("q8")

# 9. Choose the best model

Next we select the model with the best $F_\beta$ score. Follow the steps in the code. 

In [None]:
# 1. Set `cstar` to the index of the best performing regularization value
cstar = ...

# 2. Set `fbeta_star` to the corresponding F-beta score
fbeta_star = ...

# The next bit of code extracts the coefficients of the logistic regression for each of the 20 values of `C`. 
# This is stored in `theta` , which is a (20,30) array. (30 is the number of features)
theta = np.vstack([model.named_steps['logreg'].coef_[0,:] for model in models])

# 3. Plot the F-beta score as a function of `C`. (done already)
fig, ax = plt.subplots(figsize=(8,8),nrows=2,sharex=True)
ax[0].semilogx(Cs,fbetas,'o-',color='b',linewidth=2)
ax[0].semilogx(Cs[cstar],fbeta_star,'*',color='m',markersize=14)
ax[0].grid(linestyle=':')
ax[0].set_ylabel('performance',fontsize=12)

# 4. In a single plot, plot the 30 coefficients as a fucntion of `C`. (done already)
ax[1].semilogx(Cs,theta)
ax[1].grid(linestyle=':')
ax[1].set_xlabel('C',fontsize=16)

In [None]:
grader.check("q9")

# 10. Significant features

The plot below shows the coefficients for the best-case regularized logistic regression found in the previous part. Notice that many of these coefficients have been set to zero. 

In [None]:
theta_star = theta[cstar,:]

plt.figure(figsize=(10,3))
plt.stem(np.abs(theta_star));

Follow the instructions in the cell that follows. 

In [None]:
features = Xtrain.columns

# 1. Set `best_features` to the set of feature names corresponding to non-zero coefficients in the plot above. 
best_features = ...

# 2. Set `max_theta_feature` to the feature name corresponding to the coefficient with maximum absolute value. 
max_theta_feature = ...

# 3. Save the selected lasso model to the variable `lasso_model`.
lasso_model = ...

In [None]:
grader.check("q10")

# 11. Evaluate the final model with test data

Use the test dataset to evaluate the performance of the selected lasso model. You can directly use scikit-learn's [`fbeta_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html) method (with $\beta$ as in previous parts) for this. 

In [None]:
# Use the model's `predict` method to predict outputs for the test inputs
yhat = ...

# Use `fbeta_score` to evaluate those predictions against the true test output. 
lasso_test = ...

In [None]:
grader.check("q11")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Make sure you submit the .zip file to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)