<a href="https://colab.research.google.com/github/cassiomo/trainlab/blob/main/TRAIN_YLC_Week_13_Lab_Notebook_%5BSTUDENT%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 13: Machine Learning Review**
---

### **Description**

In this week's lab, we will review all algorithms we have seen so far. We will go through implementing and evaluating linear regression and KNN models before taking a look at building logistic regression models.

<br>

### **Structure**

**Part 1**: [[OPTIONAL] Linear Regression Review](#p1)

**Part 2**: [[OPTIONAL] KNN Review](#p2)

**Part 3**: [[OPTIONAL] Implementing Logistic Regression](#p3)


<br>

### **Learning Objectives**

By the end of this lab, you will:
* Understand how to implement and evaluate Linear Regression models in sklearn.
* Understand how to implement and evaluate KNN models in sklearn.
* Recognize how to implement and evaluate Logistic Regression models in sklearn.

<br>

### **Resources**

* [Linear Regression with sklearn](https://docs.google.com/document/d/1DPUqouqGKeAYBfNBoHNsKRoQGXfD7mjUAvjjK0VsLbc/edit?usp=drive_link)

* [K-Nearest Neighbors with sklearn](https://docs.google.com/document/d/16r4lrQNH-IUFbzh5RXL9w__ZQ6sOXWxHYaCd77nMK3Q/edit?usp=sharing)

* [Logistic Regression with sklearn](https://docs.google.com/document/d/1a2MbrwRDCP3cpnLs2n2qG-4qbR9TEP6xnfdI7cl_vzQ/edit?usp=sharing)



<br>

**Run the code below before continuing.**

In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import *
from sklearn.datasets import load_breast_cancer

<a name="p1"></a>

---
## **Part 1: Linear Regression Review**
---

In this part, we will model the relationship between specific features (`danceability`, `instrumentalness`, and `loudness`) and the `popularity` variable as the label using linear regression.

#### **Step #1: Load the data**



In [None]:
url = "https://raw.githubusercontent.com/the-codingschool/TRAIN/main/music_genres/music_genres_cleaned.csv"
df = pd.read_csv(url)

df.head()

#### **Step #2: Decide independent and dependent variables**

Examining the DataFrame, choose only `danceability`, `instrumentalness`, and `loudness` for the features and `popularity` for the label.


In [None]:
features = # COMPLETE THIS CODE
label = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data**

Split the data using a 80 / 20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS CODE

#### **Step #4: Import your model**

In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize your model and set hyperparameters**


Linear regression takes no hyperparameters, so just initialize the model.

#### **Step #6: Fit your model, test on the testing data, and create a visualization if applicable**

##### **Create a visualization**

Use `y_test` and your `prediction` from the model to create a scatter plot. Then use the following line to visualize where a correct prediction would be. The code has already been given to you.
```
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--k', label="Correct prediction")
```

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(# COMPLETE THIS CODE
plt.plot(# COMPLETE THIS CODE, '--k', label="Correct prediction")\

plt.xlabel(# COMPLETE THIS CODE
plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE


plt.legend()

#### **Step #7: Evaluate your model**

Use mean squared error and the R2 score as the evaluation metrics.


#### **Step #8: Use the model**

Using the model we created, predict the popularity of two new songs.

* Song 1: danceability of 0.48, instrumentalness of 0.0021, and loudness of 14.

* Song 2: danceability of 0.79, instrumentalness of 0.00036, and loudness of -7.2.

**NOTE**: you must create a dataframe containing with the information of the new songs:

```python
new_song_data = pd.DataFrame(new_song_data, columns =["danceability", "instrumentalness", "loudness"])
```

<a name="p2"></a>

---
## **Part 2: KNN Review**
---

In this section, you will create a 5NN model to predict `music_genre` for the same dataset we used in Part 1. We will use the same features used in Part 1.

#### **Step #1: Load in Data**

**This step was completed in Part 1.**

#### **Step #2: Choose your Variables**



In [None]:
features = # COMPLETE THIS CODE
label = # COMPLETE THIS CODE

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

#### **Step #4: Import an ML Algorithm**




In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize the Model**

Use K = 5 here.

In [None]:
model = # COMPLETE THIS CODE

#### **Step #6: Fit and Test**


In [None]:
model.fit(X_train, # COMPLETE THIS CODE

In [None]:
predictions = # COMPLETE THIS CODE

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, use the accuracy score to get a simple overall picture of your model's performance, and the confusion matrix to get a more nuanced view of where the model is performing the best and worst


In [None]:
print(accuracy_score(# COMPLETE THIS CODE

In [None]:
cm = confusion_matrix(# COMPLETE THIS CODE
disp = ConfusionMatrixDisplay(# COMPLETE THIS CODE
disp.plot()

plt.xticks(rotation=90)
plt.show()

Now, let's take the proper and more insightful approach: evaluating the model using K-Folds Cross Validation. Complete the code below to evaluate a 5NN model using 10-Folds Cross Validation.

In [None]:
knn_5 = KNeighborsClassifier(n_neighbors = 5)

scores_5 = cross_val_score(knn_5, X_train, y_train, cv = # COMPLETE THIS CODE
print("10-Folds CV Scores: " + str(scores_5.mean()) + " +/- " + str(scores_5.std()))

#### **Step #8: Use the model**

Using the model we created, predict the music genre of two new songs.

* Song 1: danceability of 0.48, instrumentalness of 0.0021, and loudness of 14.

* Song 2: danceability of 0.79, instrumentalness of 0.00036, and loudness of -7.2.

**NOTE**: you must create a dataframe containing with the information of the new songs:

```python
new_song_data = pd.DataFrame(new_song_data, columns =["danceability", "instrumentalness", "loudness"])
```

In [None]:
# COMPLETE THIS CODE

In [None]:
# COMPLETE THIS CODE

<a name="p3"></a>

---
## **Part 3: Logistic Regression**
---
#### **About the Dataset**
We've already used the Breast Cancer dataset to create a KNN model for classification; now it's time to create a logistic regression model using it. This dataset contains mammography exam results and whether or not cancer was detected.

The features are as follows:
* `radius`
* `texture`: standard deviation of gray-scale values
* `perimeter`
* `area`
* `smoothness`: local variations in radius lengths
* `compactness`: perimeter^2 / area - 1
* `concavity`: severity of concave portions of the contour
* `concave points`: number of concave portions of the contour
* `symmetry`
* `fractal dimension`: "coastline approximation" - 1

Note: There is data recorded for the mean, standard error, and worst (or largest) for each feature, resulting in 30 total features.
<br>

#### **Your Task**
Using the Breast Cancer dataset, we will do the following:
* Create a logistic regression model in order to classify breast cancer tumors as malignant (0) or benign (1).

### **Step #1: Load the data**

Use the following code to load the breast cancer dataset.

In [None]:
data = load_breast_cancer()

df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df

### **Step #2: Decide independent and dependent variables**

We will be using all features except `target` as our independent variables for predicting `target`.

In [None]:
X = data.data
print(len(X[0]))
y = data.target

### **Step #3: Split the data into train and test sets**

### **Step #4: Import the Logistic Regression algorithm**

### **Step #5: Initialize the model**


### **Step #6: Fit your model and make predictions for the test data**


In [None]:
#fit

y_pred = # COMPLETE THIS LINE

y_pred_proba = # COMPLETE THIS LINE

y_pred_binary = # COMPLETE THIS LINE

### **Step #7: Evaluate the model**

Print the classification report. Then, run the code cell below to plot the ROC curve.


In [None]:
report = # WRITE YOUR CODE HERE
print(report)

In [None]:
# Plot Sensitivity (TPR) vs 1-Specificity (FPR)
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

fpr, tpr, threshold = roc_curve(y_test, y_pred_proba[:, 1])
roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])

plt.figure()
plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.4f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

### **Reflection question**
Answer the following question:

1. What do the predicted probabilities represent in this context?
2. How is the threshold of 0.5 used to convert predicted probabilities into binary predictions?
3. What does precision mean for the Benign class in this model?
4. Would you trust this model?

---
#End of Notebook

© 2023 The Coding School, All rights reserved