<a href="https://colab.research.google.com/github/emilsar/bit-of-data-science-and-scikit-learn/blob/master/TRAIN_AWS_Part_II_Day_2_Lab_Notebook_%5BEmil%20Sargsyan%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 2: Review of Classification and Model Selection**
---
### **Description**
This lab provides a comprehensive overview of implementing and evaluating KNN, Logistic Regression, and Hyperparameter Tuning.

<br>

### **Lab Structure**
**Part 1**: [Day 1 Review](#p1)

> **Part 1.1**: [Data Exploration](#p1.1)
>
> **Part 1.2**: [Linear Regression](#p1.2)
>

**Part 2**: [Classification with sklearn](#p2)

> **Part 2.1**: [K-Nearest Neighbors](#p2.1)
>
> **Part 2.2**: [Logistic Regression](#p2.2)
>

**Part 3**: [K-Folds Cross Validation](#p3)



<br>

### **Learning Objectives**
 By the end of this lab, we will:
* Understand how to implement and evaluate KNN and Logistic Regression models in sklearn.
* Understand how to use K-Folds CV in sklearn.


<br>


### **Resources**
* [EDA with pandas Cheat Sheet](https://docs.google.com/document/d/1FFoqw45P-kuoq912ARP4qfdGeLTqoq73_qjZThPp2_8/edit?usp=drive_link)

* [Data Visualization with matplotlib Cheat Sheet](https://docs.google.com/document/d/1YlUp6ll81qOyDpU1OWzE-SPxQ3hnF5C9ukLRL_6PYKE/edit?usp=drive_link)

* [Linear Regression with sklearn Cheat Sheet](https://docs.google.com/document/d/1iVieBynTpoKq1LA0kR-4pqDo6evoW5wvbNyE0wOGhYY/edit?usp=drive_link)

* [KNN with sklearn Cheat Sheet](https://docs.google.com/document/d/1U-AWXkJEDXZFqhBwFlDjyp9bLsVOeeXGYaxa6SZ7KpY/edit#heading=h.y8q92z25l6we)

* [Logistic Regression](https://docs.google.com/document/d/1Xi4fXFROik5Rs6C0d3oIM-OmK3pvw7MwkvM3TJw7vn4/edit)


<br>

**Before starting, run the code below to import all necessary functions and libraries.**


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import model_selection
from sklearn import datasets
from sklearn.metrics import *

<a name="p1"></a>

---
## **Part 1: Day 1 Review**
---

In this part, we will explore and model the relationship between the numerical features and the `Runtime (min)` variable as the label using linear regression.


**Run the cell below to load in the data.**

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vS9jPkeKJ8QUuAl-fFdg3nJPDP6vx1byvIBl4yW8UZZJ9QEscyALJp1eywKeAg7aAffwdKP63D9osF1/pub?gid=169291584&single=true&output=csv"
movie_df = pd.read_csv(url)

movie_df.drop_duplicates(inplace=True)

mean_runtime = movie_df['Runtime'].mean()
movie_df['Runtime'] = movie_df['Runtime'].fillna(mean_runtime)

movie_df = movie_df.rename(columns = {"Runtime": "Runtime (min)"})
movie_df = movie_df.astype({"Runtime (min)": "int64"})

<a name="p1.1"></a>

---
### **Part 1.1: Data Exploration**
---

#### **Problem #1.2.1**

Use `.head()` to take an initial look at the DataFrame.

#### **Problem #1.2.2**

Use `.info()` to determine which variables are numerical. These will be the ones you can use to model `Runtime (min)` since we cannot train linear regression on text.

#### **Problem #1.2.3**

Use `.describe()` to determine max, min, and average `Runtime (min`) of all numerical variables.

#### **Problem #1.2.4**

Create a scatterplot using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Gross Money vs. Runtime:
* `X-axis`: "Gross (USD)"
* `Y-axis`: "Runtime (min)"

#### **Problem #1.2.5**

Create a scatterplot using `Released_Year` as the x-axis value and `Runtime (min)` as the y-axis value.

Make sure to include a meaningful:
* `Title`: "Runtime vs. Released_Year"
* `X-axis`: "Year"
* `Y-axis`: "Runtime (min)"

#### **Problem #1.2.6**

Create a *lineplot* using `Runtime (min)` as the x-axis value and `Gross` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Gross Money vs. Runtime'`.
* X-axis label including units `'min'`.
* Y-axis label including units `'USD'`.

<br>

**NOTE**: This is not going to be a particularly helpful graph (the scatter plot is a better choice), but we oftentimes will not know this ahead of time. A lot of EDA and visualizations involves trying a number of things and seeing what is useful.

#### **Problem #1.2.7**

Create a *lineplot* using `Released_Year` as the x-axis value and `Average Gross in Year` as the y-axis value.

Make sure to include a meaningful:
* Title, ex: `'Average Gross Money vs. Released Year'`.
* X-axis label.
* Y-axis label including units `'USD'`.

In [None]:
mean_gross = movie_df.groupby(# COMPLETE THIS LINE


#### **Problem #1.2.8**

Create a bar plot of the number of movies released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

In [None]:
movies_per_year = movie_df['Released_Year'].value_counts()

plt.bar(movies_per_year.index, # COMPLETE THIS CODE

#### **Problem #1.2.9**

Create a bar plot of the number of Dramas released per year.

Use the DataFrame provided, `movies_per_year` and make sure to include a meaningful:
* Title.
* X-axis label.
* Y-axis label.

<br>

**Hint**: Recall that you can use `.loc[CRITERIA, :]` to find all data matching given criteria and the example in Problem #6 for finding the number of movies realeased per year.

In [None]:
# COMPLETE THIS CODE

<a name="p1.2"></a>

---
### **Part 1.2: Linear Regression**
---

In this part, we will model the relationship between the numerical features and the `Runtime (min)` variable as the label using linear regression.

#### **Step #1: Load the data**

This has already been completed above.

#### **Step #2: Decide independent and dependent variables**

Examining the DataFrame, choose only the numerical variables (other than `Runtime (min)`) for the features and `Runtime (min)` for the label.


In [None]:
features = # COMPLETE THIS CODE
label = # COMPLETE THIS CODE

#### **Step #3: Split data into training and testing data**

Split the data using a 80 / 20 split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(# COMPLETE THIS CODE

#### **Step #4: Import your model**

In [None]:
# COMPLETE THIS CODE

#### **Step #5: Initialize your model and set hyperparameters**


Linear regression takes no hyperparameters, so just initialize the model.

#### **Step #6: Fit your model, test on the testing data, and create a visualization if applicable**

##### **Create a visualization**

Use `y_test` and your `prediction` from the model to create a scatter plot. Then use the following line to visualize where a correct prediction would be. The code has already been given to you.
```
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--k', label="Correct prediction")
```

In [None]:
plt.figure(figsize=(8, 8))

plt.scatter(# COMPLETE THIS CODE
plt.plot(# COMPLETE THIS CODE, '--k', label="Correct prediction")\

plt.xlabel(# COMPLETE THIS CODE
plt.ylabel(# COMPLETE THIS CODE
plt.title(# COMPLETE THIS CODE


plt.legend()

#### **Step #7: Evaluate your model**

Use mean squared error and the R2 score as the evaluation metrics.


#### **Step #8: Use the model**

Using the model we created, predict the runtime of movies based on the following `Released_Year`, `IMDB_Rating`, `No_of_Votes`, and `Gross`:

* `1999`, `7.9`, `100000`, `8000000`
* `2007`, `8.5`, `1000000`, `10000000`

In [None]:
movie_df.describe()

<a name="p2"></a>

---
## **Part 2: Classification with sklearn**
---

<a name="p2.1"></a>

---
### **Part 2.1: K-Nearest Neighbors**
---
In this, we will implement a K-Nearest Neighbors (KNN) model aimed at predicting the diagnosis of breast cancer samples. The goal is to classify new samples as either malignant or benign based on their feature characteristics.

<br>

This dataset contains crucial information related to breast cancer, including various features such as mean radius, mean texture, and mean smoothness. The target variable (label) indicates the diagnosis, distinguishing between malignant and benign cases.

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
selected_features = ["mean radius", "mean texture", "mean perimeter", "mean area", "mean smoothness", "mean compactness", "mean concavity", "mean concave points", "mean symmetry", "mean fractal dimension"]
df = pd.DataFrame(data.data, columns=data.feature_names)
df = df[selected_features]
df['Target'] = data.target

#### **Step #2: Choose your Variables**



In [None]:
inputs = # COMPLETE THIS CODE
output = # COMPLETE THIS CODE

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

#### **Step #4: Import an ML Algorithm**




#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS LINE

#### **Step #6: Fit and Test**


In [None]:
model.fit(X_train, # COMPLETE THIS LINE

In [None]:
predictions = # COMPLETE THIS LINE

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, use the accuracy score to get a simple overall picture of your model's performance, and the confusion matrix to get a more nuanced view of where the model is performing the best and worst


In [None]:
print(accuracy_score(# COMPLETE THIS CODE

In [None]:
cm = confusion_matrix(# COMPLETE THIS CODE
disp = ConfusionMatrixDisplay(# COMPLETE THIS CODE
disp.plot()
plt.show()

#### **Step #8: Apply your Model**

You are provided with data from two new breast cancer samples, and you want to assess the predicted class labels (Malignant or Benign) for each of them. The goal is to determine whether either sample is likely to be malignant or benign based on the model's predictions.

Here is the data for the two samples:

**Sample 1:**

* Mean Radius = 12.5
* Mean Texture = 18.2
* Mean Perimeter = 80.3
* Mean Area = 490.2
* Mean Smoothness = 0.09
* Mean Compactness = 0.08
* Mean Concavity = 0.05
* Mean Concave Points = 0.03
* Mean Symmetry = 0.18
* Mean Fractal Dimension = 0.06

**Sample 2:**

* Mean Radius = 14.3
* Mean Texture = 20.8
* Mean Perimeter = 92.6
* Mean Area = 650.9
* Mean Smoothness = 0.1
* Mean Compactness = 0.12
* Mean Concavity = 0.09
* Mean Concave Points = 0.05
* Mean Symmetry = 0.2
* Mean Fractal Dimension = 0.07

You will use your KNN (k-nearest neighbors) model to predict the class labels for these samples and assess their relative likelihood of being malignant or benign based on the predictions.

##### **1. Predict the diagnosis of Sample 1**


In [None]:
sample_1_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_sample_1 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Sample 1:", "Malignant" if prediction_sample_1[0] == 1 else "Benign")

##### **2. Predict the diagnosis of Sample 2**

In [None]:
sample_2_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_sample_2 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Sample 2:", "Malignant" if prediction_sample_1[0] == 1 else "Benign")

<a name="p2.2"></a>

---
### **Part 2.2: Logistic Regression**
---

In this part, we will develop a Logistic Regression model to predict diabetes diagnoses based on these features. The primary objective is to classify new individuals as either having diabetes or not, based on the provided attribute values.

<br>

The Pima Indians Diabetes dataset is an essential collection of medical records related to diabetes diagnoses among women of Pima Indian heritage. It comprises various attributes, including the number of times pregnant, plasma glucose concentration, diastolic blood pressure, triceps skinfold thickness, and several others. The target variable (label) indicates whether an individual has diabetes.

#### **Step #1: Load in Data**

**Run the code below to load the data.**

In [None]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

column_names = [
    "Number of times pregnant",
    "Plasma glucose concentration",
    "Diastolic blood pressure",
    "Triceps skinfold thickness",
    "2-Hour serum insulin",
    "BMI",
    "Diabetes pedigree function",
    "Age",
    "Class"
]

df = pd.read_csv(url, names=column_names)

#### **Step #2: Choose your Variables**



In [None]:
inputs = # COMPLETE THIS CODE
output = # COMPLETE THIS CODE

#### **Step #3: Split your Data**


In [None]:
X_train, X_test, y_train, y_test = # COMPLETE THIS CODE

#### **Step #4: Import an ML Algorithm**




#### **Step #5: Initialize the Model**


In [None]:
model = # COMPLETE THIS LINE

#### **Step #6: Fit and Test**


In [None]:
model.fit(X_train, # COMPLETE THIS LINE

In [None]:
predictions = # COMPLETE THIS LINE

#### **Step #7: Evaluate**

Let's evaluate this model and put it to the test! Specifically, use the accuracy score to get a simple overall picture of your model's performance, and the confusion matrix to get a more nuanced view of where the model is performing the best and worst


In [None]:
report = classification_report(# COMPLETE THIS LINE
print('Classification report ' + str(report))

In [None]:
cm = confusion_matrix(# COMPLETE THIS CODE
disp = ConfusionMatrixDisplay(# COMPLETE THIS CODE
disp.plot()
plt.show()

#### **Step #8: Apply your Model**

You are provided with data from two new Pima Indian individuals, and you want to assess the predicted class labels (Diabetes or No Diabetes) for each of them. The goal is to determine whether either individual is likely to have diabetes based on the model's predictions.

Here is the data for the two individuals:

**Individual 1:**

* Number of times pregnant: 2
* Plasma glucose concentration: 85
* Diastolic blood pressure: 66
* Triceps skinfold thickness: 29
* 2-Hour serum insulin: 0
* BMI: 26.6
* Diabetes pedigree function: 0.351
* Age: 31

**Individual 2:**

* Number of times pregnant: 8
* Plasma glucose concentration: 183
* Diastolic blood pressure: 64
* Triceps skinfold thickness: 0
* 2-Hour serum insulin: 0
* BMI: 23.3
* Diabetes pedigree function: 0.672
* Age: 32

You will use your logistic regression model to predict the class labels for these individuals and assess their relative likelihood of having diabetes based on the predictions.

##### **1. Predict the diagnoses for Individual 1**


In [None]:
individual_1_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_individual_1 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Individual 1:", "Diabetes" if prediction_individual_1[0] == 1 else "No Diabetes")

##### **2. Predict the diagnosis of Individual 2**

In [None]:
individual_2_features = pd.DataFrame([[# COMPLETE THIS CODE

prediction_individual_2 = model.predict(# COMPLETE THIS CODE

print("Predicted label for Individual 2:", "Diabetes" if prediction_individual_2[0] == 1 else "No Diabetes")

<a name="p3"></a>

---
## **Part 3: K-Folds Cross Validation**
---

In this section, we will explore how to use K-Folds to evaluate and compare models before deciding on the final model we will use. Only once we have selected our final model should we evaluate it on the test set.

<br>

In particular, we will use K-Folds Cross Validation to determine the best model for several datasets.

#### **Problem #3.1**

To start, let's train and evaluate a 5NN model on the Iris dataset as usual. This is *bad practice*, but will help motivate why we should use cross validation.

In [None]:
iris = load_iris()

features = iris.data
label = iris.target

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(features, label, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
knn_5 = KNeighborsClassifier(n_neighbors = # COMPLETE THIS LINE

knn_5.fit(# COMPLETE THIS LINE

pred = knn_5.predict(# COMPLETE THIS LINE

print(classification_report(# COMPLETE THIS LINE

#### **Problem #3.2**

Now, let's take the proper and more insightful approach: evaluating the model using K-Folds Cross Validation. Complete the code below to evaluate a 5NN model using 10-Folds Cross Validation.

In [None]:
knn_5 = KNeighborsClassifier(n_neighbors = 5)

scores_5 = cross_val_score(knn_5, X_train, y_train, cv = # COMPLETE THIS LINE
print("10-Folds CV Scores: " + str(scores_5.mean()) + " +/- " + str(scores_5.std()))

#### **Visualize the scores by running the cell below.**

In [None]:
plt.plot(scores_5, label = '5NN')
plt.plot([scores_5.mean() for i in range(10)], label = 'average')

plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

#### **Problem #3.3**

Now, use 10-Folds Cross Validation to evaluate and compare to the following models:
1. 1NN
2. 11NN (**NOTE**: $\sqrt{\text{length of training data}} \approx 11$)
3. 99NN
4. Logistic Regression

<br>

**NOTE**: There is code at the end that will visualize all of these results together.


##### **1. 1NN**

In [None]:
knn_1 = KNeighborsClassifier(# COMPLETE THIS LINE

scores_1 = cross_val_score(# COMPLETE THIS LINE
print("10-Folds CV Scores: " + str(scores_1.mean()) + " +/- " + str(scores_1.std()))

##### **2. 11NN (**NOTE**: $\sqrt{\text{length of training data}} \approx 11$)**

In [None]:
knn_11 = # COMPLETE THIS LINE

# COMPLETE THIS CODE

##### **3. 99NN**

In [None]:
knn_99 = # COMPLETE THIS LINE

# COMPLETE THIS CODE

##### **4. Logistic Regression**

In [None]:
log = # COMPLETE THIS LINE

# COMPLETE THIS CODE

#### **Visualize the scores by running the cell below.**

In [None]:
plt.plot(scores_1, label = '1NN')
plt.plot(scores_5, label = '5NN')
plt.plot(scores_11, label = '11NN')
plt.plot(scores_99, label = '99NN')
plt.plot(scores_log, label = 'Logistic Regression')

plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

#### **Problem #3.4**

Assuming we do not plan to try out any other models, we can safely train our final model and evaluate it on the test set. Consider the average, standard deviation, and individual scores we visualized to pick one of the models from above and:
* Train it on the entire training set.
* Evaluate it on the test set with a classification report.


### **Reflection questions**
Answer the following questions:

1. Which of the five models had the highest performance during cross validation?

2. Which of the five models had the lowest performance during cross validation?

3. How do the top performing model's cross validation metrics compare to the test metrics? In other words, how does this model perform in Problem #3.3 versus #3.4?

---

# End of Notebook

Â© 2023 The Coding School, All rights reserved