<div style="display: flex; align-items: center; justify-content: center; text-align: center;">
  <img src="https://coursereport-s3-production.global.ssl.fastly.net/uploads/school/logo/219/original/CT_LOGO_NEW.jpg" width="100" style="margin-right: 10px;">
  <div>
    <h1><b>Applied Lesson - Iris & Breast Cancer Datasets</b></h1>
  </div>
</div>

<br>

## Load libraries
---

We'll need the following libraries for today's lecture:
1. `pandas`
2. `numpy`
3. `matplotlib` and `seaborn`
4. `KNeighborsClassifier` from `sklearn`'s `neighbors` module
6. `train_test_split` from `sklearn`'s `model_selection` module
7. `StandardScaler` from `sklearn`'s `preprocessing` module
8. `ConfusionMatrixDisplay` from `sklearn`'s `metrics` module

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay

# **PART 1:** The Iris Dataset
---

> The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. - [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)

![](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)

## Data Cleaning
---

Let's see if our `DataFrame` requires any cleaning. In the cells below:
1. Check the `dtypes` to make sure every column is numerical
2. Check for null values

## EDA: Visualizing KNN
---

Using `seaborn`, create a scatter plot using two features from your `DataFrame`: `'petal length (cm)'` and `'petal width (cm)'`. Each dot should be colored according to its species.

## EDA: Pairplot
---

Let's expand on the scatter plot created in the previous step. We can use `seaborn`'s `.pairplot()` method to create scatter plots using all of our features.

## EDA: Quick Aside - Plotly

Seaborn isn't the only data visualization library! Plotly is a versatile data visualization library that empowers users to create interactive and dynamic plots and charts. It is becoming more widely used, offering a user-friendly interface for generating visually appealing and interactive graphics for data analysis and presentation. Want to learn more about Plotly? Visit the documentation [here](https://plotly.com/python/)!

## Baseline Model

> We want a model that will have a better accuracy score than 33.33%!

## Train/Test split
---

Use the `train_test_split` function to split your data into a training set and a holdout set.

In [None]:
X = 
y = 

## `StandardScaler`
---
StandardScaler is a preprocessing technique in machine learning that standardizes the features of a dataset by transforming them to have a mean of 0 and a standard deviation of 1. StandardScaler is particularly useful when working with algorithms (like KNN!) that are sensitive to the scale of the input features, ensuring that all features contribute equally to the model's learning process and preventing any single feature from dominating due to its larger magnitude.

Because KNN is calculating the distance between neighbors, it's highly sensitive to the magnitude of your features. For example, if we were using KNN on a housing dataset, a feature like square footage (measured in **thousands** of feet^2) can really affect the distance. 

Thus, in order for KNN to work properly, it's important to scale our data. In the cells below, create an instance of `StandardScaler` and use it to transform `X_train` and `X_test`.

## Instantiate KNN
---

For the `KNeighborsClassifier`, there a few important parameters to keep in mind:

1. `n_neighbors`: this is the "K" in KNN. The best K will change from problem to problem, but the default is 5.
2. `weights`: The neighbors can all have an equal vote (`uniform`), or the closer points can have a higher weighted vote (`distance`).
3. `p`: The distance metric. The default is Euclidean distance (2). Changing it to 1 is setting the distance to Manhattan.

In the cell below, instantiate a `knn` model using the default parameters.

## Model Fitting and Evaluation
---

Now that we know what we can expect from our KNN model, let's 
1. fit the model to `X_train_sc`, `y_train`
2. score it on `X_test_sc`, `y_test`

# **PART 2:** Breast Cancer Dataset
We have another data set on breast cancer. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. A few of the images can be found [here](https://ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/cancer_images/). You can find a partial data dictionary on Kaggle [here](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data/data).

In [None]:
# Drop Unnamed: 32


## EDA: Pairplot
---

Let's expand on the scatter plot created in the previous step. We can use `seaborn`'s `.pairplot()` method to create scatter plots using a few chosen features.

## Baseline Model

> We want to get a model that has a higher accuracy than 62.7%!

## Train/Test split
---

Use the `train_test_split` function to split your data into a training set and a holdout set.

In [None]:
X = 
y = 

### `StandardScaler`
---

## Instantiate KNN
---
Use the defaults for now! We'll play around with different values of `k` in a bit!

## Model fitting and evaluation
---

Now that we know what we can expect from our KNN model, let's 
1. fit the model to `X_train_sc`, `y_train`
2. score it on `X_test_sc`, `y_test`

In [None]:
# Fit


In [None]:
# Training accuracy score


In [None]:
# Testing accuracy score


In [None]:
# Confusion Matrix

## The Model... or _A_ Model?
We let a default of $k$ = 5 earlier. Is that best? How do we know?

In [None]:
# Visualize this:


In [None]:
# Instantiate (again) & Refit (again)


In [None]:
# New training score


In [None]:
# New testing score


In [None]:
# New Confusion Matrix


In [None]:
# A bit better!

# Conclusions and Takeaways
* k-nearest neighbors is a model that can be used for both regression and classification, but most commonly for classification
* It's a simpler model that doesn't always perform too well
* The "$k$" in kNN is what we'll come to know as a **tuning parameter** that can ge adjusted to get a better model
* kNN suffers from "the curse of dimensionality" - it gets more difficult to use and understand the more columns we have. kNN is best when our data aren't too "wide"

**When to use kNN**
* When you don't have too many rows
* When you don't have too many columns
* When you don't have any categorical features

### Other Common Classification Models:

1. [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

2. [Decision Trees](https://scikit-learn.org/stable/modules/tree.html#decision-trees)

3. [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#random-forests)

4. [Support Vector Machines (SVM)](https://scikit-learn.org/stable/modules/svm.html#svm)

5. [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes)

6. [Neural Networks](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised)


Each model has its own strengths and weaknesses, and the choice depends on the characteristics of the data and the goals of the classification task. It's often beneficial to experiment with multiple models and assess their performance to determine the most suitable approach for a particular problem.


### Let's Try Logistic Regression!

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logr = LogisticRegression()
logr.fit(X_train_sc, y_train)

In [None]:
logr.score(X_train_sc, y_train)

In [None]:
logr.score(X_test_sc, y_test)

In [None]:
ConfusionMatrixDisplay.from_estimator(logr, X_test_sc, y_test, cmap = 'Reds');

### Let's Try Random Forest!

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train_sc, y_train)

In [None]:
rf.score(X_train_sc, y_train)

In [None]:
rf.score(X_test_sc, y_test)

In [None]:
ConfusionMatrixDisplay.from_estimator(rf, X_test_sc, y_test, cmap = 'Purples');