# Module 6 KNN Assignment




## 1st Review Demo Code Provided 
## 2nd Follow the below instructions to code your KNN model 

### Import required Python Libraries for Lab

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, model_selection

## KNN Demo Code:

### Load the Cancer Dataset

The following dataset is from the [UCI ML Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)). The dataset includes mammography exam results and cancer reslts.

In [None]:
cancer_dataset = datasets.load_breast_cancer()
df = pd.DataFrame(data=cancer_dataset.data, columns=cancer_dataset.feature_names)
df['TARGET'] = cancer_dataset.target
df

### Selecting Features 

To streamline our KNN model fitting process, we'll use the "mean radius" and "mean texture" features to train our classifier for predicting the presence of breast cancer in patients.

In [None]:
# Let's slice our dataset to focus only on the features we'll be utilizing.
df[["mean radius","mean texture", "TARGET"]]

### Visualization of Features

As we're developing a classifier, it may prove helpful to examine some of the features/variables. In the following analysis, we'll investigate the relationship between "mean radius" and "mean texture" and their potential correlation with cancer detection.

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(df["mean radius"], df["mean texture"], c=df["TARGET"])
plt.title("Mean texture and radius colored by detection of Cancer")
plt.xlabel("mean radius")
plt.ylabel("mean texture")
plt.show()

### Split data to create two dataframes. One for training the model and the second for testing data


The train_test_split method is a function in machine learning libraries like Scikit-learn that splits a given dataset into training and testing sets. This is an essential step in the machine learning workflow, as it allows us to evaluate the performance of our model on new, unseen data.

The method takes as input the feature matrix X and the target vector y that we want to split. It also takes the test_size parameter, which specifies the proportion of the dataset to include in the testing set. For example, a test_size of 0.3 means that 30% of the data will be used for testing, while the remaining 70% will be used for training.

Additionally, the train_test_split method can also take other parameters such as random_state, which is used to set the random seed for reproducibility, and stratify, which ensures that the proportion of classes in the training and testing sets is the same as the proportion in the original dataset.

The method returns four arrays: X_train, X_test, y_train, and y_test. The X_train and y_train arrays are used for training the model, while the X_test and y_test arrays are used for evaluating the model's performance on new, unseen data.

By splitting the dataset into training and testing sets, we can train our model on one set of data and then evaluate its performance on a separate set of data, which helps us to detect overfitting and ensures that our model is able to generalize well to new, unseen data.

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[["mean radius","mean texture"]],
                                                                    df[["TARGET"]], test_size=0.2, random_state=42)

### Import an ML algorithm sklearn KNeighborsClassifier

In [None]:
# import the KNN algorithm
from sklearn.neighbors import KNeighborsClassifier

### Initialize the Model

We are setting the hyperparameter `n_neighbors` to 4. 
This number can be changed. Different n_neighbor values can change the accurancy of the model. Give it a try! Change n_neighbor!

The line model = KNeighborsClassifier(n_neighbors = 4) creates an instance of the KNeighborsClassifier class from the Scikit-learn machine learning library and assigns it to the variable model.

The KNeighborsClassifier is a type of supervised learning algorithm used for classification tasks. It works by finding the k nearest neighbors to a given data point in the feature space and assigning the class label of the majority of those neighbors to the data point being classified. The value of k is specified by the n_neighbors parameter, which is set to 4 in this example.

By creating an instance of the KNeighborsClassifier class with n_neighbors = 4, we're initializing a model that will classify data points by considering the class labels of their four nearest neighbors in the feature space. We can then use this model to make predictions on new, unseen data.

In [None]:
# Initialize the Model
model = KNeighborsClassifier(n_neighbors = 4)

### Fit the new model. 

The line model.fit(X_train, y_train) trains the KNeighborsClassifier model created earlier using the training set X_train and corresponding target variable y_train.

The fit() method is a built-in method of the Scikit-learn estimator objects, including the KNeighborsClassifier class. It takes the feature matrix X_train and target vector y_train as input and fits the model to the training data by calculating the distance between each data point and its k nearest neighbors in the feature space.

During the training process, the model adjusts its internal parameters (i.e., weights or coefficients) to minimize the difference between the predicted values and the actual target values in the training set. This process is known as parameter estimation or model training.

After the training is complete, the model object stores the learned parameters, which can be used to make predictions on new, unseen data using the predict() method.

In [None]:
# Fit the model
model.fit(X_train, y_train)

### Test the model by making a prediction

The line pred = model.predict(X_test) uses the trained KNeighborsClassifier model to make predictions on the test set X_test and assigns the predicted values to the pred variable.

The predict() method is a built-in method of the Scikit-learn estimator objects, including the KNeighborsClassifier class. It takes the feature matrix X_test as input and returns the predicted target values based on the learned parameters of the model.

In other words, for each data point in the test set, the predict() method uses the internal parameters of the trained model to classify the data point based on the class labels of its k nearest neighbors in the feature space. It then returns the predicted target value for that data point.

After the predictions have been made for the test set, the predicted values are assigned to the pred variable for later analysis and evaluation of the model's performance on the test set.

In [None]:
# Predict
pred = model.predict(X_test)

### Visualize Data and boundary

A colormesh plot to show the decision boundary of the KNN model.


In [None]:
# Make the same scatter plot of the training data

fig, ax = plt.subplots(figsize=(10,6))

xx, yy = np.meshgrid(np.arange(6, 30, 0.1),
                     np.arange(6, 42, 0.1))
z = model.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)

ax.pcolormesh(xx, yy, z, alpha=0.1)

for label, data in df.groupby('TARGET'):
  ax.scatter(data["mean radius"], data["mean texture"], label=["Cancerous","Healthy"][label])

ax.set_title("Decision Boundary of the KNN Classifier")
ax.set_xlabel("mean radius")
ax.set_ylabel("mean texture")
ax.legend()
plt.show()

### The below code test the accuracy the model

This code block is used to evaluate the performance of the trained KNeighborsClassifier model on the test set, specifically by computing the mean squared error, accuracy score, and classification report.

mean_squared_error(y_test, pred) computes the mean squared error between the true target values in the test set y_test and the predicted values pred. This metric measures the average squared difference between the predicted and true values, and a lower value indicates better performance.

accuracy_score(y_test, pred) computes the accuracy of the model by comparing the predicted target values pred with the true target values in the test set y_test. The accuracy is the proportion of correctly classified samples, and a higher value indicates better performance.

classification_report(y_test, pred) generates a text report that summarizes the precision, recall, and F1-score for each class in the target variable. This report provides a more detailed evaluation of the model's performance on each class.

By printing these metrics, we can evaluate the performance of the model on the test set and determine how well it is able to generalize to new, unseen data. These metrics can be used to compare different models and select the best one for the task at hand.

In [None]:
# Evaluation of accuracy
from sklearn.metrics import mean_squared_error, classification_report, accuracy_score
print('Mean squared error: ', mean_squared_error(y_test, pred))
print("Accuracy Score: ", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

### Use the Model with new data


1. Given a patient with a mean radius of 14.2 and a mean texture of 30.3, is it probable that the patient has breast cancer?
2. Would a patient with a mean radius of 11.2 and a mean texture of 33.6 be likely to have breast cancer?


In [None]:
patient1 = pd.DataFrame([[14.2, 30.3]], columns=["mean radius", "mean texture"])
prediction = model.predict(patient1)[0]
print(f"Patient 1 {['is likely', 'is not likely'][prediction]} to have cancer")

patient2 = pd.DataFrame([[11.2, 33.6]], columns=["mean radius", "mean texture"])
prediction = model.predict(patient2)[0]
print(f"Patient 2 {['is likely', 'is not likely'][prediction]} have cancer")

## Module 6 Assignment: Your Turn KNN


### Import the below required Python Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, model_selection

### Load the UCI Wine Dataset

The given dataset is sourced from the UCI ML Wine Recognition dataset and presents the outcomes of a chemical analysis of wines that were grown in the same area in Italy using three distinct grape cultivars. Your task is to create a new column TARGET that corresponds to the target column of the load_wine() data and to construct a model that can forecast the cultivar (grape type) that the wine was produced from.

In [None]:
wine_dataset = datasets.load_wine()

df = pd.DataFrame(data=wine_dataset.data, columns=wine_dataset.feature_names)
df['TARGET'] = wine_dataset.target
df

### Feature Selection

To simplify the model, train the KNN classifier by utilizing the features of "malic_acid" and "color_intensity". This will aid in predicting the grape cultivar that the wine has originated from.

In [None]:
# Use pandas to extract the features we care about
# ADD CODE HERE

### Visualize the features selected for model 

Create a scatter() graph of the features, `"malic acid"` and  `"color intensity"`, and see if there is a correlation with the cultivar number.

Use the `c` value color the dots by target class.

In [None]:
# ADD CODE HERE


### Split dataset into training and testing data



In [None]:
# Sample Code provided - Add the features and target 
# X_train, X_test, y_train, y_test = model_selection.train_test_split(df[[ADD CODE HERE]],df[[ADD CODE HERE]], test_size=0.2, random_state=42)

### Import Sklearn Algorithms KNeighborsClassifier
 

In [None]:
# import the KNN algorithm
# ADD CODE HERE

### Initialize the Model

Set hyperparameter `n_neighbors = 4`.

In [None]:
# ADD CODE HERE

### Fit the Model Hint: fit()

In [None]:
# ADD CODE HERE

In [None]:
# Use predict() on model
# ADD CODE HERE

### Use the provided code to test the accuracy

Display the:
`mean_squared_error`, `classification_report`, and `accuracy_score`.

### ADD COMMENT HERE: Explain the below code.

In [None]:
# Evaluation of accuracy
from sklearn.metrics import mean_squared_error, classification_report, accuracy_score

 
print('Mean squared error: ', mean_squared_error(y_test, pred))
print("Accuracy Score: ", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

### Use your new model


Which cultivated probably created a wine with `malic_acid = 4.1` and `color_intensity = 1.1` ?

In [None]:
# Sample Code
wine1 = pd.DataFrame([[4.1, 1.1]], columns=["malic_acid", "color_intensity"])
prediction = model.predict(wine1)[0]
print(f"Wine 1 {['cultivator 1', 'cultivator 2','cultivator 3'][prediction]} ")

Which cultivated probably created a wine with `malic_acid = 5.3` and `color_intensity = 8.1` ?

In [None]:
# ADD CODE HERE