Outline:

* [Introduction](#1)
* [Packages & Libraries](#3)
* [Basic EDA and Visualization](#4)
* [Data Preprocessing & Outlier Analysis, LOF](#5)
* [Model](#6)
    * [K-Nearest Neighbors (KNN) Algorithm](#7)
    * [KNN Best Parameters](#8)
* [Dimensionality Reduction](#9)
    * [Principal Component Analysis (PCA)](#10)
    * [Neighborhood Component Analysis (NCA)](#11)
* [Results & Evaluation](#12)

<a id = "1"></a>
### Introduction
<img src="https://www.jbcp.jo/sites/default/files/2018-05/Breast-Cancer-Staging.jpg">

**Background:** Breast cancer is a type of cancer that starts in the breast. It can start in one or both breasts. Breast cancer can spread when the cancer cells get into the blood or lymph system and then are carried to other parts of the body. The lymph (or lymphatic) system is a part of your body's immune system. It is a network of lymph nodes (small, bean-sized glands), ducts or vessels, and organs that work together to collect and carry clear lymph fluid through the body tissues to the blood. The clear lymph fluid inside the lymph vessels contains tissue by-products and waste material, as well as immune system cells.

Source: [https://www.cancer.org/cancer/breast-cancer/about/what-is-breast-cancer.html]

**Motivation:** Finding breast cancer early and getting state-of-the-art cancer treatment are two of the most important strategies for preventing deaths from breast cancer. Breast cancer that’s found early, when it’s small and has not spread, is easier to treat successfully. In this work we will apply a breast cancer classification with KNN algorithm. We will also analyze the outliers of the dataset before KNN training. To increase model accuracy dimension reduction techniques like PCA and NCA will be performed.

<a id = "2"></a>
### Packages & Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier, NeighborhoodComponentsAnalysis, LocalOutlierFactor
from sklearn.decomposition import PCA

import warnings
warnings.filterwarnings('ignore')

**First look on data**

In [None]:
data = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")
data.head(3)

In [None]:
data = data.drop(['id','Unnamed: 32'], axis = 1)
data.head(3)

In [None]:
data.rename(columns = {'diagnosis':'target'}, inplace = True)
data.head(3)

In [None]:
sns.countplot(data['target'])

In [None]:
print(data.target.value_counts())

In [None]:
data['target'] = [1 if i.strip() == 'M' else 0 for i in data['target']]

In [None]:
data.head(3)

In [None]:
data.shape

In [None]:
data.info()

There is no missing value in dataset

In [None]:
data.describe()

Note: To get data features into the same space, we have to scale all features

<a id = "3"></a>
### Basic EDA & Visualization

In [None]:
# Correlations
corr_matrix = data.corr()
corr_matrix

In [None]:
# Clustermap with Pearson Correlation Coefficients
sns.clustermap(corr_matrix, annot = True, fmt = ".2f", figsize = (18, 15))
plt.title("Correlation Between Features")
plt.show()

In [None]:
# threshold to explore highly correlated features
threshold = 0.75
filter_ = np.abs(corr_matrix['target']) > threshold
corr_features = corr_matrix.columns[filter_].tolist()
corr_features

In [None]:
# visualize which features are highly correlated
sns.heatmap(data[corr_features].corr(), annot = True, fmt = ".2f")
plt.title("Correlation Between Features")
plt.show()

There are some correlated features

In [None]:
# melting data for box plot
data_melted = pd.melt(data, id_vars = 'target', var_name = 'feature', value_name = 'value')
data_melted

In [None]:
# Box plot to detect outliers
plt.figure(figsize = (10, 8))
sns.boxplot(x = 'feature', y = 'value', hue = 'target', data = data_melted)
plt.xticks(rotation = 90)
plt.show()

As seen, box plots is a little bit confusing because non-scaled features. We have to normalize the features to solve this issue.

In [None]:
# decreasing threshold to see if the distributions have skewness
threshold = 0.75
filter_ = np.abs(corr_matrix['target']) > threshold
corr_features = corr_matrix.columns[filter_].tolist()
sns.pairplot(data[corr_features], diag_kind = 'kde', markers = "+", hue = 'target')
plt.show()

<img src="https://studiousguy.com/wp-content/uploads/2021/08/Skewed-Distribution.jpg" width="500" height="600">

Skewness is a measurement of the distortion of symmetrical distribution or asymmetry in a data set. Skewness is demonstrated on a bell curve when data points are not distributed symmetrically to the left and right sides of the median on a bell curve. 

* A skewness value of 0 in the output denotes a symmetrical distribution of values in row 1.
* A negative skewness value in the output indicates an asymmetry in the distribution corresponding to row 2 and the tail is larger towards the left hand side of the distribution.
* A positive skewness value in the output indicates an asymmetry in the distribution corresponding to row 3 and the tail is larger towards the right hand side of the distribution.

So which features have skewness?

In [None]:
skewness = pd.DataFrame(data.skew(), columns = ['skewness'])
skewness

In [None]:
skewness['skewness'] = ["Positively skewed" if i >= 1 else "Negatively skewed" if i <= -1 else "Normal Distribution" for i in skewness['skewness']]
skewness

There are many skewed features on data. We must solve this issue too.

<a id = "4"></a>
### Data Preprocessing & Outlier Analysis

**Density based Outlier Detection: Local Outlier Factor (LOF)** : Compare local density of one point to local density of its K-NN
* LOF > 1 ==> outlier / anomaly
* LOF < 1 ==> inlier

<img src="https://miro.medium.com/max/1400/1*5tBfkFHgFcqyNuDa6eNkNQ.jpeg" width="500" height="600" alt = "Source:https://medium.com/mlpoint/local-outlier-factor-a-way-to-detect-outliers-dde335d77e1a">

In [None]:
y = data.target
x = data.drop(["target"], axis = 1)
columns = x.columns.tolist()

In [None]:
clf = LocalOutlierFactor()
y_pred = clf.fit_predict(x) # Returns -1 for anomalies/outliers and +1 for inliers.
X_score = clf.negative_outlier_factor_
X_score

In [None]:
plt.figure()
plt.scatter(x.iloc[:,0], x.iloc[:,1], color = 'k', s = 3, label = 'Data Point') # radius_mean and texture_mean as an example plot
plt.show()

In [None]:
outlier_score = pd.DataFrame()
outlier_score["score"] = X_score

# threshold for negative lof values
threshold = -2
filter_ = outlier_score["score"] < threshold
outlier_index = outlier_score[filter_].index.tolist()

# Let's plot the outliers based on threshold we set
plt.figure(figsize = (16, 9))
plt.scatter(x.iloc[outlier_index,0], x.iloc[outlier_index,1], color = 'blue', s = 40, label = 'Outliers')
plt.scatter(x.iloc[:,0], x.iloc[:,1], color = 'k', s = 3, label = 'Data Points')
radius = (X_score.max() - X_score) / (X_score.max() - X_score.min())  # Normalization
plt.scatter(x.iloc[:,0], x.iloc[:,1], s=1000*radius, edgecolors = "r", facecolors = "none", label = "Outlier Scores")
plt.legend()
plt.show()

In [None]:
# Drop outliers
x = x.drop(outlier_index)
y = y.drop(outlier_index)

In [None]:
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

In [None]:
# Standardization
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
y_train = y_train.reset_index().drop("index", axis = 1)
x_train = pd.DataFrame(x_train, columns = columns)
train_df = pd.concat([x_train, y_train], axis = 1)
train_df

In [None]:
# standardized train df
train_df.describe()

In [None]:
train_df.info()

In [None]:
data_melted = pd.melt(train_df, id_vars = "target", var_name = "feature", value_name = "value")
data_melted.head()

In [None]:
# boxplot
plt.figure(figsize = (18, 8))
sns.boxplot(x = "feature", y = "value", hue = "target", data = data_melted)
plt.xticks(rotation = 90)
plt.show()

In [None]:
# lets visualize pairplot again with densities under a determined threshold value
threshold = 0.7
filter_ = np.abs(corr_matrix['target']) > threshold
corr_features = corr_matrix.columns[filter_].tolist()
sns.pairplot(train_df[corr_features], diag_kind = 'kde', markers = "+", hue = 'target')
plt.show()

<a id = "5"></a>
### Model

<a id = "6"></a>
#### K-Nearest Neighbors (KNN) Algorithm

K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification. KNN tries to predict the correct class for the test data by calculating the distance between the test data and all the training points. Then select the K number of points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training points.

Let see the below example to make it a better understanding

<img src="https://miro.medium.com/max/828/0*34SajbTO2C5Lvigs.png" width="500" height="600">

Source: [https://medium.com/swlh/k-nearest-neighbor-ca2593d7a3c4]

In [None]:
# KNN Model with k = 2
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
score = knn.score(x_test, y_test)
print("Score: ", score)
print("Confusion matrix: ")
print(cm)
print("Basic KNN accuracy: ", acc)

Model score is fairly good but we can do better with hyperparameter optimization.

<a id = "7"></a>
#### KNN Best Parameters

In [None]:
# Choosing best KNN parameters
def KNN_Best_Params(x_train, x_test, y_train, y_test):
    
    k_range = list(range(1,31))
    weight_options = ["uniform", "distance"]
    print()
    param_grid = dict(n_neighbors = k_range, weights = weight_options)
    
    knn = KNeighborsClassifier()
    grid = GridSearchCV(knn, param_grid, cv = 10, scoring = "accuracy")
    grid.fit(x_train, y_train)
    
    print("Best training score {} with parameters: {}".format(grid.best_score_, grid.best_params_))
    print()
    
    knn = KNeighborsClassifier(**grid.best_params_)
    knn.fit(x_train, y_train)
    
    y_pred_test = knn.predict(x_test)
    y_pred_train = knn.predict(x_train)
    
    cm_test = confusion_matrix(y_test, y_pred_test)
    cm_train = confusion_matrix(y_train, y_pred_train)
    
    acc_test = accuracy_score(y_test, y_pred_test)
    acc_train = accuracy_score(y_train, y_pred_train)
    print("Test score: {}, train score: {}".format(acc_test, acc_train))
    print()
    print("CM Test")
    print(cm_test)
    print("CM Train")
    print(cm_train)
    
    return grid

grid = KNN_Best_Params(x_train, x_test, y_train, y_test)

We can say that this is overfitting because train accuracy is **100%** while test accuracy is **94%.** Overfitting must be solved later.

**Overfitting vs. Underfitting**

Overfitting is a concept in data science, which occurs when a statistical model fits exactly against its training data. When this happens, the algorithm unfortunately cannot perform accurately against unseen data, defeating its purpose. Generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data.

<img src="https://1.cms.s81c.com/sites/default/files/2021-03-03/model-over-fitting.png" width="600" height="600">

Source: [https://www.ibm.com/cloud/learn/overfitting]

<img src="https://1.cms.s81c.com/sites/default/files/2021-03-03/classic%20overfitting_0.jpg" width="400" height="400">

One of the techniques to solve overfitting is to reduce model complexity (dimensionality reduction) so we will do that to increase test accuracy and model success.

<a id = "8"></a>
### Dimensionality Reduction

<a id = "9"></a>
#### Principal Component Analysis (PCA)

Principal component analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

So, to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.

<img alt="Principal Component Analysis second principal" src="https://builtin.com/sites/www.builtin.com/files/inline-images/national/Principal%2520Component%2520Analysis%2520second%2520principal.gif" width="700" height="400">

Source: [https://builtin.com/data-science/step-step-explanation-principal-component-analysis]

**Eigen values and eigen vectors**

In [None]:
# Assume that we have two arrays to find eigen values and eigen vectors
x2 = np.array([2.4, 0.6, 2.1, 2, 3, 2.5, 1.9, 1.1, 1.5, 1.2])
y2 = np.array([2.5, 0.7, 2.9, 2.2, 3, 2.3, 2, 1.1, 1.6, 0.8])
plt.scatter(x2,y2)

In [None]:
# Transform data to (0,0) center
x_mean = np.mean(x2)
y_mean = np.mean(y2)
x2 = x2 - x_mean
y2 = y2 - y_mean
plt.scatter(x2,y2)

In [None]:
# cov matrix
cov = np.cov(x2, y2)
cov

Finding eigen values and eigen vectors

In [None]:
from numpy import linalg as LA
w, v = LA.eig(cov)
w

In [None]:
v

So, what do all these mean?

w(…, M) array
The eigenvalues, each repeated according to its multiplicity. The eigenvalues are not necessarily ordered. The resulting array will be of complex type, unless the imaginary part is zero in which case it will be cast to a real type. When a is real the resulting eigenvalues will be real (0 imaginary part) or occur in conjugate pairs

v(…, M, M) array
The normalized (unit “length”) eigenvectors, such that the column v[:,i] is the eigenvector corresponding to the eigenvalue w[i].

In [None]:
p1 = v[:, 1]
p2 = v[:, 0]

In [None]:
p1

In [None]:
# visualizing eigen vectors and eigen values
plt.scatter(x2,y2)
# main component
plt.plot([-2 * p1[0], 2 * p1[0]] , [-2 * p1[1], 2 * p1[1]])
# small component
plt.plot([-1 * p2[0], 1 * p2[0]] , [-1 * p2[1], 1 * p2[1]])

In [None]:
## PCA
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

pca = PCA(n_components = 2) # reduction to 2 features
pca.fit(x_scaled)
x_reduced_pca = pca.transform(x_scaled)
pca_data = pd.DataFrame(x_reduced_pca, columns = ["p1","p2"])
pca_data["target"] = y.reset_index().drop("index", axis = 1)

# visualize PCA
plt.figure(figsize = (10, 6))
sns.scatterplot(x = "p1", y = "p2", hue = "target", data = pca_data)
plt.title("PCA: p1 vs p2")
plt.show()

In [None]:
# Train-test split
x_train_pca, x_test_pca, y_train_pca, y_test_pca = train_test_split(x_reduced_pca, y, test_size = 0.3, random_state = 42)

In [None]:
grid_pca = KNN_Best_Params(x_train_pca, x_test_pca, y_train_pca, y_test_pca)

In [None]:
# visualize PCA
cmap_light = ListedColormap(["orange", "cornflowerblue"])
cmap_bold = ListedColormap(["darkorange", "darkblue"])

h = .05
X = x_reduced_pca
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 0].min() - 1, X[:, 0].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = grid_pca.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap = cmap_light)

# plot also the trainin points
plt.scatter(X[:,0], X[:,1], c = y, cmap = cmap_bold, edgecolor = "k", s = 20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("%i-Class classification (k = %i, weights = '%s')" % (len(np.unique(y)), grid_pca.best_estimator_.n_neighbors, grid_pca.best_estimator_.weights))

Which points did we make a wrong classification on test data using PCA?

In [None]:
knn = KNeighborsClassifier(**grid_pca.best_params_)
knn.fit(x_train_pca, y_train_pca)
y_pred_pca = knn.predict(x_test_pca)
acc_test_pca = accuracy_score(y_pred_pca, y_test_pca)
knn.score(x_test_pca, y_test_pca)

test_data = pd.DataFrame()
test_data["X_test_pca_p1"] = x_test_pca[:,0]
test_data["X_test_pca_p2"] = x_test_pca[:,1]
test_data["Y_pred_pca"] = y_pred_pca
test_data["Y_test_pca"] = y_test_pca.reset_index().drop("index", axis = 1)

plt.figure(figsize = (14, 8))
sns.scatterplot(x = "X_test_pca_p1", y = "X_test_pca_p2", hue="Y_test_pca", data = test_data)

diff = np.where(y_pred_pca != y_test_pca)[0]
plt.scatter(test_data.iloc[diff, 0], test_data.iloc[diff, 1], label = "Wrong Classified", alpha = 0.2, color = "red",s = 1000)

<a id = "10"></a>
#### Neighborhood Component Analysis (NCA)

Neighbourhood components analysis is a supervised learning method for classifying multivariate data into distinct classes according to a given distance metric over the data. Functionally, it serves the same purposes as the K-nearest neighbors algorithm, and makes direct use of a related concept termed stochastic nearest neighbours.

Rather than having the user specify some arbitrary distance metric, NCA learns it by choosing a parameterized family of quadratic distance metrics, constructing a loss function of the parameters, and optimizing it with gradient descent. Furthermore, the learned distance metric can explicitly be made low-dimensional, solving test-time storage and search issues. How does NCA do this?

The goal of the learning algorithm then, is to optimize the performance of kNN on future test data. Since we don’t a priori know the test data, we can choose instead to optimize the closest thing in our toolbox: the leave-one-out (LOO) performance of the training data.

<img alt="Principal Component Analysis second principal" src="https://d3i71xaburhd42.cloudfront.net/24c287d97982216c8f35c8d326dc2ec2d2475f3e/6-Figure2-1.png" width="500" height="300">

In [None]:
# NCA
nca = NeighborhoodComponentsAnalysis(n_components = 2, random_state = 42)
nca.fit(x_scaled, y)
x_reduced_nca = nca.transform(x_scaled)
nca_data = pd.DataFrame(x_reduced_nca, columns = ["p1","p2"])
nca_data["target"] = y.reset_index().drop("index", axis = 1)

plt.figure(figsize = (10, 6))
sns.scatterplot(x = "p1", y = "p2", hue = "target", data = nca_data)
plt.title("NCA: p1 vs p2")
plt.show()

In [None]:
# Train-test split
x_train_nca, x_test_nca, y_train_nca, y_test_nca = train_test_split(x_reduced_nca, y, test_size = 0.3, random_state = 42)

In [None]:
grid_nca = KNN_Best_Params(x_train_nca, x_test_nca, y_train_nca, y_test_nca)

In [None]:
# visualize NCA
cmap_light = ListedColormap(["orange", "cornflowerblue"])
cmap_bold = ListedColormap(["darkorange", "darkblue"])

h = .2
X = x_reduced_nca
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 0].min() - 1, X[:, 0].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = grid_nca.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap = cmap_light)

# plot also the trainin points
plt.scatter(X[:,0], X[:,1], c = y, cmap = cmap_bold, edgecolor = "k", s = 20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("%i-Class classification (k = %i, weights = '%s')" % (len(np.unique(y)), grid_nca.best_estimator_.n_neighbors, grid_nca.best_estimator_.weights))

So, which points did we make a wrong classification on test data using NCA?

In [None]:
knn = KNeighborsClassifier(**grid_nca.best_params_)
knn.fit(x_train_nca, y_train_nca)
y_pred_nca = knn.predict(x_test_nca)
acc_test_nca = accuracy_score(y_pred_nca, y_test_nca)
knn.score(x_test_nca, y_test_nca)

test_data = pd.DataFrame()
test_data["X_test_nca_p1"] = x_test_nca[:,0]
test_data["X_test_nca_p2"] = x_test_nca[:,1]
test_data["Y_pred_nca"] = y_pred_nca
test_data["Y_test_nca"] = y_test_nca.reset_index().drop("index", axis = 1)

plt.figure(figsize = (14, 8))
sns.scatterplot(x = "X_test_nca_p1", y = "X_test_nca_p2", hue="Y_test_nca", data = test_data)

diff = np.where(y_pred_nca != y_test_nca)[0]
plt.scatter(test_data.iloc[diff, 0], test_data.iloc[diff, 1], label = "Wrong Classified", alpha = 0.2, color = "red",s = 1000)

**Thanks for looking at this kernel. If you like it don't forget to upvote!**