# CM3015 Mid-term Coursework Report

## Abstract
This report aims to identify the best machine learning model for the breast cancer dataset obtained from the scikit-learn library. The study employs a comprehensive approach by implementing a k-Nearest Neighbors (KNN) algorithm from scratch, utilizing scikit-learn's KNN, and applying Decision Trees Classification from the same library. The report summarizes the findings and evaluations of these machine learning models.

## 1. Introduction
Machine learning plays a crucial role in healthcare, and the identification of the best model for breast cancer classification is of paramount importance. This project explores the application of various machine learning algorithms to the breast cancer dataset, aiming to determine the most effective lication, and implementation.



## 2. Background
In this study, two prominent machine learning algorithms, k-Nearest Neighbors (KNN) and Decision Trees Classification (DTC), are explored for their efficacy in breast cancer classification. KNN, a proximity-based algorithm, assigns a data point the majority class of its k-nearest neighbors. Additionally, a scratch implementation of KNN is developed for comparison. On the other hand, DTC constructs a tree-like model, recursively partitioning the feature space to make decisions. Both algorithms offer unique strengths and interpretability, and their performance will be rigorously compared to determine the most effective approach for breast cancer classification. The scratch implementation of KNN introduces an additional layer of analysis, providing insights into algorithmic intricacies.

## 3. Methodology
We started by loading the breast cancer dataset from scikit-learn and preprocessing the data as necessary. The methodology includes the implementation of a KNN algorithm from scratch using standard Python code, alongside the application of scikit-learn's KNN and Decision Trees Classification. The exploration involves techniques such as cross-validation to ensure robust evaluation.

### 3.1 KNN from the scratch

The K-Nearest Neighbors (KNN) algorithm is implemented from scratch in Python, incorporating essential functions for distance calculation and neighbor selection. The methodology involves the following steps:

1. **Euclidean Distance Calculation:** The euclidean_distance() function computes the Euclidean distance between two data points.
2. **Neighbor Selection:** The get_neighbors() function identifies the k-nearest neighbors of a given test data point within the training dataset based on Euclidean distance.
3. **Classification Prediction:** The predict_classification() function predicts the class label of the test data point using a majority voting mechanism among its neighbors.

The choice of the number of neighbors (k) is crucial, and a thorough evaluation is conducted to determine the optimal value. The methodology ensures a comprehensive understanding of the implemented KNN algorithm and its performance characteristics on the breast cancer dataset.

In [140]:
import numpy as np

In [145]:
# Get distance between rows
def euclidean_distance(row1, row2):
    return np.sqrt(np.sum((row1 - row2)**2))

In [146]:
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
    
    distances = list()
    
    for i, train_row in enumerate(train):
        
        dist = euclidean_distance(test_row, train_row)
        distances.append((i, dist))
        distances.sort(key=lambda tup: tup[1])

    neighbors = [index for index, _ in distances[:num_neighbors]]
    
    return neighbors

In [147]:
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors, y_train):
    
    negativeCounter = 0
    positiveCounter = 0
    
    neighbors = get_neighbors(train, test_row, num_neighbors)
    
    for n in neighbors:
        if (y_train[n] == 0): negativeCounter += 1
        if (y_train[n] == 1): positiveCounter += 1

    if (negativeCounter > positiveCounter): return 0
    if (negativeCounter < positiveCounter): return 1
    if (negativeCounter == positiveCounter): print("choose anoter value for K")

### 3.1 Scaling 

The breast cancer dataset (X) is loaded, consisting of various features. To standardize the feature values, the StandardScaler from scikit-learn is employed.



In [149]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset
data = load_breast_cancer()
# Get features
X = data.data
# Instance of scaler
scaler = StandardScaler()
# Scaled features
X_scaled = scaler.fit_transform(X)

The fit_transform method ensures that the scaler learns the mean and standard deviation from the data and scales it accordingly.

### Splitting train\test data:

In the provided code snippet, the dataset is split into training and testing sets using the train_test_split function from scikit-learn. This is a common practice in machine learning to assess the performance of a model on unseen data. Let's discuss the key aspects of this section:


In [270]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, data.target, test_size=0.2, random_state=0)

### Predictions

The code uses a loop to iterate over each data point in the testing set (X_test). For each test data point, the predict_classification function is called. This function is presumably your own implementation of the KNN algorithm. The parameters passed include the training set (X_train and y_train), the test data point (X_test[i]), the number of neighbors (11 in this case), and the training labels (y_train).

y_pred: It seems that y_pred is a list where the predicted labels for each test data point are being appended. This list likely contains the predicted labels for the entire testing set after the loop completes.

In [275]:
# Initialize an empty list to store predictions
y_pred = list()

# Making Predictions
for i in range(len(X_test)):
    y_pred.append(predict_classification(X_train, X_test[i], 11, y_train))


### Decision Tree Classification

Implement and train the Decision Trees Classification model using scikit-learn. For instance:

In [334]:
from sklearn.tree import DecisionTreeClassifier
dtc_model = DecisionTreeClassifier(random_state=42, max_depth=10)
dtc_model.fit(X_train,y_train)

In [335]:
y_pred = dtc_model.predict(X_test)

In [336]:
# Evaluate Model
cm = confusion_matrix(y_test,y_pred)
asc = accuracy_score(y_test,y_pred) 
fs = f1_score(y_test,y_pred)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[44  3]
 [ 7 60]]

Accuray Score
 0.9122807017543859

F1 Score
 0.923076923076923


### Cross-validation

In [338]:
scores = cross_val_score(dtc_model, X_scaled, data.target, cv=5)
scores.mean()

0.9173420276354604

## 4. Results
Results are presented in clear and concise tables, highlighting the performance metrics of each algorithm. We cross-reference these results with the experiments outlined in the methodology section to provide a comprehensive overview of the model evaluations.

### KNN implementation

In [277]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

# Evaluate Model
cm = confusion_matrix(y_test,y_pred)
asc = accuracy_score(y_test,y_pred) 
fs = f1_score(y_test,y_pred)
print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[43  4]
 [ 1 66]]

Accuray Score
 0.956140350877193

F1 Score
 0.9635036496350364


### KNN comparison with sklearn

In [278]:
# Create and train the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=11)
knn_classifier.fit(X_train, y_train)

# Make predictions
predictions = knn_classifier.predict(X_test)

In [279]:
# Evaluate Model
cm = confusion_matrix(y_test,predictions)
asc = accuracy_score(y_test,predictions) 
fs = f1_score(y_test,predictions)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[43  4]
 [ 1 66]]

Accuray Score
 0.956140350877193

F1 Score
 0.9635036496350364


### Decision Tree Classifier

In [329]:
# Evaluate Model
cm = confusion_matrix(y_test,y_pred)
asc = accuracy_score(y_test,y_pred) 
fs = f1_score(y_test,y_pred)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[44  3]
 [ 2 65]]

Accuray Score
 0.956140350877193

F1 Score
 0.962962962962963


## 5. Evaluation
This section critically evaluates the strengths and weaknesses of each model. It demonstrates a nuanced understanding of the breast cancer dataset and the performance of the implemented algorithms. Emphasis is placed on how well the models achieve the stated aim of identifying the best machine learning model for breast cancer classification.

**Evaluation and Parameter Tuning:** 
The KNN algorithm is applied to the breast cancer dataset, and the implementation's performance is assessed using evaluation metrics such as precision, accuracy, and recal

**Comparison with scikit-learn's KNN:** The own implementation is compared with scikit-learn's KNN implementation to validate correctness and assess performance differences.l.

## 6. Conclusions
The findings of this study are succinctly summarized, relating them to the initial aim. The report provides insights into the effectiveness of each machine learning model on the breast cancer dataset.

## 7. References
Include any academic works or documentation referenced in the report.