# CM3015 Mid-term Coursework Report

## 1. Introduction
Machine learning plays a crucial role in healthcare, and the identification of the best model for breast cancer classification is of paramount importance. This report aims to identify the best machine learning model for the breast cancer dataset obtained from the scikit-learn library. The study employs a comprehensive approach by implementing a k-Nearest Neighbors (KNN) algorithm from scratch, utilizing scikit-learn's KNN, and applying Decision Trees Classification from the same library. The report summarizes the findings and evaluations of these machine learning models.

## 2. Background
In this study, two prominent machine learning algorithms, k-Nearest Neighbors (KNN) and Decision Trees Classification (DTC), are explored for their efficacy in breast cancer classification. KNN, a proximity-based algorithm, assigns a data point the majority class of its k-nearest neighbors[2]. On the other hand, DTC constructs a tree-like model, recursively partitioning the feature space to make decisions[3]. Both algorithms offer unique strengths and interpretability, and their performance will be rigorously compared to determine the most effective approach for breast cancer classification. The scratch implementation of KNN introduces an additional layer of analysis, providing insights into algorithmic intricacies.

## 3. Methodology
I began the investigation by importing the breast cancer dataset from scikit-learn, meticulously preparing the data through scaling, and strategically dividing it into training and testing sets. The detailed implementation of these processes is documented. The methodology includes the implementation of a KNN algorithm from scratch using standard Python code, alongside the application of scikit-learn's Decision Trees Classification. This section aims to present the implementations of the proposed processes, setting the stage for model comparison in the Results section. Our exploration will delve into each stage, transforming the data into the specific models aimed for evaluation.

### 3.1 Scaling 

To enhance the robustness and effectiveness of the models, a scaling process was employed. Utilizing the StandardScaler from scikit-learn, the feature values of the breast cancer dataset were transformed to conform to a standardized range, constraining them between 0 and 1. This normalization is crucial for preventing certain features from dominating the model due to disparate scales, ensuring that the distances between data points are not excessively influenced by specific columns. By constraining the values within a consistent range, we mitigate the risk of false positives or false negatives caused by skewed distances, thus contributing to the overall improvement of our model results.[6]



In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer

# Load breast cancer dataset
data = load_breast_cancer()
# Get features
X = data.data
# Instance of scaler
scaler = StandardScaler()
# Scaled features
X_scaled = scaler.fit_transform(X)

### 3.2 Splitting train\test data:

The dataset undergoes division into training and testing sets through the utilization of the train_test_split function from scikit-learn. This widely adopted practice in machine learning serves the purpose of evaluating a model's efficacy when confronted with unseen data[1]. The method is configured to allocate 80% of the data for training and reserves the remaining 20% for testing. The specific dataset subjected to this partitioning is the normalized data X_scaled. This strategic splitting ensures a comprehensive evaluation of the model's generalization capabilities.[5]


In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, data.target, test_size=0.2, random_state=42)

### 3.3 Homegrown KNN

The implementation of the K-Nearest Neighbors (KNN) algorithm in Python involves crafting essential functions for distance computation and neighbor selection[2]. This homegrown KNN algorithm follows a systematic process comprising the following key steps:

1. **Euclidean Distance Calculation:** The euclidean_dist() function computes the Euclidean distance between two data points. Specifically, it computes the distance between two rows of data, employing the fundamental concept of Euclidean geometry.

3. **Neighbor Selection:** The get_neighbors_indexes()  function determines the k-nearest neighbors of a specified test data point within the training dataset by utilizing Euclidean distance. This function retrieves an array containing the respective indexes of the identified neighbors.

   
4. **Classification Prediction:** The predict_classification() function serves the critical role of predicting the class label of the test data point. Employing a majority voting mechanism among its k-nearest neighbors, this function ensures a robust classification prediction based on the collective input from the neighborhood.

Note: The choice of the number of neighbors (k) is crucial, and various machine learning techniques, such as the gradient descent, aid in this calculation. However, the current library lacks the capability for the Homegrown model to connect and access gradient descent. Therefore, implementing gradient descent is outside the scope of this report.

In [3]:
import numpy as np

In [4]:
# Get distance between rows
def euclidean_dist(row1, row2):
    return np.sqrt(np.sum((row1 - row2)**2))

In [5]:
# Locate the neighbors and get their indexes
def get_neighbors_indexes(sample, row, k):

    # Instance for all distances of row
    distances = list()

    # For each row in sample set
    for i, sample_row in enumerate(sample):
        
        # Calculate distance between rows
        dist = euclidean_dist(row, sample_row)
        # Store distance and index of sample_row in distances array 
        distances.append((i, dist))
        # Sort them by distance
        distances.sort(key=lambda tup: tup[1])

    # Get the closest neighbors depending on k
    neighbors_indexes = [index for index, _ in distances[:k]]
    
    return neighbors_indexes

In [6]:
# Make a classification prediction with neighbors
def predict_classification(sample, row, k, y_train):

    # Instantiate votes
    negative_counter = 0
    positive_counter = 0

    # Get neighbors
    neighbor_indexes = get_neighbors_indexes(sample, row, k)

    # Count votes
    for n in neighbor_indexes:
        if (y_train[n] == 0): negative_counter += 1
        if (y_train[n] == 1): positive_counter += 1

    # Classify
    if (negative_counter > positive_counter): return 0
    if (negative_counter < positive_counter): return 1
    if (negative_counter == positive_counter): return None

### 3.4 Predictions of the Homegrown KNN

Now is the time to utilize the crafted KNN algorithm and gather the predictions, which will be stored in the variable homegrown_KNN_predictions. The methodology involves iterating through the test samples, comparing each test row with the training samples. The predict_classification() function is called each iteration and returns the classifications denoted by 0, 1, or None. This process offers valuable insights into the anticipated outcomes for each individual test data point.

In [7]:
# Initialize an empty list to store predictions
homegrown_KNN_predictions = list()

# Making Predictions
for i in range(len(X_test)):
    homegrown_KNN_predictions.append(predict_classification(X_train, X_test[i], 11, y_train))


### 3.5 Decision Tree Classification

In this section, the implementation of a decision tree classification model is demonstrated, making use of the capabilities of the scikit-learn library for constructing the model.. The demonstration provides a clear illustration of how to instantiate and utilize the library for model development. The resultant model is stored in the variable dtc_model. By fitting the training data, previously segmented, this process ensures consistency in comparing with other algorithms. Furthermore, the predictions derived from this decision tree model are stored in the variable decision_tree_predictions.

In [8]:
from sklearn.tree import DecisionTreeClassifier

# Create a Decision Tree
dtc_model = DecisionTreeClassifier(random_state=42, max_depth=10)
# Fit data into model
dtc_model.fit(X_train,y_train)

In [9]:
# Get predictions
decision_tree_predictions = dtc_model.predict(X_test)

## 4. Results
In this analysis, we unveil the performance of the Homegrown K-Nearest Neighbors (KNN) algorithm, meticulously crafted within this report. The algorithm's precision is highlighted through a comprehensive examination of confusion matrices, accuracy scores, and F1 scores. Additionally, a comparative analysis with scikit-learn's KNN model accentuates the homegrown solution's reliability. Furthermore, insights into the classification capabilities of a Decision Tree Classifier and cross-validation results contribute to a nuanced understanding of the models' effectiveness.

In [10]:
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

### 4.1 Homegrown KNN Algorithm

The outcomes of the Homegrown K-Nearest Neighbors (KNN) algorithm, developed within the scope of this report, are detailed below:

**Confusion Matrix**
```
               | Predicted Negative | Predicted Positive |
---------------|--------------------|--------------------|
Actual Negative|        40          |         3          |
---------------|--------------------|--------------------|
Actual Positive|         2          |        69          |
```

**Accuracy Score**
Approximately 95.61%

**F1 Score**
Approximately 96.50%

These results attest to the robust performance of the Homegrown KNN algorithm. The high accuracy and balanced F1 score demonstrate the model's reliability in making accurate predictions on the dataset. The confusion matrix offers a comprehensive breakdown of accurate and inaccurate categorizations, further reinforcing the efficacy of the model in handling classification tasks. These numbers affirm the reliability of the model constructed in this report, validating its competence in making precise and consistent predictions.

In [11]:
# Evaluate Homegrown KNN Model
cm = confusion_matrix(y_test,homegrown_KNN_predictions)
asc = accuracy_score(y_test,homegrown_KNN_predictions) 
fs = f1_score(y_test,homegrown_KNN_predictions)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[40  3]
 [ 2 69]]

Accuray Score
 0.956140350877193

F1 Score
 0.965034965034965


### 4.2 Comparative Analysis: Homegrown KNN vs. SKlearn KNN

In the pursuit of assessing the performance of the Homegrown K-Nearest Neighbors (KNN) algorithm, a comparative analysis was undertaken by creating a model utilizing the popular scikit-learn library, implementing its own KNN classifier.

In [12]:
from sklearn.neighbors import KNeighborsClassifier

# Create and train the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=11)
knn_classifier.fit(X_train, y_train)

# Make predictions
sklearn_KNN_predictions = knn_classifier.predict(X_test)

The evaluation of both the Homegrown KNN algorithm and the SKlearn KNN model yielded identical results:

**Confusion Matrix:**
```
               | Predicted Negative | Predicted Positive |
---------------|--------------------|--------------------|
Actual Negative|        40          |         3          |
---------------|--------------------|--------------------|
Actual Positive|         2          |        69          |
```

**Accuracy Score:** Approximately 95.61%

**F1 Score:** Approximately 96.50%

The striking similarity in results between the Homegrown KNN and SKlearn KNN models suggests a high degree of consistency and correctness in their classification predictions. Both models achieved impressive accuracy and a balanced F1 score, indicating their effectiveness in handling the dataset.

This congruence in performance underscores the reliability of the Homegrown KNN algorithm, showcasing its capacity to match the results produced by a widely-used library implementation. This comparison provides confidence in the accuracy and functionality of the Homegrown KNN algorithm, validating its efficacy in classification tasks.

In [13]:
# Evaluate Model
cm = confusion_matrix(y_test,sklearn_KNN_predictions)
asc = accuracy_score(y_test,sklearn_KNN_predictions) 
fs = f1_score(y_test,sklearn_KNN_predictions)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[40  3]
 [ 2 69]]

Accuray Score
 0.956140350877193

F1 Score
 0.965034965034965


### 4.3 Decision Tree Classifier

The outcomes of the Decision Tree Classifier are presented below, providing insights into its classification performance.

**Confusion Matrix:**
```
               | Predicted Negative | Predicted Positive |
---------------|--------------------|--------------------|
Actual Negative|        40          |         3          |
---------------|--------------------|--------------------|
Actual Positive|         3          |        68          |
```

**Accuracy Score:** Approximately 94.74%

**F1 Score:** Approximately 95.77%

The Decision Tree Classifier exhibits robust classification performance, characterized by a strong accuracy score and a balanced F1 score. The model's accuracy reflects its effectiveness in making accurate predictions, while the balanced F1 score underscores its ability to maintain precision and recall equilibrium.

These results affirm the Decision Tree Classifier's competence in handling classification tasks and contribute to a comprehensive understanding of its performance on the given dataset.

In [14]:
# Evaluate Model
cm = confusion_matrix(y_test,decision_tree_predictions)
asc = accuracy_score(y_test,decision_tree_predictions) 
fs = f1_score(y_test,decision_tree_predictions)

print("Confusion Matrix\n",cm)
print("\nAccuray Score\n",asc)
print("\nF1 Score\n",fs)

Confusion Matrix
 [[40  3]
 [ 3 68]]

Accuray Score
 0.9473684210526315

F1 Score
 0.9577464788732394


### 4.4 Cross-Validation Results
To comprehensively evaluate the generalization capabilities of the machine learning models, cross-validation was employed using the scikit-learn library. The decision tree classifier (dtc_model) and the K-Nearest Neighbors algorithm (knn_classifier) were subjected to this rigorous assessment.

In [15]:
from sklearn.model_selection import cross_val_score

**Decision Tree Classifier:**
The decision tree classifier yielded an average cross-validation score of approximately 91.73%. This score reflects the model's performance across different folds, providing insights into its capacity to generalize data that has not been encountered before[4].The robust performance of the decision tree classifier indicates its reliability and effectiveness in making accurate predictions.

In [16]:
dtc_model_scores = cross_val_score(dtc_model, X_scaled, data.target, cv=5)
dtc_model_scores.mean()

0.9173420276354604

**K-Nearest Neighbors Algorithm:**
Unfortunately, the Homegrown K-Nearest Neighbors (KNN) algorithm was found to be incompatible with the scikit-learn library's cross-validation function. Nevertheless, to maintain a comparative analysis, the scikit-learn KNN classifier (knn_classifier) underwent cross-validation.

The scikit-learn KNN model exhibited an impressive average cross-validation score of approximately 96.31%. This underscores the model's strong generalization capabilities, indicating its proficiency in handling diverse subsets of the dataset.

In [17]:
dtc_model_scores = cross_val_score(knn_classifier, X_scaled, data.target, cv=5)
dtc_model_scores.mean()

0.9631113181183046

While the Homegrown KNN algorithm couldn't be directly integrated into the scikit-learn cross-validation process, the results from the scikit-learn KNN model suggest a high level of generalization, contributing to a comprehensive assessment of the models' performance.

## 5. Conclusion

The evaluation of the machine learning models in this report reveals noteworthy insights into their performance. The Homegrown K-Nearest Neighbors (KNN) algorithm exhibits robustness, reflected in a high accuracy of approximately 95.61% and a balanced F1 score of 96.50%. The congruence with scikit-learn's KNN model suggests reliability, affirming the homegrown solution's competency.

The Decision Tree Classifier, with an accuracy of about 94.74% and an F1 score of 95.77%, demonstrates commendable classification capabilities. These results emphasize the model's precision and recall equilibrium.

Cross-validation further underscores the models' generalization capabilities. The decision tree classifier achieves an average score of around 91.73%, while scikit-learn's KNN model attains an impressive 96.31%. Although the Homegrown KNN algorithm doesn't integrate seamlessly into cross-validation, scikit-learn's KNN results suggest its potential for robust generalization.

Collectively, these evaluations validate the effectiveness of the machine learning models in handling diverse datasets and showcase their potential for real-world applications.

Selecting a winning algorithm entails a thoughtful consideration of various factors beyond raw performance metrics. While both the Homegrown K-Nearest Neighbors (KNN) algorithm and the Decision Tree Classifier showcase strong results on the breast cancer dataset in terms of accuracy and F1 score, other critical aspects need examination. Considerations include the interpretability of the model, computational efficiency, and ease of implementation. Moreover, assessing factors like model robustness, scalability, and potential for integration into cross-validation procedures can provide a more comprehensive understanding of each algorithm's suitability. Hence, a holistic approach considering a spectrum of criteria is essential to determine the most fitting machine learning model for the breast cancer dataset, ensuring alignment with the goals and constraints of the intended application.

## 6. References
[1] Andreas, Müller, and Sarah Guido. Introduction to Machine Learning with Python: A Guide for Data Scientists, 2016. 

[2] Scikit-Learn. “1.10. Decision Trees.” scikit. Accessed December 21, 2023. https://scikit-learn.org/stable/modules/tree.html. 

[3] Scikit-Learn. “1.6. Nearest Neighbors.” scikit. Accessed December 21, 2023. https://scikit-learn.org/stable/modules/neighbors.html. 

[4] Scikit-Learn. “Sklearn.Model_selection.Cross_val_score.” scikit. Accessed December 21, 2023. https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score. 

[5] Scikit-Learn. “Sklearn.Model_selection.Train_test_split.” scikit. Accessed December 21, 2023. https://scikitlearn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split. 

[6] Scikit-Learn. “Sklearn.Preprocessing.StandardScaler.” scikit. Accessed December 21, 2023. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. 016