Objective:

The objective of this assessment is to evaluate your understanding and ability to apply supervised learning techniques to a real-world dataset.

Dataset:
Use the breast cancer dataset available in the sklearn library.

# 1. Loading and Preprocessing 

Load the breast cancer dataset from sklearn.

Preprocess the data to handle any missing values and perform necessary feature scaling.

Explain the preprocessing steps you performed and justify why they are necessary for this dataset.

In [1]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np

# Load the dataset
cancer_data = load_breast_cancer()
X = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
y = pd.Series(cancer_data.target)

# Check for missing values
print(X.isnull().sum())  # Check for missing values

# If there are any missing values, handle them
# For this dataset, there are no missing values, but if there were, we could use:
# X.fillna(X.mean(), inplace=True)  # Example of imputation

# Feature Scaling
# Scaling features is important for algorithms that rely on distance calculations (e.g., KNN, SVM)
scaler = StandardScaler()
X_scaled_standard = scaler.fit_transform(X)

minmax_scaler = MinMaxScaler()
X_scaled_minmax = minmax_scaler.fit_transform(X)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled_standard, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}, Test set shape: {X_test.shape}")


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64
Training set shape: (455, 30), Test set shape: (114, 30)


# 2. Classification Algorithm Implementation 

Implement the following five classification algorithms:
    
1. Logistic Regression
2. Decision Tree Classifier
3. Random Forest Classifier
4. Support Vector Machine (SVM)
5. k-Nearest Neighbors (k-NN)

For each algorithm, provide a brief description of how it works and why it might be suitable for this dataset.

1. Logistic Regression

Description: Logistic Regression is a statistical method for predicting binary classes. It models the probability of the default class (e.g., malignant tumors) using a logistic function. It estimates the relationship between features and the binary outcome, making it suitable for classification tasks.

Suitability: Logistic Regression works well for linearly separable data. It's interpretable, efficient, and can provide probabilities that can be useful for assessing risk. Given the features of the breast cancer dataset, it's a good baseline algorithm.

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train the model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log_reg = log_reg.predict(X_test)

# Evaluation
print("Logistic Regression:")
print(classification_report(y_test, y_pred_log_reg))


Logistic Regression:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



2. Decision Tree Classifier

Description: A Decision Tree Classifier splits the data into subsets based on feature values, creating a tree-like structure. It makes decisions based on the features that result in the largest information gain (or reduction in entropy).

Suitability: Decision Trees are easy to interpret, can handle both numerical and categorical data, and capture non-linear relationships. They work well on the breast cancer dataset, as they can effectively handle the intricate relationships between tumor characteristics.

In [3]:
from sklearn.tree import DecisionTreeClassifier

# Train the model
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Predictions
y_pred_dt = dt_classifier.predict(X_test)

# Evaluation
print("Decision Tree Classifier:")
print(classification_report(y_test, y_pred_dt))


Decision Tree Classifier:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



3. Random Forest Classifier

Description: Random Forest is an ensemble learning method that combines multiple decision trees to improve classification accuracy. Each tree is trained on a random subset of data, and predictions are made based on majority voting.

Suitability: Random Forest is robust against overfitting and can handle high-dimensional spaces, making it ideal for datasets with many features, like the breast cancer dataset. It generally provides higher accuracy than a single Decision Tree.

In [4]:
from sklearn.ensemble import RandomForestClassifier

# Train the model
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_classifier.predict(X_test)

# Evaluation
print("Random Forest Classifier:")
print(classification_report(y_test, y_pred_rf))


Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



4. Support Vector Machine (SVM)

Description: Support Vector Machine constructs a hyperplane or set of hyperplanes in a high-dimensional space to separate different classes. It focuses on the points closest to the hyperplane (support vectors) and aims to maximize the margin between the classes.

Suitability: SVM is effective in high-dimensional spaces and is particularly useful for binary classification problems. It can be especially powerful for the breast cancer dataset, where features may be complex and non-linear.

In [5]:
from sklearn.svm import SVC

# Train the model
svm_classifier = SVC(kernel='linear', random_state=42)
svm_classifier.fit(X_train, y_train)

# Predictions
y_pred_svm = svm_classifier.predict(X_test)

# Evaluation
print("Support Vector Machine Classifier:")
print(classification_report(y_test, y_pred_svm))


Support Vector Machine Classifier:
              precision    recall  f1-score   support

           0       0.93      0.95      0.94        43
           1       0.97      0.96      0.96        71

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114



5. k-Nearest Neighbors (k-NN)

Description: k-NN classifies data points based on the majority class among its k nearest neighbors in the feature space. The distance can be calculated using different metrics (like Euclidean distance).

Suitability: k-NN is simple to implement and can adapt to the underlying data distribution. It works well with small datasets, making it suitable for the breast cancer dataset, especially for understanding local structures in the data.

In [8]:
pip install --upgrade scikit-learn numpy


Collecting numpy
  Downloading numpy-1.24.4-cp38-cp38-win_amd64.whl (14.9 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'C:\\Users\\91989\\anaconda3\\Lib\\site-packages\\~-mpy\\.libs\\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll'
Consider using the `--user` option or check the permissions.



In [9]:
from sklearn.neighbors import KNeighborsClassifier

# Train the model
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Predictions
y_pred_knn = knn_classifier.predict(X_test)

# Evaluation
print("k-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn))


AttributeError: 'NoneType' object has no attribute 'split'

In [10]:
conda update scikit-learn numpy


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 

Updating scikit-learn is constricted by 

anaconda -> requires scikit-learn==0.24.1=py38hf11a4ad_0

If you are sure you want an update of your package either try `conda update --all` or install a specific version of the package you want using `conda install <pkg>=<version>`



Updating numpy is constricted by 

anaconda -> requires numpy==1.20.1=py38h34a8a5c_0

If you are sure you want an update of your package either try `conda update --all` or install a specific version of the package you want using `conda install <pkg>=<version>`

done

## Package Plan ##

  environment location: C:\Users\91989\anaconda3

  added / updated specs:
    - numpy
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    conda-token-0.5.0          |     pyhd3eb1b0_0          11 KB
    xmlt



  current version: 4.10.1
  latest version: 24.9.1

Please update conda by running

    $ conda update -n base -c defaults conda




In [1]:
from sklearn.neighbors import KNeighborsClassifier

# Train the model
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Predictions
y_pred_knn = knn_classifier.predict(X_test)

# Evaluation
print("k-Nearest Neighbors Classifier:")
print(classification_report(y_test, y_pred_knn))



NameError: name 'X_train' is not defined

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load the dataset
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

# Step 2: Preprocess the data (if necessary)
# For this dataset, no missing values are present, but you can scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN model
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Step 5: Make predictions
y_pred_knn = knn_classifier.predict(X_test)

# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))


AttributeError: 'NoneType' object has no attribute 'split'

In [5]:
conda install numpy pandas scikit-learn


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.




  current version: 4.10.1
  latest version: 24.9.1

Please update conda by running

    $ conda update -n base -c defaults conda




In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load the dataset
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

# Step 2: Preprocess the data (if necessary)
# Scaling the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN model
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)

# Step 5: Make predictions
y_pred_knn = knn_classifier.predict(X_test)

# Step 6: Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))


Accuracy: 0.9473684210526315

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114


Confusion Matrix:
 [[40  3]
 [ 3 68]]


# 3. Model Comparison 


Compare the performance of the five classification algorithms.
Which algorithm performed the best and which one performed the worst?

Import Neces
sary Libraries: Make sure to have the required libraries installed.
    

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


Load the Dataset: Load the breast cancer dataset and split it into training and testing sets.

In [3]:
# Load the dataset
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = pd.Series(cancer.target)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Train and Evaluate Models: Create a function to train and evaluate each model, and then compare their performance.

In [4]:
# List of classifiers
classifiers = {
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(),
    'SVM': SVC(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier()
}

# Store results
results = {}

for name, clf in classifiers.items():
    clf.fit(X_train_scaled, y_train)
    y_pred = clf.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy

# Convert results to DataFrame for better visualization
results_df = pd.DataFrame(list(results.items()), columns=['Algorithm', 'Accuracy'])


Analyze the Results: Sort and display the results to determine which model performed best and which performed worst.

In [5]:
# Sort results
results_df = results_df.sort_values(by='Accuracy', ascending=False)
print(results_df)

# Best and Worst Model
best_model = results_df.iloc[0]
worst_model = results_df.iloc[-1]

print(f"Best Model: {best_model['Algorithm']} with an accuracy of {best_model['Accuracy']:.2f}")
print(f"Worst Model: {worst_model['Algorithm']} with an accuracy of {worst_model['Accuracy']:.2f}")


             Algorithm  Accuracy
2                  SVM  0.982456
3  Logistic Regression  0.973684
4        Random Forest  0.964912
0                  KNN  0.947368
1        Decision Tree  0.938596
Best Model: SVM with an accuracy of 0.98
Worst Model: Decision Tree with an accuracy of 0.94
