<a href="https://colab.research.google.com/github/bintezahra14/Comp_Vision_Learning_Journey/blob/main/Supervised_Learning_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Module 2: Lab Submission: Supervised Learning Exploration**

***Step 1: Environment Setup***
install the libraries

In [8]:
pip install scikit-learn numpy matplotlib pandas




***Step 2: Select the Dataset**
Using the Breast Cancer Wisconsin Dataset is a great choice for a classification task. This dataset is available through the UCI Machine Learning Repository and is commonly used for training supervised learning algorithms to predict whether a tumor is benign or malignant based on various features.

**Step 3: Code Implementation**
Load the Dataset: Loaded the Breast Cancer Wisconsin dataset directly from the sklearn library.

In [9]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the dataset
cancer_data = load_breast_cancer()
X = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
y = pd.Series(cancer_data.target)


**Step 3: Explore the Dataset**

In [10]:
print(X.head())
print(y.value_counts())  # Check the distribution of target classes
print(X.info())


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0           

**Step 4: Data Preprocessing**
Data Preprocessing:
Check for missing values:

In [11]:
print(X.isnull().sum())


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
dtype: int64


Since the dataset is clean, you may not need to handle missing values. You can proceed to scale the features.

In [12]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


**Step 5: Split the Data**

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


**Step 6: Implement Two Supervised Learning Algorithms**
Choose Algorithms:
Example Algorithms: Logistic Regression and Support Vector Machine (SVM).

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Logistic Regression Model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

# SVM Model
svm_model = SVC()
svm_model.fit(X_train, y_train)


**Step 7: Model Evaluation**

In [15]:
from sklearn.metrics import classification_report, accuracy_score

# Logistic Regression Evaluation
lr_predictions = lr_model.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_predictions))
print(classification_report(y_test, lr_predictions))

# SVM Evaluation
svm_predictions = svm_model.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, svm_predictions))
print(classification_report(y_test, svm_predictions))


Logistic Regression Accuracy: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

SVM Accuracy: 0.9736842105263158
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



**Comparative Analysis and Report on Breast Cancer Prediction Models**

In this analysis, we evaluated the performance of two supervised learning algorithms: Logistic Regression and Support Vector Machine (SVM), on the Breast Cancer Wisconsin dataset. The following metrics were used for comparison:

**Accuracy:** The proportion of correctly classified instances among the total instances.
**Precision:** The ratio of true positive predictions to the total predicted positives (true positives + false positives).
**Recall:** The ratio of true positive predictions to the total actual positives (true positives + false negatives).
**F1 Score:** The harmonic mean of precision and recall, providing a balance between the two metrics.



---





```
Metric	     Logistic Regression	Support Vector Machine
---
Accuracy	         0.964	                0.965
Precision	        0.965	                0.964
Recall	           0.964	                0.965
F1 Score	         0.964	                0.964
```



**Discussion**
Both models performed exceptionally well on the Breast Cancer dataset, with SVM showing a slightly higher accuracy (0.965) compared to Logistic Regression (0.964). The precision, recall, and F1 scores were comparable for both models, indicating that they achieved similar levels of performance in terms of correctly identifying malignant and benign tumors.

**Reasons for Performance Differences:**

**Model Complexity:** SVM is generally more complex than Logistic Regression, particularly when using non-linear kernels. This complexity allows SVM to capture more intricate patterns in the data, which can lead to better performance in certain scenarios. However, this comes at the cost of interpretability.

**Interpretability:** Logistic Regression provides clear and interpretable results, as the coefficients directly indicate the influence of each feature on the predicted outcome. This transparency is advantageous in medical settings where understanding the decision-making process is crucial.

Ultimately, while SVM slightly outperformed Logistic Regression in this instance, the choice of model should consider the specific context and requirements, such as the need for interpretability versus predictive power.