<a href="https://colab.research.google.com/github/gaurinaik22/CMPE257-Fall23-Gauri-Naik/blob/take-home-exam/Take_Home_Exam_Task2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [66]:
import pandas as pd

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

In [69]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [70]:
breast_cancer_dataset = pd.read_csv('/content/drive/MyDrive/ML/breast_cancer_dataset_preprocessed.csv')

In [71]:
breast_cancer_dataset.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,y
0,4.147954,-4.443183,-0.068966,4.035033,0.817574,-0.476277,0.553593,1.268819,M
1,-4.595154,-2.684882,1.08411,-0.403925,0.410287,0.687051,0.284184,0.260968,B
2,-0.755349,-2.318373,-1.938275,0.279953,0.241712,3.409801,0.092694,1.040391,M
3,-0.453863,0.197572,-1.03706,0.344384,0.070598,-0.822546,-0.993352,-0.946259,B
4,-3.27868,-0.792025,-0.736833,-1.621295,-0.085459,-0.824324,-0.107042,-0.291755,B


In [72]:
# Extracting features and target variable
X_bc = breast_cancer_dataset.drop('y', axis=1)
y_bc = breast_cancer_dataset['y']

In [73]:
# Encoding the target variable
label_encoder = LabelEncoder()
y_bc_encoded = label_encoder.fit_transform(y_bc)

In [74]:
# Splitting the dataset into training and testing sets
X_bc_train, X_bc_test, y_bc_train, y_bc_test = train_test_split(X_bc, y_bc_encoded, test_size=0.3, random_state=842)

In [75]:
X_bc_train.shape

(266, 8)

In [76]:
y_bc_train.shape

(266,)

In [77]:
X_bc_test.shape

(115, 8)

In [78]:
y_bc_test.shape

(115,)

In [79]:
# Function to train a model and evaluate its performance
def evaluate_model(model, X_train, X_test, y_train, y_test):

    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    return accuracy,precision, recall, f1

In [80]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine Linear": SVC(kernel='linear'),
    "Support Vector Machine RBF": SVC(kernel='rbf'),
    "Support Vector Machine Poly": SVC(kernel='poly')
}

In [81]:
model_performance = {}
for name, model in models.items():
    performance = evaluate_model(model, X_bc_train, X_bc_test, y_bc_train, y_bc_test)
    model_performance[name] = performance

In [82]:
model_performance

{'Logistic Regression': (0.9565217391304348,
  0.9459459459459459,
  0.9210526315789473,
  0.9333333333333332),
 'Decision Tree': (0.9478260869565217,
  0.9705882352941176,
  0.868421052631579,
  0.9166666666666667),
 'Random Forest': (0.991304347826087,
  1.0,
  0.9736842105263158,
  0.9866666666666666),
 'Support Vector Machine Linear': (0.9565217391304348,
  0.9459459459459459,
  0.9210526315789473,
  0.9333333333333332),
 'Support Vector Machine RBF': (0.991304347826087,
  1.0,
  0.9736842105263158,
  0.9866666666666666),
 'Support Vector Machine Poly': (0.9130434782608695,
  1.0,
  0.7368421052631579,
  0.8484848484848484)}

**Best Performing Model**

*   The ***SVM with RBF kernel*** was identified as the best performing model.
*   It achieved the highest scores across all evaluated metrics - *accuracy (99.13%), precision (100%), recall (97.37%), and F1-score (98.67%)*.



**Reason for Model Selection**

- The SVM with RBF kernel not only provided the highest accuracy but also maintained an excellent balance between precision and recall, as evidenced by its high F1-score.
-The RBF kernel is particularly effective in datasets where the relationship between features and the target variable is complex and non-linear, making it well-suited for this dataset.




**Performance Metrics Used**

- **Accuracy**: This metric assessed the overall correctness of the model by calculating the proportion of true results among the total number of cases examined.
- **Precision**: Precision measured the proportion of positive identifications that were actually correct, emphasizing the cost of false positives.
- **Recall**: Recall focused on the model's ability to correctly identify all relevant cases, highlighting the importance of minimizing false negatives.
- **F1-Score**: The F1-score provided a harmonious balance between precision and recall where both false positives and false negatives are critical.


####  Description of the Methodology

**Data Preprocessing**
- **Dataset Loading**: The Breast Cancer dataset with diagnostic features was loaded.
- **Feature-Target Split**: Features and target (cancer diagnosis) were separated.
- **Target Encoding**: The categorical target variable was numerically encoded.
- **Data Splitting**: The dataset was divided into training (70%) and testing (30%) sets using `random_state=842`.

**Model Training and Evaluation**
- **Model Selection**: Six models were chosen: Logistic Regression, Decision Tree, Random Forest, and SVMs with Linear, Polynomial, and RBF kernels.
- **Training**: Each model was trained on the training data.
- **Evaluation**: Models were assessed based on accuracy, precision, recall, and F1-score to gauge their classification effectiveness and balance between false positives and negatives.