# Topic 2: Advanced Clustering Exercise

I'm following along the **Decision Trees** implementation from this [YouTube](https://youtu.be/My4JgIeFdWk).

This notebook can be accessed via GitHub repository [here](https://github.com/fauzanghazi/advanced-ai/blob/main/notebooks/exercise/04-advanced-classification.ipynb).

#### About the Dataset

Dataset used in this exercise is taken from Kaggle

In this exercise, I'll be using [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data) data.

#### Import necessary libraries & dataset

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#### Load & display the dataset

In [None]:
data = pd.read_csv("../../data/breast-cancer.csv")
data.head()

#### Understanding the dataset

In [None]:
data.info() # to get the information on the columns and datatype

In [None]:
data.describe # to understand the numerical values

#### Cleaning the data
- Deal with the null value
    - use seaborn
- Deal with null column
    - Unnamed: 32
    - id

In [None]:
sns.heatmap(data.isnull())

In [None]:
# drop the column

data.drop(['Unnamed: 32', 'id'], inplace=True, axis=1)

In [None]:
data.head()

#### Change diagnosis class into numerical value (0 and 1)

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
data['diagnosis'] = encoder.fit_transform(data['diagnosis'])

In [None]:
data.head()

In [None]:
# simple eda for quick view

data["diagnosis"].value_counts().plot(kind='bar')
plt.title("Diagnosis Count")
plt.xlabel("Diagnosis (0 = Benign, 1 = Malignant)")
plt.ylabel("Count")
plt.show()

#### Dividing the predictors and the target variable
We also want to normalize the data to ensure that all features contribute equally to the model by bringing them to a similar scale

In [None]:
y = data["diagnosis"] # target variable
X = data.drop(["diagnosis"], axis=1) # we also can use .iloc here just like Dr Syahid instructed

In [None]:
from sklearn.preprocessing import StandardScaler

# create a scaler object

scaler = StandardScaler()

# fit the scaler to the data and transform the data

X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled

#### Split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

#### Train the model

In [None]:
from sklearn.linear_model import LogisticRegression

# create the lr model

lr = LogisticRegression()

# train the model on the training data

lr.fit(X_train, y_train)

# predict the target variable based on the test data

y_pred = lr.predict(X_test)

#### Evaluating the model performance

In [None]:
#accuracy score

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
# precision score

from sklearn.metrics import precision_score
precision_score(y_test, y_pred)

In [None]:
# recall

from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

In [None]:
# classification report (this is an extensive report from the library)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names =['malignant','benign']))

In [None]:
# confusion matrix

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred, labels=[0,1])
cm

In [None]:
# Let's visualize a heatmap for our confusion matrix using seaborn & matplotlib
# We also want to normalize the numbers to be between -1 to 1 using numpy

import numpy as np

cm_normalized = np.round(cm/np.sum(cm, axis=1).reshape(-1,1),2)

import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(cm_normalized, cmap="RdBu", annot=True,
            cbar_kws={"orientation":"vertical","label":"color bar"},
            xticklabels=[0,1], yticklabels=[0,1]
            )
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()