# Assignment 2: Building Classification Models

Instructions
You are provided with a breast cancer dataset (Breast_Cancer_Data.csv) taken originally from the UCI data repository. 

https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original).

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"

columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion',
           'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'Class']

This dataset has approximately 683 patient data having 10 features and 1 class label describing whether the patient has cancer or not. Each row describes one patient, and the class column describes if the patient tumor is benign (label = 2) or malignant (label = 4). For this dataset, build all the classification models (using Python and Scikit-learn) given below (no need to visualize) and tabulate the accuracy and confusion matrix obtained for each. Split the dataset such that the test data size is 25% of the total dataset.

Make sure to code each classification model in a separate python file. Then, you can tabulate the accuracy and confusion matrix in a Word document table. Finally, submit all the python files and Word documents.

            a. Logistic Regression

            b. KNN (k = 5)

            c. Linear SVM (kernel = linear)

            d. Kernel SVM (kernel = rbf)

            e. Naïve Bayes

            f. Decision Tree

            g. Random Forest (estimators = 10)

            f. XGBoost


### Step 1: Import Required Libraries

In [80]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

### Step 2: Load the Dataset from the URL

In [82]:
# Load the dataset from the URL
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)

# Preprocess the data
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')  # Convert to numeric
data = data.dropna() 

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


### Step 3: Create Each Classification Model

#### a. Logistic Regression

In [84]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Logistic Regression Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Logistic Regression Accuracy: 0.9532163742690059
Confusion Matrix:
 [[102   1]
 [  7  61]]



#### b. KNN (k = 5)


In [86]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("KNN Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

KNN Accuracy: 0.9473684210526315
Confusion Matrix:
 [[102   1]
 [  8  60]]



#### c. Linear SVM (kernel = linear)


In [88]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Linear SVM Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Linear SVM Accuracy: 0.9532163742690059
Confusion Matrix:
 [[102   1]
 [  7  61]]



#### d. Kernel SVM (kernel = rbf)


In [90]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = SVC(kernel='rbf')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Kernel SVM Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Kernel SVM Accuracy: 0.9473684210526315
Confusion Matrix:
 [[101   2]
 [  7  61]]



#### e. Naïve Bayes

In [92]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = GaussianNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Naïve Bayes Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Naïve Bayes Accuracy: 0.9649122807017544
Confusion Matrix:
 [[100   3]
 [  3  65]]



#### f. Decision Tree

In [94]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Decision Tree Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Decision Tree Accuracy: 0.9415204678362573
Confusion Matrix:
 [[101   2]
 [  8  60]]



#### g. Random Forest (estimators = 10)


In [96]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model
model = RandomForestClassifier(n_estimators=10)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("Random Forest Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

Random Forest Accuracy: 0.9239766081871345
Confusion Matrix:
 [[102   1]
 [ 12  56]]



#### h. XGBoost

In [98]:
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
columns = ['Sample_code_number', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 
           'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 
           'Normal Nucleoli', 'Mitoses', 'Class']

data = pd.read_csv(url, header=None, names=columns)
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')
data = data.dropna()

# Change class labels from 2 and 4 to 0 and 1
data['Class'] = data['Class'].replace({2: 0, 4: 1})

X = data.drop(['Sample_code_number', 'Class'], axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and train the model without the use_label_encoder parameter
model = XGBClassifier(eval_metric='logloss')
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print("XGBoost Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)

XGBoost Accuracy: 0.9532163742690059
Confusion Matrix:
 [[102   1]
 [  7  61]]
