<i>Modified from the file written by Ahsan Khan on behalf of Alberta Machine Intelligence Institute for the Al Pathways Partnership supported by Prairies Economic Development Canada</i>

---

**Important Note:**

Please do not alter any part of this notebook outside the designated text cells that are clearly marked with "*Start student input* ↓" and "*End student input ↑*". Changes made outside these specified areas could lead to incorrect evaluations of your work, potentially affecting your lab scores.

Ensure you complete all activities within these sections, which are indicated by labels like **[A1]**, **[A2]**, **[A3]**, ... Each activity is crucial for the successful completion of this lab. Additionally, please name your variables exactly as specified in the instructions (if specified) to ensure that your answers are correctly assessed.

Make sure to fill your name in the box below. Also modify the file name so reflect your name inside the parentheses. For example, if this file is named `Lab x - Topic (Student).ipynb` rename it to `Lab x - Topic (First Last).ipynb` where `First` is your first name and `Last` is your last name.

---


**Student name**: First Last

# Lab 2: Decision Trees

Following up from the k-NN classifer, you will now be introduced to the Decision Tree (DT) classifier. DTs, like k-NN, are able to naturally handle non-linear multi-class data. The algorithm is not a distance based classifer like the K-NN classifer, instead it takes sequential  binary decisions in the internal nodes of the tree in order to arrive at a prediction (leaf). For this lab you will use the Breast Cancer Wisconsin (Diagnostic) dataset again.

In [None]:
# Crucial data processing and analysis libraries
import numpy as np
import pandas as pd

# Loading the modules required to build and evaluate a DT classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Loading the Breast Cancer Wisconsin (Diagnostic) dataset from sklearn
from sklearn.datasets import load_breast_cancer

##### Loading our data onto a dataframe the same way you encountered previously in lab 1.

In [None]:
#loading data
breast_cancer = load_breast_cancer()

# There are three key parts to the dataset we care about
# The feature data, X
X = breast_cancer.data
display(X)

In [None]:
# The target classes, y
y = breast_cancer.target
display(y)

In [None]:
# The feature names
display(breast_cancer.feature_names)

##### This is how the data looks like in a pandas dataframe

In [None]:
df = pd.DataFrame(X, columns=breast_cancer['feature_names'])
df['class'] = y


print(f"Number of rows in the data: {df.shape[0]}")
df.head()

##### Always a good idea to observe some statistics of our dataset to get an understanding of it.

In [None]:
df.describe()

# Lab activity: the decision tree classifer

##### **[A1]**  
Split your data into training and validation. Use ``X_train``, ``y_train``,``X_val`` and ``y_val`` as the assigned variables respectively.

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

##### **[A2]**
Instantiate a decision tree classifer called `dt`. For now, you do not need to set any hyperparameters or other class constructor arguments.

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

##### **[A3]**
Fit the decision tree classifier to your data

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

##### **[A4]**
Predict on your validation data

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

##### **[A5]**
Evaluate model performance using the `accuracy_score()` and  `classification_report()` functions (you can find the documentation for `classification_report` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).)

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

Going back to your lab 1 notebook you may notice the accuracy for your first k-NN model was lower than the accuracy achieved in the first DT model here. Recall that you had to scale your data afterwards in order to achieve a decent accuracy score for the k-NN classifer.

##### **[A6]**
Normalize your dataset using the `StandardScaler()` function, normalizing both the `X_train` and `X_val` values from parameters set according to training data. Then, fit a new model `dt2` to scaled data. Finally evalaute the accuracy score.

*Start student input* ↓

In [None]:
# Put your code here.

*End student input ↑*

##### **[A7]**
Based on your results for the above two accuracy evaluations do you need to scale your data for a DT classifer? Explain.

*Start student input* ↓

*Put your explanation here.*

*End student input ↑*