<a href="https://colab.research.google.com/github/chefs-kiss/ML_J2026/blob/main/PA5_Trees_with_BreastCancer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name:


Who you worked with:

## Objectives
The goals of this project are to:
- Implement multiple tree-based models
- Tune models to find the best combination of hyperparameters

## Overview
For this programming assignment, you will be working with the Wisconsin Breast Cancer dataset. This dataset contains various measurements of the size and shape of tumors, along with the diagnosis of the tumor as benign or malignant. Further information on this dataset can be found here: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic). Our goal is to train a tree-based model that can use the measurements of a tumor to diagnosis it as benign or malignant.

## Schedule
Here is the suggested schedule for working on this project:
- Weekend: Read through project instructions, run code for Task 0.
- Tuesday: Complete Tasks 1-2.
- Wednesday: Complete Tasks 3.
- Thursday: Complete Task 4.

This project is due on Thursday, 3/20, by 11:59pm.


#Task 0: Data

We start by loading the dataset into a pandas dataframe.

In [None]:
import pandas as pd

cancer_df = pd.read_csv("https://github.com/lynn0032/MLCamp2021/raw/main/breast_cancer.csv")

Next, we look at a summary of the data. Note that not every column is shown in this summary.

In [None]:
cancer_df.describe(include = "all")

We can use shape to check the size of the data. Here, we see that we have 569 samples (for 569 tumors), and 33 attributes.

In [None]:
cancer_df.shape

Normally, we'd do exploratory data analysis here to understand the distribution of the dataset. For this programming assignment, however, we're going to focus on evaluation for a classifier, so we will neglect to do that here.

There are two columns that we won't use for prediction, so we drop those columns.

In [None]:
cancer_df.drop(["id", "Unnamed: 32"], axis=1, inplace = True)

cancer_df.describe(include="all")

Next, we separate the dataset into features and target. Remember that the diagnosis is the target, and it takes values "M" (for malignant) and "B" for benign.

In [None]:
features = list(cancer_df.columns)
features.remove("diagnosis")

X = cancer_df[features].values
y = cancer_df["diagnosis"].values

# Task 1: Evaluation for multiple random testing-training splits

First, we import the functions that we'll use to train and evaluate models.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

To see how a decision tree (with default parameter values) performs on this data, we will generate random testing training splits in repeated trials, and average the performance. You will want to reference the workbook from Friday to figure out where to put the sklearn methods.

##üíªQ1: Complete the code

*   We will run 10 trials, generating 10 different training-testing splits.
*   For each trial, we generate a new training-testing split, train a model on the training set, and evaluate it on the testing set.
*   Each training-testing split should use 75% of the data for training (then how much is used for testing?), and for a random state, use the value of the variable `iter` (giving the trial number).
*   Create a decision tree classifier with random state 189, and train it on the training data.
*   Make predictions for the testing set
*   Compute the accuracy, precision, and recall for your model on the testing data, and append these to the lists storing these results. Use "M" as the positive class.
*   For precision and recall, you will need to use an optional parameter, `pos_label`, to specify the label for the positive class ("M").

In [None]:
num_trials =   # your code here

accuracy = []
precision = []
recall = []

for iter in range(num_trials):
  # perform a random split
  # your code here


  # train classifier on the training data
  # your code here


  # make predictions on the testing data
  # your code here


  # compute and store the performance metrics
  # your code here


# report the performance
print("Average Accuracy:", sum(accuracy)/len(accuracy))
print("Average Precision:", sum(precision)/len(precision))
print("Average recall:", sum(recall)/len(recall))
print("Accuracy:", accuracy)
print("Precision", precision)
print("Recall", recall)

In [None]:
# TEST CELL - DO NOT CHANGE - RUN THIS CELL TO CHECK YOUR WORK
# Note: passing this cell doesn't guarantee your code is correct or that you will get full credit,
# but should be used to help you check your work
assert(accuracy[3:6] == [0.9300699300699301, 0.8951048951048951, 0.958041958041958])
assert(precision[7:10] == [0.9285714285714286, 0.8928571428571429, 0.9411764705882353])
assert(recall[-3:] == [0.8666666666666667, 0.8620689655172413, 0.9056603773584906])
print("Tests passed, but be sure to test your own code as well!")

##‚úè Q2: Interpreting Your Results

In the cell below, for precision and recall, write one sentence each describing what these mean in the context of the models and dataset.

As an example, here is such a sentence for accuracy: Of all the tumors in the dataset, our models classified an average of 91.8% of tumors correctly as malignant or benign.

your answers here

#Task 2: Evaluation

For this task, you will complete the definition of an evaluation function that can be used to evaluate the performance of a decision tree with specified values for the parameters `max_depth` and `min_samples_leaf`. We will use stratified $k$-fold cross-validation, to help ensure that our results are reliable.


##üíª Q3: Complete the code

*   Use stratified $k$-fold validation, with $k=5$. Remember to shuffle the data, and use a random state of 7.
*   For each split, you will train a decision tree classifier with the the values of `max_depth` and `min_samples_leaf` given by the function parameters. Use random state 7.
*   For each model trained, you will compute and store the accuracy on the testing set and the accuracy on the training set. The function returns the average testing accuracy and the average training accuracy.

In [None]:
from sklearn.model_selection import StratifiedKFold

def evaluate(max_depth_value, min_samples_leaf_value):
  # Use stratified 5-fold evaluation to generate splits
  # your code here


  test_accuracy = []
  train_accuracy = []

  # for each split
  for train_index, test_index in skf.split(X, y):
    # Get training and testing sets
    X_train, X_test = # your code here
    y_train, y_test = # your code here

    # Train a decision tree classifier on the training set
    # your code here


    # Make predictions on the testing set
    y_test_pred = # your code here

    # Make predictions on the training set
    y_train_pred = # your code here

    # Evaluate the performance
    test_accuracy.append(accuracy_score(y_test, y_test_pred))
    train_accuracy.append(accuracy_score(y_train, y_train_pred))

  return sum(test_accuracy)/len(test_accuracy), sum(train_accuracy)/len(train_accuracy)

In [None]:
# TEST CELL - DO NOT CHANGE - RUN THIS CELL TO CHECK YOUR WORK
# Note: passing this cell doesn't guarantee your code is correct or that you will get full credit,
# but should be used to help you check your work
assert(evaluate(10,2) == (0.9227138643067848, 0.9890138808559861))
assert(evaluate(4,10) == (0.9139108834031984, 0.9617765567765568))
assert(evaluate(4,1) == (0.9262381617761217, 0.9859417775207249))
assert(evaluate(None,20) == (0.9104642136314236, 0.945080971659919))
print("Tests passed, but be sure to test your own code as well!")

#Task3: Tuning

## Tuning `max_depth`

Now, we will use the evaluate function to tune the parameter `max_depth`, with the default value for `min_samples_leaf` (which is 1).

The code is provided for you, but you will need to interpret the results.



In [None]:
test_accuracies = []
train_accuracies = []

for max_depth in range(1,20):
  test_acc, train_acc = evaluate(max_depth, 1)
  test_accuracies.append(test_acc)
  train_accuracies.append(train_acc)

  print("Max depth", max_depth)
  print("\tTesting Accuracy:", test_acc)
  print("\tTraining Accuracy:", train_acc)

In [None]:
import matplotlib.pyplot as plt

test = plt.plot(range(1,20), test_accuracies, label = "Testing")
train = plt.plot(range(1,20), train_accuracies, label = "Training")
plt.ylim(0, 1)      # to "zoom in", you can delete this line
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.legend()
plt.show()

##‚úè Q4: describe max_depth results.


Be sure to discuss overfitting vs. underfitting, as well as which value for `max_depth` you think is best and why.

your answer here

## Tuning `min_samples_leaf`

Next, we will use the evaluate function to tune the parameter `min_samples_leaf`, with the default value for `max_depth` (which is `None`). The code is provided for you, but you will need to interpret the results.

In [None]:
test_accuracies = []
train_accuracies = []

for min_samples_leaf in range(1,100):
  test_acc, train_acc = evaluate(None, min_samples_leaf)
  test_accuracies.append(test_acc)
  train_accuracies.append(train_acc)

  print("Min samples leaf", min_samples_leaf)
  print("\tTesting Accuracy:", test_acc)
  print("\tTraining Accuracy:", train_acc)

In [None]:
import matplotlib.pyplot as plt

test = plt.plot(range(1,100), test_accuracies, label = "Testing")
train = plt.plot(range(1,100), train_accuracies, label = "Training")
plt.ylim(0, 1)      # to "zoom in", you can delete this line
plt.xlabel('min_samples_leaf')
plt.ylabel('accuracy')
plt.legend()
plt.show()

##‚úèQ5: Describe min_samples_leaf results

Be sure to discuss overfitting vs. underfitting, as well as which value for `min_samples_leaf` you think is best and why.

your answer here

##üíªQ6: Best Combination

When tuning parameters, it typically takes more effort than just tuning parameters individually. Really, we want the best combination of parameters, so we want to explore them together.

One way to do this is with a **grid search**, where we explore a grid of possible parameter values. In the cell below, you will conduct a grid search to find the best combination of `max_depth` and `min_samples_leaf`.

*   Using nested `for` loops, use the function `evaluate` to test all combinations of `max_depth` and `min_samples_leaf`, where `max_depth` ranges from 1 through 10, and `min_samples_leaf` ranges from 1 through 50 (can you see how I chose these ranges from the results for the individual parameters above?).
*   Remember that the function `evaluate` returns both the testing accuracy and training accuracy. For simplicity, we will just look for the parameters with the **best testing accuracy**.
*   For the parameters with the best testing accuracy, store their values in the tuple `parameters` (with `max_depth` first, followed by `min_samples_leaf`). Store their testing accuracy and training accuracy (as found by by evaluate), in the tuple `best_metrics`.

The grid search should take a couple of minutes to run. Because it can be time consuming, it is not always possible to run a grid search for big datasets with complex models.

In [None]:
from tables.array import Leaf
best_metrics = (0,0)
parameters = (0,0)

# your code here


print("Best max_depth:", parameters[0])
print("Best min_samples_leaf:", parameters[1])
print("Testing Accuracy:", best_metrics[0])
print("Training Accuracy:", best_metrics[1])

In [None]:
# TEST CELL - DO NOT CHANGE - RUN THIS CELL TO CHECK YOUR WORK
# Note: passing this cell doesn't guarantee your code is correct or that you will get full credit,
# but should be used to help you check your work
assert(best_metrics == (0.9315013196708584, 0.9578214767688452))
print("Tests passed, but be sure to test your own code as well!")

#Task 4: Final Model + Reflection


##Training a final model

Now that we have identified the optimal parameter values, we will train a final model on all of the data, using those parameters values. The idea is that this final model would be used to diagnose tumors.

Although we were able to find the optimal parameter values for decision trees, there are other machine learning models that can have even better performance on this dataset. Unfortunately, we don't have time to cover more in this course.

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=parameters[0], min_samples_leaf=parameters[1], random_state = 0)
clf.fit(X, y)

Next, we visualize the decision tree that we produced, so that we can see how it makes decisions.

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names = features, class_names = ["Benign", "Malignant"])
plt.show()

##‚úèQ7: New record 1

Suppose we encounter a new tumor, with values:
*   `area_se = 35.021`
*   `concave_points_worst = 0.100`
*   `concavity_worst = .238`
*   `radius_worst = 17.106`
*   `texture_mean = 15.987`
*   `texture_worst = 26.832`

How would the decision tree model classify this tumor? Explain your answer by describing the path this datapoint takes through the tree.

your answer here

##‚úèQ8: New record 2

Suppose we encounter a new tumor, with values:
*   `area_se = 39.542`
*   `concave_points_worst = 0.154`
*   `concavity_worst = .221`
*   `radius_worst = 16.234`
*   `texture_mean = 16.785`
*   `texture_worst = 25.979`

How would the decision tree model classify this tumor? Explain your answer by describing the path this datapoint takes through the tree.

your answer here

##‚úè Q9: Reflection

What did you like about it? What could be improved? Your answers will not affect your overall grade. This feedback will be used to improve future programming assignments.

#Grading

For each of the following accomplishments, there is a breakdown of points which total to 20. The fraction of points earned out of 20 will be multiplied by 5 to get your final score (e.g. 17 points earned will be 17/20 * 5 ‚Üí 4.25)
* (1pt) Task1 Q1: Uses correct cross validation
* (1pt) Task1 Q1: Trains and fits the data on a decision tree
* (1pt) Task1 Q1: Predicts on correct set
* (1pt) Task1 Q1: Finds all three evaluation metrics and appends to list
* (2pt) Task1 Q2: Correctly interprets accuracy and recall results
* (1pt) Task2 Q3: Uses stratified fold correctly
* (1pt) Task2 Q3: Finds test and training sets
* (1pt) Task2 Q3: Trains and fits a decision tree
* (1pt) Task2 Q3: Makes predictions
* (1pt) Task3 Q4: Describes max_depth results including over/under fitting and best value
* (1pt) Task3 Q5: Describes min_samples_leaf results including over/under fitting and best value
* (2pt) Task3 Q6 : Correctly used for loops to loop through hyperparameters
* (1pt) Task3 Q6 : Using evaluate function from Task 2
* (2pt) Task3 Q6 : Updates best model appropriately
* (1pt) Task4 Q7: Correctly classifies new tumor using decision tree path
* (1pt) Task4 Q8: Correctly classifies new tumor using decision tree path
* (1pt) Task4 Q9: Thoughtfully reflected on the assignment