# Basic Supervised Model Implementation for Patient Risk Scoring

Time estimate: **45** minutes

## Objectives

After completing this lab, you will be able to:

 - Prepare and split a clinical dataset for patient risk prediction
 - Train and evaluate a Random Forest model using key performance metrics
 - Interpret and visualize model results to assess clinical relevance

## What you will do in this lab

In this lab, you will work with a real clinical dataset to build and test a simple patient risk prediction model.

You will:

- Explore the dataset and identify key features for prediction
- Split the data into training and test sets to prepare for modeling
- Train a Random Forest model to predict patient risk
- Evaluate how well the model performs using key metrics and visualizations

## Overview

In healthcare, predicting patient risk is not just a technical exercise. It supports real clinical decisions. A good model helps identify who might need early intervention or additional monitoring, improving patient outcomes.

You will use a Random Forest classifier for this task because it is well suited to healthcare data, which often involves many interdependent features and complex relationships. Instead of relying on a single decision tree that might overfit, a Random Forest builds many trees and aggregates their predictions. This ensemble approach reduces variance, improves generalization, and provides insights into which features influence predictions the most.

Evaluating the model goes beyond accuracy. Metrics like sensitivity and specificity tell you whether the model correctly identifies high-risk patients and avoids false alarms. The confusion matrix helps visualize these trade-offs, showing whether the model’s predictions align with clinical priorities.

## About the dataset

In this lab, you will use a dataset based on the Wisconsin Breast Cancer Dataset.

### Dataset overview
The Wisconsin Breast Cancer Dataset is a comprehensive collection of diagnostic measurements derived from digitized images of fine needle aspirate (FNA) of breast masses. This dataset is widely used in machine learning and medical research for binary classification tasks to predict whether a breast mass is benign or malignant.

### Column descriptions

1. **ID Number** - Unique identification number assigned to each patient sample for tracking and reference purposes

2. **Diagnosis** - Binary classification of the tumor: M (Malignant) or B (Benign), representing the target variable for prediction

3. **Radius Mean** - Mean of distances from center to points on the perimeter of the cell nucleus, measured in micrometers

4. **Texture Mean** - Mean of standard deviation of gray-scale values in the nucleus image, representing surface texture variation

5. **Perimeter Mean** - Mean perimeter measurement of the cell nucleus boundary, calculated in micrometers

6. **Area Mean** - Mean area of the cell nucleus, measured in square micrometers

7. **Smoothness Mean** - Mean of local variation in radius lengths, indicating the smoothness of the nucleus contour

8. **Compactness Mean** - Mean compactness calculated as (perimeter² / area - 1.0), describing the shape regularity of the nucleus

9. **Concavity Mean** - Mean severity of concave portions of the nucleus contour, representing indentations in the cell boundary

10. **Concave Points Mean** - Mean number of concave portions of the nucleus contour, counting the frequency of indentations

11. **Symmetry Mean** - Mean symmetry measurement of the cell nucleus, assessing bilateral similarity

12. **Fractal Dimension Mean** - Mean fractal dimension calculated using "coastline approximation" method, representing complexity of the nucleus boundary

13. **Radius SE** - Standard error of distances from center to points on the perimeter, indicating measurement variability

14. **Texture SE** - Standard error of gray-scale value standard deviations, representing texture measurement precision

15. **Perimeter SE** - Standard error of perimeter measurements, indicating boundary measurement variability

16. **Area SE** - Standard error of area measurements, representing variability in nucleus size calculations

17. **Smoothness SE** - Standard error of local radius length variations, indicating smoothness measurement precision

18. **Compactness SE** - Standard error of compactness values, representing variability in shape regularity measurements

19. **Concavity SE** - Standard error of concavity measurements, indicating precision of contour indentation assessments

20. **Concave Points SE** - Standard error of concave point counts, representing variability in indentation frequency measurements

21. **Symmetry SE** - Standard error of symmetry measurements, indicating precision of bilateral similarity assessments

22. **Fractal Dimension SE** - Standard error of fractal dimension calculations, representing variability in boundary complexity measurements

23. **Radius Worst** - Largest (worst) mean value for radius among all cells in the sample, representing maximum nucleus size

24. **Texture Worst** - Largest (worst) mean value for texture among all cells in the sample, representing maximum surface variation

25. **Perimeter Worst** - Largest (worst) mean value for perimeter among all cells in the sample, representing maximum boundary length

26. **Area Worst** - Largest (worst) mean value for area among all cells in the sample, representing maximum nucleus coverage

27. **Smoothness Worst** - Largest (worst) mean value for smoothness among all cells in the sample, representing maximum contour irregularity

28. **Compactness Worst** - Largest (worst) mean value for compactness among all cells in the sample, representing maximum shape irregularity

29. **Concavity Worst** - Largest (worst) mean value for concavity among all cells in the sample, representing maximum contour indentation severity

30. **Concave Points Worst** - Largest (worst) mean value for concave points among all cells in the sample, representing maximum indentation frequency

31. **Symmetry Worst** - Largest (worst) mean value for symmetry among all cells in the sample, representing maximum bilateral asymmetry

32. **Fractal Dimension Worst** - Largest (worst) mean value for fractal dimension among all cells in the sample, representing maximum boundary complexity


## Setup


### Installing required libraries

The following libraries are required to run this lab. 

In [None]:
# Install the Libraries required for this lab.
!pip install pandas
!pip install scikit-learn
!pip install matplotlib
!pip install seaborn


In [None]:
# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

### Importing required libraries


In [None]:
# Import pandas for data manipulation (like Excel for Python)
import pandas as pd
# Import numpy for numerical operations
import numpy as np
# Import machine learning tools from scikit-learn
from sklearn.model_selection import train_test_split
# Import our machine learning algorithm
from sklearn.ensemble import RandomForestClassifier
# Import evaluation metrics for measuring model performance
from sklearn.metrics import (
    accuracy_score, confusion_matrix, classification_report, roc_curve, auc
)
#import matplotlib for visualisation
import matplotlib.pyplot as plt
#import seaborn for statistical data visualizations
import seaborn as sns

# Import recall_score for calculating sensitivity
from sklearn.metrics import recall_score


print("All libraries imported successfully!")
print("Ready to begin diabetes classification analysis.")

## Step 1: Load the data in a csv file into a dataframe




In [None]:
df = pd.read_csv("https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab1/breast_cancer.csv")

Display top 5 rows from the dataset.


In [None]:
df.head()

Let's find out the number of rows and columns in the dataset:


In [None]:
# Display rows and columns information about the dataset
print(f'Number of rows :', df.shape[0])
print(f'Number of columns :', df.shape[1])

### Basic information of all features
Let's examine the first few patient records to understand what our data looks like. This is similar to reviewing the first few patient files in a medical study.


In [None]:
# Display basic statistical information about each feature
print("=== BASIC STATISTICS FOR ALL FEATURES ===")
df.describe()

In [None]:
df.columns

##  Step 2: Identify input features (X) and target variable (y)
In medical prediction, you separate:

- Input features (X):

The medical measurements you use to make predictions (like symptoms and test results)

- Target variable (y):

What you want to predict (diagnosis: yes or no)

Dataset Columns

Step 1: Decide the target (y)

The column Diagnosis is the natural target.

Step 2: Define input features (X)

All the tumor measurements will be the input features.
That means:

- Input Features (X):

'radius_mean', 'texture_mean', 'perimeter_mean','area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean','concave points_mean', 'symmetry_mean' etc.



X = all measurement columns (30 features), except diagnosis and id.

You remove id because any column that refers to the patients personal information like name, id, emailid, phone is of no use for the model. Including them may give wrong results.

y = binary target diagnosis (B = Benign(no cancer), M = Malignant(cancerous tumor))

You identify the features next. Features are the values the machine learning model learns from.


In [None]:
#Define features and target
X = df.drop(columns=["id","diagnosis"])
y = df["diagnosis"]

## Step 3: Split the data into Train and Test sets

Training set → used to train the model

Testing set → used to evaluate the model's performance on unseen data

Example : If X has 100 patient records and y has 100 labels:

X_train → 80 rows

X_test → 20 rows

y_train → 80 rows

y_test → 20 rows

- random_state:

Controls randomness in operations like splitting data.

Why it matters: Many operations in machine learning involve randomness (e.g., shuffling data before splitting). Using a fixed random_state ensures the same split every time you run the code.

- stratify :

Think of it as keeping the proportions balanced when splitting the data.

- Purpose: Ensures the target variable distribution is preserved in both train and test sets.
- Why it matters:

If your dataset is imbalanced (e.g., 80% no diabetes, 20% diabetes), random splitting might accidentally put most positive cases in either train or test — this can bias your model.
Example: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
Here, if 20% of your patients have diabetes in the full dataset, then 20% of the test set and 20% of the training set will also have diabetes. Keeps the class distribution similar across splits.

This below line splits your dataset into training and testing sets, which is a standard step before training a machine learning model.

In [None]:
# 4. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                                                    X, y,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df.diagnosis
                                                    )

# 1. Print number of samples
print("Number of training samples:", X_train.shape[0])
print("Number of test samples:", X_test.shape[0])

# 2. Optional: Check number of features
print("Number of features:", X_train.shape[1])

# 3. Check class distribution in train/test sets
print("\nTraining set class distribution:\n", y_train.value_counts())
print("\nTest set class distribution:\n", y_test.value_counts())


### Train the model
Now you will train the machine learning model using the training data. This is like teaching a medical diagnostic system using historical patient records and their known outcomes.


## Step 4: Create a Random Forest model with the specified parameters

- n_estimators=100

The number of decision trees in the forest.

Each tree makes a prediction, and the forest combines them (majority vote for classification).

More trees → better stability, but training takes longer.

- max_depth=None

This controls how deep each tree can grow.

None means trees keep growing until all leaves are pure (or until other stopping conditions are met).

If set smaller (e.g., max_depth=10), trees are shallower → less overfitting.

- random_state=42

A fixed random seed so results are reproducible.

If you don't set this, results may vary each time you run the code.

- n_jobs=-1

Tells scikit-learn to use all CPU cores available for parallel training.

This speeds up model training.

In [None]:
# 5. Train Random Forest model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,      # number of trees
    max_depth=None,        # let trees expand fully unless limited
    random_state=42,       # for reproducibility
    n_jobs=-1              # use all processors
)

model.fit(X_train, y_train)


## Step 5: Generate predictions and evaluate model performance
Your model is now trained. Time to evaluate the model.


In [None]:
# 6. Predictions using Random Forest
predicted_values = model.predict(X_test)               # Predicted class labels
original_values = y_test


### accuracy_score(y_test, y_pred)
Compares the true labels (y_test) with the predicted labels (y_pred) from Random Forest model.

Calculates the accuracy: the proportion of correctly predicted patients.

Accuracy = Numberofcorrectpredictions/Totalpredictions

### What classification_report does

classification_report is a function from scikit-learn.

It evaluates a classification model in more detail than just accuracy.

It shows key metrics for each class (here: Low Risk = 0, High Risk = 1).

#### Precision, Recall, F1Score, Support
- Precision: Of all patients predicted as high risk, how many were actually high risk?

    High Precision → fewer false positives
    
- Recall: (Sensitivity)	Of all actual high-risk patients, how many did the model correctly identify?	High Recall → fewer false negatives (important in healthcare).
- F1-Score: Harmonic mean of precision & recall → balances both.	Useful when data is imbalanced.
- Support:	Number of actual instances of each class in y_test

#### Why classification_report matters in patient risk scoring

- Accuracy alone can be misleading, especially if classes are imbalanced (e.g., fewer high-risk patients).

- Recall for high-risk patients is often more important clinically → you don't want to miss someone at risk.

- F1-score helps balance precision and recall to get a better overall measure.



In [None]:
# 7. Evaluation metrics
print("\n Model Accuracy:", accuracy_score(original_values, predicted_values))

print("\n Classification Report:\n", classification_report(original_values, predicted_values))

In [None]:
# 8. Feature importances

Feature importance is a technique used to identify which input variables have the greatest influence on a predictive model's decisions. In the context of the Wisconsin Breast Cancer Dataset, feature importance scores reveal which cellular characteristics—such as radius, texture, or concavity—are most critical in distinguishing between benign and malignant tumors. By ranking features based on their contribution to the model's predictive accuracy, you can gain insights into the biological factors most indicative of cancer, improve model interpretability, and potentially reduce dimensionality by focusing on the most relevant variables. Higher importance scores indicate features that play a more significant role in the classification process.

In [None]:
# Create feature importances dataframe
feature_importances = pd.DataFrame({
    'feature': X_test.columns,
    'importance': model.feature_importances_,
}).sort_values('importance', ascending=False)
feature_importances.head()

In [None]:
# Create horizontal bar chart
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importances, y='feature', x='importance', palette='viridis')
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importances', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Step 6: Generate the confusion matrix
confusion_matrix is from scikit-learn.
It compares actual labels (y_test) vs predicted labels (y_pred).

Returns a 2×2 matrix for binary classification (Low Risk = 0, High Risk = 1):
##sns.heatmap() is from Seaborn to make a color-coded matrix.

Parameters:

cm → the confusion matrix data

annot=True → shows the numbers in each cell

fmt="d" → format numbers as integers

xticklabels / yticklabels → label rows and columns for clarity

In [None]:
# Confusion matrix
cm = confusion_matrix(original_values, predicted_values)
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=["Low Risk", "High Risk"],
            yticklabels=["Low Risk", "High Risk"])
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

## Step 7: Calculate sensitivity and specificity
These two are super important in healthcare ML because they describe how well your model detects disease vs. avoids false alarms.

#####   **Sensitivity **(a.k.a Recall, True Positive Rate)

- Definition: Of all the actual positive (High Risk) patients, how many did the model correctly identify as positive?

- Intuition: How good is the test at catching sick patients?

- Healthcare example:

   - If 100 people actually have cancer, and the model correctly detects 95, then Sensitivity = 95%.

   - High Sensitivity = few missed cases (low false negatives).

#####   **Specificity **(a.k.a True Negative Rate)

- Definition: Of all the actual negative (Low Risk) patients, how many did the model correctly identify as negative?
- Intuition: How good is the test at ruling out healthy patients?

- Healthcare example:

If 100 people are healthy, and the model correctly says 90 are healthy, then Specificity = 90%.

High Specificity = few false alarms (low false positives).

-------------------------------

 **Putting them together**:

Sensitivity = “Don't miss sick patients.”

Specificity = “Don't wrongly label healthy patients as sick.”

### Example: Healthcare trade-off example:

In cancer screening:

Example with 1,000 patients

- 100 patients have cancer (true positives group)

- 900 patients are healthy (true negatives group)

#### Case A: High Sensitivity (95%), Lower Specificity (80%)

Test detects 95/100 cancer patients correctly. This means it wrongly flags 20% of healthy patients.

180 people incorrectly told they may have cancer.

- Outcome: Almost no cancer patient is missed, but many healthy patients face unnecessary stress and extra tests.

#### Case B: High Specificity (98%), Lower Sensitivity (70%)

Test detects only 70/100 cancer patients.

But only 2% of healthy patients (18 people) are wrongly flagged.

- Outcome: Very few healthy people are worried unnecessarily, but 30 cancer patients are missed, which can be dangerous.

-------------------------------

**Takeaway**

High Sensitivity → Good for screening (you want to catch everyone at risk, even if it means false alarms).

High Specificity → Good for confirmation (you only want to say “yes” when you are very sure).

That's why in healthcare, screening tests are usually designed to be highly sensitive, and then follow-up diagnostic tests are designed to be highly specific.

### Print the sensitivity score



In [None]:
# Sensitivity (True Positive Rate) = TP / (TP + FN)
# It measures how well the model identifies positive cases (liver disease patients)
# pos_label parameter specifies which class is considered positive
# Print the sensitivity score
sensitivity = recall_score(original_values, predicted_values, pos_label='M')
print(f"Sensitivity Score: {sensitivity:.4f}")
print(f"Sensitivity Percentage: {sensitivity * 100:.2f}%")
print("Sensitivity measures the model's ability to correctly identify liver disease patients")

### Print the specificity score

In [None]:

# Specificity (True Negative Rate) = TN / (TN + FP)
# It measures how well the model identifies negative cases (healthy patients)
# For specificity, you use pos_label=0 (assuming 0 represents healthy patients)
specificity = recall_score(original_values, predicted_values,pos_label='B')
# Print the specificity score
print(f"Specificity Score: {specificity:.4f}")
print(f"Specificity Percentage: {specificity * 100:.2f}%")
print("Specificity measures the model's ability to correctly identify healthy patients")

# Exercises


Use the dataset **Parkinson_disease.csv** (https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab1/Parkinson_disease.csv) for this exercise.

## Exercise 1: Load a dataset (Parkinson) and display first 5 rows


In [None]:
# df = # your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Use the read_csv function, shape, and head()

</details>


<details>
    <summary>Click here for solution</summary>

```python
df = pd.read_csv("https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab1/Parkinson_disease.csv")
print("Rows and columns of the dataset :", df.shape)
print("First 5 rows:\n", df.head())

```

</details>


## Exercise 2: Identify the target column and the data columns

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Refer to Step 2
</details>


<details>
    <summary>Click here for solution</summary>

```python
# Target variable = 'Category'
print("All columns information:")
df.info()
target = "status"
X = df.drop(columns=[target,'name'], errors="ignore")
y = df[target]

```

</details>


## Exercise 3: Split the data into training and test sets (Stratified 80:20)
- Split the data
- Display number of samples in both training and test cases



In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Refer to Step 3
</details>


<details>
    <summary>Click here for solution</summary>

```python
#Split the dataset in the ratio of 80:20
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=42 )
#print the samples
print("Number of training samples:", X_train.shape[0])
print("Number of test samples:", X_test.shape[0])

```

</details>


## Exercise 4: Train the model on training data using Random Forest classifier. (Use n_estimators=200)



In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
  Refer to Step 4
</details>


<details>
    <summary>Click here for solution</summary>

```python
# Train the model on training data
rf_clf = RandomForestClassifier(
    n_estimators=200,      # number of trees
    max_depth=None,        # let trees expand fully unless limited
    random_state=42,       # for reproducibility
    n_jobs=-1              # use all processors
)

rf_clf.fit(X_train, y_train)

```

</details>


## Exercise 5: Predict using the test dataset


In [None]:
 # your code goes here


<details>
    <summary>Click here for a hint</summary>
    
use the predict() method
</details>


<details>
    <summary>Click here for solution</summary>

```python
#  Predictions using Random Forest
y_pred = rf_clf.predict(X_test)    


```

</details>


## Exercise 6: Evaluate the model using accuracy and classification report

In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Refer to Step 6
</details>

<details>
    <summary>Click here for solution</summary>

```python
#  Evaluation metrics
print("\n Model Accuracy:", accuracy_score(y_test, y_pred))

print("\n Classification Report:\n", classification_report(y_test, y_pred))


```

</details>

## Exercise 7: Find the sensitivity and display it in percentage

In [None]:
#your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Refer to Step 7
</details>

<details>
    <summary>Click here for solution</summary>

```python

from sklearn.metrics import recall_score

# Sensitivity
sensitivity = recall_score(y_test, y_pred, average="weighted")
print(f"Sensitivity Percentage: {sensitivity * 100:.2f}%") 


```

</details>

# Congratulations!

You have successfully completed this lab on **Basic Supervised Model Implementation for Patient Risk Scoring** using the Wisconsin Breast Cancer dataset and a Random Forest classifier. You learned how to prepare and split a clinical dataset, train a Random Forest model, evaluate its performance using key metrics, and interpret results for clinical decision-making. These skills form the foundation for building reliable and interpretable machine learning models in healthcare.

## Authors


Ramesh Sannareddy


Copyright © 2025 SkillUp. All rights reserved.
