## Questions 1 and 2


**Give the number of patients in each category for the variables which measure sex, tumour type, and resection margins.**


Of 727 patients, 409 were male, and 318 were female. Of these patients, 541 had been classified with tumour severity "1"; 132 with severity "2"; 31 with severeity "4"; 19 with sevrity "5"; and 4 patients with severeity "6". Similarly, 523 patients had a resection margin count of "1" and 204 had a resection margin count of "2".


**Find the mean and standard deviation for the first measurement of the patient’s tumour size.**


The patients had a mean tumour size of 30.16 +/- 21.74mm during the initial measurement.


**Find the number of patients who had surgery after 2001.**

In this study, we assayed tumour dimensions on 727 cancer patients who have undergone surgery, of which 642 patients had surgery post 2001.


**Find the 2 most common countries, recode patients from all other countries as a new variable ‘other’ and give it an appropriate number.**


The two most common countries were countries with codes "12" with 277 patients, and country "5" with 109 patients. Other countries had a total of 341 patients.


**Give the number of patients in each category for the variables which measure sex, tumour type, and resection margins grouped by whether they had surgery in/before 2001, or after 2001.**


56 male and 29 female patients had undergone surgery upto 2001. Of these patients, 52 had been classified with tumour severity "1"; 26 with severity "2"; 4 with severeity "4"; 3 with sevrity "5". 65 patients had a resection margin count of "1" and 20 had a resection margin count of "2".

Similarly, 353 male and 289 female patients had undergone resection surgery after 2001. Of these patients, 489 had been classified with tumour severity "1"; 106 with severity "2"; 27 with severeity "4"; 16 with sevrity "5"; and 4 patients with severeity "6". 458 patients had a resection margin count of "1" and 184 had a resection margin count of "2".


**Find the mean and standard deviation for the first measurement of the patient’s tumour size for patients grouped by the three country groups you identified in part d.**


Patients from country "12" had a mean tumour size of 30.28 +/- 26.11 mm, patients from country "5" displayed mean tumour dimensions of 35.11 +/- 16.93 mm, and patients from other countries had a mean tumour size of 28.72 +/- 18.88 mm.

### (Q2) Without pandas

Data preprocessing uses custom typecast utility functions (and try-catch statements) to handle invalid or missing values. We store data as a list of dictionaries and leverage `collections.Counter` for counting categories and grouping, while `statistics` calculates mean and standard deviation. Results are displayed using the `tabulate` library for readable tables. The code adheres to the Google Python Style Guide, maintaining consistency and standardised documentation.

In [9]:
import csv  # For reading CSV files
from datetime import datetime  # For parsing dates
from collections import Counter, defaultdict  # For counting and grouping data
import statistics  # For mean and standard deviation
from tabulate import tabulate  # For displaying results in table format


# 
# 1. DATA LOADING & PREPROCESSING
# 

def typecast_to_float(value):
    """Safely converts a value to a float.

    Args:
        value (str): The value to convert.

    Returns:
        float or None: The converted float value, or None if conversion fails.
    """
    try:
        return float(value)
    except (ValueError, TypeError):
        return None


def typecast_to_date(value):
    """Safely parses a date string into a datetime object.

    Args:
        value (str): The date string to parse.

    Returns:
        datetime or None: Parsed datetime object, or None if parsing fails.
    """
    try:
        return datetime.strptime(value, '%d/%m/%Y')
    except (ValueError, TypeError):
        return None


def load_and_clean_data(file_path):
    """Loads and preprocesses data from a CSV file.

    Args:
        file_path (str): Path to the CSV file.

    Returns:
        list[dict]: List of dictionaries representing preprocessed rows of data.
    """
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        processed_data = []
        for row in reader:
            row['PatTumourSize1'] = typecast_to_float(row['PatTumourSize1'])
            row['PatTumourSize2'] = typecast_to_float(row['PatTumourSize2'])
            row['PatStage'] = typecast_to_float(row['PatStage'])
            row['PatDateOfSurgery'] = typecast_to_date(row['PatDateOfSurgery'])
            processed_data.append(row)
    return processed_data


# 
# 2. TASK (a): COUNT CATEGORIES
# 

def count_categories(data):
    """Counts categories for sex, tumour type, and resection margins.

    Args:
        data (list[dict]): Dataset.

    Returns:
        dict: Counts for each category.
    """
    return {
        'PatSex': Counter(row['PatSex'] for row in data),
        'PatTumourType': Counter(row['PatTumourType'] for row in data),
        'PatResectionMargins': Counter(row['PatResectionMargins'] for row in data)
    }


# 
# 3. TASK (b): TUMOUR SIZE STATISTICS
# 

def calculate_tumour_size_stats(data):
    """Calculates mean and standard deviation for tumour size.

    Args:
        data (list[dict]): Dataset.

    Returns:
        dict: Mean and standard deviation of tumour sizes.
    """
    tumour_sizes = [row['PatTumourSize1'] for row in data if row['PatTumourSize1'] is not None]
    return {
        'Mean': statistics.mean(tumour_sizes) if tumour_sizes else 'N/A',
        'Standard Deviation': statistics.stdev(tumour_sizes) if len(tumour_sizes) > 1 else 'N/A'
    }


# 
# 4. TASK (c): SURGERIES AFTER 2001
# 

def count_surgeries_after_2001(data):
    """Counts surgeries performed after 2001.

    Args:
        data (list[dict]): Dataset.

    Returns:
        int: Number of surgeries after 2001.
    """
    return sum(1 for row in data if row['PatDateOfSurgery'] and row['PatDateOfSurgery'].year > 2001)


# 
# 5. TASK (d): GROUP COUNTRIES
# 

def group_countries(data):
    """Groups patients into the top two countries and 'Other'.

    Args:
        data (list[dict]): Dataset.

    Returns:
        dict: Grouped country counts.
    """
    country_counts = Counter(row['PatCountry'] for row in data)
    top_countries = [country for country, _ in country_counts.most_common(2)]
    for row in data:
        row['PatCountryGrouped'] = row['PatCountry'] if row['PatCountry'] in top_countries else 'Other'
    return Counter(row['PatCountryGrouped'] for row in data)


# 
# 6. TASK (e): GROUP BY SURGERY PERIOD
# 

def group_by_surgery_period(data):
    """Groups data by surgery period.

    Args:
        data (list[dict]): Dataset.

    Returns:
        dict: Grouped data by surgery period.
    """
    surgery_groups = {'Before/On 2001': defaultdict(Counter), 'After 2001': defaultdict(Counter)}
    for row in data:
        date = row['PatDateOfSurgery']
        if date:
            period = 'Before/On 2001' if date.year <= 2001 else 'After 2001'
            surgery_groups[period]['PatSex'][row['PatSex']] += 1
            surgery_groups[period]['PatTumourType'][row['PatTumourType']] += 1
            surgery_groups[period]['PatResectionMargins'][row['PatResectionMargins']] += 1
    return surgery_groups


# 
# 7. TASK (f): TUMOUR SIZE STATS BY COUNTRY GROUP
# 

def calculate_tumour_size_stats_by_country(data):
    """Calculates tumour size stats grouped by country groups.

    Args:
        data (list[dict]): Dataset.

    Returns:
        dict: Mean and standard deviation by country group.
    """
    country_group_sizes = defaultdict(list)
    for row in data:
        if row['PatTumourSize1'] is not None:
            country_group_sizes[row['PatCountryGrouped']].append(row['PatTumourSize1'])
    return {
        group: {
            'Mean': statistics.mean(sizes),
            'Standard Deviation': statistics.stdev(sizes) if len(sizes) > 1 else 'N/A'
        }
        for group, sizes in country_group_sizes.items()
    }


# 
# 8. DISPLAY RESULTS IN TABLE FORMAT
# 

def display_table(title, data):
    """Display data in a formatted table.

    Args:
        title (str): Title of the table.
        data (dict or list): Data to display.
    """
    print(f"\n{title}")
    print(tabulate(data, headers='keys', tablefmt='grid'))


# 
# 9. MAIN FUNCTION
# 

def main():
    """Main function to execute all tasks."""
    file_path = 'CancerPatients.csv'
    data = load_and_clean_data(file_path)

    # Task (a): Count categories
    category_counts = count_categories(data)
    display_table("Category Counts", [{'Category': key, 'Counts': dict(value)} for key, value in category_counts.items()])

    # Task (b): Tumour size statistics
    tumour_stats = calculate_tumour_size_stats(data)
    display_table("Tumour Size Statistics", [tumour_stats])

    # Task (c): Surgeries after 2001
    surgeries_after_2001 = count_surgeries_after_2001(data)
    display_table("Surgeries After 2001", [{'Surgeries After 2001': surgeries_after_2001}])

    # Task (d): Grouped countries
    grouped_countries = group_countries(data)
    display_table("Grouped Countries", [{'Country Group': k, 'Count': v} for k, v in grouped_countries.items()])

    # Task (e): Grouped by surgery period
    grouped_surgery = group_by_surgery_period(data)
    display_table("Grouped by Surgery Period", [{'Period': k, 'Details': dict(v)} for k, v in grouped_surgery.items()])

    # Task (f): Tumour size stats by country group
    tumour_stats_by_country = calculate_tumour_size_stats_by_country(data)
    display_table("Tumour Size Stats by Country Group", [{'Group': k, **v} for k, v in tumour_stats_by_country.items()])


# Entry point
if __name__ == '__main__':
    main()



Category Counts
+---------------------+------------------------------------------------+
| Category            | Counts                                         |
| PatSex              | {'1': 409, '2': 318}                           |
+---------------------+------------------------------------------------+
| PatTumourType       | {'1': 541, '4': 31, '2': 132, '6': 4, '5': 19} |
+---------------------+------------------------------------------------+
| PatResectionMargins | {'1': 523, '2': 204}                           |
+---------------------+------------------------------------------------+

Tumour Size Statistics
+---------+----------------------+
|    Mean |   Standard Deviation |
| 30.1648 |              21.7383 |
+---------+----------------------+

Surgeries After 2001
+------------------------+
|   Surgeries After 2001 |
|                    642 |
+------------------------+

Grouped Countries
+-----------------+---------+
| Country Group   |   Count |
| Other           |     34

### (Q1) With pandas

Data preprocessing uses functions from the `pandas` library to convert input to numeric and date formats, with invalid inputs being forced to `NaN`. We store data as a pandas dataframe. Analytical tasks utilise functions like `value_counts` for category counting, `mean` and `std` for statistical analysis, and`groupby` for aggregations. Conditional filtering and `apply` mapping execute tasks such as grouping countries or categorising surgery periods.

In [13]:
import pandas as pd  # Data manipulation library
from tabulate import tabulate  # For displaying results in table format


# 
# 1. DATA LOADING & PREPROCESSING
# 

def load_and_clean_data(file_path):
    """Load and preprocess the cancer patients dataset.

    Args:
        file_path (str): Path to the dataset CSV file.

    Returns:
        pd.DataFrame: Preprocessed dataset.
    """
    df = pd.read_csv(file_path)
    df['PatDateOfSurgery'] = pd.to_datetime(df['PatDateOfSurgery'], format='%d/%m/%Y', errors='coerce')
    df['PatTumourSize1'] = pd.to_numeric(df['PatTumourSize1'], errors='coerce')
    df['PatTumourSize2'] = pd.to_numeric(df['PatTumourSize2'], errors='coerce')
    df['PatStage'] = pd.to_numeric(df['PatStage'], errors='coerce')
    return df


# 
# 2. TASK (a): COUNT CATEGORIES
# 

def count_categories(df):
    """Count categories for sex, tumour type, and resection margins.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        dict: Counts for sex, tumour type, and resection margins.
    """
    return {
        'PatSex': df['PatSex'].value_counts().to_dict(),
        'PatTumourType': df['PatTumourType'].value_counts().to_dict(),
        'PatResectionMargins': df['PatResectionMargins'].value_counts().to_dict()
    }


# 
# 3. TASK (b): TUMOUR SIZE STATISTICS
# 

def calculate_tumour_size_stats(df):
    """Calculate tumour size statistics.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        dict: Mean and standard deviation for tumour size.
    """
    return {
        'Mean': df['PatTumourSize1'].mean(),
        'Standard Deviation': df['PatTumourSize1'].std()
    }


# 
# 4. TASK (c): SURGERIES AFTER 2001
# 

def count_surgeries_after_2001(df):
    """Count surgeries performed after the year 2001.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        int: Number of surgeries after 2001.
    """
    return df[df['PatDateOfSurgery'].dt.year > 2001].shape[0]


# 
# 5. TASK (d): GROUP COUNTRIES
# 

def map_country_group(country, top_countries):
    """Map a country to itself if in the top two, otherwise map to 'Other'.

    Args:
        country (int or str): The country identifier.
        top_countries (list): List of top two countries.

    Returns:
        str: The original country or 'Other' if not in the top two.
    """
    if country in top_countries:
        return country
    return 'Other'


def group_countries(df):
    """Group patients into the top two countries and 'Other'.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        dict: Counts of patients in each country group.
    """
    top_countries = df['PatCountry'].value_counts().nlargest(2).index.tolist()
    df['PatCountryGrouped'] = df['PatCountry'].apply(map_country_group, args=(top_countries,))
    return df['PatCountryGrouped'].value_counts().to_dict()


# 
# 6. TASK (e): GROUP BY SURGERY PERIOD
# 

def map_surgery_period(date):
    """Map surgery date to 'Before/On 2001' or 'After 2001'.

    Args:
        date (pd.Timestamp): Surgery date.

    Returns:
        str: Group ('Before/On 2001' or 'After 2001').
    """
    if pd.isnull(date):
        return None
    return 'Before/On 2001' if date.year <= 2001 else 'After 2001'


def group_by_surgery_period(df):
    """Group data by surgery period.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        dict: Grouped counts by surgery period.
    """
    df['SurgeryPeriod'] = df['PatDateOfSurgery'].apply(map_surgery_period)
    
    def count_values(group):
        """Count values in a group."""
        return {
            'PatSex': group['PatSex'].value_counts().to_dict(),
            'PatTumourType': group['PatTumourType'].value_counts().to_dict(),
            'PatResectionMargins': group['PatResectionMargins'].value_counts().to_dict()
        }
    
    grouped = {}
    for period, group in df.groupby('SurgeryPeriod'):
        grouped[period] = count_values(group)
    
    return grouped


# 
# 7. TASK (f): TUMOUR SIZE STATS BY COUNTRY GROUP
# 

def calculate_tumour_size_stats_by_country(df):
    """Calculate tumour size stats grouped by country.

    Args:
        df (pd.DataFrame): The dataset.

    Returns:
        dict: Mean and standard deviation of tumour sizes by country group.
    """
    stats = {}
    for group, group_df in df.groupby('PatCountryGrouped'):
        stats[group] = {
            'Mean': group_df['PatTumourSize1'].mean(),
            'Standard Deviation': group_df['PatTumourSize1'].std()
        }
    return stats


# 
# 8. DISPLAY RESULTS
# 
def display_table(title, data):
    """Display data in a formatted table.

    Args:
        title (str): Title of the table.
        data (dict or list): Data to display.
    """
    print(f"\n{title}")
    print(tabulate(data, headers='keys', tablefmt='grid'))


# 
# 9. MAIN FUNCTION
# 

def main():
    """Main function to execute all tasks."""
    file_path = 'CancerPatients.csv'
    df = load_and_clean_data(file_path)

    # Task (a): Count categories
    category_counts = count_categories(df)
    display_table("Category Counts", [{'Category': key, 'Counts': dict(value)} for key, value in category_counts.items()])

    # Task (b): Tumour size statistics
    tumour_stats = calculate_tumour_size_stats(df)
    display_table("Tumour Size Statistics", [tumour_stats])

    # Task (c): Surgeries after 2001
    surgeries_after_2001 = count_surgeries_after_2001(df)
    display_table("Surgeries After 2001", [{'Surgeries After 2001': surgeries_after_2001}])

    # Task (d): Grouped countries
    grouped_countries = group_countries(df)
    display_table("Grouped Countries", [{'Country Group': k, 'Count': v} for k, v in grouped_countries.items()])

    # Task (e): Grouped by surgery period
    grouped_surgery = group_by_surgery_period(df)
    display_table("Grouped by Surgery Period", [{'Period': k, 'Details': dict(v)} for k, v in grouped_surgery.items()])

    # Task (f): Tumour size stats by country group
    tumour_stats_by_country = calculate_tumour_size_stats_by_country(df)
    display_table("Tumour Size Stats by Country Group", [{'Group': k, **v} for k, v in tumour_stats_by_country.items()])


# Entry point
if __name__ == '__main__':
    main()



Category Counts
+---------------------+--------------------------------------+
| Category            | Counts                               |
| PatSex              | {1: 409, 2: 318}                     |
+---------------------+--------------------------------------+
| PatTumourType       | {1: 541, 2: 132, 4: 31, 5: 19, 6: 4} |
+---------------------+--------------------------------------+
| PatResectionMargins | {1: 523, 2: 204}                     |
+---------------------+--------------------------------------+

Tumour Size Statistics
+---------+----------------------+
|    Mean |   Standard Deviation |
| 30.1648 |              21.7383 |
+---------+----------------------+

Surgeries After 2001
+------------------------+
|   Surgeries After 2001 |
|                    642 |
+------------------------+

Grouped Countries
+-----------------+---------+
| Country Group   |   Count |
| Other           |     341 |
+-----------------+---------+
| 12              |     277 |
+---------------

## Question 3

**The variables which measure tumour size contain some missing values, in this question you will explore different ways to impute these based on the K-nearest neighbour algorithm.**
* **Identify patients missing the first measurement of tumour size and remove these from the dataset, giving the number of samples removed. Use this dataset for the rest of question 3.**

There are 79 patients missing the first tumour size

* **How many patients are missing second measurements of tumour size?**

There are 24 patients missing the second tumour size

* **Impute the missing values of second measurements of tumour size by identifying the 10 other patients whose first measurement of tumour size is most similar to the patient with the missing value, and using the mean of these patients second tumour measurement as the imputed value. For each imputed value, give the patient ID and imputed value for second tumour measurement in a table.**
  
See Table 1

* **Repeat c, but use the 5 most similar patients.**
  
See Table 2

* **Repeat c, but use the 20 most similar patients.**
  
See Table 3

* **Repeat c, but use the 10 most similar patients who also have the same tumour type as the patient with a missing second tumour measurement.**
  
See Table 4


The K-Nearest Neighbours (KNN) approach imputes missing values for `PatTumourSize2` by identifying the most similar patients based on `PatTumourSize1`. Rows with valid measurements are filtered, and Euclidean distance is calculated to determine similarity. The top-k nearest Neighbours are selected, and their `PatTumourSize2` values are averaged to impute the missing value. Additionally, a `Neighbours Used` column tracks how many Neighbours were actually available for each imputation, ensuring transparency when fewer than k Neighbours exist.

In [1]:
import pandas as pd  # Library for data manipulation
import numpy as np  # Library for numerical computations
from tabulate import tabulate  # Library for displaying data in table format


# 
# 1. DATA LOADING & PREPROCESSING
# 

def load_and_clean_data(file_path):
    """Load and preprocess the cancer patients dataset.

    Args:
        file_path (str): Path to the dataset CSV file.

    Returns:
        pd.DataFrame: Preprocessed dataset.
    """
    # Read dataset from CSV file.
    df = pd.read_csv(file_path)
    
    # Convert surgery date to datetime, handle invalid entries as NaT.
    df['PatDateOfSurgery'] = pd.to_datetime(df['PatDateOfSurgery'], format='%d/%m/%Y', errors='coerce')
    
    # Convert tumour size and stage columns to numeric, handle invalid entries as NaN.
    df['PatTumourSize1'] = pd.to_numeric(df['PatTumourSize1'], errors='coerce')
    df['PatTumourSize2'] = pd.to_numeric(df['PatTumourSize2'], errors='coerce')
    df['PatStage'] = pd.to_numeric(df['PatStage'], errors='coerce')
    
    return df


# 
# 2. TASK (a): REMOVE MISSING FIRST TUMOUR SIZE MEASUREMENT
# 

def remove_missing_first_tumour_size(df):
    """Remove patients missing the first tumour size measurement.

    Args:
        df (pd.DataFrame): Dataset.

    Returns:
        pd.DataFrame: Filtered dataset without missing first tumour size.
        int: Number of removed samples.
    """
    initial_count = df.shape[0]  # Record initial row count.
    df = df.dropna(subset=['PatTumourSize1']).reset_index(drop=True)  # Remove rows with missing PatTumourSize1.
    removed_count = initial_count - df.shape[0]  # Calculate removed rows.
    
    return df, removed_count


# 
# 3. TASK (b): COUNT MISSING SECOND TUMOUR SIZE MEASUREMENTS
# 

def count_missing_second_tumour_size(df):
    """Count patients missing the second tumour size measurement.

    Args:
        df (pd.DataFrame): Dataset.

    Returns:
        int: Number of rows with missing second tumour size.
    """
    return df['PatTumourSize2'].isna().sum()  # Count NaN entries in PatTumourSize2.


# 
# 4. TASK (c, d, e, f): IMPUTE MISSING SECOND TUMOUR SIZE USING KNN
# 

def euclidean_distance(value1, value2):
    """Calculate Euclidean distance between two numeric values.

    Args:
        value1 (float): First numeric value.
        value2 (float): Second numeric value.

    Returns:
        float: Absolute difference as Euclidean distance.
    """
    return abs(value1 - value2)


def impute_missing_tumour_size(df, k, filter_by_type=False):
    """Impute missing second tumour size using manual KNN.

    Args:
        df (pd.DataFrame): Dataset.
        k (int): Number of nearest Neighbours.
        filter_by_type (bool): Whether to filter by tumour type.

    Returns:
        pd.DataFrame: DataFrame containing Patient ID, Imputed Tumour Size,
                      and Actual Nearest Neighbours Used.
    """
    results = []  # Store imputation results.
    
    # Create a subset with valid tumour size measurements.
    df_valid = df.dropna(subset=['PatTumourSize1', 'PatTumourSize2'])
    df_missing = df[df['PatTumourSize2'].isna()]  # Subset with missing second tumour size.
    
    for idx, missing_row in df_missing.iterrows():
        # Filter by tumour type if required.
        if filter_by_type:
            df_filtered = df_valid[df_valid['PatTumourType'] == missing_row['PatTumourType']]
        else:
            df_filtered = df_valid
        
        if df_filtered.empty:
            continue  # Skip if no valid Neighbours exist.

        # Create an explicit copy to avoid the warning
        df_filtered = df_filtered.copy()

        # Calculate distances after making a copy
        df_filtered['Distance'] = (df_filtered['PatTumourSize1'] - missing_row['PatTumourSize1']).abs()
        
        # Select top-k nearest Neighbours based on distance.
        df_Neighbours = df_filtered.sort_values(by='Distance').head(k)
        
        # Calculate mean of Neighbours' second tumour size.
        imputed_value = df_Neighbours['PatTumourSize2'].mean()
        
        # Store the result along with the actual number of Neighbours used.
        results.append({
            'Patient ID': missing_row['PatID'],
            'Imputed Tumour Size 2': round(imputed_value, 2),
            'Neighbours Used': len(df_Neighbours)
        })
    
    return pd.DataFrame(results)


# 
# 5. DISPLAY RESULTS
# 

def display_table(title, df):
    """Display results in a formatted table.

    Args:
        title (str): Title of the table.
        df (pd.DataFrame): DataFrame containing data to display.
    """
    print(f"\n{title}")
    print(tabulate(df, headers='keys', tablefmt='grid', showindex=False))


# 
# 6. MAIN FUNCTION
# 

def main():
    """Main function to execute all tasks."""
    file_path = 'CancerPatients.csv'
    df = load_and_clean_data(file_path)

    # Task (a): Remove rows with missing first tumour size.
    df, removed_count = remove_missing_first_tumour_size(df)
    print(f"\nNumber of samples removed (missing PatTumourSize1): {removed_count}")

    # Task (b): Count missing second tumour size.
    missing_second_count = count_missing_second_tumour_size(df)
    print(f"Number of patients missing PatTumourSize2: {missing_second_count}")

    # Task (c): Impute using 10 nearest Neighbours.
    knn_10_results = impute_missing_tumour_size(df, k=10)
    display_table("TABLE 1: Imputed Tumour Size 2 Using 10 Neighbours", knn_10_results)

    # Task (d): Impute using 5 nearest Neighbours.
    knn_5_results = impute_missing_tumour_size(df, k=5)
    display_table("TABLE 2: Imputed Tumour Size 2 Using 5 Neighbours", knn_5_results)

    # Task (e): Impute using 20 nearest Neighbours.
    knn_20_results = impute_missing_tumour_size(df, k=20)
    display_table("TABLE 3: Imputed Tumour Size 2 Using 20 Neighbours", knn_20_results)

    # Task (f): Impute using 10 nearest Neighbours with same tumour type.
    knn_10_type_results = impute_missing_tumour_size(df, k=10, filter_by_type=True)
    display_table("TABLE 4: Imputed Tumour Size 2 Using 10 Neighbours (Same Tumour Type)", knn_10_type_results)


# Entry point for the program.
if __name__ == '__main__':
    main()



Number of samples removed (missing PatTumourSize1): 79
Number of patients missing PatTumourSize2: 24

TABLE 1: Imputed Tumour Size 2 Using 10 Neighbours
+--------------+-------------------------+-------------------+
|   Patient ID |   Imputed Tumour Size 2 |   Neighbours Used |
|          283 |                   23.37 |                10 |
+--------------+-------------------------+-------------------+
|          299 |                   28.86 |                10 |
+--------------+-------------------------+-------------------+
|          310 |                   31.21 |                10 |
+--------------+-------------------------+-------------------+
|          419 |                   15.92 |                10 |
+--------------+-------------------------+-------------------+
|          428 |                    8.98 |                10 |
+--------------+-------------------------+-------------------+
|          433 |                   15.92 |                10 |
+--------------+-----------

## Question 4

**Patients now need to be randomly assigned to one of 3 different treatment arms of a randomized control trial. Generate these random assignments ensuring the proportion of patients with the each tumour type is the same in each arm. The information about which treatment arm patients are assigned to should be stored in your code but not be visible to clinicians looking at this dataset. Give the patient IDs and the treatment arms in a table**

The dataset is stratified by tumour type using Pandas' `groupby`. Within each group, patients are randomly shuffled and distributed equally across the treatment arms using a repeated sequence (A, B, C) combined with random sampling. The treatment arm assignments are removed in the original dataframe `df` to prevent visibility to clinicians, being only present in a seperate dataframe `assignment_df`. 

In [8]:
import pandas as pd
import numpy as np
from tabulate import tabulate


# 
# 1. DATA LOADING & PREPROCESSING
# 

def load_and_clean_data(file_path):
    """Load and preprocess the cancer patients dataset.

    Args:
        file_path (str): Path to the dataset CSV file.

    Returns:
        pd.DataFrame: Preprocessed dataset.
    """
    df = pd.read_csv(file_path)
    df['PatDateOfSurgery'] = pd.to_datetime(df['PatDateOfSurgery'], format='%d/%m/%Y', errors='coerce')
    df['PatTumourSize1'] = pd.to_numeric(df['PatTumourSize1'], errors='coerce')
    df['PatTumourSize2'] = pd.to_numeric(df['PatTumourSize2'], errors='coerce')
    df['PatStage'] = pd.to_numeric(df['PatStage'], errors='coerce')
    return df


# 
# 2. STRATIFIED RANDOMIZATION INTO TREATMENT ARMS
# 

def assign_treatment_arms(df):
    """Randomly assign patients to treatment arms while balancing tumour types.

    Args:
        df (pd.DataFrame): Dataset.

    Returns:
        pd.DataFrame: Dataset with a hidden 'TreatmentArm' column.
    """
    np.random.seed(8738)  # For reproducibility
    
    treatment_arms = ['A', 'B', 'C']  # Define treatment arms
    
    # Create an empty list to store assignments
    treatment_assignments = []

    # Stratify by tumour type and assign treatments
    for tumour_type, group in df.groupby('PatTumourType'):
        arm_count = len(group) // len(treatment_arms)  # Calculate count per arm
        
        # Create a list of treatment arms repeated for the group
        arm_list = treatment_arms * arm_count + np.random.choice(
            treatment_arms, size=(len(group) % len(treatment_arms)), replace=False
        ).tolist()
        
        np.random.shuffle(arm_list)  # Shuffle treatment arm assignments
        
        # Assign to the current tumour group
        treatment_assignments.extend(arm_list)
    
    # Assign treatments back to the dataset
    df['TreatmentArm'] = treatment_assignments
    
    return df

# 
# 3. DISPLAY RESULTS
# 

def display_table(title, df):
    """Display results in a formatted table.

    Args:
        title (str): Title of the table.
        df (pd.DataFrame): DataFrame containing data to display.
    """
    print(f"\n{title}")
    print(tabulate(df, headers='keys', tablefmt='grid', showindex=False))


# 
# 4. MAIN FUNCTION
# 

def main():
    """Main function to execute all tasks."""
    file_path = 'CancerPatients.csv'
    df = load_and_clean_data(file_path)

    # Stratified random assignment to treatment arms
    df = assign_treatment_arms(df)

    # Secure and extract assignment data
    assignment_df = df[['PatID', 'TreatmentArm']].copy() 
    df = df.drop(columns=['TreatmentArm'])
    
    # Display assignment table
    display_table("TABLE 1: Patient IDs and Treatment Arms", assignment_df.head()) # Just prinitng top few rows to bot print out whole table
    display_table("TABLE 2: Patient IDs for clinicians", df.head())  # Simulate hiding treatment assignment from clinicians

# Entry point for the program
if __name__ == '__main__':
    main()



TABLE 1: Patient IDs and Treatment Arms
+---------+----------------+
|   PatID | TreatmentArm   |
|     234 | A              |
+---------+----------------+
|     235 | C              |
+---------+----------------+
|     236 | A              |
+---------+----------------+
|     240 | B              |
+---------+----------------+
|     242 | C              |
+---------+----------------+

TABLE 2: Patient IDs for clinicians
+-----+---------+----------+---------------------+-----------------+------------------+------------------+------------+--------------+-----------------------+
|   e |   PatID |   PatSex | PatDateOfSurgery    |   PatTumourType |   PatTumourSize1 |   PatTumourSize2 |   PatStage |   PatCountry |   PatResectionMargins |
|   0 |     234 |        1 | 2000-06-23 00:00:00 |               1 |            26.19 |            21.01 |          3 |            8 |                     1 |
+-----+---------+----------+---------------------+-----------------+------------------+----------