# Introduction

## [Overview](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/overview)
- Objective: Develop a model to predict if patients received a metastatic cancer diagnosis within 90 days of screening (i.e., `DiagPeriodL90D`) using a unique oncology dataset.
- Motivation: 
    - Metastatic TNBC is considered the most aggressive TNBC and requires most urgent and timely treatment. Unnecessary delays in diagnosis and subsequent treatment can have devastating effects in these difficult cancers. Differences in the wait time to get treatment is a good proxy for disparities in healthcare access.
    - The primary goal of building these models is to detect relationships between demographics of the patient with the likelihood of getting timely treatment. The secondary goal is to see if environmental hazards impact proper diagnosis and treatment.  #TODO: check this
- Dataset
    - Source: Provided by Gilead Sciences, originating from Health Verity and enriched with third-party geo-demographic data and zip code level toxicology data from NASA/Columbia University.
    - Content: Information about demographics, diagnosis, treatment options, and insurance for patients diagnosed with breast cancer from 2015-2018.
    - Highlighted Features:
        - Demographics (e.g., age, gender, race, ...)
        - Diagnosis and treatment details (e.g., breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, ...)
        - Insurance information
        - Geo (zip-code level) demographic data (e.g, income, education, rent, race, poverty, ...)
        - Toxic air quality (zip-code level) data (e.g., Ozone, PM25 and NO2, ...)
- Evaluation
    - Metric: Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC).
    - Leaderboard:
        - During the competition: 51% of the test data.
        - Final standings: 49% of the test data.
- Submission Format
    - File Format: CSV
    - Columns:
        - `patient_id` (integer)
        - `DiagPeriodL90D` (percentage)
    - Example:
        ```
        patient_id,DiagPeriodL90D
        372069,.5
        981264,.5
        ```

## [Dataset](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/data)
Roughly 18k records, each corresponds to a single patient and her Diagnosis Period
- `training.csv`
    - 12906 records
    - 83 columns, the last column is `DiagPeriodL90D` (int64)
- `test.csv`
    - 5792 records
    - 82 columns

# Exploratory Data Analysis

Thoughts:
- Don't remove records easily where certain columns are empty. You don't know if such data missing pattern indicates certain characteristics of patients which correlates to target prediction.
- Ensure same pre- and post-processing for training and test data.
- Logistic Regression as a prelimilary step to discover relevant columns.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional, Union

In [None]:
# seaborn style (default)
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1, color_codes=True, rc=None)

# font sizes
title_fontsize = 14
label_fontsize = 12
tick_fontsize = 10
text_fontsize = 10

# random state
random_state = 42

# copy-on-write
pd.set_option("mode.copy_on_write", True)

# maximum column width, number of rows and columns to display
pd.set_option("display.max_colwidth", None)  # no limit
pd.set_option("display.max_rows", None)      # no limit
pd.set_option("display.max_columns", None)   # no limit

In [None]:
df_training = pd.read_csv('data/training.csv')
df_training.info()

<span style="color: #ffed29;">From the `dtypes` (NumPy data types: `np.float64`, `np.int64`; Python's built-in type: `object`), we know that the missing values in all columns can only be `numpy.nan`, i.e., `NaN`.</span>

## Drop Duplicates

<span style="color: #ffed29;">There are some records where all attributes are identical except for `patient_id`. After removing these duplicates (keeping the first occurrence), 12,870 records remain.</span>

In [None]:
def find_duplicates(df, id_col):
    """
    Identifies and filters duplicates based on all columns except the specified ID column.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    id_col (str): The name of the ID column.
    
    Returns:
    pd.DataFrame: A DataFrame containing only the duplicated rows.
    list: List of columns to check for duplicates.
    """
    columns_to_check = df.columns.difference([id_col]).to_list()
    duplicates_mask = df.duplicated(subset=columns_to_check, keep=False)
    duplicates_df = df[duplicates_mask].copy()
    return duplicates_df, columns_to_check

In [None]:
# find duplicates except the `patient_id` column
duplicates_df, columns_to_check = find_duplicates(df_training, 'patient_id')
first_occurrence_df = duplicates_df.drop_duplicates(subset=columns_to_check)
print('All Duplicated Rows (including the first occurrence):')
print(duplicates_df.to_string(index=False))
print(f'\nTotal Number of Duplicated Rows (excluding the first occurrence): {len(duplicates_df) - len(first_occurrence_df)}')

In [None]:
# remove duplicates
df_training_without_duplicates = df_training.drop_duplicates(subset=columns_to_check, ignore_index=True)
print(f'Removed {len(df_training) - len(df_training_without_duplicates)} duplicated rows.')
print(f'Current #Patients: {len(df_training_without_duplicates)}')

In [None]:
df_training_without_duplicates.info()

## Inspect Columns

In [None]:
def get_value_counts(df: pd.DataFrame, col: str, normalize: bool = True, dropna: bool = False) -> pd.DataFrame:
    """
    Computes the value counts for a specified column in a DataFrame, optionally normalizing and dropping missing values.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    col (str): The name of the column to compute value counts for.
    normalize (bool, optional): If True, the value counts will be normalized to percentages. Defaults to True.
    dropna (bool, optional): If True, missing values will be dropped from the value counts. Defaults to False.

    Returns:
    pd.DataFrame: A DataFrame containing the value counts (*100 if normalized).
    """
    # compute value counts
    value_counts = df[col].value_counts(normalize=normalize, dropna=dropna)
    
    # multiply by 100 if normalized
    if normalize:
        value_counts *= 100
    
    return value_counts.to_frame()

In [None]:
def get_combined_value_counts(df: pd.DataFrame, col: str, filter_col: str, filter_values: list, normalize: bool = True, dropna: bool = False) -> pd.DataFrame:
    """
    Computes and combines the value counts for a specified column in the original DataFrame and filtered DataFrames.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    col (str): The name of the column to compute value counts for.
    filter_col (str): The name of the column to filter on.
    filter_values (list): A list of values to filter by in the filter_col.
    normalize (bool, optional): If True, the value counts will be normalized to percentages. Defaults to True.
    dropna (bool, optional): If True, missing values will be dropped from the value counts. Defaults to False.

    Returns:
    pd.DataFrame: A DataFrame containing the combined value counts.
    """
    # compute value counts for the original DataFrame
    original_counts = get_value_counts(df, col, normalize=normalize, dropna=dropna)
    original_counts.columns = ['All (%)']

    # initialize an empty DataFrame to store the combined results
    combined_df = original_counts

    # compute value counts for each filter value and merge into the combined DataFrame
    for value in filter_values:
        filtered_df = df[df[filter_col] == value]
        filtered_counts = get_value_counts(filtered_df, col, normalize=normalize, dropna=dropna)
        filtered_counts.columns = [f'{filter_col}={value} (%)']
        combined_df = combined_df.join(filtered_counts, how='outer')

    # reindex the combined DataFrame to match the original index order
    combined_df = combined_df.reindex(original_counts.index)

    return combined_df

In [None]:
def highlight_nan(df: pd.DataFrame):
    """
    Highlight NaN values in the DataFrame and rows where the index is NaN.

    Parameters:
    df (pd.DataFrame): The DataFrame to apply the styling to.

    Returns:
    pd.DataFrame: A DataFrame with applied styles.
    """
    def highlight_row(row):
        if pd.isna(row.name):
            return ['background-color: gray'] * len(row)
        return ['background-color: gray' if pd.isna(v) else '' for v in row]
    
    return df.style.apply(highlight_row, axis=1)

In [None]:
def plot_category_distribution(df: pd.DataFrame, cat_col: str, index_labels: Optional[List[Tuple[int, str]]] = None) -> None:
    """
    Plots the distribution of a categorical column in a DataFrame, normalizing the value counts and optionally setting a specific order.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    cat_col (str): The name of the categorical column to plot.
    index_labels (Optional[List[Tuple[int, str]]]): A list of tuples where each inner tuple contains an index and its corresponding label.
    """
    # compute value counts
    series = df[cat_col].copy()
    series.name = None
    value_counts_normalized = series.value_counts(normalize=True, dropna=False)

    if index_labels:
        # define the order of the categories
        sorted_index_labels = sorted(index_labels)
        order = [index for index, _ in sorted_index_labels]
        # reindex the result to set the order
        value_counts_normalized = value_counts_normalized.reindex(order)

    # plot
    value_counts_normalized.plot(kind='barh', figsize=(10, min(len(value_counts_normalized.index), 8)))
    plt.title(f'{cat_col} Distribution (%)', fontsize=title_fontsize)
    plt.xlim([0, 1])
    if index_labels:
        plt.yticks(ticks=order, labels=[f'{index} ({label})' for index, label in sorted_index_labels], fontsize=tick_fontsize)
    plt.show()

In [None]:
def plot_histogram_with_percentages(df: pd.DataFrame, num_col: str, bins: Union[int, List[int]]) -> None:
    """
    Plot a histogram for a specified numerical column of a DataFrame, showing the percentage each bin represents.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    num_col (str): The name of the numerical column to plot.
    bins (int or List[int]): The number of bins or the boundaries of the bins.
    """
    # check if the column exists and is numerical and has no missing values
    if num_col not in df.columns:
        raise ValueError(f"Column '{num_col}' not found in the DataFrame.")
    if not pd.api.types.is_numeric_dtype(df[num_col]):
        raise ValueError(f"Column '{num_col}' is not numerical.")
    if df[num_col].isna().any():
        raise ValueError(f"Column '{num_col}' contains missing values. Please handle or remove them before plotting.")
 
    # calculate the weights to represent percentages
    total_count = len(df[num_col])
    weights = [100 / total_count] * total_count

    # plot
    plt.figure(figsize=(10, 6))
    hist = sns.histplot(x=df[num_col], bins=bins, weights=weights)
    for p in hist.patches:
        height = p.get_height()
        hist.text(p.get_x() + p.get_width() / 2., height + 0.5, f'{height:.1f}', ha="center", fontsize=text_fontsize)
    plt.title(f'{num_col} Distribution', fontsize=title_fontsize)
    plt.xlabel(num_col, fontsize=label_fontsize)
    plt.ylabel('Percentage', fontsize=label_fontsize)
    if isinstance(bins, int):
        bins = pd.cut(df[num_col], bins=bins, retbins=True)[1]
    plt.xticks(ticks=bins, fontsize=tick_fontsize)
    plt.yticks(fontsize=tick_fontsize)
    plt.show()

In [None]:
def plot_side_by_side_histogram_with_percentages(df: pd.DataFrame, num_col: str, filter_col: str, filter_values: List, bins: Union[int, List[int]]) -> None:
    """
    Plot a side-by-side histogram for a specified numerical column of a DataFrame, showing the percentage each bin represents for both the original and filtered distributions.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    num_col (str): The name of the numerical column to plot.
    filter_col (str): The name of the column to filter on.
    filter_values (list): A list of values to filter by in the filter_col.
    bins (int or List[int]): The number of bins or the boundaries of the bins.
    """
    # check if the columns exist and are numerical and have no missing values
    if num_col not in df.columns:
        raise ValueError(f"Column '{num_col}' not found in the DataFrame.")
    if filter_col not in df.columns:
        raise ValueError(f"Column '{filter_col}' not found in the DataFrame.")
    if not pd.api.types.is_numeric_dtype(df[num_col]):
        raise ValueError(f"Column '{num_col}' is not numerical.")
    if df[num_col].isna().any():
        raise ValueError(f"Column '{num_col}' contains missing values. Please handle or remove them before plotting.")
    
    # calculate bin edges
    if isinstance(bins, int):
        bins = pd.cut(df[num_col], bins=bins, retbins=True)[1]
    bins = np.array(bins)

    # calculate bin centers
    bin_centers = (bins[:-1] + bins[1:]) / 2
    bin_widths = bins[1:] - bins[:-1]

    # create a DataFrame to store the counts
    counts_df = pd.DataFrame()

    # calculate counts for the original data
    original_counts, _ = np.histogram(df[num_col], bins=bins)
    counts_df['All'] = original_counts

    # calculate counts for the filtered data
    for value in filter_values:
        filtered_df = df[df[filter_col] == value]
        filtered_counts, _ = np.histogram(filtered_df[num_col], bins=bins)
        counts_df[f'{filter_col}={value}'] = filtered_counts

    # normalize counts to percentages
    counts_df = counts_df.div(counts_df.sum(axis=0)) * 100

    # plot
    fig, ax = plt.subplots(figsize=(10, 6))
    cmap = plt.get_cmap('viridis')
    colors = [cmap(i / len(counts_df.columns)) for i in range(len(counts_df.columns))]

    for i, (col, color) in enumerate(zip(counts_df.columns, colors)):
        for j, (center, width, count) in enumerate(zip(bin_centers, bin_widths, counts_df[col])):
            ax.bar(center + (i - len(counts_df.columns) // 2) * (width / len(counts_df.columns)), count, width=width / (len(counts_df.columns)), alpha=0.7, color=color, label=col if j == 0 else "")

    for i, col in enumerate(counts_df.columns):
        for center, width, count in zip(bin_centers, bin_widths, counts_df[col]):
            ax.text(center + (i - len(counts_df.columns) // 2) * (width / len(counts_df.columns)), count + 0.5, f'{count:.1f}', ha="center", fontsize=tick_fontsize)

    plt.title(f'{num_col} Distribution', fontsize=title_fontsize)
    plt.xlabel(num_col, fontsize=label_fontsize)
    plt.ylabel('Percentage', fontsize=label_fontsize)
    plt.xticks(ticks=bins, fontsize=tick_fontsize)
    plt.yticks(fontsize=tick_fontsize)
    plt.legend()
    plt.show()

### `DiagPeriodL90D`

<span style="color: #ffed29;">Approximately 38% of patients did not receive a metastatic cancer diagnosis within 90 days of screening.</span>

In [None]:
target_column = 'DiagPeriodL90D'

In [None]:
get_value_counts(df_training_without_duplicates, target_column)

In [None]:
plot_category_distribution(df_training_without_duplicates, target_column, [(0, '>= 90 Days'), (1, '< 90 Days')])

### `patient_race`

<span style="color: #ffed29;">Race information is unknown for approximately 50% of patients. Among the known cases, the only race that comprises more than 10% is White, at about 28%. Patients with unknown race are slightly more likely to be undiagnosed within 90 days of screening, while patients identified as White are slightly more likely to be diagnosed timely.</span>

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'patient_race', target_column, [0, 1]))

### `payer_type`

<span style="color: #ffed29;">Around 47% of patients have commercial insurance. The proportions of patients with Medicaid or Medicare are similar, each close to 20%. Patients with unknown insurance types are slightly more likely to be diagnosed timely, while those with commercial insurance are slightly more likely to be undiagnosed within 90 days of screening.</span>  #TODO: check this

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'payer_type', target_column, [0, 1]))

### `patient_state`

<span style="color: #ffed29;">The top 6 states where more than 5% of patients come from are CA (18.9%), TX (8.9%), NY (8.1%), MI (6.6%), IL (6.1%), and OH (5.9%). The tail of the distribution includes RI, NH, and MA, each at 0.00777%. Patients from some states, such as CA and MI, tend to receive more timely diagnoses compared to those from other states.</span>

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'patient_state', target_column, [0, 1]))

### `patient_zip3`

<span style="color: #ffed29;">There are 739 zip3 codes with patient distribution ranging from 0.00777% to 1.8%.</span>

In [None]:
patient_zip3_counts_df = get_combined_value_counts(df_training_without_duplicates, 'patient_zip3', target_column, [0, 1])
print(f'#Zip3-codes: {len(patient_zip3_counts_df)}')
highlight_nan(patient_zip3_counts_df.head(10))
# highlight_nan(patient_zip3_counts_df.tail(10))

### `patient_age`

<span style="color: #ffed29;">This dataset includes patients ranging from 18 to 91 years old, with more than 80% of the patients aged between 40 and 80. Patients older than 60 are slightly more likely to be diagnosed timely, while younger patients are slightly more likely to be missed.</span>

In [None]:
print(f'Min & max age: {df_training_without_duplicates['patient_age'].min()}, {df_training_without_duplicates['patient_age'].max()}')
age_boundries = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]  # bins: [10-20), [20-30), [30-40), [40-50), [50-60), [60-70), [70-80), [80-90), [90-100]
plot_side_by_side_histogram_with_percentages(df_training_without_duplicates, 'patient_age', target_column, [0, 1], age_boundries)

### `patient_gender`

<span style="color: #ffed29;">All patients are female.</span>

In [None]:
highlight_nan(get_value_counts(df_training_without_duplicates, 'patient_gender'))

### `bmi`

<span style="color: #ffed29;">Almost 70% of patients' BMI information is not available. For the available records, BMI ranges from 14.0 to 85.0. Most patients fall into the categories of 'Normal weight', 'Pre-obesity', and 'Obesity class I'. Patients with normal weights are slightly more likely to receive timely diagnoses, whereas those with higher BMIs are less likely to do so.</span>

WHO Classification of BMI

| BMI Interval (kg/m²) | Category                |
|----------------------|-------------------------|
| < 18.5               | Underweight             |
| 18.5 - 24.9          | Normal weight           |
| 25 - 29.9            | Pre-obesity             |
| 30 - 34.9            | Obesity class I         |
| 35 - 39.9            | Obesity class II        |
| ≥ 40                 | Obesity class III       |

In [None]:
print(f'The proportion of missing BMI: {df_training_without_duplicates['bmi'].isna().sum() / len(df_training_without_duplicates) * 100:.1f}%')
bmi_boundries = [0, 18.5, 25, 30, 35, 40, 85]  # bins: [0, 18.5), [18.5, 25), [25, 30), [30, 35), [35, 40), [40, 85]
id_bmi_df = df_training_without_duplicates[['patient_id', 'bmi']].dropna()
id_target_df = df_training_without_duplicates[['patient_id', target_column]]
id_bmi_target_df = pd.merge(id_bmi_df, id_target_df, how='left', on='patient_id')
plot_side_by_side_histogram_with_percentages(id_bmi_target_df, 'bmi', target_column, [0, 1], bmi_boundries)

### `breast_cancer_diagnosis_code`

<span style="color: #ffed29;">
    <ol>
        <li>The number of ICD-10 patients is 3.3 times that of ICD-9 patients. The most common codes are 174.9 (15.3%), C50.911 (13.9%), C50.912 (13.3%), and C50.919 (11.4%). Many other codes have fewer than 10 patients.
        <li>Patients with a primary diagnosis code of 174.9 are more than 16 times as likely to remain undiagnosed within 90 days after screening. Generally, patients with ICD-9 codes are less likely to receive a timely diagnosis compared to those with ICD-10 codes.
        <li>ICD-10 codes offer more detailed diagnoses for breast cancer compared to ICD-9 codes. While it is possible to map ICD-9 codes to ICD-10 codes, this process can alter the distribution of diagnosis codes. Therefore, direct conversion is not performed. However, some available mappings include: C50.9|174.9, C50.41|174.4, C50.81|174.8, C50.21|174.2, C50.11|174.1, C50.51|174.5, C50.31|174.3, C50.01|174.6. #TODO: experiment on this
        <li>Some codes in the dataset (C50.929, C50.021, 175.9, C50.421) indicate breast cancer in males, which contradicts the information in the `patient_gender` column. These discrepancies suggest that there may be man-made errors present in the dataset, possibly introduced during data collection or data preparation. These records are not removed but require additional attention. #TODO: remove/move to validation set
        <li>The code 198.81 represents a secondary malignant neoplasm of the breast, indicating that the cancer has metastasized to the breast from another part of the body. This differs from the other codes in the dataset, which represent primary breast cancer. Only one patient out of nine has `DiagPeriodL90D=1`. This definitely requires attention. #TODO: check this
    </ol>
</span>

In [None]:
is_icd10 = df_training_without_duplicates['breast_cancer_diagnosis_code'].str.startswith('C')
icd10_df = df_training_without_duplicates[is_icd10]
icd9_df = df_training_without_duplicates[~is_icd10]
print(f'The ratio of ICD-10 codes to ICD-9 codes: {len(icd10_df)/len(icd9_df):.1f}')

In [None]:
code_counts_df = get_combined_value_counts(df_training_without_duplicates, 'breast_cancer_diagnosis_code', target_column, [0, 1])
code_desc_df = df_training_without_duplicates[['breast_cancer_diagnosis_code', 'breast_cancer_diagnosis_desc']].drop_duplicates()
code_counts_desc_df = pd.merge(code_counts_df, code_desc_df, left_index=True, right_on='breast_cancer_diagnosis_code')
highlight_nan(code_counts_desc_df)

#### Edge Cases


In [None]:
male_codes = ['C50929', 'C50021', '1759', 'C50421']
secondary_code = '19881'

In [None]:
male_code_df = df_training_without_duplicates[df_training_without_duplicates['breast_cancer_diagnosis_code'].isin(male_codes)]
male_code_df[['patient_id', target_column]]

In [None]:
secondary_code_df = df_training_without_duplicates[df_training_without_duplicates['breast_cancer_diagnosis_code']==secondary_code]
secondary_code_df[['patient_id', target_column]]

### `breast_cancer_diagnosis_desc`

<span style="color: #ffed29;">
    <ol>
        <li>Many words in the descriptions, such as 'Malignant'/'Malig', 'neoplasm'/'neopl', 'female', and 'breast', are not informative since they appear in almost every field. Removing these words can make the descriptions shorter and more concise. For example, the description 'Malignant neoplasm of breast (female), unspecified' can be reduced to 'unspecified' eventually.
        <li>Some words have short forms, and using a uniform representation can reduce unnecessary variations. For this purpose, 'unsp' will be replaced by 'unspecified', and 'ovrlp' will be replaced by 'overlapping'.
        <li>Each original description consists of 4 to 10 words. These descriptions specify the position of the neoplasm, such as the quadrant and side. After cleaning, the descriptions have been reduced to 1 to 7 words, with the single one-word description ('of') corresponding to C50.
        <li>Through further preprocessing, the text will be converted to lowercase, and common English stop words (e.g., 'of', 'and', etc.) will be removed. Punctuation and special characters will also be eliminated implicitly.
    </ol>
</span>

In [None]:
def remove_tokens(text: str, tokens: list) -> str:
    """
    Removes specified tokens from the input text.

    Parameters:
    text (str): The input text string.
    tokens (list): A list of tokens to remove from the input text.

    Returns:
    str: The filtered text with specified tokens removed.
    """
    words = text.split()
    filtered_words = [word for word in words if word not in tokens]
    
    return ' '.join(filtered_words)

In [None]:
def replace_tokens(text: str, replacements: dict) -> str:
    """
    Replaces tokens in the input text based on a dictionary of token pairs.

    Parameters:
    text (str): The input text string.
    replacements (dict): A dictionary where keys are tokens to be replaced and values are the replacement tokens.

    Returns:
    str: The text with tokens replaced according to the dictionary.
    """
    words = text.split()
    replaced_words = [replacements.get(word, word) for word in words]
    return ' '.join(replaced_words)

In [None]:
def count_words(text: str) -> int:
    """
    Counts the number of words in the input text, splitting by spaces.

    Parameters:
    text (str): The input text string.

    Returns:
    int: The number of words in the text.
    """
    return len(text.split())

In [None]:
# remove tokens
tokens_to_remove = ['Malignant', 'malignant', 'Malig',
                    'neoplasm', 'neoplm', 
                    'female', '(female),',
                    'breast', 'breast,']
code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'] = code_counts_desc_df['breast_cancer_diagnosis_desc'].apply(lambda x: remove_tokens(x, tokens_to_remove))


In [None]:
# replace tokens
token_replacements = {
    'unsp': 'unspecified',
    'ovrlp': 'overlapping'
}
code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'] = code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'].apply(lambda x: replace_tokens(x, token_replacements))

In [None]:
desc_count = code_counts_desc_df['breast_cancer_diagnosis_desc'].apply(count_words)
desc_cleaned_count = code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'].apply(count_words)
print(f'Original #Words: min: {desc_count.min()}, max: {desc_count.max()}')
print(f'Current #Words: min: {desc_cleaned_count.min()}, max: {desc_cleaned_count.max()}')

### `metastatic_cancer_diagnosis_code`

<span style="color: #ffed29;">
    <ol>
        <li>All of the diagnosis codes are ICD-10 codes. The top 3 codes are C77.3(54.6%; Secondary and unspecified malignant neoplasm of axilla and upper limb lymph nodes), C79.51(14.2%; Secondary malignant neoplasm of bone), C77.9(5.9%; Secondary and unspecified malignant neoplasm of lymph node, unspecified).
        <li>Patients with code C77.3 are slightly more likely to be diagnosed timely.
        <li>Only 9 out of 43 diagnosis codes have metastatic first novel treatments, and these treatments occur in 24 out of all patients in the dataset (i.e., 0.186%). Among these patients, 8 were not diagnosed within 90 days after screening.
    </ol>
</span>

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'metastatic_cancer_diagnosis_code', target_column, [0, 1]))

In [None]:
metastatic_code_counts_df = get_value_counts(df_training_without_duplicates, 'metastatic_cancer_diagnosis_code')
metastatic_code_treatment_df = df_training_without_duplicates[['metastatic_cancer_diagnosis_code', 'metastatic_first_novel_treatment', 'metastatic_first_novel_treatment_type']].drop_duplicates().dropna()
metastatic_code_counts_treatment_df = pd.merge(metastatic_code_counts_df, metastatic_code_treatment_df, left_index=True, right_on='metastatic_cancer_diagnosis_code')
metastatic_code_counts_treatment_df

In [None]:
patients_with_first_novel_treatments_df = df_training_without_duplicates[df_training_without_duplicates['metastatic_first_novel_treatment'].notna()]
patients_with_first_novel_treatments_df

### `metastatic_first_novel_treatment` & `metastatic_first_novel_treatment_type`

<span style="color: #ffed29;">
There are only two metastatic first novel treatments: PEMBROLIZUMAB and OLAPARIB. Both of these treatments are of the type Antineoplastics.
</span>

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'metastatic_first_novel_treatment', target_column, [0, 1]))

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'metastatic_first_novel_treatment_type', target_column, [0, 1]))

### `Region`

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'Region', target_column, [0, 1]))

### `Division`

In [None]:
highlight_nan(get_combined_value_counts(df_training_without_duplicates, 'Division', target_column, [0, 1]))

### `population`, `density`, ..., `veteran`

<span style="color: #ffed29;">
    <ol>
        <li>All patients have complete or partial geographic (zip-code level) demographic data, with the exception of patient ID 224030, whose zip3 code (332) is the only occurrence of this value in the dataset. #TODO: check this
        <li>In addition to that single patient, three patients with the zip3 code 772 also have missing values in columns starting with 'family', 'income_household', 'home', 'rent', 'self_employed', 'farmer', 'poverty', and 'limited_english'. These three patients are the only ones with this specific zip3 code. #TODO: check this
    </ol>
</span>

In [None]:
na_population_df = df_training_without_duplicates[df_training_without_duplicates['population'].isna()]
na_population_zip3 = na_population_df['patient_zip3'].unique()
print(na_population_zip3)

In [None]:
df_training_without_duplicates[df_training_without_duplicates['patient_zip3'] == na_population_zip3[0]]

In [None]:
na_family_size_df = df_training_without_duplicates[df_training_without_duplicates['family_size'].isna()]
na_family_size_zip3 = na_family_size_df['patient_zip3'].unique()
print(na_family_size_zip3)

In [None]:
df_training_without_duplicates[df_training_without_duplicates['patient_zip3'] == na_family_size_zip3[0]]

### `Ozone`, `PM25`, `N02`

<span style="color: #ffed29;">
29 patients with the zip3 code 967, 968, 998, 995, 996, and 999 lack information about environmental hazards. These patients are the only ones with these specific zip3 codes. #TODO: check this
</span>

In [None]:
na_ozone_df = df_training_without_duplicates[df_training_without_duplicates['Ozone'].isna()]
na_ozone_zip3 = na_ozone_df['patient_zip3'].unique()
print(na_ozone_zip3)

In [None]:
df_training_without_duplicates[df_training_without_duplicates['patient_zip3'].isin(na_ozone_zip3)]

## Check (Linear) Correlation

<span style="color: #ffed29;">
Among all originally numerical columns:
    <ol>
        <li>Slight positive correlation: e.g., patient_age, education_bachelors, patient_zip3,  income_individual_median, home_value
        <li>Slight negative correlation: e.g., education_less_highschool, widowed, income_household_25_to_35, health_uninsured, commute_time
    </ol>
</span>

In [None]:
corr_df = df_training_without_duplicates.corr(numeric_only=True)
corr_df[target_column].sort_values(ascending=False)

## Collect Edge Cases

In [None]:
edge_case_dfs = [male_code_df, secondary_code_df, na_family_size_df, na_ozone_df]  # na_population_df is a part of na_family_size_df
edge_case_notes = ['male code', 'secondary code', 'na family size', 'na ozone']

results = []
for df, note in zip(edge_case_dfs, edge_case_notes):
    extracted_df = df[[target_column]]
    extracted_df['note'] = note
    results.append(extracted_df)
df_edge_cases = pd.concat(results, ignore_index=False)

# Data Preparation

### TfidfVectorizer

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
columns_to_remove = ['patient_id', 'patient_gender']
categorical_columns = ['patient_race', 'payer_type', 'patient_state', 
                       'patient_zip3',  
                       'breast_cancer_diagnosis_code', 'metastatic_cancer_diagnosis_code', 
                       'metastatic_first_novel_treatment', 'metastatic_first_novel_treatment_type', 
                       'Region', 'Division']
textual_column = 'breast_cancer_diagnosis_desc'
textual_column_cleaned = textual_column + '_cleaned'
numerical_columns = df_training_without_duplicates.columns.difference([target_column] + 
                                                                      columns_to_remove + 
                                                                      categorical_columns + 
                                                                      [textual_column]).to_list()

In [None]:
# 1. remove patient_id, patient_gender
df_training_without_ids_genders = df_training_without_duplicates.drop(columns=columns_to_remove)

# 2. categorical -> numerical
# fill NaN values with 'unknown'
df_training_without_ids_genders[categorical_columns] = df_training_without_ids_genders[categorical_columns].fillna('unknown')
# encode
le = LabelEncoder()
for col in categorical_columns:
    df_training_without_ids_genders[col] = le.fit_transform(df_training_without_ids_genders[col])

# 3. textual -> numerical
# clean
df_training_without_ids_genders[textual_column_cleaned] = df_training_without_ids_genders[textual_column].apply(lambda x: remove_tokens(x, tokens_to_remove))
df_training_without_ids_genders[textual_column_cleaned] = df_training_without_ids_genders[textual_column_cleaned].apply(lambda x: replace_tokens(x, token_replacements))
# tfidf  #TODO: other options?
vectorizer = TfidfVectorizer(lowercase=True, stop_words='english')  #TODO: ngram
tfidf_matrix = vectorizer.fit_transform(df_training_without_ids_genders[textual_column_cleaned])
feature_names = vectorizer.get_feature_names_out()
prefix = 'tfidf_'
new_feature_names = [prefix + name for name in feature_names]
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=new_feature_names)
df_training_with_text_encoded = pd.concat([df_training_without_ids_genders, tfidf_df], axis=1)
df_training_prepared = df_training_with_text_encoded.drop(columns=[textual_column, textual_column_cleaned])

# 4. numerical
# fill NaN values with -1.0  #TODO: best practices?
df_training_prepared[numerical_columns] = df_training_prepared[numerical_columns].fillna(-1.0)

df_training_prepared.info()

### Data Partition

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_training_prepared.drop(target_column, axis=1)
y = df_training_prepared[target_column]

X_train, X_validation, y_train, y_validation, train_indices, validation_indices = train_test_split(
    X, y, np.arange(len(X)), test_size=0.2, random_state=random_state
)

In [None]:
def determine_partition(index: int, train_indices: np.ndarray, validation_indices: np.ndarray) -> str:
    """
    Determine the partition ('train' or 'validation') for a given index.

    Parameters:
    - index (int): The index to check.
    - train_indices (numpy.ndarray): Array of indices for the training set.
    - validation_indices (numpy.ndarray): Array of indices for the validation set.

    Returns:
    - str: 'train' if the index is in train_indices, 'validation' if the index is in validation_indices.

    Raises:
    - ValueError: If the index is not found in either train_indices or validation_indices.
    """
    if index in train_indices:
        return 'train'
    elif index in validation_indices:
        return 'validation'
    else:
        raise ValueError(f"Index {index} not found in either train_indices or validation_indices")

In [None]:
df_edge_cases['partition'] = df_edge_cases.index.map(lambda idx: determine_partition(idx, train_indices, validation_indices))
df_edge_cases['prediction'] = np.nan
edge_case_validation_indices = df_edge_cases[df_edge_cases['partition'] == 'validation'].index.to_list()
edge_case_prediction_indices = [np.where(validation_indices == idx)[0][0] for idx in edge_case_validation_indices]

### Feature Scaling

In [None]:
# Standardization (mean=0, variance=1) is generally preferred for algorithms that assume a Gaussian distribution of the input data (e.g., linear regression, logistic regression, SVM with RBF kernel).
# Min-Max Scaling (min=0, max=1) is useful for algorithms that do not assume a specific distribution and are sensitive to the range of input data (e.g., k-nearest neighbors, neural networks).
# Some algorithms, like decision trees and random forests, are not affected by the scale of the features and do not require normalization.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_validation_scaled = scaler.transform(X_validation)

### RandomOverSampler

In [None]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

In [None]:
ros = RandomOverSampler(random_state=random_state)
X_train_scaled_resampled, y_train_resampled = ros.fit_resample(X_train_scaled, y_train)
print(f'Target Distribution (after Data Splitting): {Counter(y_train)}')
print(f'Target Distribution (after Resampling): {Counter(y_train_resampled)}')

In [None]:
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

# Modeling

In [None]:
def highlight_diff(df: pd.DataFrame, col1: str, col2: str):
    """
    Highlight rows where the values in the specified columns are different for non-missing rows.

    Parameters:
    df (pd.DataFrame): The DataFrame to apply the styling to.
    col1 (str): The name of the first column to compare.
    col2 (str): The name of the second column to compare.

    Returns:
    pd.DataFrame: A DataFrame with applied styles.
    """
    def highlight_row(row):
        if (row[col1] != row[col2]) and (pd.notna(row[col1])) and (pd.notna(row[col2])):
            return ['background-color: yellow'] * len(row)
        return [''] * len(row)
    
    return df.style.apply(highlight_row, axis=1)

## Metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, balanced_accuracy_score, roc_auc_score, classification_report

In [None]:
def evaluate_model(y_gt: List[int], y_pred: List[int], y_prob: List[float]) -> pd.DataFrame:
    """
    Evaluate a binary classification model using various metrics.

    Parameters:
    -----------
    y_gt : List[int]
        Ground truth labels (0 or 1).
    y_pred : List[int]
        Predicted labels (0 or 1).
    y_prob : List[float]
        Probability estimates for the positive class (1).

    Returns:
    --------
    pd.DataFrame
        DataFrame containing the evaluation metrics: accuracy, precision, recall, f1_score, balanced_accuracy, and auc.

    Metrics:
    --------
    - accuracy: (TP + TN) / (TP + TN + FP + FN)
    - precision: TP / (TP + FP)
      The ability of the classifier not to label as positive a sample that is negative.
    - recall/TPR: TP / (TP + FN)
      The ability of the classifier to find all the positive samples.
    - f1_score: 2 * precision * recall / (precision + recall)
    - balanced_accuracy
      The average of recall obtained on each class.
    - FPR: FP / (FP + TN)
      The proportion of actual negative cases that are incorrectly identified as positive by the model.
    - auc
      Calculated using TPR and FPR at various threshold settings.
      https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
    """
    accuracy = accuracy_score(y_gt, y_pred)
    precision = precision_score(y_gt, y_pred, pos_label=1, average='binary')
    recall = recall_score(y_gt, y_pred, pos_label=1, average='binary')
    f1 = f1_score(y_gt, y_pred, pos_label=1, average='binary')
    balanced_accuracy = balanced_accuracy_score(y_gt, y_pred)
    auc = roc_auc_score(y_gt, y_prob)
    return pd.DataFrame([[accuracy, precision, recall, f1, balanced_accuracy, auc]], 
                        columns=['accuracy', 'precision', 'recall', 'f1_score', 'balanced_accuracy', 'auc'])

## Logistic Regression

<span style="color: #ffed29;">
    <ol>
        <li>When using the L2 penalty, similar top coefficients are observed (no zero coefficients anymore), except for 'population', which becomes a top positive coefficient.
        <li>When using resampled data, the recall for class 0 increases by 0.01, while all other metrics degrade slightly.
     </ol>
</span>

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(penalty="l1", C=1.0, random_state=random_state, solver="liblinear", max_iter=100, verbose=0)
model.fit(X_train_scaled, y_train)

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': model.coef_[0]
})
feature_importance.sort_values(by='Coefficient', ascending=False, inplace=True)

print(f'Top 10 Features with Positive Coefficients: \n{feature_importance.head(10)['Feature'].to_list()}\n')
print(f'Top 10 Features with Negative Coefficients: \n{feature_importance.tail(10)['Feature'].to_list()}\n')
print(f'All Features with Zero Coefficients: \n{feature_importance[feature_importance['Coefficient']==0.]['Feature'].to_list()}')

In [None]:
y_pred = model.predict(X_validation_scaled)
y_prob = model.predict_proba(X_validation_scaled)[:, 1]
evaluate_model(y_validation, y_pred, y_prob)

In [None]:
print(classification_report(y_validation, y_pred))

In [None]:
df_edge_cases.loc[df_edge_cases['partition'] == 'validation', 'prediction'] = y_pred[edge_case_prediction_indices]
highlight_diff(df_edge_cases, target_column, 'prediction')

## Decision Tree

<span style="color: #ffed29;">
    <ol>
        <li>Using resampled data slightly improves the AUC.
        <li>The most important features differ from those in logistic regression. For example, 'education_highschool', 'income_household_six_figure', and 'education_college_or_above' are considered important here but irrelevant in logistic regression.
     </ol>
</span>

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree

In [None]:
model = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=random_state)
model.fit(X_train_resampled, y_train_resampled)

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
})
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)

print(f'10 Most Important Features: \n{feature_importance.head(10)['Feature'].to_list()}\n')
print(f'10 Least Important Features: \n{feature_importance.tail(10)['Feature'].to_list()}\n')

print(f'Depth of the tree: {model.get_depth()}')
print(f'Number of leaves: {model.get_n_leaves()}')

In [None]:
plt.figure(figsize=(100, 50))
plot_tree(model, feature_names=X.columns, filled=True)
plt.show()

In [None]:
y_pred = model.predict(X_validation)
y_prob = model.predict_proba(X_validation)[:, 1]
evaluate_model(y_validation, y_pred, y_prob)

In [None]:
print(classification_report(y_validation, y_pred))

In [None]:
df_edge_cases.loc[df_edge_cases['partition'] == 'validation', 'prediction'] = y_pred[edge_case_prediction_indices]
highlight_diff(df_edge_cases, target_column, 'prediction')

## Random Forest

<span style="color: #ffed29;">
    <ol>
        <li>Using resampled data slightly degrades the AUC.
        <li>The most important features differ from those in logistic regression. For example, 'tfidf_quadrant', 'tfidf_left', 'tfidf_upper' and 'tfidf_specified' are considered important here but irrelevant in logistic regression.
     </ol>
</span>

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_estimators=115, criterion='gini', max_depth=5, 
                               oob_score=True, random_state=random_state, verbose=0)
model.fit(X_train, y_train)

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
})
feature_importance.sort_values(by='Importance', ascending=False, inplace=True)

print(f'10 Most Important Features: \n{feature_importance.head(10)['Feature'].to_list()}\n')
print(f'10 Least Important Features: \n{feature_importance.tail(10)['Feature'].to_list()}\n')

print(f'Out-of-bag score: {model.oob_score_}')

In [None]:
y_pred = model.predict(X_validation)
y_prob = model.predict_proba(X_validation)[:, 1]
evaluate_model(y_validation, y_pred, y_prob)

In [None]:
print(classification_report(y_validation, y_pred))

In [None]:
df_edge_cases.loc[df_edge_cases['partition'] == 'validation', 'prediction'] = y_pred[edge_case_prediction_indices]
highlight_diff(df_edge_cases, target_column, 'prediction')

# Testing

## Retrain

## Test