# Introduction

## [Overview](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/overview)
- Objective: Develop a model to predict if patients received a metastatic cancer diagnosis within 90 days of screening (i.e., `DiagPeriodL90D`) using a unique oncology dataset.
- Motivation: 
    - Metastatic TNBC is considered the most aggressive TNBC and requires most urgent and timely treatment. Unnecessary delays in diagnosis and subsequent treatment can have devastating effects in these difficult cancers. Differences in the wait time to get treatment is a good proxy for disparities in healthcare access.
    - The primary goal of building these models is to detect relationships between demographics of the patient with the likelihood of getting timely treatment. The secondary goal is to see if environmental hazards impact proper diagnosis and treatment.  #TODO: check this
- Dataset
    - Source: Provided by Gilead Sciences, originating from Health Verity and enriched with third-party geo-demographic data and zip code level toxicology data from NASA/Columbia University.
    - Content: Information about demographics, diagnosis, treatment options, and insurance for patients diagnosed with breast cancer from 2015-2018.
    - Highlighted Features:
        - Demographics (e.g., age, gender, race, ...)
        - Diagnosis and treatment details (e.g., breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, ...)
        - Insurance information
        - Geo (zip-code level) demographic data (e.g, income, education, rent, race, poverty, ...)
        - Toxic air quality (zip-code level) data (e.g., Ozone, PM25 and NO2, ...)
- Evaluation
    - Metric: Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC).
    - Leaderboard:
        - During the competition: 51% of the test data.
        - Final standings: 49% of the test data.
- Submission Format
    - File Format: CSV
    - Columns:
        - `patient_id` (integer)
        - `DiagPeriodL90D` (percentage)
    - Example:
        ```
        patient_id,DiagPeriodL90D
        372069,.5
        981264,.5
        ```

## [Dataset](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/data)
Roughly 18k records, each corresponds to a single patient and her Diagnosis Period
- `training.csv`
    - 12906 records
    - 83 columns, the last column is `DiagPeriodL90D` (int64)
- `test.csv`
    - 5792 records
    - 82 columns

### Columns

| Column Name | Meaning | Notes Before EDA |
|-------------|---------|------------------|
| `patient_id` | Unique identification number of patient | only useful in data pre- and post-processing, but irrelavant to target prediction |
| `patient_race` | <span style="color: #57b9ff;">Asian, African American, Hispanic or Latino, White, Other Race</span> | races + nan, nan -> "unknown", encode |
| `payer_type` | <span style="color: #57b9ff;">Payer type at Medicaid, Commercial, Medicare on the metastatic date</span> | payers + nan, nan -> "unknown", encode |
| `patient_state` | <span style="color: #57b9ff;">Patient state (e.g., AL, AK, AZ, AR, CA, CO, etc.) on the metastatic date</span> | states + nan, nan -> "unknown", encode |
| `patient_zip3` | Patient Zip3 (e.g., 190) on the metastatic date | int64, encode? |
| `patient_age` | Derived from Patient Year of Birth (index year minus year of birth) | int64 |
| `patient_gender` | <span style="color: #57b9ff;">F, M on the metastatic date</span> | genders, encode |
| `bmi` | If available, will show available BMI (e.g., 24) information (Earliest BMI recording post metastatic date) | float64 + nan, nan -> -1? |
| `breast_cancer_diagnosis_code` | <span style="color: #57b9ff;">ICD10 (e.g., C50412) or ICD9 (e.g., 1748) diagnoses code</span> | codes, encode |
| `breast_cancer_diagnosis_desc` | <span style="color: #ffa500;">ICD10 or ICD9 code description. This column is raw text and may require NLP/processing and cleaning</span> | length?, encode, which tokenizer?, token diversity?, fine-tune tokenizer? |
| `metastatic_cancer_diagnosis_code` | <span style="color: #57b9ff;">ICD10 (e.g., C773) diagnoses code | codes, encode |
| `metastatic_first_novel_treatment` | <span style="color: #57b9ff;">Generic drug name of the first novel treatment (e.g., "Cisplatin") after metastatic diagnosis</span> | very few, very timely diagnosis? nan -> "unknown" |
| `metastatic_first_novel_treatment_type` | <span style="color: #57b9ff;">Description of the treatment (e.g., Antineoplastic) of the first novel treatment after metastatic diagnosis</span> | very few, very timely diagnosis? nan -> "unknown" |
| `Region` | <span style="color: #57b9ff;">Region of patient location (e.g., Midwest)</span> | regions + nan, nan -> "unknown", encode |
| `Division` | <span style="color: #57b9ff;">Division of patient location (e.g., East North Central)</span> | divisions + nan, nan -> "unknown", encode |
| `population` | An estimate of the zip code's population | one record missing!, nan -> -1?, remove record? |
| `density` | The estimated population per square kilometer | one record missing!, nan -> -1?, remove record? |
| `age_median` | The median age of residents in the zip code | which are irrelavant? aggregate? nan -> -1? |
| `male` | The percentage of residents who report being male (e.g., 55.1) | |
| `female` | The percentage of residents who report being female (e.g., 44.9) | |
| `married` | The percentage of residents who report being married (e.g., 44.9) | |
| `family_size` | The average size of resident families (e.g., 3.22) | |
| `income_household_median` | Median household income in USD | |
| `income_household_six_figure` | Percentage of households that earn at least $100,000 (e.g., 25.3) | |
| `home_ownership` | Percentage of households that own (rather than rent) their residence | |
| `housing_units` | The number of housing units (or households) in the zip code | |
| `home_value` | The median value of homes that are owned by residents | |
| `rent_median` | The median rent paid by renters | |
| `education_college_or_above` | The percentage of residents with at least a 4-year degree | |
| `labor_force_participation` | The percentage of residents 16 and older in the labor force | |
| `unemployment_rate` | The percentage of residents unemployed | |
| `race_white` | The percentage of residents who report their race as White | |
| `race_black` | The percentage of residents who report their race as Black or African American | |
| `race_asian` | The percentage of residents who report their race as Asian | |
| `race_native` | The percentage of residents who report their race as American Indian and Alaska Native | |
| `race_pacific` | The percentage of residents who report their race as Native Hawaiian and Other Pacific Islander | |
| `race_other` | The percentage of residents who report their race as Some other race | |
| `race_multiple` | The percentage of residents who report their race as Two or more races | |
| `hispanic` | The percentage of residents who report being Hispanic. Note: Hispanic is considered to be an ethnicity and not a race | |
| `age_under_10` | The percentage of residents aged 0-9 | |
| `age_10_to_19` | The percentage of residents aged 10-19 | |
| `age_20s` | The percentage of residents aged 20-29 | |
| `age_30s` | The percentage of residents aged 30-39 | |
| `age_40s` | The percentage of residents aged 40-49 | |
| `age_50s` | The percentage of residents aged 50-59 | |
| `age_60s` | The percentage of residents aged 60-69 | |
| `age_70s` | The percentage of residents aged 70-79 | |
| `age_over_80` | The percentage of residents aged over 80 | |
| `divorced` | The percentage of residents divorced | |
| `never_married` | The percentage of residents never married | |
| `widowed` | The percentage of residents widowed | |
| `family_dual_income` | The percentage of families with dual income earners | |
| `income_household_under_5` | The percentage of households with income under $5,000 | |
| `income_household_5_to_10` | The percentage of households with income from $5,000-$10,000 | |
| `income_household_10_to_15` | The percentage of households with income from $10,000-$15,000 | |
| `income_household_15_to_20` | The percentage of households with income from $15,000-$20,000 | |
| `income_household_20_to_25` | The percentage of households with income from $20,000-$25,000 | |
| `income_household_25_to_35` | The percentage of households with income from $25,000-$35,000 | |
| `income_household_35_to_50` | The percentage of households with income from $35,000-$50,000 | |
| `income_household_50_to_75` | The percentage of households with income from $50,000-$75,000 | |
| `income_household_75_to_100` | The percentage of households with income from $75,000-$100,000 | |
| `income_household_100_to_150` | The percentage of households with income from $100,000-$150,000 | |
| `income_household_150_over` | The percentage of households with income over $150,000 | |
| `income_individual_median` | The median income of individuals in the zip code | |
| `poverty` | The median value of owner-occupied homes | |
| `rent_burden` | The median rent as a percentage of the median renter's household income | |
| `education_less_highschool` | The percentage of residents with less than a high school education | |
| `education_highschool` | The percentage of residents with a high school diploma but no more | |
| `education_some_college` | The percentage of residents with some college but no more | |
| `education_bachelors` | The percentage of residents with a bachelor's degree (or equivalent) but no more | |
| `education_graduate` | The percentage of residents with a graduate degree | |
| `education_stem_degree` | The percentage of college graduates with a Bachelor's degree or higher in a Science and Engineering (or related) field | |
| `self_employed` | The percentage of households reporting self-employment income on their 2016 IRS tax return | |
| `farmer` | The percentage of households reporting farm income on their 2016 IRS tax return | |
| `disabled` | The percentage of residents who report a disability | |
| `limited_english` | The percentage of residents who only speak limited English | |
| `commute_time` | The median commute time of resident workers in minutes | |
| `health_uninsured` | The percentage of residents who report not having health insurance | |
| `veteran` | The percentage of residents who are veterans | |
| `Ozone` | Annual Ozone (O3) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `PM25` | Annual Fine Particulate Matter (PM2.5) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `N02` | Annual Nitrogen Dioxide (NO2) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `DiagPeriodL90D` | Diagnosis period being less than 90 days | |

# Exploratory Data Analysis

Thoughts:
- EDA: columns dist - 0 vs 1
- Don't remove records easily where certain columns are empty. You don't know if such data missing pattern indicates certain characteristics of patients which correlates to target prediction.
- Ensure same pre- and post-processing for training and test data.
- Logistic Regression as a prelimilary step to discover relevant columns.

Questions:
- Which columns need normalization?
- Unify numericals to float64?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from typing import List, Tuple, Optional, Union

In [None]:
# seaborn style (default)
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1, color_codes=True, rc=None)

# font sizes
title_fontsize = 14
label_fontsize = 12
tick_fontsize = 10
text_fontsize = 10

# random state
random_state = 42

# copy-on-write
pd.set_option("mode.copy_on_write", True)

# maximum column width, number of rows and columns to display
pd.set_option('display.max_colwidth', None)  # no limit
pd.set_option('display.max_rows', None)      # no limit
pd.set_option('display.max_columns', None)   # no limit

In [None]:
df_training = pd.read_csv('data/training.csv')
df_training.info()

<span style="color: #ffed29;">From the `dtypes` (NumPy data types: `np.float64`, `np.int64`; Python's built-in type: `object`), we know that the missing values in all columns can only be `numpy.nan`, i.e., `NaN`.</span>

## Drop Duplicates
Remove records where all attributes are identical except for `patient_id`.


In [None]:
def find_duplicates(df, id_col):
    """
    Identifies and filters duplicates based on all columns except the specified ID column.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    id_col (str): The name of the ID column.
    
    Returns:
    pd.DataFrame: A DataFrame containing only the duplicated rows.
    list: List of columns to check for duplicates.
    """
    columns_to_check = df.columns.difference([id_col]).to_list()
    duplicates_mask = df.duplicated(subset=columns_to_check, keep=False)
    duplicates_df = df[duplicates_mask].copy()
    return duplicates_df, columns_to_check

In [None]:
# find duplicates except the `patient_id` column
duplicates_df, columns_to_check = find_duplicates(df_training, 'patient_id')
first_occurrence_df = duplicates_df.drop_duplicates(subset=columns_to_check)
print('All Duplicated Rows (including the first occurrence):')
print(duplicates_df.to_string(index=False))
print(f'\nTotal Number of Duplicated Rows (excluding the first occurrence): {len(duplicates_df) - len(first_occurrence_df)}')

In [None]:
# remove duplicates
df_training_without_duplicates = df_training.drop_duplicates(subset=columns_to_check)
print(f'Removed {len(df_training) - len(df_training_without_duplicates)} duplicated rows.')
print(f'Current #Patients: {len(df_training_without_duplicates)}')

## Inspect Columns

In [None]:
def get_value_counts(df: pd.DataFrame, col: str, normalize: bool = True, dropna: bool = False) -> pd.DataFrame:
    """
    Computes the value counts for a specified column in a DataFrame, optionally normalizing and dropping missing values.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    col (str): The name of the column to compute value counts for.
    normalize (bool, optional): If True, the value counts will be normalized to percentages. Defaults to True.
    dropna (bool, optional): If True, missing values will be dropped from the value counts. Defaults to False.

    Returns:
    pd.DataFrame: A DataFrame containing the value counts (*100 if normalized).
    """
    # compute value counts
    value_counts = df[col].value_counts(normalize=normalize, dropna=dropna)
    
    # multiply by 100 if normalized
    if normalize:
        value_counts *= 100
    
    return value_counts.to_frame()

In [None]:
def plot_category_distribution(df: pd.DataFrame, cat_col: str, index_labels: Optional[List[Tuple[int, str]]] = None) -> None:
    """
    Plots the distribution of a categorical column in a DataFrame, normalizing the value counts and optionally setting a specific order.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    cat_col (str): The name of the categorical column to plot.
    index_labels (Optional[List[Tuple[int, str]]]): A list of tuples where each inner tuple contains an index and its corresponding label.
    """
    # compute value counts
    series = df[cat_col].copy()
    series.name = None
    value_counts_normalized = series.value_counts(normalize=True, dropna=False)

    if index_labels:
        # define the order of the categories
        sorted_index_labels = sorted(index_labels)
        order = [index for index, _ in sorted_index_labels]
        # reindex the result to set the order
        value_counts_normalized = value_counts_normalized.reindex(order)

    # plot
    value_counts_normalized.plot(kind='barh', figsize=(10, min(len(value_counts_normalized.index), 8)))
    plt.title(f'{cat_col} Distribution (%)', fontsize=title_fontsize)
    plt.xlim([0, 1])
    if index_labels:
        plt.yticks(ticks=order, labels=[f'{index} ({label})' for index, label in sorted_index_labels], fontsize=tick_fontsize)
    plt.show()

In [None]:
def plot_histogram_with_percentages(df: pd.DataFrame, num_col: str, bins: Union[int, List[int]]) -> None:
    """
    Plot a histogram for a specified numerical column of a DataFrame, showing the percentage each bin represents.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the data.
    num_col (str): The name of the numerical column to plot.
    bins (int or List[int]): The number of bins or the boundaries of the bins.
    """
    # check if the column exists and is numerical and has no missing values
    if num_col not in df.columns:
        raise ValueError(f"Column '{num_col}' not found in the DataFrame.")
    if not pd.api.types.is_numeric_dtype(df[num_col]):
        raise ValueError(f"Column '{num_col}' is not numerical.")
    if df[num_col].isna().any():
        raise ValueError(f"Column '{num_col}' contains missing values. Please handle or remove them before plotting.")
 
    # calculate the weights to represent percentages
    total_count = len(df[num_col])
    weights = [100 / total_count] * total_count

    # plot
    plt.figure(figsize=(10, 6))
    hist = sns.histplot(x=df[num_col], bins=bins, weights=weights)
    for p in hist.patches:
        height = p.get_height()
        hist.text(p.get_x() + p.get_width() / 2., height + 0.5, f'{height:.1f}%', ha="center", fontsize=text_fontsize)
    plt.title(f'{num_col} Distribution', fontsize=title_fontsize)
    plt.xlabel(num_col, fontsize=label_fontsize)
    plt.ylabel('Percentage', fontsize=label_fontsize)
    if isinstance(bins, int):
        bins = pd.cut(df[num_col], bins=bins, retbins=True)[1]
    plt.xticks(ticks=bins, fontsize=tick_fontsize)
    plt.yticks(fontsize=tick_fontsize)
    plt.show()

### `DiagPeriodL90D`

<span style="color: #ffed29;">Approximately 38% of patients did not receive a metastatic cancer diagnosis within 90 days of screening.</span>

In [None]:
get_value_counts(df_training_without_duplicates, 'DiagPeriodL90D')

In [None]:
plot_category_distribution(df_training_without_duplicates, 'DiagPeriodL90D', [(0, '>= 90 Days'), (1, '< 90 Days')])

### `patient_race`

<span style="color: #ffed29;">Race information is unknown for approximately 50% of patients. The only race that comprises more than 10% of the known cases is White, at about 28%.</span>

In [None]:
get_value_counts(df_training_without_duplicates, 'patient_race')

In [None]:
plot_category_distribution(df_training_without_duplicates, 'patient_race')

### `payer_type`

<span style="color: #ffed29;">Around 47% of patients have commercial insurance. The proportions of patients who have Medicaid or Medicare are similar, each close to 20%.</span>

In [None]:
get_value_counts(df_training_without_duplicates, 'payer_type')

In [None]:
plot_category_distribution(df_training_without_duplicates, 'payer_type')

### `patient_state`

<span style="color: #ffed29;">The top 6 states where more than 5% of patients come from are CA (18.9%), TX (8.9%), NY (8.1%), MI (6.6%), IL (6.1%), and OH (5.9%). The tail of the distribution includes RI, NH, and MA, each at 0.00777%.</span>

In [None]:
get_value_counts(df_training_without_duplicates, 'patient_state')

In [None]:
plot_category_distribution(df_training_without_duplicates, 'patient_state')

### `patient_zip3`

<span style="color: #ffed29;">There are 739 zip3 codes with patient distribution ranging from 0.00777% to 1.8%.</span>

### `patient_age`

<span style="color: #ffed29;">This dataset includes patients ranging from 18 to 91 years old, with more than 80% of the patients aged between 40 and 80.</span>

In [None]:
age_boundries = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]  # bins: [10-20), [20-30), [30-40), [40-50), [50-60), [60-70), [70-80), [80-90), [90-100]
plot_histogram_with_percentages(df_training_without_duplicates, 'patient_age', age_boundries)

### `patient_gender`

<span style="color: #ffed29;">All patients are female.</span>

### `bmi`

<span style="color: #ffed29;">Almost 70% of patients' BMI information is not available. For the available records, BMI ranges from 14.0 to 85.0. Most patients fall into the categories of 'Normal weight', 'Pre-obesity', and 'Obesity class I'.</span>

WHO Classification of BMI

| BMI Interval (kg/m²) | Category                |
|----------------------|-------------------------|
| < 18.5               | Underweight             |
| 18.5 - 24.9          | Normal weight           |
| 25 - 29.9            | Pre-obesity             |
| 30 - 34.9            | Obesity class I         |
| 35 - 39.9            | Obesity class II        |
| ≥ 40                 | Obesity class III       |

In [None]:
print(f'The proportion of missing BMI: {df_training_without_duplicates['bmi'].isna().sum() / len(df_training_without_duplicates) * 100:.1f}%')
bmi_df = df_training_without_duplicates['bmi'].dropna().to_frame()
bmi_boundries = [0, 18.5, 25, 30, 35, 40, 85]  # bins: [0, 18.5), [18.5, 25), [25, 30), [30, 35), [35, 40), [40, 85]
plot_histogram_with_percentages(bmi_df, 'bmi', bmi_boundries)

### `breast_cancer_diagnosis_code`

<span style="color: #ffed29;">
    <ol>
        <li>The most common codes are 174.9 (15.3%), C50.911 (13.9%), C50.912 (13.3%), and C50.919 (11.4%). Many other codes have fewer than 10 patients.
        <li>ICD-10 codes offer more detailed diagnoses for breast cancer compared to ICD-9 codes. While it is possible to map ICD-9 codes to ICD-10 codes, this process can alter the distribution of diagnosis codes. Therefore, direct conversion is not performed. However, some available mappings include: C50.9|174.9, C50.41|174.4, C50.81|174.8, C50.21|174.2, C50.11|174.1, C50.51|174.5, C50.31|174.3, C50.01|174.6. #TODO: experiment on this
        <li>Some codes in the dataset (C50.929, C50.021, 175.9, C50.421) indicate breast cancer in males, which contradicts the information in the `patient_gender` column. These discrepancies suggest that there may be man-made errors present in the dataset, possibly introduced during data collection or data preparation. These records are not removed but require additional attention. #TODO: remove/move to validation set
        <li>The code 198.81 represents a secondary malignant neoplasm of the breast, indicating that the cancer has metastasized to the breast from another part of the body. This differs from the other codes in the dataset, which represent primary breast cancer. Only one patient out of nine has `DiagPeriodL90D=1`. This definitely requires attention. #TODO: check this
    </ol>
</span>

In [None]:
code_counts_df = get_value_counts(df_training_without_duplicates, 'breast_cancer_diagnosis_code')
code_desc_df = df_training_without_duplicates[['breast_cancer_diagnosis_code', 'breast_cancer_diagnosis_desc']].drop_duplicates()
code_counts_desc_df = pd.merge(code_counts_df, code_desc_df, left_index=True, right_on='breast_cancer_diagnosis_code')
code_counts_desc_df

#### Edge Cases


In [None]:
male_codes = ['C50929', 'C50021', '1759', 'C50421']
secondary_code = '19881'

In [None]:
male_code_df = df_training_without_duplicates[df_training_without_duplicates['breast_cancer_diagnosis_code'].isin(male_codes)]
male_code_df.loc[:, ['patient_id', 'DiagPeriodL90D']]

In [None]:
secondary_code_df = df_training_without_duplicates[df_training_without_duplicates['breast_cancer_diagnosis_code']==secondary_code]
secondary_code_df.loc[:, ['patient_id', 'DiagPeriodL90D']]

### `breast_cancer_diagnosis_desc`

<span style="color: #ffed29;">
    <ol>
        <li>Many words in the descriptions, such as 'Malignant'/'Malig', 'neoplasm'/'neopl', 'female', and 'breast', are not informative since they appear in almost every field. Removing these words can make the descriptions shorter and more concise. For example, the description 'Malignant neoplasm of breast (female), unspecified' can be reduced to 'unspecified'.
        <li>Some words have short forms, and using a uniform representation can reduce unnecessary variations. For this purpose, 'unsp' will be replaced by 'unspecified', and 'ovrlp' will be replaced by 'overlapping'.
        <li>Each original description consists of 4 to 10 words (averaging 8.44 words). These descriptions specify the position of the neoplasm, such as the quadrant and side. After cleaning, the descriptions have been reduced to 0 to 6 words (averaging 3.34 words), with the single empty description corresponding to C50.
        <li>After data cleaning, encode each description into a feature vector. #TODO: experiment on this
    </ol>
</span>

In [None]:
def remove_tokens_and_handle_of(text: str, tokens: list) -> str:
    """
    Removes specified tokens from the input text and, if the first or last token of the filtered text starts with 'of',
    removes the respective token. Also, strips any trailing commas from the tokens.

    Parameters:
    text (str): The input text string.
    tokens (list): A list of tokens to remove from the input text.

    Returns:
    str: The filtered text with specified tokens removed, the first or last token removed if it starts with 'of',
         and trailing commas stripped from the tokens.
    """
    words = text.split()
    filtered_words = [word.rstrip(',') for word in words if word not in tokens]

    if filtered_words and filtered_words[0].startswith('of'):
        filtered_words = filtered_words[1:]

    if filtered_words and filtered_words[-1] == 'of':
        filtered_words = filtered_words[:-1]
    
    return ' '.join(filtered_words)

In [None]:
def replace_tokens(text: str, replacements: dict) -> str:
    """
    Replaces tokens in the input text based on a dictionary of token pairs.

    Parameters:
    text (str): The input text string.
    replacements (dict): A dictionary where keys are tokens to be replaced and values are the replacement tokens.

    Returns:
    str: The text with tokens replaced according to the dictionary.
    """
    words = text.split()
    replaced_words = [replacements.get(word, word) for word in words]
    return ' '.join(replaced_words)

In [None]:
# remove tokens
tokens_to_remove = ['Malignant', 'malignant', 'Malig',
                    'neoplasm', 'neoplm', 
                    'female', '(female),',
                    'breast', 'breast,']
code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'] = code_counts_desc_df['breast_cancer_diagnosis_desc'].apply(lambda x: remove_tokens_and_handle_of(x, tokens_to_remove))


In [None]:
# replace tokens
token_replacements = {
    'unsp': 'unspecified',
    'ovrlp': 'overlapping'
}
code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'] = code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'].apply(lambda x: replace_tokens(x, token_replacements))

In [None]:
def count_words(text: str) -> int:
    """
    Counts the number of words in the input text, splitting by spaces.

    Parameters:
    text (str): The input text string.

    Returns:
    int: The number of words in the text.
    """
    return len(text.split())

In [None]:
code_counts_desc_df['breast_cancer_diagnosis_desc_count'] = code_counts_desc_df['breast_cancer_diagnosis_desc'].apply(count_words)
code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned_count'] = code_counts_desc_df['breast_cancer_diagnosis_desc_cleaned'].apply(count_words)

### `metastatic_cancer_diagnosis_code`

<span style="color: #ffed29;">
    <ol>
        <li>The top 3 diagnosis codes are C77.3(54.6%; Secondary and unspecified malignant neoplasm of axilla and upper limb lymph nodes), C79.51(14.2%; Secondary malignant neoplasm of bone), C77.9(5.9%; Secondary and unspecified malignant neoplasm of lymph node, unspecified).
        <li>Only 9 out of 43 diagnosis codes have metastatic first novel treatments, and these treatments occur in 24 out of 12,870 patients (0.186%) in the dataset.
    </ol>
</span>

In [None]:
metastatic_code_counts_df = get_value_counts(df_training_without_duplicates, 'metastatic_cancer_diagnosis_code')
metastatic_code_counts_df

In [None]:
metastatic_code_treatment_df = df_training_without_duplicates[['metastatic_cancer_diagnosis_code', 'metastatic_first_novel_treatment', 'metastatic_first_novel_treatment_type']].drop_duplicates().dropna()
metastatic_code_counts_treatment_df = pd.merge(metastatic_code_counts_df, metastatic_code_treatment_df, left_index=True, right_on='metastatic_cancer_diagnosis_code')
metastatic_code_counts_treatment_df

In [None]:
patients_with_first_novel_treatments_df = df_training_without_duplicates[df_training_without_duplicates['metastatic_first_novel_treatment'].notna()]
patients_with_first_novel_treatments_df

### `metastatic_first_novel_treatment` & `metastatic_first_novel_treatment_type`

<span style="color: #ffed29;">
There are only two metastatic first novel treatments: PEMBROLIZUMAB and OLAPARIB. Both of these treatments are of the type Antineoplastics.
</span>

In [None]:
get_value_counts(df_training_without_duplicates, 'metastatic_first_novel_treatment')

In [None]:
get_value_counts(df_training_without_duplicates, 'metastatic_first_novel_treatment_type')