# Introduction

## [Overview](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/overview)
- Objective: Develop a model to predict if patients received a metastatic cancer diagnosis within 90 days of screening (i.e., `DiagPeriodL90D`) using a unique oncology dataset.
- Motivation: 
    - Metastatic TNBC is considered the most aggressive TNBC and requires most urgent and timely treatment. Unnecessary delays in diagnosis and subsequent treatment can have devastating effects in these difficult cancers. Differences in the wait time to get treatment is a good proxy for disparities in healthcare access.
    - The primary goal of building these models is to detect relationships between demographics of the patient with the likelihood of getting timely treatment. The secondary goal is to see if environmental hazards impact proper diagnosis and treatment.  # TODO: check this
- Dataset
    - Source: Provided by Gilead Sciences, originating from Health Verity and enriched with third-party geo-demographic data and zip code level toxicology data from NASA/Columbia University.
    - Content: Information about demographics, diagnosis, treatment options, and insurance for patients diagnosed with breast cancer from 2015-2018.
    - Highlighted Features:
        - Demographics (e.g., age, gender, race, ...)
        - Diagnosis and treatment details (e.g., breast cancer diagnosis code, metastatic cancer diagnosis code, metastatic cancer treatments, ...)
        - Insurance information
        - Geo (zip-code level) demographic data (e.g, income, education, rent, race, poverty, ...)
        - Toxic air quality (zip-code level) data (e.g., Ozone, PM25 and NO2, ...)
- Evaluation
    - Metric: Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC).
    - Leaderboard:
        - During the competition: 51% of the test data.
        - Final standings: 49% of the test data.
- Submission Format
    - File Format: CSV
    - Columns:
        - `patient_id` (integer)
        - `DiagPeriodL90D` (percentage)
    - Example:
        ```
        patient_id,DiagPeriodL90D
        372069,.5
        981264,.5
        ```

## [Dataset](https://www.kaggle.com/competitions/widsdatathon2024-challenge1/data)
Roughly 18k records, each corresponds to a single patient and her Diagnosis Period
- `training.csv`
    - 12906 records
    - 83 columns, the last column is `DiagPeriodL90D` (int64)
- `test.csv`
    - 5792 records
    - 82 columns

### Columns

| Column Name | Meaning | Notes Before EDA |
|-------------|---------|------------------|
| `patient_id` | Unique identification number of patient | only useful in data pre- and post-processing, but irrelavant to target prediction |
| `patient_race` | <span style="color: #57b9ff;">Asian, African American, Hispanic or Latino, White, Other Race</span> | races + nan, nan -> "unknown", encode |
| `payer_type` | <span style="color: #57b9ff;">Payer type at Medicaid, Commercial, Medicare on the metastatic date</span> | payers + nan, nan -> "unknown", encode |
| `patient_state` | <span style="color: #57b9ff;">Patient state (e.g., AL, AK, AZ, AR, CA, CO, etc.) on the metastatic date</span> | states + nan, nan -> "unknown", encode |
| `patient_zip3` | Patient Zip3 (e.g., 190) on the metastatic date | int64, encode? |
| `patient_age` | Derived from Patient Year of Birth (index year minus year of birth) | int64 |
| `patient_gender` | <span style="color: #57b9ff;">F, M on the metastatic date</span> | genders, encode |
| `bmi` | If available, will show available BMI (e.g., 24) information (Earliest BMI recording post metastatic date) | float64 + nan, nan -> -1? |
| `breast_cancer_diagnosis_code` | <span style="color: #57b9ff;">ICD10 (e.g., C50412) or ICD9 (e.g., 1748) diagnoses code</span> | codes, encode |
| `breast_cancer_diagnosis_desc` | <span style="color: #ffa500;">ICD10 or ICD9 code description. This column is raw text and may require NLP/processing and cleaning</span> | length?, encode, which tokenizer?, token diversity?, fine-tune tokenizer? |
| `metastatic_cancer_diagnosis_code` | <span style="color: #57b9ff;">ICD10 (e.g., C773) diagnoses code | codes, encode |
| `metastatic_first_novel_treatment` | <span style="color: #57b9ff;">Generic drug name of the first novel treatment (e.g., "Cisplatin") after metastatic diagnosis</span> | very few, very timely diagnosis? nan -> "unknown" |
| `metastatic_first_novel_treatment_type` | <span style="color: #57b9ff;">Description of the treatment (e.g., Antineoplastic) of the first novel treatment after metastatic diagnosis</span> | very few, very timely diagnosis? nan -> "unknown" |
| `Region` | <span style="color: #57b9ff;">Region of patient location (e.g., Midwest)</span> | regions + nan, nan -> "unknown", encode |
| `Division` | <span style="color: #57b9ff;">Division of patient location (e.g., East North Central)</span> | divisions + nan, nan -> "unknown", encode |
| `population` | An estimate of the zip code's population | one record missing!, nan -> -1?, remove record? |
| `density` | The estimated population per square kilometer | one record missing!, nan -> -1?, remove record? |
| `age_median` | The median age of residents in the zip code | which are irrelavant? aggregate? nan -> -1? |
| `male` | The percentage of residents who report being male (e.g., 55.1) | |
| `female` | The percentage of residents who report being female (e.g., 44.9) | |
| `married` | The percentage of residents who report being married (e.g., 44.9) | |
| `family_size` | The average size of resident families (e.g., 3.22) | |
| `income_household_median` | Median household income in USD | |
| `income_household_six_figure` | Percentage of households that earn at least $100,000 (e.g., 25.3) | |
| `home_ownership` | Percentage of households that own (rather than rent) their residence | |
| `housing_units` | The number of housing units (or households) in the zip code | |
| `home_value` | The median value of homes that are owned by residents | |
| `rent_median` | The median rent paid by renters | |
| `education_college_or_above` | The percentage of residents with at least a 4-year degree | |
| `labor_force_participation` | The percentage of residents 16 and older in the labor force | |
| `unemployment_rate` | The percentage of residents unemployed | |
| `race_white` | The percentage of residents who report their race as White | |
| `race_black` | The percentage of residents who report their race as Black or African American | |
| `race_asian` | The percentage of residents who report their race as Asian | |
| `race_native` | The percentage of residents who report their race as American Indian and Alaska Native | |
| `race_pacific` | The percentage of residents who report their race as Native Hawaiian and Other Pacific Islander | |
| `race_other` | The percentage of residents who report their race as Some other race | |
| `race_multiple` | The percentage of residents who report their race as Two or more races | |
| `hispanic` | The percentage of residents who report being Hispanic. Note: Hispanic is considered to be an ethnicity and not a race | |
| `age_under_10` | The percentage of residents aged 0-9 | |
| `age_10_to_19` | The percentage of residents aged 10-19 | |
| `age_20s` | The percentage of residents aged 20-29 | |
| `age_30s` | The percentage of residents aged 30-39 | |
| `age_40s` | The percentage of residents aged 40-49 | |
| `age_50s` | The percentage of residents aged 50-59 | |
| `age_60s` | The percentage of residents aged 60-69 | |
| `age_70s` | The percentage of residents aged 70-79 | |
| `age_over_80` | The percentage of residents aged over 80 | |
| `divorced` | The percentage of residents divorced | |
| `never_married` | The percentage of residents never married | |
| `widowed` | The percentage of residents widowed | |
| `family_dual_income` | The percentage of families with dual income earners | |
| `income_household_under_5` | The percentage of households with income under $5,000 | |
| `income_household_5_to_10` | The percentage of households with income from $5,000-$10,000 | |
| `income_household_10_to_15` | The percentage of households with income from $10,000-$15,000 | |
| `income_household_15_to_20` | The percentage of households with income from $15,000-$20,000 | |
| `income_household_20_to_25` | The percentage of households with income from $20,000-$25,000 | |
| `income_household_25_to_35` | The percentage of households with income from $25,000-$35,000 | |
| `income_household_35_to_50` | The percentage of households with income from $35,000-$50,000 | |
| `income_household_50_to_75` | The percentage of households with income from $50,000-$75,000 | |
| `income_household_75_to_100` | The percentage of households with income from $75,000-$100,000 | |
| `income_household_100_to_150` | The percentage of households with income from $100,000-$150,000 | |
| `income_household_150_over` | The percentage of households with income over $150,000 | |
| `income_individual_median` | The median income of individuals in the zip code | |
| `poverty` | The median value of owner-occupied homes | |
| `rent_burden` | The median rent as a percentage of the median renter's household income | |
| `education_less_highschool` | The percentage of residents with less than a high school education | |
| `education_highschool` | The percentage of residents with a high school diploma but no more | |
| `education_some_college` | The percentage of residents with some college but no more | |
| `education_bachelors` | The percentage of residents with a bachelor's degree (or equivalent) but no more | |
| `education_graduate` | The percentage of residents with a graduate degree | |
| `education_stem_degree` | The percentage of college graduates with a Bachelor's degree or higher in a Science and Engineering (or related) field | |
| `self_employed` | The percentage of households reporting self-employment income on their 2016 IRS tax return | |
| `farmer` | The percentage of households reporting farm income on their 2016 IRS tax return | |
| `disabled` | The percentage of residents who report a disability | |
| `limited_english` | The percentage of residents who only speak limited English | |
| `commute_time` | The median commute time of resident workers in minutes | |
| `health_uninsured` | The percentage of residents who report not having health insurance | |
| `veteran` | The percentage of residents who are veterans | |
| `Ozone` | Annual Ozone (O3) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `PM25` | Annual Fine Particulate Matter (PM2.5) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `N02` | Annual Nitrogen Dioxide (NO2) concentration data at Zip3 level. This data shows how air quality data may impact health | many record missing, nan -> -1? |
| `DiagPeriodL90D` | Diagnosis period being less than 90 days | |

# Exploratory Data Analysis

Thoughts:
- Don't remove records easily where certain columns are empty. You don't know if such data missing pattern indicates certain characteristics of patients which correlates to target prediction.
- Ensure same pre- and post-processing for training and test data.
- Logistic Regression as a prelimilary step to discover relevant columns.

Questions:
- Which columns need normalization?
- Unify numericals to float64?

In [16]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [17]:
# seaborn style (default)
sns.set_theme(context="notebook", style="darkgrid", palette="deep", font="sans-serif", font_scale=1, color_codes=True, rc=None)

# font sizes
title_fontsize = 14
label_fontsize = 12
tick_fontsize = 10
text_fontsize = 10

# random state
random_state = 42

In [None]:
df_training = pd.read_csv('data/training.csv')
df_training.info()