## Predicting Metastatic Cancer Diagnosis

## Project Overview

Metastatic Triple-Negative Breast Cancer (TNBC) is considered the most aggressive TNBC and requires most urgent and timely treatment. TNBC is characterized by high risk of invasiveness, high metastasis, and poor prognosis. According to the National Institute of Health (2022), more than 1/3 of patients with TNBC experience recirrent or distant metastsis. Early diagnosis and treatment is very important for such difficult cancers. Differences in the wait time to get treatment is a key factor contributing to disparities in healthcare.

The primary goal of this model is to detect the relationship between demographic characteristics and the likelihood of getting timely treatment. Additionally, the model aims to determine whether environmental hazards impact proper diagnosis and treatment. The developed model will predict the likelihood that a patient will be diagnosed with metastatic TNBC within 90 days of screening based on various demographic and environmental factors.

## Libraries and Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from catboost import CatBoostClassifier
from sklearn.metrics import classification_report, accuracy_score

import missingno as msn
import re


#To prevent truncation of output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)




## Importing Datasets

In [None]:
train_df = pd.read_csv('training.csv')

test = pd.read_csv('test.csv')

sample_submission = pd.read_csv('sample_submission.csv')

In [None]:
test.info()

In [None]:
sample_submission.head()

In [None]:
train_df.head()

In [None]:
train_df.info()

## Data Pre-Processing

### Missing Values

In [None]:

msn.bar(train_df)

The bar chart above shows some columns with significantly high numbers of missing data, with some having almost all data missing. Understanding percentages of the missing data will help to better determine best approaches to handle the columns with missing data.

In [None]:
missing_percentage = (train_df.isnull().mean() * 100).round(2)

# DataFrame to display the results
missing_info = pd.DataFrame({
    'Column': missing_percentage.index,
    'Missing Percentage': missing_percentage.values
})

missing_info[missing_info['Missing Percentage'] > 1] #listing columns with more than 1% of data missing

The percentage of missing values in the following columns are between 49.5% and 99.8%. These values are too high, creating a high risk of reducing the overall predictive power of the model; hence, we will drop these columns;

- Patient race
- bmi
- metastatic first novel treatment
- metastatic first novel treatment type

In [None]:
train_df.drop(columns = ['patient_race', 
                      'bmi', 
                      'metastatic_first_novel_treatment', 
                      'metastatic_first_novel_treatment_type'], inplace=True)

As for the 'payer_type' column, approximately 13% of the data is missing. The following cell uses the missingno module to visualize the degree of randomness, i.e. whether there is a notable pattern in how the data is missing.

In [None]:
payer_type_subset = pd.DataFrame(train_df['payer_type'])
msn.matrix(payer_type_subset)

The missing values are spread randomly within the dataset. This data is likley missing completely at random (MCAR). When data is MCAR, the fact that the data is missing is independent of the observed or unobserved variables. In other words, no systematic differences exist between the patients with missing payer type and those with complete data. 

### Imputing missing data
Imputing the rest of the columns with mean if the data type is numerical, and the mode if the data type is categorical; this will be done after confirmation that the predictor variable has no null values.

In [None]:
train_df['DiagPeriodL90D'].isna().sum()

In [None]:
for column in train_df:
    
    if train_df[column].isna().sum() > 0:
        if train_df[column].dtype == "float64" or train_df[column].dtype == "int64":
            train_df[column] = train_df[column].fillna(train_df[column].mean())
        
        else:  
            train_df[column] = train_df[column].fillna(train_df[column].mode().iloc[0])

In [None]:
#Confirming that there is no more missing data :)

train_df.isna().sum()

### Breast Cancer Diagnosis Description Cleaning

We noted that the 'breast_cancer_diagnosis_desc' column includes important information that could possibly strengthen our model. In the following cells, we will clean the text in this column using the following steps;


- Convert to Lowercase: The text is converted to lowercase to ensure uniformity.

- Normalize Incomplete Words: Certain abbreviated words like 'malig', 'neoplm', and 'unsp' are replaced with their full forms ('malignant', 'neoplasm', and 'unspecified', respectively).

In [None]:
def clean_breast_cancer_diagnosis_desc(text):
    
    text = text.lower() #Make every character lowercase
    
    # Normalize Incomplete Words
    text = re.sub(r'\bmalig\b', 'malignant', text)
    text = re.sub(r'\bneoplm\b', 'neoplasm', text)
    text = re.sub(r'\bunsp\b', 'unspecified', text)


    return text


train_df['breast_cancer_diagnosis_desc'] = train_df['breast_cancer_diagnosis_desc'].apply(clean_breast_cancer_diagnosis_desc)

### Dropping extra columns

All patients in the train dataset are female; hence, it is best to delete the column. A column with constant values does not provide any useful information for the model. Since the values are the same for all rows, the column does not contribute to distinguishing between different instances or making predictions. 

Additionally, constant columns can sometimes lead to overfitting, especially if the model mistakenly learns patterns from noise in the data. Removing such columns can help mitigate the risk of overfitting and improve the model's ability to generalize to unseen data. 

This decision has also been made to improve model complexity. Removing redundant columns simplifies the model and reduces its complexity. This can lead to faster training times, improved interpretability, and potentially better generalization performance on new data.

Similarly, the patient_id column is a unique identifier for each patient; hence, it does not provide any information for the model. This column will also be dropped.

In [None]:
train_df['patient_gender'].value_counts()

In [None]:
train_df.drop('patient_gender', axis=1, inplace=True)

In [None]:
train_df.drop('patient_id', axis=1, inplace=True)

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
#Test data preprocessing

#dropping columns with high levels of missing data and patient_gender for lack of dimension
test.drop(columns = ['patient_race', 
                      'bmi', 
                      'metastatic_first_novel_treatment', 
                      'metastatic_first_novel_treatment_type', 'patient_gender'], inplace=True)

In [None]:
test.drop('patient_id', axis=1, inplace=True)

In [None]:
test.info()

## Model Training

In [None]:
x_train = train_df.drop(columns = ['DiagPeriodL90D'])
y_train = train_df['DiagPeriodL90D']


In [None]:
categorical_features = [
    'payer_type',
    'patient_state',
    'breast_cancer_diagnosis_code',
    'metastatic_cancer_diagnosis_code',
    'Region',
    'Division']
model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',
    eval_metric='AUC',
    cat_features = categorical_features,
    text_features = ['breast_cancer_diagnosis_desc'],
    random_seed=42,
    verbose=200)

In [None]:
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(test)

In [None]:
# Evaluate the model


y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))