## Overview

Metastatic TNBC is considered the most aggressive TNBC and requires most urgent and timely treatment. Unnecessary delays in diagnosis and subsequent treatment can have devastating effects in these difficult cancers. Differences in the wait time to get treatment is a good proxy for disparities in healthcare access.

## Our Objective

We will be predicting if the patients received metastatic cancer diagnosis within 90 days of screening.

The primary goal of building these models is to detect relationships between demographics of the patient with the likelihood of getting timely treatment. The secondary goal is to see if environmental hazards impact proper diagnosis and treatment.

In [1]:
# import necessary modules
import pandas as pd
import re

In [2]:
# Readin in the data
training_df = pd.read_csv("/home/paulet/Documents/cancer_prediction/training.csv")
training_df.head(5)

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,...,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,...,12.871429,22.542857,10.1,27.814286,11.2,3.5,52.23721,8.650555,18.606528,1
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,...,8.957576,10.109091,8.057576,30.606061,7.018182,4.10303,42.301121,8.487175,20.113179,1
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,...,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,...,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",...,15.276,11.224,1.946,26.170213,12.088,13.106,41.356058,4.110749,11.722197,0


In [3]:
# EXtracting the dependent variable 
y = training_df['DiagPeriodL90D']

In [4]:
# EXtracting the independent variable
x = training_df.drop(['DiagPeriodL90D', 'patient_id'], axis=1)

In [5]:
x.head()

Unnamed: 0,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,metastatic_cancer_diagnosis_code,...,hispanic,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02
0,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,C7989,...,66.685714,12.871429,22.542857,10.1,27.814286,11.2,3.5,52.23721,8.650555,18.606528
1,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,C773,...,37.948485,8.957576,10.109091,8.057576,30.606061,7.018182,4.10303,42.301121,8.487175,20.113179
2,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,C773,...,19.37,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351
3,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,C773,...,16.716667,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123
4,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",C773,...,13.334,15.276,11.224,1.946,26.170213,12.088,13.106,41.356058,4.110749,11.722197


In [6]:
# Handing Missing Values 
# Checking for null values in the independent variables (X)
null_values = x.isnull().sum()

# Displaying columns with null values
columns_with_null = null_values[null_values > 0]
print("Columns with null values:")
print(columns_with_null)


Columns with null values:
patient_race                         6385
payer_type                           1803
patient_state                          51
bmi                                  8965
metastatic_first_novel_treatment    12882
                                    ...  
health_uninsured                        1
veteran                                 1
Ozone                                  29
PM25                                   29
N02                                    29
Length: 75, dtype: int64


In [7]:
x['breast_cancer_diagnosis_desc'].head(20)

0     Malignant neoplasm of unsp site of unspecified...
1     Malig neoplm of upper-outer quadrant of right ...
2     Malignant neoplasm of central portion of left ...
3     Malig neoplasm of upper-inner quadrant of left...
4     Malignant neoplasm of breast (female), unspeci...
5     Malignant neoplasm of breast (female), unspeci...
6     Malignant neoplasm of unspecified site of left...
7     Malig neoplasm of lower-outer quadrant of left...
8     Malignant neoplasm of upper-outer quadrant of ...
9     Malignant neoplasm of unspecified site of left...
10    Malig neoplasm of upper-outer quadrant of left...
11    Malignant neoplasm of breast (female), unspeci...
12    Malignant neoplasm of ovrlp sites of left fema...
13    Malignant neoplasm of breast (female), unspeci...
14    Malignant neoplasm of unsp site of right femal...
15    Malig neoplm of upper-outer quadrant of right ...
16    Malignant neoplasm of breast (female), unspeci...
17    Malignant neoplasm of unsp site of right f

### HANDLING THE BREAST CANCER DESC WITH NLP

In [8]:
# NLP
def clean_breast_cancer_diagnosis_desc(text):
    # Removing unnecessary characters
    text = re.sub(r'[^a-zA-Z\S]', '', text)
    
    # Conversion to lowercase
    text = text.lower()

    # Normalize Incomplete Words
    text = re.sub(r'\bmalig\b', 'malignant', text)
    text = re.sub(r'\bneoplm\b', 'neoplasm', text)

    # Remove double quotes
    text = text.replace('"', '')

    return text

# Apply Function
training_df['clean_breast_cancer_diagnosis_desc'] = training_df['breast_cancer_diagnosis_desc'].apply(clean_breast_cancer_diagnosis_desc)

# Display cleaned data
print(training_df[['breast_cancer_diagnosis_desc', 'clean_breast_cancer_diagnosis_desc']])


                            breast_cancer_diagnosis_desc  \
0      Malignant neoplasm of unsp site of unspecified...   
1      Malig neoplm of upper-outer quadrant of right ...   
2      Malignant neoplasm of central portion of left ...   
3      Malig neoplasm of upper-inner quadrant of left...   
4      Malignant neoplasm of breast (female), unspeci...   
...                                                  ...   
12901  Malig neoplm of upper-outer quadrant of right ...   
12902  Malignant neoplasm of unspecified site of left...   
12903  Malignant neoplasm of unspecified site of left...   
12904  Malignant neoplasm of breast (female), unspeci...   
12905  Malig neoplasm of upper-outer quadrant of left...   

                      clean_breast_cancer_diagnosis_desc  
0      malignantneoplasmofunspsiteofunspecifiedfemale...  
1      maligneoplmofupper-outerquadrantofrightfemaleb...  
2      malignantneoplasmofcentralportionofleftfemaleb...  
3      maligneoplasmofupper-innerquadrantof

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK stop words
nltk.download('stopwords')

# Load NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()

# Get NLTK stop words
stop_words = set(stopwords.words('english'))

# Tokenization and Lemmatization function using NLTK
def tokenize_and_lemmatize(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token.lower() not in stop_words]
    return lemmatized_tokens

training_df['tokens_lemmas'] = training_df['clean_breast_cancer_diagnosis_desc'].apply(tokenize_and_lemmatize)

# Convert set to list for stop words
stop_words_list = list(stop_words)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words_list, tokenizer=lambda x: x, preprocessor=lambda x: x)
tfidf_matrix = tfidf_vectorizer.fit_transform(training_df['tokens_lemmas'])

# Print the features (words) obtained from TF-IDF
print("TF-IDF Features (Words):", tfidf_vectorizer.get_feature_names_out())


[nltk_data] Downloading package stopwords to /home/paulet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TF-IDF Features (Words): ['female' 'leftfemalebreast'
 'malignantneoplasmofaxillarytailoffemalebreast'
 'malignantneoplasmofaxillarytailofleftfemalebreast'
 'malignantneoplasmofaxillarytailofrightfemalebreast'
 'malignantneoplasmofaxillarytailofunspfemalebreast'
 'malignantneoplasmofbreast' 'malignantneoplasmofbreastofunspecifiedsite'
 'malignantneoplasmofcentralportionofbreast'
 'malignantneoplasmofcentralportionoffemalebreast'
 'malignantneoplasmofcentralportionofleftfemalebreast'
 'malignantneoplasmofcentralportionofrightfemalebreast'
 'malignantneoplasmofcentralportionofunspfemalebreast'
 'malignantneoplasmofnippleandareola'
 'malignantneoplasmofotherandunspecifiedsitesofmalebreast'
 'malignantneoplasmofotherspecifiedsitesoffemalebreast'
 'malignantneoplasmofoverlappingsitesofbreast'
 'malignantneoplasmofovrlpsitesofleftfemalebreast'
 'malignantneoplasmofovrlpsitesofrightfemalebreast'
 'malignantneoplasmofovrlpsitesofunspfemalebreast'
 'malignantneoplasmofunspecifiedsiteofleftfemal



In [32]:
training_df.head(30)

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,...,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D,clean_breast_cancer_diagnosis_desc,tokens_lemmas
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,...,10.1,27.814286,11.2,3.5,52.23721,8.650555,18.606528,1,malignantneoplasmofunspsiteofunspecifiedfemale...,[malignantneoplasmofunspsiteofunspecifiedfemal...
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,...,8.057576,30.606061,7.018182,4.10303,42.301121,8.487175,20.113179,1,maligneoplmofupper-outerquadrantofrightfemaleb...,[]
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,...,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1,malignantneoplasmofcentralportionofleftfemaleb...,[malignantneoplasmofcentralportionofleftfemale...
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,...,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0,maligneoplasmofupper-innerquadrantofleftfemale...,[]
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",...,1.946,26.170213,12.088,13.106,41.356058,4.110749,11.722197,0,"malignantneoplasmofbreast(female),unspecified","[malignantneoplasmofbreast, female, unspecified]"
5,111545,White,MEDICARE ADVANTAGE,NY,141,66,F,,1749,"Malignant neoplasm of breast (female), unspeci...",...,0.638235,25.0,4.797143,7.745714,40.107248,6.181812,13.562528,0,"malignantneoplasmofbreast(female),unspecified","[malignantneoplasmofbreast, female, unspecified]"
6,914071,,COMMERCIAL,CA,900,51,F,29.05,C50912,Malignant neoplasm of unspecified site of left...,...,14.7375,30.709375,10.341538,3.030769,41.186992,11.166898,21.644261,1,malignantneoplasmofunspecifiedsiteofleftfemale...,[malignantneoplasmofunspecifiedsiteofleftfemal...
7,479368,White,COMMERCIAL,IL,619,60,F,,C50512,Malig neoplasm of lower-outer quadrant of left...,...,0.503333,24.275862,8.753333,7.506667,37.64677,7.295977,12.914805,1,maligneoplasmoflower-outerquadrantofleftfemale...,[]
8,994014,White,MEDICARE ADVANTAGE,,973,82,F,,1744,Malignant neoplasm of upper-outer quadrant of ...,...,1.620968,26.015254,6.645313,10.955385,36.323573,4.744352,10.439314,0,malignantneoplasmofupper-outerquadrantoffemale...,[]
9,155485,,COMMERCIAL,IL,617,64,F,,C50912,Malignant neoplasm of unspecified site of left...,...,0.190566,23.843396,4.684906,9.016981,37.77383,7.299998,14.942968,1,malignantneoplasmofunspecifiedsiteofleftfemale...,[malignantneoplasmofunspecifiedsiteofleftfemal...


In [None]:
### 