In [2]:
# Importing necessary modules
import pandas as pd
import re

In [3]:
# Reading in the data
training_df = pd.read_csv("training.csv")
test_csv = pd.read_csv("test.csv")
training_df.head(5)

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,...,disabled,poverty,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,...,12.871429,22.542857,10.1,27.814286,11.2,3.5,52.23721,8.650555,18.606528,1
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,...,8.957576,10.109091,8.057576,30.606061,7.018182,4.10303,42.301121,8.487175,20.113179,1
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,...,11.253333,9.663333,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,...,8.845238,8.688095,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",...,15.276,11.224,1.946,26.170213,12.088,13.106,41.356058,4.110749,11.722197,0


## NLP

In [4]:
training_df['breast_cancer_diagnosis_desc'].head(20)

0     Malignant neoplasm of unsp site of unspecified...
1     Malig neoplm of upper-outer quadrant of right ...
2     Malignant neoplasm of central portion of left ...
3     Malig neoplasm of upper-inner quadrant of left...
4     Malignant neoplasm of breast (female), unspeci...
5     Malignant neoplasm of breast (female), unspeci...
6     Malignant neoplasm of unspecified site of left...
7     Malig neoplasm of lower-outer quadrant of left...
8     Malignant neoplasm of upper-outer quadrant of ...
9     Malignant neoplasm of unspecified site of left...
10    Malig neoplasm of upper-outer quadrant of left...
11    Malignant neoplasm of breast (female), unspeci...
12    Malignant neoplasm of ovrlp sites of left fema...
13    Malignant neoplasm of breast (female), unspeci...
14    Malignant neoplasm of unsp site of right femal...
15    Malig neoplm of upper-outer quadrant of right ...
16    Malignant neoplasm of breast (female), unspeci...
17    Malignant neoplasm of unsp site of right f

## Breast Cancer Diagnosis Description Cleaning

In this code snippet, the `clean_breast_cancer_diagnosis_desc` function is designed to clean the text in the 'breast_cancer_diagnosis_desc' column. The following steps are performed:

1. **Remove Unnecessary Characters:** Using a regular expression (`[^a-zA-Z\s]`), all characters that are not letters or whitespace are removed.

2. **Convert to Lowercase:** The text is converted to lowercase to ensure uniformity.

3. **Normalize Incomplete Words:** Certain abbreviated words like 'malig', 'neoplm', and 'unsp' are replaced with their full forms ('malignant', 'neoplasm', and 'unspecified', respectively).

4. **Remove Double Quotes and Brackets:** Any double quotes and brackets (both '(' and ')') are removed from the text.

In [5]:
# NLP
def clean_breast_cancer_diagnosis_desc(text):
    # Removing unnecessary characters
    text = re.sub(r'[^a-zA-Z\S]', '', text)
    
    # Conversion to lowercase
    text = text.lower()

    # Normalize Incomplete Words
    text = re.sub(r'\bmalig\b', 'malignant', text)
    text = re.sub(r'\bneoplm\b', 'neoplasm', text)
    text = re.sub(r'\bunsp\b', 'unspecified', text)

    # Remove double quotes and brackets
    text = text.replace('"', '').replace('(', '').replace(')', '')

    return text

# Apply Function
training_df['clean_breast_cancer_diagnosis_desc'] = training_df['breast_cancer_diagnosis_desc'].apply(clean_breast_cancer_diagnosis_desc)

# Display cleaned data
print(training_df[['breast_cancer_diagnosis_desc', 'clean_breast_cancer_diagnosis_desc']])


                            breast_cancer_diagnosis_desc  \
0      Malignant neoplasm of unsp site of unspecified...   
1      Malig neoplm of upper-outer quadrant of right ...   
2      Malignant neoplasm of central portion of left ...   
3      Malig neoplasm of upper-inner quadrant of left...   
4      Malignant neoplasm of breast (female), unspeci...   
...                                                  ...   
12901  Malig neoplm of upper-outer quadrant of right ...   
12902  Malignant neoplasm of unspecified site of left...   
12903  Malignant neoplasm of unspecified site of left...   
12904  Malignant neoplasm of breast (female), unspeci...   
12905  Malig neoplasm of upper-outer quadrant of left...   

                      clean_breast_cancer_diagnosis_desc  
0      malignantneoplasmofunspsiteofunspecifiedfemale...  
1      maligneoplmofupper-outerquadrantofrightfemaleb...  
2      malignantneoplasmofcentralportionofleftfemaleb...  
3      maligneoplasmofupper-innerquadrantof

## TF-IDF Vectorization of Cleaned Breast Cancer Diagnosis Descriptions

In this code snippet, the 'clean_breast_cancer_diagnosis_desc' column in the DataFrame is subjected to TF-IDF vectorization. The following steps are performed:

1. **Download NLTK Stop Words:** Stop words from the NLTK library are downloaded.

2. **Load NLTK Lemmatizer:** The WordNet lemmatizer from NLTK is loaded.

3. **Tokenization and Lemmatization:** The function `tokenize_and_lemmatize` is defined to tokenize and lemmatize the text. It uses NLTK's word tokenizer, lemmatizer, and removes stop words. This function is applied to create a new column 'tokens_lemmas'.

4. **Convert Stop Words to List:** The set of stop words is converted to a list for later use.

5. **TF-IDF Vectorization:** The TfidfVectorizer from scikit-learn is employed for TF-IDF vectorization. It takes the 'tokens_lemmas' column as input and produces a TF-IDF matrix.

6. **Print TF-IDF Features (Words):** The features obtained from the TF-IDF vectorization are printed.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Download NLTK stop words
nltk.download('stopwords')

# Load NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()

# Get NLTK stop words
stop_words = set(stopwords.words('english'))

# Tokenization and Lemmatization function using NLTK
def tokenize_and_lemmatize(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha() and token.lower() not in stop_words]
    return lemmatized_tokens

# Example usage on your DataFrame
training_df['tokens_lemmas'] = training_df['clean_breast_cancer_diagnosis_desc'].apply(tokenize_and_lemmatize)

# Convert set to list for stop words
stop_words_list = list(stop_words)

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words_list, 
                                   tokenizer=lambda x: x, 
                                   preprocessor=lambda x: x, 
                                   token_pattern=None) 
tfidf_matrix = tfidf_vectorizer.fit_transform(training_df['tokens_lemmas'])

# Print the features (words) obtained from TF-IDF
print("TF-IDF Features (Words):", tfidf_vectorizer.get_feature_names_out())


[nltk_data] Downloading package stopwords to /home/paulet/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TF-IDF Features (Words): ['female' 'leftfemalebreast'
 'malignantneoplasmofaxillarytailoffemalebreast'
 'malignantneoplasmofaxillarytailofleftfemalebreast'
 'malignantneoplasmofaxillarytailofrightfemalebreast'
 'malignantneoplasmofaxillarytailofunspfemalebreast'
 'malignantneoplasmofbreast' 'malignantneoplasmofbreastfemale'
 'malignantneoplasmofbreastofunspecifiedsite'
 'malignantneoplasmofcentralportionofbreast'
 'malignantneoplasmofcentralportionoffemalebreast'
 'malignantneoplasmofcentralportionofleftfemalebreast'
 'malignantneoplasmofcentralportionofrightfemalebreast'
 'malignantneoplasmofcentralportionofunspfemalebreast'
 'malignantneoplasmofnippleandareola'
 'malignantneoplasmofotherandunspecifiedsitesofmalebreast'
 'malignantneoplasmofotherspecifiedsitesoffemalebreast'
 'malignantneoplasmofoverlappingsitesofbreast'
 'malignantneoplasmofovrlpsitesofleftfemalebreast'
 'malignantneoplasmofovrlpsitesofrightfemalebreast'
 'malignantneoplasmofovrlpsitesofunspfemalebreast'
 'malignantn



### Breast Cancer Diagnosis Code 

In [7]:
distinct_diagnosis_codes = training_df['breast_cancer_diagnosis_code'].unique()
print(distinct_diagnosis_codes)


['C50919' 'C50411' 'C50112' 'C50212' '1749' 'C50912' 'C50512' '1744'
 'C50412' 'C50812' 'C50911' 'C50312' 'C50311' 'C50111' '1741' 'C5091'
 'C50811' '1748' 'C50511' '1743' 'C50211' 'C50011' 'C5051' 'C50012'
 'C50419' '1742' 'C50611' 'C50612' 'C50119' 'C50819' '1746' 'C5041'
 'C50619' '19881' 'C5081' '1745' 'C50219' 'C50319' 'C50019' 'C50519'
 'C50929' 'C50021' 'C5021' 'C5011' 'C5031' 'C509' 'C50' '1759' 'C5001'
 'C50421']


In [8]:
training_df.head()

Unnamed: 0,patient_id,patient_race,payer_type,patient_state,patient_zip3,patient_age,patient_gender,bmi,breast_cancer_diagnosis_code,breast_cancer_diagnosis_desc,...,limited_english,commute_time,health_uninsured,veteran,Ozone,PM25,N02,DiagPeriodL90D,clean_breast_cancer_diagnosis_desc,tokens_lemmas
0,475714,,MEDICAID,CA,924,84,F,,C50919,Malignant neoplasm of unsp site of unspecified...,...,10.1,27.814286,11.2,3.5,52.23721,8.650555,18.606528,1,malignantneoplasmofunspsiteofunspecifiedfemale...,[malignantneoplasmofunspsiteofunspecifiedfemal...
1,349367,White,COMMERCIAL,CA,928,62,F,28.49,C50411,Malig neoplm of upper-outer quadrant of right ...,...,8.057576,30.606061,7.018182,4.10303,42.301121,8.487175,20.113179,1,maligneoplmofupper-outerquadrantofrightfemaleb...,[]
2,138632,White,COMMERCIAL,TX,760,43,F,38.09,C50112,Malignant neoplasm of central portion of left ...,...,3.356667,31.394915,15.066667,7.446667,40.108207,7.642753,14.839351,1,malignantneoplasmofcentralportionofleftfemaleb...,[malignantneoplasmofcentralportionofleftfemale...
3,617843,White,COMMERCIAL,CA,926,45,F,,C50212,Malig neoplasm of upper-inner quadrant of left...,...,5.280952,27.561905,4.404762,4.809524,42.070075,7.229393,15.894123,0,maligneoplasmofupper-innerquadrantofleftfemale...,[]
4,817482,,COMMERCIAL,ID,836,55,F,,1749,"Malignant neoplasm of breast (female), unspeci...",...,1.946,26.170213,12.088,13.106,41.356058,4.110749,11.722197,0,"malignantneoplasmofbreastfemale,unspecified","[malignantneoplasmofbreastfemale, unspecified]"
