# Pathology Report Segmentation 
--------


### WARNING BEFORE COMMITTING
<span style="color:red">Do not commit this demo where PHI is displayed!!</span>


### Introduction

The primary goal of this notebook is to break down pathology reports into sections and sub-sections. These consist of:

- Report header
- Specimens submitted
- Clinical diagnosis
- Pathological diagnosis
    - Synoptic infomation
    - Free-text
- Addendums

#### Annotations
Once reports are broken down into their general subsections, specific attributes of the report are captured. These include:
- Date of procedure (different from date of report)
- Source accession number (Ex. specimen in current report was originally examined in a previous report)
- Specimen name

### Accessing This Jupyter Notebook on an MSK-MIND Machine from local machine: 

1) From MIND server, start Remote Jupyter Notebook:
      
       >> jupyter-lab --no-browser --port=8888
    
2) Listen to Forwarded Port on Local machine:
   
       >> ssh -N -f -L localhost:8080:localhost:8888 <USERNAME>@<MIND-SERVER-NAME>
    
3) Open local browser, notebook will be at localhost:8080

Note: If you are completely running this notebok from your local machine, you can disregard these steps.

### Import libraries

In [1]:
import sys  
sys.path.insert(0, '../segmentation')
sys.path.insert(0, '../annotations')
sys.path.insert(0, '../synoptic_classification')
sys.path.insert(0, '../')
import os
import pandas as pd
from darwin_pathology import DarwinDiscoveryPathology
from pathology_parse_surgical import ParseSurgicalPathology
from pathology_parse_molecular import ParseMolecularPathology
from pathology_parse_specimen_submitted import PathologyParseSpecSubmitted
from pathology_parsing_surgical_specimens import ParseSurgicalPathologySpecimens
from pathology_extract_accession import PathologyExtractAccession
from pathology_extract_dop import PathologyExtractDOP
from pathology_extract_dop_impact_wrapper import CombineAccessionDOPImpact
from pathology_impact_summary_dop_annotator import PathologyImpactDOPAnno
import constants_darwin_pathology as c_dar
from utils_pathology import set_debug_console

In [2]:
# Extend debug window
set_debug_console()

-----------------------
## Create MSK-IMPACT pathology annotation summary table

#### Parameters

In [3]:
run_path_clean = True
run_parse_surg = True
run_parse_dmp = True
run_parse_spec_sub = True
run_parse_path_dx = True
run_spec_sub_path_dx_combiner = False

# TODO: Move the rest to annotation section
run_parse_accession = True
run_parse_dop = True
run_CombineAccessionDOPImpact = True
run_dop_extra_anno = True

#### Filenames
Filenames used for this demo are defined here: `clinical_data_mining/pathology_report_segmentation/constants_darwin_pathology.py`

#### Column names

In [4]:
## Constants
col_label_access_num = 'ACCESSION_NUMBER'
col_label_spec_num = 'SPECIMEN_NUMBER'
col_spec_sub = 'SPECIMEN_SUBMITTED'


-------
## Run Code for header segmentation

<font color=green> DarwinDiscoveryPathology</font>


**Run initial cleaning of pathology table that comes directly from the DDP scrapper**
- Source file: `darwin_pathology.py`

- Input: Pathology report table from DDP (Typically `Darwin_Digital_Pathology_Reports.csv`)

- Output: Dataframe returned with basic cleanning done (Typically: `table_pathology_clean.csv`)

In [5]:
# Recreate cleaned pathology data
if run_path_clean:
    print('Running DarwinDiscoveryPathology...')
    obj_path = DarwinDiscoveryPathology(pathname=c_dar.pathname,
                                         fname=c_dar.fname_path_ddp,  # Change name
                                         fname_out=c_dar.fname_darwin_path_clean)

Running DarwinDiscoveryPathology...
Loading pathology/DDP_Pathology_Reports_20210209.tsv
Saving pathology/table_pathology_clean.csv
Saved.


In [6]:
obj_path.return_df().head()

Unnamed: 0,P_ID,PATH_REPORT_TYPE,DTE_PATH_PROCEDURE,PATH_RPT_ID,ACCESSION_NUMBER,ASSOCIATED_PATH_REPORT_ID,DMP_ID,SAMPLE_ID,PATH_REPORT_NOTE,SPECIMEN_SUBMISSION_LIST,PATH_REPORT_TYPE_GENERAL,RPT_CHAR_LEN
575370,1130006,Surgical Pathology,2007-08-28,10,S07-35014,,,,PathDoc Version 1.1\r\nMRN: 35172806\r\nAccoun...,Specimens Submitted:\r\n1: SP: Right para-aort...,Surgical,9214
579680,986772,Surgical Pathology,2006-04-20,46,S06-15377,1503502.0,,,PathDoc Version 1.1\r\nMRN: 00317869\r\nAccoun...,Specimens Submitted:\r\n1: BI: Left breast (tl...,Surgical,2823
579226,1082701,Surgical Pathology,2006-06-16,75,S06-23341,,,,PathDoc Version 1.1\r\nMRN: 35129311\r\nAccoun...,Specimens Submitted:\r\n1: SP: Sigmoid colon (...,Surgical,5246
579173,356717,Surgical Pathology,2006-06-22,110,S06-23994,,,,PathDoc Version 1.1\r\nMRN: 00011541\r\nAccoun...,Specimens Submitted:\r\n1: SP: Hepatic artery ...,Surgical,2566
45004,1081931,Surgical Pathology,2006-06-23,133,S06-24315,,,,PathDoc Version 1.1\r\nMRN: 35129246\r\nAccoun...,"Specimens Submitted:\r\n1: SP: Lymph nodes, Ri...",Surgical,6788


In [7]:
# df_path_clean = obj_path.return_original_df()
# df_path_clean[df_path_clean['Accession Number'] == 'M19-44183']
df_path_clean = obj_path.return_df()
df_path_clean[df_path_clean['ACCESSION_NUMBER'] == 'M15-816']


Unnamed: 0,P_ID,PATH_REPORT_TYPE,DTE_PATH_PROCEDURE,PATH_RPT_ID,ACCESSION_NUMBER,ASSOCIATED_PATH_REPORT_ID,DMP_ID,SAMPLE_ID,PATH_REPORT_NOTE,SPECIMEN_SUBMISSION_LIST,PATH_REPORT_TYPE_GENERAL,RPT_CHAR_LEN
41101,1637773,Diagnostic Molecular Pathology,2015-01-14,2408283,M15-816,2318448,P-0002099,P-0002099-T02-IM3,PathDoc Version 1.1\r\nMRN: 35434061\r\nAccoun...,Specimens Submitted:\r\n1: Left testis and sp...,Molecular,4670
41102,1637773,Diagnostic Molecular Pathology,2015-01-14,2408283,M15-816,2318448,P-0002099,P-0002099-T02-IM5,PathDoc Version 1.1\r\nMRN: 35434061\r\nAccoun...,Specimens Submitted:\r\n1: Left testis and sp...,Molecular,4670


### ParseSurgicalPathology
**Parse main sections of surgical pathology reports**
- Source file: `pathology_parse_surgical.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of surgical pathology reports (S-numbers) with main sections separated by column  (Typically: `table_pathology_surgical_notes_parsed.csv`)

In [5]:
# Using the cleaned pathology table, parse the main sections of the surgical pathology note
# surgical_pathology_parsing.py
if run_parse_surg:
    print('Running ParseSurgicalPathology...')
    obj_path_parse = ParseSurgicalPathology(pathname=c_dar.pathname,
                                            fname_path_clean=c_dar.fname_darwin_path_clean,
                                            fname_save=c_dar.fname_darwin_path_surgical)

Running ParseSurgicalPathology...


  import sys


Extracting Section: HEADER
Extracting Section: IND_CLINICAL_DX_1
Extracting Section: IND_SPEC_SUB_1
Extracting Section: IND_PATH_DX_1
Parsing Remaining Text
Saving pathology/table_pathology_surgical_notes_parsed.csv
Saved.


In [6]:
df_obj_path_parse = obj_path_parse.return_df_summary()
df_obj_path_parse.head()
# df_obj_path_parse['DMP_ID'].nunique()


Unnamed: 0,P_ID,ACCESSION_NUMBER,PATH_NOTE_HEADER,PATH_NOTE_CLINICAL_DX,PATH_NOTE_SPEC_SUB,PATH_NOTE_PATH_DX,PATH_NOTE_OTHER
116961,63,S18-31248,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,Left lumpectomy. Bilateral calcifications. Ple...,"1: Breast, left; stereotactic biopsy bilat\r\...","1. Breast, left; stereotactic biopsy bilat: ...",I ATTEST THAT THE ABOVE DIAGNOSIS IS BASED UPO...
118435,63,S18-36445,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,"Breast, malignant neoplasm. 9/19/13 statu post...","1: Breast, left; needle localization excision ...","1. Breast, left; needle localization excision...",I ATTEST THAT THE ABOVE DIAGNOSIS IS BASED UPO...
119210,63,S18-38353,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,56 year old female history of left DCISand IDC...,1: Sentinel node #1 level 1 left axilla (fsjr)...,Summary of findings in this case: Left Breast\...,I ATTEST THAT THE ABOVE DIAGNOSIS IS BASED UPO...
147622,63,S19-49361,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,History of breast ca. Left exchange.,1: Left tissue expander removed from chest wall,1. Left tissue expander removed from chest wa...,I ATTEST THAT THE ABOVE DIAGNOSIS IS BASED UPO...
11087,162,S03-16882,PathDoc Version 1.1\r\nMRN: 35001251\r\nAccoun...,Left parotid gland.\r\nInfiltrating squamous c...,"1: A. ZYGOMATIC FASCIA, B. FACIAL NERVE MARGIN...","1A. SOFT TISSUE, ZYGOMATIC FASCIA; BIOPSY ...",I ATTEST THAT THE ABOVE DIAGNOSIS IS BASED UPO...


### ParseMolecularPathology
**Parse main sections of molecular pathology (MSK-IMPACT) reports**
- Source file: `pathology_parse_molecular.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of molecular pathology reports (M-numbers) with main sections separated by column  (Typically: `table_pathology_molecular_notes_parsed.csv`)

In [6]:
# Using the cleaned pathology table, parse the main sections of the molecular pathology note
if run_parse_dmp:
    print('Running ParseMolecularPathology...')
    obj_parse_dmp = ParseMolecularPathology(pathname=c_dar.pathname,
                                           fname_path_clean=c_dar.fname_darwin_path_clean,
                                           fname_save=c_dar.fname_darwin_path_molecular)

Running ParseMolecularPathology...
Loading clean pathology table
Number of samples in list: 224893
Extracting Section: IND_SPEC_SUB_1
Extracting Section: IND_PATH_DX_1
Extracting Section: IND_HEADER_0
Saving pathology/table_pathology_molecular_notes_parsed.csv
Saved.


In [7]:
df_obj_parse_dmp = obj_parse_dmp.return_df_summary()
df_obj_parse_dmp.head()

Unnamed: 0,P_ID,ACCESSION_NUMBER,DMP_NOTE_RPT_HEADER,DMP_NOTE_SPEC_SUB,DMP_NOTE_DX_AND_OTHER
147195,63,M18-15941,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,"1: Breast, left; stereotactic biopsy bilat (M...",HER2/NEU GENE STATUS - AMPLIFICATION IS DETECT...
152666,63,M18-19170,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,"1: EDTA Blood, Matched Normal for MSK-IMPACT",BLOOD RECEIVED FOR USE AS MATCHED NORMAL FOR M...
155518,63,M18-19882,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,"1: Breast, left, mastectomy (MSK:S18-38353/3-1...",TEST FAILURE (See note)\r\n\r\nNote: Please se...
154752,63,M18-19994,PathDoc Version 1.1\r\nMRN: 00184795\r\nAccoun...,"1: Breast, left, mastectomy (MSK:S18-38353/3-1...",NEGATIVE FOR SOMATIC ALTERATIONS IN THE CLINIC...
196364,162,DMG19-3288,PathDoc Version 1.1\r\nMRN: 35001251\r\nAccoun...,"1: DNA from Peripheral Blood (M19-10514, DOP: ...",TEST PERFORMED:\r\nSecondary Germline MSK-IMPA...


### PathologyParseSpecSubmitted
**Parses specimen submitted column into individual parts**
- Source file: `pathology_parse_specimen_submitted.py`
- Input: 
    - Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)
    - Column name for specimens submitted
    - List of column names used as the index for the output (Typically only Patient ID and accession number is needed)

- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

In [8]:
if run_parse_spec_sub:
    # Parse the individual part descriptions from the specimen submitted column of parsed reports
    # Segment specimen submissions for molecular path
    print('Running PathologyParseSpecSubmitted...')
    obj_spec_sub = PathologyParseSpecSubmitted(pathname=c_dar.pathname,
                                              fname_path_parsed=c_dar.fname_darwin_path_clean,
                                              col_spec_sub='SPECIMEN_SUBMISSION_LIST',
                                              list_cols_id=['P_ID', 'ACCESSION_NUMBER'],
                                              fname_save=c_dar.fname_darwin_path_col_spec_sub)

Running PathologyParseSpecSubmitted...
Loading pathology table containing specimen submitted info
Parsing Specimen List
Saving pathology/table_pathology_col_spec_sub.csv
Saved.


In [12]:
df_spec_sub = obj_spec_sub.return_df()
df_spec_sub.head(10)

Unnamed: 0,P_ID,ACCESSION_NUMBER,SPECIMEN_NUMBER,SPECIMEN_SUBMITTED
0,1130006,S07-35014,1,SP: Right para-aortic lymph nodes (am)
1,1130006,S07-35014,2,SP: Right common iliac lymph nodes (am)
2,1130006,S07-35014,3,SP: Left para-aortic lymph nodes (am)
3,1130006,S07-35014,4,SP: Left common iliac lymph nodes (am)
4,1130006,S07-35014,5,SP: Left external iliac lymph nodes (am)
5,1130006,S07-35014,6,SP: Left obturator lymph node (am)
6,1130006,S07-35014,7,SP: Right external iliac lymph node (am)
7,1130006,S07-35014,8,SP: Right obturator lymph node (am0 9: SP: Ute...
8,986772,S06-15377,1,BI: Left breast (tl)
9,986772,S06-15377,2,BI: Left breast (5cm) (tl)


### ParseSurgicalPathologySpecimens
**Parses specimen submitted SECTION in surgical path reports into individual parts**
- Source file: `pathology_parsing_surgical_specimens.py`
- Input: Parsed surgical pathology report table (Main headers) from `ParseSurgicalPathology` (Typically `table_pathology_surgical_notes_parsed.csv`)


- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_surgical_samples_parsed_specimen.csv`)

In [5]:
if run_parse_path_dx:
    print('Running ParseSurgicalPathologySpecimens...')
    pathname = c_dar.pathname
    fname_out_pathology_specimens_parsed = c_dar.fname_darwin_path_clean_parsed_specimen
    fname_darwin_pathology_parsed = c_dar.fname_darwin_path_surgical
    obj_parse = ParseSurgicalPathologySpecimens(pathname=pathname,
                                                fname_darwin_pathology_parsed=fname_darwin_pathology_parsed,
                                                fname_out_pathology_specimens_parsed=fname_out_pathology_specimens_parsed)

Running ParseSurgicalPathologySpecimens...
Parsing Pathology Diagnosis Section
Elongating Parsed Pathology Data
Saving pathology/table_pathology_surgical_samples_parsed_specimen.csv
Saved.


In [6]:
df_surg_path_parsed_spec = obj_parse.return_df_parsed_spec()
df_surg_path_parsed_spec.head()



Unnamed: 0,ACCESSION_NUMBER,PATH_DX_SPEC_NUM,PATH_DX_SPEC_TITLE,PATH_DX_SPEC_DESC
0,H16-6456,1.0,"(220) Right Lower Abdomen, Biopsy- Punch, Phot...",-Myeloid sarcoma.\r\n\r\n\r\nMORPHOLOGY\r\nThe...
1,M16-2183,0.0,,
2,S00-10044,1.0,BLADDER; BIOPSY,- PAPILLARY UROTHELIAL NEOPLASM OF LOW MALIGNA...
3,S00-10088,1.0,PROSTATE AND SEMINAL VESICLES; RADICAL PROSTAT...,- ADENOCARCINOMA OF PROSTATE. THE GLEASON'S G...
4,S00-10088,2.0,"LYMPH NODE, RIGHT PELVIC; EXCISION",- NO TUMOR SEEN IN FIVE LYMPH NODES (0/5).


In [20]:
print(df_surg_path_parsed_spec.shape)
print(df_surg_path_parsed_spec['PATH_DX_SPEC_TITLE'].isnull().sum())

df_surg_path_parsed_spec[df_surg_path_parsed_spec['PATH_DX_SPEC_NUM'] == 0]
df_surg_path_parsed_spec.loc[df_surg_path_parsed_spec['PATH_DX_SPEC_NUM'] == 0, 'ACCESSION_NUMBER'].str[1:3].value_counts();

(440280, 4)
6775


In [19]:
df_surg_path_parsed = obj_parse.return_df_parsed()
df_surg_path_parsed[df_surg_path_parsed['ACCESSION_NUMBER'] == 'M16-2183']

Unnamed: 0,P_ID,ACCESSION_NUMBER,PATH_NOTE_HEADER,PATH_NOTE_CLINICAL_DX,PATH_NOTE_SPEC_SUB,PATH_NOTE_PATH_DX,PATH_NOTE_OTHER
68601,1578551,M16-2183,PathDoc Version 1.1\r\nMRN: 35400711\r\nAccoun...,,1: Normal BLOOD\r\n\r\n\r\n\r\nDIAGNOSTIC INT...,,


### run_spec_sub_path_dx_combiner
##### TODO!!! 
This code will combine specimen submitted values from column version and "parsed section in the reports" version

In [None]:
if run_spec_sub_path_dx_combiner:
    print('Running Combiner...')
    run = None

print('Complete!')

----------------
## Run code for abstracting elements for annotations
Once pathology reports are segmented, abstract data for annotations

### PathologyExtractAccession
**Automatically abstracts source accession number for specimen submitted parts**
- Source file: `pathology_extract_accession.py`
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of accession and part number embedded in free-text (Typically: `path_accessions.csv`)

In [21]:
if run_parse_accession:
    print('Running PathologyExtractAccession...')
    obj_ext_accession = PathologyExtractAccession(pathname=c_dar.pathname,
                                                fname=c_dar.fname_darwin_path_col_spec_sub,
                                                col_label_access_num=col_label_access_num,
                                                col_label_spec_num=col_label_spec_num,
                                                col_spec_sub=col_spec_sub,
                                                fname_out=c_dar.fname_accessions)

Running PathologyExtractAccession...
Extracting Matching Accession Numbers


KeyError: "['DMP_ID'] not in index"

In [None]:
df_accessions = obj_ext_accession.return_df()
df_accessions.head()

### PathologyExtractDOP
**Extraction of DOP of the specimen part from specimen submitted sections**
- Source file: pathology_extract_dop.py
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of date of procedure for each part number embedded in free-text (Typically: `pathology_spec_part_dop.csv`)

In [None]:
# Create annotation for date of procedure (DOP) for all pathology reports/specimen part, if indicated
if run_parse_dop:
    print('Running PathologyExtractDOP...')
    obj_dop = PathologyExtractDOP(pathname=c_dar.pathname,
                                    fname=c_dar.fname_darwin_path_col_spec_sub,
                                    col_label_access_num=col_label_access_num,
                                    col_label_spec_num=col_label_spec_num,
                                    col_spec_sub=col_spec_sub,
                                    list_accession=None,
                                    fname_out=c_dar.fname_spec_part_dop)

In [None]:
df_obj_dop = obj_dop.return_df()
df_obj_dop.head()

### CombineAccessionDOPImpact
**To generate a summary table of source accession and DOP for a given  pathology report.**
- Source file: `pathology_extract_dop_impact_wrapper.py`
- Input: 
    - fname_accession=`path_accessions.csv`
    - fname_dop=`pathology_spec_part_dop.csv`
    - fname_path=`table_pathology_clean.csv`
- Output: `pathology_dop_impact_summary.csv`

In [None]:
# Create table of M accessions of IMPACT samples, source accession number, dates of reports and procedures
#Call pathology_extract_dop_impact_wrapper.py
if run_CombineAccessionDOPImpact:
    print('Running CombineAccessionDOPImpact...')
    obj_p = CombineAccessionDOPImpact(pathname=c_dar.pathname,
                                      fname_accession=c_dar.fname_accessions,
                                      fname_dop=c_dar.fname_spec_part_dop,
                                      fname_path=c_dar.fname_darwin_path_clean,
                                      fname_out=c_dar.fname_combine_dop_accession)

In [None]:
df_dop_accession = obj_p.return_df()
df_dop_accession.head()

### PathologyImpactDOPAnno
**To fill in missing DOPs by comparing dates of source surgical pathology reports with surgical and IR dates and labelling positive comparisons as the DOP.**
- Source file: `pathology_impact_summary_dop_annotator.py`
- Input:  
    - fname_path_summary=`pathology_dop_impact_summary.csv`
    - fname_surgery=`table_surgery.tsv`
    - fname_ir=`table_investigational_radiology.tsv`
- Output: `table_pathology_impact_sample_summary_dop_anno.csv`

In [None]:
# Add annoations for surgical reports that on the same day as the surgery/IR
# Call pathology_impact_summary_dop_annotator.py
if run_CombineAccessionDOPImpact:
    print('Running CombineAccessionDOPImpact...')
    obj_dop_anno = PathologyImpactDOPAnno(pathname=c_dar.pathname,
                                          fname_path_summary=c_dar.fname_combine_dop_accession,
                                          fname_surgery=c_dar.fname_darwin_surgery,
                                          fname_ir=c_dar.fname_darwin_ir,
                                          fname_save=c_dar.fname_dop_anno)

In [None]:
df_obj_dop_anno = obj_dop_anno.return_summary()
df_obj_dop_anno.head()

In [None]:
df_obj_dop_anno[df_obj_dop_anno['DMP_ID'] == 'P-0030151']