# Pathology Report Segmentation 
--------


### WARNING BEFORE COMMITTING
<span style="color:red">Do not commit this demo where PHI is displayed!!</span>


### Introduction

The primary goal of this notebook is to break down pathology reports into sections and sub-sections. These consist of:

- Report header
- Specimens submitted
- Clinical diagnosis
- Pathological diagnosis
    - Synoptic infomation
    - Free-text
- Addendums

#### Annotations
Once reports are broken down into their general subsections, specific attributes of the report are captured. These include:
- Date of procedure (different from date of report)
- Source accession number (Ex. specimen in current report was originally examined in a previous report)
- Specimen name

### Accessing This Jupyter Notebook on an MSK-MIND Machine from local machine: 

1) From MIND server, start Remote Jupyter Notebook:
      
       >> jupyter-lab --no-browser --port=8888
    
2) Listen to Forwarded Port on Local machine:
   
       >> ssh -N -f -L localhost:8080:localhost:8888 <USERNAME>@<MIND-SERVER-NAME>
    
3) Open local browser, notebook will be at localhost:8080

Note: If you are completely running this notebok from your local machine, you can disregard these steps.

### Import libraries

In [1]:
import sys  
sys.path.insert(0, '../segmentation')
sys.path.insert(0, '../annotations')
sys.path.insert(0, '../synoptic_classification')
sys.path.insert(0, '../')
import os
import pandas as pd
from darwin_pathology import DarwinDiscoveryPathology
from pathology_parse_surgical import ParseSurgicalPathology
from pathology_parse_molecular import ParseMolecularPathology
from pathology_parse_specimen_submitted import PathologyParseSpecSubmitted
from pathology_parsing_surgical_specimens import ParseSurgicalPathologySpecimens
from pathology_extract_accession import PathologyExtractAccession
from pathology_extract_dop import PathologyExtractDOP
from pathology_extract_dop_impact_wrapper import CombineAccessionDOPImpact
from pathology_impact_summary_dop_annotator import PathologyImpactDOPAnno
import constants_darwin_pathology as c_dar
from utils_pathology import set_debug_console

In [2]:
# Extend debug window
set_debug_console()

-----------------------
## Create MSK-IMPACT pathology annotation summary table

#### Parameters

In [3]:
run_path_clean = True
run_parse_surg = True
run_parse_dmp = True
run_parse_spec_sub = True
run_parse_path_dx = True
run_spec_sub_path_dx_combiner = False

# TODO: Move the rest to annotation section
run_parse_accession = True
run_parse_dop = True
run_CombineAccessionDOPImpact = True
run_dop_extra_anno = True

#### Filenames
Filenames used for this demo are defined here: `clinical_data_mining/pathology_report_segmentation/constants_darwin_pathology.py`

#### Column names

In [4]:
## Constants
col_label_access_num = 'ACCESSION_NUMBER'
col_label_spec_num = 'SPECIMEN_NUMBER'
col_spec_sub = 'SPECIMEN_SUBMITTED'


-------
## Run Code for header segmentation

<font color=green> DarwinDiscoveryPathology</font>


**Run initial cleaning of pathology table that comes directly from the DDP scrapper**
- Source file: `darwin_pathology.py`

- Input: Pathology report table from DDP (Typically `Darwin_Digital_Pathology_Reports.csv`)

- Output: Dataframe returned with basic cleanning done (Typically: `table_pathology_clean.csv`)

In [5]:
# Recreate cleaned pathology data
if run_path_clean:
    print('Running DarwinDiscoveryPathology...')
    obj_path = DarwinDiscoveryPathology(pathname=c_dar.pathname,
                                         fname=c_dar.fname_path_ddp,  # Change name
                                         fname_out=c_dar.fname_darwin_path_clean)

Running DarwinDiscoveryPathology...
Loading pathology/DDP_Pathology_Reports_20210428.tsv
Saving pathology/table_pathology_clean.csv
Saved.


In [6]:
df_obj_path = obj_path.return_df().head()
df_obj_path.head();

### ParseSurgicalPathology
**Parse main sections of surgical pathology reports**
- Source file: `pathology_parse_surgical.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of surgical pathology reports (S-numbers) with main sections separated by column  (Typically: `table_pathology_surgical_notes_parsed.csv`)

In [7]:
# Using the cleaned pathology table, parse the main sections of the surgical pathology note
# surgical_pathology_parsing.py
if run_parse_surg:
    print('Running ParseSurgicalPathology...')
    obj_path_parse = ParseSurgicalPathology(pathname=c_dar.pathname,
                                            fname_path_clean=c_dar.fname_darwin_path_clean,
                                            fname_save=c_dar.fname_darwin_path_surgical)

Running ParseSurgicalPathology...
Extracting Section: HEADER
Extracting Section: IND_CLINICAL_DX_1
Extracting Section: IND_SPEC_SUB_1
Extracting Section: IND_PATH_DX_1
Parsing Remaining Text
Saving pathology/table_pathology_surgical_notes_parsed.csv
Saved.


In [9]:
df_obj_path_parse = obj_path_parse.return_df_summary()
df_obj_path_parse.head()
# df_obj_path_parse['DMP_ID'].nunique()


NameError: name 'obj_path_parse' is not defined

### ParseMolecularPathology
**Parse main sections of molecular pathology (MSK-IMPACT) reports**
- Source file: `pathology_parse_molecular.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of molecular pathology reports (M-numbers) with main sections separated by column  (Typically: `table_pathology_molecular_notes_parsed.csv`)

In [9]:
# Using the cleaned pathology table, parse the main sections of the molecular pathology note
if run_parse_dmp:
    print('Running ParseMolecularPathology...')
    obj_parse_dmp = ParseMolecularPathology(pathname=c_dar.pathname,
                                           fname_path_clean=c_dar.fname_darwin_path_clean,
                                           fname_save=c_dar.fname_darwin_path_molecular)

Running ParseMolecularPathology...
Loading clean pathology table
Number of samples in list: 224915
Extracting Section: IND_SPEC_SUB_1
Extracting Section: IND_PATH_DX_1
Extracting Section: IND_HEADER_0
Saving pathology/table_pathology_molecular_notes_parsed.csv
Saved.


In [10]:
df_obj_parse_dmp = obj_parse_dmp.return_df_summary()
df_obj_parse_dmp.head();


### PathologyParseSpecSubmitted
**Parses specimen submitted column into individual parts**
- Source file: `pathology_parse_specimen_submitted.py`
- Input: 
    - Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)
    - Column name for specimens submitted
    - List of column names used as the index for the output (Typically only Patient ID and accession number is needed)

- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

In [11]:
if run_parse_spec_sub:
    # Parse the individual part descriptions from the specimen submitted column of parsed reports
    # Segment specimen submissions for molecular path
    print('Running PathologyParseSpecSubmitted...')
    obj_spec_sub = PathologyParseSpecSubmitted(pathname=c_dar.pathname,
                                              fname_path_parsed=c_dar.fname_darwin_path_clean,
                                              col_spec_sub='SPECIMEN_SUBMISSION_LIST',
                                              list_cols_id=['MRN', 'ACCESSION_NUMBER'],
                                              fname_save=c_dar.fname_darwin_path_col_spec_sub)

Running PathologyParseSpecSubmitted...
Loading pathology table containing specimen submitted info
Parsing Specimen List
Saving pathology/table_pathology_col_spec_sub.csv
Saved.


In [16]:
df_spec_sub = obj_spec_sub.return_df()
df_spec_sub.head(10);

NameError: name 'obj_spec_sub' is not defined

### ParseSurgicalPathologySpecimens
**Parses specimen submitted SECTION in surgical path reports into individual parts**
- Source file: `pathology_parsing_surgical_specimens.py`
- Input: Parsed surgical pathology report table (Main headers) from `ParseSurgicalPathology` (Typically `table_pathology_surgical_notes_parsed.csv`)


- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_surgical_samples_parsed_specimen.csv`)

In [13]:
if run_parse_path_dx:
    print('Running ParseSurgicalPathologySpecimens...')
    pathname = c_dar.pathname
    fname_out_pathology_specimens_parsed = c_dar.fname_darwin_path_clean_parsed_specimen
    fname_darwin_pathology_parsed = c_dar.fname_darwin_path_surgical
    obj_parse = ParseSurgicalPathologySpecimens(pathname=pathname,
                                                fname_darwin_pathology_parsed=fname_darwin_pathology_parsed,
                                                fname_out_pathology_specimens_parsed=fname_out_pathology_specimens_parsed)

Running ParseSurgicalPathologySpecimens...
Parsing Pathology Diagnosis Section
Elongating Parsed Pathology Data
Saving pathology/table_pathology_surgical_samples_parsed_specimen.csv
Saved.


In [15]:
df_surg_path_parsed_spec = obj_parse.return_df_parsed_spec()
print(df_surg_path_parsed_spec.shape)
df_surg_path_parsed_spec.head();



NameError: name 'obj_parse' is not defined

In [15]:
df_surg_path_parsed = obj_parse.return_df_parsed()
df_surg_path_parsed.head();

### run_spec_sub_path_dx_combiner
##### TODO!!! 
This code will combine specimen submitted values from column version and "parsed section in the reports" version

In [16]:
if run_spec_sub_path_dx_combiner:
    print('Running Combiner...')
    run = None

print('Complete!')

Complete!


----------------
## Run code for abstracting elements for annotations
Once pathology reports are segmented, abstract data for annotations

### PathologyExtractAccession
**Automatically abstracts source accession number for specimen submitted parts**
- Source file: `pathology_extract_accession.py`
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of accession and part number embedded in free-text (Typically: `path_accessions.csv`)

In [17]:
if run_parse_accession:
    print('Running PathologyExtractAccession...')
    obj_ext_accession = PathologyExtractAccession(pathname=c_dar.pathname,
                                                fname=c_dar.fname_darwin_path_col_spec_sub,
                                                col_label_access_num=col_label_access_num,
                                                col_label_spec_num=col_label_spec_num,
                                                col_spec_sub=col_spec_sub,
                                                fname_out=c_dar.fname_accessions)

Running PathologyExtractAccession...
Extracting Matching Accession Numbers
Saving pathology/path_accessions.csv
Saved.


In [14]:
df_accessions = obj_ext_accession.return_df()
df_accessions[df_accessions['SOURCE_ACCESSION_NUMBER_0'].notnull()].head();

NameError: name 'obj_ext_accession' is not defined

### PathologyExtractDOP
**Extraction of DOP of the specimen part from specimen submitted sections**
- Source file: pathology_extract_dop.py
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of date of procedure for each part number embedded in free-text (Typically: `pathology_spec_part_dop.csv`)

In [19]:
# Create annotation for date of procedure (DOP) for all pathology reports/specimen part, if indicated
if run_parse_dop:
    print('Running PathologyExtractDOP...')
    obj_dop = PathologyExtractDOP(pathname=c_dar.pathname,
                                    fname=c_dar.fname_darwin_path_col_spec_sub,
                                    col_label_access_num=col_label_access_num,
                                    col_label_spec_num=col_label_spec_num,
                                    col_spec_sub=col_spec_sub,
                                    list_accession=None,
                                    fname_out=c_dar.fname_spec_part_dop)

Running PathologyExtractDOP...
Extracting Date of Procedure in specimen submitted column
Saving pathology/pathology_spec_part_dop.csv
Saved.


In [20]:
df_obj_dop = obj_dop.return_df()
df_obj_dop[df_obj_dop['DATE_OF_PROCEDURE_SURGICAL'].notnull()].shape

(249574, 4)

### CombineAccessionDOPImpact
**To generate a summary table of source accession and DOP for a given  pathology report.**
- Source file: `pathology_extract_dop_impact_wrapper.py`
- Input: 
    - fname_accession=`path_accessions.csv`
    - fname_dop=`pathology_spec_part_dop.csv`
    - fname_path=`table_pathology_clean.csv`
- Output: `pathology_dop_impact_summary.csv`

In [21]:
# Create table of M accessions of IMPACT samples, source accession number, dates of reports and procedures
#Call pathology_extract_dop_impact_wrapper.py
if run_CombineAccessionDOPImpact:
    print('Running CombineAccessionDOPImpact...')
    obj_p = CombineAccessionDOPImpact(pathname=c_dar.pathname,
                                      fname_accession=c_dar.fname_accessions,
                                      fname_dop=c_dar.fname_spec_part_dop,
                                      fname_path=c_dar.fname_darwin_path_clean,
                                      fname_out=c_dar.fname_combine_dop_accession)

Running CombineAccessionDOPImpact...
Saving pathology/pathology_dop_impact_summary.csv
Saved.


In [13]:
df_dop_accession = obj_p.return_df()
df_dop_accession.head();


NameError: name 'obj_p' is not defined

### PathologyImpactDOPAnno
**To fill in missing DOPs by comparing dates of source surgical pathology reports with surgical and IR dates and labelling positive comparisons as the DOP.**
- Source file: `pathology_impact_summary_dop_annotator.py`
- Input:  
    - fname_path_summary=`pathology_dop_impact_summary.csv`
    - fname_surgery=`table_surgery.tsv`
    - fname_ir=`table_investigational_radiology.tsv`
- Output: `table_pathology_impact_sample_summary_dop_anno.csv`

In [10]:
# Add annoations for surgical reports that on the same day as the surgery/IR
# Call pathology_impact_summary_dop_annotator.py
if run_CombineAccessionDOPImpact:
    print('Running CombineAccessionDOPImpact...')
    obj_dop_anno = PathologyImpactDOPAnno(pathname=c_dar.pathname,
                                          fname_path_summary=c_dar.fname_combine_dop_accession,
                                          fname_surgery=c_dar.fname_darwin_surgery,
                                          fname_ir=c_dar.fname_darwin_ir,
                                          fname_save=c_dar.fname_dop_anno)

Running CombineAccessionDOPImpact...
Saving pathology/table_pathology_impact_sample_summary_dop_anno.csv
Saved.


In [12]:
df_f = obj_dop_anno.return_summary()
df_f.head();