# Pathology Report Segmentation 
--------


### Introduction

The primary goal of this notebook is to break down pathology reports into sections and sub-sections. These consist of:

- Report header
- Specimens submitted
- Clinical diagnosis
- Pathological diagnosis
    - Synoptic infomation
    - Free-text
- Addendums



### Accessing This Jupyter Notebook on an MSK-MIND Machine from local machine: 

1) From MIND server, start Remote Jupyter Notebook:
      
       >> jupyter-lab --no-browser --port=8888
    
2) Listen to Forwarded Port on Local machine:
   
       >> ssh -N -f -L localhost:8080:localhost:8888 <USERNAME>@<MIND-SERVER-NAME>
    
3) Open local browser, notebook will be at localhost:8080

Note: If you are completely running this notebok from your local machine, you can disregard these steps.

### Import libraries

In [1]:
import sys  
sys.path.insert(0, '../segmentation')
sys.path.insert(0, '../annotations')
sys.path.insert(0, '../synoptic_classification')
sys.path.insert(0, '../')
import os
import pandas as pd
from darwin_pathology import DarwinDiscoveryPathology
from pathology_parse_surgical import ParseSurgicalPathology
from pathology_parse_molecular import ParseMolecularPathology
from pathology_parse_specimen_submitted import PathologyParseSpecSubmitted
from pathology_parsing_surgical_specimens import ParseSurgicalPathologySpecimens
from pathology_extract_accession import PathologyExtractAccession
from pathology_extract_dop import PathologyExtractDOP
from pathology_extract_dop_impact_wrapper import CombineAccessionDOPImpact
from pathology_impact_summary_dop_annotator import PathologyImpactDOPAnno
import constants_darwin_pathology as c_dar
from utils_pathology import set_debug_console

In [2]:
# Extend debug window
set_debug_console()

-----------------------
## Create MSK-IMPACT pathology annotation summary table

### Set Parameters

#### Parameters

In [2]:
run_path_clean = True
run_parse_surg = False
run_parse_dmp = False
run_parse_spec_sub = False
run_parse_path_dx = False
run_spec_sub_path_dx_combiner = False
run_spec_sub_path_dx_combiner = False

# TODO: Move the rest to annotation section
run_parse_accession = False
run_parse_dop = False
annotation_steps = False
run_CombineAccessionDOPImpact = False

#### Filenames
Filenames used for this demo are defined here: `clinical_data_mining/pathology_report_segmentation/constants_darwin_pathology.py`

-------
### Run Code for header segmentation

#### DarwinDiscoveryPathology


**Run initial cleaning of pathology table that comes directly from the DDP scrapper**
- Source file: `darwin_pathology.py`

- Input: Pathology report table from DDP (Typically `Darwin_Digital_Pathology_Reports.csv`)

- Output: Dataframe returned with basic cleanning done (Typically: `table_pathology_clean.csv`)

In [20]:
# Recreate cleaned pathology data
if run_path_clean:
    print('Running DarwinDiscoveryPathology...')
    obj_path = DarwinDiscoveryPathology(pathname=c_dar.pathname,
                                         fname='Darwin_Digital_Pathology_Reports.csv',  # Change name
                                         fname_out=c_dar.fname_darwin_path_clean)

#### ParseSurgicalPathology
**Parse main sections of surgical pathology reports**
- Source file: `pathology_parse_surgical.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of surgical pathology reports (S-numbers) with main sections separated by column  (Typically: `table_pathology_surgical_notes_parsed.csv`)

In [22]:
# Using the cleaned pathology table, parse the main sections of the surgical pathology note
# surgical_pathology_parsing.py
if run_parse_surg:
    print('Running ParseSurgicalPathology...')
    obj_path_parse = ParseSurgicalPathology(pathname=c_dar.pathname,
                                            fname_path_clean=c_dar.fname_darwin_path_clean,
                                            fname_save=c_dar.fname_darwin_path_surgical)

#### ParseMolecularPathology
**Parse main sections of molecular pathology (MSK-IMPACT) reports**
- Source file: `pathology_parse_molecular.py`
- Input: Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)

- Output: Dataframe of molecular pathology reports (M-numbers) with main sections separated by column  (Typically: `table_pathology_molecular_notes_parsed.csv`)

In [None]:
# Using the cleaned pathology table, parse the main sections of the molecular pathology note
if run_parse_dmp:
    print('Running ParseMolecularPathology...')
    obj_dmp = ParseMolecularPathology(pathname=c_dar.pathname,
                                       fname_path_clean=c_dar.fname_darwin_path_clean,
                                       fname_save=c_dar.fname_darwin_path_molecular)

#### PathologyParseSpecSubmitted
**Parses specimen submitted column into individual parts**
- Source file: `pathology_parse_specimen_submitted.py`
- Input: 
    - Clean pathology report table from `DarwinDiscoveryPathology` (Typically `table_pathology_clean.csv`)
    - Column name for specimens submitted
    - List of column names used as the index for the output (Typically only Patient ID and accession number is needed)

- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

In [24]:
if run_parse_spec_sub:
    # Parse the individual part descriptions from the specimen submitted column of parsed reports
    # Segment specimen submissions for molecular path
    print('Running PathologyParseSpecSubmitted...')
    obj_mol = PathologyParseSpecSubmitted(pathname=c_dar.pathname,
                                          fname_path_parsed=c_dar.fname_darwin_path_clean,
                                          col_spec_sub='SPECIMEN_SUBMISSION_LIST',
                                          list_cols_id=['DMP_ID', 'ACCESSION_NUMBER'],
                                          fname_save=c_dar.fname_darwin_path_col_spec_sub)

    df_m = obj_mol.return_df()

#### ParseSurgicalPathologySpecimens
**Parses specimen submitted SECTION in surgical path reports into individual parts**
- Source file: `pathology_parsing_surgical_specimens.py`
- Input: Parsed surgical pathology report table (Main headers) from `ParseSurgicalPathology` (Typically `table_pathology_surgical_notes_parsed.csv`)


- Output: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_surgical_samples_parsed_specimen.csv`)

In [None]:
if run_parse_path_dx:
    print('Running ParseSurgicalPathologySpecimens...')
    pathname = c_dar.pathname
    fname_out_pathology_specimens_parsed = c_dar.fname_darwin_path_clean_parsed_specimen
    fname_darwin_pathology_parsed = c_dar.fname_darwin_path_surgical
    obj_parse = ParseSurgicalPathologySpecimens(pathname=pathname,
                                                fname_darwin_pathology_parsed=fname_darwin_pathology_parsed,
                                                fname_out_pathology_specimens_parsed=fname_out_pathology_specimens_parsed)

    df_surg_path_parsed_spec = obj_parse.return_df_parsed_spec()

#### run_spec_sub_path_dx_combiner
##### TODO!!! 
This code will combine specimen submitted values from column version and "parsed section in the reports" version

In [3]:
if run_spec_sub_path_dx_combiner:
    print('Running Combiner...')
    run = None

print('Complete!')

Complete!


----------------
### Run code for abstracting elements for annotations
Once pathology reports are segmented, abstract data for annotations

#### Filenames

In [26]:
# TODO Segment pathologic diagnosis section of surgical pathology reports at the part level
# Call pathology_parsing_surgical_specimens.py

# Create annotation of source accession number for all pathology reports/specimen part, if indicated
## Constants
col_label_access_num = 'ACCESSION_NUMBER'
col_label_spec_num = 'SPECIMEN_NUMBER'
col_spec_sub = 'SPECIMEN_SUBMITTED'

fname_spec_part_dop = 'pathology_spec_part_dop.csv'
fname_combine_dop_accession = 'pathology_dop_impact_summary.csv'
fname_dop_anno = 'table_pathology_impact_sample_summary_dop_anno.csv'

#### PathologyExtractAccession
**Automatically abstracts source accession number for specimen submitted parts**
- Source file: `path_accessions.py`
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of accession and part number embedded in free-text (Typically: `path_accessions.csv`)

In [None]:
print('Running PathologyExtractAccession...')
obj_p = PathologyExtractAccession(pathname=c_dar.pathname,
                            fname=c_dar.fname_darwin_path_col_spec_sub,
                            col_label_access_num=col_label_access_num,
                            col_label_spec_num=col_label_spec_num,
                            col_spec_sub=col_spec_sub,
                            fname_out=c_dar.fname_accessions)

#### PathologyExtractDOP
**Extraction of DOP of the specimen part from specimen submitted sections**
- Source file: pathology_extract_dop.py
- Input: Dataframe of specimens submitted. Each row is a part number of a report (Typically: `table_pathology_col_spec_sub.csv`)

- Output: Dataframe of date of procedure for each part number embedded in free-text (Typically: `pathology_spec_part_dop.csv`)

In [None]:
# Create annotation for date of procedure (DOP) for all pathology reports/specimen part, if indicated
print('Running PathologyExtractDOP...')
obj_p = PathologyExtractDOP(pathname=c_dar.pathname,
                            fname=c_dar.fname_darwin_path_col_spec_sub,
                            col_label_access_num=col_label_access_num,
                            col_label_spec_num=col_label_spec_num,
                            col_spec_sub=col_spec_sub,
                            list_accession=None,
                            fname_out=c_dar.fname_spec_part_dop)

#### CombineAccessionDOPImpact
**To generate a summary table of source accession and DOP for a given  pathology report.**
- Source file: `pathology_extract_dop_impact_wrapper.py`
- Input: 
    - fname_accession=`path_accessions.csv`
    - fname_dop=`pathology_spec_part_dop.csv`
    - fname_path=`table_pathology_clean.csv`
- Output: `pathology_dop_impact_summary.csv`

In [None]:
# Create table of M accessions of IMPACT samples, source accession number, dates of reports and procedures
#Call pathology_extract_dop_impact_wrapper.py
obj_p = CombineAccessionDOPImpact(pathname=c_dar.pathname,
                                  fname_accession=c_dar.fname_accessions,
                                  fname_dop=c_dar.fname_spec_part_dop,
                                  fname_path=c_dar.fname_darwin_path_clean,
                                  fname_out=c_dar.fname_combine_dop_accession)

#### PathologyImpactDOPAnno
**To fill in missing DOPs by comparing dates of source surgical pathology reports with surgical and IR dates and labelling positive comparisons as the DOP.**
- Source file: `pathology_impact_summary_dop_annotator.py`
- Input:  
    - fname_path_summary=`pathology_dop_impact_summary.csv`
    - fname_surgery=`table_surgery.tsv`
    - fname_ir=`table_investigational_radiology.tsv`
- Output: `table_pathology_impact_sample_summary_dop_anno.csv`

In [None]:
# Add annoations for surgical reports that on the same day as the surgery/IR
# Call pathology_impact_summary_dop_annotator.py
objd = PathologyImpactDOPAnno(pathname=c_dar.pathname,
                              fname_path_summary=c_dar.fname_combine_dop_accession,
                              fname_surgery='table_surgery.tsv',
                              fname_ir='table_investigational_radiology.tsv',
                              fname_save=c_dar.fname_dop_anno)