# Run through of TRE Tools

The following notebook is designed to demonstrate the main features of the tretools package. The notebook is split into the following sections:

- Codelists
- Datasets
- Phenotype Reports
- Summary Report

Codelists and Datasets form the building blocks of the system. They come together in specific queries in Phenotype Reports - to answer questions such as which patients in a given dataset have an event that matches a code in given codelist. A phenotype in this context is defined as a person who has a code in an identified codelist. A person might have several codes in 1 dataset, 1 qualifying code in different codelists or multiple codes in multiple codelists. 

## Codelists
First we import our Codelist from a file. Let's start with a SNOMED codelist. 

In [52]:
from tretools.codelists.codelist import Codelist

In [53]:
snomed_codelist = Codelist("codelists/disease_a_snomed.csv", codelist_type="SNOMED")

This gives us a codelist that we can view. 

In [54]:
snomed_codelist.data

[{'code': '100000001', 'term': 'Disease A - 1'},
 {'code': '100000002', 'term': 'Disease A - 2'}]

In [55]:
snomed_codelist.codes

{'100000001', '100000002'}

Now lets do the same with an ICD codelist. 

In [56]:
icd_codelist = Codelist("Codelists/disease_a_icd.csv", "ICD10")

In [57]:
icd_codelist.codes

{'A01', 'A021'}

Sometimes we want to add X to the end of a ICD10 to allow us to use it in NHS Digital Data. Now lets pass in `add_x_codes` to see the difference. You will see that extra codes are generated with the same code term. 

In [58]:
icd_codelist_with_x = Codelist("Codelists/disease_a_icd.csv", "ICD10", add_x_codes=True)
icd_codelist_with_x.codes

{'A01', 'A01X', 'A021'}

Additionally, with ICD10 codes, we want to truncate this to 3 digits only. 

In [59]:
truncated_icd_codelist = Codelist("Codelists/disease_a_icd.csv", "ICD10", icd10_3_digit_only=True)
truncated_icd_codelist.codes

{'A01', 'A021'}

## Datasets

There are 3 types of Datasets:
- RawDataset - this takes in data, which might be messy. It can be processed to produce a ProcessedDataset
- ProcessedDataset - this is a tidy dataset that can be merged with other ProcessedDataset
- DemographicDataset - this is a special type of Dataset that can be created from specific data files

In [60]:
from tretools.datasets.raw_dataset import RawDataset
from tretools.datasets.demographic_dataset import DemographicDataset

We can load a datafile into a RawDataset. This will put the data at `.data`. 

In [61]:
raw_data = RawDataset(path="datasets/primary_care_data.csv", coding_system="SNOMED", dataset_type="primary_care")

In [62]:
raw_data.data

pseudo_nhs_number,clinical_effective_date,original_code,original_term,extra_col
str,str,i64,str,i64
"""84950DE0614A5C…","""2018-10-05 12:…",100000001,"""Disease A - 1""",1
"""84950DE0614A5C…","""05/11/2018""",100000001,"""Disease A - 1""",1
"""84950DE0614A5C…","""12-02-2019""",100000002,"""Disease A - 2""",1
"""84950DE0614A5C…","""2020-05-22T08:…",200000001,"""Disease B - 1""",1
"""73951AB0712D6E…","""""",100000001,"""Disease A - 1""",1
"""73951AB0712D6E…","""03-06-2013 15:…",100000001,"""Disease A - 1""",1
"""53952EF0503F7F…","""July 19, 2016""",200000001,"""Disease B - 1""",1
"""53952EF0503F7F…","""2016-08-20 07:…",200000001,"""Disease B - 1""",1
"""44966CC0716B4C…",,100000002,"""Disease A - 2""",1
"""84950DE0614A5C…","""2018-10-05 12:…",100000001,"""Disease A - 1""",1


There are different dates, some missing data and an extra column. Let's get rid of these. We pass in a deduplication option, plus the maps of the column. This will create a ProcessedDataset.

In [63]:
gp_processed_data = raw_data.process_dataset(
    deduplication_options=["nhs_number", "code", "date"], 
    column_maps={"original_code": "code", "original_term": "term", "clinical_effective_date": "date", "pseudo_nhs_number": "nhs_number"}
)

In [64]:
gp_processed_data.data

nhs_number,code,date
str,i64,date
"""53952EF0503F7F…",200000001,2016-07-19
"""84950DE0614A5C…",100000001,2018-11-05
"""73951AB0712D6E…",100000001,2013-06-03
"""84950DE0614A5C…",100000001,2018-10-05
"""84950DE0614A5C…",200000001,2020-05-22
"""84950DE0614A5C…",100000002,2019-02-12
"""53952EF0503F7F…",200000001,2016-08-20


Now you will see that there is a clean dataset. 

Let's move onto Demographics. 

In [65]:
demographics = DemographicDataset(path_to_mapping_file="demographics/mapping.txt", path_to_demographic_file="demographics/gender_dummy.txt")

Similar to RawDataset, we can process this data to standardise it. We need to pass in a map of what columns mean in each of the 2 input files.  

In [66]:
mapping_config = {
    "mapping": {
        "OrageneID": "study_id",
        "PseudoNHS_2023-11-08": "nhs_number"
    },
    "demographics": {
        "S1QST_Oragene_ID": "study_id",
        "S1QST_MM-YYYY_ofBirth": "dob",
        "S1QST_Gender": "gender"
    }
}

In [67]:
demographics.process_dataset(mapping_config)

In [68]:
demographics.data

nhs_number,gender,dob
str,i64,date
"""84950DE0614A5C…",2,1983-10-15
"""73951AB0712D6E…",1,1979-01-15
"""53952EF0503F7F…",1,1948-06-15


We now have a clean dataset for age and gender. 

# Phenotype Report

In [69]:
from tretools.phenotype_report.report import PhenotypeReport

We create an empty report. 

In [70]:
report = PhenotypeReport("Disease A")

We now add counts. A count includes the following fields:

- Dataset (compulsory)
- Codelist (compulsory)
- Demographics (optional)

In [71]:
report.add_count("primary_care", codelist=snomed_codelist, dataset=processed_data, demographics=demographics)

This gives us a count summary. 

In [72]:
report.counts

{'primary_care': {'code': [100000001, 100000002],
  'patient_count': 2,
  'event_count': 4,
  'nhs_numbers': shape: (2, 5)
  ┌───────────────────────────────────┬───────────┬────────────┬──────────────┬────────┐
  │ nhs_number                        ┆ code      ┆ date       ┆ age_at_event ┆ gender │
  │ ---                               ┆ ---       ┆ ---        ┆ ---          ┆ ---    │
  │ str                               ┆ i64       ┆ date       ┆ i64          ┆ str    │
  ╞═══════════════════════════════════╪═══════════╪════════════╪══════════════╪════════╡
  │ 84950DE0614A5C241F7223FBCCD27BE8… ┆ 100000001 ┆ 2018-10-05 ┆ 34           ┆ F      │
  │ 73951AB0712D6E241E8222EDCCF28AE8… ┆ 100000001 ┆ 2013-06-03 ┆ 34           ┆ M      │
  └───────────────────────────────────┴───────────┴────────────┴──────────────┴────────┘,
  'codelist_path': 'codelists/disease_a_snomed.csv',
  'codelist_type': 'SNOMED',
  'dataset_type': 'primary_care',
  'log': ['2023-12-13 23:51:30.605885: There are

We can add a further count. Let's use a different dataset - one from barts to do this. 

In [73]:
barts_data = RawDataset("datasets/barts_diagnosis.tab", dataset_type="barts_health", coding_system="ICD10")
processed_data_hospital = barts_data.process_dataset(deduplication_options=["nhs_number", "code", "date"],
                           column_maps={
                               "ICD_Diagnosis_Cd": "code", 
                               "ICD_Diag_Desc": "term", 
                               "Activity_date": "date", 
                               "PseudoNHS_2023_04_24": "nhs_number"}
                           )

In [74]:
report.add_count("secondary_care", codelist=icd_codelist, dataset=processed_data_hospital, demographics=demographics)


Let's examine our counts now:

In [75]:
report.counts

{'primary_care': {'code': [100000001, 100000002],
  'patient_count': 2,
  'event_count': 4,
  'nhs_numbers': shape: (2, 5)
  ┌───────────────────────────────────┬───────────┬────────────┬──────────────┬────────┐
  │ nhs_number                        ┆ code      ┆ date       ┆ age_at_event ┆ gender │
  │ ---                               ┆ ---       ┆ ---        ┆ ---          ┆ ---    │
  │ str                               ┆ i64       ┆ date       ┆ i64          ┆ str    │
  ╞═══════════════════════════════════╪═══════════╪════════════╪══════════════╪════════╡
  │ 84950DE0614A5C241F7223FBCCD27BE8… ┆ 100000001 ┆ 2018-10-05 ┆ 34           ┆ F      │
  │ 73951AB0712D6E241E8222EDCCF28AE8… ┆ 100000001 ┆ 2013-06-03 ┆ 34           ┆ M      │
  └───────────────────────────────────┴───────────┴────────────┴──────────────┴────────┘,
  'codelist_path': 'codelists/disease_a_snomed.csv',
  'codelist_type': 'SNOMED',
  'dataset_type': 'primary_care',
  'log': ['2023-12-13 23:51:30.605885: There are

This produces a report that is in its raw form. It can be used by SummaryReportTransformer for now but in the future, more report transformer types will be added. The advantage of this, is that a phenotype report can run once but be used for many outputs. 

# Summary Report

In [76]:
from tretools.report_transformers.summary_report import SummaryReportTransformer

We put our reports into a list. 

In [80]:
reports = [report]

We create our SummaryReportTransformer() from the reports, and then transform the output. We must pass in a path where we want the summary to be created. 

In [86]:
summary_report = SummaryReportTransformer.load_from_objects(reports)

In [87]:
summary_report.transform(path="summary_report")

This should output a number of folders and files contain your report at that path. 