### [POC 3]  NGLY1 deficiency patient extraction

This analysis attempts to extract as many details as possible about NGLY1 patients from 2 full-text papers on the subject.  Outline:

1. Define a minimal prompt to per-study patient identifiers and free-text descriptions of associated information
2. Summarize the results for each patient, across studies, in a standard schema
3. Analyze the patients

In [2]:
%load_ext autoreload
%autoreload 2
import io
import sys
import pandas as pd
import matplotlib.pyplot as plt
from ngly1_gpt import utils, llm, doc
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
pd.set_option("display.max_colwidth", None, "display.max_rows", 400, "display.max_columns", None)

In [3]:
prompt = (utils.get_paths().prompts / "patient_extraction_1.txt").read_text()
print(prompt)

Text will be provided that contains information from a published, biomedical research article about {disease}.  Extract details about the patients discussed in this text. 

Requirements:

- Exclude any patients where context dictates that they do NOT have {disease}, e.g. when {disease} patients are compared to similar patients with other diseases.
- Extract as much information as possible about each patient including associated genotypes, phenotypes, physical or behavioral traits, demographics, lab measurements, treatments, family histories or anything else of clinical and/or biological relevance.
- Extract this information in CSV format with the following headers:
  - `patient_id`: Identifying information for the patient within the context of the article; typically an integer or anonymized id like "Patient 1". If some information applies to ALL patients in a study and the context does not make it possible to enumerate the patient ids, report only the value "ALL"
  - `external_study`: 

##### Execution

The prompt was run for all paper chunks via a command like:

```bash
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py extract_patients --output-filename=patients_1.tsv 2>&1 | tee data/logs/extract_patients_1.log.txt
```

The logs for these extractions showing all prompts and results are in [data/logs](data/logs).

#### Examples

#### Analysis

The original data had a minor error where one chunk of text resulted in comma rather than pipe-delimited CSV content:

In [43]:
(
    pd.read_csv(utils.get_paths().output_data / "patients_1.tsv", sep="\t")
    .pipe(utils.apply, lambda df: df.info())
    .dropna(subset='patient_id,external_study,details')
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 6 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   patient_id                         173 non-null    object
 1   external_study                     14 non-null     object
 2   details                            173 non-null    object
 3   doc_id                             180 non-null    object
 4   doc_filename                       180 non-null    object
 5   patient_id,external_study,details  7 non-null      object
dtypes: object(6)
memory usage: 8.6+ KB


Unnamed: 0,patient_id,external_study,details,doc_id,doc_filename,"patient_id,external_study,details"
62,,,,PMC4243708,PMC4243708.txt,"ALL,24,Patients with autosomal recessive mutations in ERLIN2 present profound intellectual disability, developmental regression and multiple contractures. Despite the severity of the intellectual disability and neuromuscular findings, the results of brain imaging, electromyography and muscle biopsy appeared normal in the initial erlin2-deficient patients."
63,,,,PMC4243708,PMC4243708.txt,"ALL,26,Another family was found to have a homozygous null mutation in ERLIN2, with affected individuals presenting with a hereditary spastic paraplegia phenotype."
83,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Clinical studies were designed to detail the phenotypic features of NGLY1-CDDG. Blood, urine, cerebral spinal fluid (CSF), lymphoblasts, and primary dermal fibroblasts were collected, analyzed, and stored. Studies included brain magnetic resonance imaging and spectroscopy (MRI and MRS, supplementary methods), routine and overnight electroencephalograms (EEGs) with a limited montage performed during a sleep study, electromyogram (EMG, supplementary methods) and nerve conduction studies (NCS, supplementary methods), indirect calorimetry, awake and sedated eye examination with Schirmer II testing, optical coherence tomography scans and electroretinography, behavioral determination of pure tone thresholds, tympanometry, distortion product otoacoustic emissions, auditory brainstem evoked potentials (ABR), quantitative sweat analysis autonomic testing (QSWEAT, supplementary methods), gastric aspiration, swallow study, skeletal survey, bone age, dual X-ray absorptiometry (DEXA), abdominal ultrasound, vibration controlled transient elastography (Fibroscan)12, echocardiogram, and electrocardiogram. Consultations included clinical neurology, audiology, nutrition, ophthalmology, hepatology, growth, puberty and hormonal studies, allergy and immunology, genetic counseling, physiatry, and speech, occupational, and physical therapy."
84,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Eleven individuals underwent developmental psychological evaluations, consisting of at least the Vineland Adaptive Behavior Scales, 2nd edition. Cognitive function was assessed with testing specific for age and developmental level that provided either an intelligence quotient (IQ) or developmental quotient (DQ) score."
85,,,,PMC7477955,PMC7477955.txt,"ALL,NA,The Nijmegen pediatric CDG rating scale, a measure of clinical disease progression developed for CDG, was applied to all affected individuals younger than 18 years."
131,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Compared to the reference population, N-acetylaspartylglutamate + N-acetylaspartate (NAA) was lower than normal in the left centrum semiovale (LCSO) (p=0.004), the midline parietal grey matter (PGM) (p=0.02), and superior cerebellar vermis (SVERM) (p<0.0001). There was a deficit of glutamine + glutamate + gamma-aminobutyric acid (Glx) in the PGM (p=0.03), LCSO (p=0.01), and pons (p=0.0002). Choline was higher than expected for age only in the LCSO (p=0.0097), and myo-inositol was higher than expected for age in the pons (p=0.002). Multiple correlations between these MRS-measured metabolites and age, functional assessments, brain volume, and neurotransmitters in the CSF were found. The general trend showed that the differences noted above became more pronounced with increasing age, worsening function, and lower brain volume. MRS metabolite measurements did not correlate with total CSF protein, CSF albumin, or CSF/serum albumin ratio. There was a weak correlation (p=0.09) between atrophy and total CSF protein, but not CSF albumin or CSF/serum albumin ratio."
169,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Strong correlation between brain atrophy on MRI and functional assessments suggests that loss of neurons contributes to the functional impairment. The atrophy also correlated with CSF metabolites (BH4, 5-HIAA, HVA), which are known to be lower when there is damage to neurotransmitter producing neurons. This suggests that these biochemical abnormalities may be secondary to brain atrophy."


A second run after tweaking the prompt with stronger language on what delimiter to use didn't have that problem:

In [45]:
patients = pd.read_csv(utils.get_paths().output_data / "patients_2.tsv", sep="\t")
patients.info()
patients.sample(n=15, random_state=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   patient_id      173 non-null    object
 1   external_study  7 non-null      object
 2   details         173 non-null    object
 3   doc_id          185 non-null    object
 4   doc_filename    185 non-null    object
 5   `patient_id     12 non-null     object
 6   details`        12 non-null     object
dtypes: object(7)
memory usage: 10.2+ KB


Unnamed: 0,patient_id,external_study,details,doc_id,doc_filename,`patient_id,details`
33,Patient 8,,"Age: 16 years, Gender: Female, Ethnicity: Caucasian, Mutations: c1201A>T(p. R401X)/c.1201A>T(p.R401X), IUGR: Yes, Brain imaging abnormalities: Yes, Global developmental delay: Yes, Microcephaly: Yes, Hypotonia: Yes, Movement disorder: Yes, EEG abnormalities: Yes, Decreased DTRs: Yes, Seizures: Yes, Ocular apraxia: Yes, Alacrima/hypolacrima: Yes, Corneal ulcerations/scarring: Yes, Strabismus: Yes, Lactic acidosis: Yes, Constipation: Yes, Dysmorphic features: Yes, Scoliosis: Yes, Small hands/feet: Yes",PMC4243708,PMC4243708.txt,,
125,#2,,Had slight cerebellar atrophy.,PMC7477955,PMC7477955.txt,,
173,ALL,,"Patients have low Motor Skills scores and Daily Living Skills scores, indicating significant motor involvement required for daily tasks. They have a hyperkinetic movement disorder due to NGLY1-CDDG. This disorder may be due to myoclonic seizures, neurotransmitter deficiency, and/or peripheral neuropathy manifesting as sensory ataxia.",PMC7477955,PMC7477955.txt,,
112,SOME,Achouitar et al14,"Three individuals scored in the mild range, two in the moderate range, and six in the severe range.",PMC7477955,PMC7477955.txt,,
61,ALL,2425,"Patients with autosomal recessive mutations in ERLIN2 present profound intellectual disability, developmental regression and multiple contractures. Abnormal erlin2 causes impaired ERAD of activated inositol 1,4,5-triphosphate receptors (IP3) and other substrates by compromising the structure of the erlin1/2 complex. Despite the severity of the intellectual disability and neuromuscular findings, the results of brain imaging, electromyography and muscle biopsy appeared normal.",PMC4243708,PMC4243708.txt,,
18,ALL,,"All patients had global developmental delay, a movement disorder, and hypotonia. The most common mutation was associated with more severe outcomes.",PMC4243708,PMC4243708.txt,,
137,ALL,,"Awake and sedated ophthalmic examinations were performed. Observed conditions include Lagophthalmous, ptosis, exotropia and/or esotropia, corneal neovascularization, pannus formation or scarring, optic nerve pallor or atrophy, retinal.",PMC7477955,PMC7477955.txt,,
7,ALL,,The most common deleterious allele was the nonsense mutation c.1201A>T (p.R401X).,PMC4243708,PMC4243708.txt,,
5,SOME,,5 out of 6 patients had hepatocyte cytoplasmic storage material or vacuolization.,PMC4243708,PMC4243708.txt,,
162,ALL,,"Femoral bone density was low in all nine individuals who underwent DEXA scanning (mean, SEM z-scores for 8 patients < 21 years adjacent to the growth plate = −3, 0.4; metaphysis-diaphysis = −2.2, 0.6, and diaphysis = −1.8, 0.5).",PMC7477955,PMC7477955.txt,,


In [46]:
(
    patients[['doc_id', 'patient_id']]
    .value_counts()
    .reset_index()
)

Unnamed: 0,doc_id,patient_id,count
0,PMC7477955,ALL,63
1,PMC7477955,SOME,18
2,PMC4243708,ALL,15
3,PMC4243708,SOME,13
4,PMC4243708,Patient 3,6
5,PMC4243708,Patient 4,5
6,PMC4243708,Patient 7,5
7,PMC4243708,Patient 6,5
8,PMC4243708,Patient 5,5
9,PMC4243708,Patient 2,5


In [53]:
patient_details = (
    patients
    .pipe(lambda df: df[~df['patient_id'].isin(['ALL', 'SOME'])])
    ['details']
    .dropna()
    .drop_duplicates()
)
patient_details.head().values

array(['3-year-old boy with compound heterozygous inactivating mutations in NGLY1. Clinical phenotype suggestive of a congenital disorder of glycosylation, although repeated transferrin isoelectric focusing and N-glycan analyses were normal. Liver biopsy showed accumulation of an amorphous unidentified substance throughout the cytoplasm, a finding likely consistent with NGLY1 dysfunction, which would be expected to result in abnormal accumulation of misfolded glycoproteins because of impaired cytosolic degradation.',
       'Exome sequencing was performed at Duke University using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit.',
       'Exome sequencing was performed at University of British Columbia using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit.',
       'Exome sequencing was performed at University of British Columbia using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb K

In [70]:
patient_details_list = "\n".join("- " + patient_details.sample(frac=.2, random_state=0))
len(doc.tokens(patient_details_list, "gpt-4"))

809

In [77]:
response = llm.chat_completion("patient_extraction_2.txt", model="gpt-4", temperature=0.0, details=patient_details_list)
print(response)

INFO:ngly1_gpt.llm:Prompt (temperature=0.0, model=gpt-4):
The following list of details contains specific characteristics of rare disease patients:

--- BEGIN DETAILS LIST ---
- A 3 base pair in-frame deletion TCC> beginning at position 3:25775416 (hg19) was identified in both the mother and daughter. An additional G>T mutation resulting in a heterozygous SMP at position 3:25777564 was identified in the daughter, mother and father. This mutation was not previously observed in 1000 genomes and is a coding region; however, it is present in heterozygous form in all three individuals.
- At Stanford University, the patient and parents were sequenced using both Illumina HiSeq2000 and Complete Genomics platforms. Variants in Illumina-sequenced reads were called using both the Hugeseq and Real Time Genomics pipelines and Complete Genomics variants were identified by their own variant callers. At Baylor College of Medicine, DNA was capture- sequenced using a commercially developed capture reage

INFO:ngly1_gpt.llm:Response:
```
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/patient.schema.json",
  "title": "Patient",
  "description": "A rare disease patient",
  "type": "object",
  "properties": {
    "study_id": {
      "type": "string",
      "category": "identifiers"
    },
    "patient_id": {
      "type": "string",
      "category": "identifiers"
    },
    "age": {
      "type": "integer",
      "category": "demographics"
    },
    "gender": {
      "type": "string",
      "category": "demographics"
    },
    "ethnicity": {
      "type": "string",
      "category": "demographics"
    },
    "mutations": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "category": "genotypes"
    },
    "brain_imaging_abnormalities": {
      "type": "boolean",
      "category": "phenotypes"
    },
    "global_developmental_delay": {
      "type": "boolean",
      "category": "phenotypes"
    },
    "microcepha

In [4]:
text = """
Neurologic Phenotype
Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. Details regarding age of onset, seizure type and frequency, medications, and EEG findings are noted in Supplementary Table S. On overnight EEG, only one individual (#6) had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In fact, in each sibling pair, one had seizures and the other did not.
All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals (Supplementary Movie S1).

Brain MRI and MRS
Eleven individuals underwent MRI and MRS of the brain. Clinical assessment of the images was not striking (Figure 3). Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate. Four individuals (#1, #2, #6, #10) also had slight cerebellar atrophy. The atrophy tended to be greater in the older individuals (p=0.17, Supplementary Figure S3), and in one teenager (#11) follow-up imaging showed atrophy measurably worse after a 20-month interval (net loss of 34 cm3 relative to expected). Increased atrophy correlated with worsening of all functional measurements (Supplementary Figure S3), including IQ or DQ (p<0.03), Vineland assessments (p<0.03), and Nijmegen scores (p=0.01). Brain volume also directly correlated with CSF levels of 5-HIAA (p=0.03), tetrahydrobiopterin (p=0.02), and 5-HVA (p=0.06) (Supplementary Figure S3).
"""

In [10]:
text = """
Table 1
Clinical and molecular findings in NGLY1 deficiency
Patient 1	Patient 2	Patient 3	Patient 4	Patient 5	Patient 6	Patient 7	Patient 8	Totals
Age	5 y	20 y	4 y	2 y	d.5 y	d.9 m	3 y	16 y	
Gender	M	F	F	M	M	F	F	F	
Ethnicity	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	
Consanguinity	−	+	−	−	−	−	−	−	1/8
Mutations (maternal/paternal allele)	c.C1891del (p.Q631S)/c.1201A>T(p.R401X)	c.1370dupG(p.R458fs)/c.1370dupG(p.R458fs)	c.1205_1207del(p.402_403del)/c.1570C>(p.R524X)	c.1201A>T(p.R401X)c.1201A>T(pR401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>Y(p.R401X)/c.1201A>T(p.R401X)	c1201A>T(p.R401X)/c.1201A>T(p.R401X)	
IUGR	−	+	−	+	+	+	−	+	5/8
Brain imaging abnormalities	+a	−b	+c	+d	+e	+f	−	+g	6/8
Global developmental delay	+	+	+	+	+	+	+	+	8/8
Microcephalyh	−	+	+	−	+	+	+	+	6/8
Hypotonia	+	+	+	+	+	+	+	+	8/8
Movement disorder	+	+	+	+	+	+	+	+	8/8
EEG abnormalities	+	+	+	+	+	+	−	+	7/8
↓DTRs	+	+	−	+	+	−	+	+	6/8
Seizures	+	−	−	+	+	−	−	+	4/8
Ocular apraxia	−	+	+	−	−	−	+	+	4/8
Alacrima/hypolacrima	+	+	+	+	+	−	+	+	7/8
Corneal ulcerations/scarring	+	+	−	+	−	−	−	+	4/8
Chalazions	+	−	+	+	−	−	+	−	4/8
Strabismus	−	−	+	+	−	−	+	+	5/8
ABR abnormalities	−	−	+	+	−	ND	ND	ND	2/5
Lactic acidosis	−	+	+	+	−i	ND	+	ND	4/6
Neonatal jaundice	+	−	+	+	−	−	+	−	4/8
Elevated liver transaminases	+	+	+	+	+	ND	+	−	6/7
Elevated AFP	+	−	−j	+	+	ND	ND	ND	3/5
Liver fibrosis	+	−	−	+	−	−	ND	ND	2/6
Liver storage or vacuolization	+	+	+	−	+k	+l	ND	ND	5/6
Constipation	+	+	+	+	+	−	+	+	7/8
Dysmorphic features	−	−	−	−	+m	+n	+o	+p	4/8
Scoliosis	−	+	−	+	+	−	−	+	4/8
Small hands/feet	+	−	+	+	−	−	−	+	4/8
Peripheral neuropathyq	+	+	ND	ND	+	ND	ND
"""

In [12]:
text = """
Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total care. She has very little expressive speech and communicates through an electronic speech-generating device. She continues to ambulate with a walker.
"""

In [76]:
response = llm.chat_completion("patient_extraction_1.txt", model="gpt-4", temperature=0.0, disease=utils.NGLY1_DEFICIENCY, text=text)
print(response)

INFO:ngly1_gpt.llm:Prompt (temperature=0.0, model=gpt-4):
Text will be provided that contains information from a published, biomedical research article about NGLY1 deficiency.  Extract details about the patients discussed in this text: 

--- BEGIN TEXT ---

Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total c

In [9]:
print(response)

patient_id|external_study|details
ALL|NA|Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. On overnight EEG, only one individual had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In each sibling pair, one had seizures and the other did not.
ALL|NA|All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals.
ALL|NA|Eleven individuals underwent MRI and MRS of the brain. Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate.
#1|NA|Had slight cerebellar atrophy.
#2|NA|Had slight cerebel