### [POC 3]  NGLY1 deficiency patient extraction

This analysis attempts to extract as many details as possible about NGLY1 patients from 2 full-text papers on the subject.  Outline:

1. Define a minimal prompt to per-study patient identifiers and free-text descriptions of associated information
2. Summarize the results for each patient, across studies, in a standard schema
3. Analyze the patients

In [2]:
%load_ext autoreload
%autoreload 2
import io
import sys
import pandas as pd
import matplotlib.pyplot as plt
from ngly1_gpt import utils, llm, doc
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
pd.set_option("display.max_colwidth", None, "display.max_rows", 400, "display.max_columns", None)

In [3]:
prompt = (utils.get_paths().prompts / "patient_extraction_1.txt").read_text()
print(prompt)

Text will be provided that contains information from a published, biomedical research article about {disease}.  Extract details about the patients discussed in this text. 

Requirements:

- Exclude any patients where context dictates that they do NOT have {disease}, e.g. when {disease} patients are compared to similar patients with other diseases.
- Extract as much information as possible about each patient including associated genotypes, phenotypes, physical or behavioral traits, demographics, lab measurements, treatments, family histories or anything else of clinical and/or biological relevance.
- Extract this information in CSV format with the following headers:
  - `patient_id`: Identifying information for the patient within the context of the article; typically an integer or anonymized id like "Patient 1". If some information applies to ALL patients in a study and the context does not make it possible to enumerate the patient ids, report only the value "ALL"
  - `external_study`: 

##### Execution

The prompt was run for all paper chunks via a command like:

```bash
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py extract_patients --output-filename=patients.tsv 2>&1 | tee data/logs/extract_patients.log.txt
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py infer_patients_schema --sampling-rate=.75 --input-filename=patients.tsv --output-filename=patients.schema.json 2>&1 | tee data/logs/infer_patients_schema.log.txt
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py export_patients --input-data-filename=patients.tsv --input-schema-filename=patients.schema.json --output-filename=patients.json 2>&1 | tee data/logs/export_patients.log.txt
```

The logs for these extractions showing all prompts and results are in [data/logs](data/logs).

#### Examples

#### Analysis

The original data had a minor error where one chunk of text resulted in comma rather than pipe-delimited CSV content:

In [43]:
(
    pd.read_csv(utils.get_paths().output_data / "patients_1.tsv", sep="\t")
    .pipe(utils.apply, lambda df: df.info())
    .dropna(subset='patient_id,external_study,details')
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 6 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   patient_id                         173 non-null    object
 1   external_study                     14 non-null     object
 2   details                            173 non-null    object
 3   doc_id                             180 non-null    object
 4   doc_filename                       180 non-null    object
 5   patient_id,external_study,details  7 non-null      object
dtypes: object(6)
memory usage: 8.6+ KB


Unnamed: 0,patient_id,external_study,details,doc_id,doc_filename,"patient_id,external_study,details"
62,,,,PMC4243708,PMC4243708.txt,"ALL,24,Patients with autosomal recessive mutations in ERLIN2 present profound intellectual disability, developmental regression and multiple contractures. Despite the severity of the intellectual disability and neuromuscular findings, the results of brain imaging, electromyography and muscle biopsy appeared normal in the initial erlin2-deficient patients."
63,,,,PMC4243708,PMC4243708.txt,"ALL,26,Another family was found to have a homozygous null mutation in ERLIN2, with affected individuals presenting with a hereditary spastic paraplegia phenotype."
83,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Clinical studies were designed to detail the phenotypic features of NGLY1-CDDG. Blood, urine, cerebral spinal fluid (CSF), lymphoblasts, and primary dermal fibroblasts were collected, analyzed, and stored. Studies included brain magnetic resonance imaging and spectroscopy (MRI and MRS, supplementary methods), routine and overnight electroencephalograms (EEGs) with a limited montage performed during a sleep study, electromyogram (EMG, supplementary methods) and nerve conduction studies (NCS, supplementary methods), indirect calorimetry, awake and sedated eye examination with Schirmer II testing, optical coherence tomography scans and electroretinography, behavioral determination of pure tone thresholds, tympanometry, distortion product otoacoustic emissions, auditory brainstem evoked potentials (ABR), quantitative sweat analysis autonomic testing (QSWEAT, supplementary methods), gastric aspiration, swallow study, skeletal survey, bone age, dual X-ray absorptiometry (DEXA), abdominal ultrasound, vibration controlled transient elastography (Fibroscan)12, echocardiogram, and electrocardiogram. Consultations included clinical neurology, audiology, nutrition, ophthalmology, hepatology, growth, puberty and hormonal studies, allergy and immunology, genetic counseling, physiatry, and speech, occupational, and physical therapy."
84,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Eleven individuals underwent developmental psychological evaluations, consisting of at least the Vineland Adaptive Behavior Scales, 2nd edition. Cognitive function was assessed with testing specific for age and developmental level that provided either an intelligence quotient (IQ) or developmental quotient (DQ) score."
85,,,,PMC7477955,PMC7477955.txt,"ALL,NA,The Nijmegen pediatric CDG rating scale, a measure of clinical disease progression developed for CDG, was applied to all affected individuals younger than 18 years."
131,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Compared to the reference population, N-acetylaspartylglutamate + N-acetylaspartate (NAA) was lower than normal in the left centrum semiovale (LCSO) (p=0.004), the midline parietal grey matter (PGM) (p=0.02), and superior cerebellar vermis (SVERM) (p<0.0001). There was a deficit of glutamine + glutamate + gamma-aminobutyric acid (Glx) in the PGM (p=0.03), LCSO (p=0.01), and pons (p=0.0002). Choline was higher than expected for age only in the LCSO (p=0.0097), and myo-inositol was higher than expected for age in the pons (p=0.002). Multiple correlations between these MRS-measured metabolites and age, functional assessments, brain volume, and neurotransmitters in the CSF were found. The general trend showed that the differences noted above became more pronounced with increasing age, worsening function, and lower brain volume. MRS metabolite measurements did not correlate with total CSF protein, CSF albumin, or CSF/serum albumin ratio. There was a weak correlation (p=0.09) between atrophy and total CSF protein, but not CSF albumin or CSF/serum albumin ratio."
169,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Strong correlation between brain atrophy on MRI and functional assessments suggests that loss of neurons contributes to the functional impairment. The atrophy also correlated with CSF metabolites (BH4, 5-HIAA, HVA), which are known to be lower when there is damage to neurotransmitter producing neurons. This suggests that these biochemical abnormalities may be secondary to brain atrophy."


A second run after tweaking the prompt with stronger language on what delimiter to use didn't have that problem:

In [207]:
#patients = pd.read_csv(utils.get_paths().output_data / "patients_2.1.tsv", sep="\t")
#patients = pd.read_csv(utils.get_paths().output_data / "patients_3.tsv", sep="\t")
patients = pd.read_csv(utils.get_paths().output_data / "patients_4.tsv", sep="\t")
patients.info()
patients.sample(n=15, random_state=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   patient_id         203 non-null    object
 1   patient_accession  202 non-null    object
 2   external_study     15 non-null     object
 3   details            203 non-null    object
 4   doc_id             203 non-null    object
 5   doc_filename       203 non-null    object
dtypes: object(6)
memory usage: 9.6+ KB


Unnamed: 0,patient_id,patient_accession,external_study,details,doc_id,doc_filename
18,Patient 6,6,,"Patient 6 is a 0.9 month old Caucasian female. She does not have consanguinity. Her mutations are c.1201A>T(p. R401X)/c.1201A>T(p.R401X). She has IUGR, brain imaging abnormalities, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, alacrima/hypolacrima, elevated liver transaminases, liver storage or vacuolization, and constipation.",PMC4243708,PMC4243708.txt
45,Patients 5 and 6,5,,Significant brain disease was noted on autopsy in Patient 5 who was found to have pathological changes consistent with hypoxic-ischemic encephalopathy (HIE).,PMC4243708,PMC4243708.txt
33,Patient 2,2,,"Patient 2 underwent Whole Exome Sequencing (WES) at Baylor College of Medicine Whole Genome Laboratory, which revealed a homozygous mutation in exon 9 of the NGLY1 gene, denoted as c.1370dupG or p.R458fs. Both parents were confirmed to be heterozygous carriers by Sanger sequencing. The mutation causes a frame shift in codon 458, causing insertion of 13 incorrect residues before a stop codon is introduced towards the end of exon 9. The mutation was not seen in any of 3321 other subjects sequenced at Duke, nor was it seen in 6503 subjects on the Exome Variant Server (NHLBI GO Exome Sequencing Project (ESP), Seattle, WA).",PMC4243708,PMC4243708.txt
37,Patient 4,4,,"Sanger sequencing (Duke University) detected a homozygous nonsense mutation, p.R401X, at position 3:25775422 (hg19) in transcript ENST00000280700. At the cDNA level this is c.1201A>T in exon 8 of NGLY1. This finding was confirmed in a CLIA- certified laboratory (GeneDx).",PMC4243708,PMC4243708.txt
109,ALL,ALL,,"Increased atrophy correlated with worsening of all functional measurements, including IQ or DQ, Vineland assessments, and Nijmegen scores. Brain volume also directly correlated with CSF levels of 5-HIAA, tetrahydrobiopterin, and 5-HVA.",PMC7477955,PMC7477955.txt
90,ALL,ALL,,The total foot length was < 3rd percentile in all 12 individuals.,PMC7477955,PMC7477955.txt
5,Patient 6,6,University of British Columbia,Exome sequencing was performed using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit.,PMC4243708,PMC4243708.txt
124,SOME,SOME,,Ten of twelve individuals had some degree of constipation.,PMC7477955,PMC7477955.txt
12,ALL,ALL,,"All patients had global developmental delay, a movement disorder, and hypotonia. Other common findings included hypo- or alacrima (7/8), abnormal brain imaging (7/8), EEG abnormalities (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), seizures (4/8), and abnormal nerve conduction (3/3). Two of the patients died prematurely at 9 months and 5 years of age.",PMC4243708,PMC4243708.txt
153,1,1,,Patient 1 had onset of seizures at 0.7 years with initial type being Infantile Spasms. No seizures per day were reported. The patient was on Levetiracetam medication. No epileptiform discharge localization was observed. Background slowing was present with a PDR of 7 Hertz. Anterior/Posterior gradient was present.,PMC7477955,PMC7477955.txt


In [208]:
(
    patients[['doc_id', 'patient_id', 'patient_accession']]
    .value_counts()
    .reset_index()
)

Unnamed: 0,doc_id,patient_id,patient_accession,count
0,PMC7477955,ALL,ALL,50
1,PMC4243708,ALL,ALL,11
2,PMC7477955,SOME,SOME,11
3,PMC4243708,Patient 3,3,8
4,PMC7477955,6,6,7
5,PMC4243708,Patient 2,2,6
6,PMC7477955,11,11,6
7,PMC7477955,7,7,6
8,PMC7477955,2,2,6
9,PMC7477955,8,8,6


In [209]:
patient_details = (
    patients
    .pipe(lambda df: df[pd.to_numeric(df['patient_accession'], errors='coerce').notnull()])
    ['details']
    .dropna()
    .drop_duplicates()
)
patient_details.head().values

array(['Exome sequencing was performed using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit.',
       'The patient and parents were sequenced using both Illumina HiSeq2000 and Complete Genomics platforms. Variants in Illumina-sequenced reads were called using both the Hugeseq and Real Time Genomics pipelines and Complete Genomics variants were identified by their own variant callers.',
       'DNA was capture- sequenced using a commercially developed capture reagent (VCRome2). Sequence data were generated on an Illumina HiSeq2000 producing an average coverage of 80× with >90% of targeted bases at 20× coverage or higher.',
       'WES was performed on a clinical basis.',
       'Sanger sequencing of NGLY1 was performed and results were confirmed by a clinical laboratory (GeneDx, Gaithersburg, Maryland).'],
      dtype=object)

In [212]:
patient_details_list = "\n".join("- " + patient_details.sample(frac=.75, random_state=0))
#patient_details_list = "\n".join("- " + patient_details)
print(f'Num tokens: {len(doc.tokens(patient_details_list, "gpt-4"))}')

Num tokens: 6720


In [214]:
patient_schema = llm.create_patient_schema(details=patient_details_list, temperature=0)
print(patient_schema)

INFO:ngly1_gpt.llm:Prompt (temperature=0, model=gpt-4):
The following list of details contains specific characteristics of rare disease patients:

--- BEGIN DETAILS LIST ---
- Patient 8 has Protein level of 40 mg/dL, Albumin level of 24 mg/dL, CSF/serum Albumin quotient of 7.1, 5HIAA level of 195 nM, HVA level of 376 nM, Neopterin level of 12 nM, BH4 level of 18 nM, Lactate level of 1.1 mM, and normal Amino acids.
- Patient 7 has Protein level of 13 mg/dL, Albumin level of 9 mg/dL, CSF/serum Albumin quotient of 2.5, 5HIAA level of 169 nM, HVA level of 327 nM, Neopterin level of 17 nM, BH4 level of 13 nM, Lactate level of 1.3 mM, and Glutamine in Amino acids.
- Patient 6 has peripheral neuropathy (PN) in sensory and motor nerves, demyelinative conduction velocity (CV), normal findings in one arm muscle but noted to have chronic neurogenic changes one year later, and absent QSWEAT findings.
- Patient 4 is a 2 year old Caucasian male. He does not have consanguinity. His mutations are c.12

In [239]:
patient_records = pd.read_json(utils.get_paths().output_data / "patients.json", lines=True)
patient_records.info()
patient_records.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   doc_id                 22 non-null     object 
 1   patient_accession      22 non-null     object 
 2   extra_info             22 non-null     object 
 3   age                    21 non-null     float64
 4   gender                 21 non-null     object 
 5   ethnicity              9 non-null      object 
 6   consanguinity          9 non-null      float64
 7   mutations              22 non-null     object 
 8   phenotypes             16 non-null     object 
 9   sequencing_data        10 non-null     object 
 10  lab_measurements       17 non-null     object 
 11  neurological_findings  17 non-null     object 
 12  seizure_history        16 non-null     object 
 13  family_history         7 non-null      object 
 14  scores                 19 non-null     object 
dtypes: float

Unnamed: 0,doc_id,patient_accession,extra_info,age,gender,ethnicity,consanguinity,mutations,phenotypes,sequencing_data,lab_measurements,neurological_findings,seizure_history,family_history,scores
0,PMC4243708,1,"[Exome sequencing was performed using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit., Whole exome sequencing (WES) performed as part of a research protocol detected putative knock out mutations forming a compound heterozygote genotype in the NGLY1 gene., A 5-year-old male who presented in the neonatal period with involuntary movements, including athetosis involving the trunk and extremities and constant lip smacking and pursing while awake.]",5.0,male,Caucasian,0.0,"[c.C1891del (p.Q631S), c.1201A>T(p.R401X)]","[brain imaging abnormalities, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, corneal ulcerations/scarring, chalazions, neonatal jaundice, elevated liver transaminases, elevated AFP, liver fibrosis, liver storage or vacuolization, constipation, small hands/feet, peripheral neuropathy]","{'sequencing_center': 'Research Protocol', 'sequencing_platform': 'Illumina HiSeq2000', 'coverage': '50 Mb'}",,,,,
1,PMC4243708,2,"[WES was performed on a clinical basis., Patient 2 underwent Whole Exome Sequencing (WES) at Baylor College of Medicine Whole Genome Laboratory, which revealed a homozygous mutation in exon 9 of the NGLY1 gene, denoted as c.1370dupG or p.R458fs., Both parents were confirmed to be heterozygous carriers by Sanger sequencing., The mutation causes a frame shift in codon 458, causing insertion of 13 incorrect residues before a stop codon is introduced towards the end of exon 9., The mutation was not seen in any of 3321 other subjects sequenced at Duke, nor was it seen in 6503 subjects on the Exome Variant Server (NHLBI GO Exome Sequencing Project (ESP), Seattle, WA).]",20.0,female,Caucasian,1.0,"[c.1370dupG(p.R458fs), c.1370dupG(p.R458fs)]","[IUGR, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, ocular apraxia, alacrima/hypolacrima, corneal ulcerations/scarring, elevated liver transaminases, liver storage or vacuolization, constipation, scoliosis, peripheral neuropathy]","{'sequencing_center': 'Baylor College of Medicine Whole Genome Laboratory', 'sequencing_platform': 'Clinical WES', 'coverage': 'Exon 9'}",,,,,
2,PMC4243708,3,"[The patient and parents were sequenced using both Illumina HiSeq2000 and Complete Genomics platforms., Variants in Illumina-sequenced reads were called using both the Hugeseq and Real Time Genomics pipelines and Complete Genomics variants were identified by their own variant callers., DNA was capture- sequenced using a commercially developed capture reagent (VCRome2)., Sequence data were generated on an Illumina HiSeq2000 producing an average coverage of 80× with >90% of targeted bases at 20× coverage or higher., WES and whole-genome sequencing were performed using research protocols at Baylor College of Medicine and Stanford University., Mutations in NGLY1 that followed a compound heterozygous inheritance pattern were identified., A stop gain mutation caused by a G>A mutation at position 3:25761670 (hg19) resulting in p.R542X was identified in both the father and daughter., A 3 base pair in-frame deletion TCC> beginning at position 3:25775416 (hg19) was identified in both the mother and daughter., An additional G>T mutation resulting in a heterozygous SMP at position 3:25777564 was identified in the daughter, mother and father., This mutation was not previously observed in 1000 genomes and is a coding region; however, it is present in heterozygous form in all three individuals., A moderate reduction in mitochondrial DNA content was identified in a liver sample from Patient 3.]",4.0,female,Caucasian,0.0,"[c.1205_1207del(p.402_403del), c.1570C>(p. R524X)]","[brain imaging abnormalities, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, ocular apraxia, alacrima/hypolacrima, chalazions, neonatal jaundice, elevated liver transaminases, liver storage or vacuolization, constipation, small hands/feet]","{'sequencing_center': 'Baylor College of Medicine and Stanford University', 'sequencing_platform': 'Illumina HiSeq2000 and Complete Genomics', 'coverage': '80×'}",,,,,
3,PMC4243708,4,"[Sanger sequencing of NGLY1 was performed and results were confirmed by a clinical laboratory (GeneDx, Gaithersburg, Maryland)., A 2-year-old boy, delivered by Cesarean section at 38 weeks of gestation after fetal distress was noted on cardiotocography., Sanger sequencing (Duke University) detected a homozygous nonsense mutation, p.R401X, at position 3:25775422 (hg19) in transcript ENST00000280700., This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years.]",2.0,male,Caucasian,0.0,[c.1201A>T(p. R401X)c.1201A>T(pR401X)],"[IUGR, brain imaging abnormalities, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, corneal ulcerations/scarring, chalazions, neonatal jaundice, elevated liver transaminases, elevated AFP, liver fibrosis, constipation, dysmorphic features, scoliosis, small hands/feet]",{},{},{},{},[],{}
4,PMC4243708,5,"[Exome sequencing was performed using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit., Patient 5, a boy, died at the age of 5 years., Significant brain disease was noted on autopsy in Patient 5 who was found to have pathological changes consistent with hypoxic-ischemic encephalopathy (HIE)., This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years.]",0.5,male,Caucasian,0.0,[c.1201A>T(p. R401X)/c.1201A>T(p.R401X)],"[IUGR, brain imaging abnormalities, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, seizures, alacrima/hypolacrima, neonatal jaundice, elevated liver transaminases, elevated AFP, liver storage or vacuolization, constipation, dysmorphic features, scoliosis]",{},{},{},{},[],{}


In [241]:
patient_records['doc_id'].unique()

array(['PMC4243708', 'PMC7477955'], dtype=object)

In [243]:
patient_records.dropna(subset='sequencing_data').sample(5, random_state=0).T

Unnamed: 0,2,8,4,21,1
doc_id,PMC4243708,PMC4243708,PMC4243708,PMC7477955,PMC4243708
patient_accession,3,ALL,5,ALL,2
extra_info,"[The patient and parents were sequenced using both Illumina HiSeq2000 and Complete Genomics platforms., Variants in Illumina-sequenced reads were called using both the Hugeseq and Real Time Genomics pipelines and Complete Genomics variants were identified by their own variant callers., DNA was capture- sequenced using a commercially developed capture reagent (VCRome2)., Sequence data were generated on an Illumina HiSeq2000 producing an average coverage of 80× with >90% of targeted bases at 20× coverage or higher., WES and whole-genome sequencing were performed using research protocols at Baylor College of Medicine and Stanford University., Mutations in NGLY1 that followed a compound heterozygous inheritance pattern were identified., A stop gain mutation caused by a G>A mutation at position 3:25761670 (hg19) resulting in p.R542X was identified in both the father and daughter., A 3 base pair in-frame deletion TCC> beginning at position 3:25775416 (hg19) was identified in both the mother and daughter., An additional G>T mutation resulting in a heterozygous SMP at position 3:25777564 was identified in the daughter, mother and father., This mutation was not previously observed in 1000 genomes and is a coding region; however, it is present in heterozygous form in all three individuals., A moderate reduction in mitochondrial DNA content was identified in a liver sample from Patient 3.]","[All patients had global developmental delay, a movement disorder, and hypotonia., Other common findings included hypo- or alacrima (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), and seizures (4/8)., The nonsense mutation c.1201A>T (p.R401X) was the most common deleterious allele., NGLY1 deficiency is a novel autosomal recessive disorder of the ERAD pathway associated with neurological dysfunction, abnormal tear production, and liver disease., The majority of patients detected to date carry a specific nonsense mutation that appears to be associated with severe disease., Seven newly diagnosed patients with mutations in NGLY1., These observations confirm NGLY1 deficiency as an inherited disorder associated with the ERAD process and document its clinical presentation., All patients had global developmental delay, a movement disorder, and hypotonia., Other common findings included hypo- or alacrima (7/8), abnormal brain imaging (7/8), EEG abnormalities (7/8), elevated liver transaminases (6/7), microcephaly (6/8), diminished reflexes (6/8), hepatocyte cytoplasmic storage material or vacuolization (5/6), seizures (4/8), and abnormal nerve conduction (3/3)., Two of the patients died prematurely at 9 months and 5 years of age., The nonsense mutation c.1201A>T (p.R401X) was the most common deleterious allele identified, present in homozygous state in 5 of 8 cases and in compound heterozygous state in one case., Patients have a striking clinical triad consisting of abnormal tear production, choreoathetosis and liver disease., In addition, global developmental delay, acquired microcephaly, hypotonia, EEG abnormalities with or without overt seizures, brain imaging abnormalities, a peripheral neuropathy, constipation and a history of IUGR were common findings., Some patients were noted to have dysmorphic features, but overall these were not considered to be particularly prominent., Low uE3 was present in three patients, including in two children who died and were found to have significant adrenal cortex vacuolization., Although adrenal function was not specifically evaluated in our patients, some degree of dysfunction remains possible., Patients in this cohort were initially suspected to have congenital disorders of glycosylation (CDGs) due to multisystem disease involving the central nervous system, heart, liver and gastrointestinal tract, and endocrine system., However, unlike CDGs, NGLY1 deficiency in these patients was not associated with cerebellar atrophy, lipodystrophy, or significant heart manifestations., Common findings in these patients included abnormalities detected on brain imaging, but these were typically mild and non-specific., The combination of hypo- or alacrima and a movement disorder consisting of tremulousness and varying degrees of choreoathetosis appear to be pathognomonic for NGLY1 deficiency in these patients., Mitochondrial disorders were also considered as diagnostic possibilities., Lactic acidemia was variably present, but tended to be mild; chronic elevations were not noted in any patient., Patients with autosomal recessive mutations in ERLIN2 present profound intellectual disability, developmental regression and multiple contractures., Despite the severity of the intellectual disability and neuromuscular findings, the results of brain imaging, electromyography and muscle biopsy appeared normal in these patients., A family was found to have a homozygous null mutation in ERLIN2, with affected individuals presenting with a hereditary spastic paraplegia phenotype., Three patients showed accumulation of an amorphous substance in the liver, which is supportive of the role of NGLY1 and may help explain the liver disease noted in these patients., The undefined stored substance likely represents accumulation of misfolded glycoproteins in the cytoplasm that have been retrotranslocated from the ER but cannot undergo further processing., Two other patients showed vacuolization consistent with storage., Transferrin isoelectric focusing or mass spectrometry studies in NGLY1-deficient patients have been normal or only subtly abnormal., Patients have axonal loss and gliosis in the brains suggestive of HIE, indicating that NGLY1 plays a role in maintaining central nervous system integrity., They also seem to have peripheral neuropathy that is relatively common in NGLY1 deficiency., Further studies are needed to determine the underlying pathogenesis of both central and peripheral nervous system abnormalities found in these patients., NGLY1 deficiency is characterized by a constellation of unique features including elevation of AFP and liver enzymes in infancy with relative normalization in early childhood, accumulation of a substance with staining properties similar to glycogen in hepatocyte cytoplasm, absent tears resulting in blepharitis and corneal ulceration, a movement disorder and peripheral neuropathy., The transient nature of the AFP and liver transaminase elevation may make older or more mildly affected individuals difficult to detect.]","[Exome sequencing was performed using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit., Patient 5, a boy, died at the age of 5 years., Significant brain disease was noted on autopsy in Patient 5 who was found to have pathological changes consistent with hypoxic-ischemic encephalopathy (HIE)., This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years.]","[NGLY1 deficiency was first reported in 2012 by Need et al., Seven additional affected individuals identified through a social media campaign., Clinical studies were designed to detail the phenotypic features of NGLY1-CDDG., Comprehensive, prospective, clinical, molecular, radiologic, and laboratory investigations were performed., Eleven individuals underwent developmental psychological evaluations., Most affected individuals had hypotonic facies, and the features of older individuals reflected their low weight., Individuals grew poorly after mid-childhood, with weight affected more than height., All five individuals carrying the common mutation c.1201A>T were either moderately or severely impaired., There was no significant difference in disease severity of males compared to females., All twelve subjects had at least some developmental delay or intellectual disability, with a broad range of severity., Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG., All twelve individuals exhibited hyperkinetic movement disorders., Increased atrophy correlated with worsening of all functional measurements., Compared to the reference population, N-acetylaspartylglutamate + N-acetylaspartate (NAA) was lower than normal., Nine subjects underwent lumbar puncture., Nerve conduction studies were performed in 11 individuals., QSWEATs, performed in 11 individuals, were abnormal in the same 8 individuals who had axonal sensorimotor neuropathies., Awake and sedated ophthalmic examinations were performed on 11 study participants., Audiologic assessments were conducted on 11 subjects., Clinical feeding and modified barium swallow assessments were performed on 11 subjects., Echocardiograms were unremarkable for all subjects (n=12)., Gastric pH was assessed after H2 blockers and osmotic pump inhibitors had been discontinued for 5 days., Laboratory values reflecting gastrointestinal and hepatic function were essentially normal at the time of the NIH evaluation., One individual had undergone liver transplantation for presumed hepatocellular carcinoma., These transaminase levels normalized around age 4., Circulating proteins were normal or borderline low in all individuals., Total cholesterol, LDL, and triglycerides were also low., Hematologic profiles showed unremarkable complete blood counts., Seven of 11 individuals tested exhibited out of range elevations in antibody titers towards rubella and/or rubeola following MMR vaccination., Bone age was delayed in eight of the 11 subjects tested., Femoral bone density was low in all nine individuals who underwent DEXA scanning., All subjects had joint hypermobility except the older ones, who had contractures in both small and large joints., Complete skeletal surveys were performed on 11 individuals., Our prospective investigations into NGLY1-CDDG revealed several new discoveries associated with this disorder., Patients have global developmental delay, movement disorder, frequent seizure disorder, hypotonia, hypolacrima or alacrima, corneal disease, ptosis, lagophthalmous, strabismus, peripheral neuropathy and occasional diminished reflexes, hypohidrosis, auditory brainstem response abnormalities, abnormal brain imaging, scoliosis, acquired microcephaly, small hands and feet, dysmorphic features, constipation, elevated liver enzymes present only early in childhood, osteopenia, and hypocholesterolemia associated with NGLY1-CDDG., Patients considered for NGLY1-CDDG diagnosis present with developmental delay/intellectual disability, hyperkinetic movement disorder, hypolacrima, and a history of elevated transaminases during early childhood., Patients have auditory neural pathway dysfunction without peripheral hearing loss, resembling auditory neuropathy., Patients have low Motor Skills scores and Daily Living Skills scores, indicating significant motor involvement required for daily tasks., As functional impairment worsened and age increased in patients, NAA decreased, while choline, myo-inositol and creatine increased., Individuals with NGLY1-CDDG share phenotypic features of low cholesterol, hepatopathy, peripheral neuropathy, retinal and optic nerve abnormalities, seizures, developmental delay with socialization as a relative strength, and delayed bone age., NGLY1-CDDG is a progressive disorder., Individuals with NGLY1-CDDG have facial features that include upturned nasal tip, hypotonic facies, ptosis, brachycephaly, thinned facies, hollowed cheeks, and visible zygomatic arches., Right eye close-up showing conjunctival injection, limbal neovascularization, and corneal scarring in an NGLY1 patient.]","[WES was performed on a clinical basis., Patient 2 underwent Whole Exome Sequencing (WES) at Baylor College of Medicine Whole Genome Laboratory, which revealed a homozygous mutation in exon 9 of the NGLY1 gene, denoted as c.1370dupG or p.R458fs., Both parents were confirmed to be heterozygous carriers by Sanger sequencing., The mutation causes a frame shift in codon 458, causing insertion of 13 incorrect residues before a stop codon is introduced towards the end of exon 9., The mutation was not seen in any of 3321 other subjects sequenced at Duke, nor was it seen in 6503 subjects on the Exome Variant Server (NHLBI GO Exome Sequencing Project (ESP), Seattle, WA).]"
age,4.0,,0.5,21.0,20.0
gender,female,,male,mixed,female
ethnicity,Caucasian,,Caucasian,white,Caucasian
consanguinity,0.0,,0.0,0.0,1.0
mutations,"[c.1205_1207del(p.402_403del), c.1570C>(p. R524X)]",[c.1201A>T (p.R401X)],[c.1201A>T(p. R401X)/c.1201A>T(p.R401X)],[c.1201A>T (p.R401*)],"[c.1370dupG(p.R458fs), c.1370dupG(p.R458fs)]"
phenotypes,"[brain imaging abnormalities, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, ocular apraxia, alacrima/hypolacrima, chalazions, neonatal jaundice, elevated liver transaminases, liver storage or vacuolization, constipation, small hands/feet]","[global developmental delay, movement disorder, hypotonia, hypo- or alacrima, elevated liver transaminases, microcephaly, diminished reflexes, hepatocyte cytoplasmic storage material or vacuolization, seizures]","[IUGR, brain imaging abnormalities, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, seizures, alacrima/hypolacrima, neonatal jaundice, elevated liver transaminases, elevated AFP, liver storage or vacuolization, constipation, dysmorphic features, scoliosis]","[optic atrophy, retinal pigmentary changes/cone dystrophy, delayed bone age, joint hypermobility, lower than predicted resting energy expenditure]","[IUGR, microcephaly, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, ocular apraxia, alacrima/hypolacrima, corneal ulcerations/scarring, elevated liver transaminases, liver storage or vacuolization, constipation, scoliosis, peripheral neuropathy]"
sequencing_data,"{'sequencing_center': 'Baylor College of Medicine and Stanford University', 'sequencing_platform': 'Illumina HiSeq2000 and Complete Genomics', 'coverage': '80×'}",{},{},"{'sequencing_center': None, 'sequencing_platform': None, 'coverage': None}","{'sequencing_center': 'Baylor College of Medicine Whole Genome Laboratory', 'sequencing_platform': 'Clinical WES', 'coverage': 'Exon 9'}"


In [242]:
patient_records.pipe(lambda df: df[df['doc_id'] == 'PMC7477955']).sample(5, random_state=0).Ta

Unnamed: 0,15,20,13,19,11
doc_id,PMC7477955,PMC7477955,PMC7477955,PMC7477955,PMC7477955
patient_accession,4,9,2,8,11
extra_info,"[Patient 4 never had seizures. No information on medication, epileptiform discharge localization, background slowing, PDR, and Anterior/Posterior gradient was determined., Patient 4's nerve conduction study (NCS), electromyogram (EMG), and quantitative sweat analysis (QSWEAT) results were not determined (ND).]","[Has confirmed biallelic mutations in NGLY1, included in previous clinical publications., Distal upper and lower extremities have developed a flexion contracture at the hands and feet. Irregular action-induced jerking movements are elicited while reaching. Gaze is conjugate. Visual pursuit is normal and there is generalized arreflexia. Muscle wasting notable throughout, especially in the lower extremities likely secondary to peripheral neuropathy.]","[Individual #2 had slight cerebellar atrophy., Mild background slowing was present with a PDR of 5-6 Hertz., Anterior/Posterior gradient was poorly formed., Random multifocal irregular adventitious movements of all four extremities are induced by voluntary movements and or posture., Head and trunk titubation during the crawling position or during supported gait may reflect axial cerebellar dysfunction and/or associated negative motor phenomena (negative myoclonus) leading to sudden brief loss of postural muscle tone.]","[Sibling of patient 7, Individual 8 with a private cryptic splice site mutation (c.930C>T) and a private nonsense mutation (c.622C>T) exhibited relatively mild impairment in all domains.]","[Sibling of patient 3, has confirmed biallelic mutations in NGLY1, included in previous clinical publications., In one teenager (#11) follow-up imaging showed atrophy measurably worse after a 20-month interval (net loss of 34 cm3 relative to expected)., Schirmera: 0 ; 0, Ptosis / Lagophthalmous: + / +, Nystagmus / Strabismus: -, Cornea: Scarring; NV, Retina: No view (corneal scar), Optic Atrophy: +, Refraction: ND]"
age,5.0,16.0,4.0,10.0,18.0
gender,female,male,male,male,female
ethnicity,,,,,
consanguinity,,,,,
mutations,"[c.931G>A (p.E311K), c.730T>C (p.W244R)]","[c.347C>G (p.S116*), c.881+5G>T (IVS5+5G>T)]","[c.1201A>T (p.R401*), c.1201A>T (p.R401*)]","[c.622C>T (p.Q208*), c.930C>T (p.G310G (splice site))]","[c.1201A>T (p.R401*), c.1201A>T (p.R401*)]"
phenotypes,,[NGLY1 deficiency],,[NGLY1 deficiency],[NGLY1 deficiency]
sequencing_data,,,,,


In [203]:
patient_records.pipe(lambda df: df[df['doc_id'] == 'PMC7477955']).sample(5, random_state=0).T

Unnamed: 0,0,2,1,1.1,2.1
doc_id,PMC7477955,PMC7477955,PMC7477955,PMC7477955,PMC7477955
patient_accession,4,9,2,8,11
age,5.0,16.0,4.0,10.0,18.0
sex,female,male,male,male,female
ethnicity,,,,,
consanguinity,,,,,
mutations,"[c.931G>A (p.E311K), c.730T>C (p.W244R)]","[c.347C>G (p.S116*), c.881+5G>T (IVS5+5G>T)]","[c.1201A>T (p.R401*), c.1201A>T (p.R401*)]","[c.622C>T (p.Q208*), c.930C>T (p.G310G - splice site)]","[c.1201A>T (p.R401*), c.1201A>T (p.R401*)]"
phenotype,,,,,
brain_imaging_abnormalities,,,,,
global_developmental_delay,,,,,


In [4]:
text = """
Neurologic Phenotype
Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. Details regarding age of onset, seizure type and frequency, medications, and EEG findings are noted in Supplementary Table S. On overnight EEG, only one individual (#6) had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In fact, in each sibling pair, one had seizures and the other did not.
All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals (Supplementary Movie S1).

Brain MRI and MRS
Eleven individuals underwent MRI and MRS of the brain. Clinical assessment of the images was not striking (Figure 3). Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate. Four individuals (#1, #2, #6, #10) also had slight cerebellar atrophy. The atrophy tended to be greater in the older individuals (p=0.17, Supplementary Figure S3), and in one teenager (#11) follow-up imaging showed atrophy measurably worse after a 20-month interval (net loss of 34 cm3 relative to expected). Increased atrophy correlated with worsening of all functional measurements (Supplementary Figure S3), including IQ or DQ (p<0.03), Vineland assessments (p<0.03), and Nijmegen scores (p=0.01). Brain volume also directly correlated with CSF levels of 5-HIAA (p=0.03), tetrahydrobiopterin (p=0.02), and 5-HVA (p=0.06) (Supplementary Figure S3).
"""

In [10]:
text = """
Table 1
Clinical and molecular findings in NGLY1 deficiency
Patient 1	Patient 2	Patient 3	Patient 4	Patient 5	Patient 6	Patient 7	Patient 8	Totals
Age	5 y	20 y	4 y	2 y	d.5 y	d.9 m	3 y	16 y	
Gender	M	F	F	M	M	F	F	F	
Ethnicity	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	
Consanguinity	−	+	−	−	−	−	−	−	1/8
Mutations (maternal/paternal allele)	c.C1891del (p.Q631S)/c.1201A>T(p.R401X)	c.1370dupG(p.R458fs)/c.1370dupG(p.R458fs)	c.1205_1207del(p.402_403del)/c.1570C>(p.R524X)	c.1201A>T(p.R401X)c.1201A>T(pR401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>Y(p.R401X)/c.1201A>T(p.R401X)	c1201A>T(p.R401X)/c.1201A>T(p.R401X)	
IUGR	−	+	−	+	+	+	−	+	5/8
Brain imaging abnormalities	+a	−b	+c	+d	+e	+f	−	+g	6/8
Global developmental delay	+	+	+	+	+	+	+	+	8/8
Microcephalyh	−	+	+	−	+	+	+	+	6/8
Hypotonia	+	+	+	+	+	+	+	+	8/8
Movement disorder	+	+	+	+	+	+	+	+	8/8
EEG abnormalities	+	+	+	+	+	+	−	+	7/8
↓DTRs	+	+	−	+	+	−	+	+	6/8
Seizures	+	−	−	+	+	−	−	+	4/8
Ocular apraxia	−	+	+	−	−	−	+	+	4/8
Alacrima/hypolacrima	+	+	+	+	+	−	+	+	7/8
Corneal ulcerations/scarring	+	+	−	+	−	−	−	+	4/8
Chalazions	+	−	+	+	−	−	+	−	4/8
Strabismus	−	−	+	+	−	−	+	+	5/8
ABR abnormalities	−	−	+	+	−	ND	ND	ND	2/5
Lactic acidosis	−	+	+	+	−i	ND	+	ND	4/6
Neonatal jaundice	+	−	+	+	−	−	+	−	4/8
Elevated liver transaminases	+	+	+	+	+	ND	+	−	6/7
Elevated AFP	+	−	−j	+	+	ND	ND	ND	3/5
Liver fibrosis	+	−	−	+	−	−	ND	ND	2/6
Liver storage or vacuolization	+	+	+	−	+k	+l	ND	ND	5/6
Constipation	+	+	+	+	+	−	+	+	7/8
Dysmorphic features	−	−	−	−	+m	+n	+o	+p	4/8
Scoliosis	−	+	−	+	+	−	−	+	4/8
Small hands/feet	+	−	+	+	−	−	−	+	4/8
Peripheral neuropathyq	+	+	ND	ND	+	ND	ND
"""

In [12]:
text = """
Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total care. She has very little expressive speech and communicates through an electronic speech-generating device. She continues to ambulate with a walker.
"""

In [76]:
response = llm.chat_completion("patient_extraction_1.txt", model="gpt-4", temperature=0.0, disease=utils.NGLY1_DEFICIENCY, text=text)
print(response)

INFO:ngly1_gpt.llm:Prompt (temperature=0.0, model=gpt-4):
Text will be provided that contains information from a published, biomedical research article about NGLY1 deficiency.  Extract details about the patients discussed in this text: 

--- BEGIN TEXT ---

Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total c

In [9]:
print(response)

patient_id|external_study|details
ALL|NA|Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. On overnight EEG, only one individual had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In each sibling pair, one had seizures and the other did not.
ALL|NA|All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals.
ALL|NA|Eleven individuals underwent MRI and MRS of the brain. Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate.
#1|NA|Had slight cerebellar atrophy.
#2|NA|Had slight cerebel