### [POC 3]  NGLY1 deficiency patient extraction

This analysis attempts to extract as many details as possible about NGLY1 patients from 2 full-text papers on the subject.  Outline:

1. Define a minimal prompt to per-study patient identifiers and free-text descriptions of associated information
2. Summarize the results for each patient, across studies, in a standard schema
3. Analyze the patients

In [2]:
%load_ext autoreload
%autoreload 2
import io
import sys
import pandas as pd
import matplotlib.pyplot as plt
from ngly1_gpt import utils, llm, doc
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
pd.set_option("display.max_colwidth", None, "display.max_rows", 400, "display.max_columns", None)

In [3]:
prompt = (utils.get_paths().prompts / "patient_extraction_1.txt").read_text()
print(prompt)

Text will be provided that contains information from a published, biomedical research article about {disease}.  Extract details about the patients discussed in this text. 

Requirements:

- Exclude any patients where context dictates that they do NOT have {disease}, e.g. when {disease} patients are compared to similar patients with other diseases.
- Extract as much information as possible about each patient including associated genotypes, phenotypes, physical or behavioral traits, demographics, lab measurements, treatments, family histories or anything else of clinical and/or biological relevance.
- Extract this information in CSV format with the following headers:
  - `patient_id`: Identifying information for the patient within the context of the article; typically an integer or anonymized id like "Patient 1". If some information applies to ALL patients in a study and the context does not make it possible to enumerate the patient ids, report only the value "ALL"
  - `external_study`: 

##### Execution

The prompt was run for all paper chunks via a command like:

```bash
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py extract_patients --output-filename=patients_1.tsv 2>&1 | tee data/logs/extract_patients_1.log.txt
```

The logs for these extractions showing all prompts and results are in [data/logs](data/logs).

#### Examples

#### Analysis

The original data had a minor error where one chunk of text resulted in comma rather than pipe-delimited CSV content:

In [43]:
(
    pd.read_csv(utils.get_paths().output_data / "patients_1.tsv", sep="\t")
    .pipe(utils.apply, lambda df: df.info())
    .dropna(subset='patient_id,external_study,details')
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 6 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   patient_id                         173 non-null    object
 1   external_study                     14 non-null     object
 2   details                            173 non-null    object
 3   doc_id                             180 non-null    object
 4   doc_filename                       180 non-null    object
 5   patient_id,external_study,details  7 non-null      object
dtypes: object(6)
memory usage: 8.6+ KB


Unnamed: 0,patient_id,external_study,details,doc_id,doc_filename,"patient_id,external_study,details"
62,,,,PMC4243708,PMC4243708.txt,"ALL,24,Patients with autosomal recessive mutations in ERLIN2 present profound intellectual disability, developmental regression and multiple contractures. Despite the severity of the intellectual disability and neuromuscular findings, the results of brain imaging, electromyography and muscle biopsy appeared normal in the initial erlin2-deficient patients."
63,,,,PMC4243708,PMC4243708.txt,"ALL,26,Another family was found to have a homozygous null mutation in ERLIN2, with affected individuals presenting with a hereditary spastic paraplegia phenotype."
83,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Clinical studies were designed to detail the phenotypic features of NGLY1-CDDG. Blood, urine, cerebral spinal fluid (CSF), lymphoblasts, and primary dermal fibroblasts were collected, analyzed, and stored. Studies included brain magnetic resonance imaging and spectroscopy (MRI and MRS, supplementary methods), routine and overnight electroencephalograms (EEGs) with a limited montage performed during a sleep study, electromyogram (EMG, supplementary methods) and nerve conduction studies (NCS, supplementary methods), indirect calorimetry, awake and sedated eye examination with Schirmer II testing, optical coherence tomography scans and electroretinography, behavioral determination of pure tone thresholds, tympanometry, distortion product otoacoustic emissions, auditory brainstem evoked potentials (ABR), quantitative sweat analysis autonomic testing (QSWEAT, supplementary methods), gastric aspiration, swallow study, skeletal survey, bone age, dual X-ray absorptiometry (DEXA), abdominal ultrasound, vibration controlled transient elastography (Fibroscan)12, echocardiogram, and electrocardiogram. Consultations included clinical neurology, audiology, nutrition, ophthalmology, hepatology, growth, puberty and hormonal studies, allergy and immunology, genetic counseling, physiatry, and speech, occupational, and physical therapy."
84,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Eleven individuals underwent developmental psychological evaluations, consisting of at least the Vineland Adaptive Behavior Scales, 2nd edition. Cognitive function was assessed with testing specific for age and developmental level that provided either an intelligence quotient (IQ) or developmental quotient (DQ) score."
85,,,,PMC7477955,PMC7477955.txt,"ALL,NA,The Nijmegen pediatric CDG rating scale, a measure of clinical disease progression developed for CDG, was applied to all affected individuals younger than 18 years."
131,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Compared to the reference population, N-acetylaspartylglutamate + N-acetylaspartate (NAA) was lower than normal in the left centrum semiovale (LCSO) (p=0.004), the midline parietal grey matter (PGM) (p=0.02), and superior cerebellar vermis (SVERM) (p<0.0001). There was a deficit of glutamine + glutamate + gamma-aminobutyric acid (Glx) in the PGM (p=0.03), LCSO (p=0.01), and pons (p=0.0002). Choline was higher than expected for age only in the LCSO (p=0.0097), and myo-inositol was higher than expected for age in the pons (p=0.002). Multiple correlations between these MRS-measured metabolites and age, functional assessments, brain volume, and neurotransmitters in the CSF were found. The general trend showed that the differences noted above became more pronounced with increasing age, worsening function, and lower brain volume. MRS metabolite measurements did not correlate with total CSF protein, CSF albumin, or CSF/serum albumin ratio. There was a weak correlation (p=0.09) between atrophy and total CSF protein, but not CSF albumin or CSF/serum albumin ratio."
169,,,,PMC7477955,PMC7477955.txt,"ALL,NA,Strong correlation between brain atrophy on MRI and functional assessments suggests that loss of neurons contributes to the functional impairment. The atrophy also correlated with CSF metabolites (BH4, 5-HIAA, HVA), which are known to be lower when there is damage to neurotransmitter producing neurons. This suggests that these biochemical abnormalities may be secondary to brain atrophy."


A second run after tweaking the prompt with stronger language on what delimiter to use didn't have that problem:

In [127]:
#patients = pd.read_csv(utils.get_paths().output_data / "patients_2.1.tsv", sep="\t")
patients = pd.read_csv(utils.get_paths().output_data / "patients_3.tsv", sep="\t")
patients.info()
patients.sample(n=15, random_state=0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179 entries, 0 to 178
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   patient_id         179 non-null    object
 1   patient_accession  179 non-null    object
 2   external_study     5 non-null      object
 3   details            179 non-null    object
 4   doc_id             179 non-null    object
 5   doc_filename       179 non-null    object
dtypes: object(6)
memory usage: 8.5+ KB


Unnamed: 0,patient_id,patient_accession,external_study,details,doc_id,doc_filename
137,ALL,ALL,,"Gastric pH was assessed after H2 blockers and osmotic pump inhibitors had been discontinued for 5 days. Gastric pH was appropriately acidic in all individuals tested except one, whose pH was 7.25.",PMC7477955,PMC7477955.txt
7,ALL,ALL,,The most common deleterious allele was the nonsense mutation c.1201A>T (p.R401X).,PMC4243708,PMC4243708.txt
124,ALL,ALL,,"Awake and sedated ophthalmic examinations were performed on 11 study participants. Observed conditions include Lagophthalmous, ptosis, exotropia and/or esotropia, corneal neovascularization, pannus formation or scarring, optic nerve pallor or atrophy, retinal.",PMC7477955,PMC7477955.txt
71,ALL,ALL,,"Twelve individuals from ten families with confirmed biallelic mutations in NGLY1 were admitted to the NIH Clinical Center. All individuals (six female; six male) were white and ranged from 2.5– 21.3 years of age. We identified 13 different mutations: five missense, five nonsense, two splice site, and one frameshift mutation. The most common mutation was c.1201A>T (p.R401*), occurring in seven alleles. The mutations were widely dispersed along the gene with no obvious hotspot. Only four of the mutations lay within the catalytic domain.",PMC7477955,PMC7477955.txt
135,SOME,SOME,,Five individuals had frequent periodic limb movements,PMC7477955,PMC7477955.txt
169,ALL,ALL,,"Patients have auditory neural pathway dysfunction without peripheral hearing loss, resembling auditory neuropathy. They experience difficulty hearing in the presence of background noise and benefit from quiet listening environments. If hypohidrosis is detected, preventative measures (hydration, ventilation, etc.) against situations that cause dangerous core body hyperthermia can be taken. It is important to aggressively manage hypo-lacrima with artificial tears and bland ointment to prevent secondary complications that can impact vision. They have disordered mastication for solids but a functional ability to swallow, suggesting the need for oral motor and swallowing therapies to facilitate better control and chewing maturation.",PMC7477955,PMC7477955.txt
56,4,4,,"Patient 4 has a homozygous state of the c.1201A>T (p.R401X) mutation. This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years.",PMC4243708,PMC4243708.txt
143,ALL,ALL,,Laboratory values reflecting gastrointestinal and hepatic function were essentially normal at the time of the NIH evaluation.,PMC7477955,PMC7477955.txt
161,ALL,ALL,,Bone age was delayed in eight of the 11 subjects tested without any consistent abnormalities of the endocrine system: the somatotropic axis and thyroid function were normal in all studied individuals.,PMC7477955,PMC7477955.txt
107,ALL,ALL,,"Eleven individuals underwent MRI and MRS of the brain. Clinical assessment of the images was not striking. Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination.",PMC7477955,PMC7477955.txt


In [128]:
(
    patients[['doc_id', 'patient_id', 'patient_accession']]
    .value_counts()
    .reset_index()
)

Unnamed: 0,doc_id,patient_id,patient_accession,count
0,PMC7477955,ALL,ALL,57
1,PMC7477955,SOME,SOME,27
2,PMC4243708,ALL,ALL,13
3,PMC4243708,3,3,7
4,PMC4243708,5,5,6
5,PMC4243708,6,6,6
6,PMC4243708,SOME,SOME,6
7,PMC4243708,1,1,5
8,PMC4243708,7,7,5
9,PMC4243708,2,2,5


In [131]:
patient_details = (
    patients
    ['details']
    .dropna()
    .drop_duplicates()
)
patient_details.head().values

array(['All patients had global developmental delay, a movement disorder, and hypotonia.',
       '7 out of 8 patients had hypo- or alacrima.',
       '6 out of 7 patients had elevated liver transaminases.',
       '6 out of 8 patients had microcephaly.',
       '6 out of 8 patients had diminished reflexes.'], dtype=object)

In [132]:
patient_details_list = "\n".join("- " + patient_details.sample(frac=.5, random_state=0))
print(f'Num tokens: {len(doc.tokens(patient_details_list, "gpt-4"))}')

Num tokens: 6034


In [133]:
patient_schema = llm.create_patient_schema(details=patient_details_list, temperature=0.8)
print(patient_schema)

INFO:ngly1_gpt.llm:Prompt (temperature=0.8, model=gpt-4):
The following list of details contains specific characteristics of rare disease patients:

--- BEGIN DETAILS LIST ---
- Femoral bone density was low in all nine individuals who underwent DEXA scanning (mean, SEM z-scores for 8 patients < 21 years adjacent to the growth plate = −3, 0.4; metaphysis-diaphysis = −2.2, 0.6, and diaphysis = −1.8, 0.5).
- Patient 8 has a homozygous state of the c.1201A>T (p.R401X) mutation. This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years.
- Two individuals had combined obstructive and central sleep apnea
- 12 individuals ages 2 to 21 years with confirmed, biallelic, pathogenic NGLY1 mutations. Clinical features include optic atrophy and retinal pigmentary changes/cone dystrophy, delayed bone age, joint hypermobility, and lower than predicted resting energy expenditure. Laboratory findings include low CSF total protein and albumin

INFO:ngly1_gpt.llm:Response:
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/patient.schema.json",
  "title": "Patient",
  "description": "A rare disease patient",
  "type": "object",
  "properties": {
    "doc_id": {
      "type": "string",
      "category": "identifiers"
    },
    "patient_accession": {
      "type": "string",
      "category": "identifiers"
    },
    "mutations": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "category": "genetics"
    },
    "genotype": {
      "type": "string",
      "category": "genetics"
    },
    "phenotype": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "category": "physical_traits"
    },
    "age": {
      "type": "integer",
      "category": "demographics"
    },
    "sex": {
      "type": "string",
      "category": "demographics"
    },
    "ethnicity": {
      "type": "string",
      "category": "demographics"
    },
   

In [158]:
patient_details_records = (
    patients
    .pipe(lambda df: df[~df['patient_id'].isin(['SOME'])])
    [['doc_id', 'patient_accession', 'details']]
    .pipe(utils.apply, lambda df: display(df[['doc_id', 'patient_accession']].value_counts().reset_index().style.set_caption('Identifiers before filtering')))
    .assign(patient_accession=lambda df: df['patient_accession'].str.replace('#', ''))
    .pipe(lambda df: df[
        pd.to_numeric(df['patient_accession'], errors='coerce').notnull() | 
        (df['patient_accession'] == 'ALL')
    ])
    .sort_values(['doc_id', 'patient_accession'])
    .pipe(utils.apply, lambda df: display(df[['doc_id', 'patient_accession']].value_counts().reset_index().style.set_caption('Identifiers after filtering')))
    .groupby(['doc_id', 'patient_accession'])['details'].unique()
    .reset_index()
    .assign(details=lambda df: df['details'].apply(lambda v: " ".join([f"{i+1}) {e}" for i, e in enumerate(v)])))
)
patient_details_records.info()
patient_details_records.head()

Unnamed: 0,doc_id,patient_accession,count
0,PMC7477955,ALL,57
1,PMC4243708,ALL,13
2,PMC4243708,3,7
3,PMC4243708,5,6
4,PMC4243708,6,6
5,PMC4243708,2,6
6,PMC4243708,1,5
7,PMC4243708,4,5
8,PMC4243708,7,5
9,PMC4243708,8,4


Unnamed: 0,doc_id,patient_accession,count
0,PMC7477955,ALL,57
1,PMC4243708,ALL,13
2,PMC4243708,3,7
3,PMC4243708,5,6
4,PMC4243708,6,6
5,PMC4243708,2,6
6,PMC7477955,6,5
7,PMC4243708,1,5
8,PMC4243708,7,5
9,PMC4243708,4,5


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   doc_id             22 non-null     object
 1   patient_accession  22 non-null     object
 2   details            22 non-null     object
dtypes: object(3)
memory usage: 656.0+ bytes


Unnamed: 0,doc_id,patient_accession,details
0,PMC4243708,1,"1) Exome sequencing was performed at Duke University using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit. 2) 5 year old, male, Caucasian, no consanguinity, mutations: c.C1891del (p.Q631S)/c.1201A>T(p.R401X), no IUGR, brain imaging abnormalities, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, corneal ulcerations/scarring, chalazions, elevated liver transaminases, elevated AFP, liver fibrosis, liver storage or vacuolization, constipation, small hands/feet, peripheral neuropathy 3) 5-year-old male, presented in the neonatal period with involuntary movements, including athetosis involving the trunk and extremities and constant lip smacking and pursing while awake. Pregnancy and birth history were unremarkable. He had mild neonatal jaundice requiring phototherapy, but otherwise appeared well. Global developmental delay, hypotonia, intractable multifocal epilepsy, consisting of myoclonic seizures, drop attacks, and staring or tonic episodes, and liver disease were present in infancy. He has cortical vision loss and congenital alacrima and corneal ulcerations with scarring were noted at age 4 years. Now, at age 5 years, the movement disorder has not abated and he has central hypotonia and global developmental delay. 4) Whole exome sequencing (WES) performed as part of a research protocol detected putative knock out mutations forming a compound heterozygote genotype in the NGLY1 gene (Maternal frameshift: Q631S. at cDNA level: C1891del in transcript ENST00000280700. EXON 12. Paternal nonsense: 3_25750426_A, which causes a nonsense mutation, R401X, in transcript ENST00000280700. At the cDNA level this is A1201T EXON 8) 5) Patient 1 has a compound heterozygous state of the c.1201A>T (p.R401X) mutation."
1,PMC4243708,2,"1) WES was performed on a clinical basis at Baylor College of Medicine Whole Genome Laboratory, Houston, Texas. 2) 20 year old, female, Caucasian, consanguinity, mutations: c.1370dupG(p.R458fs)/c.1370dupG(p.R458fs), IUGR, no brain imaging abnormalities, global developmental delay, microcephaly, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, ocular apraxia, alacrima/hypolacrima, corneal ulcerations/scarring, elevated liver transaminases, liver storage or vacuolization, constipation, scoliosis, peripheral neuropathy 3) 20-year-old female, born at 39 weeks of gestation via Cesarean section due to intrauterine growth retardation and an abnormal appearing placenta. Noted hypotonia, developmental delay and elevated liver transaminases at four months of age. At approximately 4 years of age, observed to have a slight intention tremor and frequent involuntary movements of her neck, hands and arm. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total care. She has very little expressive speech and communicates through an electronic speech-generating device. She continues to ambulate with a walker. 4) Did not carry the c.1201A>T (p.R401X) mutation and their clinical phenotype was relatively mild in comparison 5) WES (Baylor College of Medicine Whole Genome Laboratory) revealed a homozygous mutation in exon 9 of the NGLY1 gene denoted as c.1370dupG or p.R458fs. Both parents were confirmed to be heterozygous carriers by Sanger sequencing. The mutation causes a frame shift in codon 458, causing insertion of 13 incorrect residues before a stop codon is introduced towards the end of exon 9. The mutation was not seen in any of 3321 other subjects sequenced at Duke, nor was it seen in 6503 subjects on the Exome Variant Server (NHLBI GO Exome Sequencing Project (ESP), Seattle, WA. 6) Patient 2 does not carry the c.1201A>T (p.R401X) mutation and appears to have a relatively mild phenotype."
2,PMC4243708,3,"1) Sequenced at Stanford University using both Illumina HiSeq2000 and Complete Genomics platforms. Variants in Illumina-sequenced reads were called using both the Hugeseq and Real Time Genomics pipelines and Complete Genomics variants were identified by their own variant callers. DNA was also capture-sequenced at Baylor College of Medicine using a commercially developed capture reagent (VCRome2). Sequence data were generated on an Illumina HiSeq2000 producing an average coverage of 80× with >90% of targeted bases at 20× coverage or higher. 2) 4 year old, female, Caucasian, no consanguinity, mutations: c.1205_1207del(p.402_403del)/c.1570C>(p. R524X), no IUGR, brain imaging abnormalities, global developmental delay, microcephaly, hypotonia, movement disorder, EEG abnormalities, ocular apraxia, alacrima/hypolacrima, chalazions, strabismus, ABR abnormalities, lactic acidosis, neonatal jaundice, elevated liver transaminases, liver storage or vacuolization, constipation, small hands/feet 3) 4-year-old girl, born via Cesarean section at term for a non-reassuring fetal heart tracing, monitored in the NICU due to poor feeding and lethargy. Pregnancy was complicated by a positive second trimester screen noting increased risk for Smith-Lemli Opitz syndrome (SLOS) and trisomy 18, but karyotype on amniocentesis was normal. As a neonate, she had hyperbilirubinemia treated with phototherapy, elevated liver transaminases and transient thrombocytopenia. In infancy, she had global developmental delay, acquired microcephaly, bilateral exotropia, hypotonia, constipation, and intermittent mild lactic acidemia. At age 1 year, she did not make tears when crying, but had adequate tear production to keep her eyes moist. She had intermittent chalazions, but no corneal scarring. She developed staring spells, lasting up to 15 seconds, at approximately age 1 year; these episodes occur about once daily and can be interrupted by gentle contact. By age 17 months she had developed an extrapyramidal movement disorder consisting of asynchronous myoclonic jerks of the limbs and shoulders and subtle choreoathetotic movements of the hands and fingers. At 4 years she can ambulate unassisted, although her gait is unsteady, and communicates with vocalizations, gestures and use of a speech-generating device. 4) Did not carry the c.1201A>T (p.R401X) mutation and their clinical phenotype was relatively mild in comparison 5) WES and whole-genome sequencing were performed using research protocols at Baylor College of Medicine and Stanford University. Mutations in NGLY1 that followed a compound heterozygous inheritance pattern were identified. A stop gain mutation caused by a G>A mutation at position 3:25761670 (hg19) resulting in p.R542X was identified in both the father and daughter. A 3 base pair in-frame deletion TCC> beginning at position 3:25775416 (hg19) was identified in both the mother and daughter. An additional G>T mutation resulting in a heterozygous SMP at position 3:25777564 was identified in the daughter, mother and father. This mutation was not previously observed in 1000 genomes and is a coding region; however, it is present in heterozygous form in all three individuals. 6) A moderate reduction in mitochondrial DNA content was identified in a liver sample 7) Patient 3 does not carry the c.1201A>T (p.R401X) mutation and appears to have a relatively mild phenotype."
3,PMC4243708,4,"1) Sanger sequencing of NGLY1 was performed at Duke University and results were confirmed by a clinical laboratory (GeneDx, Gaithersburg, Maryland). 2) 2 year old, male, Caucasian, no consanguinity, mutations: c.1201A>T(p. R401X)c.1201A>T(pR401X), IUGR, brain imaging abnormalities, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, corneal ulcerations/scarring, chalazions, strabismus, ABR abnormalities, lactic acidosis, neonatal jaundice, elevated liver transaminases, elevated AFP, liver fibrosis, constipation, dysmorphic features, scoliosis, small hands/feet 3) 2-year-old boy, delivered by Cesarean section at 38 weeks of gestation after fetal distress was noted on cardiotocography. Pregnancy history was positive for intrauterine growth restriction (IUGR) and oligohydramnios. He had mild hyperbilirubinemia, but otherwise his neonatal course was unremarkable and he was discharged on day of life three. Intermittent head flexion was noted at 6 months, and an EEG at 8 months showed generalized poly-spike discharges. Soon thereafter, mild tonic seizures with head and body flexion started, and evolved to single, symmetric spasms with bilateral arm extension. Involuntary movements of the upper extremities were also noted at this time. In addition, global developmental delay, bilateral ptosis, abnormal tear production, elevated liver transaminases (3 to 4 times upper limit of normal), and constipation were noted in infancy. He has had recurrent episodes of keratoconjunctivitis and poor lid closure during sleep with resultant corneal scarring. 4) Sanger sequencing (Duke University) detected a homozygous nonsense mutation, p.R401X, at position 3:25775422 (hg19) in transcript ENST00000280700. At the cDNA level this is c.1201A>T in exon 8 of NGLY1. This finding was confirmed in a CLIA- certified laboratory (GeneDx). 5) Patient 4 has a homozygous state of the c.1201A>T (p.R401X) mutation. This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years."
4,PMC4243708,5,"1) Exome sequencing was performed at University of British Columbia using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit. 2) 0.5 year old, male, Caucasian, no consanguinity, mutations: c.1201A>T(p. R401X)/c.1201A>T(p.R401X), IUGR, brain imaging abnormalities, global developmental delay, microcephaly, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, elevated liver transaminases, elevated AFP, liver storage or vacuolization, constipation, dysmorphic features, scoliosis 3) Patient 5, a boy, died at the age of 5 years. He was born at term following a pregnancy that was complicated by a positive second trimester serum screening for trisomy 18 and Smith- Lemli-Opitz syndrome (SLOS) (AFP 1.97 MoM, uE3 0.24 MoM and hCG 0.48 MoM). Cytogenetic analysis of cultured amniocytes showed a normal male karyotype and measurement of 7-dehydrocholesterol in amniotic fluid excluded SLOS. He was delivered by Cesarean section at 36 weeks of gestation due to concerns for IUGR and a non-reassuring stress test. He had mild flexion contractures of both knees, but had an uneventful neonatal period. He had global developmental delay and constant movements of his arms and legs since early infancy and developed head bobbing at 7 months. At 8 months, liver transaminase elevations (approximately 1.5 times the upper limit of normal) were noted, and the elevations persisted until age 3 1⁄2 years. His reflexes appeared normal in infancy, but were diminished by age 2 years and at 38 months could no longer be elicited. During the second year of life, he was noted to have dry eyes that were treated with lubricant drops at bedtime, and microcephaly was present by 16 months. At 2 1⁄2 years, he developed myoclonic seizures that became intractable despite numerous therapeutic trials. Between the ages of 10 months and five years, he showed slow developmental progress, but regressed during the last year. He died at age 5 years following a viral illness and a prolonged seizure. 4) The variant in NGLY1, single nucleotide variant T> A at position 3:25775422 (hg19), which was called to be homozygous, was present in the mother’s exome as a heterozygous call. This base pair substitution causes a nonsense mutation, R401X. The NGLY1 variant was independently validated by Sanger sequencing in the patient and both parents. 5) Significant brain disease was noted on autopsy, found to have pathological changes consistent with hypoxic-ischemic encephalopathy (HIE) 6) Patient 5 has a homozygous state of the c.1201A>T (p.R401X) mutation. This mutation is associated with a severe phenotype, with outcomes ranging from early demise to living into teenage years."


In [163]:
patient_details_records_csv = (
    patient_details_records
    # .pipe(lambda df: df[df['doc_id'] == 'PMC4243708'])
    .pipe(lambda df: df[df['doc_id'] == 'PMC7477955'])
    .to_csv(sep='|', index=False)
)
print(len(doc.tokens(patient_details_records_csv, "gpt-4")))
print(patient_details_records_csv[:250])

6026
doc_id|patient_accession|details
PMC7477955|1|1) 3 year old male with NGLY1 deficiency. Allele 1 mutation: c.953T>C (p.L318P). Allele 2 mutation: c.1169G>C (p.R390P). Nijmegen score: 14. Vineland score: 62. IQ or DQ not determined. 2) Had slight cere


In [164]:
patient_records_json = llm.extract_patient_json(details=patient_details_records_csv, schema=patient_schema, temperature=0.8)
print(patient_records_json)

INFO:ngly1_gpt.llm:Prompt (temperature=0.8, model=gpt-4):
The following table in pipe-delimited CSV format contains details about individual rare disease patients:

--- BEGIN PATIENT DETAILS ---
doc_id|patient_accession|details
PMC7477955|1|1) 3 year old male with NGLY1 deficiency. Allele 1 mutation: c.953T>C (p.L318P). Allele 2 mutation: c.1169G>C (p.R390P). Nijmegen score: 14. Vineland score: 62. IQ or DQ not determined. 2) Had slight cerebellar atrophy.
PMC7477955|10|1) 17 year old female with NGLY1 deficiency. Allele 1 mutation: c.1201A>T (p.R401*). Allele 2 mutation: c.1201A>T (p.R401*). Nijmegen score: 25. IQ or DQ: 16. Vineland score: 42. 2) Had slight cerebellar atrophy.
PMC7477955|11|1) Included in previous clinical publications. Has a sibling also with NGLY1 deficiency (#3). 2) 18 year old female with NGLY1 deficiency. Allele 1 mutation: c.1201A>T (p.R401*). Allele 2 mutation: c.1201A>T (p.R401*). Nijmegen score: 52. IQ or DQ: 2. Vineland score: 24. 3) One teenager had follow

In [162]:
patient_records_json = llm.extract_patient_json(details=patient_details_records_csv, schema=patient_schema, temperature=0.8)
print(patient_records_json)

INFO:ngly1_gpt.llm:Prompt (temperature=0.8, model=gpt-4):
The following table in pipe-delimited CSV format contains details about individual rare disease patients:

--- BEGIN PATIENT DETAILS ---
doc_id|patient_accession|details
PMC4243708|1|1) Exome sequencing was performed at Duke University using the Illumina HiSeq2000 platform and the Agilent SureSelect Human All Exon 50 Mb Kit. 2) 5 year old, male, Caucasian, no consanguinity, mutations: c.C1891del (p.Q631S)/c.1201A>T(p.R401X), no IUGR, brain imaging abnormalities, global developmental delay, hypotonia, movement disorder, EEG abnormalities, decreased DTRs, seizures, alacrima/hypolacrima, corneal ulcerations/scarring, chalazions, elevated liver transaminases, elevated AFP, liver fibrosis, liver storage or vacuolization, constipation, small hands/feet, peripheral neuropathy 3) 5-year-old male, presented in the neonatal period with involuntary movements, including athetosis involving the trunk and extremities and constant lip smacking

INFO:ngly1_gpt.llm:Response:
{"doc_id": "PMC4243708", "patient_accession": "1", "mutations": ["c.C1891del (p.Q631S)", "c.1201A>T(p.R401X)"], "genotype": "Compound heterozygote in the NGLY1 gene", "phenotype": ["Brain imaging abnormalities", "Global developmental delay", "Hypotonia", "Movement disorder", "EEG abnormalities", "Decreased DTRs", "Seizures", "Alacrima/hypolacrima", "Corneal ulcerations/scarring", "Chalazions", "Elevated liver transaminases", "Elevated AFP", "Liver fibrosis", "Liver storage or vacuolization", "Constipation", "Small hands/feet", "Peripheral neuropathy"], "age": 5, "sex": "male", "ethnicity": "Caucasian", "consanguinity": false, "birth_history": "Normal birth history", "developmental_delay": true, "intellectual_disability": false, "seizures": true, "iq_scores": [], "vineland_scores": [], "nijmegen_scores": [], "lab_measurements": {"csf_protein": null, "csf_albumin": null, "csf_lactate": null, "transaminase_levels": null, "cholesterol_levels": null}, "treatment

In [4]:
text = """
Neurologic Phenotype
Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. Details regarding age of onset, seizure type and frequency, medications, and EEG findings are noted in Supplementary Table S. On overnight EEG, only one individual (#6) had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In fact, in each sibling pair, one had seizures and the other did not.
All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals (Supplementary Movie S1).

Brain MRI and MRS
Eleven individuals underwent MRI and MRS of the brain. Clinical assessment of the images was not striking (Figure 3). Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate. Four individuals (#1, #2, #6, #10) also had slight cerebellar atrophy. The atrophy tended to be greater in the older individuals (p=0.17, Supplementary Figure S3), and in one teenager (#11) follow-up imaging showed atrophy measurably worse after a 20-month interval (net loss of 34 cm3 relative to expected). Increased atrophy correlated with worsening of all functional measurements (Supplementary Figure S3), including IQ or DQ (p<0.03), Vineland assessments (p<0.03), and Nijmegen scores (p=0.01). Brain volume also directly correlated with CSF levels of 5-HIAA (p=0.03), tetrahydrobiopterin (p=0.02), and 5-HVA (p=0.06) (Supplementary Figure S3).
"""

In [10]:
text = """
Table 1
Clinical and molecular findings in NGLY1 deficiency
Patient 1	Patient 2	Patient 3	Patient 4	Patient 5	Patient 6	Patient 7	Patient 8	Totals
Age	5 y	20 y	4 y	2 y	d.5 y	d.9 m	3 y	16 y	
Gender	M	F	F	M	M	F	F	F	
Ethnicity	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	Caucasian	
Consanguinity	−	+	−	−	−	−	−	−	1/8
Mutations (maternal/paternal allele)	c.C1891del (p.Q631S)/c.1201A>T(p.R401X)	c.1370dupG(p.R458fs)/c.1370dupG(p.R458fs)	c.1205_1207del(p.402_403del)/c.1570C>(p.R524X)	c.1201A>T(p.R401X)c.1201A>T(pR401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>T(p.R401X)/c.1201A>T(p.R401X)	c.1201A>Y(p.R401X)/c.1201A>T(p.R401X)	c1201A>T(p.R401X)/c.1201A>T(p.R401X)	
IUGR	−	+	−	+	+	+	−	+	5/8
Brain imaging abnormalities	+a	−b	+c	+d	+e	+f	−	+g	6/8
Global developmental delay	+	+	+	+	+	+	+	+	8/8
Microcephalyh	−	+	+	−	+	+	+	+	6/8
Hypotonia	+	+	+	+	+	+	+	+	8/8
Movement disorder	+	+	+	+	+	+	+	+	8/8
EEG abnormalities	+	+	+	+	+	+	−	+	7/8
↓DTRs	+	+	−	+	+	−	+	+	6/8
Seizures	+	−	−	+	+	−	−	+	4/8
Ocular apraxia	−	+	+	−	−	−	+	+	4/8
Alacrima/hypolacrima	+	+	+	+	+	−	+	+	7/8
Corneal ulcerations/scarring	+	+	−	+	−	−	−	+	4/8
Chalazions	+	−	+	+	−	−	+	−	4/8
Strabismus	−	−	+	+	−	−	+	+	5/8
ABR abnormalities	−	−	+	+	−	ND	ND	ND	2/5
Lactic acidosis	−	+	+	+	−i	ND	+	ND	4/6
Neonatal jaundice	+	−	+	+	−	−	+	−	4/8
Elevated liver transaminases	+	+	+	+	+	ND	+	−	6/7
Elevated AFP	+	−	−j	+	+	ND	ND	ND	3/5
Liver fibrosis	+	−	−	+	−	−	ND	ND	2/6
Liver storage or vacuolization	+	+	+	−	+k	+l	ND	ND	5/6
Constipation	+	+	+	+	+	−	+	+	7/8
Dysmorphic features	−	−	−	−	+m	+n	+o	+p	4/8
Scoliosis	−	+	−	+	+	−	−	+	4/8
Small hands/feet	+	−	+	+	−	−	−	+	4/8
Peripheral neuropathyq	+	+	ND	ND	+	ND	ND
"""

In [12]:
text = """
Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total care. She has very little expressive speech and communicates through an electronic speech-generating device. She continues to ambulate with a walker.
"""

In [76]:
response = llm.chat_completion("patient_extraction_1.txt", model="gpt-4", temperature=0.0, disease=utils.NGLY1_DEFICIENCY, text=text)
print(response)

INFO:ngly1_gpt.llm:Prompt (temperature=0.0, model=gpt-4):
Text will be provided that contains information from a published, biomedical research article about NGLY1 deficiency.  Extract details about the patients discussed in this text: 

--- BEGIN TEXT ---

Patient 2, a now 20-year-old female, was born at 39 weeks of gestation via Cesarean section because of intrauterine growth retardation and an abnormal appearing placenta. At four months of age, hypotonia, developmental delay and elevated liver transaminases were noted. At approximately 4 years of age, a slight intention tremor and frequent involuntary movements of her neck, hands and arm were observed. At 5 years of age, she was noted to have ocular apraxia, distal tapering of hands and feet, and diminished deep tendon reflexes. She has cortical vision impairment, as well as alacrima and dry eyes that require lubrication, but has not developed corneal scarring. Presently, she has marked intellectual disabilities and requires total c

In [9]:
print(response)

patient_id|external_study|details
ALL|NA|Seven of twelve subjects had clinical seizures, and one had subclinical seizures recognized on previous EEG. On overnight EEG, only one individual had active seizures recorded, but seven had multifocal epileptiform activity. There were no age or genotype differences between individuals having seizures and those without. In each sibling pair, one had seizures and the other did not.
ALL|NA|All twelve individuals exhibited hyperkinetic movement disorders that included choreiform, athetoid, dystonic, myoclonic, action tremor, and dysmetric movements and were more severe in the younger individuals.
ALL|NA|Eleven individuals underwent MRI and MRS of the brain. Delayed myelination was present in three of the four youngest individuals, but all the older individuals had complete myelination. Six of nine individuals had qualitatively-evident cerebral atrophy that ranged from slight to moderate.
#1|NA|Had slight cerebellar atrophy.
#2|NA|Had slight cerebel