### [POC 2]  NGLY1 deficiency relation extraction

This analysis builds on POC 1 to process much more data (2 full-text papers w/ ~65k chars total vs ~3.5k chars in POC1) and with much more control over the relations identified. Outline:

1. Define a very detailed [prompt](https://vscode.dev/github/eric-czech/ngly1-gpt/blob/main/ngly1_gpt/resources/prompts/relation_extraction_1.txt) defining entities and relations between them to identify
2. Run this prompt on chunks of each paper
    - There are about 60 chunks total between the two papers
    - This takes ~40 mins w/ GPT4
    - The two papers used are in [data/extract](data/extract)
3. Show examples of text inputs and relation extraction outputs
4. Analyze frequencies of relations and entities

In [50]:
%load_ext autoreload
%autoreload 2
import sys
import pandas as pd
from ngly1_gpt import utils, llm, doc
import logging
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Extraction

For reference, here is the original prompt used that defines the entities and relations for this analysis:

In [143]:
!find ./ | grep relation_extraction_1.txt | xargs cat

Text will be provided that contains information from a published, biomedical research article about {disease}. Extract subject-predicate-object relations from this text.

Types of subjects and objects to extract:
- assay
- biological process
- cell type
- cellular component
- chemical substance
- clinical trial
- disease
- drug
- gene
- gene family 
- genetic variant
- genotype
- macromolecular complex
- metabolite
- molecular activity
- organism
- organization
- pathway
- phenotype
- protein
- protein variant
- symptom
- tissue
- transcript variant
- <other>

Predicates to extract:
- affects risk for
- associated with
- capable of
- caused by
- causes
- colocalizes with
- contributes to
- correlated with
- decreases abundance of
- decreases activity of
- derives from
- disrupts
- enables
- exact match
- expressed in
- expresses
- genetically interacts with
- has affected feature
- has attribute
- has gene product
- has genotype
- has metabolite
- has output
- has participant
- has phe

This was then used to extract data from the papers via:

```bash
PYTHONPATH="$(pwd)" python ngly1_gpt/cli.py extract_relations --output-filename=relations.tsv 2>&1 | tee data/logs/extract_relations.log.txt
```

The logs for these extractions showing all prompts and results are in [data/logs](data/logs).

#### Examples

Here are a few select examples (from the logs) showing how various input chunks are transformed into structured outputs:

In [120]:
log = "data/logs/extract_relations_1.log.txt"

In [136]:
!echo "Prompt text:"
!cat $log | sed -n "5495,5518p"
!echo "Prompt response:"
!cat $log | sed -n "5541,5567p" | column -s "|" -t

Prompt text:
--- BEGIN TEXT ---
Table 1
Patient #	Age (y) / Sex	Allele #1a	Allele #2a	Functional Scores
Mutation (NM_018297.3)	Protein	Mutation (NM_018297.3)	Protein	Nijmegenb	IQ or DQ	Vinelandc
1	3 / M	c.953T>C	p.L318P	c.1169G>C	p.R390P	 14	ND	62
2	4 / M	c.1201A>T	p.R401*	c.1201A>T	p.R401*	33	25	52
3	4 / F	c.1201A>T	p. R401*	c.1201A>T	p.R401*	34	5	40
4	5 / F	c.931G>A	 p.E311K	c.730T>C	p.W244R	33	ND	ND
5	6 / M	c.1604G>A	p.W535*	 c.1910delT	p.L637*	25	8	43
6	7 / M	c.1891delC	p.Q631S	 c.1201A>T	p.R401*	36	5	37
7	8 / F	c.622C>T	p.Q208*	c.930C>T	p.G310G (splice site)	 10	74	98
8	10 / M	c.622C>T	p. Q208*	c.930C>T	p.G310G (splice site)	 9	81	94
9	16 / M	c.347C>G	p.S116*	c.881+5G>T	IVS5+5G>T	32	2	28
10	17 / F	c.1201A>T	p.R401*	c.1201A>T	p.R401*	25	16	42
11	18 / F	c.1201A>T	p.R401*	c.1201A>T	p.R401*	52	2	24
12	21 / F	c.1370dupG	p.R458Kfs*14	c.1370dupG	p.R458Kfs*14	ND	ND	37
Mean	9	-	-	-	-	28	23	51
SEM	2	-	-	-	-	4	10	7
Abbreviations: IQ, intellectual quotient; DQ, developmental quotient; ND, not

In [135]:
!echo "Prompt text:"
!cat $log | sed -n "6331,6335p"
!echo "Prompt response:"
!cat $log | sed -n "6358,6371p" | column -s "|" -t

Prompt text:
--- BEGIN TEXT ---
CSF Laboratory Results
Nine subjects underwent lumbar puncture. CSF total protein and albumin concentrations, as well as the CSF/serum albumin ratios, were low in nearly all individuals (Supplementary Table S3). There was no correlation between age and CSF protein or albumin levels. In the two oldest subjects, CSF 5-hydroxyindolacetic acid (5-HIAA) and homovanillic acid (HVA) were decreased, suggesting neuronal loss. Neopterin levels were normal but decreased with age, and CSF tetrahydrobiopterin (BH4) levels were below the lower limit of normal in all but one subject tested; however, there was no correlation with age. CSF 5-HIAA, HVA, and BH4 levels strongly and directly correlated with brain atrophy. CSF lactate and amino acid levels were essentially normal (Supplementary Table S3). CSF leucocyte counts (0–4; normal 0–5), glucose concentrations (54–72 mg/dL; normal 40–70), 3-O-methyldopa concentrations (12–28 nM; normal <150) and 5-methyltetrahydrofola

In [134]:
!echo "Prompt text:"
!cat $log | sed -n "8776,8780p"
!echo "Prompt response:"
!cat $log | sed -n "8803,8813p" | column -s "|" -t

Prompt text:
--- BEGIN TEXT ---
Figure 3: Brain MRI findings.
 White matter lesions were seen in two of eleven individuals. A shows multiple lesions in the periventricular white matter, some of which were confluent. B shows a single lesion in the periventricular white matter. Both A and B illustrate cerebral atrophy and were performed using T2-weighting. Sulci are slightly prominent in both A and B, and prominent ventricles are visible in A. C illustrates high position of the cerebellar tonsils, and large foramen of Magendie and cisterna magna.
--- END TEXT ---

Prompt response:
subject               subject_entity      predicate        object                        object_entity
Brain MRI findings    phenotype           has participant  White matter lesions          phenotype
White matter lesions  phenotype           located in       two of eleven individuals     organism
White matter lesions  phenotype           located in       periventricular white matter  cellular component
Cerebr

#### Analysis

In [154]:
relations = pd.read_csv(utils.get_paths().output_data / "relations_2.tsv", sep="\t")
relations

Unnamed: 0,subject,subject_entity,predicate,object,object_entity,doc_id,doc_filename
0,Mutations,genetic variant,cause,Inherited Disorder of the Endoplasmic Reticulum-Associated Degradation,disease,PMC4243708,PMC4243708.txt
1,NGLY1,gene,has gene product,Mutations,genetic variant,PMC4243708,PMC4243708.txt
2,ERAD pathway,pathway,responsible for,translocation of misfolded proteins,biological process,PMC4243708,PMC4243708.txt
3,ERAD pathway,pathway,has participant,proteasome,molecular activity,PMC4243708,PMC4243708.txt
4,N-glycanase 1,protein,involved in,ERAD pathway,pathway,PMC4243708,PMC4243708.txt
...,...,...,...,...,...,...,...
916,NGLY1 deficiency,disease,causes,conjunctival injection,symptom,PMC7477955,PMC7477955.txt
917,NGLY1 deficiency,disease,causes,limbal neovascularization,symptom,PMC7477955,PMC7477955.txt
918,NGLY1 deficiency,disease,causes,corneal scarring,symptom,PMC7477955,PMC7477955.txt
919,NGLY1 deficiency,disease,causes,severe dry eyes,symptom,PMC7477955,PMC7477955.txt


In [183]:
pd.set_option("display.max_colwidth", None, "display.max_rows", 400, "display.max_columns", None)

In [195]:
(
    relations
    .assign(subject_entity=lambda df: df["subject_entity"].str.replace("[<>]", "", regex=True))
    .assign(object_entity=lambda df: df["object_entity"].str.replace("[<>]", "", regex=True))
    [["subject_entity", "object_entity"]]
    .value_counts().unstack().fillna(0).astype(int)
    .pipe(lambda df: df.loc[
        df.sum(axis=1).sort_values(ascending=False).index,
        df.sum(axis=0).sort_values(ascending=False).index
    ])
    .style
    .background_gradient()
)

object_entity,phenotype,disease,genetic variant,assay,biological process,other,organism,cellular component,gene,chemical substance,protein,molecular activity,cell type,symptom,macromolecular complex,protein variant,metabolite,pathway,organization,transcript variant,clinical trial,tissue,genotype
subject_entity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
disease,336,23,17,38,6,0,6,1,0,11,0,0,4,6,0,0,0,1,0,0,0,0,0
organism,82,3,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
genetic variant,15,12,21,1,0,9,3,1,7,0,1,2,0,0,0,3,0,0,0,2,0,0,0
phenotype,10,22,3,1,3,3,4,4,3,3,1,3,1,4,0,0,4,0,0,0,0,0,0
gene,4,8,11,0,4,0,1,0,1,0,4,1,0,0,1,2,0,0,0,0,0,0,1
protein,0,0,0,0,11,0,0,4,2,0,4,2,1,0,3,0,0,3,0,0,0,0,0
assay,9,4,2,0,0,2,6,1,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0
organization,0,0,12,9,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0
biological process,7,8,1,0,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,0,0,2,0
cellular component,12,0,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0


In [177]:
relations.predicate.value_counts().head(100)

predicate
has phenotype                   329
associated with                 158
has participant                  61
causes                           55
part of                          33
has genotype                     32
caused by                        30
located in                       22
produces                         17
correlated with                  15
has attribute                    14
increases abundance of           12
decreases abundance of           11
has output                        9
location of                       8
involved in                       8
occurs in                         7
performed                         6
used in                           6
produced by                       5
not associated with               5
has gene product                  5
contributes to                    5
demonstrates                      4
derives from                      4
participates in                   3
treats                            3
not correlated wit

In [176]:
def apply(df, fn):
    fn(df)
    return df

(
    relations
    .pipe(lambda df: df[
        (df['subject'] == utils.NGLY1_DEFICIENCY) | 
        (df['object'] == utils.NGLY1_DEFICIENCY)
    ])
    .pipe(lambda df: pd.DataFrame([
        (r['object'], r['object_entity'], r['predicate'])
        if r['subject'] == utils.NGLY1_DEFICIENCY else 
        (r['subject'], r['subject_entity'], r['predicate'])
        for _, r in df.iterrows()
    ], columns=['entity', 'entity_type', 'relation']))
    .pipe(apply, lambda df: display(df['relation'].value_counts()))
    .pipe(lambda df: df[
        df['relation'].str.contains('phenotype|disease|symptom', regex=True) 
    ])
    # .pipe(lambda df: df[df['predicate'].str.contains('cause')])
    .sort_values(['entity', 'relation'])
)

relation
has phenotype          225
associated with         71
caused by               18
causes                  16
has genotype            13
not associated with      4
has participant          3
is allele of             1
produces                 1
part of                  1
involved in              1
instance of              1
is marker for            1
manifestation of         1
Name: count, dtype: int64

Unnamed: 0,entity,entity_type,relation
49,Alacrima/hypolacrima,phenotype,has phenotype
105,Anal stenosis,phenotype,has phenotype
51,Chalazions,phenotype,has phenotype
245,Communication scores,phenotype,has phenotype
59,Constipation,phenotype,has phenotype
50,Corneal ulcerations/scarring,phenotype,has phenotype
246,Daily Living Skills scores,phenotype,has phenotype
291,Developmental delays in mastication,phenotype,has phenotype
60,Dysmorphic features,phenotype,has phenotype
293,Dystonic movements of the tongue,phenotype,has phenotype


In [157]:
(
    relations
    .pipe(lambda df: df[
        df['subject_entity'].str.contains('variant|') | 
        df['object_entity'].str.contains('variant')
    ])
)

Unnamed: 0,subject,subject_entity,predicate,object,object_entity,doc_id,doc_filename
0,Mutations,genetic variant,cause,Inherited Disorder of the Endoplasmic Reticulum-Associated Degradation,disease,PMC4243708,PMC4243708.txt
1,NGLY1,gene,has gene product,Mutations,genetic variant,PMC4243708,PMC4243708.txt
14,c.1201A>T (p.R401X),genetic variant,is allele of,NGLY1 deficiency,disease,PMC4243708,PMC4243708.txt
18,c.1201A>T (p.R401X),genetic variant,associated with,severe disease,disease,PMC4243708,PMC4243708.txt
33,NGLY1 deficiency,disease,caused by,mutations in NGLY1,genetic variant,PMC4243708,PMC4243708.txt
43,mutations in NGLY1,genetic variant,causes,NGLY1 deficiency,disease,PMC4243708,PMC4243708.txt
44,mutations in NGLY1,genetic variant,causes,impaired cytosolic degradation,molecular activity,PMC4243708,PMC4243708.txt
45,mutations in NGLY1,genetic variant,causes,abnormal accumulation of misfolded glycoproteins,phenotype,PMC4243708,PMC4243708.txt
46,mutations in NGLY1,genetic variant,causes,accumulation of an amorphous unidentified substance,phenotype,PMC4243708,PMC4243708.txt
47,mutations in NGLY1,genetic variant,causes,congenital disorder of glycosylation,phenotype,PMC4243708,PMC4243708.txt
