Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add in NER stats from SemMedDB to the semmeddb2 API #606

Closed
andrewsu opened this issue Mar 30, 2023 · 7 comments
Closed

add in NER stats from SemMedDB to the semmeddb2 API #606

andrewsu opened this issue Mar 30, 2023 · 7 comments
Assignees

Comments

@andrewsu
Copy link
Member

Now that we've created the new https://biothings.ncats.io/semmeddb2 API as part of #569 to investigate filtering strategies to improve signal/noise, let's also join in information about the Named Entity Recognition (NER) from the PREDICATION_AUX table (https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR/dbinfo.html):

image

I can really only imagine us using the SUBJECT_TEXT AND SUBJECT_SCORE values (plus the corresponding OBJECT_ values), so let's focus on those. We can add these values to the predication object at the same level as the predication_id:

image

@colleenXu
Copy link
Collaborator

FYI: I also saw some possible predicate info and sentence-predication confidence info:

I don't see any "score" for the relationship selection ("predicate"), but there's also a SCORE column in the ENTITY table (that seems to relate to individual sentences).

There are columns in the PREDICATION_AUX table to identify the position of the string that was used to pick the predicate (PREDICATE_START_INDEX and PREDICATE_END_INDEX).

originally posted here

@erikyao
Copy link

erikyao commented May 24, 2023

Example#1: Plasmids STIMULATES Dihydrofolate Reductase

Click me
{
  "_id": "C0032136-STIMULATES-C0039667",
  "predicate": "STIMULATES",
  "predication": [
    {
      "predication_id": 84692642,
      "pmid": 360038,
      "sentence_id": 43528009,
      "sentence": "The R factor induced enzyme was partially purified from a strain carrying a multicopy recombinant plasmid into which the 1770 bp fragment was inserted and which induced high levels of dihydrofolate  reductase.",
      "subject_text": "plasmid",
      "subject_score": 773,
      "object_text": "dihydrofolate  reductase",
      "object_score": 1000
    },
    {
      "predication_id": 102407545,
      "pmid": 6407900,
      "sentence_id": 74212980,
      "sentence": "A plasmid mutation has been identified that increases expression of mouse DHFR more than ten-fold.",
      "subject_text": "plasmid",
      "subject_score": 888,
      "object_text": "DHFR",
      "object_score": 824
    }
  ],
  "pmid_count": 2,
  "predication_count": 2,
  "subject": {
    "umls": "C0032136",
    "name": "Plasmids",
    "semantic_type_abbreviation": "bacs",
    "semantic_type_name": "Biologically Active Substance",
    "novelty": 1
  },
  "object": {
    "umls": "C0039667",
    "name": "Dihydrofolate Reductase",
    "semantic_type_abbreviation": "gngm",
    "semantic_type_name": "Gene or Genome",
    "novelty": 1
  }
}

Example#2: CDK3 gene INTERACTS_WITH activating transcription factor 1

Click me
{
  "_id": "C1332734-INTERACTS_WITH-C0214635",
  "predicate": "INTERACTS_WITH",
  "predication": [
    {
      "predication_id": 125412171,
      "pmid": 18794154,
      "sentence_id": 120609124,
      "sentence": "Cyclin-dependent kinase 3-mediated activating transcription factor 1 phosphorylation enhances cell transformation.",
      "subject_text": "Cyclin-dependent kinase 3",
      "subject_score": 849,
      "object_text": "activating transcription factor 1",
      "object_score": 849
    },
    {
      "predication_id": 125412566,
      "pmid": 18794154,
      "sentence_id": 120609128,
      "sentence": "Furthermore, we found that cdk3 phosphorylates activating transcription factor 1 (ATF1) at serine 63 and enhances the transactivation and transcriptional activities of ATF1.",
      "subject_text": "cdk3",
      "subject_score": 1000,
      "object_text": "activating transcription factor 1",
      "object_score": 1000
    }
  ],
  "pmid_count": 1,
  "predication_count": 2,
  "subject": {
    "umls": "C1332734",
    "name": "CDK3 gene",
    "semantic_type_abbreviation": [
      "aapp",
      "gngm"
    ],
    "semantic_type_name": [
      "Amino Acid, Peptide, or Protein",
      "Gene or Genome"
    ],
    "novelty": 1
  },
  "object": {
    "umls": "C0214635",
    "name": "activating transcription factor 1",
    "semantic_type_abbreviation": "aapp",
    "semantic_type_name": "Amino Acid, Peptide, or Protein",
    "novelty": 1
  }
}

Example#3: C1333570-CAUSES-C0023882

Questionable NER data:

  1. Text little should not be connected to concept Little's Disease
  2. Text PSMs (plant secondary metabolites) should not be connected to concept FOLH1 gene
    • FOLH1 gene is one of PSMA (Prostate-Specific Membrane Antigen), and this could be the cause of the mistake.
Click me
{
  "_id": "C1333570-CAUSES-C0023882",
  "predicate": "CAUSES",
  "predication": [
    {
      "predication_id": 182378913,
      "pmid": 31580494,
      "sentence_id": 342370606,
      "sentence": "Ambient temperature has been shown to alter liver function in rodents and the toxicity of some PSMs, but little is known about the physiological and nutritional consequences of consuming PSMs at different ambient temperatures.",
      "subject_text": "PSMs",
      "subject_score": 827,
      "object_text": "little",
      "object_score": 1000
    }
  ],
  "pmid_count": 1,
  "predication_count": 1,
  "subject": {
    "umls": "C1333570",
    "name": "FOLH1 gene",
    "semantic_type_abbreviation": "gngm",
    "semantic_type_name": "Gene or Genome",
    "novelty": 1
  },
  "object": {
    "umls": "C0023882",
    "name": "Little's Disease",
    "semantic_type_abbreviation": "dsyn",
    "semantic_type_name": "Disease or Syndrome",
    "novelty": 1
  }
}

@erikyao
Copy link

erikyao commented May 24, 2023

Statistics of all NER stats

STAT subject_score object_score
TOTAL 122611719 122611719
MIN 0 0
MAX 1000 1000
MEAN 927.23 922.86
MEDIAN 916 901
2.5TH PERCENTILE 766 759
25TH PERCENTILE 888 872
50TH PERCENTILE 916 901
75TH PERCENTILE 1000 1000
97.5TH PERCENTILE 1000 1000

So NER shows high confidence in the connection between the entity texts and concepts. A threshold around 800 seems weak.

@erikyao
Copy link

erikyao commented May 24, 2023

Statistics of predication list lengths (i.e. predication_count values in existing documents)

STAT predication_count
TOTAL 24481939
MIN 1
MAX 64451
MEAN 3.65
MEDIAN 1
2.5TH PERCENTILE 1
25TH PERCENTILE 1
50TH PERCENTILE 1
75TH PERCENTILE 2
97.5TH PERCENTILE 15

The documents with the max predication_count is exactly C0023884-PART_OF-C0034693 (Liver PART_OF Rattus norvegicus) which caused the BSONObjectTooLarge error to MongoDB.

If we apply a threshold of 1000 to the length of predication lists, 4,293 documents out of 24,481,939 (i.e. 0.0175%) will be affected.

@erikyao
Copy link

erikyao commented May 25, 2023

14 predication records, invovling 10 semmeddb2 documents, have NO NER stats from the source data. They are:

predication_id doc _id
201544459 C0871124-PROCESS_OF-C0008059
203800327 C0039185-STIMULATES-C2936529
201986825 C1709820-PART_OF-C0029045
192912334 N/A
196519528 N/A
196519532 N/A
198555342 C4321237-PROCESS_OF-C0008059
199170923 C0429845-USES-C1709820
192883183 C0020538-COEXISTS_WITH-C0871124
201986826 C0018207-LOCATION_OF-6098
203209176 C1422771-compared_with-C1418270
203209174 C0206131-LOCATION_OF-C1418270
205429882 N/A
201469895 C1817666-PROCESS_OF-C0024432

@erikyao
Copy link

erikyao commented May 26, 2023

@colleenXu @andrewsu semmeddb2 updated with NER stats in 4 new fields:

{
    "_id": "...",
    "predication": [
       {
            "object_score": <int>,
            "object_text": <str>,
            "subject_score": <int>,
            "subject_text": <str>
       },
       # omitted
    ]
}

@andrewsu
Copy link
Member Author

Super, this looks great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants