# Finding Entities in Multiple Sclerosis Research

This isn't as much to develop more for Gregory, it's to see what I can do with Spacy and Named Entity Recognition (NER). This is a Jupyter notebook because I want to try giving it a proper use and because it will make it easier to achieve two goals.

1. Show others what my thought process was.
2. Make it easier to ask questions to people who know more than me.
3. Discover what is the best NER model to analyse Multiple Sclerosis (MS) articles

## Data sources

https://api.gregory-ms.com/articles/all

## Initilize modules and get data

In [1]:
import os
from pathlib import Path
import scispacy
import spacy 
import pandas as pd
import requests
from spacy import displacy

In [2]:

url = 'https://api.gregory-ms.com/articles/all'

urlData = requests.get(url).content

df = pd.read_json(urlData)

print(df)

      article_id                                              title  \
0           1138  The Relationship Between Walking Speed and the...   
1           1139  Microglial changes associated with meningeal i...   
2           6317  Inadequate Vaccine Responses in Children With ...   
3           6324  Recent advances in clinical trials targeting t...   
4           6327        TDP-43 Pathology in Alzheimer&#39;s Disease   
...          ...                                                ...   
7981       12696  Does the Serum Expression Level of High-Mobili...   
7982       14071  The microbiota restrains neurodegenerative mic...   
7983       14074  Autologous treatment for ALS with implication ...   
7984       14538  Timed Up &amp; Go (TUG) With Cognitive and Man...   
7985       14804  Chromatin accessibility and transcriptome inte...   

                                                summary  \
0     &lt;div&gt;&lt;p style&#x3D;&quot;color: #4aa5...   
1     &lt;div&gt;&lt;p style&

In [3]:
summary = df.loc[4201, 'summary']
print(summary)

<h2>Abstract</h2><h3>Background</h3><p>Differentiation of the demyelinating disorders of the CNS seems challenging in practice. Conus medullaris, the cone-shaped end of the spinal cord, is more involved in anti-MOG patients based on preliminary studies, a possibly helpful detail in its differentiation. Nevertheless, the evidence is still limited and the underlying cause is unclear and undiscussed in previous studies.</p><h3>Objective</h3><p>To contribute to preliminary studies by comparing conus involvement among patients with MS, anti-AQP4, and anti-MOG diseases using larger sample size.</p><h3>Methods</h3><p>More than a thousand MS, anti-AQP4, and anti-MOG patients were followed up for a maximum of five years, scanned for conus medullaris involvement. Data regarding each cohort were then analyzed and compared using statistical methods.</p><h3>Results</h3><p>The rate of conus medullaris involvement was significantly higher in anti-MOG patietns (OR = 27.109, <i>P</i> < 0.001), followed

Summary includes html, so we need to clean the data

In [4]:
import html
summary = html.unescape(summary)

from bs4 import BeautifulSoup
soup = BeautifulSoup(summary, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()    # rip it out
summary = soup.get_text()
print(summary)



AbstractBackgroundDifferentiation of the demyelinating disorders of the CNS seems challenging in practice. Conus medullaris, the cone-shaped end of the spinal cord, is more involved in anti-MOG patients based on preliminary studies, a possibly helpful detail in its differentiation. Nevertheless, the evidence is still limited and the underlying cause is unclear and undiscussed in previous studies.ObjectiveTo contribute to preliminary studies by comparing conus involvement among patients with MS, anti-AQP4, and anti-MOG diseases using larger sample size.MethodsMore than a thousand MS, anti-AQP4, and anti-MOG patients were followed up for a maximum of five years, scanned for conus medullaris involvement. Data regarding each cohort were then analyzed and compared using statistical methods.ResultsThe rate of conus medullaris involvement was significantly higher in anti-MOG patietns (OR = 27.109, P < 0.001), followed by anti-AQP4 (OR = 4.944, P = 0.004), and MS patients (OR = reference). Sur

Let's look at the output of 'en_core_sci_md' as a NER, and we'll see that it identifies entities, but does not show what they are.

In [5]:
nlp = spacy.load('en_core_sci_md')
doc = nlp(summary)
displacy_image = displacy.render(doc,  style = 'ent' )

Same thing, this time with `en_ner_jnlpba_md` as a NER model, and we don't see any entities at all.

There are four models available for NER on science articles that we are going to test. These all come from  [SciSpacy](https://allenai.github.io/scispacy/).	


| **Model**            | **Description**                                                                                                      | **Install URL** | **Tested** |
|----------------------|----------------------------------------------------------------------------------------------------------------------|-----------------|---|
| en_core_sci_sm       | A full spaCy pipeline for biomedical data.                                                                           | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz)        | |
| en_core_sci_md       | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.                             | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz)        | |
| en_core_sci_scibert  | A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz)        | |
| en_core_sci_lg       | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.                            | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz)        | |
| en_ner_craft_md      | A spaCy NER model trained on the CRAFT corpus.                                                                       | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz)        | Yes |
| en_ner_jnlpba_md     | A spaCy NER model trained on the JNLPBA corpus.                                                                      | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz)        | | 
| en_ner_bc5cdr_md     | A spaCy NER model trained on the BC5CDR corpus.                                                                      | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz)        | Yes |
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus.                                                                  | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz)        | Yes |



In [6]:
nlp_sci = spacy.load('en_core_sci_md')
nlp_scibert = spacy.load('en_core_sci_scibert')
nlp_core_sci_lg = spacy.load('en_core_sci_lg')
nlp_cr = spacy.load('en_ner_craft_md')
nlp_bc = spacy.load('en_ner_bc5cdr_md')
nlp_bi = spacy.load('en_ner_bionlp13cg_md')
nlp_jn = spacy.load('en_ner_jnlpba_md')


In [7]:
doc = nlp_sci(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

In [8]:
doc = nlp_scibert(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)



In [9]:
doc = nlp_core_sci_lg(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

In [10]:
doc = nlp_cr(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

In [11]:
doc = nlp_bc(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

In [12]:
doc = nlp_bi(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

In [13]:
doc = nlp_jn(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

# Conclusion so far

We are able to identify some proteins, aparently, but not sure if they are relevant for research.