# Finding Entities in Multiple Sclerosis Research

This isn't as much to develop more for Gregory, it's to see what I can do with Spacy and Named Entity Recognition (NER). This is a Jupyter notebook because I want to try giving it a proper use and because it will make it easier to achieve two goals.

1. Show others what my thought process was.
2. Make it easier to ask questions to people who know more than me.
3. Discover what is the best NER model to analyse Multiple Sclerosis (MS) articles

## Data sources

https://api.gregory-ms.com/articles/all

## Initilize modules and get data

In [23]:
import os
import scispacy
import spacy 
import pandas as pd
import requests
from spacy import displacy

In [24]:

url = 'https://api.gregory-ms.com/articles/all'

urlData = requests.get(url).content

df = pd.read_json(urlData)

print(df)

      article_id                                              title  \
0           1138  The Relationship Between Walking Speed and the...   
1           1139  Microglial changes associated with meningeal i...   
2           1201  Association of neurogranin gene expression wit...   
3            843  Depression in multiple sclerosis: Is one appro...   
4           1145  An engineered neurovascular unit for modeling ...   
...          ...                                                ...   
7960       12696  Does the Serum Expression Level of High-Mobili...   
7961       14071  The microbiota restrains neurodegenerative mic...   
7962       14074  Autologous treatment for ALS with implication ...   
7963       14538  Timed Up &amp; Go (TUG) With Cognitive and Man...   
7964       14804  Chromatin accessibility and transcriptome inte...   

                                                summary  \
0     &lt;div&gt;&lt;p style&#x3D;&quot;color: #4aa5...   
1     &lt;div&gt;&lt;p style&

In [25]:
summary = df.loc[0, 'summary']
print(summary)

&lt;div&gt;&lt;p style&#x3D;&quot;color: #4aa564;&quot;&gt;Neurorehabil Neural Repair. 2021 Apr 13:15459683211005028. doi: 10.1177&#x2F;15459683211005028. Online ahead of print.&lt;&#x2F;p&gt;&lt;p&gt;&lt;b&gt;ABSTRACT&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;&lt;p xmlns:xlink&#x3D;&quot;http:&#x2F;&#x2F;www.w3.org&#x2F;1999&#x2F;xlink&quot; xmlns:mml&#x3D;&quot;http:&#x2F;&#x2F;www.w3.org&#x2F;1998&#x2F;Math&#x2F;MathML&quot; xmlns:p1&#x3D;&quot;http:&#x2F;&#x2F;pubmed.gov&#x2F;pub-one&quot;&gt;BACKGROUND: Persons with multiple sclerosis (pwMS) experience walking impairments, characterized by decreased walking speeds. In healthy subjects, the self-selected walking speed is the energetically most optimal. In pwMS, the energetically most optimal walking speed remains underexposed. Therefore, this review aimed to determine the relationship between walking speed and energetic cost of walking (Cw) in pwMS, compared with healthy subjects, thereby assessing the walking speed with the lowest energetic co

Summary includes html, so we need to clean the data

In [26]:
import html
summary = html.unescape(summary)

from bs4 import BeautifulSoup
soup = BeautifulSoup(summary, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()    # rip it out
summary = soup.get_text()
print(summary)



Neurorehabil Neural Repair. 2021 Apr 13:15459683211005028. doi: 10.1177/15459683211005028. Online ahead of print.ABSTRACTBACKGROUND: Persons with multiple sclerosis (pwMS) experience walking impairments, characterized by decreased walking speeds. In healthy subjects, the self-selected walking speed is the energetically most optimal. In pwMS, the energetically most optimal walking speed remains underexposed. Therefore, this review aimed to determine the relationship between walking speed and energetic cost of walking (Cw) in pwMS, compared with healthy subjects, thereby assessing the walking speed with the lowest energetic cost. As it is unclear whether the Cw in pwMS differs between overground and treadmill walking, as reported in healthy subjects, a second review aim was to compare both conditions.METHOD: PubMed and Web of Science were systematically searched. Studies assessing pwMS, reporting walking speed (converted to meters per second), and reporting oxygen consumption were includ

Let's look at the output of 'en_core_sci_md' as a NER, and we'll see that it identifies entities, but does not show what they are.

In [27]:
nlp = spacy.load('en_core_sci_md')
doc = nlp(summary)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

Same thing, this time with `en_ner_jnlpba_md` as a NER model, and we don't see any entities at all.

There are four models available for NER on science articles that we are going to test. These all come from  [SciSpacy](https://allenai.github.io/scispacy/).
1. en_ner_craft_md
2. en_ner_bc5cdr_md 
3. en_ner_bionlp13cg_md 
4. en_ner_jnlpba_md

In [28]:
nlp_jn = spacy.load('en_ner_jnlpba_md')
doc = nlp_jn(summary)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')



In [29]:
nlp_cr = spacy.load('en_ner_craft_md')
nlp_bc = spacy.load('en_ner_bc5cdr_md')
nlp_bi = spacy.load('en_ner_bionlp13cg_md')
nlp_jn = spacy.load('en_ner_jnlpba_md')

In [30]:
doc = nlp_cr(summary)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

In [31]:
doc = nlp_bc(summary)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

In [32]:
doc = nlp_bi(summary)
displacy_image = displacy.render(doc, jupyter = True, style = 'ent')

# Conclusion so far
I don't think any of these will work