# Finding Entities in Multiple Sclerosis Research

This isn't as much to develop more for Gregory, it's to see what I can do with Spacy and Named Entity Recognition (NER). This is a Jupyter notebook because I want to try giving it a proper use and because it will make it easier to achieve two goals.

1. Show others what my thought process was.
2. Make it easier to ask questions to people who know more than me.
3. Discover what is the best NER model to analyse Multiple Sclerosis (MS) articles

## Data sources

https://api.gregory-ms.com/articles/all

There are four models available for NER on science articles that we are going to test. These all come from  [SciSpacy](https://allenai.github.io/scispacy/).	


| **Model**            | **Description**                                                                                                      | **Install URL** | **Tested** |
|----------------------|----------------------------------------------------------------------------------------------------------------------|-----------------|---|
| en_core_sci_sm       | A full spaCy pipeline for biomedical data.                                                                           | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz)        | No, we tested en_core_sci_md because it has a larger corpus |
| en_core_sci_md       | A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.                             | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz)        | Yes |
| en_core_sci_scibert  | A full spaCy pipeline for biomedical data with a ~785k vocabulary and allenai/scibert-base as the transformer model. | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz)        | Yes |
| en_core_sci_lg       | A full spaCy pipeline for biomedical data with a larger vocabulary and 600k word vectors.                            | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz)        | Yes |
| en_ner_craft_md      | A spaCy NER model trained on the CRAFT corpus.                                                                       | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz)        | Yes |
| en_ner_jnlpba_md     | A spaCy NER model trained on the JNLPBA corpus.                                                                      | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz)        | Yes | 
| en_ner_bc5cdr_md     | A spaCy NER model trained on the BC5CDR corpus.                                                                      | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz)        | Yes |
| en_ner_bionlp13cg_md | A spaCy NER model trained on the BIONLP13CG corpus.                                                                  | [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz)        | Yes |



## Initilize modules and get data

In [1]:
%%capture

import os
import sys
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from pathlib import Path
!{sys.executable} -m pip install scispacy
import scispacy
!{sys.executable} -m pip install spacy
import spacy 
!{sys.executable} -m pip install pandas
import pandas as pd
!{sys.executable} -m pip install requests
import requests
from spacy import displacy

# NLP Models
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_sm-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_md-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_scibert-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_core_sci_lg-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_craft_md-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_jnlpba_md-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bc5cdr_md-0.5.0.tar.gz
!{sys.executable} -m pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.0/en_ner_bionlp13cg_md-0.5.0.tar.gz



url = 'https://api.gregory-ms.com/articles/all'

urlData = requests.get(url).content

df = pd.read_json(urlData)

df

In [16]:
article_id = input()

In [17]:
summary = df.loc[int(article_id), 'summary']
summary

'&lt;div&gt;&lt;p style&#x3D;&quot;color: #4aa564;&quot;&gt;Mult Scler Relat Disord. 2021 May 8;52:103016. doi: 10.1016&#x2F;j.msard.2021.103016. Online ahead of print.&lt;&#x2F;p&gt;&lt;p&gt;&lt;b&gt;ABSTRACT&lt;&#x2F;b&gt;&lt;&#x2F;p&gt;&lt;p xmlns:xlink&#x3D;&quot;http:&#x2F;&#x2F;www.w3.org&#x2F;1999&#x2F;xlink&quot; xmlns:mml&#x3D;&quot;http:&#x2F;&#x2F;www.w3.org&#x2F;1998&#x2F;Math&#x2F;MathML&quot; xmlns:p1&#x3D;&quot;http:&#x2F;&#x2F;pubmed.gov&#x2F;pub-one&quot;&gt;BACKGROUND: Relapsing MS (RMS) is a lifelong disease without a cure, usually diagnosed between 20-40 years of age. Being newly diagnosed with RMS is a highly stressful event due to the unpredictable disease course after diagnosis. Thus, it is imperative that persons with MS have the skills and support to cope with the negative physical and emotional effects of the disease. The objective of this study was to assess whether a mindfulness-based intervention (MBI) would improve coping skills and thus lessen the negativ

Summary includes html, so we need to clean the data

In [18]:
import html
summary = html.unescape(summary)

from bs4 import BeautifulSoup
soup = BeautifulSoup(summary, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()    # rip it out
summary = soup.get_text()
summary 



'Mult Scler Relat Disord. 2021 May 8;52:103016. doi: 10.1016/j.msard.2021.103016. Online ahead of print.ABSTRACTBACKGROUND: Relapsing MS (RMS) is a lifelong disease without a cure, usually diagnosed between 20-40 years of age. Being newly diagnosed with RMS is a highly stressful event due to the unpredictable disease course after diagnosis. Thus, it is imperative that persons with MS have the skills and support to cope with the negative physical and emotional effects of the disease. The objective of this study was to assess whether a mindfulness-based intervention (MBI) would improve coping skills and thus lessen the negative consequences of stress due to being newly diagnosed with RMS.METHODS: This was a single-blind (assessor), randomized, prospective study of a 10-week MBI vs. usual standard of care in persons newly diagnosed (within 1 year) with RMS, recruited from one tertiary care MS clinic in London (ON), Canada. The MBI was administered in group format with a trained MBI facili

We are going to load all the NLP modules that we listed above and referenced for testing.

In [19]:
nlp_sci = spacy.load('en_core_sci_md')
nlp_scibert = spacy.load('en_core_sci_scibert')
nlp_core_sci_lg = spacy.load('en_core_sci_lg')
nlp_cr = spacy.load('en_ner_craft_md')
nlp_bc = spacy.load('en_ner_bc5cdr_md')
nlp_bi = spacy.load('en_ner_bionlp13cg_md')
nlp_jn = spacy.load('en_ner_jnlpba_md')


## Testing en_core_sci_md

In [20]:
doc = nlp_sci(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

## Testing en_core_sci_scibert	

In [21]:
doc = nlp_scibert(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)



## Testing en_core_sci_lg

In [22]:
doc = nlp_core_sci_lg(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

## Testing en_ner_craft_md

In [23]:
doc = nlp_cr(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)



## Testing en_ner_jnlpba_md

In [24]:
doc = nlp_jn(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

## Testing en_ner_bc5cdr_md

In [25]:
doc = nlp_bc(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

## Testing en_ner_bionlp13cg_md

In [26]:
doc = nlp_bi(summary)
displacy_image = displacy.render(doc, style = 'ent',jupyter=True)

# Conclusion so far

We are able to identify some proteins, aparently, but not sure if they are relevant for research.