# **NLP to BioMed - SciSpacy**
---
The single line platform to establish the libraries and pipelines availble to carry the applications of Natural Language Processing to Bio-Medical Data.

Here we use the [SciSpacy](https://github.com/allenai/scispacy) to do the NLP tasks on the sample Biomed-data.

**Installation**
---

The following packages are installed to carry out the tasks of Natural Language Processing.
- Scispacy
- Spacy
- Pre-trained models ( in detail description of each pre-trained model is given below)

**Note**: Please restart the runtime once the below cell is executed, so that it is easy to call the pre-trained model for further actions.

In [1]:
!pip install scispacy==0.2.5

!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_bc5cdr_md-0.2.5.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
Building wheels for collected packages: en-core-sci-sm
  Building wheel for en-core-sci-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.2.5-cp36-none-any.whl size=33155838 sha256=d4b6d6f7031a9320f6ccc1b6c1691bb0103981244f3e81bb82eb8fdb773a2a29
  Stored in directory: /root/.cache/pip/wheels/24/38/1b/67d4ad18da43aa62deaca08a05fa0dc8595116a82154448f88
Successfully built en-core-sci-sm
Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_bc5cdr_md-0.2.5.tar.gz
[?25l  Downloading https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_bc5cdr_md-0.2.5.tar.gz (79.9MB)
[K     |████████████████████████████████| 79.9MB 51kB/s 
Building wheels for collected packages: en-ner-bc5cdr-md
 

**Configuration and Libraries:**

The libraries are imported once intalled.



In [1]:
import spacy
import scispacy
from spacy import displacy

**Sample Data**

In [2]:
abstract = "Corticobasal syndrome (CBS) is a rare cognitive and movement disorder characterized by asymmetric rigidity, apraxia, alien-limb phenomenon, cortical sensory loss, myoclonus, focal dystonia, and dementia. It occurs along the clinical spectrum of frontotemporal lobar degeneration (FTLD), which has recently been shown to segregate with truncating mutations in progranulin (PGRN), a multifunctional growth factor thought to promote neuronal survival. This study identifies a novel splice donor site mutation in the PGRN gene (IVS7+1G→A) that segregates with CBS in a Canadian family of Chinese origin. We confirmed the absence of the mutant PGRN allele in the RT–PCR product which supports the model of haploinsufficiency for PGRN-linked disease. This report of mutation in the PGRN gene in CBS extends the evidence for genetic and phenotypic heterogeneity in FTLD spectrum disorders."

## **Abbreviation Detector**

The following pipeline of detection algorithm is used to identify the abbreviation definitions in the Biomedical Text

E.g. : 
  - CBS: Corticobasal Syndrome
  - FTLD: Frontotemporal lobar degeneration

In [3]:
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load('en_ner_bc5cdr_md')

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

doc = nlp(abstract)

print("Abbreviation", "\t", "Definition")

for abrv in doc._.abbreviations:
		print(f"{abrv} :\t {abrv._.long_form}")

Abbreviation 	 Definition
CBS :	 Corticobasal syndrome
CBS :	 Corticobasal syndrome
CBS :	 Corticobasal syndrome
FTLD :	 frontotemporal lobar degeneration
FTLD :	 frontotemporal lobar degeneration
PGRN :	 progranulin
PGRN :	 progranulin
PGRN :	 progranulin
PGRN :	 progranulin


**Entity Linker**
---
The following pipeline is used to link the entities such that it can be used to fetch the information from the nearest search in terms of definition, aliases etc.

E.g. : 
  - Name: CBS gene
  - Definition: This gene plays a role in transsulfuration
  - Aliases: CBS Gene, CYSTATHIONINE BETA-SYNTHASE, cystathionine-beta-synthase

Parameters that are to be tuned for the above application:
- *resolve_abbreviations* : bool = True, optional (default = False) Whether to resolve abbreviations identified in the Doc before performing linking. This parameter has no effect if there is no AbbreviationDetector in the spacy pipeline.

- *k* : int, optional, (default = 30) The number of nearest neighbours to look up from the candidate generator per mention.
- *threshold* : float, optional, (default = 0.7) The threshold that a mention candidate must reach to be added to the mention in the Doc as a mention candidate.

- *no_definition_threshold* : float, optional, (default = 0.95) The threshold that a entity candidate must reach to be added to the mention in the Doc as a mention candidate if the entity candidate does not have a definition.

- *filter_for_definitions*: bool, default = True Whether to filter entities that can be returned to only include those with definitions in the knowledge base.
- *max_entities_per_mention* : int, optional, default = 5 The maximum number of entities which will be returned for a given mention, regardless of how many are nearest neighbours are found.

In [3]:
from scispacy.linking import EntityLinker

nlp = spacy.load('en_core_sci_sm')

entity_linker = EntityLinker(resolve_abbreviations=True, name="umls")
nlp.add_pipe(entity_linker)

doc = nlp(abstract)

entity = doc.ents[3]

print("Name: ", entity)

for umls_ent in entity._.kb_ents:
	print(entity_linker.kb.cui_to_entity[umls_ent[0]])

Name:  movement disorder
CUI: C0026650, Name: Movement Disorders
Definition: Syndromes which feature DYSKINESIAS as a cardinal manifestation of the disease process. Included in this category are degenerative, hereditary, post-infectious, medication-induced, post-inflammatory, and post-traumatic conditions.
TUI(s): T047
Aliases (abbreviated, total: 23): 
	 Movement Disorders, Movement Disorders, Movement Disorders, Movement disorders, Movement disorders, movement disorders, Movement Disorder, Movement Disorder, movement disorder, movement disorder
CUI: C0028850, Name: Ocular Motility Disorders
Definition: Disorders that feature impairment of eye movements as a primary manifestation of disease. These conditions may be divided into infranuclear, nuclear, and supranuclear disorders. Diseases of the eye muscles or oculomotor cranial nerves (III, IV, and VI) are considered infranuclear. Nuclear disorders are caused by disease of the oculomotor, trochlear, or abducens nuclei in the BRAIN STEM

**Name Entity Recognition**
---




In [3]:
nlp_ner = spacy.load('en_ner_bc5cdr_md')
doc_ner = nlp_ner(abstract)

output_ner = displacy.render(doc_ner, jupyter = True, style = "ent")

**Pre-trained Models for Bio-Medical Data**
---
**en_core_sci_sm** : A full spaCy pipeline for biomedical data with a ~100k vocabulary.	 [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz)

**en_core_sci_md** : A full spaCy pipeline for biomedical data with a ~360k vocabulary and 50k word vectors. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_md-0.2.5.tar.gz)

**en_core_sci_lg** : A full spaCy pipeline for biomedical data with a ~785k vocabulary and 600k word vectors. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_lg-0.2.5.tar.gz)

***NER Models***:


**en_ner_craft_md** : A spaCy NER model trained on the CRAFT corpus. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_craft_md-0.2.5.tar.gz)

**en_ner_jnlpba_md** : A spaCy NER model trained on the JNLPBA corpus. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_jnlpba_md-0.2.5.tar.gz)

**en_ner_bc5cdr_md** : 	A spaCy NER model trained on the BC5CDR corpus. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_bc5cdr_md-0.2.5.tar.gz)

**en_ner_bionlp13cg_md** : A spaCy NER model trained on the BIONLP13CG corpus. [Download](https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_ner_bionlp13cg_md-0.2.5.tar.gz)

#### References:
- [ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing](https://www.semanticscholar.org/paper/ScispaCy%3A-Fast-and-Robust-Models-for-Biomedical-Neumann-King/de28ec1d7bd38c8fc4e8ac59b6133800818b4e29) 
- [AllenAI](https://github.com/allenai/scispacy)

