In [None]:
import spacy
from IPython.display import SVG
from spacy import displacy

# Intro to spaCy
We'll finish up our course by looking at `spaCy`, a great Python library for working with natural language processing. While most of the tools we've used so far have been rule-based, spaCy consists mostly of **statistical NLP** models. In statistical models, a large corpus of text is processed and mathematical methods are used to identify patterns in the corpus. This process is called **training**. Once a model has been trained, we can use it to analyze new text. 

spaCy comes with several pre-trained models, meaning that we can quickly load a model which has been trained on large amounts of data. This way, we can take advantage of work which has already been done by spaCy developers and focus on our own NLP tasks. In these notebooks, we'll see how combining spaCy's statistical models with rule-based systems offers a powerful way to process and analyze text.

## What we'll do today

We'll start by looking at the basic usage of spaCy. Next, we'll focus on specific NLP task, **named entity recognition (NER)**, and see how this works in spaCy, as well as some of the limitations with clinical data. Some of these limitations can be addressed by writing our own rules for concept extraction, and we'll practice that with some clinical texts. We'll then go a little deeper into how spaCy's models are implemented and how we can modify them. Finally, we'll end the day by spaCy models which were designed specifically for use in the biomedical domain.

# spaCy documentation

spaCy has great documentation. As we're going along today, try browsing through their documentation to find examples and instructions. Start by opening up these two pages and navigating through the documentation:

[Basic spaCy usage](https://spacy.io/usage/models)

[API documentation](https://spacy.io/api)

spaCy also has a really good, free online class. If you want to dig deeper into spaCy after this class, it's a great resource for using this library:
https://course.spacy.io/


# Basic usage of spaCy


In this notebook, we'll look at the basic fundamentals of spaCy:
- Main classes in spaCy
- Linguistic attributes available as part of default text processing
- Coding exercise

## How to use spaCy
At a high-level, here are the steps for using spaCy:
- Start by loading a pre-trained NLP model
- Process a string of text with the model
- Use the attributes in our processed documents for downstream NLP tasks like NER or document classification

For example, here's a very short example of how this works. For the sake of demonstration, we'll use this snippet of a recent, exciting news article:

In [None]:
# First, load a pre-trained model
nlp = spacy.load("en_core_web_sm")

In [None]:
# Process a string of text with the model
text = """Taco Bell’s latest marketing venture, a pop-up hotel, opened at 10 a.m. Pacific Time Thursday. 
The rooms sold out within two minutes.
The resort has been called “The Bell: A Taco Bell Hotel and Resort.” It’s located in Palm Springs, California."""

doc = nlp(text)
doc

In [None]:
# Use the attributes in our processed documents for downstream NLP tasks
# Here, we'll visualize the entities in this text identified through NER
displacy.render(doc, style="ent")

Let's dive a little deeper into how spaCy is structured and what we have to work with.

## spaCy architecture
The following diagram from spaCy's API documentation shows a basic overview of spaCy's architecture:

In [None]:
SVG('./images/spacy_architecture.svg')

There's a lot of information in this diagram, so we'll start small by focusing on these 4 spaCy classes:
- `Language`: The NLP model used to process text
- `Doc`: A sequence of text which has been processed by a `Language` object
- `Token`: A single word or symbol in a Doc
- `Span`: A slice from a Doc

# `Language` class
The `nlp` object in spaCy is the linguistic model which will be used for processing text. We instantiate a `Language` class by providing the name of a pre-trained model which we wish to use. We typically name this object `nlp`, and this will be our primary entry point.


Statistical, pre-trained, explain more

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp

In [None]:
help(nlp)

The `nlp` model we instantiated above is a **small** ("sm"), **English** ("en")-language model trained on **web** ("web") data, but there are currently 16 different models from 9 different languages. See the [spaCy documentation](https://spacy.io/usage/models) for more information on each of the models.

# Documents, spans and tokens
The `nlp` object is what we'll be using to process text. The next few classes represent the output of our NLP model.

## `Doc` class
The `doc` object represents a single document of text. To create a `doc` object, we call `nlp` on a string of text. This runs that text through a spaCy pipeline, which we'll learn more about in a future notebook.

In [None]:
text = 'Taco Bell’s latest marketing venture, a pop-up hotel, opened at 10 a.m. Pacific Time Thursday.'

In [None]:
doc = nlp(text)

In [None]:
print(doc)

## Tokens and Spans
A `Token` is a single word, symbol, or whitespace in a `doc`. When we create a `doc` object, the text broken up into individual tokens. This is called **"tokenization"**.

**Discussion**: Look at the tokens generated from this text snippet. What can you say about the tokenization method? Is it as simple as splitting up into words every time we reach a whitespace?

In [None]:
token = doc[0]

In [None]:
token

In [None]:
type(token)

In [None]:
for token in doc:
    print(token)

A `Span` is a slice of a document, or a consecutive sequence of tokens.

In [None]:
span = doc[1:4]

In [None]:
span

In [None]:
type(span)

## Linguistic Attributes
Because spaCy comes with pre-trained linguistic models, when we call `nlp` on a text we have access to a number of linguistic attributes in the `doc` or `token` objects:

- Part-of-speech (POS) tagging
- Morphology
- Dependency Parsing
- Named entity recognition
- Sentence splitting
- Others

In [None]:
nlp.pipe_names

### POS Tagging
Parts of speech are categories of words. For example, "nouns", "verbs", and "adjectives" are all examples of parts of speech. Assigning parts of speech to words is useful for downstream NLP texts such as word sense disambiguation and named entity recognition.

**Discussion**: What to the POS tags below mean?

In [None]:
print(f"Token -> POS\n")
for token in doc:
    print(f"{token.text} -> {token.pos_}")

In [None]:
spacy.explain("PROPN")

### Morphology
The morphology of a word refers to the form of a word. For example, "eat", "eats", and "ate" are all different inflections of the word "eat". We would say that "eat" is the **lemma** of all of these words.

In [None]:
print(f"Token -> Lemma\n")
for token in doc:
    print(f"{token.text} -> {token.lemma_}")

### Dependency Parsing
In dependency parsing, we analyze the structure of a sentence. We won't spend too much time on this, but here is a nice visualization of dependency parse looks like. Take a minute to look at the arrows between words and try to figure out what they mean.

In [None]:
doc = nlp("The cat sat on the green mat")

In [None]:
displacy.render(doc, style='dep')

### Other attributes
Look at spaCy's [Token class documentation](https://spacy.io/api/token) for a full list of additional attributes available for each token in a document. We'll print out a few more.

**Discussion**: How can these attributes be useful in downstream NLP tasks?

In [None]:
doc = nlp("Bitcoin rose 7.9% to $11,899 as of 11:53 a.m. in New York on Monday. "
    "https://www.bloomberg.com/news/articles/2019-07-08/bitcoin-breakout-may-be-ahead-as-technicals-show-rally-extension")

In [None]:
print(doc)

The "shape" of the token:

In [None]:
print(doc[0])
print(doc[0].shape_)

The vector of a token (we'll learn more about this in the upcoming [statistical NLP class](https://datascience4health.bmi.utah.edu/statistical-nlp/) ).

In [None]:
token = doc[0]

In [None]:
token.vector

Whether the token resembles a:
- Number
- Punctuation
- Currency

In [None]:
print(doc[2])
print(doc[2].like_num)

In [None]:
print(doc[5])
print(doc[5].is_currency)

In [None]:
print(doc[-1])
print(doc[-1].like_url)

# Coding Exercises and Discussions
Now that we've seen some examples of what we can do with spaCy, let's practice with some coding exercises!

**1. Ambiguity** 

Consider these two sentences and look at the word "duck". Consider these questions and discuss them with a group:
- Are the two tokens "duck" in these two sentences identical? What does that tell us about a "token" vs. a normal string?
- When two words are spelled the same but have different meanings, they are "ambiguous". Using spaCy, what are some ways we we could "disambiguate" the word "duck" in these two sentences?

In [None]:
doc1 = nlp("The duck swam gently down the river.")
doc2 = nlp("He had to duck as he came through the door.")

In [None]:
duck1 = doc1[1]
duck2 = doc2[3]

In [None]:
duck1 == duck2

In [None]:
# Your code here

**2. Count POS tags** 

Write a function that takes a Doc and returns a count of the number of each POS tag in that doc.

*Bonus*: Write a second function that plots a bar graph of these counts.

In [None]:
from collections import defaultdict
def count_pos_tags(doc):
    """
    """
    # Your code here
    # d = defaultdict(...)
    return d

In [None]:
doc1 = nlp("The highly vivacious green paint is giving me a terrible, mind-splitting headache.")
doc2 = nlp("My favorite activites are swimming, biking, reading, eating, and watching trashy shows on Netflix.")
doc3 = nlp("On Tuesday, the 28 EU leaders chose Ursula von der Leyen, an ally of German Chancellor Angela Merkel, "
            "to replace Jean-Claude Juncker at the helm of the Commission.")

In [None]:
docs = (doc1, doc2, doc3)

In [None]:
for doc in docs:
    print(count_pos_tags(doc))

**3. Sort by number of words with a certain POS tag**

Write a function that takes a list of Docs and a POS tag. Return a new list which is sorted in **descending order** by the number of tokens which have that POS tag. Use a default value of 'PROPN' for the argument `pos_tag`. Test this out using the list `docs` from the last exercise.

**Hint**: Use the built-in [sorted function](https://www.geeksforgeeks.org/sorted-function-python/) to sort the list, and use a lambda function to define the key which we should sort by.

In [None]:
def sort_by_pos(docs, pos_tag):
    """
    """
    # Your code here
    # srtd_docs = ... 
    return srtd_docs
    
def count_pos_tag(doc, pos_tag):
    n = 0
    for token in doc:
        # Your code here
    return n

In [None]:
# Sort by which documents have the most proper nouns
sort_by_pos(docs, 'PROPN')

In [None]:
# Sort by which documents have the most verbs
sort_by_pos(docs, 'VERB')

In [None]:
# Sort by which documents have the most adjectives
sort_by_pos(docs, 'ADJ')