# spaCy

spaCy is a newer library than NLTK, specifically designed to i) work on larger problems and ii) hide irrelevant details from the users. Like NLTK, there's a lot to this library. We'll focus on the main features:

[Loading spaCy models](#loading)<br>

[Tokenization](#tokenization)<br>

[Lemmatization](#lemma)<br>

[Named entity recognition](#ner)<br>

[Visualizing NER](#visualize-ner)<br>

[Word vectors and similarity](#vectors)<br>

### Time
- Teaching: 30 minutes
- Exercises: 30 minutes

In [None]:
import os
import spacy
import numpy as np
import pandas as pd

## Loading spaCy models <a id='loading'></a>

spaCy required us to download a model, which we did in the second notebook. spaCy has different models for different languages. The models are what actually do the processing in spaCy. To make use of the models, we first load them in spaCy with `nlp = spacy.load('en')`, which stores the model in a variable called `nlp` for us. The `'en'` stands for English. You can see what other languages spaCy supports [here](https://spacy.io/usage/models). If you wanted to process data in those languages, you'd need to first download the relevant models and then load it in a similar way.

In [None]:
nlp = spacy.load('en')

We can think of `nlp` as a function that we can apply to text data we want to analyze. First, let's read in the Python wikipedia page into a variable called `text`.

In [None]:
DATA_DIR = 'data'

def read(fname):
    fname = os.path.join(DATA_DIR, fname)
    with open(fname) as f:
        return f.read()

text = read('python_wikipedia.txt')
text[:100]

Now we can use the `nlp` model to process it. We call the `nlp` object on `text`. When we do this, spaCy does a lot of work behind the scenes. In fact, most of the processing that we'll use later on is done at this stage. spaCy analyzes the text, and stores the result in a special `Doc` object. By convention, we call this `doc`. The `Doc` object holds all the information that we'll use later on, such as the sentence boundaries, the POS tags, the named entities, etc.

In [None]:
doc = nlp(text)

## Tokenization <a id='tokenization'></a>

Tokenization in spaCy is easy. In fact, it's already done! When we iterate over a `Doc` object, spaCy assumes we want to iterate over the tokens.

In [None]:
for token in doc[:10]:
    print(token)

Each `token` in `doc` is a `Token` object. This is an object that stores all the information about the token. To get the string representation of the token, we use the `.text` attribute. We'll see that all the information that we care about in spaCy is stored in attributes of objects like `Token`s.

In [None]:
first_token = doc[0]
type(first_token)

In [None]:
for token in doc[:10]:
    print(token.text)

In [None]:
type(first_token.text)

We can ask the `doc` object how many tokens it has:

In [None]:
len(doc)

### Challenge

Get the string representations of all the tokens in our text into a list called `tokens`. Check that it has 7547 strings in it.

In [None]:
# your answer goes here

## POS tagging

spaCy has already done the POS tagging for us. Guess where that information is stored? You got it: it's in an attribute of each token.

In [None]:
for token in doc[:10]:
    print(token.text, token.pos_)

Two observations. First, these labels may be opaque to you. What does `PROPN` mean? And `ADP`? spaCy has got you covered.

In [None]:
print(spacy.explain('PROPN'))
print(spacy.explain('ADP'))

Second, if you looked closely at the attribute we used for getting the part of speech, you would have seen that we used `.pos_`, with the underscore at the end. What's that about?

It all has to do with the fact that spaCy is designed to be hard-core, "industrial-strength" NLP software. It wants to be fast and efficient. To make it run fast, it actually stores a lot of the information as hashes, or special numbers, that refer to the more human-readable data. Think of them as unique codes for data. Storing these hashes is more efficient than using strings like `'PROPN'`. spaCy keeps the efficient data representation in attributes without the underscore, and keeps the human-readable form in an identically-named attribute with a trailing underscore. Most of the time, we as users of spaCy want the human-readable form, so we'll use the attributes with underscores.

In [None]:
for token in doc[:10]:
    print(token.text, token.pos)

### Challenge

Get a list of all POS tags in the document. BONUS: Get a list of tuples of (word, pos) for every token in the text.

In [None]:
# your answer goes here

## Lemmatization <a id='lemma'></a>

No prizes for guessing where the lemmas for each token are stored. You'll notice that nowhere do we have to say what algorithm we want to use to get the lemmata (that's the plural of _lemma_). That's the whole point of spaCy. The designers don't want you to have to worry about what the best algorithm is to use. They have done their research, and chosen what they believe is a general-purpose method. This is pretty different from NLTK.

In [None]:
for token in doc[:20]:
    print(token.text, token.lemma_)

## Named entity recognition <a id='ner'></a>

Named entity recognition (NER) is a big task in NLP. And rightfully so: it's really useful. A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. NER refers to extracting the named entities out of a text. Imagine if you're researching whether a particular newspaper is politically biased. One immediate thing you might ask is how often they talk about politicians of different persuasions. You could use NER to extract out all the mentions of people, filter them down to politicians, and classify them by party. Then you'd want to look at the number of mentions of people in each party. If one's much higher than the others, that could be because the newspaper is biased. (It could also be a million other things.)

In NER, the different named entities that are extracted are grouped by their type. For example, "person", "organization", "location", "country", etc. In spaCy, there are lots of [different types](https://spacy.io/api/annotation#named-entities) of named entities that it can extract.

Named entities in spaCy are available as the `ents` property of a `Doc`. The `.label_` tells us the type of named entity.

In [None]:
example_ner_sentence = '''On Wednesday, Apple announced that it is looking to buy a U.K. startup called Bamboozle.
It stated that it was willing to pay $1 billion for the rights to own its services in America, Vanuatu and Sweden.
Although none of its employees speak fluent French or Swahili, Bamboozle offered to expand its services to both
France and Tanzania. The rights cover the entirety of mainland U.S., except for Lake Michigan.'''.replace('\n', ' ')

example_ner_doc = nlp(example_ner_sentence)

In [None]:
for ent in example_ner_doc.ents:
    print(ent.text, ent.label_)

### Challenge

Extract all the named entities in the Python wikipedia page, currently stored in `doc`. Bonus: find the most popular person (i.e. the person with the most mentions) and the most popular country (country is labeled as 'GPE').

In [None]:
# your answer goes here

In [None]:
# your answer goes here

In [None]:
# your answer goes here

## Visualizing NER <a id='visualize-ner'></a>

spaCy has some cool features for visualizing its analysis of text data. To use this, we have to import `displacy` from the spacy library. We can ask `displacy` to `render` the NER information in `doc`, paying attention to tell it we're in a Jupyter notebook.

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style='ent', jupyter=True)

## Dependency parsing

Dependency parsing refers to drawing the relationships between individual words in a sentence. Just like NER, this is a huge topic in NLP. It's so big we're not going to cover it now. For our purposes, it's useful to know which words modify which.

In [None]:
for token in doc[:18]:
    print(token.text, token.dep_, token.head)

This may not seem impressive. But let's visualize it:

In [None]:
dependency_doc = nlp(text[:90])
displacy.render(dependency_doc, style='dep', jupyter=True)

### Challenge

For the text of the Python wikipedia page, extract out the following information for each token:
- string representation
- pos
- lemma
- whether it's a stop word (`.is_stop`)
- whether it's a punctuation symbol (`.is_punct`)
- whether it's a number (`.like_num`)
- the dependency relation (`.dep_`)

Store this information in a list for each piece of information (i.e. a list for pos, a list for lemmata). BONUS: turn this into a `pandas.DataFrame` and find the distribution of pos, and the distribution of pos given the word is a stop word.

In [None]:
tokens = [token.text for token in doc]
# your answer goes here

In [None]:
# your answer goes here

In [None]:
# your answer goes here

In [None]:
# your answer goes here

## Word vectors and similarity <a id='vectors'></a>

Word vectors are mathematical representations of words. They allow us to find words that are similar to one another, and by extension, how similar texts are to each other.

In [None]:
tokens = nlp(u'dog cat horse banana peach strawberry')
data = []
for token1 in tokens:
    dic = {}
    for token2 in tokens:
        dic[token2] = token1.similarity(token2)
    data.append(dic)

In [None]:
df = pd.DataFrame(data, index=[t.text for t in tokens])
df

Similarity of texts:

In [None]:
berkeley = read('berkeley_wikipedia.txt')
stanford = read('stanford_wikipedia.txt')
mit = read('mit_wikipedia.txt')

In [None]:
berkeley_doc = nlp(berkeley)
stanford_doc = nlp(stanford)
mit_doc = nlp(mit)

In [None]:
berkeley_doc.similarity(stanford_doc)

In [None]:
berkeley_doc.similarity(mit_doc)

In [None]:
berkeley_doc.similarity(doc)