## Introduction to ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

From the command line, you should first make sure to run the setup script to install requirements:

```shell 
bash setup.sh
```

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different langauges you want to use different language models.

In [1]:
# create a spacy NLP class
import spacy
nlp = spacy.load("en_core_web_md")

> We start by downloading spacy and then the specific model we're working with

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (```token.text```); its part-of-speech tag (```token.pos_```); and the grammatical dependency relations between the tokens (```token.dep_```).


In [2]:
# a single sentence example
input_string = "My name is Ross and I come from Scotland."

In [3]:
# create a new variable call a Doc, complex objects 
doc = nlp(input_string)

In [4]:
type(doc)

spacy.tokens.doc.Doc

In [5]:
print(doc)

My name is Ross and I come from Scotland.


It's because keeps the string intact but now each of these linguistic tokens has more info in the background 

__Tokenize__

In [6]:
# tokenizing text
for token in doc:
    print(token.text) # write .text because it gives each token a set of attributes, like part of speech, grammatical category, etc. 

My
name
is
Ross
and
I
come
from
Scotland
.


Can see that it even separates punctuation. If change the string to come from New York, it'll tokenize them separately like we discussed earlier which we don't want it to do because it's one token. Keep in mind that spacy is opinionated. We are working within their assumptions and framework. 

__Trying some more attributes__

In [12]:
# find parts-of-speech and grammatical relations
for token in doc:
    print(token.i, token.text, token.pos_, token.dep_, token.morph) # token.i = index, token.text = text, token.pos = part of speech but we get a number. This is fixed by following with an "_", token.dep_ = dependency  


0 My PRON poss Number=Sing|Person=1|Poss=Yes|PronType=Prs
1 name NOUN nsubj Number=Sing
2 is AUX ROOT Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
3 Ross PROPN attr Number=Sing
4 and CCONJ cc ConjType=Cmp
5 I PRON nsubj Case=Nom|Number=Sing|Person=1|PronType=Prs
6 come VERB conj Tense=Pres|VerbForm=Fin
7 from ADP prep 
8 Scotland PROPN pobj Number=Sing
9 . PUNCT punct PunctType=Peri


__NER__

Extracting named entities from a ```spaCy``` doc requires an extra step, but nothing too challenging:

In [16]:
# extracting NERs
for ent in doc.ents: # for entity in doc.ents
    print(ent.text, ent.label_)

Ross PERSON
Scotland GPE


Here it will capture that New York is one entity on it's own.

__Questions:__ 

1. What range of linguistic features is available beyond what we're looking at here? 
2. Are the same range of features available for all languages? Compare e.g. English and Danish.

## Count distribution of linguistic features

__Create doc object__

In [19]:
# load a text file
import os
filename = os.path.join("..", "data", "example.txt")

In [20]:
with open(filename, "r", encoding="utf-8") as file:
    text = file.read()

In [21]:
# create a doc object
doc = nlp(text)

In [23]:
# create empty list 
entities = []

# add each entity to list  
for ent in doc.ents:
    entities.append(ent.text)

In [24]:
print(set(entities)) # set(), given a collection of objects - gives all the unique objects 

['KANSAS CITY', 'Mo.', 'AP', 'Hundreds of thousands', 'Kansas City', 'Wednesday', 'the Kansas City Chiefs’', 'second', 'Super Bowl', 'two years', 'Andy Reid', 'Super Bowl MVP', 'Patrick Mahomes', 'Chiefs', 'one', 'Kansas City', 'noon', 'Union Station', 'Chiefs', 'the Philadelphia Eagles', '38-35', 'Sunday', '8 seconds', 'more than 19', 'The City Council Transportation and Infrastructure Committee', '750,000', 'Quinton Lucas', 'more than $1.5 million', 'The the Kansas City Sports Commission', '$1 million', 'the Jackson County Legislature', '75,000']


In [28]:
# count number of adjectives
adjective_count = 0

for token in doc:
    if token.pos_ == "ADJ":
        adjective_count +=1

print(adjective_count)

11


__Relative frequency__

In [29]:
# find the relative frequency per 10,000 words
relative_freq = (adjective_count/len(doc)) * 10000

In [31]:
int(relative_freq) # make it an int to get rid of the very long decimal 

438

In [32]:
round(relative_freq, 2)

438.25

## Creating neater outputs using ```pandas```

At the moment, all of our output from ```spaCy``` is in the form of lists. If we want to save these, it probably makes sense to have them saved in a more transferable format, such as CSV files or JSONs.

One very easy way to do this with Python is by using the dataframe library ```pandas```.

In [33]:
import pandas as pd

In [34]:
# create spaCy doc
# create a new Doc object
doc = nlp(input_string)

In [37]:
annotations = []
for token in doc:
    annotations.append((token.text, token.pos_)) # by wrapping the two "token." together puts them in a tuple together 

In [38]:
annotations

[('My', 'PRON'),
 ('name', 'NOUN'),
 ('is', 'AUX'),
 ('Ross', 'PROPN'),
 ('and', 'CCONJ'),
 ('I', 'PRON'),
 ('come', 'VERB'),
 ('from', 'ADP'),
 ('Scotland', 'PROPN'),
 ('.', 'PUNCT')]

In [42]:
# spaCy doc to pandas dataframe
data = pd.DataFrame(annotations, 
                    columns=["Token", "POS"])

In [43]:
data

Unnamed: 0,Token,POS
0,My,PRON
1,name,NOUN
2,is,AUX
3,Ross,PROPN
4,and,CCONJ
5,I,PRON
6,come,VERB
7,from,ADP
8,Scotland,PROPN
9,.,PUNCT


In [49]:
# save dataframe - getting an error here 
outpath = os.path.join("..", "out", "annotations.csv")
data.to_csv(outpath)

OSError: Cannot save file into a non-existent directory: '../out'

## Assignment 1

Spend some time exploring and familiarizing yourself with ```spaCy``` and ```pandas```. We'll come back to them quite a lot through this semester, so it will help to have a solid handle on how they function.

When you are ready, head over to [Assignment 1](https://classroom.github.com/a/PdNi7nPv) which takes some of the skills you've learned last week and today. The task will be to count how many times certain linguistic features occur accross different documents, and to save those results in a clear and easy-to-understand way.