## SpaCy

A newer machine learning library

Currently, Spacy does not work with versions of Python above 3.12. To work around this, you can use conda/mamba to create a new environment called `spacy` with Python version 3.12. 

    mamba create -n spacy python=3.12
    mamba activate spacy
    mamba install spacy

The quick and easy way to run this notebook in that new environment would be to install jupyter there: `mamba install jupyter`.

A more correct solution would be to run jupyter from your usual environment and but then run this notebook with the Python kernel from this new environment.  

For a discussion on how to be able to switch Python kernels from different conda environments, see [this webpage](https://towardsdatascience.com/get-your-conda-environment-to-show-in-jupyter-notebooks-the-easy-way-17010b76e874).  The basic idea is that in your base environment, do `mamba install nb_conda_kernels` and then in the new environment whose kernel you want to access, do `mamba install ipykernel`.  Then restart Jupyter from your base environment. 

In [None]:
!pip install spacy

In [75]:
import spacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
for token in nlp('boxes was having mice children swam dug'):
    print(token.lemma_)

In [None]:
with open('carroll-alice.txt') as f:
    alice = f.read()

In [None]:
doc = nlp(alice)

In [None]:
doc

In [None]:
for token in doc:
    print(token.text, token.lemma_)

# Named Entity Recognition

Pulling references to concrete people, places and things in the real world out of texts is crucial to many forms of cultural analysis.

But it's pretty tricky to do in the general sense (easier if you know what you are looking for).

As our sample text, instead of *Alice in Wonderland*, let's use the text of Lewis Carroll's Wikipedia entry, which has a lot more references to the real world.

We use Beautiful Soup to parse the webpage.

In [None]:
from bs4 import BeautifulSoup

In [None]:
import requests

In [None]:
headers = {
    'User-Agent': 'Educational script',
}

In [None]:
page = requests.get("https://en.wikipedia.org/wiki/Lewis_Carroll", 
                    headers=headers)
page_content = BeautifulSoup(page.text, "html")

In [None]:
page_content

Very messy!  We just want the text of the article.  So we look for paragraphs `<p> ... </p>` and then we extract the text within each one without any tags.

In [None]:
text = ''
for para in page_content.find_all("p"):
    para = para.text
    text += para
text

Those newlines are annoying, so let's get rid of them.

In [None]:
text = ''
for para in page_content.find_all("p"):
    para = para.text
    para = para.replace("\n", " ")
    text += para
text

That's pretty clean, except for the footnote markers (e.g. `[1]`).  Let's get rid of those.

In [None]:
import re
text = ''
for para in page_content.find_all("p"):
    para = para.text
    para = para.replace("\n", " ")
    para = re.sub(r'\[\d+\]', '', para)
    text += para
text

Now we have a text, but how do we find the named entities?  A naive approach would be to look for capitalized words, but that does not work very well.

In [None]:
for word in text.split():
    if re.match(r'[A-Z]', word):
        print(word)

In [None]:
with open('carroll.txt', mode='w', encoding='utf-8') as f:
    f.write(text)

## Spacy for NER

In [None]:
from spacy import displacy

In [None]:
tags = nlp(text)

In [None]:
displacy.render(tags, jupyter=True, style='ent')

## Dependency parsing

See [this introduction](https://universaldependencies.org/introduction.html)

In [None]:
alice_tags = nlp('''Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?' ''')

In [None]:
displacy.render(alice_tags, jupyter=True, style='dep')

In [None]:
sentence_spans = list(tags.sents)
displacy.render(sentence_spans, jupyter=True, style="dep")

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

In [None]:
for token in doc:
    if token.dep_ == 'amod':
        print(token.text)

In [None]:
for token in doc:
    for child in token.children:
        if child.text == 'little':
            print(token)

In [None]:
for token in doc:
    for child in token.children:
        if child.text == 'little' and token.pos_ == 'NOUN':
            print(token)

In [None]:
adjectives = ['little', 'small']
#adjectives = ['little', 'big', 'large', 'small']
from collections import Counter
things = Counter()
for token in doc:
    for child in token.children:
        if child.lemma_ in adjectives and token.pos_ == 'NOUN':
            things[token.lemma_] += 1
things.most_common()

## Jane Austen

What are Jane Austen's favourite adjectives?

In [None]:
import requests
# austen_resp = requests.get("https://www.gutenberg.org/files/31100/31100.txt")
austen_resp = requests.get("https://www.gutenberg.org/cache/epub/1342/pg1342.txt", 
                           headers=headers)

In [None]:
austen_resp

In [None]:
austen = austen_resp.content.decode()

In [None]:
austen

In [None]:
austen.index('It is a truth')

In [None]:
austen[35886:40000]

In [None]:
austen = austen[35886:]

In [None]:
austen[0:1000]

In [None]:
import re
austen = re.sub(r'[\r\n]+', ' ', austen)

In [None]:
austen[0:1000]

In [None]:
austen = re.sub(r'\[.*?\]+', ' ', austen, flags=re.S)

In [None]:
austen[0:1000]

In [None]:
austen_doc = nlp(austen)

In [None]:
for token in austen_doc:
    if token.dep_ == 'amod':
        print(token.lemma_)

In [None]:
adjectives = Counter()
for token in austen_doc:
    if token.dep_ == 'amod':
        adjectives[token.lemma_] += 1
adjectives.most_common(25)

In [None]:
women = Counter()
for token in doc:
    if token.lemma_ in ['girl', 'woman', 'wife', 'lady']:
        for child in token.children:
            if child.pos_ == 'ADJ':
#                 print(child.pos_, child.lemma_)
                women[child.lemma_] += 1
women.most_common()

In [None]:
austen_resp = requests.get("https://www.gutenberg.org/files/31100/31100-0.txt", 
                           headers=headers)

In [None]:
austen = str(austen_resp.content)

In [None]:
austen = re.sub(r'\\r\\n', ' ', austen)

In [None]:
austen[0:2000]

In [None]:
nlp.max_length = len(austen) + 1000

In [None]:
austen_doc = nlp(austen)

In [None]:
women = Counter()
for token in austen_doc:
    if token.lemma_ in ['girl', 'woman', 'wife', 'lady', 'she', 'her']:
        for child in token.children:
            if child.pos_ == 'ADJ':
#                 print(child.pos_, child.lemma_)
                women[child.lemma_] += 1
women.most_common()