# Advanced text processing

In order to find out more about a text, we often need to understand more about the text’s linguistic structure. In the previous example, we used regular expressions for identifying individual words, and a stopword list to identify content words. But very soon, linguistic tasks can get very complex, i.e. when trying to identify syntactic structure or identifying the persons a text is talking about.

These tasks define a whole area at the intersection of linguistics and computer science called natural language processing (NLP). For Python, several packages for this kind of analysis are available, including the natural language toolkit [NLTK](http://www.nltk.org/) and [spaCy](https://spacy.io/).

To use the spaCy module for natural language processing, we have to install it and download it’s model files for the English language:

    conda install spacy
    python -m spacy download en

We can now activate it:

In [1]:
import spacy
nlp = spacy.load('en')

Using the example text from the previous lesson, we can test some of spaCy’s capabilities:

In [2]:
with open('heimskringla_preface.txt') as textfile:
    text = textfile.read()

# SpaCy gets confused by line breaks in the text, so we replace them by spaces.
text = text.replace('\n', ' ')

The core functionality of spaCy can be accessed by creating a processed version of the text that holds information about the text’s words, sentences, and more.

In [3]:
doc = nlp(text)

The now created document behaves like a list of words, but each word carries additional information. This way, we can identify the word class (the “part of speec” or “POS”).

In [4]:
for word in doc[0:20]:
    print(word.text, '-', word.pos_)  # Note: it’s `.pos_`, not `.pos`!

PREFACE - NOUN
OF - ADP
SNORRE - PROPN
STURLASON - PROPN
. - PUNCT
  - SPACE
In - ADP
this - DET
book - NOUN
I - PRON
have - VERB
had - VERB
old - ADJ
stories - NOUN
written - VERB
down - PART
, - PUNCT
as - ADP
I - PRON
have - VERB


For the meaning of the individual tags, see [SpaCy’s documentation](https://spacy.io/api/annotation#pos-tagging).

To access structural information other than words, we can use special properties of the document, e.g. for sentences.

*Technical note:* The sentences are not a `list` object, but something called a “generator.” In contrast to lists, a generator is built dynamically. As a consequence, one cannot tell how many elements the generator will contain, thus the usual indexing does not work. To print only the first sentences, we have to convert the generator to a proper list.

In [5]:
sentences = list(doc.sents)
for sentence in sentences[0:3]:
    print(sentence)

PREFACE OF SNORRE STURLASON.  
In this book I have had old stories written down, as I have heard them told by intelligent people, concerning chiefs who have have held dominion in the northern countries, and who spoke the Danish tongue; and also concerning some of their family branches, according to what has been told me.
Some of this is found in ancient family registers, in which the pedigrees of kings and other personages of high birth are reckoned up, and part is written down after old songs and ballads which our forefathers had for their amusement.


We can use the structural information to gain additional insights into the text’s statistical properties. E.g., we can find the most frequent nouns:

In [6]:
from collections import Counter

nouns = [w.text for w in doc if w.pos_ == 'NOUN']
noun_counts = Counter(nouns)
noun_counts.most_common(10)

[('who', 4),
 ('time', 4),
 ('people', 3),
 ('chiefs', 3),
 ('family', 3),
 ('what', 3),
 ('songs', 3),
 ('poem', 3),
 ('son', 3),
 ('death', 3)]

As one can see, different forms of the same word are countent as different words. Here, this affects mainly plural forms ("chiefs" vs "chief"), but for highly inflectional languages, this problem becomes more severe. Often, we want to count the base forms, or lemmas, instead of the inflected word forms.

In [7]:
noun_lemmas = [w.lemma_ for w in doc if w.pos_ == 'NOUN']
noun_lemma_counts = Counter(noun_lemmas)
noun_lemma_counts.most_common(10)

[('poem', 5),
 ('time', 5),
 ('chief', 4),
 ('who', 4),
 ('son', 4),
 ('people', 3),
 ('family', 3),
 ('what', 3),
 ('song', 3),
 ('skald', 3)]

We can see that the count of "poem" increased from 3 to 5. As we can easily see, this does indeed stem from plural forms now included in the count:

In [8]:
[w.text for w in doc if w.lemma_ == 'poem']

['poem', 'poem', 'poem', 'poems', 'poems']

Instead of filtering the words according to their word class, we can also build statistics about the distribution of the word classes:

In [9]:
pos_tags = [w.pos_ for w in doc]
pos_counter = Counter(pos_tags)
pos_counter.most_common()

[('NOUN', 110),
 ('VERB', 98),
 ('ADP', 87),
 ('PUNCT', 68),
 ('DET', 63),
 ('ADJ', 59),
 ('PROPN', 49),
 ('CCONJ', 33),
 ('PRON', 20),
 ('ADV', 20),
 ('PART', 14),
 ('SPACE', 5),
 ('NUM', 1)]

A more comple kind of analysis are named entities. This term refers to words—or groups of words—that refer to identifyable entities like persons, groups, or places. SpaCy identifies them for us:

In [10]:
for ent in doc.ents[0:10]:
    print(ent.text, '-', ent.label_)

Danish - NORP
Thjodolf of Hvin - PERSON
Harald Harfager - PERSON
Ynglingatal - WORK_OF_ART
Olaf Geirstadalf - ORG
King Halfdan - ORG
thirty - CARDINAL
Fjolner - ORG
Yngvefrey - GPE
Swedes - NORP


Similarly, we can count how many entities of which category are found throughout the text:

In [11]:
ne_tags = [e.label_ for e in doc.ents]
ne_counter = Counter(ne_tags)
ne_counter.most_common()

[('PERSON', 10),
 ('GPE', 9),
 ('ORG', 5),
 ('NORP', 4),
 ('WORK_OF_ART', 2),
 ('CARDINAL', 1),
 ('DATE', 1)]

To understand what these labels mean, see the [list of codes](https://spacy.io/api/annotation#named-entities) for named entities.

Now this can help us to find all the persons mentioned in the text:

In [12]:
persons = [e.text for e in doc.ents if e.label_ == 'PERSON']
persons

['Thjodolf of Hvin',
 'Harald Harfager',
 'Eyvind Skaldaspiller',
 'Earl Hakon',
 'Thjodolf',
 'Frey',
 'Dan Milkillate',
 'Harald Harfager',
 'the King of Norway',
 'Harald']

This can be a starting point for a more content-oriented analysis, e.g. by identifying the words frequently associated with certain persons.