# Introduction to Text Analysis for Python Users

## What is text analysis and why do we use it?

It is easy for machines to read data that has a simple, preconceived structure. For example, below is a table of five famous statues around the world. Each row in the table represents a different statue, and each column consistently represents a piece of information (or variable) related to the statue.

We could easily use a piece of software like Microsoft Excel to answer questions about this data, such as "Which statue was built first?" and "Which of the countries in this table have more statues?"

Compare it to the descriptive paragraph of text below. 

"The largest statue in the world is India's Statue of Unity. It is 182 meters tall and was completed in 2018. The second largest statue comes in at 128 meters and is called the Spring Temple Buddha. It was built in China in 2008. Next is Laykun Sekkya, a statue in Myanmar that is 115.8 meters. It was also built in 2008. Built in 2020, the Statue of Belief stands in India at 106 meters tall. The fifth largest is Ushiku Daibutsu, a Japanese statue built in 1993. It is 100 meters tall."

Although the text maintains structures inherint in language (grammar, punctuation, parts of speech, etc.) those structures are complex and inconsistent - so much so that we often refer to text as unstructured. Through text analysis, we can extract machine-readable information from unstructured text.

To extract meaningful pieces of information from the paragraph in the example above, we would use a method within text analysis known as Named Entity Recognition. There are many other types of text analysis that solve various other problems. Below are the methods we will discuss in this workshop.

Named Entity Recognition
Classifies named entities found in unstructured text into pre-defined categories.

Supervised Classification
Unsupervised Classification
Frequencies
Collocation

## Setting up your environment



Before beginning this workshop, you should have experience using Python in a IDE you are comfortable with. We will be downloading and using files, so be sure you are familiar with your working directory and are downloading your files to the correct location. You should also have the following libraries installed:

NLTK
SpaCy

You should also run the following lines in your terminal:

python -m nltk.downloader punkt
python -m nltk.downloader maxent_ne_chunker
python -m nltk.downloader words
python -m spacy download en_core_web_sm

## Named Entity Recognition

Let's start with the example at the beginning of this workshop. To extract meaningful information from the paragraph of text about statues, we'll use both the NLTK and SpaCy libraries.

In [5]:
import nltk
import spacy
import en_core_web_sm

statues = "The largest statue in the world is India's Statue of Unity. It is 182 meters tall and was completed in 2018. The second largest statue comes in at 128 meters and is called the Spring Temple Buddha. It was built in China in 2008. Next is Laykun Sekkya, a statue in Myanmar that is 115.8 meters. It was also built in 2008. Built in 2020, the Statue of Belief stands in India at 106 meters tall. The fifth largest is Ushiku Daibutsu, a Japanese statue built in 1993. It is 100 meters tall."

### Tokenizing

Tokenization is the first step in most text analysis methods. It involves breaking down our text into smaller pieces such as sentences, multi-word groups, or individual words. For this example, we'll first tokenize our text by sentences, then by individual words.

In [41]:
#tokenize by sentence
statues_sent = nltk.sent_tokenize(statues)

statues_sent

["The largest statue in the world is India's Statue of Unity.",
 'It is 182 meters tall and was completed in 2018.',
 'The second largest statue comes in at 128 meters and is called the Spring Temple Buddha.',
 'It was built in China in 2008.',
 'Next is Laykun Sekkya, a statue in Myanmar that is 115.8 meters.',
 'It was also built in 2008.',
 'Built in 2020, the Statue of Belief stands in India at 106 meters tall.',
 'The fifth largest is Ushiku Daibutsu, a Japanese statue built in 1993.',
 'It is 100 meters tall.']

As you can see above, NLTK's tokenizer takes a single string of text and creates a list of separate strings for each sentence. Now let's break each sentence down into separate tokens.

In [52]:
# create an empty list to hold our tokenized sentences
statues_tk = []

# tokenize each sentence and append it to the list
for sent in statues_sent:
    toks = nltk.word_tokenize(sent)
    statues_tk.append(toks)

statues_tk[0]

['The',
 'largest',
 'statue',
 'in',
 'the',
 'world',
 'is',
 'India',
 "'s",
 'Statue',
 'of',
 'Unity',
 '.']

Each sentence has been broken down into a list of separate strings for each word. Note what happens with posessive nouns and punctuation. This is not a problem for our current task, but we will be exploring ways to handle such issues later on.

## Parts of speech tagging

Parts of speech tagging allows us to label individual tokens with their lexical categories such as nouns, adjectives, verbs etc. The NLTK library includes a function called `pos_tag()` that will do this for us.

In [67]:
# create an empty list to hold our tagged sentences
statues_pos = []

# tag each sentence with parts of speech and append it to the list
for sent in statues_tk:
    tags = nltk.pos_tag(sent)
    statues_pos.append(tags)

statues_pos[0]

[('The', 'DT'),
 ('largest', 'JJS'),
 ('statue', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('is', 'VBZ'),
 ('India', 'NNP'),
 ("'s", 'POS'),
 ('Statue', 'NNP'),
 ('of', 'IN'),
 ('Unity', 'NNP'),
 ('.', '.')]

NLTK'S POS tagging tool is based on the [Penn Treebank Project](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1246&context=cis_reports), so to understand the label each token has been assigned, we'll need to check the [Penn Treebank Project list of tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). As we can see, "The" is labeled as a determiner, "largest" as a superlative adjective, etc. 

### Named Entity Recognition with NLTK

Tagging parts of speech begins to provide structure to our text, but it doesn't tell us which words and phrases represent important entities, such as countries, names, etc. In order to find those, we'll need to use another function provided by NLTK called `ne_chunk()`. Let's test that function on just one of our tagged sentences:

In [97]:
test = nltk.ne_chunk(statues_pos[2])

print(test)

(S
  The/DT
  second/JJ
  largest/JJS
  statue/NN
  comes/VBZ
  in/IN
  at/IN
  128/CD
  meters/NNS
  and/CC
  is/VBZ
  called/VBN
  the/DT
  (ORGANIZATION Spring/NN Temple/NNP Buddha/NNP)
  ./.)


We can see that NLTK has chunked the "Spring", "Temple" and "Buddha" tokens into a single entity. That chunk belongs to the object class `nltk.tree.Tree`.

In [106]:
type(test[13])

nltk.tree.Tree

To see all of our named entities in the text, let's print out just the chunks that belong to the `nltk.tree.Tree` class:

In [75]:
for sent in statues_pos:
    chunks = nltk.ne_chunk(sent)
    
    for chunk in chunks:
        if type(chunk) == nltk.tree.Tree:
            print(chunk)

(GPE India/NNP)
(GPE Unity/NNP)
(ORGANIZATION Spring/NN Temple/NNP Buddha/NNP)
(GPE China/NNP)
(PERSON Laykun/NNP Sekkya/NNP)
(GPE Myanmar/NNP)
(GPE Belief/NNP)
(GPE India/NNP)
(PERSON Ushiku/JJ Daibutsu/NNP)
(GPE Japanese/JJ)


This is a rather dissappointing list. Although country names have been reliably labelled as Geopolitical Entities (GPE), the full names of some of our statues have been lost, and neither the quantities nor years in the text have been tagged at all.

Let's try using a different library to find named entities. This time, we'll use SpaCy.

### Named Entity Recognition with SpaCy

SpaCy has a more streamlined approach to tagging. We can simply load a "pipeline" that does all of the tokenizing, parts of speech tagging, and named entity recognition for us in one go. We'll use the `en_core_web_sm` pipeline and apply it to our text. This will create a special class of object called `spacy.tokens.doc.Doc`

In [89]:
nlp = en_core_web_sm.load()
statues_nlp = nlp(statues)

type(statues_nlp)

spacy.tokens.doc.Doc

Our spacy.tokens object has many useful attributes. Named entities are in an attribute called `ents`. 

In [91]:
statues_nlp.ents

(India,
 Statue of Unity,
 182 meters,
 2018,
 second,
 128 meters,
 the Spring Temple Buddha,
 China,
 2008,
 Laykun Sekkya,
 Myanmar,
 115.8 meters,
 2008,
 2020,
 the Statue of Belief,
 India,
 106 meters,
 fifth,
 Ushiku Daibutsu,
 Japanese,
 1993,
 100 meters)

Each `ent` has text and label elements. Printing those elements provides a nice list of all the named entities in our text.

In [95]:
for ent in statues_nlp.ents:
    print(ent.text,  ent.label_)

India GPE
Statue of Unity FAC
182 meters QUANTITY
2018 DATE
second ORDINAL
128 meters QUANTITY
the Spring Temple Buddha FAC
China GPE
2008 DATE
Laykun Sekkya PERSON
Myanmar GPE
115.8 meters QUANTITY
2008 DATE
2020 DATE
the Statue of Belief FAC
India GPE
106 meters QUANTITY
fifth ORDINAL
Ushiku Daibutsu PERSON
Japanese NORP
1993 DATE
100 meters QUANTITY


These results are much more accurate. So why does named entity recognition work so much better in SpaCy than it does in NLTK?

The answer lies in **classification**. NLTK and SpaCy use different techniques for classifying tokens into different categories. Classification of words, phrases and documents is one of the most common applications of text analysis used today, and there are many classifcation techniques that can be applied to a text analysis problem. In this workshop, we'll explore two types of classification techniques: supervised and unsupervised.

## Supervised Classification

#### Stop words

#### Lemmatization

### Bag of words techniques

#### Frequencies

#### N-grams/co-occurrence

### Classification

#### Supervised

##### Training a model

##### Evaluating results

##### Tweaking/comparing models (prob. Naïve Bayes and SVM)

#### Unsupervised

##### Explain and demonstrate various models (WLDA, clustering, etc.)