# Tutorial: Basics of Computational Text Analysis (with Python)

## Contents

- Introduction
- Loading Text, Encodings, Memory
    - Example
    - Ressources
- Manipulating Text: Regular Expressions
    - Example
    - Ressources
- Tokenization
    - Example
    - Resources
- Document Term Matrices
    - Example
    - Resources
- The Sparsity Problem
- Text Preprocessing
- Stemming and Lemmatization
- Additional Resources

## Introduction


- Scope: Lay out what can be done. Concentrate on Bag-of-Words stuff. Explain basics how to get to tdm

- What can and can't be done with computers

- Common (analytic) tasks
    - Supervised problems
        - Text classification
        - Sentiment Analysis
        - Named Entity Recognition
    - Unsupervised Problems
        - Thematic Clustering (e.g. Topic Modelling)
        - Information Retrieval
        - Document similarity
       
- Text as Data
    - Bag of Words
    - Natural Language Tools
    
- From dirty text to a Document-Term-Matrix (DTM)
DTM is a good basis for a lot of analyses (from counting words to sophisticated statistcal analysis). Is not always necessary but good to have. Most issues in natural language processing come up when creating / working with DTMs

- What we will do:
    1. Load text into python (Republican Debate)
    2. Clean the text
    3. Parse it into documents
    4. Tokenize
    5. Create Term Document Matrix
    6. Reduce dimensionality
    
    
I want to show basic concepts. 
I will program them from scratch, in practice you would use optimized packages. I don't go into detail here but will point to them in the in each section in the `Resources` subsection.

Terminology:
- Corpus
- Document
- Token
- Term
- Type
- Document Term Matrix
- Stopwords

## Loading Text, Encodings, Memory

Write something here

### Example

In [None]:
import io

## Loading Text into memory
raw_text_connection = io.open('rep_debate.txt', mode='r', encoding='utf-8')

print raw_text_connection

In [None]:
raw_text = raw_text_connection.read()
raw_text[0:1000]

Some explanations. The `u` in front of the output means this is a unicode string. Unicode is a character encoding that includes all possible characters. \ stands for escaped characters. For example `\n` is a new line `\u201911` is the unicode code for `'`. To see the nicer formatting where special characters are interpreted we can `print` it:

In [None]:
print raw_text[0:1000]

In [None]:
raw_text_connection = io.open('rep_debate.txt', mode='r', encoding='ascii')
raw_text = raw_text_connection.read()
raw_text[0:1000]

In [None]:
## Working on streams
raw_text_connection = io.open('rep_debate.txt', mode='r', encoding='utf-8')
lengths = []

for line_number, line in enumerate(raw_text_connection):
    print line
    lengths.append(len(line))
    if line_number == 5:
        break
        
print lengths

### Resources
- Intro to encodings with emphasis on unicode: https://www.w3.org/International/articles/definitions-characters/

## Manipulating Text: Regular Expressions

Regular expressions are basically templates that can be used to match character sequences. You have probably used them. For example when you type a search into a search engine you can use the `*` to match several terms with one query. For example `read*` would match `reading`, `reader`, `readings`, `reads`, etc. In this case `*` is a regular expression that says 'match every character except a space'. 

Regular expressions differe slightly between programming languages. This [regex tester](https://regex101.com/#python) is a really cool ressource to check out your regular expressions. 



In [None]:
# Load the re (for regular expression) module
import re

# Some basics
s = 'Good evening, I’m Carl Quintanilla, with my colleagues Becky Quick and John Harwood.'

# Search stuff
results = re.findall(pattern=r'[A-Z][a-z]+', string=s) # Find all Upper Case Words
for result in results:
    print result

In [None]:
# Replace stuff

## Replace every `'`, ',', '.' with `!!!!!`
print re.sub(pattern=r'[\',\.]', repl='!!!!!', string=s)

### Some basic expressions

- `[]`: A set of characters/expressions. E.g. `[A-Z]` matches all upper case letters
- `.`: Matches everything
- `+`: Matches one or more of the preceding character
- `?`: Matches 0 or one repetitions of the preceding character
- `\n`: Newline
- `\t`: Tab
- `\s`: Every space character, e.g. `\n`, `\t`, `\r`

Just some examples. See the [documentation]() for an exhaustive list and use Google and [regex tester](https://regex101.com/#python) extensively. 

### Our Example

Let's use it with our example. First we want to clean transcript from annotations like `(APPLAUSE)` or `(LAUGHTER)`.

In [None]:
print raw_text[1:1000]

In [None]:
annotation = re.compile(r'\([A-Z]+\)\n?')
annotation.findall(raw_text)[1:20]

In [None]:
new_text = annotation.sub(repl='', string=raw_text)
print new_text[1:1000]

Now we want to make a separate text document for each utterance so we can treat them as separate documents. We observe that each speaker change is indicated by the following pattern: `[NewLine] + NAME + : + [NewLine]`. We can use this regularity to split the text.

In [None]:
speaker = re.compile(r'([A-Z]+:)')

Let's test if it works:

In [None]:
speaker.findall(string=new_text)[1:10]

Works. Now we use this to make one big document per speaker containing all utterances by this speaker. 

In [None]:
docs = speaker.split(string=new_text)

from pprint import pprint
pprint(docs[1:10])

In [None]:
# Empty dictionary to store the information. Format will be 'name': 'string of all their text'
utterances = {}

# Loop through the documents we created in the previous step
for element in docs:
    if element == '':
        continue
    if speaker.match(element):      
        
        current_speaker = re.sub(':', '', element)
        
        if current_speaker not in utterances.keys():
            utterances[current_speaker] = ''
    else:
        utterances[current_speaker] = utterances[current_speaker] + ' ' + element

# See for whom we got utterances
print utterances.keys()

## Tokenization

Most statistical text analysis is based on the 'bag-of-words' approach: It is assumed (clearly wrongly) information in documents is purely contained in the word counts of a document. Grammatical and syntactical structure is ignored. Let's first count all words in each document.
To do this we have to split the document into discrete words, they can are also called 'tokens' and the process 'tokenization'. Here we simply split the string object, whenever there is a white space (you might already think of things that can go wrong here). There are more sophisticated methods to do this but for now it is sufficient.

### Example

Let's see how this works for Donald Trumps utterances:

In [None]:
# Extract Trumps utterances from our collection
text = utterances['TRUMP']

# Split the text on white space
text = text.split(' ')

# Empty dictionary to store 'word: wordcount' pairs 
trump = {}

# Loop through each word in the text and count them
for token in text:
    if token not in  trump.keys():
        trump[token] = 1
    else:
        trump[token] += 1

# Print a sample of the word counts to check what we got
for i, word in enumerate(trump):
    print word + ': ' + str(trump[word])
    if i == 10:
        break

Ignore that this looks weird. We will address the punctuation and lower/upper case issues in a bit. Let's first finish our first document-term-matrix.
First we count the words for each speaker, as we did with Trump.

In [None]:
# Empty list to store each speaker's dictionary
word_counts = []

# Count all the words!
for person in utterances:
    text = utterances[person]
    text = text.split(' ')
    counts = {}
    for token in text:
        if token not in  counts.keys():
            counts[token] = 1
        else:
            counts[token] += 1
    word_counts.append(counts)

### Resources

- stanford nltk (look up the method)
- spacy (fastest!)
- R tm package


## Document Term Matrix

A lot (but not all) statistical analyses of text are based on Document Term Matrices (DTM). A DTM, as the name says, is a matrix that contains one row per document (in our case a document is all the text for one candidate) and one column per term (or token). That means that each document is represented as a vector in a space that has as many dimensions as there are unique terms in the collection of all documents. 

### Example

In [None]:
# Collect all terms form all speakers
vocabulary = set()
for person in word_counts:
    vocabulary.update(person.keys())
print len(vocabulary)
pprint(vocabulary)

In [None]:
import pandas as pd
dtm = pd.DataFrame(columns = list(vocabulary))
i = 0
for person in word_counts:
    
    vec = []
    for word in dtm.columns:
        if word in person:
            vec.append(person[word])
        else:
            vec.append(0)
    
    dtm.loc[i] = vec
    i += 1
dtm
print dtm.shape

### Resources

- Python: sklearn.vectorizer
- R: tm, benoits package
- 

## The Sparsity Problem

In [None]:
from __future__ import division
n_non_zero = dtm.astype(bool).sum().sum()
tot = 20 * 4360
1 - n_non_zero / tot

The calculation above show's that almost 90% of our matrix are zeros. This causes problems in a variety of ways: 

The matrix becomes very large for only minimal increase in information. Memory and computational requirements increase dramatically.

Words that mean the same thing are treated as completely independent dimensions. E.g.:
    - Capitalization: 'And', 'and', 'AND'
    - Punctuation: 'dog,', 'dog.', 'dog'
    - Grammatical inflections: 'walking' vs. 'walked' vs. 'walk'
    - Irregular grammatical forms: 'dive', 'dove'
    - etc.

Therefore, documents that could be considered similar can seem very distant for the computer. E.g. consider three documents:
        1. 'The dog eats a cat'
        2. 'the dogs eat many bananas'
        3. 'I like coffee very much'

The corresponding term document matrix would look like this:

```
    The dog eats a cat the dogs eat many bananas I like coffee very much
1   1   1   1    1 1   0   0    0   0    0       0 0    0      0    0
2   0   0   0    0 0   1   1    1   1    1       0 0    0      0    0
3   0   0   0    0 0   0   0    0   0    0       1 1    1      1    1
```

Calculating the euclidian distance (or any other metric) between the vector for each word shows that they are all equidistant from each other. However, one could argue that documents `1` and `2` are closer to each other than they are to document `3`.

Now you probably say 'of course, we have to make everything lowercase, remove all the punctuation and stem the tokens and remove all the stopwords'. These steps are often taken as a routine in text analysis applications. In most cases most of these standard pre processing steps make absolute sense, but I think they should not be done mindlessly. Depending on the analysis some of these things that we remove here might be of value later. For example, capitalization of words might be important to differentiate between proper names and other words. The use of punctuation might be very informative when trying to attribute text to a specific person (don't we all know someone who uses way too many exclamation marks!!!!).

Instead, I will discuss these steps as feature selection or dimensionality reduction techniques. Although they are normally separate steps in the analysis and there are statistical techniques to do them, I think it makes conceptually sense to treat them as the same. Consider using some form of latent factor analysis to retrieve a certian number of factors from the DTM and represent the documents in this reduced space. Although technically different, it is conceptually similar to converting tokens to lower case: We would expect that `Dog` and `dog` are highly correlated in the data because they mean basically the same thing. Therefore, we can represent the two variables as one, that we label `dog` and load with the sum of the old variables.

There are many ways to deal with sparsity, but almost all of them involve reducing the size of the vocabulary, while maintaining as much relevant information as possible. Here is a list of things we can do:
    1. Convert all words to lowercase
    2. Remove all punctuation, numbers, special characters, etc.
    3. Remove very common words that contain little semantic information like: `and`, `or`, `the`, etc. also called *stopwords*
    4. Remove very infrequent words. Words that appear only once in the whole corpus are likely typos. Very infrequent words are also not very relevant for statistical analysis (but see above, depending on the analysis and outlier can be very informative).
    5. Stemming 
    6. Lemmatization
    7. Use dimension reduction techniques such as PCA, Factor Analysis, Neural Networks, Topic Models, etc. to locate documents in a [semantic space](https://en.wikipedia.org/wiki/Word_embedding). 
    8. Reduce vocabulary by inspecting the information content the words have for a supervise task, for example with $$\chi^2$$ -test
    
For this basic tutorial I will demonstrate just the basics (`1`-`7`) and completely ignore the statistical techniques. I hope to do a more advanced tutorial where I will cover these topics.

Now lets do this. `1`, `2` and `3` can be easily done with regular expressions and standard string tools in every programming language. We will do this on the utterances we split (since we need the colon, parentheses and upper case words for our text cleaning (see above).

First we remove all everything that is not a normal letter or a space (or a colon, since we need it to identify the speakers, we will remove it later) with a regular expression.

## Text Preprocessing

In [None]:
# First generate two regular expressions
non_alpha = re.compile(r'[^A-Za-z ]') # Everything that is not a letter or a space
excess_space = re.compile(r' +') # One or more spaces

# Load a stopword list
stopwords = set(io.open('stopwords.txt', 'r').read().split('\n'))

utterances_clean = {}

for person in utterances:
        
    text = utterances[person]
    # convert everything to lowercase
    text = text.lower()
      
    # Remove stopwords
    text = [w for w in text.split(' ') if w not in stopwords]
    text = ' '.join(text)
    
    # removce non-letters and excess space
    text = non_alpha.sub(repl=' ', string=text)
    text = excess_space.sub(repl=' ', string=text)
    
    utterances_clean[person] = text


Let's make some word clouds to see how well it worked

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline  

wordcloud = WordCloud().generate(utterances_clean['TRUMP'])
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
print 'Trump'
plt.show()
wordcloud = WordCloud().generate(utterances_clean['CRUZ'])
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
print 'Cruz'
plt.show()
wordcloud = WordCloud().generate(utterances_clean['BUSH'])
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
print 'Bush'
plt.show()

Now let's regenerate the DTM

In [None]:
word_counts = []

# Count all the words!
for person in utterances_clean:
    text = utterances_clean[person]
    text = text.split(' ')
    counts = {}
    for token in text:
        if token not in  counts.keys():
            counts[token] = 1
        else:
            counts[token] += 1
    word_counts.append(counts)
    
# Collect all terms form all speakers
vocabulary = set()
for person in word_counts:
    vocabulary.update(person.keys())
    
# Generate DTM
dtm = pd.DataFrame(columns = list(vocabulary))
i = 0
for person in word_counts:
    
    vec = []
    for word in dtm.columns:
        if word in person:
            vec.append(person[word])
        else:
            vec.append(0)
    
    dtm.loc[i] = vec
    i += 1
 
print dtm.shape

## Stemming and Lemmatization

Stemming and lemmatization are two techniques to 'normalize' tokens. This is done to avoid differentiating between different grammatical forms of the same word. Consider the three examples: 
    - (walk, walking) 
    - (dive, dove) 
    - (doves, dove)
    - (is, are)
    
The first is simple, the second is an irregular verb and the third is an animal. 

### Stemming

Stemming algorithms are rule based and operate on the tokens itself. It returns the 'stem' of a word, i.e. without grammatical inflections etc. For the above example it would probably return something like:
    - (walk, walk)
    - (div, dov)
    - (dov, dov)
    - (is, are)
It worked fine in the first place, but stemmers are not able to find the cannonical form (or lemma) of a word. Therefore it failed to figure out the last three cases.

### Lemmatization

Lemmatization, as the name suggests, is a group of algorithms that allow to find the lemma of a word. It often depends  on the context what the real lemma is (for example `the dove flies` or `I dove into the data`). Therefore, lemmatization works better if the algorithm is applied to the text in it's original form. 

### Example

In [None]:
from spacy.en import English
import Stemmer # PyStemmer module

stemmer = Stemmer.Stemmer('english')
parser = English()

In [None]:
text = docs[16]
text = re.sub('\n', ' ', text)
tokens = text.split(' ')

# Lets first try stemming
for i, token in enumerate(tokens):
    print '{} -> {}'.format(token, stemmer.stemWord(token))
    if i == 17:
        break

In [None]:
# Now lemmatizing
parsed_text = parser(text)

for i, token in enumerate(parsed_text):
    print '{} -> {}'.format(token.orth_, token.lemma_)
    if i == 20:
        break

Now let's make our final DTM. We redo all the steps above and add the lemmatization step.

In [None]:
# First generate two regular expressions
non_alpha = re.compile(r'[^A-Za-z ]') # Everything that is not a letter or a space
excess_space = re.compile(r' +') # One or more spaces

# Load a stopword list
stopwords = set(io.open('stopwords.txt', 'r').read().split('\n'))

utterances_clean = {}

for person in utterances:
        
    text = utterances[person]
    
    parsed_text = parser(text)
    lemmas = [token.lemma_ for token in parsed_text]
    text = ' '.join(lemmas)
    
    # convert everything to lowercase
    text = text.lower()
      
    # Remove stopwords
    text = [w for w in text.split(' ') if w not in stopwords]
    text = ' '.join(text)
    
    # removce non-letters and excess space
    text = non_alpha.sub(repl=' ', string=text)
    text = excess_space.sub(repl=' ', string=text)
    
    utterances_clean[person] = text

word_counts = []

# Count all the words!
for person in utterances_clean:
    text = utterances_clean[person]
    text = text.split(' ')
    counts = {}
    for token in text:
        if token not in  counts.keys():
            counts[token] = 1
        else:
            counts[token] += 1
    word_counts.append(counts)
    
# Collect all terms form all speakers
vocabulary = set()
for person in word_counts:
    vocabulary.update(person.keys())
    
# Generate DTM
dtm = pd.DataFrame(columns = list(vocabulary))
i = 0
for person in word_counts:
    
    vec = []
    for word in dtm.columns:
        if word in person:
            vec.append(person[word])
        else:
            vec.append(0)
    
    dtm.loc[i] = vec
    i += 1
 
print dtm.shape
dtm

Now this matrix can be used for a variety of analyses. Ranging from simple word counts to sophisticated statistical models.

## More Resources


### Books / Articles / Videos


### Software


#### Stanford Nltk


#### Python


#### R


#### Commandline