# Text Analysis

We will explore exploratory data analysis and supervised learning for free text in this lecture. In the next lecture, we will look at unsupervised learning and topic models.

Along the way, we will use the packages

- [`sklearn`](http://scikit-learn.org/stable/)
- [`wordcloud`](https://github.com/amueller/word_cloud)
- [`nltk`](https://www.nltk.org)
- [`gensim`](https://radimrehurek.com/gensim/)
- [`spaCy`](https://spacy.io)

Other packages useful for text analysis include

- [`fasttext`](https://fasttext.cc/)

and many, many others.

## Exploratory data analysis

### Corpus

A corpus is a collection of text documents. There are many ways to create a corpus, and they may come from documents, scraped web pages, Twitter streams, speech translation and so on. The first step in any text analysis application is nearly always to create an application-specific corpus. This is important, because the language patterns in different domains are often very different (e.g. contrast medical records with legal documents with Twitter streams). 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context('notebook', font_scale=1.5)

In [None]:
import numpy as np
import pandas as pd

In [None]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder
from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures
import string

#### Toy corpus

We see how a small corpus with two documents is broken down into smaller pieces 

document $\to$ paragraph $\to$ sentences $\to$ tokens

Although this explicit decomposition may not be necessary in all applications, it is still useful to be aware of these units:

- A paragraph contains an *idea*
- A sentence is a unit of syntax
- A token (word or punctuation) is the smallest meaningful unit

In [None]:
docs = [
    '''Spicy jalapeno bacon ipsum dolor amet aute prosciutto velit corned beef consectetur. Aute kielbasa adipisicing, nostrud drumstick ipsum tail pig capicola burgdoggen corned beef. Dolor proident salami deserunt. Venison capicola pork belly bacon aliquip swine incididunt sint quis cupidatat pork chop et turducken nulla beef. Ground round kielbasa tri-tip consectetur, t-bone pariatur deserunt id ut adipisicing.

Strip steak meatball chuck aute, pork loin turkey pork commodo et officia. Rump enim spare ribs, prosciutto chuck deserunt tail. Aute pork lorem sausage. Nostrud dolore kevin proident pork chop do in. Exercitation shoulder dolore kevin ut, sausage ullamco frankfurter ham hock. Ground round fatback ribeye turkey tri-tip capicola.''',
    '''Burgdoggen id ham hock ut kielbasa. Eu pork chop anim picanha sed porchetta dolor consequat drumstick shankle proident pork andouille. Et cupim burgdoggen, officia lorem shank ut sed drumstick shankle salami ad ball tip dolore pig. Shankle turkey officia, reprehenderit bacon ipsum ullamco enim tail tongue. Brisket short ribs biltong jerky flank, venison filet mignon tenderloin culpa bacon meatball short loin commodo. Leberkas jowl prosciutto, et kielbasa pancetta chicken. Nisi minim sausage porchetta jowl.

Beef ribs pariatur pork chop dolore ex, consequat turducken frankfurter esse filet mignon lorem bacon. Elit dolore porchetta meatball ea, pork loin pork anim non sirloin. Aliquip tenderloin reprehenderit pariatur, leberkas alcatra short loin. Fugiat elit meatloaf, nulla cow in sausage. Doner consequat shankle salami est, boudin deserunt. Drumstick ham lorem reprehenderit.

Beef adipisicing nisi rump filet mignon cillum leberkas boudin tail picanha pork loin. Culpa picanha ground round in laborum spare ribs. Burgdoggen leberkas landjaeger adipisicing strip steak velit doner eu ground round meatloaf consectetur deserunt anim ball tip cow. Porchetta ad minim eiusmod labore eu nisi boudin laboris officia jowl deserunt strip steak. Shank aliquip beef ribs tri-tip ipsum flank. Turducken elit meatloaf aliqua corned beef sirloin irure. Tongue cupim ullamco in sint prosciutto.'''
]

##### Documents

In [None]:
docs

In [None]:
from itertools import chain

In [None]:
def flatten(listOfLists):
    return list(chain.from_iterable(listOfLists))

#### Paragraphs

In [None]:
paras = flatten([doc.split('\n\n') for doc in docs])

In [None]:
paras[:3]

##### Sentences

In [None]:
sentences = flatten([nltk.tokenize.sent_tokenize(para) for para in paras])

In [None]:
sentences[:10]

In [None]:
tokens = flatten([nltk.tokenize.word_tokenize(sentence) for sentence in sentences])

In [None]:
tokens[:10]

### Exploratory analysis of the  `newsgroup` corpus

In [None]:
from sklearn.datasets import fetch_20newsgroups

For convenience, we will use an existing corpus - the 20 newsgroups dataset that comprises around 18000 newsgroups posts on 20 topics. The 20 topics are

```
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

In [None]:
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=('rec.sport.baseball', 
                'rec.sport.hockey',
                'sci.med',
                'sci.space'),
    
    remove=('headers', 'footers', 'quotes'))

In [None]:
newsgroups_train.keys()

In [None]:
print(newsgroups_train.DESCR)

In [None]:
newsgroups_train.filenames.shape

In [None]:
newsgroups_train.target.shape

In [None]:
newsgroups_train.target_names

In [None]:
newsgroups_train.data[0]

### Getting word counts

In [None]:
from sklearn.feature_extraction.text import (
    HashingVectorizer,
    TfidfVectorizer, 
    CountVectorizer, 
)

In [None]:
vectorizer = CountVectorizer()

In [None]:
idx = np.nonzero(
    newsgroups_train.target == 
    newsgroups_train.target_names.index('rec.sport.baseball')
)[0]
baseball_sample = [newsgroups_train.data[i] for i in idx]

In [None]:
X = vectorizer.fit_transform(baseball_sample)

In [None]:
X

In [None]:
vocab = vectorizer.get_feature_names()

In [None]:
rownames = [':'.join(filename.split('/')[-2:]) 
            for filename in newsgroups_train.filenames[idx]]
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab, index=rownames)

In [None]:
freqs = df.sum(axis=0).astype('int')

In [None]:
freqs.nlargest(10)

### Distribution of word counts

In [None]:
sns.distplot(freqs, kde=False)
pass

### Zipf's law

The number of words that occur with frequency $f$ is a random variable with a power law distribution

$$
p(f) = \alpha f^{1-1/s}
$$

Random variables that follow a power law distribution look linear on a log-log plot.

In [None]:
xs = freqs.sort_values(ascending=False).reset_index(drop=True, )
plt.loglog(xs.index + 1, xs)
plt.xlabel('Log(Rank)')
plt.ylabel('Log(Frequency)')
plt.title("Zipf's law")
pass

### Stop words, lemmatization and stemming

We can try to reduce the number of tokens using the simple strategies of stop words, stemming and lemmatization.

#### Stop words

The most common words are not very informative, and we may wish to remove them. There are other ways to handle this (e.g. with TF-IDF vectorizers) but we will simply use stop words for this section.

In [None]:
vectorizer = CountVectorizer(stop_words='english')

In [None]:
idx = np.nonzero(
    newsgroups_train.target == 
    newsgroups_train.target_names.index('rec.sport.baseball')
)[0]
baseball_sample = [newsgroups_train.data[i] for i in idx]

In [None]:
X = vectorizer.fit_transform(baseball_sample)

In [None]:
vocab = vectorizer.get_feature_names()

In [None]:
rownames = [':'.join(filename.split('/')[-2:]) 
            for filename in newsgroups_train.filenames[idx]]
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab, index=rownames)

In [None]:
freqs = df.sum(axis=0).astype('int')

We will also drop numbers.

In [None]:
freqs = freqs[~freqs.index.str.isnumeric()]

Now the most common words are more informative.

In [None]:
freqs.nlargest(15)

#### Stemming

Stemming is the attempt to identify the common roots of words using prefix and suffix rules.

In [None]:
def tokenize(text):
    stem = SnowballStemmer('english')
    text = text.lower()
    
    for token in nltk.word_tokenize(text):
        if token in string.punctuation:
            continue
        yield stem.stem(token)

In [None]:
text = '''circle circles circular circularity 
circumference circumscribe circumstantial
infer inference inferences inferential'''

In [None]:
list(tokenize(text))

#### Lemmatization

Lemmatization also attempts to identify the common roots of words, but uses dictionary lookup to do so. Lemmatization often gives better results than stemming, but is slower.

In [None]:
def tokenize(text):
    lem = WordNetLemmatizer()
    text = text.lower()
    
    for token in nltk.word_tokenize(text):
        if token in string.punctuation:
            continue
        yield lem.lemmatize(token)

In [None]:
list(tokenize(text))

### Word cloud

In [None]:
from wordcloud import WordCloud

In [None]:
wordcloud = WordCloud().generate(' '.join(freqs.nlargest(200).index))
pass

In [None]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
pass

In [None]:
from imageio import imread

In [None]:
rabbit = imread('data/rabbit.png').astype('ubyte')

In [None]:
wc = WordCloud(mask=rabbit[:,:,0], 
               mode='RGBA',
               background_color=None)
wc.generate(' '.join(freqs.nlargest(200).index))
pass

In [None]:
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
pass

## Supervised Learning

A general framework for supervised learning on text is

construct corpus $\to$ vectorization of features $\to$ classification $\to$ evaluation (often by cross-validation)

For example, we may classify documents into topics, or by sentiment, or as spam/not spam.

### Vectorization of features

There are 3 common methods to vectorize features when the text is treated as a bag of words - word count, one hot encoding and TF-IDF.

In [None]:
small_sample = """Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham!
Would you like them here or there?
I would not like them here or there.
I would not like them anywhere.
I do so like green eggs and ham!
Thank you! Thank you,
Sam-I-am!""".splitlines()

In [None]:
small_sample

#### Word counts

In [None]:
count_vectorizer = CountVectorizer()

In [None]:
X = count_vectorizer.fit_transform(small_sample)

In [None]:
vocab = count_vectorizer.get_feature_names()
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab)
df.fillna(0).iloc[:, :10]

#### Hashing

If the number of words is too large, we can hash words into a fixed number of buckets to keep the computations tractable. However, we lose the ability to map back to the original tokens.

In [None]:
hash_vectorizer = HashingVectorizer(n_features=5)

In [None]:
X = hash_vectorizer.fit_transform(small_sample)

In [None]:
X.toarray()

#### One hot encoding

One hot encoding simply sets words with non-zero counts to 1.

In [None]:
one_hot_vectorizer = CountVectorizer(binary=True)

In [None]:
X = one_hot_vectorizer.fit_transform(small_sample)

In [None]:
vocab = one_hot_vectorizer.get_feature_names()
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab)
df.fillna(0).iloc[:, :10]

#### TF-IDF

See [Wikipedia](https://en.wikipedia.org/wiki/Tf–idf) for definition.

In [None]:
tf_idf_vectorizer = TfidfVectorizer()

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
X = tf_idf_vectorizer.fit_transform(small_sample)

In [None]:
vocab = tf_idf_vectorizer.get_feature_names()
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab)
df.fillna(0).iloc[:, :10]

## Maintaining context

For some supervised learning tasks such as sentiment analysis (is this review positive or negative), the context of words is very important. For example the following two reviews use very similar words but have very different meanings.

- `Only an idiot like Reviewer two could love that movie`
- `Could not love that movie more. Reviewer one is an idiot`

In this case, we need to take the context of individual words into account. Common ways to take context into account include the use N-grams (also known as colocations), part-of-speech (POS) tagging and grammars, and the `word2vec` family of algorithms.

### N-grams

In [None]:
count_vectorizer = CountVectorizer(ngram_range=(1,3))

In [None]:
X = count_vectorizer.fit_transform(small_sample)

In [None]:
vocab = count_vectorizer.get_feature_names()
df = pd.DataFrame.sparse.from_spmatrix(X, columns=vocab)
df.fillna(0).iloc[:, :10]

### Significant collocation

Most n-grams are not meaningfully phrases. We can use statistical tests for the likelihood of co-occurrence of words, and only use the significant collocations. Basically we test against the null hypothesis that the words in the n-gram appear by chance if the probability of each word was independently derived from its empirical frequency. 

In [None]:
abstract = '''Macrophages represent one of the most numerous and diverse 
leukocyte types in the body. Furthermore, they are important regulators 
and promoters of many cardiovascular disease programs. Their functions 
range from sensing pathogens to digesting cell debris, modulating inflammation, 
and producing key cytokines and other regulatory factors throughout the body. 
Macrophage research has undergone a renaissance in recent years, which 
has propelled a newfound interest in their heterogeneity as well as a 
new understanding of ontological differences in their development. 
In addition, recent technological advances such as single-cell 
mass-cytometry by time-of-flight have enabled phenotype and functional 
analyses of individual immune myeloid cells, including macrophages, 
at unprecedented resolution. In this Part 1 of a 4-part review series 
covering the macrophage in cardiovascular disease, we focus on the 
basic principles of macrophage development, heterogeneity, phenotype, 
tissue-specific differentiation, and functionality as a basis to understand 
their role in cardiovascular disease.'''

In [None]:
ngrams = TrigramCollocationFinder.from_words(nltk.tokenize.word_tokenize(abstract))

In [None]:
scores = ngrams.score_ngrams(TrigramAssocMeasures.likelihood_ratio)

In [None]:
scores[:5]

In [None]:
scores[-5:]

## Part-of-speech tagging

Regex for grammar from this [blog](http://bdewilde.github.io/blog/2014/09/23/intro-to-automatic-keyphrase-extraction/)

#### Parts of speech in NLTK

In [None]:
%%capture
import nltk
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')

In [None]:
nltk.help.upenn_tagset()

Using a [paragraph](https://en.wikipedia.org/wiki/Alfred_Nobel) from Wikipedia.

In [None]:
nobel = "Born in Stockholm, Alfred Nobel was the third son of Immanuel Nobel (1801–1872), an inventor and engineer, and Carolina Andriette (Ahlsell) Nobel (1805–1889).The couple married in 1827 and had eight children. The family was impoverished, and only Alfred and his three brothers survived past childhood. Through his father, Alfred Nobel was a descendant of the Swedish scientist Olaus Rudbeck (1630–1702),and in his turn the boy was interested in engineering, particularly explosives, learning the basic principles from his father at a young age. Alfred Nobel's interest in technology was inherited from his father, an alumnus of Royal Institute of Technology in Stockholm."

In [None]:
nobel

In [None]:
text = nltk.word_tokenize(nobel)

In [None]:
pos = nltk.pos_tag(text)

In [None]:
pos[:32]

In [None]:
grammar = 'KP: {(<JJ>* <NN.*>+ <IN>)? <JJ>* <NN.*>+}'

In [None]:
chunker = nltk.RegexpParser(grammar)

In [None]:
tree = chunker.parse(pos[:32])

In [None]:
tree

In [None]:
tree.collapse_unary

In [None]:
import itertools

In [None]:
kps = [ ]
for key, group in itertools.groupby(nltk.tree2conlltags(tree), lambda x: x[-1]):
    if key != 'O':
        phrase = []
        for word, pos, cls in group:
            phrase.append(word)
        kps.append(' '.join(phrase))
kps

### Finding named entities

We use a pre-trained model from `spacy`. See [here](https://spacy.io/usage/training#ner) if you want to train on your own corpus or extend the pre-trained model.

The default model is not perfect, but may be good enough for your needs.

In [None]:
%%capture
! python3 -m spacy download en

In [None]:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
doc = nlp(nobel)

In [None]:
print([(X, X.ent_iob_, X.ent_type_) for X in doc])

In [None]:
displacy.render(doc, jupyter=True, style='ent')

In [None]:
for entity in doc.ents:
    if entity.label_ == 'PERSON':
        print(entity)