# Text Mining: Course introduction

This course uses [Jupyter notebooks](http://jupyter.org/) for the lab assignments. Notebooks let you write and execute Python code in a web browser, and they make it very easy to mix code and text.

The purpose of this particular notebook is to give you a glimpse of what is to come.

## Load some data

Load a data set of movie reviews.

In [None]:
import bz2
import pandas as pd

with bz2.open('sst-train.json.bz2', 't') as source:
    df = pd.read_json(source)

Print the number of reviews.

In [None]:
len(df)

Show the first few reviews.

In [None]:
df.head()

Define a helper function that splits a text into tokens at whitespace and removes any non-alphabetic tokens.

In [None]:
def tokens(text):
    return [t.lower() for t in text.split() if t.isalpha()]

## Exploration 1: Basic statistics

Load the `Counter` class, which is useful for statistics.

In [None]:
from collections import Counter

Count how many occurrences of each token (word) the data contains.

In [None]:
counter = Counter()
for text in df['text']:
    counter.update(tokens(text))

Print the total number of tokens.

In [None]:
print(len(counter))

The token *movie* occurs quite often:

In [None]:
print(counter['movie'])

Print the 10 most common words.

In [None]:
counter.most_common(10)

Plot the number of occurrences of the 100 most common words.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

labels, values = zip(*counter.most_common(100))
plt.bar(range(len(labels)), values)
plt.show()

## Exploration 2: Information extraction

Load spaCy.

In [None]:
import spacy

Load the English language model.

In [None]:
nlp = spacy.load('en_core_web_sm')

Define a short text.

In [None]:
text = u'Apple Corp. buys Alphabet Inc. for $1 billion'

Process the text using the default pipeline.

In [None]:
doc = nlp(text)

Print the tokens together with their lemmas, part-of-speech tags, and stopword flags.

In [None]:
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

Show the dependency parse.

In [None]:
from spacy import displacy

displacy.render(doc, style='dep', options={'distance': 110}, jupyter=True)

Show the named entities.

In [None]:
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

The following function will find the root word of an entity.

In [None]:
def root(ent):
    for token in ent:
        while not token.head is token and ent.start <= token.head.i and token.head.i < ent.end:
            token = token.head
        return token

Extract semantic relations.

In [None]:
for ent1 in doc.ents:
    root1 = root(ent1)
    for ent2 in doc.ents:
        root2 = root(ent2)
        if root1.head == root2.head and root1.head.pos_ == 'VERB' and root1.dep_ == 'nsubj' and root2.dep_ == 'dobj':
            print('[{}]-[{}]-[{}]'.format(ent1, root1.head.lemma_, ent2))

## Exploration 3: Topic modelling

Import gensim.

In [None]:
import gensim

Disable some warnings.

In [None]:
import warnings

warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

Build the vocabulary and show its size.

In [None]:
dictionary = gensim.corpora.Dictionary(tokens(text) for text in df['text'])
len(dictionary)

Filter out stop words, as well as extremely frequent/infrequent words.

In [None]:
bad_ids = [i for t, i in dictionary.token2id.items() if nlp.vocab[t].is_stop]
dictionary.filter_tokens(bad_ids=bad_ids)
dictionary.filter_extremes()
len(dictionary)

Create an iterator over the data (for efficiency).

In [None]:
class MyCorpus(object):
    def __iter__(self):
        for text in df['text']:
            yield dictionary.doc2bow(tokens(text))

Build the LDA model (takes a while).

In [None]:
lda = gensim.models.ldamodel.LdaModel(
    corpus=MyCorpus(),
    num_topics=7,
    id2word=dictionary,
    chunksize=5,
    passes=10,
    update_every=1,
    alpha='auto',
)

Print the most common topics.

In [None]:
lda.print_topics(7)

Load the pyLDAvis library for data visualization.

In [None]:
import pyLDAvis

pyLDAvis.enable_notebook()

Visualize the LDA model.

In [None]:
import pyLDAvis.gensim_models

pyLDAvis.gensim_models.prepare(lda, list(MyCorpus()), dictionary, mds='tsne')

That&rsquo;s all, folks!