### Textmining with NLTK

NLTK (Natural Language Tool Kit) and Spacy are two Python libraries often used for textual analysis. NLTK has been around for quite some time. It is a huge library, with many of it's use-cases well documented.

Spacy is the "new kid on the block", documentation is not as thorough as NLTK, but it's website offers code examples and some good tutorials can be found online. Seems to lean a litlle bit more towards easy integration of machine learning somewhere in the pipeline.

For most code examples we will use NLTK.

In [None]:
import nltk

In [None]:
nltk.corpus.gutenberg.fileids()

In [None]:
# The import above is too long to type repeatedly, so we use another way of importing the textual sources
from nltk.corpus import gutenberg as gb
gb.fileids()

Starting with NLTK and it's notion of a corpus, we are now quite flexible as to how precisely solve our problems.

First of all we can decide if we want to work with one text, a couple of texts or all of them.

In [None]:
# If we want to work with all of the texts shown above, we would wrap our code in a for-loop:
print("word sent vocab\tFILE\n")
for fileid in gb.fileids():
    num_chars = len(gb.raw(fileid))
    num_words = len(gb.words(fileid))
    num_sents = len(gb.sents(fileid))
    num_vocab = len(set(w.lower() for w in gb.words(fileid)))
    print(round(num_chars/num_words),  # Average word length
          round(num_words/num_sents),  # Average sentence length
          round(num_words/num_vocab),  # On average the number of times each vocab item appears in the text
          fileid)

In [None]:
# But we can also choose to start with one text file:
emma = gb.raw('austen-emma.txt')
len(emma)

In [None]:
# Here we use the sents method directly on the Gutenberg corpus with the fileid of the text we are interested in.
tokenized_emma_sents = gb.sents('austen-emma.txt')

Let's see what we have here:

In [None]:
for sent in tokenized_emma_sents[0:5]:
    print('Sentence: ')
    print(sent)
    print()

In this notebook we will use the raw textfiles from the Gutenberg collection as our textual sources, but the good news is that the NLTK library provides the methods to quickstart with your own corpus.

In [None]:
# Example setup could be as follows:
from nltk.corpus import PlaintextCorpusReader

CORPUS_ROOT = "/Users/peter/Documents/repub/wip/Samare/data/testing"

contents = PlaintextCorpusReader(CORPUS_ROOT, '.*\.txt')
for fileid in contents.fileids():
    # Do something useful
    print(fileid)

In [None]:
# A spacy example (we can use the NLTK groundwork)
import spacy
nlp = spacy.load('en')

"""
A lot is happening after we declared the previous two statements:
- we told Spacy that we are going to use the language "en" or English
- and we will use the pipeline ["tagger", "parser", "ner"]
"""

blake = gb.raw('blake-poems.txt')
blake_process = blake[:100]
doc = nlp(blake_process)
for token in doc:
    print(token.text, token.lemma_, token.pos_)    

Spacy comes with a lot of batteries included (pipelines, word vectors, etc.). If you are into hard-core text and data mining it might be worthwhile to dive in.

Let's move on with something more basic: Get rid of things we do not need when working with texts, like stopwords.

In [None]:
from nltk.tokenize import word_tokenize
text = "Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to \
        unite some of the best blessings of existence; and had lived nearly twenty-one years in the world \
        with very little to distress or vex her."
tokens = word_tokenize(text)
print(tokens)

In [None]:
english_stopwords = set(nltk.corpus.stopwords.words('english'))
content_tokens = [token for token in tokens if token.lower() not in english_stopwords]
print(content_tokens)

From here we can go into several directions, which direction is best depends on the problem(s) we want to solve.

- We could analyse the narrative of a novel, plotting the appearance of main characters in chapters.
- We could construct so-called "synopsis documents" using certain characteristics of the words used in documents: freq. used terms, freq used bi- and trigrams, hapax's, etc.

In [None]:
fdist = nltk.FreqDist()
emma_words = gb.words('austen-emma.txt')
print(len(emma_words))

In [None]:
# Let's do the usual pre-processing:
e_words = [word for word in emma_words if not word.isnumeric()]
e_words = [word for word in e_words if word.isalpha()]
e_words = [word for word in e_words if len(word) > 2]
e_words = [word.lower() for word in e_words if word not in english_stopwords]
# we could add a filter for stemming words, but ...
len(e_words)

In [None]:
# Let's calculate the 25 most used words in Jane Austen's "Emma".
fdist = nltk.FreqDist(e_words)
for word, frequency in fdist.most_common(25):
    print('{};{}'.format(word, frequency))

In [None]:
fdist.hapaxes()[:10]

In [None]:
# We can fish for all bigrams very easily with the bigram method, but note that we just get the bigrams word order
# with one word overlap
list(nltk.bigrams(e_words))[:10]

In [None]:
# And the same goes for naive trigram fishing:
list(nltk.trigrams(e_words))[:10]

In [None]:
# We can do better:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(e_words)
bigram_finder.apply_freq_filter(10)
print(bigram_finder.nbest(bigram_measures.pmi, 10))

In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
trigram_finder = nltk.collocations.TrigramCollocationFinder.from_words(e_words)
trigram_finder.apply_freq_filter(10)
print(trigram_finder.nbest(trigram_measures.pmi, 10))

Just as we the PMI measure to filter out the most frequently used bi- and trigrams, we can use tf-idf to the most frequent terms used in a document set against the background of inversed document frequency within a corpus.

For this we need to import another library: gensim.

Fot the following examples the scenario is that we will use tf-idf to characterize docs within a corpus; modelling stuff. Whenever we come across a new doc we can scan if we have "similar" docs.

Tf-idf works ok if we are dealing with not so large corpora. Word2vec and Doc2vec kick in when things get larger and larger.

In [None]:
import gensim
raw_documents = [gb.raw('austen-emma.txt'), gb.raw('blake-poems.txt'), gb.raw('milton-paradise.txt')]
def get_tokens(text):
    tokens = word_tokenize(text)
    return tokens

In [None]:
gen_docs = [get_tokens(text) for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
num_words = len(dictionary)
print("Number of words in dictionary: {}".format(num_words))
#for idx,word in dictionary.items():
    #print(idx,word)

In [None]:
# Let's have quick look:
print(dictionary[18])
print(dictionary.id2token[18])
print(dictionary.token2id['comfortable'])

In [None]:
bow_doc = dictionary.doc2bow(['I', 'love', 'tacos'])
print(bow_doc)

In [None]:
print(dictionary.id2token[951])

In [None]:
# We create a corpus: a list of bags of words
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
# print(corpus)

In [None]:
# Now we can use the corpus to create a tf-idf model (num_nnz is the number of tokens)
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)

In [None]:
# Let's have a closer look at what we have:
# print(gen_docs[0])
# print(corpus[0])
# print(tf_idf[corpus][0])

In [None]:
# Ok, so far so good
# Let's take a new document and process that
bow = dictionary.doc2bow(get_tokens(gb.raw('austen-sense.txt')))
query_doc_tf_idf = tf_idf[bow]

In [None]:
sims = gensim.similarities.Similarity('/Users/peter/',tf_idf[corpus],
                                      num_features=len(dictionary))
print(sims)

In [None]:
sims[query_doc_tf_idf]

What comes next?

Doc2vec which performs really well with larger corpora (word2vec and doc2vec are big in the ML environments of unsupervised learning). Tf-idf scores good when the amount of data is smaller.

And then there are of course strategies like: Classification of documents with the computer, clustering, topic maps, etc., etc.

As always, one should use techniques and libraries that are sound solutions for the problems one wants to solve. making things very complicated is the easiest thing to do, but elegant and simple solutions to the problems at hand are the goals here.