# TOPIC Modeling Text Preprocessing

Steps involved in preprocessing a corpus of unstructed text documents using *scikit-learn* for topic modelling. Adapted from a [tutorial by Derek Greene](https://github.com/derekgreene/topic-model-tutorial)

### Loading the Documents

As our sample corpus of text, we will use a corpus of news articles collected in 2016. These articles have been stored in a single file and formatted so that one article appears on each line. We will load these articles into a list, and also create a short snippet of text for each document.

In [55]:
import os.path
raw_documents = []
snippets = []
with open( os.path.join("data", "articles.txt") ,"r") as fin:
    for line in fin.readlines():
        text = line.strip()
        raw_documents.append( text )
        # keep a short snippet of up to 100 characters as a title for each article
        snippets.append( text[0:min(len(text),100)] )
print(f"Read {len(raw_documents)} raw text documents")

Read 4551 raw text documents


print the first 200 characters of the first document

In [54]:
print(raw_documents[0][:200])

Barclays' defiance of US fines has merit Barclays disgraced itself in many ways during the pre-financial crisis boom years. So it is tempting to think the bank, when asked by US Department of Justice 


### Creating a Document-Term Matrix

When preprocessing text, a common approach is to remove non-informative stopwords. The choice of stopwords can have a considerable impact later on. We will use a custom stopword list:

In [2]:
custom_stop_words = []
with open( "stopwords.txt", "r" ) as fin:
    for line in fin.readlines():
        custom_stop_words.append( line.strip() )
print("Stopword list has %d entries" % len(custom_stop_words) )

Stopword list has 350 entries


Print the first 10 stop words

In [56]:
print(custom_stop_words[:10])

['a', 'about', 'above', 'according', 'across', 'actually', 'adj', 'after', 'afterwards', 'again']


In the *bag-of-words model*, each document is represented by a vector in a *m*-dimensional coordinate space, where *m* is number of unique terms across all documents. This set of terms is called the corpus *vocabulary*. 

Since each document can be represented as a term vector, we can stack these vectors to create a full *document-term matrix*. We can easily create this matrix from a list of document strings using *CountVectorizer* from Scikit-learn. The parameters passed to *CountVectorizer* control the pre-processing steps that it performs.

In [58]:
from sklearn.feature_extraction.text import CountVectorizer
# use a custom stopwords list, set the minimum term-document frequency to 40
vectorizer = CountVectorizer(stop_words = custom_stop_words, min_df = 40)
A = vectorizer.fit_transform(raw_documents)
print(f"Created {A.shape[0]} X {A.shape[1]} document-term matrix")

Created 4551 X 6263 document-term matrix


In [37]:
print(A[0,:].shape)
print(vectorizer.get_feature_names()[3000:3020])

(1, 6263)
['innovative', 'inquiry', 'inside', 'insiders', 'insight', 'insist', 'insisted', 'insistence', 'insisting', 'insists', 'inspiration', 'inspire', 'inspired', 'inspiring', 'instagram', 'installed', 'instance', 'instant', 'instantly', 'instinct']


This process also builds a vocabulary for the corpus:

In [63]:
terms = vectorizer.get_feature_names()
print(f"Vocabulary has {len(terms)} distinct terms")
print(f"Examples: 100:{terms[100]}; 2000:{terms[2000]}; 4000:{terms[4000]}; 6000:{terms[6000]}")

Vocabulary has 6263 distinct terms
Examples: 100:30pm; 2000:embrace; 4000:palace; 6000:verge


We can save this document-term matrix, terms, and snippets for later use using *Joblib* to persist the data.

In [43]:
import joblib
joblib.dump((A,terms,snippets), "articles-raw.pkl") 

['articles-raw.pkl']

### Applying Term Weighting with TF-IDF

We can improve the usefulness of the document-term matrix by giving more weight to the more "important" terms. The most common normalisation is *term frequency–inverse document frequency* (TF-IDF). In Scikit-learn, we can generate at TF-IDF weighted document-term matrix by using *TfidfVectorizer* in place of *CountVectorizer*.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
# we can pass in the same preprocessing parameters
vectorizer = TfidfVectorizer(stop_words=custom_stop_words, min_df = 40)
A = vectorizer.fit_transform(raw_documents)
print( "Created %d X %d TF-IDF-normalized document-term matrix" % (A.shape[0], A.shape[1]) )

Created 4551 X 6263 TF-IDF-normalized document-term matrix


In [65]:
# extract the resulting vocabulary
terms = vectorizer.get_feature_names()
print(f"Vocabulary has {len(terms)} distinct terms")

Vocabulary has 6263 distinct terms


In [48]:
print(terms[:5])

['000', '10', '100', '100m', '10m']


A simple characterisation that we might do would be to look at the terms with the highest TF-IDF scores across all documents in the document-term matrix. We can define such a function as follows:

In [49]:
import operator
def rank_terms( A, terms ):
    # get the sums over each column
    sums = A.sum(axis=0)
    # map weights to the terms
    weights = {}
    for col, term in enumerate(terms):
        weights[term] = sums[0,col]
    # rank the terms by their weight over all documents
    return sorted(weights.items(), key=operator.itemgetter(1), reverse=True)

We can now display a ranking of the top 20 terms, which gives us a very rough sense of the content of the document collection:

In [75]:
ranking = rank_terms( A, terms )
for i, pair in enumerate( ranking[0:20] ):
    print(f"{i+1:02d}: {pair[0]} ({pair[1]:.1f})")

01: trump (12150.0)
02: people (9524.0)
03: time (6714.0)
04: eu (5774.0)
05: uk (5293.0)
06: back (4771.0)
07: get (4410.0)
08: before (4171.0)
09: way (4056.0)
10: clinton (3974.0)
11: campaign (3916.0)
12: government (3686.0)
13: think (3673.0)
14: world (3651.0)
15: going (3559.0)
16: work (3487.0)
17: against (3423.0)
18: right (3329.0)
19: film (3321.0)
20: good (3263.0)


Again we will save this document-term matrix, terms, and snippets for topic modelling later using *Joblib*.

In [51]:
joblib.dump((A,terms,snippets), "articles-tfidf.pkl") 

['articles-tfidf.pkl']