# Text Preprocessing

In this notebook we will look at the steps involved in preprocessing a corpus of unstructed text documents using *scikit-learn*, which we will use later for topic modelling.

In [1]:
from pathlib import Path
import operator, joblib
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Loading the Documents

As our sample corpus of text, we will use a corpus of news articles collected in 2016. These articles have been stored in a single file and formatted so that one article appears on each line. We will load these articles into a list, and also create a short snippet of text for each document.

In [2]:
in_path = Path("data") / "articles.txt"
raw_documents = []
snippets = []
with open(in_path, "r") as fin:
    for line in fin.readlines():
        text = line.strip()
        raw_documents.append( text )
        # keep a short snippet of up to 100 characters as a title for each article
        snippets.append( text[0:min(len(text),100)] )
print("Read %d raw text documents" % len(raw_documents))

Read 4551 raw text documents


### Creating a Document-Term Matrix

When preprocessing text, a common approach is to remove non-informative stopwords. The choice of stopwords can have a considerable impact later on. We will use a custom stopword list:

In [3]:
custom_stop_words = []
with open( "stopwords.txt", "r" ) as fin:
    for line in fin.readlines():
        custom_stop_words.append( line.strip() )
# note that we need to make it hashable
print("Stopword list has %d entries" % len(custom_stop_words) )

Stopword list has 350 entries


In the *bag-of-words model*, each document is represented by a vector in a *m*-dimensional coordinate space, where *m* is number of unique terms across all documents. This set of terms is called the corpus *vocabulary*. 

Since each document can be represented as a term vector, we can stack these vectors to create a full *document-term matrix*. We can easily create this matrix from a list of document strings using *CountVectorizer* from Scikit-learn. The parameters passed to *CountVectorizer* control the pre-processing steps that it performs.

In [4]:
# use a custom stopwords list, set the minimum term-document frequency to 20
vectorizer = CountVectorizer(stop_words = custom_stop_words, min_df = 20)
A = vectorizer.fit_transform(raw_documents)
print( "Created %d X %d document-term matrix" % (A.shape[0], A.shape[1]) )

Created 4551 X 10285 document-term matrix


This process also builds a vocabulary for the corpus:

In [5]:
terms =  list(vectorizer.get_feature_names_out())
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 10285 distinct terms


We can save this document-term matrix, terms, and snippets for later use using *Joblib* to persist the data.

In [6]:
joblib.dump((A,terms,snippets), "articles-raw.pkl") 

['articles-raw.pkl']

### Applying Term Weighting with TF-IDF

We can improve the usefulness of the document-term matrix by giving more weight to the more "important" terms. The most common normalisation is *term frequency–inverse document frequency* (TF-IDF). In Scikit-learn, we can generate at TF-IDF weighted document-term matrix by using *TfidfVectorizer* in place of *CountVectorizer*.

In [7]:
# we can pass in the same preprocessing parameters
vectorizer = TfidfVectorizer(stop_words=custom_stop_words, min_df = 20)
A = vectorizer.fit_transform(raw_documents)
print( "Created %d X %d TF-IDF-normalized document-term matrix" % (A.shape[0], A.shape[1]) )

Created 4551 X 10285 TF-IDF-normalized document-term matrix


In [8]:
terms =  list(vectorizer.get_feature_names_out())
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 10285 distinct terms


A simple characterisation that we might do would be to look at the terms with the highest TF-IDF scores across all documents in the document-term matrix. We can define such a function as follows:

In [9]:
def rank_terms( A, terms ):
    # get the sums over each column
    sums = A.sum(axis=0)
    # map weights to the terms
    weights = {}
    for col, term in enumerate(terms):
        weights[term] = sums[0,col]
    # rank the terms by their weight over all documents
    return sorted(weights.items(), key=operator.itemgetter(1), reverse=True)

We can now display a ranking of the top 20 terms, which gives us a very rough sense of the content of the document collection:

In [10]:
ranking = rank_terms(A, terms)
for i, pair in enumerate(ranking[0:20]):
    print( "%02d. %s (%.2f)" % (i+1, pair[0], pair[1] ))

01. trump (190.87)
02. people (109.90)
03. eu (109.17)
04. film (91.35)
05. uk (89.10)
06. bank (78.69)
07. time (76.12)
08. brexit (67.98)
09. health (61.59)
10. government (60.24)
11. back (60.24)
12. clinton (59.95)
13. get (57.37)
14. world (56.85)
15. campaign (56.35)
16. women (55.82)
17. way (54.95)
18. before (54.60)
19. vote (54.33)
20. work (54.04)


Again we will save this document-term matrix, terms, and snippets for topic modelling later using *Joblib*.

In [11]:
joblib.dump((A,terms,snippets), "articles-tfidf.pkl") 

['articles-tfidf.pkl']