# Text Preprocessing

In this notebook we will look at the steps involved in preprocessing a corpus of unstructed text documents using *scikit-learn*, which we will use later for topic modelling.

### Loading the Documents

As our sample corpus of text, we will use a corpus of news articles collected in 2016. These articles have been stored in a single file and formatted so that one article appears on each line. We will load these articles into a list, and also create a short snippet of text for each document.

In [1]:
import os.path

raw_documents = []
snippets = []
with open(os.path.join("cenario1_8M2020_tweets_es.txt"), "r") as fin:
    for line in fin.readlines():
        text = line.strip()
        raw_documents.append(text)
        # keep a short snippet of up to 100 characters as a title for each article
        snippets.append(text[0 : min(len(text), 100)])
print("Read %d raw text documents" % len(raw_documents))

Read 1897848 raw text documents


### Creating a Document-Term Matrix

When preprocessing text, a common approach is to remove non-informative stopwords. The choice of stopwords can have a considerable impact later on. We will use a custom stopword list:

In [2]:
custom_stop_words = []

with open("spanish.txt", "r") as fin:
    for line in fin.readlines():
        custom_stop_words.append(line.strip())
        
# note that we need to make it hashable
print("Stopword list has %d entries" % len(custom_stop_words))

Stopword list has 313 entries


In the *bag-of-words model*, each document is represented by a vector in a *m*-dimensional coordinate space, where *m* is number of unique terms across all documents. This set of terms is called the corpus *vocabulary*. 

Since each document can be represented as a term vector, we can stack these vectors to create a full *document-term matrix*. We can easily create this matrix from a list of document strings using *CountVectorizer* from Scikit-learn. The parameters passed to *CountVectorizer* control the pre-processing steps that it performs.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# use a custom stopwords list, set the minimum term-document frequency to 20
vectorizer = CountVectorizer(stop_words=custom_stop_words, min_df=20)
A = vectorizer.fit_transform(raw_documents)

print("Created %d X %d document-term matrix" % (A.shape[0], A.shape[1]))

Created 1897848 X 31668 document-term matrix


This process also builds a vocabulary for the corpus:

In [4]:
terms = vectorizer.get_feature_names()
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 31668 distinct terms


We can save this document-term matrix, terms, and snippets for later use using *Joblib* to persist the data.

In [5]:
import joblib

joblib.dump((A, terms, snippets), "cenario1-tweets-raw.pkl")

['cenario1-tweets-raw.pkl']

### Applying Term Weighting with TF-IDF

We can improve the usefulness of the document-term matrix by giving more weight to the more "important" terms. The most common normalisation is *term frequency–inverse document frequency* (TF-IDF). In Scikit-learn, we can generate at TF-IDF weighted document-term matrix by using *TfidfVectorizer* in place of *CountVectorizer*.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# we can pass in the same preprocessing parameters
vectorizer = TfidfVectorizer(stop_words=custom_stop_words, min_df=20, max_df=0.9)
A = vectorizer.fit_transform(raw_documents)

print(
    "Created %d X %d TF-IDF-normalized document-term matrix" % (A.shape[0], A.shape[1])
)

Created 1897848 X 31668 TF-IDF-normalized document-term matrix


In [7]:
# extract the resulting vocabulary
terms = vectorizer.get_feature_names()
print("Vocabulary has %d distinct terms" % len(terms))

Vocabulary has 31668 distinct terms


A simple characterisation that we might do would be to look at the terms with the highest TF-IDF scores across all documents in the document-term matrix. We can define such a function as follows:

In [8]:
import operator


def rank_terms(A, terms):
    # get the sums over each column
    sums = A.sum(axis=0)
    
    # map weights to the terms
    weights = {}
    
    for col, term in enumerate(terms):
        weights[term] = sums[0, col]
        
    # rank the terms by their weight over all documents
    return sorted(weights.items(), key=operator.itemgetter(1), reverse=True)

We can now display a ranking of the top 20 terms, which gives us a very rough sense of the content of the document collection:

In [9]:
ranking = rank_terms(A, terms)

for i, pair in enumerate(ranking[0:20]):
    print("%02d. %s (%.2f)" % (i + 1, pair[0], pair[1]))

01. 8m (65292.09)
02. 8m2020 (61353.24)
03. diainternacionaldelamujer (59793.84)
04. mujeres (53571.96)
05. dia (45764.37)
06. hoy (40346.95)
07. marcha8m (40219.17)
08. mujer (36243.33)
09. diadelamujer (32427.31)
10. todas (30859.94)
11. feliz (27629.31)
12. marcha (26571.03)
13. niunamenos (19218.71)
14. lucha (19170.99)
15. ser (18722.22)
16. quiero (16592.31)
17. asi (16487.52)
18. violencia (15957.67)
19. chile (15404.65)
20. igualdad (15216.51)


Again we will save this document-term matrix, terms, and snippets for topic modelling later using *Joblib*.

In [10]:
joblib.dump((A,terms,snippets), "cenario1-tweets-tfidf.pkl") 

['cenario1-tweets-tfidf.pkl']