Text mining
Prerequisite: Technicalities
Wikidata API, other APIs
Scraping HTML to extract structured/semi-structured information




Processing text
Rule based (regex, etc)
PoS tagging
N-grams, co-occurrences and concordancer
Word embeddings (word2vec, BERT, etc)


### (b) Text Pre-processing

- Rule-based Techniques (regex, etc)
- Part-of-Speech (PoS) tagging
- Chunking
- N-grams, co-occurrences and concordancer
- Word embeddings (word2vec, BERT, etc)


### i. Programming Basics

A **text** as nothing more than a sequence of **words** and **punctuation**. In the following example, `sentence` followed by the `=` sign, and then some quoted words, separated with commas, surrounded with brackets is known as a list in Python.

In [22]:
sentence = ["History", "is", "full", "of", "people", "who", "knew", "about", "history"]

Next, we can print the contents of the `sentence`:

In [23]:
sentence

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history']

We can ask for its length to find how many words are in the sentence:

In [24]:
len(sentence)

9

For the purpose of this exercise, we define two sentences:

In [25]:
sentence1 = ["History", "is", "full", "of", "people", "who", "knew", "about", "history"]
sentence2 = ["but", "kept", "on", "repeating", "the", "same", "mistakes", "again", "and", "again", "anyway"]

print(sentence1)
print(sentence2)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history']
['but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway']


Adding two lists creates a new list with everything from the first list, followed by everything from the second list:

In [26]:
sentence = sentence1 + sentence2

print(sentence)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history', 'but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway']


We can add a new item to the list. When we `append()` to a list, the list itself is updated as a result of the operation.

In [27]:
sentence.append(".")

print(sentence)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history', 'but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway', '.']


We can identify the elements of a Python list by their order of occurrence in the list:

In [28]:
print(sentence[0])

History


In [30]:
print(sentence[4])

people


We can also extract a *slice* or *sublist* from `sentence`:

In [34]:
print(sentence[1:5])

['is', 'full', 'of', 'people']


Any individual word in the `sentence` is a `word` of type *String* (short *str*):

In [35]:
word = "history"

print(word)
print(type(word))

history
<class 'str'>


### ii. Natural Language Processing (NLP) Programming Basics

For a deeper introduction into NLP programming techniques, we continue with the usage of NLTK. [NLTK](https://www.nltk.org/) is a widely used standard Natural Language Processing (NLP) and Computational Linguistics (CL) Python library with prebuilt functions and utilities for the ease of use and implementation. 

In [None]:
!pip install nltk

First, we can check the count and frequency distributions of words in a text. For example, if we take the previous `sentence`, we obtain:

In [41]:
from nltk.probability import FreqDist

fdist = FreqDist(sentence)

print(fdist.most_common(10))

[('again', 2), ('History', 1), ('is', 1), ('full', 1), ('of', 1), ('people', 1), ('who', 1), ('knew', 1), ('about', 1), ('history', 1)]


When we first invoke [FreqDist](http://www.nltk.org/api/nltk.html?highlight=freqdist), we pass the name of the text as an argument. The expression `most_common(10`) gives us a list of the 10 most frequently occurring types in the text (in this case, it will print all the words in the `sentence` since the length of the `sentence` is less than 10).

### iii. Collocations and N-grams

A *collocation* is a sequence of words that occur together unusually often. To understand collocations, we start by extracting from the `sentence` a list of word pairs, also known as bigrams. The *bigrams* are lists of two words (*bi*), while n-grams are lists of *n* words (e.g. ..). This is easily accomplished with the function `bigrams()`:

In [42]:
from nltk import bigrams

list_of_bigrams = list(bigrams(sentence))

print(list_of_bigrams)

[('History', 'is'), ('is', 'full'), ('full', 'of'), ('of', 'people'), ('people', 'who'), ('who', 'knew'), ('knew', 'about'), ('about', 'history'), ('history', 'but'), ('but', 'kept'), ('kept', 'on'), ('on', 'repeating'), ('repeating', 'the'), ('the', 'same'), ('same', 'mistakes'), ('mistakes', 'again'), ('again', 'and'), ('and', 'again'), ('again', 'anyway'), ('anyway', '.')]


Stopwords...

In [44]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

Word and sentence tokenization..

In [66]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [55]:
sentences = sent_tokenize("History is full of people who knew about history. But they kept on repeating the same mistakes again and again anyway.")

In [56]:
print(len(sentences))

2


In [57]:
sentences

['History is full of people who knew about history.',
 'But they kept on repeating the same mistakes again and again anyway.']

In [58]:
words = word_tokenize("History is full of people who knew about history")

In [59]:
words

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history']

### iv. Part-of-speech (PoS) tagging

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word:

In [62]:
from nltk import pos_tag

pos_tag(sentence)

[('History', 'NN'),
 ('is', 'VBZ'),
 ('full', 'JJ'),
 ('of', 'IN'),
 ('people', 'NNS'),
 ('who', 'WP'),
 ('knew', 'VBD'),
 ('about', 'IN'),
 ('history', 'NN'),
 ('but', 'CC'),
 ('kept', 'VBD'),
 ('on', 'IN'),
 ('repeating', 'VBG'),
 ('the', 'DT'),
 ('same', 'JJ'),
 ('mistakes', 'NNS'),
 ('again', 'RB'),
 ('and', 'CC'),
 ('again', 'RB'),
 ('anyway', 'RB'),
 ('.', '.')]

Here we see that `"and"` is CC, a coordinating conjunction; `"again"` and `"anyway"` are RB, or adverbs; `"about` is IN, a preposition; `"History"`, `"history` are NN, nouns; `"people"` and `"mistakes"` are plural nouns; `"full"` and `"same"` are JJ, adjectives; `"is"` is VBZ, a verb, 3rd person singular present; `"kept"` is VBD, a verb, past tense; `"repeating"` is VBG, a verb, gerund/present participle, `"who"` is WP, a wh-pronoun, and `"determiner"` is DT, a determiner.

In [None]:
!python -m nltk.downloader tagsets

In [69]:
from nltk import help

help.upenn_tagset("NN")

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


### v. Chunking

Chunking segments texts and depends on PoS tagging. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text.

In [70]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences] [3]