# Text Mining

Text mining
Prerequisite: Technicalities
Wikidata API, other APIs
Scraping HTML to extract structured/semi-structured information




Processing text
Rule based (regex, etc)
PoS tagging
N-grams, co-occurrences and concordancer
Word embeddings (word2vec, BERT, etc)


### (b) Text Pre-processing

- Rule-based Techniques (regex, etc)
- Part-of-Speech (PoS) tagging
- Chunking
- N-grams, co-occurrences and concordancer
- Word embeddings (word2vec, BERT, etc)


### i. Programming Basics

A **text** as nothing more than a sequence of **words** and **punctuation**. In the following example, `sentence` followed by the `=` sign, and then some quoted words, separated with commas, surrounded with brackets is known as a list in Python.

In [1]:
sentence = ["History", "is", "full", "of", "people", "who", "knew", "about", "history"]

Next, we can print the contents of the `sentence`:

In [2]:
sentence

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history']

We can ask for its length to find how many words are in the sentence:

In [24]:
len(sentence)

9

For the purpose of this exercise, we define two sentences:

In [25]:
sentence1 = ["History", "is", "full", "of", "people", "who", "knew", "about", "history"]
sentence2 = ["but", "kept", "on", "repeating", "the", "same", "mistakes", "again", "and", "again", "anyway"]

print(sentence1)
print(sentence2)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history']
['but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway']


Adding two lists creates a new list with everything from the first list, followed by everything from the second list:

In [26]:
sentence = sentence1 + sentence2

print(sentence)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history', 'but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway']


We can add a new item to the list. When we `append()` to a list, the list itself is updated as a result of the operation.

In [27]:
sentence.append(".")

print(sentence)

['History', 'is', 'full', 'of', 'people', 'who', 'knew', 'about', 'history', 'but', 'kept', 'on', 'repeating', 'the', 'same', 'mistakes', 'again', 'and', 'again', 'anyway', '.']


We can identify the elements of a Python list by their order of occurrence in the list:

In [28]:
print(sentence[0])

History


In [30]:
print(sentence[4])

people


We can also extract a *slice* or *sublist* from `sentence`:

In [34]:
print(sentence[1:5])

['is', 'full', 'of', 'people']


Any individual word in the `sentence` is a `word` of type *String* (short *str*):

In [35]:
word = "history"

print(word)
print(type(word))

history
<class 'str'>


### RegEx Basics

Regular expressions or RegEx is a sequence of characters mainly used to find or replace patterns embedded in the text.



- \b returns a match where the specified pattern is at the beginning or at the end of a word.
- \d returns a match where the string contains digits (numbers from 0-9).
- \D returns a match where the string does not contain any digit. It is basically the opposite of \d.
- \w helps in extraction of alphanumeric characters only (characters from a to Z, digits from 0-9, and the underscore _ character)
- \W returns match at every non-alphanumeric character. Basically opposite of \w.
- (.) matches any character (except newline character)
- (^) checks whether the string starts with the given pattern or not.
- ($) checks whether the string ends with the given pattern or not.
- (*) matches for zero or more occurrences of the pattern to the left of it
- (+) matches one or more occurrences of the pattern to the left of it
- (?) matches zero or one occurrence of the pattern left to it.
- (|) checks whether any of the two patterns, to its left and right, is present in the text or not.

Python has a built-in module to work with regular expressions called `re`. Some common methods from this module are:

    - re.match()
    - re.search()
    - re.findall()


1. The `re.match(pattern, string)` function returns a match object on success and none on failure.

In [11]:
import re
result = re.match('History', 'History is full of people who knew about history.')

result.span()

(0, 7)

2. `re.search(pattern, string)` matches the first occurrence of a pattern in the entire string(and not just at the beginning).

In [12]:
result = re.search('history', 'History is full of people who knew about history.')
print(result.group())

history


3. `re.findall(pattern, string)` will return all the occurrences of the pattern from the string. I would recommend you to use re.findall() always, it can work like both `re.search()` and `re.match()`.

In [13]:
result = re.findall('history', 'History is full of people who knew about history but they kept on repeating the history.')
print(result)

['history', 'history']
