# Hands-on: Text Processing
This hands-on will cover the necessary steps in a text processing pipeline for Human Language Technologies (HLT). Some examples of projects and tasks in which this pipeline will be useful are the following:
- **Language Translation** - translation of a sentence or body of text from one language to another
- **Word Sense Disambiguation** - determining the meaning and context of a polysemic word in a body of text
- **Sentiment Analysis** - determining the overall sentiment towards a certain topic or word, whether it's positive, negative, or neutral
- **Topic Modeling** - identifying the different topics discusses in a text and determining the most prevalent one

And there are others more like question answering, information extraction, and more recently, detecting mis/disinformation.

## Pre-processing Pipeline

- **Tokenization** — split sentences into words and symbols
- **Convert to lowercase**
- **Removing unnecessary punctuation, tags, and emojis**
- **Removing stop words** — removing frequently occurring words like articles (e.g. ”the”, ”is”, etc.) that do not have specific meanings
- **Stemming** — transforms a word to their root form by removing inflectional endings. It is done by usually dropping the suffixes.

```
The stemmed form of cries is: cri
The stemmed form of crying is: cry
```

- **Lemmatization** — properly removing inflectional endings by determining the part of speech and doing morphological analysis. It transforms words to their base or dictionary form.

```
The lemmatized form of cries is: cry
The lemmatized form of crying is: cry
```

> **NOTE:** Not all HLT tasks/projects will follow the same pipeline. For example, topic modeling were proven to be better with stop words, so the removal of stop words is typicallly skipped. 

In [9]:
import os
import pandas as pd
import numpy as np
import json #loading/writing json files
import re #regular expressions
import gensim
import nltk

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora, models

from nltk.stem import WordNetLemmatizer

### Load JSON dataset containing posts from r/waze
This dataset was collected in 2019 using the pushshift.io API. It contains the `submission ID` and the post's `body` of text.

In [2]:
json_file = open(os.getenv('DSDATA') + '/raw_textonly_waze_2019.json')
json_data = json.load(json_file)
documents = pd.DataFrame(json_data)

In [3]:
documents.head()

Unnamed: 0,id,body
0,7nraji,Seriously Waze whomever is in charge of UX nee...
1,ds4cv5q,"I agree. And what's with the blue ""Go"" buttons..."
2,ds4mc2j,"Yes! When selecting a destination from seach, ..."
3,ds4mtd9,I totally agree with you and I will add: make ...
4,ds52syn,I agree and disagree with little bits and piec...


### Remove URLs from text
You can use [Regex101](https://regex101.com/) for checking your regular expressions

In [4]:
def removeURLFromText(text):
    result = re.sub(r"http\S+", "", text)
    result = result.strip()
    return result

In [5]:
processed_docs = documents['body'].map(removeURLFromText)
processed_docs.head()

0    Seriously Waze whomever is in charge of UX nee...
1    I agree. And what's with the blue "Go" buttons...
2    Yes! When selecting a destination from seach, ...
3    I totally agree with you and I will add: make ...
4    I agree and disagree with little bits and piec...
Name: body, dtype: object

### Pre-process post's body
`gensim`'s `simple_preprocess()` converts a text into a list of tokens that are already in lowercase.

This step also removes stop words and words with less than 3 characters.

In [6]:
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize(token))
    return result

### Stem and lemmatize per text
Lemmatize the word first. If there will be words missed, the stemmer should be able to handle it. The lemmatization will only be done for verbs.

In [10]:
nltk.download('wordnet')

def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brianesamson/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [11]:
processed_docs = processed_docs.map(preprocess)
processed_docs.head()

0    [seriously, waze, whomever, charge, need, fire...
1    [agree, blue, button, stay, screen, second, wa...
2    [select, destination, seach, present, options,...
3    [totally, agree, screen, mind, optional, choos...
4    [agree, disagree, little, bits, piece, items, ...
Name: body, dtype: object

### Create a `gensim` Dictionary
This will organize your bag of words into word <-> id mappings

In [12]:
dictionary = gensim.corpora.Dictionary(processed_docs)

In [13]:
for x in range(0, 20):
    print(x,":",dictionary[x])

0 : accidents
1 : accurate
2 : area
3 : aspect
4 : avoid
5 : base
6 : behalf
7 : better
8 : brainer
9 : briefly
10 : button
11 : center
12 : change
13 : charge
14 : come
15 : competent
16 : complicate
17 : concern
18 : consider
19 : consult


### Filter words
Filter out tokens that appear in

- less than `no_below` documents (absolute number) or
- more than `no_above` documents (fraction of total corpus size, not absolute number).
- after (1) and (2), keep only the first `keep_n` most frequent tokens (or keep all if `None`).

In [14]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

In [15]:
dictionary.cfs

{70: 46,
 88: 7088,
 10: 147,
 53: 723,
 9: 616,
 46: 43,
 55: 178,
 91: 300,
 14: 99,
 76: 75,
 11: 361,
 37: 507,
 58: 81,
 17: 24,
 27: 311,
 87: 850,
 79: 992,
 34: 973,
 78: 822,
 32: 522,
 44: 34,
 7: 383,
 15: 139,
 72: 341,
 35: 389,
 89: 66,
 66: 23,
 18: 101,
 43: 27,
 51: 212,
 40: 64,
 65: 35,
 21: 59,
 5: 205,
 84: 307,
 42: 56,
 29: 31,
 52: 165,
 62: 169,
 24: 150,
 38: 151,
 83: 103,
 67: 49,
 13: 52,
 85: 191,
 19: 694,
 61: 65,
 73: 615,
 48: 206,
 81: 1129,
 47: 520,
 39: 75,
 25: 38,
 77: 413,
 16: 34,
 3: 19,
 68: 589,
 57: 38,
 1: 182,
 45: 1319,
 49: 171,
 80: 37,
 20: 64,
 31: 19,
 26: 90,
 8: 35,
 71: 110,
 60: 37,
 64: 553,
 41: 160,
 23: 1554,
 50: 370,
 6: 342,
 63: 32,
 69: 215,
 56: 590,
 33: 30,
 12: 27,
 0: 49,
 74: 89,
 36: 214,
 54: 41,
 4: 274,
 28: 155,
 22: 266,
 30: 89,
 82: 25,
 2: 479,
 75: 426,
 90: 222,
 86: 61,
 59: 43,
 92: 140,
 93: 46,
 95: 121,
 94: 656,
 96: 329,
 115: 170,
 104: 302,
 111: 50,
 108: 178,
 106: 114,
 110: 86,
 113: 854,
 

In [16]:
dictionary.dfs

{70: 46,
 88: 4471,
 10: 114,
 53: 636,
 9: 500,
 46: 42,
 55: 169,
 91: 282,
 14: 92,
 76: 64,
 11: 340,
 37: 450,
 58: 79,
 17: 16,
 27: 284,
 87: 749,
 79: 909,
 34: 858,
 78: 790,
 32: 493,
 44: 22,
 7: 302,
 15: 113,
 72: 273,
 35: 369,
 89: 62,
 66: 23,
 18: 97,
 43: 23,
 51: 168,
 40: 63,
 65: 33,
 21: 53,
 5: 186,
 84: 269,
 42: 53,
 29: 27,
 52: 146,
 62: 154,
 24: 142,
 38: 130,
 83: 101,
 67: 44,
 13: 50,
 85: 151,
 19: 457,
 61: 62,
 73: 499,
 48: 167,
 81: 773,
 47: 411,
 39: 74,
 25: 34,
 77: 359,
 16: 30,
 3: 19,
 68: 439,
 57: 38,
 1: 160,
 45: 1154,
 49: 134,
 80: 37,
 20: 59,
 31: 18,
 26: 88,
 8: 33,
 71: 102,
 60: 35,
 64: 478,
 41: 142,
 23: 1200,
 50: 336,
 6: 311,
 63: 29,
 69: 200,
 56: 502,
 33: 28,
 12: 25,
 0: 44,
 74: 80,
 36: 204,
 54: 41,
 4: 233,
 28: 143,
 22: 229,
 30: 87,
 82: 24,
 2: 393,
 75: 366,
 90: 209,
 86: 55,
 59: 43,
 92: 139,
 93: 44,
 95: 111,
 94: 569,
 96: 282,
 115: 148,
 104: 249,
 111: 46,
 108: 158,
 106: 102,
 110: 82,
 113: 603,
 11

In [17]:
dictionary.num_pos

173850

In [18]:
dictionary.num_docs

12459

### Map bag of words per document

So far, we have only counted the occurrence of each word across all documents. Next, we need to know how often each word appeared in each document, but now using the IDs generated in the previous step.

In [19]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [20]:
bow_corpus

[[(0, 3),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 2),
  (6, 1),
  (7, 3),
  (8, 1),
  (9, 1),
  (10, 2),
  (11, 2),
  (12, 1),
  (13, 1),
  (14, 2),
  (15, 1),
  (16, 1),
  (17, 2),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 2),
  (25, 2),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 2),
  (34, 3),
  (35, 3),
  (36, 1),
  (37, 2),
  (38, 1),
  (39, 1),
  (40, 2),
  (41, 3),
  (42, 1),
  (43, 3),
  (44, 2),
  (45, 1),
  (46, 1),
  (47, 2),
  (48, 1),
  (49, 1),
  (50, 2),
  (51, 1),
  (52, 4),
  (53, 6),
  (54, 1),
  (55, 1),
  (56, 1),
  (57, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1),
  (67, 1),
  (68, 2),
  (69, 1),
  (70, 1),
  (71, 2),
  (72, 1),
  (73, 1),
  (74, 1),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 1),
  (79, 1),
  (80, 1),
  (81, 7),
  (82, 1),
  (83, 1),
  (84, 2),
  (85, 1),
  (86, 1),
  (87, 1),
  (88, 1),
  (89, 3),
  (90, 1),
  (91, 1)

### Compute the TF-IDF per word in a document

- **Term Frequency (TF)** is the number of times token `t` appears in a document divided by the total number of tokens in the document.
- **Inverse Document Frequency (IDF)** is the log(N/n), where, `N` is the number of documents and `n` is the number of documents a token t has appeared in. A less frequently used word will have a high IDF, whereas the IDF of a frequent word is likely to be low. 

We calculate TF-IDF value of a term as = **TF * IDF**

Example:
```
Document 1: "I worked my whole life, just to get right, just to be like"
[work, 1]
[whole, 1]
[life, 1]
[just, 2]
[right, 1]
[like, 1]
```

```
Document 2: "I worked my whole life, just to get high, just to realize"
[work, 1]
[whole, 1]
[life, 1]
[just, 2]
[high, 1]
[realize, 1]
```

```
TF('just',Document1) = 2/7, IDF('just')=log(2/2) = 0
TF('right',Document1) = 1/7,  IDF(‘right’)=log(2/1) = 0.30

TF-IDF(‘just’, Document1) = (2/7)*0 = 0
TF-IDF(‘right’, Document1) = (1/7)*0.30 = 0.42
```

In [21]:
tfidf = models.TfidfModel(bow_corpus)

In [22]:
corpus_tfidf = tfidf[bow_corpus]
for x in corpus_tfidf:
    print(x)

[(0, 0.23124633876333786), (1, 0.05945695677371818), (2, 0.04718833531067473), (3, 0.08854680523035144), (4, 0.05432546636759066), (5, 0.11480252290351058), (6, 0.050383247326507126), (7, 0.15235249524828903), (8, 0.08100969121459851), (9, 0.04390083497800084), (10, 0.12816963310950763), (11, 0.0983321807753921), (12, 0.08480005733979029), (13, 0.07533686835009577), (14, 0.1340240928568849), (15, 0.06420510363971764), (16, 0.08231091304659309), (17, 0.18178598029162624), (18, 0.06628952356346254), (19, 0.04512853167221937), (20, 0.07307718316956725), (21, 0.07454135232512273), (22, 0.05456187912910067), (23, 0.06389700339021827), (24, 0.12217267576838883), (25, 0.16120424751958198), (26, 0.0676189239314181), (27, 0.05162314889449991), (28, 0.06099053050276059), (29, 0.08374934704688179), (30, 0.06777495415345162), (31, 0.0892849577430904), (32, 0.0440933204842491), (33, 0.16650567587324386), (34, 0.10958562548148891), (35, 0.14414585877290195), (36, 0.05614013508419337), (37, 0.0906785