## Basic Text Preprocessing Operations

In this notebook we will implement some basic text preprocessing pipeline for NLP related tasks. 
At the end we will have a custome function that handles all the preprocessing for a string

List of preprocessing steps:
* Text Normalization: 
  * Removing urls
  * Remove Round Brackets
  * Remove slash
  * Remove punctuation
  * Remove whitespace
  * Lower case text
  * Numbers to text
  * Replacing contractions
  * Removing Stop Words (Optional)
  * Lemmatization (Optional)
  * Stemming (Optional)
* Tokenization:
  * Get unique tokens (Vocab) out of the corpus. 
  * Char Ngrams vs Word Ngrams
* Document Embedding:
  * Types:
    * Bag of Words or Ngrams (Bow): For each document it looks how many counts of each token it has
    * One Hot Encoding: Fixed sice document (With Padding to max doc lenght). Each entry is the id of the token in that position. Usually word based tokens and vocabulary. 
    * TfIdf:

Lemmatization and Stemming depends on the NLP task to make. They are most usefull for text clasiffication for example

In [128]:
# This is to use some benchmark data
# !pip install torchdata 
# !pip install torchtext

# Run this just once if the packages are not already install in the current jupyter kernel
# !pip install nltk
# !pip install contractions
# !pip install inflect
# !pip install beautifulsoup4
# !pip install gensim

# You might need to restart the kernel for the packages changes to take effect

In [1]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [106]:
from torchtext.datasets import AG_NEWS
train_iter = iter(AG_NEWS(split='train'))
next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

In [113]:
len(list(AG_NEWS(split='train')))

120000

In [107]:
words = ''

for i, (tag, text) in enumerate(train_iter):
    words += ' ' + text
    # if i == 1000:
    #     break
words[:1000]

" Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market. Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums. Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday. Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential election

Code base of: https://analyticsindiamag.com/complete-tutorial-on-text-preprocessing-in-nlp/

In [129]:
import nltk
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup
import re, string, unicodedata
from gensim.parsing.preprocessing import remove_stopwords

Removing web associated noise

In [114]:
# to remove HTML tag
def html_remover(data):
  beauti = BeautifulSoup(data,'html.parser')
  return beauti.get_text()

# to remove URL
def url_remover(data):
  # text = re.sub(r'HREF="http.*"', '', data)
  return re.sub(r'https\S+','', data)

def web_associated(data):
  text = html_remover(data)
  text = url_remover(text)
  return text

Remove other noise

In [94]:
def remove_round_brackets(data):
  return re.sub('\(?\)?','',data)

def remove_slashes(data):
  return re.sub('\\\\',' ',data)

def remove_hyphens(data):
  return re.sub('-',' ',data)

def remove_punc(data):
  trans = str.maketrans('','', string.punctuation)
  return data.translate(trans)

def white_space(data):
  return ' '.join(data.split())

def complete_noise(data):
  new_data = web_associated(data)
  new_data = remove_round_brackets(new_data)
  new_data = remove_slashes(new_data)
  new_data = remove_hyphens(new_data)
  new_data = remove_punc(new_data)
  new_data = white_space(new_data)
  return new_data

Finish normalization

In [130]:
def text_lower(data):
  return data.lower()

def contraction_replace(data):
  return contractions.fix(data)

def number_to_text(data):
  temp_str = data.split()
  string = ''
  for i in temp_str:
    # if the word is digit, converted to 
    # word else the sequence continues
    if i.isdigit():
      temp = inflect.engine().number_to_words(i)
      string += ' ' + temp
    else:
      string += ' ' + i
  return string

def normalization(data, remove_stop_words=False):
  text = complete_noise(data)
  text = text_lower(text)
  text = number_to_text(text)
  text = contraction_replace(text)
  if remove_stop_words:
    text = remove_stop_words(text)
  return text

Example:

In [85]:
paragraph_raw = next(train_iter)[1]
print(paragraph_raw)
processed_paragraph = normalization(paragraph_raw)
print('')
print(processed_paragraph)

Sysco Profit Rises; Sales Volume Flat  CHICAGO (Reuters) - Sysco Corp. &lt;A HREF="http://www.investor.reuters.com/FullQuote.aspx?ticker=SYY.N target=/stocks/quickinfo/fullquote"&gt;SYY.N&lt;/A&gt;, the largest U.S.  distributor of food to restaurants and hospitals, on Monday  said quarterly profit rose as an extra week in the period and  cost control measures helped offset the higher food prices that  were slowing demand.

 sysco profit rises sales volume flat chicago reuters sysco corp a syyna the largest us distributor of food to restaurants and hospitals on monday said quarterly profit rose as an extra week in the period and cost control measures helped offset the higher food prices that were slowing demand


In [115]:
norm_words = normalization(words)
norm_words[:1000]

' carlyle looks toward commercial aerospace reuters reuters private investment firm carlyle group which has a reputation for making well timed and occasionally controversial plays in the defense industry has quietly placed its bets on another part of the market oil and economy cloud stocks outlook reuters reuters soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums iraq halts oil exports from main southern pipeline reuters reuters authorities have halted oil export flows from the main pipeline in southern iraq after intelligence showed a rebel militia could strike infrastructure an oil official said on saturday oil prices soar to all time record posing new menace to us economy afp afp tearaway world oil prices toppling records and straining wallets present a new economic menace barely three months before the us presidential elections stocks end up but near year lows 

In [116]:
print(len(words))
print(len(norm_words))

28497158
28503360


### Tokenization
Basic tokenization.
We could also use some more specific libraries such as SpaCy

In [117]:
from torchtext.data import get_tokenizer
from torchtext.data.utils import ngrams_iterator

tokenizer = get_tokenizer("basic_english")
tokens = tokenizer(norm_words)
tokens[0:10]


['carlyle',
 'looks',
 'toward',
 'commercial',
 'aerospace',
 'reuters',
 'reuters',
 'private',
 'investment',
 'firm']

In [118]:
len(tokens)

4681652

Optional Stemmer

In [119]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

stemmed_words = [stemmer.stem(w) for w in tokens]
stemmed_words[0:10]


['carlyl',
 'look',
 'toward',
 'commerci',
 'aerospac',
 'reuter',
 'reuter',
 'privat',
 'invest',
 'firm']

Build a Vocabulary from tokens. 
In this context a vocanulary is just a dict that assigns a unique index to each token and has the counts for each token.

In [124]:
from torchtext.vocab import build_vocab_from_iterator

# This expects list of list of tokens, if i just pass list of tokens it considers the letters as the tokens
vocab = build_vocab_from_iterator([tokens], max_tokens=2000) #Take into account just the 2000 more common words
vocab.set_default_index(2000)

In [126]:
vocab.lookup_indices(tokens[0:10])

[2000, 1066, 856, 1272, 2000, 25, 25, 898, 785, 335]

In [125]:
vocab.lookup_tokens([0,1,2,3,4])

['the', 'to', 'a', 'of', 'in']

### Preprocessing, tokenizing and Building Vocab Using Gensim

También podemos hacer lo anterior de forma un poco menos manual usando la librería Gensim

In [132]:
from gensim.parsing.preprocessing import preprocess_string

gprep_words = preprocess_string(words) #Básicamente preprocesa y te devuleve tokens de una
gprep_words[0:10]


['carlyl',
 'look',
 'commerci',
 'aerospac',
 'reuter',
 'reuter',
 'privat',
 'invest',
 'firm',
 'carlyl']

Note que el preprocesamiento por defecto ya lo hace bastante bien. Aunque de igual forma podemos agregar filtros personalizados

In [135]:
from gensim.parsing.preprocessing import strip_tags, remove_stopwords, strip_short, stem_text

CUSTOM_FILTERS = [normalization, strip_tags, remove_stopwords, strip_short, stem_text]
words_custom_filter = preprocess_string(words, filters=CUSTOM_FILTERS)
words_custom_filter[0:10]

['carlyl',
 'look',
 'commerci',
 'aerospac',
 'reuter',
 'reuter',
 'privat',
 'invest',
 'firm',
 'carlyl']

In [136]:
g_custom_vocab = build_vocab_from_iterator([words_custom_filter], max_tokens=2000) #Take into account just the 2000 more common words
g_custom_vocab.set_default_index(2000)
g_custom_vocab.lookup_indices(words_custom_filter[0:10])

[2000, 135, 1011, 2000, 3, 3, 636, 433, 177, 2000]

En conclusión. El pipeline por defecto de gensim es bastante bueno. Solo le agregaría filtros para lidiar con url y para convertir números a palabras.