# Text Preprocessing
In this notebook we look into basic steps which are involved in any NLP task

In [5]:
import nltk
import numpy as np

In [6]:
text = "i love python and natural language processing"

# Tokenization
Tokenization is a process of converting text into a list of words or phrases

In [10]:
#Tokenization
text.split()

['i', 'love', 'python', 'and', 'natural', 'language', 'processing']

We could also use NLTK's tokenizer

In [21]:
from nltk import word_tokenize
word_tokenize(text)

['i', 'love', 'python', 'and', 'natural', 'language', 'processing']

# Lemmatization
Lemmatization is basically to convert a word to it's root or base word format which is also known as lemma

In [22]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 

In [29]:
# choose some words to be lemmatized 
words = ["corpora", "books", "going","better"] 
  
for w in words: 
    print(w, " : ", lemmatizer.lemmatize(w)) 

corpora  :  corpus
books  :  book
going  :  going
better  :  better


In [32]:
#Can we make better BETTER?
print("better :", lemmatizer.lemmatize("better", pos ="a")) 
#pos "a" denotes adjective

print("going :", lemmatizer.lemmatize("going", pos ="v")) 

better : good
going : go


# Stemming
Stemming is a process to convert words into it's 

In [16]:
from nltk.stem import PorterStemmer a wo be stemmed o
for w in words: 
    print(w, " : ", ps.stem(w)) 

program  :  program
prgramming  :  prgram
programmer  :  programm


But its not always perfect because it tries to cut down the suffixes and prefixes based on certain language rule. Let see one more example:

In [20]:
new_words = ["universe","university"]
for w in new_words: 
    print(w, " : ", ps.stem(w)) 

universe  :  univers
university  :  univers


# Stop Words
Stop words are generally the top frequency words which don't have any information associated with them e.g., The, Is, A, An etc 
The definition of stop word can be very specific to your application. e.g., if you are trying to predict whether a word is singular or plural - You might not want to remove tokens like 'IS', 'ARE' etc.

Here we will use NLTK's stop words and see how does it work



In [35]:
from nltk.corpus import stopwords
stp = stopwords.words('english')

In [36]:
stp[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [37]:
text_split = text.split()

In [41]:
text_split_without_sp = []
for i in range(len(text_split)):
    if(text_split[i] not in stp):
        text_split_without_sp.append(text_split[i])

In [42]:
text_split_without_sp

['love', 'python', 'natural', 'language', 'processing']

In [44]:
text_split

['i', 'love', 'python', 'and', 'natural', 'language', 'processing']

# Using fast.ai
Fast.ai contains very useful functions for NLP, we will explore their text.transform

The rules are all listed below, here is the meaning of the special tokens:

- UNK (xxunk) is for an unknown word (one that isn't present in the current vocabulary)
- PAD (xxpad) is the token used for padding, if we need to regroup several texts of different lengths in a batch
- BOS (xxbos) represents the beginning of a text in your dataset
- FLD (xxfld) is used if you set mark_fields=True in your TokenizeProcessor to separate the different fields of texts (if your texts are loaded from several columns in a dataframe)
- TK_MAJ (xxmaj) is used to indicate the next word begins with a capital in the original text
- TK_UP (xxup) is used to indicate the next word is written in all caps in the original text
- TK_REP (xxrep) is used to indicate the next character is repeated n times in the original text (usage xxrep n {char})
- TK_WREP(xxwrep) is used to indicate the next word is repeated n times in the original text (usage xxwrep n {word})

In [54]:
import fastai

In [59]:
from fastai.text import Tokenizer
from fastai.text import SpacyTokenizer

In [61]:
tokenizer = Tokenizer()
tok = SpacyTokenizer('en')
' '.join(tokenizer.process_text(text, tok))

'i love python and natural language processing'

In [65]:
text2 = "I WANT it Today !!!!"

In [66]:
tok = SpacyTokenizer('en')
' '.join(tokenizer.process_text(text2, tok))

'i xxup want it xxmaj today xxrep 4 !'