# **Preprocessing**

Text processing helps make for better input data when performing machine learning or other statistical methods. You have applied small bits of preprocessing (like tokenization) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model.

Other common techniques are things like **lemmatization** or **stemming**, where you shorten the words to their root stems, or techniques like **removing stop words**, which are common words in a language that don't carry a lot of meaning -- such as and or the, or removing punctuation or unwanted tokens.

In [3]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box. The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) 
if w.isalpha()] 
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)
# [('cat', 3), ('box', 3)]

[('cat', 3), ('box', 3)]

The string is_alpha method will return True if the string has only alphabetical characters. We use the is_alpha method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation).

Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

In [8]:
english_stops = stopwords.words('english')
english_stops[:5]

['i', 'me', 'my', 'myself', 'we']

In [14]:
text = "The lion is the King of the Jungle. Lions are carnivors. Lions live in the African Sabanna. Africa is the poorest continent in the World"
tokens = word_tokenize(text)

lower_tokens = [w for w in word_tokenize(text.lower()) 
if w.isalpha()] 

In [15]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer 

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('lion', 3), ('king', 1), ('jungle', 1), ('carnivors', 1), ('live', 1), ('african', 1), ('sabanna', 1), ('africa', 1), ('poorest', 1), ('continent', 1)]


# **Text cleaning**

Some of the most common text cleaning steps include removing extra whitespaces, escape sequences, punctuations, special characters such as numbers and stopwords.

Every python string has an isalpha() method that returns true if all the characters of the string are alphabets. Therefore, the "Dog".isalpha() will return true but "3dogs".isalpha() will return false as it has a non-alphabetic character 3. Similarly, numbers, punctuations and emojis will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.

If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. Examples include abbreviations such as USA and UK which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.

## **Remove non-alphabetic characters**
This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before. Next, we loop through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters.

In [2]:
string ="""
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""
import spacy
# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

In [1]:
# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas 
if lemma.isalpha() or lemma == '-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))

NameError: name 'a_lemmas' is not defined

## **Stopwords**

There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as a and the, be verbs such as is and am and pronouns such as he and she.

spaCy has a built-in list of stopwords which we can access using spacy.lang.en.stop_words.STOP_WORDS..

In [None]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))
'omg like good thing wow amazing song hooked definitely'

The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful.

In [None]:
# Function to preprocess text
def preprocess(text):
        # Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

Genism

**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora and performing topic identification and document comparisons.

A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document. You can think of it as a multi-dimensional array normally with sparse features (lots of zeros and some ones). With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.

A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks.

In [None]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.','I really liked the movie!','Awesome action scenes, but boring characters.','The movie was awful! I hate alien films.','Space is cool! I liked the movie.','More space films, please!',]

tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
dictionary = Dictionary(tokenized_docs)
dictionary.token2id
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

We can see that the Gensim corpus is a list of lists, each list item representing one document. Each document a series of tuples, the first item representing the tokenid from the dictionary and the second item representing the token frequency in the document.

Gensim models can be easily saved, updated, and reused. Our dictionary can also be updated. This more advanced and feature rich bag-of-words can be used in future exercises.

from gensim.corpora.dictionary import Dictionary 

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

Another example:
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += int(word_count)
