# **Preprocessing**

Text processing helps make for better input data when performing machine learning or other statistical methods. You have applied small bits of preprocessing (like tokenization) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model.

Other common techniques are things like **lemmatization** or **stemming**, where you shorten the words to their root stems, or techniques like **removing stop words**, which are common words in a language that don't carry a lot of meaning -- such as and or the, or removing punctuation or unwanted tokens.

In [1]:
import pandas as pd 

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter

text = """The cat is in the box. The cat likes the box. The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower()) 
if w.isalpha()] 
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)
# [('cat', 3), ('box', 3)]

[('cat', 3), ('box', 3)]

The string is_alpha method will return True if the string has only alphabetical characters. We use the is_alpha method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation).

Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

In [2]:
english_stops = stopwords.words('english')
english_stops[:5]

['i', 'me', 'my', 'myself', 'we']

In [3]:
text = "The lion is the King of the Jungle. Lions are carnivors. Lions live in the African Sabanna. Africa is the poorest continent in the World"
tokens = word_tokenize(text)

lower_tokens = [w for w in word_tokenize(text.lower()) 
if w.isalpha()] 

In [4]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer 

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('lion', 3), ('king', 1), ('jungle', 1), ('carnivors', 1), ('live', 1), ('african', 1), ('sabanna', 1), ('africa', 1), ('poorest', 1), ('continent', 1)]


# **Text cleaning**

Some of the most common text cleaning steps include removing extra whitespaces, escape sequences, punctuations, special characters such as numbers and stopwords.

Every python string has an isalpha() method that returns true if all the characters of the string are alphabets. Therefore, the "Dog".isalpha() will return true but "3dogs".isalpha() will return false as it has a non-alphabetic character 3. Similarly, numbers, punctuations and emojis will all return false too. This is an extremely convenient method to remove all (lemmatized) tokens that are or contain numbers, punctuation and emojis.

If isalpha() as a silver bullet that cleans text meticulously seems too good to be true, it's because it is. Remember that isalpha() has a tendency of returning false on words we would not want to remove. Examples include abbreviations such as USA and UK which have periods in them, and proper nouns with numbers in them such as word2vec and xto10x. For such nuanced cases, isalpha() may not be sufficient. It may be advisable to write your own custom functions, typically using regular expressions, to ensure you're not inadvertently removing useful words.

## **Remove non-alphabetic characters**
This has a lot of punctuations, unnecessary extra whitespace, escape sequences, numbers and emojis. We will generate the lemmatized tokens like before. Next, we loop through the tokens again and choose only those words that are either -PRON- or contain only alphabetic characters.

In [5]:
string = """
OMG!!!! This is like    the best thing ever \t\n.
Wow, such an amazing song! I'm hooked. Top 5 definitely. ?
"""

In [6]:
import spacy 

In [7]:
# Generate list of tokens
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
lemmas = [token.lemma_ for token in doc]

# Remove tokens that are not alphabetic
a_lemmas = [lemma for lemma in lemmas 
if lemma.isalpha() or lemma == '-PRON-']

# Print string after text cleaning
print(' '.join(a_lemmas))

OMG this be like the good thing ever wow such an amazing song -PRON- be hooked Top definitely


## **Stopwords**

There are some words in the English language that occur so commonly that it is often a good idea to just ignore them. Examples include articles such as a and the, be verbs such as is and am and pronouns such as he and she.

spaCy has a built-in list of stopwords which we can access using spacy.lang.en.stop_words.STOP_WORDS..

In [8]:
# Get list of stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
if lemma.isalpha() and lemma not in stopwords]
# Print string after text cleaning
print(' '.join(a_lemmas))

OMG like good thing wow amazing song hooked Top definitely


The text preprocessing techniques you use is always dependent on the application. There are many applications which may find punctuations, numbers and emojis useful.

In [9]:
text = """ If we hope to one day leave Earth and explore the universe, our bodies are going to have to get a lot better at surviving the harsh conditions of space. 
                    Using synthetic biology, Lisa Nip hopes to harness special powers from microbes on Earth -- such as the ability to withstand radiation -- to make humans more fit for exploring space. 
                    "We're approaching a time during which we'll have the capacity to decide our own genetic destiny," Nip says. 
                    "Augmenting the human body with new abilities is no longer a question of how, but of when." """

In [10]:
doc = nlp( text , disable=['ner', 'parser'])

# Generate lemmas
lemmas = [token.lemma_ for token in doc]
print(lemmas)

[' ', 'if', '-PRON-', 'hope', 'to', 'one', 'day', 'leave', 'Earth', 'and', 'explore', 'the', 'universe', ',', '-PRON-', 'body', 'be', 'go', 'to', 'have', 'to', 'get', 'a', 'lot', 'well', 'at', 'survive', 'the', 'harsh', 'condition', 'of', 'space', '.', '\n                    ', 'use', 'synthetic', 'biology', ',', 'Lisa', 'Nip', 'hope', 'to', 'harness', 'special', 'power', 'from', 'microbe', 'on', 'Earth', '--', 'such', 'as', 'the', 'ability', 'to', 'withstand', 'radiation', '--', 'to', 'make', 'human', 'more', 'fit', 'for', 'explore', 'space', '.', '\n                    ', '"', '-PRON-', 'be', 'approach', 'a', 'time', 'during', 'which', '-PRON-', 'will', 'have', 'the', 'capacity', 'to', 'decide', '-PRON-', 'own', 'genetic', 'destiny', ',', '"', 'Nip', 'say', '.', '\n                    ', '"', 'augment', 'the', 'human', 'body', 'with', 'new', 'ability', 'be', 'no', 'long', 'a', 'question', 'of', 'how', ',', 'but', 'of', 'when', '.', '"']


In [11]:
# Remove stopwords and non-alphabetic characters
a_lemmas = [lemma for lemma in lemmas 
    if lemma.isalpha() and lemma not in stopwords]

print(' '.join(a_lemmas))

hope day leave Earth explore universe body lot survive harsh condition space use synthetic biology Lisa Nip hope harness special power microbe Earth ability withstand radiation human fit explore space approach time capacity decide genetic destiny Nip augment human body new ability long question


# **Word embedding**

Consider the three sentences, I am happy, I am joyous and I am sad. Now if we were to compute the similarities, I am happy and I am joyous would have the same score as I am happy and I am sad, regardless of how we vectorize it. This is because 'happy', 'joyous' and 'sad' are considered to be completely different words. However, we know that happy and joyous are more similar to each other than sad. This is something that the vectorization techniques we've covered so far simply cannot capture.

Word embedding is the process of mapping words into an n-dimensional vector space. These vectors are usually produced using deep learning models and huge amounts of data.

Consequently, they can also be used to detect synonyms and antonyms. Word embeddings are also capable of capturing complex relationships. For instance, it can be used to detect that the words king and queen relate to each other the same way as man and woman. Or that France and Paris are related in the same way as Russia and Moscow.

We will use SpaCy.

In [15]:
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('I am happy')
for token in doc:
    print(token.vector)
    break

[ 1.8733e-01  4.0595e-01 -5.1174e-01 -5.5482e-01  3.9716e-02  1.2887e-01
  4.5137e-01 -5.9149e-01  1.5591e-01  1.5137e+00 -8.7020e-01  5.0672e-02
  1.5211e-01 -1.9183e-01  1.1181e-01  1.2131e-01 -2.7212e-01  1.6203e+00
 -2.4884e-01  1.4060e-01  3.3099e-01 -1.8061e-02  1.5244e-01 -2.6943e-01
 -2.7833e-01 -5.2123e-02 -4.8149e-01 -5.1839e-01  8.6262e-02  3.0818e-02
 -2.1253e-01 -1.1378e-01 -2.2384e-01  1.8262e-01 -3.4541e-01  8.2611e-02
  1.0024e-01 -7.9550e-02 -8.1721e-01  6.5621e-03  8.0134e-02 -3.9976e-01
 -6.3131e-02  3.2260e-01 -3.1625e-02  4.3056e-01 -2.7270e-01 -7.6020e-02
  1.0293e-01 -8.8653e-02 -2.9087e-01 -4.7214e-02  4.6036e-02 -1.7788e-02
  6.4990e-02  8.8451e-02 -3.1574e-01 -5.8522e-01  2.2295e-01 -5.2785e-02
 -5.5981e-01 -3.9580e-01 -7.9849e-02 -1.0933e-02 -4.1722e-02 -5.5576e-01
  8.8707e-02  1.3710e-01 -2.9873e-03 -2.6256e-02  7.7330e-02  3.9199e-01
  3.4507e-01 -8.0130e-02  3.3451e-01  2.7063e-01 -2.4544e-02  7.2576e-02
 -1.8120e-01  2.3693e-01  3.9977e-01  4.5012e-01  2

We can compute how similar two words are to each other by using the similarity method of a spacy token. Let's say we want to compute how similar happy, joyous and sad are to each other.

In [16]:
doc = nlp("happy joyous sad")
for token1 in doc:
    for token2 in doc:
   		print(token1.text, token2.text, token1.similarity(token2))

happy happy 1.0
happy joyous 0.533303
happy sad 0.64389884
joyous happy 0.533303
joyous joyous 1.0
joyous sad 0.43832767
sad happy 0.64389884
sad joyous 0.43832767
sad sad 1.0


Spacy also allows us to directly compute the similarity between two documents by using the average of the word vectors of all the words in a particular document. 

In [18]:
# Generate doc objects
sent1 = nlp("I am happy")
sent2 = nlp("I am sad")
sent3 = nlp("I am joyous")

# Compute similarity between sent1 and sent2
print(sent1.similarity(sent2))

# Compute similarity between sent1 and sent3
print(sent1.similarity(sent3))

0.9492464724721577
0.9239675481730458


# **TF-IDF**

**Tf-idf** stands for term-frequncy - inverse document frequency. It is a commonly used natural language processing model that helps you determine the most important words in each document in the corpus. The idea behind tf-idf is that each corpus might have more shared words than just stopwords.

If I am an astronomer, sky might be used often but is not important, so I want to downweight that word. TF-Idf does precisely that. It will take texts that share common language and ensure the most common words across the entire corpus don't show up as keywords. Tf-idf helps keep the document-specific frequent words weighted high and the common words across the entire corpus weighted low.

In some texts, some terms should be given a larger weight on account of its exclusivity. In other words, the word 'jupiter' characterizes the document more than 'universe'. In the astronomy field.



The equation to calculate the weights can be outlined like so: The weight of token i in document j is calculated by taking the term frequency (or how many times the token appears in the document) multiplied by the log of the total number of documents divided by the number of documents that contain the same term. 

Here we can see if the total number of documents divided by the number of documents that have the term is close to one, then our logarithm will be close to zero. So words that occur across many or all documents will have a very low tf-idf weight. On the contrary, if the word only occurs in a few documents, that logarithm will return a higher number.

In general, higher the tf-idf weight, more important is the word in characterizing the document. A high tf-idf weight for a word in a document may imply that the word is relatively exclusive to that particular document or that the word occurs extremely commonly in the document, or both.

Weighting words this way has a huge number of applications. They can be used to automatically detect stopwords for the corpus instead of relying on a generic list. They're used in search algorithms to determine the ranking of pages containing the search query and in recommender systems.

With genism:


In [21]:
from genism.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[ 1 ]]

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

print(tfidf_weights[:5]) # Print the first five weights

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

ModuleNotFoundError: No module named 'genism'

The weighting mechanism we've described is known as term frequency-inverse document frequency or tf-idf for short. It is based on the idea that the weight of a term in a document should be proportional to its frequency and an inverse function of the number of documents in which it occurs. 

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

NameError: name 'corpus' is not defined

The only difference is that TfidfVectorizer assigns weights using the tf-idf formula from before and has extra parameters related to inverse document frequency. TfidfVectorizer is almost identical to using CountVectorizer for a corpus. However, notice that the weights are non-integer and reflect values calculated by the tf-idf formul

# **NAMED ENTITY RECOGNITION**

Named Entity Recognition or NER is a natural language processing task used to identify important named entities in the text -- such as people, places and organizations -- they can even be dates, states, works of art and other categories depending on the libraries and notation you use. NER can be used alongside topic identification, or on its own to determine important items in a text or answer basic natural language understanding questions such as who? what? when and where?

Named entity recognition or NER has a host of extremely useful applications. It is used to build efficient search algorithms and question answering systems.

NLTK allows you to interact with named entity recognition via it's own model, but also the aforementioned Stanford library. The Stanford library integration requires you to perform a few steps before you can use it, including installing the required Java files and setting system environment variables. You can also use the standford library on its own without integrating it with NLTK or operate it as an API server. The stanford CoreNLP library has great support for named entity recognition as well as some related nlp tasks such as coreference (or linking pronouns and entities together) and dependency trees to help with parsing meaning and relationships amongst words or phrases in a sentence.

In [22]:
import nltk
sentence ='''In New York, I like to ride the Metro to
             visit MOMA and some restaurants rated
             well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]
[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]
print(ntlk.ne_chunk(tagged_sent))

NameError: name 'ntlk' is not defined



This tree shows the named entities tagged as their own chunks such as GPE or geopolitical entity for New York, or MOMA and Metro as organizations. It also identifies Ruth Reichl as a person. It does so without consulting a knowledge base, like wikipedia, but instead uses trained statistical and grammatical parsers.
# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

Another example:
ner_categories = defaultdict(int)
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
labels = list(ner_categories.keys())
values = [ner_categories.get(v) for v in labels]

plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)
plt.show()

NER also found application with News Providers who use it to categorize their articles and Customer Service centers who use it to classify and record their complaints efficiently.

A named entity is anything that can be denoted with a proper name or a proper noun. Named entity recognition or NER, therefore, is the process of identifying such named entities in a piece of text and classifying them into predefined categories such as person, organization, country, etc. 

For example, consider the text "John Doe is a software engineer working at Google. He lives in France." Performing NER on this text will tell us that there are three named entities: John Doe, who is a person, Google, which is an organization and France, which is a country (or geopolitical entity). 

import spacy
string =
"John Doe is a software engineer working at Google. He lives in France."
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(string)
# Generate named entities
ne = [(ent.text, ent.label_) for ent in doc.ents]
print(ne)
# Identify the persons
persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']


Currently, spaCy's models are capable of identifying more than 15 different types of named entities. The complete list of categories and their annotations can be found in spaCy's documentation.

For instance, if we are trying extract named entities for texts from a heavily technical field, such as medicine, spacy's pretrained models may not perform such a great job. In such nuanced cases, it is better to train your models with your specialized data.
