<h1>
Introduction to Natural Language Processing for Text
</h1>
<b>Author: </b> <a href="https://towardsdatascience.com/@ventsislav94">Ventsislav Yordanov</a><br/>
<b>Original article: </b> <a href="https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63">towardsdatascience.com</a><br/><br/>

<img src="https://miro.medium.com/max/2000/1*CuPIUoh1nvh_r1Ssqmy8SA.jpeg"/>

<p id="f027">
After reading this blog post, you’ll know some basic techniques to
<strong>
extract features from
</strong>
some
<strong>
text
</strong>
, so you can use these features as
<strong>
input
</strong>
for
<strong>
machine learning models
</strong>
.
</p>
<h1 id="107b">
What is NLP (Natural Language Processing)?
</h1>
<p id="2584">
<strong>
NLP
</strong>
is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages. It is used to apply
<strong>
machine learning
</strong>
algorithms to
<strong>
text
</strong>
and
<strong>
speech
</strong>
.
</p>
<p id="4b9f">
For example, we can use NLP to create systems like
<strong>
speech recognition
</strong>
,
<strong>
document summarization
</strong>
,
<strong>
machine translation
</strong>
,
<strong>
spam detection
</strong>
,
<strong>
named entity recognition
</strong>
,
<strong>
question answering, autocomplete, predictive typing
</strong>
and so on.
</p>
<p id="cdbc">
Nowadays, most of us have smartphones that have speech recognition. These smartphones use NLP to understand what is said. Also, many people use laptops which operating system has a built-in speech recognition.
</p>
<h2 id="c54d">
Some Examples
</h2>
<p id="b58d">
<strong>
Cortana
</strong>
</p>
<img src="https://miro.medium.com/max/1400/1*TXj0kr4jVrtLtmvxZFu8Lw.png"/>
<br/>
<br/>
<p id="60dd">
The Microsoft OS has a virtual assistant called
<a href="https://support.microsoft.com/en-us/help/17214/windows-10-what-is">
<strong>
Cortana
</strong>
</a>
that can recognize a
<strong>
natural voice
</strong>
. You can use it to set up reminders, open apps, send emails, play games, track flights and packages, check the weather and so on.
</p>
<p id="0c62">
You can read more for Cortana commands from
<a href="https://www.howtogeek.com/225458/15-things-you-can-do-with-cortana-on-windows-10/">
here
</a>
.
</p>
<p id="cd7f">
<strong>
Siri
</strong>
</p>
<img src="https://miro.medium.com/max/1400/1*-AuKCZbXIVOhI-AgX4J8PQ.jpeg"/>
<br/>
<p id="dcd6">
Siri is a virtual assistant of the Apple Inc.’s iOS, watchOS, macOS, HomePod, and tvOS operating systems. Again, you can do a lot of things with
<strong>
voice
</strong>
<strong>
commands
</strong>
: start a call, text someone, send an email, set a timer, take a picture, open an app, set an alarm, use navigation and so on.
</p>

<p id="95e8">
<a href="https://www.cnet.com/how-to/the-complete-list-of-siri-commands/">
Here
</a>
is a complete list of all Siri commands.
</p>
<p id="45f3">
<strong>
Gmail
</strong>
</p>
<img src="https://miro.medium.com/max/1400/1*fTPhu7PqgIbnngbWG5zFWA.gif"/>
<br/>
<p id="0391">
The famous email service
<strong>
Gmail
</strong>
developed by Google is using
<strong>
spam detection
</strong>
to filter out some spam emails.
</p>
<h1 id="da92">
Introduction to the NLTK library for Python
</h1>
<p id="5a49">
NLTK (
<strong>
Natural Language Toolkit
</strong>
) is a
<strong>
leading platform
</strong>
for building Python programs to work with
<strong>
human language data
</strong>
. It provides easy-to-use interfaces to
<strong>
many
</strong>
<a href="https://en.wikipedia.org/wiki/Text_corpus">
<strong>
corpora
</strong>
</a>
and
<strong>
lexical resources
</strong>
. Also, it contains a suite of
<strong>
text processing libraries
</strong>
for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Best of all, NLTK is a free, open source, community-driven project.
</p>
<p id="024d">
We’ll use this toolkit to show some basics of the natural language processing field. For the examples below, I’ll assume that we have imported the NLTK toolkit. We can do this like this:
</p>


In [1]:
!pip install nltk

In [2]:
import nltk
nltk.download('punkt')

<h1 id="4164">
The Basics of NLP for Text
</h1>
<p id="4e0d">
In this article, we’ll cover the following topics:
</p>
<ol>
<li id="5bae">
Sentence Tokenization
</li>
<li id="8a7f">
Word Tokenization
</li>
<li id="6c69">
Text Lemmatization and Stemming
</li>
<li id="4146">
Stop Words
</li>
<li id="69c2">
Regex
</li>
<li id="9269">
Bag-of-Words
</li>
<li id="85ee">
TF-IDF
</li>
</ol>
<h2 id="9773">
1. Sentence Tokenization
</h2>
<p id="416a">
Sentence tokenization (also called
<strong>
sentence segmentation
</strong>
) is the problem of
<strong>
dividing a string
</strong>
of written language
<strong>
into
</strong>
its component
<strong>
sentences
</strong>
. The idea here looks very simple. In English and some other languages, we can split apart the sentences whenever we see a punctuation mark.
</p>
<p id="99b8">
However, even in English, this problem is not trivial due to the use of full stop character for abbreviations. When processing plain text, tables of abbreviations that contain periods can help us to prevent incorrect assignment of
<strong>
sentence boundaries
</strong>
. In many cases, we use libraries to do that job for us, so don’t worry too much for the details for now.
</p>
<p id="55a5">
<strong>
Example
</strong>
:
</p>
<p id="9f51">
Let’s look a piece of text about a famous board game called backgammon.
</p>
<blockquote>
<p id="4012">
Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice.
</p>
</blockquote>
<p id="5d1a">
To apply a sentence tokenization with NLTK we can use the
<code>
nltk.sent_tokenize
</code>
function.
</p>

In [3]:
text = "Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has fifteen checkers which move between twenty-four points according to the roll of two dice."
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    print(sentence)
    print()

<p id="e25f">
As an output, we get the 3 component sentences separately.
</p>
<h2 id="2067">
2. Word Tokenization
</h2>
<p id="159f">
Word tokenization (also called
<strong>
word segmentation
</strong>
) is the problem of
<strong>
dividing a string
</strong>
of written language
<strong>
into
</strong>
its component
<strong>
words
</strong>
. In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.
</p>
<p id="44b0">
However, we still can have problems if we only split by space to achieve the wanted results. Some English compound nouns are variably written and sometimes they contain a space. In most cases, we use a library to achieve the wanted results, so again don’t worry too much for the details.
</p>
<p id="5af4">
<strong>
Example
</strong>
:
</p>
<p id="8a66">
Let’s use the sentences from the previous step and see how we can apply word tokenization on them. We can use the
<code>
nltk.word_tokenize
</code>
function.
</p>

In [4]:
for sentence in sentences:
    words = nltk.word_tokenize(sentence)
    print(words)
    print()


<h2 id="0c8d">
Text Lemmatization and Stemming
</h2>
<p id="cc28">
For grammatical reasons, documents can contain
<strong>
different forms of a word
</strong>
such as
<em>
drive
</em>
,
<em>
drives
</em>
,
<em>
driving
</em>
. Also, sometimes we have
<strong>
related words
</strong>
with a similar meaning, such as
<em>
nation
</em>
,
<em>
national
</em>
,
<em>
nationality
</em>
.
</p>
<blockquote>
<p id="6e80">
The goal of both
<strong>
stemming
</strong>
and
<strong>
lemmatization
</strong>
is to
<strong>
reduce
</strong>
<a href="https://en.wikipedia.org/wiki/Inflection">
<strong>
inflectional
</strong>
</a>
<strong>
forms
</strong>
and sometimes derivationally related forms
<strong>
</strong>
of a
<strong>
word to
</strong>
a
<strong>
common base form
</strong>
.
</p>
</blockquote>
<p id="c85e">
Source:
<a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
</a>
</p>
<p id="79aa">
<strong>
Examples
</strong>
:
</p>
<ul>
<li id="993f">
am, are, is
<code>
=&gt;
</code>
be
</li>
<li id="e377">
dog, dogs, dog’s, dogs’
<code>
=&gt;
</code>
dog
</li>
</ul>
<p id="9372">
The result of this mapping applied on a text will be something like that:
</p>
<ul>
<li id="377f">
the boy’s dogs are different sizes
<code>
=&gt;
</code>
the boy dog be differ size
</li>
</ul>
<p id="49c9">
Stemming and lemmatization are special cases of
<strong>
normalization
</strong>
. However, they are different from each other.
</p>

<blockquote>
<p id="67df">
<strong>
Stemming
</strong>
usually refers to a
<strong>
crude
</strong>
<a href="https://en.wikipedia.org/wiki/Heuristic">
<strong>
heuristic
</strong>
</a>
<strong>
process
</strong>
that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
</p>
<p id="6723">
<strong>
Lemmatization
</strong>
usually refers to
<strong>
doing things properly
</strong>
with the use of a
<strong>
vocabulary
</strong>
and
<strong>
morphological analysis
</strong>
of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the
<strong>
lemma
</strong>
.
</p>
</blockquote>
<p id="741d">
Source:
<a href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
</a>
</p>
<p id="fa62">
The difference is that a
<strong>
stemmer
</strong>
operates
<strong>
without knowledge of the context
</strong>
, and therefore cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers also have some advantages, they are
<strong>
easier to implement
</strong>
and usually
<strong>
run faster
</strong>
. Also, the reduced “accuracy” may not matter for some applications.
</p>
<p id="833e">
<strong>
Examples:
</strong>
</p>
<ol>
<li id="6ce8">
The word “better” has “good” as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
</li>
<li id="a8ff">
The word “play” is the base form for the word “playing”, and hence this is matched in both stemming and lemmatization.
</li>
<li id="a7fc">
The word “meeting” can be either the base form of a noun or a form of a verb (“to meet”) depending on the context; e.g., “in our last meeting” or “We are meeting again tomorrow”. Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.
</li>
</ol>
<p id="c803">
After we know what’s the difference, let’s see some examples using the NLTK tool.
</p>

In [5]:
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

def compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word, pos):
    """
    Print the results of stemmind and lemmitization using the passed stemmer, lemmatizer, word and pos (part of speech)
    """
    print("Stemmer:", stemmer.stem(word))
    print("Lemmatizer:", lemmatizer.lemmatize(word, pos))
    print()

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "seen", pos = wordnet.VERB)
compare_stemmer_and_lemmatizer(stemmer, lemmatizer, word = "drove", pos = wordnet.VERB)

<h2 id="0b95">
Stop words
</h2>
<img src="https://miro.medium.com/max/820/1*kMf7dZW4jTyaq1hxjA0pgg.png"/>
<p id="37bf">
Stop words are words which are
<strong>
filtered out
</strong>
before or after processing of text. When applying machine learning to text, these words can add a lot of
<strong>
noise
</strong>
. That’s why we want to remove these
<strong>
irrelevant words
</strong>
.
</p>
<p id="1fc5">
Stop words
<strong>
usually
</strong>
refer to the
<strong>
most common words
</strong>
such as “
<strong>
and
</strong>
”, “
<strong>
the
</strong>
”, “
<strong>
a
</strong>
” in a language, but there is
<strong>
no single universal list
</strong>
of stopwords. The list of the stop words can change depending on your application.
</p>
<p id="7da9">
The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code:
</p>


In [6]:
nltk.download("stopwords")


<p>
Once we complete the downloading, we can load the
<code>
stopwords
</code>
package from the
<code>
nltk.corpus
</code>
and use it to load the stop words.
</p>

In [7]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

<p id="f34b">
Let’s see how we can remove the stop words from a sentence.
</p>

In [8]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]
print(without_stop_words)

<p id="66ae">
If you’re not familiar with the
<a href="/python-basics-list-comprehensions-631278f22c40">
<strong>
list comprehensions
</strong>
in Python
</a>
. Here is another way to achieve the same result.
</p>

In [9]:
stop_words = set(stopwords.words("english"))
sentence = "Backgammon is one of the oldest known board games."

words = nltk.word_tokenize(sentence)
without_stop_words = []
for word in words:
    if word not in stop_words:
        without_stop_words.append(word)

print(without_stop_words)


<p id="06a9">
However, keep in mind that
<strong>
list comprehensions
</strong>
are
<strong>
faster
</strong>
because they are
<strong>
optimized
</strong>
for the Python interpreter to spot a predictable pattern during looping.
</p>
<p id="5463">
You might wonder why we convert our list into a
<a href="https://docs.python.org/3/tutorial/datastructures.html#sets">
<strong>
set
</strong>
</a>
. Set is an abstract data type that can store unique values, without any particular order. The
<strong>
search operation
</strong>
<strong>
in a set
</strong>
is
<strong>
much faster
</strong>
<strong>
than
</strong>
the search operation
<strong>
in a list
</strong>
. For a small number of words, there is no big difference, but if you have a large number of words it’s highly recommended to use the set type.
</p>
<p id="74fb">
If you want to learn more about the time consuming between the different operations for the different data structures you can look at this awesome
<a href="http://bigocheatsheet.com/">
cheat sheet
</a>
.
</p>
<h2 id="7371">
Regex
</h2>
<img src="https://miro.medium.com/max/1052/1*l_EB11yQfbZsKLFr8ZckuQ.jpeg"/>
<p id="c323">
A
<strong>
regular expression
</strong>
,
<strong>
regex
</strong>
, or
<strong>
regexp
</strong>
is a sequence of characters that define a
<strong>
search pattern
</strong>
. Let’s see some basics.
</p>
<ul>
<li id="da70">
<code>
.
</code>
- match
<strong>
any character
</strong>
<strong>
except newline
</strong>
</li>
<li id="72dc">
<code>
\w
</code>
- match
<strong>
word
</strong>
</li>
<li id="5207">
<code>
\d
</code>
- match
<strong>
digit
</strong>
</li>
<li id="6735">
<code>
\s
</code>
- match
<strong>
whitespace
</strong>
</li>
<li id="d972">
<code>
\W
</code>
- match
<strong>
not word
</strong>
</li>
<li id="543c">
<code>
\D
</code>
- match
<strong>
not digit
</strong>
</li>
<li id="2122">
<code>
\S
</code>
- match
<strong>
not whitespace
</strong>
</li>
<li id="e917">
<code>
[abc]
</code>
- match
<strong>
any
</strong>
of a, b, or c
</li>
<li id="14cf">
<code>
[
<strong>
^
</strong>
abc]
</code>
-
<strong>
not
</strong>
match a, b, or c
</li>
<li id="7cc6">
<code>
[a
<strong>
-
</strong>
g]
</code>
- match a character
<strong>
between
</strong>
a &amp; g
</li>
</ul>
<blockquote>
<p id="3909">
Regular expressions use the
<strong>
backslash character
</strong>
(
<code>
'\'
</code>
) to indicate special forms or to allow special characters to be used without invoking their special meaning. This
<strong>
collides with Python’s usage
</strong>
of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write
<code>
'\\\\'
</code>
as the pattern string, because the regular expression must be
<code>
\\
</code>
, and each backslash must be expressed as
<code>
\\
</code>
inside a regular Python string literal.
</p>
<p id="86c7">
The solution is to use Python’s
<strong>
raw string notation
</strong>
for regular expression patterns; backslashes are not handled in any special way in a string literal
<strong>
prefixed with
</strong>
<code>
<strong>
'r'
</strong>
</code>
. So
<code>
r"\n"
</code>
is a two-character string containing
<code>
'\'
</code>
and
<code>
'n'
</code>
, while
<code>
"\n"
</code>
is a one-character string containing a newline. Usually, patterns will be expressed in Python code using this raw string notation.
</p>
</blockquote>
<p id="ba7d">
Source:
<a href="https://docs.python.org/3/library/re.html?highlight=regex">
https://docs.python.org/3/library/re.html?highlight=regex
</a>
</p>
<p id="adfa">
We can use regex to apply
<strong>
additional filtering
</strong>
to our text. For example, we can remove all the non-words characters. In many cases, we don’t need the punctuation marks and it’s easy to remove them with regex.
</p>
<p id="1194">
In Python, the
<code>
<strong>
re
</strong>
</code>
module provides regular expression matching operations similar to those in Perl. We can use the
<code>
<strong>
re.sub
</strong>
</code>
function to replace the matches for a pattern with a replacement string. Let’s see an example when we replace all non-words with the space character.
</p>



In [10]:

import re
sentence = "The development of snowboarding was inspired by skateboarding, sledding, surfing and skiing."
pattern = r"[^\w]"
print(re.sub(pattern, " ", sentence))

<p id="e383">
A regular expression is a powerful tool and we can create much more complex patterns. If you want to learn more about regex I can recommend you to try these 2 web apps:
<a href="https://regexr.com/">
regex
</a>
r,
<a href="https://regex101.com/">
regex101
</a>
.
</p>
<h2 id="01b2">
Bag-of-words
</h2>

<img src="https://miro.medium.com/max/512/1*RPezKXGUUwla-JP52OnZxA.png"/>

<p id="134b">
Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called
<a href="https://en.wikipedia.org/wiki/Feature_extraction">
<strong>
feature extraction
</strong>
</a>
.
</p>
<p id="1f64">
The
<strong>
bag-of-words
</strong>
model is a
<strong>
popular
</strong>
and
<strong>
simple
</strong>
<strong>
feature extraction technique
</strong>
used when we work with text. It describes the occurrence of each word within a document.
</p>
<p id="fffd">
To use this model, we need to:
</p>
<ol>
<li id="d244">
Design a
<strong>
vocabulary
</strong>
of known words (also called
<strong>
tokens
</strong>
)
</li>
<li id="25aa">
Choose a
<strong>
measure of the presence
</strong>
of known words
</li>
</ol>
<p id="6447">
Any information about
<strong>
the order
</strong>
or
<strong>
structure
</strong>
of words
<strong>
is discarded
</strong>
. That’s why it’s called a
<strong>
bag
</strong>
of words. This model is trying to understand whether a known word occurs in a document, but don’t know where is that word in the document.
</p>
<p id="83a6">
The intuition is that
<strong>
similar documents
</strong>
have
<strong>
similar contents
</strong>
. Also, from a content, we can learn something about the meaning of the document.
</p>
<h2 id="8ceb">
<strong>
Example
</strong>
</h2>
<p id="5920">
Let’s see what are the steps to create a bag-of-words model. In this example, we’ll use only four sentences to see how this model works. In the real-world problems, you’ll work with much bigger amounts of data.
</p>
<p id="59d9">
<strong>
1. Load the Data
</strong>
</p>
<img src="https://miro.medium.com/max/512/1*JTi6Bnodv2sui50F96v7-Q.png"/>
<p id="465e">
Let’s say that this is our data and we want to load it as an array.
</p>
<code>
I like this movie, it's funny.<br/>
I hate this movie.<br/>
This was awesome! I like it.<br/>
Nice one. I love it.
</code>
<p id="1cfc">
To achieve this we can simply read the file and split it by lines.
</p>


In [11]:
with open("reviews.txt", "r") as file:
    documents = file.read().splitlines()
    
print(documents)

<p id="b74c">
<strong>
2.
</strong>
<strong>
Design the Vocabulary
</strong>
</p>
<img src="https://miro.medium.com/max/512/1*AFUcM9S6FwX7RNTW4zqtLQ.png"/>
<p id="d008">
Let’s get all the unique words from the four loaded sentences ignoring the case, punctuation, and one-character tokens. These words will be our vocabulary (known words).
</p>
<p id="dc4a">
We can use the
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">
<strong>
CountVectorizer
</strong>
</a>
class from the sklearn library to design our vocabulary. We’ll see how we can use it after reading the next step, too.
</p>
<p id="d817">
<strong>
3. Create the Document Vectors
</strong>
</p>
<img src="https://miro.medium.com/max/256/1*90Wv4B73KktRNU9NcYdpjg.png"/>
<p id="2a55">
Next, we need to score the words in each document. The task here is to convert each raw text into a vector of numbers. After that, we can use these vectors as input for a machine learning model. The simplest scoring method is to mark the presence of words with 1 for present and 0 for absence.
</p>
<p id="02b3">
Now, let’s see how we can create a bag-of-words model using the mentioned above CountVectorizer class.
</p>


In [12]:

# Import the libraries we need
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Step 2. Design the Vocabulary
# The default token pattern removes tokens of a single character. That's why we don't have the "I" and "s" tokens in the output
count_vectorizer = CountVectorizer()

# Step 3. Create the Bag-of-Words Model
bag_of_words = count_vectorizer.fit_transform(documents)

# Show the Bag-of-Words Model as a pandas DataFrame
feature_names = count_vectorizer.get_feature_names()
pd.DataFrame(bag_of_words.toarray(), columns = feature_names)

<p id="513d">
Here are our sentences. Now we can see how the bag-of-words model works.
</p>
<img src="https://miro.medium.com/max/724/1*LtMJ1qSiIuEzZDqB-RQbjw.png"/>
<h2 id="efe2">
Additional Notes on the Bag of Words Model
</h2>
<img src="https://miro.medium.com/max/512/1*JvmcnIYVAzxHYdrxtmMa3Q.png"/>
<p id="d6dc">
The
<strong>
complexity
</strong>
of the bag-of-words model comes in deciding how to
<strong>
design the vocabulary
</strong>
of known words (tokens) and how to
<strong>
score the presence
</strong>
of known words.
</p>
<p id="92aa">
<strong>
Designing the Vocabulary
</strong>
<br/>
When the vocabulary
<strong>
size increases
</strong>
, the vector representation of the documents also increases. In the example above, the length of the document vector is equal to the number of known words.
</p>
<p id="c0e9">
In some cases, we can have a
<strong>
huge amount of data
</strong>
and in this cases, the length of the vector that represents a document might be
<strong>
thousands or millions
</strong>
of elements. Furthermore, each document may contain
<strong>
only a few of the known words
</strong>
in the vocabulary.
</p>
<p id="5365">
Therefore the vector representations will have a
<strong>
lot of zeros
</strong>
. These vectors which have a lot of zeros are called
<strong>
sparse vectors
</strong>
. They require more memory and computational resources.
</p>
<p id="ece1">
We can
<strong>
decrease
</strong>
the
<strong>
number of the known words
</strong>
when using a bag-of-words model to decrease the required memory and computational resources. We can use the
<strong>
text cleaning techniques
</strong>
we’ve already seen in this article before we create our bag-of-words model:
</p>
<ul>
<li id="8007">
<strong>
Ignoring the case
</strong>
of the words
</li>
<li id="498b">
<strong>
Ignoring punctuation
</strong>
</li>
<li id="cda7">
<strong>
Removing
</strong>
the
<strong>
stop words
</strong>
from our documents
</li>
<li id="2db7">
Reducing the words to their base form (
<strong>
Text Lemmatization and Stemming
</strong>
)
</li>
<li id="6ae2">
<strong>
Fixing misspelled words
</strong>
</li>
</ul>
<p id="a398">
Another more complex way to create a vocabulary is to use
<strong>
grouped words
</strong>
. This changes the
<strong>
scope
</strong>
of the vocabulary and allows the bag-of-words model to get
<strong>
more details
</strong>
about the document. This approach is called
<strong>
n-grams
</strong>
.
</p>
<p id="864e">
An n-gram is a
<strong>
sequence of
</strong>
a number of
<strong>
items
</strong>
(words, letter, numbers, digits, etc.). In the context of
<a href="https://en.wikipedia.org/wiki/Text_corpus">
<strong>
text corpora
</strong>
</a>
, n-grams typically refer to a sequence of words. A
<strong>
unigram
</strong>
is one word, a
<strong>
bigram
</strong>
is a sequence of two words, a
<strong>
trigram
</strong>
is a sequence of three words etc. The “n” in the “n-gram” refers to the number of the grouped words. Only the n-grams that appear in the corpus are modeled, not all possible n-grams.
</p>
<p id="c32b">
<strong>
Example
</strong>
<br/>
Let’s look at the all bigrams for the following sentence:
<br/>
<code>
The office building is open today
</code>
</p>
<p id="42a9">
All the bigrams are:
</p>
<ul>
<li id="bac0">
the office
</li>
<li id="18eb">
office building
</li>
<li id="5afd">
building is
</li>
<li id="4e89">
is open
</li>
<li id="bc2b">
open today
</li>
</ul>
<p id="96f5">
The
<strong>
bag-of-bigrams
</strong>
is more powerful than the bag-of-words approach.
</p>

<p id="ab9f">
<strong>
Scoring Words
<br/>
</strong>
Once, we have created our vocabulary of known words, we need to score the occurrence of the words in our data. We saw one very simple approach - the binary approach (1 for presence, 0 for absence).
</p>
<p id="0843">
Some additional scoring methods are:
</p>
<ul>
<li id="cbb6">
<strong>
Counts
</strong>
. Count the number of times each word appears in a document.
</li>
<li id="c226">
<strong>
Frequencies
</strong>
. Calculate the frequency that each word appears in document out of all the words in the document.
</li>
</ul>
<h2 id="cec9">
TF-IDF
</h2>
<p id="a8ae">
One problem with
<strong>
scoring word frequency
</strong>
is that the most frequent words in the document start to have the highest scores. These frequent words may not contain as much “
<strong>
informational gain
</strong>
” to the model compared with some rarer and domain-specific words. One approach to fix that problem is to
<strong>
penalize
</strong>
words that are
<strong>
frequent across all the documents
</strong>
. This approach is called TF-IDF.
</p>
<p id="75ba">
TF-IDF, short for
<strong>
term frequency-inverse document frequency
</strong>
is a
<strong>
statistical measure
</strong>
used to evaluate the importance of a word to a document in a collection or
<a href="https://en.wikipedia.org/wiki/Text_corpus">
corpus
</a>
.
</p>
<p id="be4a">
The TF-IDF scoring value increases proportionally to the number of times a word appears in the document, but it is offset by the number of documents in the corpus that contain the word.
</p>
<p id="f64d">
Let’s see the formula used to calculate a TF-IDF score for a given term
<strong>
x
</strong>
within a document
<strong>
y
</strong>
.
</p>
<img src="https://miro.medium.com/max/1400/1*V9ac4hLVyms79jl65Ym_Bw.png"/>
<p id="6ba7">
Now, let’s split this formula a little bit and see how the different parts of the formula work.
</p>
<ul>
<li id="9a39">
<strong>
Term Frequency (TF)
</strong>
: a scoring of the frequency of the word in the current document.
</li>
</ul>
<br/>
<img src="https://miro.medium.com/max/926/1*V3qfsHl0t-bV5kA0mlnsjQ.png"/>
<br/>
<ul>
<li id="9c81">
<strong>
Inverse Term Frequency (ITF)
</strong>
: a scoring of how rare the word is across documents.
</li>
</ul>
<br/>
<img src="https://miro.medium.com/max/890/1*wvPGL02y36QL7-tdG1BT1A.png"/>
<br/>
<ul>
<li id="2d0d">
Finally, we can use the previous formulas to calculate the
<strong>
TF-IDF score
</strong>
for a given term like this:
</li>
</ul>
<br/>
<img src="https://miro.medium.com/max/588/1*D2UA6xj9KqcH6amzVj5Y5g.png"/>
<br/>
<p id="86f5">
<strong>
Example
<br/>
</strong>
In Python, we can use the
<strong>
TfidfVectorizer
</strong>
class from the sklearn library to calculate the TF-IDF scores for given documents. Let’s use the same sentences that we have used with the bag-of-words example.
</p>



In [13]:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf_vectorizer = TfidfVectorizer()
values = tfidf_vectorizer.fit_transform(documents)

# Show the Model as a pandas DataFrame
feature_names = tfidf_vectorizer.get_feature_names()
pd.DataFrame(values.toarray(), columns = feature_names)


<p id="3896">
Again, I’ll add the sentences here for an easy comparison and better understanding of how this approach is working.
</p>
<img src="https://miro.medium.com/max/724/1*LtMJ1qSiIuEzZDqB-RQbjw.png"/>
<h1 id="8a9b">
Summary
</h1>
<p id="511f">
In this blog post, you learn the basics of the NLP for text. More specifically you have learned the following concepts with additional details:
</p>
<ul>
<li id="e20b">
<strong>
NLP
</strong>
is used to apply
<strong>
machine learning algorithms
</strong>
to
<strong>
text
</strong>
and
<strong>
speech
</strong>
.
</li>
<li id="71a4">
NLTK (
<strong>
Natural Language Toolkit
</strong>
) is a
<strong>
leading platform
</strong>
for building Python programs to work with
<strong>
human language data
</strong>
</li>
<li id="973b">
<strong>
Sentence tokenization
</strong>
is the problem of
<strong>
dividing a string
</strong>
of written language
<strong>
into
</strong>
its component
<strong>
sentences
</strong>
</li>
<li id="b29c">
<strong>
Word tokenization
</strong>
is the problem of
<strong>
dividing a string
</strong>
of written language
<strong>
into
</strong>
its component
<strong>
words
</strong>
</li>
<li id="fcd0">
The goal of both
<strong>
stemming
</strong>
and
<strong>
lemmatization
</strong>
is to
<strong>
reduce
</strong>
<a href="https://en.wikipedia.org/wiki/Inflection">
<strong>
inflectional
</strong>
</a>
<strong>
forms
</strong>
and sometimes derivationally related forms
<strong>
</strong>
of a
<strong>
word to
</strong>
a
<strong>
common base form
</strong>
.
</li>
<li id="ed1b">
<strong>
Stop words
</strong>
are words which are filtered out before or after processing of text. They
<strong>
usually
</strong>
refer to the
<strong>
most common words
</strong>
in a language.
</li>
<li id="8a86">
A
<strong>
regular expression is
</strong>
a sequence of characters that define a
<strong>
search pattern
</strong>
.
</li>
<li id="c00c">
The
<strong>
bag-of-words
</strong>
model is a
<strong>
popular
</strong>
and
<strong>
simple
</strong>
<strong>
feature extraction technique
</strong>
used when we work with text. It describes the occurrence of each word within a document.
</li>
<li id="cb8d">
<strong>
TF-IDF
</strong>
is a
<strong>
statistical measure
</strong>
used to
<strong>
evaluate the importance
</strong>
<strong>
of
</strong>
a
<strong>
word
</strong>
to a document in a collection or
<a href="https://en.wikipedia.org/wiki/Text_corpus">
corpus
</a>
.
</li>
</ul>
<p id="a9f9">
Awesome! Now we know the basics of how to extract features from a text. Then, we can use these features as an input for machine learning algorithms.
</p>
<p id="d2cc">
Do you want to see
<strong>
all the concepts
</strong>
used in
<strong>
one more big example
</strong>
?
<br/>
-
<a href="https://github.com/Ventsislav-Yordanov/Blog-Examples/blob/master/Intro%20to%20NLP%20-%20Cleaning%20Review%20Texts%20Example/Cleaning%20Review%20Texts%20Example.ipynb">
Here you are
</a>
! If you’re reading from mobile, please scroll down to the end and click the “
<em>
Desktop version
</em>
” link.
</p>
<h1 id="df57">
Resources
</h1>
<a target="_blank" href="https://en.wikipedia.org/wiki/Natural_language_processing">https://en.wikipedia.org/wiki/Natural_language_processing</a><a>
<br>
</a><a target="_blank" href="http://www.nltk.org/">http://www.nltk.org/</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Text_segmentation">https://en.wikipedia.org/wiki/Text_segmentation</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Lemmatisation">https://en.wikipedia.org/wiki/Lemmatisation</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Stemming">https://en.wikipedia.org/wiki/Stemming</a><a>
<br>
</a><a target="_blank" href="https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html">https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Stop_words">https://en.wikipedia.org/wiki/Stop_words</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Regular_expression">https://en.wikipedia.org/wiki/Regular_expression</a><a>
<br>
</a><a target="_blank" href="https://docs.python.org/3/library/re.html?highlight=regex">https://docs.python.org/3/library/re.html?highlight=regex</a><a>
<br>
</a><a target="_blank" href="https://machinelearningmastery.com/gentle-introduction-bag-words-model/">https://machinelearningmastery.com/gentle-introduction-bag-words-model/</a><a>
<br>
</a><a target="_blank" href="https://chrisalbon.com/machine_learning/preprocessing_text/bag_of_words/">https://chrisalbon.com/machine_learning/preprocessing_text/bag_of_words/</a><a>
<br>
</a><a target="_blank" href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">https://en.wikipedia.org/wiki/Tf%E2%80%93idf</a><a></a>

