### - np of zeros

In [2]:
import numpy as np

In [3]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""

#### Simple Tokenize using split function 

In [4]:
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

In [5]:
str.split(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

###### Main problem here that it dependa only on the white space with out considering the puncituaton.

#### For now, let’s forge ahead with your imperfect tokenizer. You’ll deal with punctuation and other challenges later. With a bit more Python, you can create a numerical vector representation for each word. These vectors are called one-hot vectors.

In [6]:
token_sequence = str.split(sentence)

In [7]:
vocab = sorted(set(token_sequence))

In [8]:
vocab

['26.',
 'Jefferson',
 'Monticello',
 'Thomas',
 'age',
 'at',
 'began',
 'building',
 'of',
 'the']

In [9]:
', '.join(vocab)

'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'

In [10]:
num_tokens = len(token_sequence)

In [11]:
num_tokens

10

In [12]:
vocab_size = len(vocab)

In [13]:
vocab_size

10

###### Creating empty table as wide as the count of unique vocabulary terms and as high as the length of the document, 10 rows by 10 columns.  

In [14]:
onehot_vectors = np.zeros((num_tokens, vocab_size), int)

In [15]:
for i, word in enumerate(token_sequence): 
    onehot_vectors[i, vocab.index(word)] = 1

In [16]:
' '.join(vocab)

'26. Jefferson Monticello Thomas age at began building of the'

In [17]:
onehot_vectors

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

If you have trouble quickly reading all those ones and zeros, you’re not alone. Pandas DataFrames can help make this a little easier on the eyes and more informative. A DataFrame keeps track of labels for each column, allowing you to label each column
in our table with the token or word it represents. A DataFrame can also keep track of labels for each row in the DataFrame.

In [18]:
import pandas as pd

In [19]:
pd.DataFrame(onehot_vectors, columns=vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


One-hot vectors are super-sparse, containing only one nonzero value in each row vector. So we can make that table of one-hot row vectors even prettier by replacing  zeros with blanks. Don’t do this with any DataFrame you intend to use in your machine learning pipeline, because it’ll create a lot of non-numerical objects within your numpy array, mucking up the math. But if you just want to see how this one-hot vector sequence is like a mechanical music box cylinder, or a player piano drum, the following listing can be a handy view of your data.

In [21]:
df = pd.DataFrame(onehot_vectors, columns=vocab)

In [22]:
df[df == 0] = ''

In [23]:
df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


 ##### Creating BOW 

In [24]:
sentence_bow = {}

In [25]:
for token in sentence.split():
     sentence_bow[token] = 1

In [26]:
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

you might also notice that using a dict (or any paired mapping of words to their 0/1 values) to store a binary vector 
shouldn’t waste much space. Using a dictionary to represent your vector ensures that it only has to store a 1 when any
one of the thousands, or even millions, of possible words in your dictionary appear in a particular document.

In [28]:
df2 = pd.DataFrame(pd.Series(dict([(token, 1) for token in
    sentence.split()])), columns=['sent']).T

In [29]:
df2

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


Let’s add a few more texts to your corpus to see how a DataFrame stacks up.

In [40]:
sentences = """Thomas Jefferson began building Monticello at the age of 26.\n"""

In [41]:
sentences +="""Construction was done mostly by local masons and carpenters.\n"""

In [42]:
sentences += "He moved into the South Pavilion in 1770.\n"

In [43]:
sentences += """Turning Monticello into a neoclassical masterpiece was Jefferson's obsession."""

In [44]:
corpus = {}

In [45]:
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

In [46]:
df3 = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

In [50]:
df3[df3.columns[:20]] # This shows only the first 20 tokens(DataFrame columns), to avoid wrapping.

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,Construction,was,done,mostly,by,local,masons,and,carpenters.,He
sent0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0
sent2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
sent3,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


Example dot product calculation
dot product is simply the sum of the multiplication of the x1*2,y1*2,z1*2>>>> A = (x1, y1, z1 ) , B = (x2, y2, z2)

In [51]:
v1 = pd.np.array([1, 2, 3])

  v1 = pd.np.array([1, 2, 3])


In [52]:
v2 = np.array([2, 3, 4])

In [53]:
v1.dot(v2)

20

In [54]:
(v1 * v2).sum()

20

In [55]:
sum([x1 * x2 for x1, x2 in zip(v1, v2)]) #You shouldn’t iterate through vectors this way unless you want to slow down your pipeline.

20

#### Measuring bag-of-words overlap

In [56]:
df4 = df3.T

In [57]:
df4

Unnamed: 0,sent0,sent1,sent2,sent3
Thomas,1,0,0,0
Jefferson,1,0,0,0
began,1,0,0,0
building,1,0,0,0
Monticello,1,0,0,1
at,1,0,0,0
the,1,0,1,0
age,1,0,0,0
of,1,0,0,0
26.,1,0,0,0


In [58]:
df4.sent0.dot(df4.sent1)

0

In [59]:
df4.sent0.dot(df4.sent2)

1

In [61]:
df4.sent0.dot(df4.sent3)

1

Here’s one way to find the word that is shared by sent0 and sent3, the word that gave you that last dot product of 1:

In [63]:
[(k, v) for (k, v) in (df4.sent0 & df4.sent3).items() if v]

[('Monticello', 1)]

This is our first vector space model (VSM) of natural language documents (sentences). Not only are dot products possible, but other vector operations are defined for these bag-of-word vectors: addition, subtraction, OR, AND, and so on.

#### Tokenize the Monticello sentence with a regular expression

In [64]:
import re

In [65]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""

In [66]:
tokens = re.split(r'[-\s.,;!?]+', sentence)

In [67]:
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '']

This splits the sentence on whitespace or punctuation that occurs at least once (note the '+' after the closing square bracket in the regular expression).

In [68]:
pattern = re.compile(r"([-\s.,;!?])+")


In [73]:
tokens_b = pattern.split(sentence)

In [70]:
tokens_b

['Thomas',
 ' ',
 'Jefferson',
 ' ',
 'began',
 ' ',
 'building',
 ' ',
 'Monticello',
 ' ',
 'at',
 ' ',
 'the',
 ' ',
 'age',
 ' ',
 'of',
 ' ',
 '26',
 '.',
 '']

In [71]:
[x for x in tokens_b if x and x not in '- \t\n.,;!?'] # to remove white space and puncituation

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

If you want practice withlambda and filter(), use
list(filter(lambda x: x if x and x not in '- \t\n.,;!?' else None, tokens)).

In [74]:
list(filter(lambda x: x if x and x not in '- \t\n.,;!?' else None, tokens_b))

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

we can use the NLTK function RegexpTokenizer to replicate your simple tokenizer example like this:

In [75]:
import nltk

In [76]:
from nltk.tokenize import RegexpTokenizer

In [77]:
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')

In [78]:
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

An even better tokenizer is the Treebank Word Tokenizer from the NLTK package. It incorporates a variety of common rules for English word tokenization. For example, it separates phrase-terminating punctuation (?!.;,) from adjacent tokens and retains
decimal numbers containing a period as a single token. In addition it contains rules for English contractions. For example “don’t” is tokenized as ["do", "n’t"]. This tokenization will help with subsequent steps in the NLP pipeline, such as stemming. You can find all the rules for the Treebank Tokenizer at http://www.nltk.org/api/nltk.tokenize.html#modulenltk.tokenize.treebank.

In [79]:
from nltk.tokenize import TreebankWordTokenizer

In [81]:
sentence_b = """Monticello wasn't designated as UNESCO World Heritage Site until 1987."""

In [82]:
tokenizer = TreebankWordTokenizer()

In [83]:
tokenizer.tokenize(sentence_b)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']

#### Tokenize informal text from social networks such as Twitter and Facebook

The NLTK library includes a tokenizer—casual_tokenize—that was built to deal with short, informal, emoticon-laced texts from social networks where grammar and spelling conventions vary widely.
The casual_tokenize function allows us to strip usernames and reduce the number of repeated characters within a token:

In [84]:
from nltk.tokenize.casual import casual_tokenize

In [86]:
message = """RT @TJMonticello Best day everrrrrrr at Monticello. Awesommmmmmeeeeeeee day :*)"""

In [87]:
casual_tokenize(message)

['RT',
 '@TJMonticello',
 'Best',
 'day',
 'everrrrrrr',
 'at',
 'Monticello',
 '.',
 'Awesommmmmmeeeeeeee',
 'day',
 ':*)']

In [88]:
casual_tokenize(message, reduce_len=True, strip_handles=True)

['RT',
 'Best',
 'day',
 'everrr',
 'at',
 'Monticello',
 '.',
 'Awesommmeee',
 'day',
 ':*)']

### Extending your vocabulary with n-grams

An n-gram is a sequence containing up to n elements that have been extracted from a sequence of those elements, usually a string. In general the “elements” of an n-gram can be characters, syllables, words, or even symbols like “A,” “T,” “G,” and “C” used to represent a DNA sequence. we’re only interested in n-grams of words, not characters.7 So in this book, when we say 2-gram, we mean a pair of words, like “ice cream.” When we say 3-gram, we mean a triplet of words like “beyond the pale” or “Johann Sebastian Bach”

n-grams don’t have to mean something special together, like compound words. They merely have to be frequent enough together to catch the attention of your token counters.

In the next chapter, we show you how to recognize which of these n-grams contain the most information relative to the others, which you can use to reduce the number of tokens (n-grams) your NLP pipeline has to keep track of. Otherwise it would have to store and maintain a list of every single word sequence it came across. This prioritization of n-grams will help it recognize “Thomas Jefferson” and “ice cream,” without paying particular attention to “Thomas Smith” or “ice shattered.”

##### Let’s use your original sentence about Thomas Jefferson to show what a 2-gram tokenizer should output, so you know what you’re trying to build:

In [89]:
tokenize_2grams("Thomas Jefferson began building Monticello at the age of 26.") # p49

NameError: name 'tokenize_2grams' is not defined

n-grams are one of the ways to maintain context information as data passes through your pipeline.

###### this is the n-gram tokenizer from nltk

In [90]:
from nltk.util import ngrams

In [91]:
list(ngrams(tokens, 2))

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26'),
 ('26', '')]

In [92]:
list(ngrams(tokens, 3))

[('Thomas', 'Jefferson', 'began'),
 ('Jefferson', 'began', 'building'),
 ('began', 'building', 'Monticello'),
 ('building', 'Monticello', 'at'),
 ('Monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', '26'),
 ('of', '26', '')]

The n-grams are provided in the previous listing as tuples, but they can easily be joined together if you’d like all the tokens in your pipeline to be strings.

In [93]:
two_grams = list(ngrams(tokens, 2))

In [94]:
[" ".join(x) for x in two_grams]

['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at the',
 'the age',
 'age of',
 'of 26',
 '26 ']

n-grams are usually filtered out if they occur too often. For example, if a token or n-gram occurs in more than 25% of all the documents in your corpus, you usually ignore it. This is equivalent to the “stop words” filter in the coin-sorting machine of chapter 1.

##### STOP WORDS

Stop words are common words in any language that occur with a high frequency but carry much less substantive information about the meaning of a phrase. 
A more comprehensive list of stop words for various languages can be found in NLTK’s corpora (https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip).

If we do decide to arbitrarily filter out a set of stop words during tokenization, a Python list comprehension is sufficient. Here we take a few stop words and ignore them when we iterate through your token list:

In [95]:
stop_words = ['a', 'an', 'the', 'on', 'of', 'off', 'this', 'is']

In [96]:
tokens = ['the', 'house', 'is', 'on', 'fire']

In [97]:
tokens_without_stopwords = [x for x in tokens if x not in stop_words]

In [98]:
print(tokens_without_stopwords)

['house', 'fire']


In [99]:
tokens_without_stopwords

['house', 'fire']

#### NLTK list of stop words

In [101]:
import nltk


In [102]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\EMZ\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [118]:
stop_words = nltk.corpus.stopwords.words('english')

In [104]:
len(stop_words)

179

In [105]:
stop_words[:7]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours']

In [107]:
[sw for sw in stop_words if len(sw) == 1]

['i', 'a', 's', 't', 'd', 'm', 'o', 'y']

These one-letter stop words are even more curious, but they make sense if you’ve used the NLTK tokenizer and Porter stemmer a lot. These single-letter tokens pop up a lot when contractions are split and stemmed using NLTK tokenizers and stemmers.

In [109]:
stop_words2 = nltk.corpus.stopwords.words('arabic')

In [110]:
len(stop_words2)

754

In [111]:
stop_words2[:7]

['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر']

In [115]:
[sw for sw in stop_words2 if len(sw) == 5]

['الذين',
 'إليكم',
 'إليكن',
 'أنتما',
 'أولاء',
 'أولئك',
 'أينما',
 'بماذا',
 'تلكما',
 'حيثما',
 'ذلكما',
 'ذواتا',
 'ذواتي',
 'كأنما',
 'كيفما',
 'لستما',
 'لكنما',
 'لكيلا',
 'ليستا',
 'ليسوا',
 'هاتان',
 'هاتين',
 'هاهنا',
 'هنالك',
 'هؤلاء',
 'هيهات',
 'والذي',
 'يناير',
 'أبريل',
 'يونيو',
 'يوليو',
 'أغسطس',
 'جانفي',
 'فيفري',
 'أفريل',
 'كانون',
 'نيسان',
 'أيلول',
 'تشرين',
 'دولار',
 'دينار',
 'سنتيم',
 'اثنان',
 'ثلاثة',
 'أربعة',
 'ثماني',
 'اثنين',
 'إياها',
 'إياهم',
 'إياهن',
 'إياكم',
 'إياكن',
 'إيانا',
 'تانِك',
 'هَذِه',
 'هَذِي',
 'الألى',
 'كأيّن',
 'آمينَ',
 'أُفٍّ',
 'أُفٍّ',
 'أمامك',
 'أوّهْ',
 'إليكَ',
 'رويدك',
 'سرعان',
 'شتانَ',
 'واهاً',
 'خبَّر',
 'نبَّا',
 'طالما',
 'لكنَّ',
 'رُبَّ',
 'كلَّا',
 'لعلَّ',
 'لكنَّ',
 'لكنَّ',
 'تلقاء',
 'سبحان',
 'مئتان',
 'ستمئة',
 'عشرون',
 'خمسون',
 'سبعون',
 'تسعون',
 'عشرين',
 'خمسين',
 'سبعين',
 'تسعين',
 'خلافا',
 'صراحة',
 'عيانا',
 'غالبا',
 'فرادى',
 'قاطبة',
 'كثيرا',
 'أيّان',
 'كلّما',
 'ارتدّ',
 'انقلب',


Depending on how much natural language information we want to discard, we can take the union or the intersection of multiple stop word lists for your pipeline. Here’s a comparison of sklearn stop words (version 0.19.2) and nltk stop words (version 3.2.5).

In [116]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words

In [117]:
len(sklearn_stop_words)

318

In [119]:
len(stop_words)

179

In [120]:
len(stop_words.union(sklearn_stop_words))

AttributeError: 'list' object has no attribute 'union'

In [121]:
len(stop_words.intersection(sklearn_stop_words))

AttributeError: 'list' object has no attribute 'intersection'

In [122]:
# look for another way to that

### Normalizing the vocabulary

Case folding is when you consolidate multiple “spellings” of a word that differ only in their capitalization.
Normalizing word and character capitalization is one way to reduce your vocabulary size and generalize your NLP pipeline. It helps you consolidate words that are intended to mean the same thing (and be spelled the same way) under a single token.

Often capitalization is used to indicate that a word is a proper noun, the name of a person, place, or thing. You’ll want to be able to recognize proper nouns as distinct from other words, if named entity recognition is important to your pipeline. However, if tokens aren’t case normalized, your vocabulary will be approximately twice as large, consume twice as
much memory and processing time, and might increase the amount of training data you need to label for your machine learning pipeline to converge to an accurate, general solution.

Just as in any other machine learning pipeline, your labeled dataset used for training must be “representative” of the space of all possible feature vectors your model must deal with, including variations in capitalization. For 100,000-D bag-ofwords
vectors, you usually must have 100,000 labeled examples, and sometimes even more than that, to train a supervised machine learning pipeline without overfitting. In some situations, cutting your vocabulary size by half can be worth the loss of information content.

##### In Python, you can easily normalize the capitalization of your tokens with a list comprehension:

In [123]:
tokens = ['House', 'Visitor', 'Center']

In [124]:
normalized_tokens = [x.lower() for x in tokens]

In [125]:
print(normalized_tokens)

['house', 'visitor', 'center']


And if you’re certain that you want to normalize the case for an entire document, you can lower() the text string in one operation, before tokenization. But this will prevent advanced tokenizers that can split camel case words like “WordPerfect,” “FedEx,” or “stringVariableName.”10 Maybe you want WordPerfect to be it’s own unique thing (token), or maybe you want to reminisce about a more perfect word processing era. It’s up to you to decide when and how to apply case folding.

With case normalization, you are attempting to return these tokens to their “normal” state before grammar rules and their position in a sentence affected their capitalization. The simplest and most common way to normalize the case of a text string is to lowercase all the characters with a function like Python’s built-in str.lower().

but
Lowercasing on the first word in a sentence preserves the meaning of proper nouns in the middle of a sentence, like “Joe” and “Smith” in “Joe Smith.” And it properly groups words together that belong together, because they’re only capitalized when they are at the beginning of a sentence, since they aren’t proper nouns. This prevents “Joe” from being confused with “coffee” (“joe”)12 during tokenization.

this careful approach to case normalization, where you lowercase words only at the start of a sentence, you will still introduce capitalization errors for the rare proper nouns that start a sentence. “Joe Smith, the word smith, with a cup of joe.” will produce a different set of tokens than “Smith the word with a cup of joe, Joe Smith.” And you may not want that. In addition, case normalization is useless for languages that don’t have a concept of capitalization. To avoid this potential loss of information, many NLP pipelines don’t normalize for case at all.

The best way to find out what works is to try several different approaches, and see which approach gives you the best performance for the objectives of your NLP project. By generalizing your model to work with text that has odd capitalization, case normalization can reduce overfitting for your machine learning pipeline. Case normalization is particularly useful for a search engine.

##### STEMMING

Another common vocabulary normalization technique is to eliminate the small meaning differences of pluralization or possessive endings of words, or even various verb forms.

Stemming removes suffixes from words in an attempt to combine words with similar meanings together under their common stem. A stem isn’t required to be a properly spelled word, but merely a token, or label, representing several possible spellings of a word.

In machine learning this is referred to as dimension reduction. It helps generalize your language model, enabling the model to behave identically for all the words included in a stem. So, as long as the application doesn’t require the machine to distinguish between “house” and “houses,” this stem will reduce the programming or dataset size by half or even more, depending on the aggressiveness of the stemmer of choice.

Stemming is important for keyword search or information retrieval. It allows we to search for “developing houses in Portland” and get web pages or documents that use both the word “house” and “houses” and even the word “housing,” because these words
are all stemmed to the “hous” token. Likewise we might receive pages with the words “developer” and “development” rather than “developing,” because all these words typically reduce to the stem “develop.”

##### Here’s a simple stemmer implementation in pure Python that can handle trailing S’s:

In [126]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])

In [127]:
stem('houses')

'house'

In [128]:
stem("Doctor House's calls")

'doctor house call'

Two of the most popular stemming algorithms are the Porter and Snowball stemmers.

In [129]:
from nltk.stem.porter import PorterStemmer

In [130]:
stemmer = PorterStemmer()

In [131]:
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

Notice that the Porter stemmer, like the regular expression stemmer, retains the trailing apostrophe (unless you explicitly strip it), which ensures that possessive words will be distinguishable from nonpossessive words.

 # side note 
    More on the Porter stemmer
Julia Menchavez has graciously shared her translation of Porter’s original stemmer algorithm into pure Python (https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py). If you are ever tempted to develop your own stemmer, consider these 300 lines of code and the lifetime of refinement that Porter put into them. There are eight steps to the Porter stemmer algorithm: 1a, 1b, 1c, 2, 3, 4, 5a, and 5b. Step 1a is a bit like your regular expression for dealing with trailing S’s:
a) This is a trivially abbreviated version of Julia Menchavez’s implementation of porter-stemmer on GitHub
(https://github.com/jedijulia/porter-stemmer/blob/master/stemmer.py).

#### LEMMATIZATION

This more extensive normalization down to the semantic root of a word—its lemma—is called lemmatization. Any NLP pipeline that
wants to “react” the same for multiple different spellings of the same basic root word can benefit from a lemmatizer. It reduces the number of words you have to respond to, the dimensionality of your language model. Using it can make your model more general, but it can also make your model less precise, because it will treat all spelling variations of a given root word the same.

while working through this section, think about words where lemmatization would drastically alter the meaning of a word, perhaps even inverting its meaning and producing the opposite of the intended response from your pipeline. This scenario is called spoofing—when someone intentionally tries to elicit the wrong response from a machine learning pipeline by cleverly constructing a difficult input.

Lemmatization is a potentially more accurate way to normalize a word than stemming or case normalization because it takes into account a word’s meaning. A lemmatizer uses a knowledge base of word synonyms and word endings to ensure that only words that mean similar things are consolidated into a single token.

Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to “people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation cannot be determined. The context of a word must be known for its POS to be identified. So some advanced lemmatizers can’t be run-on words in isolation.

Consider the word better. Stemmers would strip the “er” ending from “better” and return the stem “bett” or “bet.” However, this would lump the word “better” with words like “betting,” “bets,” and “Bet’s,” rather than more similar words like “betterment,” “best,” or even “good” and “goods.”
So lemmatizers are better than stemmers for most applications. Stemmers are only really used in large-scale information retrieval applications (keyword search). And if you really want the dimension reduction and recall improvement of a stemmer in your information retrieval pipeline, you should probably also use a lemmatizer right before the stemmer. Because the lemma of a word is a valid English word, stemmers work well on the output of a lemmatizer. This trick will reduce your dimensionality and increase your information retrieval recall even more than a stemmer alone.

### The NLTK package provides functions for this.

In [132]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\EMZ\AppData\Roaming\nltk_data...


True

In [138]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\EMZ\AppData\Roaming\nltk_data...


True

In [133]:
from nltk.stem import WordNetLemmatizer

In [139]:
lemmatizer = WordNetLemmatizer()

In [140]:
lemmatizer.lemmatize("better")  # The default part of speech is “n” for noun. the NLTK lemmatizer assumes it’s a noun.

'better'

In [141]:
lemmatizer.lemmatize("better", pos="a") # “a” indicates the adjective part of speech.

'good'

In [142]:
lemmatizer.lemmatize("good", pos="a")

'good'

In [143]:
lemmatizer.lemmatize("goods", pos="a")

'goods'

In [144]:
lemmatizer.lemmatize("goods", pos="n")

'good'

In [145]:
lemmatizer.lemmatize("goodness", pos="n")

'goodness'

In [146]:
lemmatizer.lemmatize("best", pos="a")

'best'

Unfortunately, the NLTK lemmatizer is restricted to the connections within the Princeton WordNet graph of word meanings. So the word “best” doesn’t lemmatize to the same root as “better.” This graph is also missing the connection between “goodness” and “good.”

###### Porter stemmer, on the other hand, would make this connection by blindly stripping off the “ness” ending of all words:

In [147]:
stemmer.stem('goodness')

'good'


#### USE CASES

Stemmers are generally faster to compute and require less-complex code and datasets. But stemmers will make more errors and stem a far greater number of words, reducing the information content or meaning of your text much more than a lemmatizer would. Both stemmers and lemmatizers will reduce your vocabulary size and increase the ambiguity of the text. But lemmatizers do a better job retaining as much of the information content as possible based on how the word was used within the text and its intended meaning. Therefore, some NLP packages, such as spaCy, don’t provide stemming functions and only offer lemmatization methods.

IMPORTANT Bottom line, try to avoid stemming and lemmatization unless you have a limited amount of text that contains usages and capitalizations of the words you are interested in. And with the explosion of NLP datasets, this is rarely the case for English documents, unless your documents use a lot of jargon or are from a very small subfield of science, technology, or literature.
Nonetheless, for languages other than English, you may still find uses for lemmatization. The Stanford information retrieval course dismisses stemming and lemmatization entirely, due to the negligible recall accuracy improvement and the significant reduction in precision.

#### Sentiment

This sentiment analysis—measuring the sentiment of phrases or chunks of text—is a common application of NLP. In many companies it’s the main thing an NLP engineer is asked to do.
An NLP pipeline can process a large quantity of user feedback quickly and objectively, with less chance for bias. And an NLP pipeline can output a numerical rating of the positivity or negativity or any other emotional quality of the text.

Say you just want to measure the positivity or favorability of a text—how much someone likes a product or service that they
are writing about. Say you want your NLP pipeline and sentiment analysis algorithm to output a single floating point number between -1 and +1. Your algorithm would output +1 for text with positive sentiment like, “Absolutely perfect! Love it! :-) :-) :-).” And your algorithm should output -1 for text with negative sentiment like, “Horrible! Completely useless. :(.” Your NLP pipeline could use values near 0, like say +0.1, for a statement like, “It was OK. Some good and some bad things.”
###### There are two approaches to sentiment analysis:
 A rule-based algorithm composed by a human

 A machine learning model learned from data by a machine

For the machine learning approach, you need a lot of data, text labeled with the “right” sentiment
score. Twitter feeds are often used for this approach because the hash tags, such as #awesome or #happy or #sarcasm, can often be used to create a “self-labeled” dataset.
Your company may have product reviews with five-star ratings that you could associate with reviewer comments. You can use the star ratings as a numerical score for the positivity of each text.

##### VADER—A rule-based sentiment analyzer

Hutto and Gilbert at GA Tech came up with one of the first successful rule-based sentiment analysis algorithms. Many NLP packages implement some form of this algorithm. The NLTK package has an implementation of the VADER algorithm in
nltk.sentiment.vader. Hutto himself maintains the Python package vaderSentiment.
###### To go straight to the source and use vaderSentiment here.
pip install vaderSentiment to run the following example.

In [1]:
# !pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [3]:
sa = SentimentIntensityAnalyzer()

In [4]:
sa.lexicon #SentimentIntensityAnalyzer.lexicon contains that dictionary of tokens and their scores.

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

In [5]:
[(tok, score) for tok, score in sa.lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

Out of 7500 tokens defined in VADER, only 3 contain spaces, and only 2 of those are actually n-grams; the other is an emoticon for “kiss.”

In [6]:
sa.polarity_scores(text="Python is very readable and it's great for NLP.")

{'neg': 0.0, 'neu': 0.661, 'pos': 0.339, 'compound': 0.6249}

In [7]:
sa.polarity_scores(text= "Python is not a bad choice for most applications.") #The VADER algorithm considers the intensity of sentiment polarity in three separate scores (positive, negative, and neutral) and then combines them together into a compound positivity sentiment.

{'neg': 0.0, 'neu': 0.737, 'pos': 0.263, 'compound': 0.431}

Let’s see how well this rule-based approach does for the example statements we mentioned earlier:

In [10]:
corpus = ["Absolutely perfect! Love it! :-) :-) :-)", 
          "Horrible! Completely useless. :(", 
          "It was OK. Some good and some bad things."]

In [11]:
for doc in corpus:
    scores = sa.polarity_scores(doc)
print('{:+}: {}'.format(scores['compound'], doc))

-0.1531: It was OK. Some good and some bad things.


Note books result
+0.9428: Absolutely perfect! Love it! :-) :-) :-)
-0.8768: Horrible! Completely useless. :(
+0.3254: It was OK. Some good and some bad things.

So the only drawback is that VADER doesn’t look at all the words in a document, only about 7,500. what if you don’t want to have to code your own understanding of the words in a dictionary of thousands of words or add a bunch of custom words to the dictionary in SentimentIntensityAnalyzer.lexicon? The rule-based approach might be impossible if you don’t understand the language, because you wouldn’t know what scores to put in the dictionary (lexicon)!
###### That’s what machine learning sentiment analyzers are for.


In [12]:
#!pip install nlpia

Collecting nlpia
  Downloading nlpia-0.5.2-py2.py3-none-any.whl (32.0 MB)
Collecting python-Levenshtein
  Downloading python-Levenshtein-0.12.2.tar.gz (50 kB)
Collecting spacy
  Downloading spacy-3.4.1-cp39-cp39-win_amd64.whl (11.8 MB)


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\EMZ\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "C:\Users\EMZ\anaconda3\lib\site-packages\pip\_vendor\urllib3\response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "C:\Users\EMZ\anaconda3\lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 62, in read
    data = self.__fp.read(amt)
  File "C:\Users\EMZ\anaconda3\lib\http\client.py", line 463, in read
    n = self.readinto(b)
  File "C:\Users\EMZ\anaconda3\lib\http\client.py", line 507, in readinto
    n = self.fp.readinto(b)
  File "C:\Users\EMZ\anaconda3\lib\socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "C:\Users\EMZ\anaconda3\lib\ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "C:\Users\EMZ\anaconda3\lib\ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The

In [None]:
to be contiue