Here, we show you straightforward algorithms for separating a string into words.

In NLP, composing a numerical vector from text is a particularly "lossy" feature extraction process.

Consider stemming. Why it is importatnt?

end, ending -> removing 'ing' is enough rule?
run, running -> how about 'ning'?
sing? -> oh

The simplest way to tokenize a sentence is to use whitespace within a string as the "delimiter" of words.

In [3]:
sentence = """Thomas Jefferson began building Monticello at the age of 26."""

In [5]:
sentence.split()

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26.']

'26.' is different to '26'

Let's go ahead with this imperfect tokenizer

In [28]:
import numpy as np
token_sequence = sentence.split()
vocab = sorted(set(token_sequence))

In [29]:
', '.join(vocab)

'26., Jefferson, Monticello, Thomas, age, at, began, building, of, the'

In [30]:
num_tokens = len(token_sequence)
vocab_size = len(vocab)
onehot_vectors = np.zeros((num_tokens,vocab_size),int)

In [31]:
for i, word in enumerate(token_sequence):
    onehot_vectors[i,vocab.index(word)] = 1

In [33]:
' '.join(vocab)

'26. Jefferson Monticello Thomas age at began building of the'

In [34]:
onehot_vectors

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [44]:
import pandas as pd
pd.DataFrame(onehot_vectors, columns = vocab)

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,0,0,0,0,1
7,0,0,0,0,1,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0
9,1,0,0,0,0,0,0,0,0,0


In [45]:
df = pd.DataFrame(onehot_vectors, columns = vocab)
temp_df = df.copy()
temp_df[temp_df == 0] = ' '
temp_df

Unnamed: 0,26.,Jefferson,Monticello,Thomas,age,at,began,building,of,the
0,,,,1.0,,,,,,
1,,1.0,,,,,,,,
2,,,,,,,1.0,,,
3,,,,,,,,1.0,,
4,,,1.0,,,,,,,
5,,,,,,1.0,,,,
6,,,,,,,,,,1.0
7,,,,,1.0,,,,,
8,,,,,,,,,1.0,
9,1.0,,,,,,,,,


The table has 10 cols (words in your vocabulary) and 10 rows (words in the document)

One nice feature of this vector representation of wrods and tabular representation of documents is that no information is lost.

For a long document this might not be practical. You may need to do dimension reduction if you want to extract useful information from the data

You'd like to compress your document down to a single vector rather than a big table. And you're willing to give up perfect "recall".

In [47]:
sentence_bow = {}
for token in sentence.split():
    sentence_bow[token] = 1
sorted(sentence_bow.items())

[('26.', 1),
 ('Jefferson', 1),
 ('Monticello', 1),
 ('Thomas', 1),
 ('age', 1),
 ('at', 1),
 ('began', 1),
 ('building', 1),
 ('of', 1),
 ('the', 1)]

In [50]:
df = pd.DataFrame(
    pd.Series(dict([(token,1) for token in sentence.split()])), columns = ['sent']).T

In [51]:
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent,1,1,1,1,1,1,1,1,1,1


In [52]:
sentences = """Thomas Jefferson began building Monticello at the\
...   age of 26.\n"""  

In [54]:
sentences += """Construction was done mostly by local masons and\
...   carpenters.\n"""
sentences += "He moved into the South Pavilion in 1770.\n"
sentences += """Turning Monticello into a neoclassical masterpiece\
...   was Jefferson's obsession."""

In [56]:
corpus = {}

In [57]:
for i, sent in enumerate(sentences.split('\n')):
    corpus['sent{}'.format(i)] = dict((tok, 1) for tok in sent.split())

In [64]:
df = pd.DataFrame.from_records(corpus).fillna(0).astype(int).T

In [66]:
df[df.columns[:10]]

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.
sent0,1,1,1,1,1,1,1,1,1,1
sent1,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0


In [67]:
df

Unnamed: 0,Thomas,Jefferson,began,building,Monticello,at,the,age,of,26.,...,South,Pavilion,in,1770.,Turning,a,neoclassical,masterpiece,Jefferson's,obsession.
sent0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sent1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sent2,0,0,0,0,0,0,1,0,0,0,...,1,1,1,1,0,0,0,0,0,0
sent3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,1,1,1,1,1,1


One way to check for the similarities between sentences is to count the number of overlapping otkens using a dot product

If we can measure the bag of words overlap for two vectors, we can get a good estimate of how similar they are in the words they use. And this is a good estimate of how similar they are in meaning.

In [83]:
df = df.T

In [89]:
print(df['sent0'] @ df['sent1'])
print(df['sent0'] @ df['sent2'])
print(df['sent0'] @ df['sent3'])

0
1
1


In [92]:
[(k, v) for (k, v) in (df['sent0'] & df['sent3']).items() if v]

[('Monticello', 1)]

A token improvement

In [93]:
import re

In [95]:
tokens = re.split(r'[-\s.,;!?]+', sentence)
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '']

Improved regular expression for separating words

In [97]:
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)

In [99]:
[x for x in tokens if x and x not in '- \t\n.,;1?']

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

As you can imagine, tokenizers can easily become complex.

spaCy, Stanford CoreNLP, NLTK

In [100]:
from nltk.tokenize import RegexpTokenizer

In [101]:
tokenizer = RegexpTokenizer(r'\w+|$[0-9.]+|\S+')

In [102]:
tokenizer.tokenize(sentence)

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26',
 '.']

In [103]:
from nltk.tokenize import TreebankWordTokenizer

In [104]:
sentence = """Monticello wasn't designated as UNESCO World Heritage\
...   Site until 1987."""
sentence

"Monticello wasn't designated as UNESCO World Heritage  Site until 1987."

In [105]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(sentence)

['Monticello',
 'was',
 "n't",
 'designated',
 'as',
 'UNESCO',
 'World',
 'Heritage',
 'Site',
 'until',
 '1987',
 '.']

In [106]:
from nltk.tokenize.casual import casual_tokenize
message = """RT @TJMonticello Best day everrrrrrr at Monticello.\
...   Awesommmmmmeeeeeeee day :*)"""
casual_tokenize(message)

['RT',
 '@TJMonticello',
 'Best',
 'day',
 'everrrrrrr',
 'at',
 'Monticello',
 '...',
 'Awesommmmmmeeeeeeee',
 'day',
 ':*)']

In [107]:
casual_tokenize(message, reduce_len=True, strip_handles=True)

['RT',
 'Best',
 'day',
 'everrr',
 'at',
 'Monticello',
 '...',
 'Awesommmeee',
 'day',
 ':*)']

Extend your vacabulary with n-grams

How about 'ice cream'?

As you saw earlier, when a seqence of tokens is vectorzied into a bag-of-words vector, it loses a lot of the meaning inherent in the order of those words. 

In [110]:
# Here’s the original 1-gram tokenizer:
sentence = """Thomas Jefferson began building Monticello at the\
...   age of 26."""
pattern = re.compile(r"([-\s.,;!?])+")
tokens = pattern.split(sentence)
tokens = [x for x in tokens if x and x not in '- \t\n.,;!?']
tokens

['Thomas',
 'Jefferson',
 'began',
 'building',
 'Monticello',
 'at',
 'the',
 'age',
 'of',
 '26']

In [114]:
# And this is the n-gram tokenizer from nltk in action:
from nltk.util import ngrams
list(ngrams(tokens,2))

[('Thomas', 'Jefferson'),
 ('Jefferson', 'began'),
 ('began', 'building'),
 ('building', 'Monticello'),
 ('Monticello', 'at'),
 ('at', 'the'),
 ('the', 'age'),
 ('age', 'of'),
 ('of', '26')]

In [115]:
list(ngrams(tokens,3))

[('Thomas', 'Jefferson', 'began'),
 ('Jefferson', 'began', 'building'),
 ('began', 'building', 'Monticello'),
 ('building', 'Monticello', 'at'),
 ('Monticello', 'at', 'the'),
 ('at', 'the', 'age'),
 ('the', 'age', 'of'),
 ('age', 'of', '26')]

In [116]:
two_grams = list(ngrams(tokens,2))
[" ".join(x) for x in two_grams]

['Thomas Jefferson',
 'Jefferson began',
 'began building',
 'building Monticello',
 'Monticello at',
 'at the',
 'the age',
 'age of',
 'of 26']

Stop words

Historically, stop words have been excluded from NLP pipelines in order to reduce the computational effort to extract information from a text. Even though the words themselves carry little information, the stop words can provide important relational information as part of an n-gram.

Consider when removing stopwords
- Mark reported to the CEO
- Suzanne reported as the CEO to the board

In [118]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yeabinmoon/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [119]:
stop_words = nltk.corpus.stopwords.words('english')
len(stop_words)

179

In [122]:
def stem(phrase):
    return ' '.join([re.findall('^(.*ss|.*?)(s)?$', word)[0][0].strip("'") for word in phrase.lower().split()])
print(stem('houses'))
print(stem("Doctor House's calls"))


house
doctor house call


In [123]:
print(stem('dishes'))

dishe


In [124]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [125]:
' '.join([stemmer.stem(w).strip("'") for w in "dish washer's washed dishes".split()])

'dish washer wash dish'

In [130]:
stemmer.stem('Dishes dished')

'dishes dish'

In [131]:
# problem
stemmer.stem('goodness')

'good'

In [132]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yeabinmoon/nltk_data...


True

In [136]:
from nltk.stem import WordNetLemmatizer

In [137]:
lemmatizer = WordNetLemmatizer()

In [141]:
lemmatizer.lemmatize("good", pos="a")

'good'

In [142]:
lemmatizer.lemmatize("goodness", pos="n")

'goodness'

In [143]:
stemmer.stem('goodness')

'good'

In [144]:
lemmatizer.lemmatize("goodness")

'goodness'

In [1]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sa = SentimentIntensityAnalyzer()

In [2]:
sa.lexicon

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

In [3]:
[(tok, score) for tok, score in sa.lexicon.items() if " " in tok]

[("( '}{' )", 1.6),
 ("can't stand", -2.0),
 ('fed up', -1.8),
 ('screwed up', -1.5)]

In [4]:
example_text1 = "Python is very readable and it's great for NLP."

sa.polarity_scores(text = example_text1)

{'neg': 0.0, 'neu': 0.661, 'pos': 0.339, 'compound': 0.6249}

In [6]:
example_text2 = "Python is not a bad choice for most applications."

sa.polarity_scores(text = example_text2)

{'neg': 0.0, 'neu': 0.737, 'pos': 0.263, 'compound': 0.431}

In [8]:
example_text3 = "Python is not a bad, although its popularity decaying recently."

sa.polarity_scores(text = example_text3)

{'neg': 0.173, 'neu': 0.447, 'pos': 0.38, 'compound': 0.5023}

In [1]:
corpus = ["Absolutely perfect! Love it! :-) :-) :-)",
...           "Horrible! Completely useless. :(",
...           "It was OK. Some good and some bad things."]
for doc in corpus:
    scores = sa.polarity_scores(doc)
    print('{:+}: {}'.format(scores['compound'], doc))

NameError: name 'sa' is not defined