<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP I: Language Data Pre-Processing and Sentiment Analysis

_Authors: Matt Brems, Noelle Brown_

---

### Learning Objectives

1. Define and implement tokenizing, lemmatizing, and stemming.
2. Preprocess text data.
3. Define and apply sentiment analysis.

#### Before we begin, try running this:

In [1]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")

'cat'

In [7]:
# if you get an error with the above code, run this & follow below directions:
# import nltk
# nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

If you ran into issues with the above:

1. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
2. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and try running the first three cells again.
3. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then try running the first three cells again.

### Which of these was machine generated?

- A: "Kilimanjaro is a mountain of 19,710 feet covered with snow and is said to be the highest mountain in Africa. The summit of the west is called “Ngaje Ngai” in Masai, the house of God. Near the top of the west there is a dry and frozen dead body of leopard. No one has ever explained what leopard wanted at that altitude."

- B: "Kilimanjaro is a snow-covered mountain 19,710 feet high, and is said to be the highest mountain in Africa. Its western summit is called the Masai “Ngaje Ngai,” the House of God. Close to the western summit there is the dried and frozen carcass of a leopard. No one has explained what the leopard was seeking at that altitude."

<details><summary>Answer:</summary>

- Item B was written by [Ernest Hemingway](https://en.wikipedia.org/wiki/Ernest_Hemingway) in "The Snows of Kilimanjaro."

- Item A was produced by a Japanese author translating "The Snows of Kilimanjaro" into Japanese, then this Japanese version was passed through Google Translate so that it could be "translated back" into English.
</details>

**Natural language processing** (NLP) describes the field of getting computers to understand language how we as humans do. Natural language processing has many, many applications including:
- voice-to-text services for people who are hard of hearing.
- text-to-voice services for people who have difficulty reading.
- automated chatbots for organizations.
- translation services.

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

Today, we're diving into the practical side of NLP: taking text data and breaking it out into words that we can then leverage in machine learning.

In [2]:
# Imports
import pandas as pd       
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re

In [9]:
# Define spam text.
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

print(spam)

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.


# Pre-Processing 

When dealing with text data, there are common pre-processing steps. We won't necessarily use all of them every time we deal with text data.

- Remove special characters
- Tokenizing
- Lemmatizing/Stemming
- Stop word removal

## Removing special characters & Tokenizing

We need to remove unnecessary characters when cleaning text data (punctuation, symbols, etc.). This can be done with RegEx (more on that in a bit).

When we "**tokenize**" data, we take it and split it up into distinct chunks based on some pattern.

If we use a RegEx tokenizer, we often can do these steps together.

In [11]:
# sentence tokenizer
sent_tokenize(spam.lower()) #Get list of sentences (Separated by full-stops)

['hello,\ni saw your contact information on linkedin.',
 'i have carefully read through your profile and you seem to have an outstanding personality.',
 'this is one major reason why i am in contact with you.',
 'my name is mr. valery grayfer chairman of the board of directors of pjsc "lukoil".',
 'i am 86 years old and i was diagnosed with cancer 2 years ago.',
 'i will be going in for an operation later this week.',
 'i decided to will/donate the sum of 8,750,000.00 euros(eight million seven hundred and fifty thousand euros only etc.',
 'etc.']

In [12]:
# word tokenizer
word_tokenize(spam.lower()) # Breaking down into individual words

['hello',
 ',',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 '.',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 '.',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 '.',
 'my',
 'name',
 'is',
 'mr.',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 '``',
 'lukoil',
 "''",
 '.',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 '.',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 '.',
 'i',
 'decided',
 'to',
 'will/donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'euros',
 '(',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 '.',
 'etc',
 '.']

In [13]:
# Instantiate RegExp Tokenizer
tokenizer = RegexpTokenizer(r'\w+') ## We'll talk about this in a moment.

# Match either one of more words (based on regex)

In [14]:
# "Run" Tokenizer
spam_tokens = tokenizer.tokenize(spam.lower())

In [16]:
# Show Results
print(spam_tokens)

['hello', 'i', 'saw', 'your', 'contact', 'information', 'on', 'linkedin', 'i', 'have', 'carefully', 'read', 'through', 'your', 'profile', 'and', 'you', 'seem', 'to', 'have', 'an', 'outstanding', 'personality', 'this', 'is', 'one', 'major', 'reason', 'why', 'i', 'am', 'in', 'contact', 'with', 'you', 'my', 'name', 'is', 'mr', 'valery', 'grayfer', 'chairman', 'of', 'the', 'board', 'of', 'directors', 'of', 'pjsc', 'lukoil', 'i', 'am', '86', 'years', 'old', 'and', 'i', 'was', 'diagnosed', 'with', 'cancer', '2', 'years', 'ago', 'i', 'will', 'be', 'going', 'in', 'for', 'an', 'operation', 'later', 'this', 'week', 'i', 'decided', 'to', 'will', 'donate', 'the', 'sum', 'of', '8', '750', '000', '00', 'euros', 'eight', 'million', 'seven', 'hundred', 'and', 'fifty', 'thousand', 'euros', 'only', 'etc', 'etc']


<details><summary>In comparing the original text to our tokenized version of the text, we converted one long string into a list of strings. What other changes occurred?</summary>

- All strings were converted to lower case.
- All punctuation was removed. (This was done using **regular expressions**.)
</details>

### Briefly: Regular Expressions

Regular Expressions, or RegEx, is a helpful tool for detecting patterns in text. 
- This is a tool of which you should be aware!

In [17]:
[(re.findall('\d+', i), i) for i in spam_tokens]

[([], 'hello'),
 ([], 'i'),
 ([], 'saw'),
 ([], 'your'),
 ([], 'contact'),
 ([], 'information'),
 ([], 'on'),
 ([], 'linkedin'),
 ([], 'i'),
 ([], 'have'),
 ([], 'carefully'),
 ([], 'read'),
 ([], 'through'),
 ([], 'your'),
 ([], 'profile'),
 ([], 'and'),
 ([], 'you'),
 ([], 'seem'),
 ([], 'to'),
 ([], 'have'),
 ([], 'an'),
 ([], 'outstanding'),
 ([], 'personality'),
 ([], 'this'),
 ([], 'is'),
 ([], 'one'),
 ([], 'major'),
 ([], 'reason'),
 ([], 'why'),
 ([], 'i'),
 ([], 'am'),
 ([], 'in'),
 ([], 'contact'),
 ([], 'with'),
 ([], 'you'),
 ([], 'my'),
 ([], 'name'),
 ([], 'is'),
 ([], 'mr'),
 ([], 'valery'),
 ([], 'grayfer'),
 ([], 'chairman'),
 ([], 'of'),
 ([], 'the'),
 ([], 'board'),
 ([], 'of'),
 ([], 'directors'),
 ([], 'of'),
 ([], 'pjsc'),
 ([], 'lukoil'),
 ([], 'i'),
 ([], 'am'),
 (['86'], '86'),
 ([], 'years'),
 ([], 'old'),
 ([], 'and'),
 ([], 'i'),
 ([], 'was'),
 ([], 'diagnosed'),
 ([], 'with'),
 ([], 'cancer'),
 (['2'], '2'),
 ([], 'years'),
 ([], 'ago'),
 ([], 'i'),
 ([], 

RegEx in Python 3 understands `\d+` to identify numeric digits. Therefore, the above code searched through `spam_tokens` to see if any numeric digits were in there. 

A `RegexpTokenizer` splits a string into substrings using regular expressions.

The following example is pulled from [this site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

In [18]:
# Define and print string.
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

print(s)

Good muffins cost $3.88
in New York.  Please buy me
two of them.

Thanks.


In [19]:
# Instantiate tokenizer.
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [20]:
# Run tokenizer.
tokenizer_1.tokenize(s)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

`tokenizer_1` splits tokens up by spaces or by periods that are not attached to a digit.

In [21]:
# Instantiate tokenizer.
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

# Run tokenizer.
tokenizer_2.tokenize(s)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

`tokenizer_2` will identify the spaces. By setting `gaps = True`, we're grabbing everything else: thus, we're splitting our tokens up by spaces.
- If you changed to `gaps = False`, you'll return only the whitespaces!

In [22]:
# Instantiate tokenizer.
tokenizer_3 = RegexpTokenizer('[A-Z]\w+')

# Run tokenizer.
tokenizer_3.tokenize(s)

['Good', 'New', 'York', 'Please', 'Thanks']

`tokenizer_3` returns only words that begin with a capital letter.

As you can imagine, using RegEx _can_ be incredibly helpful if you want to find text matching a specific pattern.
- People used to use two spaces after a period to split sentences up; you could use RegEx to detect that pattern and tokenize on entire sentences.
- Chapters in a book could be titled "Chapter" followed by a number; you could use RegEx to detect that pattern and tokenize a book by its chapters.
- When Python libraries are upgraded, syntax changes! Perhaps you want to detect a certain pattern of syntax so you can update your code efficiently.

![](./images/regex.png)

[_from xkcd_](https://xkcd.com/1171/)

## Lemmatizing & Stemming

- "He is *running* really fast!"
- "He *ran* the race."
- "He *runs* a five-minute mile."

If we wanted a computer to interpret these sentences, I might count up how many times I see each word. The computer will treat words like "running," "ran," and "runs" differently... but they mean very similar things (in this context)!

**Lemmatizing** and **stemming** are two forms of shortening words so we can combine similar forms of the same word.

When we "**lemmatize**" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [23]:
# Instantiate lemmatizer. 
lemmatizer = WordNetLemmatizer()

In [25]:
# Lemmatize tokens.
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [26]:
# Compare tokens to lemmatized version.
list(zip(spam_tokens, tokens_lem)) # zip combines elements from different lists

## Can't print zip

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'information'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'carefully'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profile'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstanding'),
 ('personality', 'personality'),
 ('this', 'this'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'why'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valery'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i'

In [31]:
# Print only those lemmatized tokens that are different.
for i in range(len(spam_tokens)):
    if spam_tokens[i] != tokens_lem[i]:
        print((spam_tokens[i], tokens_lem[i]))

('directors', 'director')
('years', 'year')
('was', 'wa')
('years', 'year')
('euros', 'euro')
('euros', 'euro')


In [35]:
# Using list comprehemsion
[(spam_tokens[i], tokens_lem[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != tokens_lem[i]]

[('directors', 'director'),
 ('years', 'year'),
 ('was', 'wa'),
 ('years', 'year'),
 ('euros', 'euro'),
 ('euros', 'euro')]

Lemmatizing is usually the more correct and precise way of handling things from a grammatical point of view, but also might not have much of an effect.

We can also do this on individual words.

In [36]:
# Lemmatize the word "computer"
lemmatizer.lemmatize('computer')

'computer'

In [37]:
# Lemmatize the word "computers"
lemmatizer.lemmatize('computers')

'computer'

In [38]:
# Lemmatize the word "computation"
lemmatizer.lemmatize('computation')

'computation'

In [39]:
# Lemmatize the word "computationally"
lemmatizer.lemmatize('computationally')

'computationally'

When we "**stem**" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [40]:
# Instantiate PorterStemmer.
p_stemmer = PorterStemmer()

In [41]:
# Stem tokens.
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [42]:
# Compare tokens to stemmed version.
list(zip(spam_tokens, stem_spam))

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'inform'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'care'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profil'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'whi'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valeri'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i', 'i'),
 ('am', 'am'

In [44]:
# Print only those stemmed tokens that are different.
[(spam_tokens[i], stem_spam[i]) for i in range(len(spam_tokens)) if spam_tokens[i] != stem_spam[i]]

[('information', 'inform'),
 ('carefully', 'care'),
 ('profile', 'profil'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('why', 'whi'),
 ('valery', 'valeri'),
 ('directors', 'director'),
 ('years', 'year'),
 ('was', 'wa'),
 ('diagnosed', 'diagnos'),
 ('years', 'year'),
 ('going', 'go'),
 ('operation', 'oper'),
 ('this', 'thi'),
 ('decided', 'decid'),
 ('donate', 'donat'),
 ('euros', 'euro'),
 ('hundred', 'hundr'),
 ('fifty', 'fifti'),
 ('euros', 'euro'),
 ('only', 'onli')]

We can also do this on individual words as well.

In [45]:
# Stem the word "computer"
p_stemmer.stem('computer')

'comput'

In [46]:
# Stem the word "computers"
p_stemmer.stem('computers')

'comput'

In [47]:
# Stem the word "computation"
p_stemmer.stem('computation')

'comput'

In [48]:
# Stem the word "computationally"
p_stemmer.stem('computationally')

'comput'

## Stop Word Removal

The following sentence has had stop words (and punctuation) removed:

"Answer great question life universe everything said deep thought said deep thought paused forty two said deep thought infinite majesty calm."

<details><summary>What book is the above sentence from?</summary>

The Hitchhiker's Guide to the Galaxy!
    
![](./images/hgg.jpg)
    
The original quote reads:  
..."The Answer to the Great Question..."  
"Yes..!"  
"Of Life, the Universe and Everything..." said Deep Thought.  
"Yes...!"  
"Is..." said Deep Thought, and paused.  
"Yes...!"  
"Is..."  
"Yes...!!!...?"  
"Forty-two," said Deep Thought, with infinite majesty and calm.”
</details>

<details><summary>If you were familiar with the book, how did you know what book the sentence was from?</summary>

Removing stop words did not remove key identifying words such as "life", "universe", "everything", and "forty-two".
</details>

<details><summary>Based on this, how would you define stop words?</summary>

Stop words are words that have little to no significance or meaning. They are common words that only add to the grammatical structure and flow of the sentence, so it is still relatively easy to identify the contents of sentences without stop words.
</details>

In [49]:
# Print English stopwords.
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [50]:
# Remove stopwords from "spam_tokens."
no_stop_words = [token for token in spam_tokens if token not in stopwords.words('english')]

In [51]:
# Check it
no_stop_words

['hello',
 'saw',
 'contact',
 'information',
 'linkedin',
 'carefully',
 'read',
 'profile',
 'seem',
 'outstanding',
 'personality',
 'one',
 'major',
 'reason',
 'contact',
 'name',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'board',
 'directors',
 'pjsc',
 'lukoil',
 '86',
 'years',
 'old',
 'diagnosed',
 'cancer',
 '2',
 'years',
 'ago',
 'going',
 'operation',
 'later',
 'week',
 'decided',
 'donate',
 'sum',
 '8',
 '750',
 '000',
 '00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'fifty',
 'thousand',
 'euros',
 'etc',
 'etc']

---

# Sentiment Analysis

![](./images/sent.jpeg)

[Sentiment analysis](https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis-practitioners-guide-nlp-5.html) is an area of natural language processing in which we seek to classify text as having positive or negative emotion.

Let's build a simple function that can classify text as either having positive or negative sentiment.

What words tell us whether certain text is positive?

In [52]:
# Let's come up with a list of positive and negative words we might observe.

positive_words = ['delight', 'good', 'great', 'awesome', 'tremendous', 'fabulous', 'amazing', 'stellar']
negative_words = ['garbage', 'sad', 'trash', 'ugly', 'bad', 'disgusting', 'terrible', 'gross']

In [56]:
def simple_sentiment(text):
    # Instantiate tokenizer.
    tokenizer = RegexpTokenizer(r'\w+')
    
    # Tokenize text.
    tokens = tokenizer.tokenize(text.lower())
    
    # Instantiate stemmer.
    p_stemmer = PorterStemmer()
    
    # Stem words.
    stemmed_words = [p_stemmer.stem(i) for i in tokens]
    
    # Stem our positive/negative words.
    positive_stems = [p_stemmer.stem(i) for i in positive_words]
    negative_stems = [p_stemmer.stem(i) for i in negative_words]

    # Count "positive" words.
    positive_count = sum([1 for i in stemmed_words if i in positive_stems])
    
    # Count "negative" words
    negative_count = sum([1 for i in stemmed_words if i in negative_stems])
    
    # Calculate Sentiment Percentage 
    # (Positive Count - Negative Count) / (Total Count)

    return round((positive_count - negative_count)/len(tokens), 2)

In [58]:
# Recall what is spam
print(spam)

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.


In [59]:
# Run our sentiment analyzer on our spam email.
simple_sentiment(spam)

0.0

In [61]:
# Three not-so-random Chipotle reviews.

yelp_1 = "No Chipotle should ever have a 2 out of 5 star rating on Yelp. Especially not this one. As a regular (usually two or three visits a week), I have never been dissatisfied with a single meal here. It's Chipotle, so you know you'll pay $8 (after tax) for a chicken bowl and be full and satisfied afterwards. \n The employees are friendly and give generous portions. Seating is limited, but there is a place you can stand and eat near the window, which is where I always eat. I'm sitting down eight hours a day at the office anyway - standing and eating here is probably extending my lifespan. \nThe line gets line long during lunch, but it moves fast. Dinner time is amazing - rarely a line and the portions are extra generous during this time.\n This fairly new Chipotle is at a great location, near McPherson Square. It's right next to my office and gym so it's perfect for me. \nBottom line: if you're craving Chipotle and are worried about the other reviews and low ratings for this location, don't be. It's my favorite Chipotle location in the DC area, and that's not an exaggeration."

yelp_2 = "DISGUSTING LONG HAIR THREADED THROUGH CHICKEN IN BURRITO BOWL.\n There was a long blonde hair threaded through my chicken as I was eating a burrito bowl.  I did not notice until it was too late and the HAIR ENTERED MY MOUTH as I was eating and I grossly pulled the hair out.\n I calmly walked up to the register to inform them that there was hair in my food. The register person was busy, I understand that, but I was promptly ignored like my issue was not a big deal.  He proceeded to get his manager, 'Leslie' I believe.  She was not apologetic at all and offered no condolences. She did however offer a refund, but I didn't care about the money, I just wanted to eat food without eating someone's hair as a side dish.\n The second time I went back up, a different person, the general manager Peris, was more apologetic and handled the situation better. He ended up getting Leslie to file a report, but who knows if they submitted it or not.\n Suffice to say, if you dont want food in your hair, dont eat here."

yelp_3 = "First time going to this Chipotle.  The line was very quick and the food was fresh.  But as I started eating a notice that the food was very salty.  I started separating my bowl after two bites.  I ordered a bowl with white rice, black beans, chicken, sour cream, cheese and lettuce.  I tasted everything separately.  Once I tasted the Chicken by it self it was unbearable.  It taste like someone pouring the entire bottle of salt on tge chicken.  I tried to take most the chicken out the bowl but still I could not bear the taste of the salt.  So I ended up throwing the damn bowl away.  $8.00 down the drain.  SMH."

In [62]:
yelp_1

"No Chipotle should ever have a 2 out of 5 star rating on Yelp. Especially not this one. As a regular (usually two or three visits a week), I have never been dissatisfied with a single meal here. It's Chipotle, so you know you'll pay $8 (after tax) for a chicken bowl and be full and satisfied afterwards. \n The employees are friendly and give generous portions. Seating is limited, but there is a place you can stand and eat near the window, which is where I always eat. I'm sitting down eight hours a day at the office anyway - standing and eating here is probably extending my lifespan. \nThe line gets line long during lunch, but it moves fast. Dinner time is amazing - rarely a line and the portions are extra generous during this time.\n This fairly new Chipotle is at a great location, near McPherson Square. It's right next to my office and gym so it's perfect for me. \nBottom line: if you're craving Chipotle and are worried about the other reviews and low ratings for this location, don't

In [63]:
yelp_2

"DISGUSTING LONG HAIR THREADED THROUGH CHICKEN IN BURRITO BOWL.\n There was a long blonde hair threaded through my chicken as I was eating a burrito bowl.  I did not notice until it was too late and the HAIR ENTERED MY MOUTH as I was eating and I grossly pulled the hair out.\n I calmly walked up to the register to inform them that there was hair in my food. The register person was busy, I understand that, but I was promptly ignored like my issue was not a big deal.  He proceeded to get his manager, 'Leslie' I believe.  She was not apologetic at all and offered no condolences. She did however offer a refund, but I didn't care about the money, I just wanted to eat food without eating someone's hair as a side dish.\n The second time I went back up, a different person, the general manager Peris, was more apologetic and handled the situation better. He ended up getting Leslie to file a report, but who knows if they submitted it or not.\n Suffice to say, if you dont want food in your hair, d

In [64]:
yelp_3

'First time going to this Chipotle.  The line was very quick and the food was fresh.  But as I started eating a notice that the food was very salty.  I started separating my bowl after two bites.  I ordered a bowl with white rice, black beans, chicken, sour cream, cheese and lettuce.  I tasted everything separately.  Once I tasted the Chicken by it self it was unbearable.  It taste like someone pouring the entire bottle of salt on tge chicken.  I tried to take most the chicken out the bowl but still I could not bear the taste of the salt.  So I ended up throwing the damn bowl away.  $8.00 down the drain.  SMH.'

In [65]:
# Calculate sentiment of yelp_1.
simple_sentiment(yelp_1)

0.01

In [66]:
# Calculate sentiment of yelp_2.
simple_sentiment(yelp_2)

-0.01

In [67]:
# Calculate sentiment of yelp_3.
simple_sentiment(yelp_3)

0.0

<details><summary> What are some shortcomings of this method? </summary>

- Primarily, we're limited to the positive/negative words we came up with.
- If someone wrote "not good" or "not bad," our sentiment function would probably treat "not good" as positive or neutral... but it's probably supposed to mean negative!
- The ordering of the words doesn't matter here, which is not how language generally works.
- We haven't corrected for misspellings.
</details>

There are a couple of ways to proceed with sentiment analysis:

1. If you have already-labeled data, you can build a supervised learning model.
2. If you don't have labeled data, you can use a Lexicon that has already been built/trained for sentiment analysis.
    - There are a bunch of these and which to use depends on your purpose/data. Here are just a few that are available:
        - AFINN lexicon
        - MPQA subjectivity lexicon
        - SentiWordNet
        - VADER lexicon

We will use the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiments of our reviews!

In [70]:
# Instantiate Sentiment Intensity Analyzer
sent = SentimentIntensityAnalyzer()

In [72]:
# Calculate sentiment of yelp_1.
sent.polarity_scores(yelp_1)

{'neg': 0.046, 'neu': 0.799, 'pos': 0.155, 'compound': 0.9795}

In [73]:
# Calculate sentiment of yelp_2.
sent.polarity_scores(yelp_2)

{'neg': 0.078, 'neu': 0.875, 'pos': 0.047, 'compound': -0.5847}

In [74]:
# Calculate sentiment of yelp_3.
sent.polarity_scores(yelp_3)

{'neg': 0.067, 'neu': 0.89, 'pos': 0.043, 'compound': -0.5718}

The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate.

It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. Typical threshold values (used in the literature cited on this page) are:

*positive sentiment: compound score >= 0.05*

*neutral sentiment: (compound score > -0.05) and (compound score < 0.05)*

*negative sentiment: compound score <= -0.05*

NOTE: The compound score is the one most commonly used for sentiment analysis by most researchers, including the authors.

Let's try analyzing the sentiment of IMDb movie reviews. The data is from [Kaggle](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).

In [75]:
# Read in training data.
reviews = pd.read_csv("./data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [76]:
# View the first five rows.
reviews.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [82]:
# Examine a review.
reviews['review'][2]

'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against 

In [83]:
# Sentiment of the review.
reviews['sentiment'].value_counts()

0    12500
1    12500
Name: sentiment, dtype: int64

In [84]:
# Does this match the sentiment given in the training data?
sent.polarity_scores(reviews['review'][2])

{'neg': 0.142, 'neu': 0.8, 'pos': 0.058, 'compound': -0.9883}

In [85]:
# Does this match the sentiment given in the training data?
reviews['sentiment'][2]

0

In [87]:
# Let's calculate the accuracy of Vader
def vader_sentiment_pred(review):
    vader_output = sent.polarity_scores(review)
    
    if vader_output['compound'] < 0:
        return 0
    else:
        return 1

In [88]:
reviews['sentiment_pred'] = reviews['review'].apply(vader_sentiment_pred)
reviews.head()

Unnamed: 0,id,sentiment,review,sentiment_pred
0,"""5814_8""",1,"""With all this stuff going down at the moment ...",0
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ...",1
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell...",0
3,"""3630_4""",0,"""It must be assumed that those who praised thi...",0
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ...",1


In [93]:
from sklearn.metrics import accuracy_score

In [95]:
vader_accuracy = accuracy_score(reviews['sentiment'], reviews['sentiment_pred'])
vader_accuracy

0.6924

## Interview Question

<details><summary>When processing text, what can you do about frequently occurring words, like "a," "and," "the," etc.?</summary>

- These words, called "stopwords," can either be kept or removed.
    - If we think these words do help explain our $Y$ variable, we might keep them. (For example, if we're classifying the era of a poem, the frequency of the word "the" may be helpful information!)
    - If we think these words don't help explain our $Y$ variable, we might remove them. (For example, in sentiment analysis, we might not think that people who use "the" more or less frequently are happier or angrier.)
</details>

A couple things to note:
1. NLP broadly describes: 
    - how we can get unstructured text data into a more structured form that can be interpreted by computers, and 
    - algorithms for interpreting text data.
2. That does not mean these tools we used today work to the exclusion of other methods. You can and should include other variables in your model!
    - For example, maybe the length of a review tells us something about how much people liked/disliked the movie, or maybe additional information about the reviewer (i.e. geography, age, how many reviews they had submitted) has predictive value.