# NLP Part II: Tokenizing, Normalizing, and Sentiment Analysis with NLTK
---

In [1]:
import nltk

In [2]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [3]:
lemmatizer.lemmatize("walk")

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - '/Users/max/nltk_data'
    - '/Users/max/anaconda3/nltk_data'
    - '/Users/max/anaconda3/share/nltk_data'
    - '/Users/max/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [4]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/max/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

If you ran into issues with the above:

1. Open a Jupyter notebook and run `import nltk`.
    - If this runs without issue, fantastic! Move to step 4.
    - If `import nltk` does not work, then move to step 2.
2. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
3. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and return to step 1.
4. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then go back to step 1.

## Agenda
1. Pre-Processing
2. Sentiment Analysis

In [5]:
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

In [6]:
print(spam)

Hello,
I saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.


### Pre-Processing 

- Tokenizing
- Normalization
  - Upper/lower case
  - Lemmatizing
  - Stemming
- Other clean-up
  - e.g. removing HTML or other markup

#### Tokenizing

When we "tokenize" text, we take it and split it up into distinct pieces based on some pattern.

nltk provides many different tokenizers:
- RegexpTokenizer
- word_tokenize
- sent_tokenize (sentences)
- WhitespaceTokenizer
- TweetTokenizer
- and more!

In [7]:
spam.lower().replace('.', '').replace(',', '').replace('\n', '').split(' ')

['helloi',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 '"lukoil"',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will/donate',
 'the',
 'sum',
 'of',
 '875000000',
 'euros(eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 'etc']

In [8]:
spam.split(' ')

['Hello,\nI',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'LinkedIn.',
 'I',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality.',
 'This',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'I',
 'am',
 'in',
 'contact',
 'with',
 'you.',
 'My',
 'name',
 'is',
 'Mr.',
 'Valery',
 'Grayfer',
 'Chairman',
 'of',
 'the',
 'Board',
 'of',
 'Directors',
 'of',
 'PJSC',
 '"LUKOIL".',
 'I',
 'am',
 '86',
 'years',
 'old',
 'and',
 'I',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago.',
 'I',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week.',
 'I',
 'decided',
 'to',
 'WILL/Donate',
 'the',
 'sum',
 'of',
 '8,750,000.00',
 'Euros(Eight',
 'Million',
 'Seven',
 'Hundred',
 'And',
 'Fifty',
 'Thousand',
 'Euros',
 'Only',
 'etc.',
 'etc.']

In [9]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') 

In [10]:
spam.lower()

'hello,\ni saw your contact information on linkedin. i have carefully read through your profile and you seem to have an outstanding personality. this is one major reason why i am in contact with you. my name is mr. valery grayfer chairman of the board of directors of pjsc "lukoil". i am 86 years old and i was diagnosed with cancer 2 years ago. i will be going in for an operation later this week. i decided to will/donate the sum of 8,750,000.00 euros(eight million seven hundred and fifty thousand euros only etc. etc.'

In [11]:
spam_tokens = tokenizer.tokenize(spam.lower())

In [12]:
spam_tokens

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'directors',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'years',
 'old',
 'and',
 'i',
 'was',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'years',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euros',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euros',
 'only',
 'etc',
 'etc']

A `RegexpTokenizer` splits a string into substrings using a regular expression.

The following example is pulled from [the nltk site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

In [13]:
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

In [14]:
print(s)

Good muffins cost $3.88
in New York.  Please buy me
two of them.

Thanks.


In [15]:
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [16]:
tokenizer_1.tokenize(s.lower())

['good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'new',
 'york',
 '.',
 'please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'thanks',
 '.']

In [17]:
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

In [18]:
tokenizer_2.tokenize(s)

['Good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'New',
 'York.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'Thanks.']

In [19]:
capword_tokenizer = RegexpTokenizer('[A-Z]\w+')

In [20]:
capword_tokenizer.tokenize(s)

['Good', 'New', 'York', 'Please', 'Thanks']

In [21]:
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize, WhitespaceTokenizer

tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize(s.lower())

In [22]:
# corpus = [doc1, doc2]
# doc1 = 'a collection of words'
# tokens = ['a', 'collection', 'of', 'words']

In [23]:
tokens

['good',
 'muffins',
 'cost',
 '$3.88',
 'in',
 'new',
 'york.',
 'please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'thanks.']

## Normalizing Techniques: Lemmatizing and Stemming

walk, walks, walking, walked

are, is, am, was

organize, organizes, organization, organizations, organizational

universe, universes, universal, universally, university?

### Lemmatizing

When we "lemmatize" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

In [24]:
lemmatizer = WordNetLemmatizer()

In [25]:
[lemmatizer.lemmatize(token) for token in tokens]

['good',
 'muffin',
 'cost',
 '$3.88',
 'in',
 'new',
 'york.',
 'please',
 'buy',
 'me',
 'two',
 'of',
 'them.',
 'thanks.']

In [26]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/max/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/max/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [27]:
from nltk import pos_tag

tokenizer = RegexpTokenizer(r'\w+') 
tokens = tokenizer.tokenize(s) 

pos_tag(tokens)

[('Good', 'JJ'),
 ('muffins', 'NNS'),
 ('cost', 'VBP'),
 ('3', 'CD'),
 ('88', 'CD'),
 ('in', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 ('Please', 'NNP'),
 ('buy', 'VB'),
 ('me', 'PRP'),
 ('two', 'CD'),
 ('of', 'IN'),
 ('them', 'PRP'),
 ('Thanks', 'NNS')]

In [28]:
spam_tokens = tokenizer.tokenize(spam.lower())

[lemmatizer.lemmatize(token) for token in spam_tokens]

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'director',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'year',
 'old',
 'and',
 'i',
 'wa',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'year',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euro',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euro',
 'only',
 'etc',
 'etc']

In [29]:
tokens_lem = [lemmatizer.lemmatize(i) for i in spam_tokens]

In [30]:
tokens_lem

['hello',
 'i',
 'saw',
 'your',
 'contact',
 'information',
 'on',
 'linkedin',
 'i',
 'have',
 'carefully',
 'read',
 'through',
 'your',
 'profile',
 'and',
 'you',
 'seem',
 'to',
 'have',
 'an',
 'outstanding',
 'personality',
 'this',
 'is',
 'one',
 'major',
 'reason',
 'why',
 'i',
 'am',
 'in',
 'contact',
 'with',
 'you',
 'my',
 'name',
 'is',
 'mr',
 'valery',
 'grayfer',
 'chairman',
 'of',
 'the',
 'board',
 'of',
 'director',
 'of',
 'pjsc',
 'lukoil',
 'i',
 'am',
 '86',
 'year',
 'old',
 'and',
 'i',
 'wa',
 'diagnosed',
 'with',
 'cancer',
 '2',
 'year',
 'ago',
 'i',
 'will',
 'be',
 'going',
 'in',
 'for',
 'an',
 'operation',
 'later',
 'this',
 'week',
 'i',
 'decided',
 'to',
 'will',
 'donate',
 'the',
 'sum',
 'of',
 '8',
 '750',
 '000',
 '00',
 'euro',
 'eight',
 'million',
 'seven',
 'hundred',
 'and',
 'fifty',
 'thousand',
 'euro',
 'only',
 'etc',
 'etc']

In [31]:
paired = list(zip(spam_tokens, tokens_lem))

In [32]:
spam

'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

In [33]:
paired

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'information'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'carefully'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profile'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstanding'),
 ('personality', 'personality'),
 ('this', 'this'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'why'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valery'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i'

### Stemming

When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSmbjQIQsrH8aecDvbdYCBmdm0n8oZXyaeBmzKXsG_T5JmdTsQr)

In [34]:
from nltk.stem.porter import PorterStemmer

In [35]:
p_stemmer = PorterStemmer()

In [36]:
p_stemmer.stem('computers')

'comput'

In [37]:
p_stemmer.stem('computing')

'comput'

In [38]:
p_stemmer.stem("computation")

'comput'

In [39]:
p_stemmer.stem("fifty")

'fifti'

In [40]:
stem_spam = [p_stemmer.stem(i) for i in spam_tokens]

In [41]:
paired_stem = list(zip(spam_tokens, stem_spam))
paired_stem

[('hello', 'hello'),
 ('i', 'i'),
 ('saw', 'saw'),
 ('your', 'your'),
 ('contact', 'contact'),
 ('information', 'inform'),
 ('on', 'on'),
 ('linkedin', 'linkedin'),
 ('i', 'i'),
 ('have', 'have'),
 ('carefully', 'care'),
 ('read', 'read'),
 ('through', 'through'),
 ('your', 'your'),
 ('profile', 'profil'),
 ('and', 'and'),
 ('you', 'you'),
 ('seem', 'seem'),
 ('to', 'to'),
 ('have', 'have'),
 ('an', 'an'),
 ('outstanding', 'outstand'),
 ('personality', 'person'),
 ('this', 'thi'),
 ('is', 'is'),
 ('one', 'one'),
 ('major', 'major'),
 ('reason', 'reason'),
 ('why', 'whi'),
 ('i', 'i'),
 ('am', 'am'),
 ('in', 'in'),
 ('contact', 'contact'),
 ('with', 'with'),
 ('you', 'you'),
 ('my', 'my'),
 ('name', 'name'),
 ('is', 'is'),
 ('mr', 'mr'),
 ('valery', 'valeri'),
 ('grayfer', 'grayfer'),
 ('chairman', 'chairman'),
 ('of', 'of'),
 ('the', 'the'),
 ('board', 'board'),
 ('of', 'of'),
 ('directors', 'director'),
 ('of', 'of'),
 ('pjsc', 'pjsc'),
 ('lukoil', 'lukoil'),
 ('i', 'i'),
 ('am', 'am'

## Sentiment Analysis
#### Starting with something simple
Let's build a function that can classify a small amount of text, such as a tweet, into positive and negative.

What words tell us whether certain text is positive?

In [42]:
the_tweet = "We have some delightful new food in the cafeteria. Awesome!!!"

In [43]:
the_tweet

'We have some delightful new food in the cafeteria. Awesome!!!'

In [44]:
# Let's come up with a list of positive and negative words we might run into in one tweet

positive_words = ['delight', 'good', 'great', 'awesome', 'tremendous']
negative_words = ['garbage', 'sad', 'trash', 'ugly', 'bad']

In [45]:
tokenizer = RegexpTokenizer(r'\w+')

In [46]:
tweet_tokens = tokenizer.tokenize(the_tweet.lower())

In [47]:
tweet_tokens

['we',
 'have',
 'some',
 'delightful',
 'new',
 'food',
 'in',
 'the',
 'cafeteria',
 'awesome']

In [48]:
numPosWords = 0

for i in tweet_tokens:
    if i in positive_words:
        numPosWords += 1

In [49]:
numPosWords

1

In [50]:
numNegWords = 0

for i in tweet_tokens:
    if i in negative_words:
        numNegWords += 1

In [51]:
numNegWords

0

In [52]:
# return a percentage
num_words = len(tweet_tokens)
percent_positive = numPosWords / num_words
percent_negative = numNegWords / num_words

In [53]:
print("Positive: " + "{:.0%}".format(percent_positive) 
      + " Negative: " + "{:.0%}".format(percent_negative))

Positive: 10% Negative: 0%


In [54]:
percent_positive - percent_negative

0.1

In [57]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

import pandas as pd

pd.DataFrame(
    [sid.polarity_scores(token) for token in spam_tokens]
).sum()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/max/nltk_data...


neg          2.0000
neu         86.0000
pos          2.0000
compound    -0.3011
dtype: float64


# A more sophisticated approach to sentiment analysis 

The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled. 


## Step One: Import The Data

In [58]:
import pandas as pd

df = pd.read_csv("data/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [59]:
df.sample(5)

Unnamed: 0,id,sentiment,review
2150,"""2143_4""",0,"""I grew up watching the original Disney Cinder..."
4613,"""11073_8""",1,"""I have never known of a film to arouse such d..."
2112,"""4513_1""",0,"""Even though the book wasn't strictly accurate..."
14984,"""3129_10""",1,"""This is one of the best reunion specials ever..."
22255,"""10160_7""",1,"""I went into this film thinking it would be a ..."


In [60]:
df.review[8402]

'"I first flicked onto the LoG accidentally one night while waching television: since then, I have never missed an episode.<br /><br />It\'s humour is very weird, like a cross between Brass Eye\'s social commentary, the Fast Show\'s excellent one-liners, and an amazing plot that seems to develop each week without ever going anywhere. The best example of this was Hillary Briss\'s special stuff - what was that all about?<br /><br />The humour will not appeal to all. Some will say it\'s just too sick, and it\'s easy to see where they\'re coming from. Nonetheless, give it a try. If you don\'t like it, don\'t watch it, but if you do like it you\'ll be very glad you took my advice."'

In [61]:
from sklearn.model_selection import train_test_split

X = df[['review']]
y = df['sentiment'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [62]:
X_train.head()

Unnamed: 0,review
12131,"""When I saw previews of this movie I thought t..."
12827,"""One of the best if not the best rock'n'roll m..."
2912,"""I have made it my personal mission to go afte..."
13762,"""Lock Up Your Daughters is one of the best hig..."
6369,"""This is one movie that will take time to get ..."


## Our Plan
There are a few pre-processing steps we'll take to clean up the data:

- Remove HTML markup
- Normalize letter case
- Tokenize
- Remove stopwords


## Step One: Remove HTML

Fortunately, we can use the `BeautifulSoup` package to remove the HTML artifacts from our corpus

In [63]:
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(df['review'][0])

# Print the raw review and then the output of get_text(), for comparison
print(df['review'][0])
print('\n')
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Step Two: Convert to Lower Case

In [64]:
lower_case = example1.get_text().lower()

In [65]:
print(lower_case)

"with all this stuff going down at the moment with mj i've started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mj's feeling towards the press and also the obvious message of drugs are bad m'kay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 mi

## Step Three: Tokenize

In [66]:
from nltk.tokenize import RegexpTokenizer
retokenizer = RegexpTokenizer(r'\w+')
words = retokenizer.tokenize(lower_case)
words

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again',
 'maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 'moonwalker',
 'is',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 'some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'mj',
 's',
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',

## Step Four: Remove Stop Words

If you didn't complete the NLTK download you may run into some issues here.

In [67]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/max/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [68]:
from nltk.corpus import stopwords

In [69]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [70]:
words = [w for w in words if not w in stopwords.words('english')]

In [71]:
' '.join('this is a string'.split(' '))

'this is a string'

In [72]:
" ".join(words)

'stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts 20 minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate wor

## Step Four: Combine our cleaning into one function


In [73]:
def review_to_words(raw_review):
    review_text = BeautifulSoup(raw_review).get_text()

    lower_case = review_text.lower()

    retokenizer = RegexpTokenizer(r'[a-z]+')
    words = retokenizer.tokenize(lower_case)

    stops = set(stopwords.words('english'))

    meaningful_words = [w for w in words if not w in stops]

    return(" ".join(meaningful_words))

In [74]:
X_train['review'] = X_train['review'].apply(lambda x: review_to_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


## Step Five (Finally!) Applying our Function

### ...and to the test data

In [75]:
X_test['review'] = X_test['review'].apply(lambda x: review_to_words(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [76]:
new_review = '<p>This is a $100 negative review</p>. That movie was fucking bullshit'

review_to_words(new_review)

'negative review movie fucking bullshit'

## Our data is finally ready.....

In [77]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=5000) 

train_data_features = vectorizer.fit(X_train['review'])

train_data_features = vectorizer.transform(X_train['review'])

test_data_features = vectorizer.transform(X_test['review'])

In [78]:
print(train_data_features.shape)

(16750, 5000)


In [79]:
pd.DataFrame(train_data_features.todense(), columns=vectorizer.get_feature_names())

Unnamed: 0,abandoned,abc,abilities,ability,able,abraham,abrupt,absence,absolute,absolutely,...,york,young,younger,youngest,youth,zero,zizek,zombie,zombies,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16745,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
16746,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16747,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16748,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [80]:
vocab = vectorizer.get_feature_names()
print(vocab)



### Now we have an array that we can use for classification!

In [81]:
# from sklearn.neighbors import KNeighborsClassifier

In [82]:
# clf = KNeighborsClassifier(n_neighbors = 5)
# clf.fit(train_data_features, train['sentiment'])
#this will take a while....

In [83]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(train_data_features, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [84]:
y_train.mean()

0.4988059701492537

In [85]:
log_reg.score(test_data_features, y_test)

0.8598787878787879

In [86]:
predictions = log_reg.predict(test_data_features)

In [87]:
import numpy as np

new_review = '<p>This is a $100 negative review</p>. I didnt like it. It was bad. The actors were terrible. That movie was fucking bullshit'

cleaned_review = review_to_words(new_review)

df_review = pd.DataFrame([cleaned_review])

review_transformed = vectorizer.transform(df_review[0])

log_reg.predict(review_transformed)[0]

0

In [88]:
df = pd.DataFrame(list(zip(predictions, y_test, X_test)), columns=['prediction', 'label', 'review'])

In [89]:
df[(df.prediction != df.label) & (df.prediction == 1)].size

0

In [90]:
df[(df.prediction != df.label) & (df.prediction == 0)].size

0

In [91]:
df[df.prediction != df.label]

Unnamed: 0,prediction,label,review


## Resources

- [Choosing a stemmer](https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html).
- A hilarious data scientist has gone rogue and used NLP and eigenfaces (eigenvalues for face recognition) [for Tinder](http://dataconomy.com/hacking-tinder-with-facial-recognition-nlp/).
- [News headline analysis](http://nbviewer.jupyter.org/github/AYLIEN/headline_analysis/blob/06f1223012d285412a650c201a19a1c95859dca1/main-chunks.ipynb?utm_content=buffer5d40c&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer).
- [Sentiment and robot classification in movies](http://nbviewer.jupyter.org/github/cojette/ClusteringRobotsinMovie/blob/master/Classification%20of%20Robots%20in%20Movies.ipynb).
- [Text summarization with Gensim](http://nbviewer.jupyter.org/github/piskvorky/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb).
- [Sentiment analysis introduction](http://nbviewer.jupyter.org/github/sgsinclair/alta/blob/master/ipynb/SentimentAnalysis.ipynb).
- [The Largest Vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).
- [Rap Genius: Rap Stats](http://genius.com/rapstats).
- [Rap lyric generator, Hieu Nguyen and Brian Sa](http://nlp.stanford.edu/courses/cs224n/2009/fp/5.pdf)
- Check out this [Yelp blog post](http://engineeringblog.yelp.com/2015/09/automatically-categorizing-yelp-businesses.html) on how it completed a classification task (with more than 1,000 response variables) using restaurant review text.