## D. Kinney DSC 550 Week 4: 4.2 Exercise: 

### Calculate Document Similarity 
*****************************

Review the Week 4 PPT and the Sample Code for Jaccard Distance, TdifVectorizer and CountVectorizer. These are all excellent tools for text analysis.

* Create a scenario of when and why you might want to determine if comments are positive or negative (or male/female or pass/fail or any other “binary” categorization). Also tell me how the results could be used.
* You must read the data in from a file.
* You must use some kind of vectorization method/tool (my example uses sklearn count.vectorizer but you can use any vectorization tool or Jaccard Distance.
* Create some kind of a dictionary of sample words you will use to search /categorize your data.
* Display the results.
* For 10% extra credit…add something more to your program that relates to Ch 5-7!
* Submit your code and a screenshot of the results.
****************************************

**Create a scenario of when and why you might want to determine if comments are positive or negative (or male/female or pass/fail or any other “binary” categorization). Also tell me how the results could be used.**

There a number of scenarios I can think of that would be relevant for positive/negative comments. Advertising agencies immediately come to mind. We've all witnessed the scenarios where commercials appear that trigger a strong negative backlash due to taking political stands, or just outright offensiveness. Typically when these hit the small screen Twitter "blows" up. It would do ad agencies well to perform these sorts of analyses using Twitter's realtime streaming API. Ads with a strong negative response should be pulled, unless the company for which that ad was created cares more about political posturing or "ad as art" than they do about retaining brand loyalty (and, of course, profitability).

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', -1)

In [2]:
# Read in the comments data
df = pd.read_csv('data/DailyComments.csv')
df.sample(5)

Unnamed: 0,Day of Week,comments
4,Friday,Apex should be ashamed of themselves. The new commercial is offensive and really puts the company in a bad light.
6,Sunday,What a fantastic commercial!
2,Wednesday,It was really special to finally see a company show what it stand for.
5,Saturday,"Take a breath people, the company can express its point of view. Personally I am ambivalent about it."
3,Thursday,"I have mixed emotions about the new ad; so far neither good or bad, have to give it more thought."


**You must use some kind of vectorization method/tool, such as the sklearn count.vectorizer or Jaccard Distance.**

In [3]:
corpus = df['comments']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print("Vectorized Words")
print("")
print(vectorizer.get_feature_names())
print("")
print("Identify Feature Words - Matrix View")
print("")
print( X.toarray())

df = pd.DataFrame({'text' : corpus})

#check for positive words and negative words
df['positive1'] = df.text.str.count('good')
df['positive2']= df.text.str.count('special')
df['positive3']= df.text.str.count('fantastic')
df['negative1'] = df.text.str.count('bad')
df['negative2'] = df.text.str.count('ashamed')
df['TotScore'] = (df.positive1 + df.positive2 + df.positive3) - (df.negative1 + df.negative2)

print("")
print(df)

Z = sum(df['TotScore'])
print("")
print("Overall Score:  ",Z)

Vectorized Words

['about', 'ad', 'am', 'ambivalent', 'and', 'apex', 'ashamed', 'bad', 'be', 'breath', 'can', 'commercial', 'company', 'did', 'dynamite', 'emotions', 'express', 'fantastic', 'far', 'finally', 'for', 'give', 'glad', 'good', 'have', 'in', 'is', 'it', 'its', 'light', 'mixed', 'more', 'neither', 'new', 'of', 'offensive', 'or', 'people', 'personally', 'point', 'puts', 'really', 'see', 'should', 'show', 'so', 'special', 'stand', 'take', 'taking', 'the', 'themselves', 'thought', 'to', 'view', 'was', 'what', 'you']

Identify Feature Words - Matrix View

[[0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0
  0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1]
 [0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0]
 [1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 2 0 0 1 0 0 1 1 1 1

### For 10% extra credit…add something more to your program that relates to Ch 5-7!

#### 6.5: Remove "stop" words...

In [4]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# Remove stop words
tokenized_words = vectorizer.get_feature_names()
words_cleaned = [word for word in tokenized_words if word not in stop_words]

In [5]:
words_cleaned

['ad',
 'ambivalent',
 'apex',
 'ashamed',
 'bad',
 'breath',
 'commercial',
 'company',
 'dynamite',
 'emotions',
 'express',
 'fantastic',
 'far',
 'finally',
 'give',
 'glad',
 'good',
 'light',
 'mixed',
 'neither',
 'new',
 'offensive',
 'people',
 'personally',
 'point',
 'puts',
 'really',
 'see',
 'show',
 'special',
 'stand',
 'take',
 'taking',
 'thought',
 'view']

#### 6.7 Tagging Parts of Speech

In [6]:
# Load libraries
import nltk
from nltk import pos_tag
from nltk import word_tokenize
nltk.download('averaged_perceptron_tagger')

# Use pre-trained part of speech tagger
string_data = ' '.join(tokenized_words)
text_tagged = pos_tag(word_tokenize(string_data))

# Show parts of speech
text_tagged

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('about', 'IN'),
 ('ad', 'NN'),
 ('am', 'VBP'),
 ('ambivalent', 'JJ'),
 ('and', 'CC'),
 ('apex', 'NN'),
 ('ashamed', 'VBD'),
 ('bad', 'JJ'),
 ('be', 'VB'),
 ('breath', 'VBN'),
 ('can', 'MD'),
 ('commercial', 'JJ'),
 ('company', 'NN'),
 ('did', 'VBD'),
 ('dynamite', 'JJ'),
 ('emotions', 'NNS'),
 ('express', 'RBR'),
 ('fantastic', 'JJ'),
 ('far', 'RB'),
 ('finally', 'RB'),
 ('for', 'IN'),
 ('give', 'JJ'),
 ('glad', 'NN'),
 ('good', 'JJ'),
 ('have', 'VBP'),
 ('in', 'IN'),
 ('is', 'VBZ'),
 ('it', 'PRP'),
 ('its', 'PRP$'),
 ('light', 'JJ'),
 ('mixed', 'VBN'),
 ('more', 'RBR'),
 ('neither', 'JJ'),
 ('new', 'JJ'),
 ('of', 'IN'),
 ('offensive', 'JJ'),
 ('or', 'CC'),
 ('people', 'NNS'),
 ('personally', 'RB'),
 ('point', 'VBP'),
 ('puts', 'NNS'),
 ('really', 'RB'),
 ('see', 'VB'),
 ('should', 'MD'),
 ('show', 'VB'),
 ('so', 'RB'),
 ('special', 'JJ'),
 ('stand', 'NN'),
 ('take', 'VB'),
 ('taking', 'VBG'),
 ('the', 'DT'),
 ('themselves', 'PRP'),
 ('thought', 'VBD'),
 ('to', 'TO'),
 ('view', 'VB')

******************************************
**References**  

Albon, Chris. Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning . O'Reilly Media. Kindle Edition. 