## Byron Dennis - Assignment #3

### Question 1 
Write a short description of the context of the dataset in your own words. Make sure your answer is no longer than three paragraphs, and should at minimum answer these questions:

•	Why did you choose the processing that you did? Give several specific examples. 

•	What is the effect of the replacement on your feature space?  Does this make sense? Is it helpful for answering your question?  Why or why not?

 Audience: technical – fellow data scientists or other technical staff.


### Answer 1
The dataset in this workbook is a collection of Hillary Clinton's emails from her private server that were released by the government during the 2015 controversy about Hillary's use of the private server.  This dataset was provided by Kaggle.  I am attempting to capture the sentiment of Hillary's emails and eventually aggregate sentiment by subject.

The processing used for this text included max_df of .6 and a customized list of stop words.  The max_df parameter was helpful because every email had standard text that included words like "confidential" and "house benghazi committee".  There were also tags and email addresses that appeared in most of the documents that the max_df parameter addressed.  I use ngram_range=(1,2) to add more context to words as I reviewed them, but I removed this setting because it tripled the size of the feature space from approximately 4,300 columns to just over 15,000 columns.  

It was necessary for me to customize a stopwords list because email addresses, dates, and names of Hillary's close associates were continually counted most frequently.  It was also a regular practice for Hillary to email "pls print" to her team.  This phrase was added to the stop list.  Stemming was not used because I did not want to risk changing the meaning of words that would be evaluated for sentiment.  Using stopwords and max DF did not reduce the feature space significantly, but it allowed more meaningful words to be identified more easily and removed neutral words that could possibly skew the sentiment analysis.



### Question 2
Write a short description of how the sentiment analysis was done and what the outcome is. Make sure your answer is no longer than three paragraphs, and should at minimum answer these questions:

•	How did your processing affect the sentiment assignment, if at all?

•	What measure did you use to determine the sentiment label?  Why?  Do any of the label assignments surprise you? 

•	Include a few specific examples of label assignment and how it was determined and why it does or does not make sense.


### Answer 2
The sentiment analysis was completed by using the afinn sentiment dictionary which assigns positive and negative values to each word.  The sum of the sentiment values in each email were then used to label the email as positive, negative, or neutral.  The sum of the sentiment values in each email had to be at least +-2 to be labeled positive or negative.  If the sentiment value was between 1 and -1 the email would be labeled neutral.

For preprocessing, I used a custom set of stop words to remove frequently used government words and names.  However, because a count vectorizer was not used for the sentiment analysis it was difficult to get the text as clean as I would have preferred.  Most of the words/phrases that I would have removed (email addresses) did not have sentiment values so it did not have a significant impact on the analysis.  Two key words that I removed were "please" because there were many emails with the phrase "please print" that were initially causing results to appear positive that should have been considered neutral.

I was surprised how many emails were labeled as neutral, but that was caused by many emails that had very little substance.  An example of one email that was labeled neutral that could be viewed as positive would be, "well, what doesn't kill you, makes you stronger gas i have rationalized for years), so just survive and you'll have triumphed!"  The sentiment values in this email are Kill -3, Stronger 2 and trimuphed! 0.  I listed "triumphed!" because "triumph" has a value of 4 which would cause the email to be considered positive, but the puncutation caused the sentiment value to be excluded.  This should be corrected. 

### Question 3

Consider a specific outcome you would like to achieve with your sentiment analysis.  That is, determine what sentiment you might want to have assigned to a specific piece of text.  It could be one entry in your corpus, several documents, or the entire corpus. Make changes to the feature space and/or dictionary to achieve that outcome. Show specific results. "

Write a short description of the exercise and the outcome.  Make sure your answer is no longer than three paragraphs, and should at minimum answer these questions:

•	What outcome did you choose?  Why?

•	How did you change the dictionary to achieve that outcome?

•	How would you explain (justify, rationalize) those changes if necessary?

### Answer 3
The phrase below was originally as "neutral". I added "cnn" to the dictionary with a value of -2.  I changed cnn to show as negative because I wanted to show how I could skew results around a subject intentionally.  The label was changed to negative after my update to the dictionary.  To rationalize the change I would say that is CNN one of the most negative news outlets (according to conservatives) and if CNN is involved there must be something negative.

"well, philippe looks right again. cnn is reporting this as being done against my wishes. any way to salvage?"




### T1.  Read in or create a data frame with at least one column of text to be analyzed.  This could be the text you used previously or new text. Based on the context of your dataset and the question you want to answer, identify at what processing you think is necessary (stop words, stemming, custom replacement, etc.) Compare the feature space before and after your processing. 

### Question 4
Data science is all about finding patterns in the data.  You have just been asked to decide on a pattern before finding it. Write a short description of how the easy or difficult it was to arrive at a predetermined conclusion.  How difficult was it to justify? What are the ethical issues involved, if any? What is your role as a data scientist? 

### Answer 4
Technically arriving at predetermined conclusions is an easy thing to be done.  However, internally it is difficult for me to report numbers that I myself do not trust and agree with.  Justifying predetermined conclusions could be difficult because I would not want to put my name on work that is inaccurate or not providing a complete picture of the situation at hand.  

In the previous exercise it was easy to skew the numbers because it was just an exercise, but a real situation where a person is asked to report outcomes that are predetermined is difficult in many ways.  Most obviously, the organization consuming the content provided in the report could be misled and lose extraordinary sums of money or possibly damage key business relationships.  Conversely, as an employee with important monthly financial obligations it is preferred to maintain steady employment and preferrably not have conflict with leaders within the company.  As data scientists it is our obligation to provide the most accurate information possibe and to ensure that it is interpreted correctly throughout our organizations.

In [2]:
import pandas as pd
import numpy as np
from __future__ import division

pathname = "C:/Users/byron/OneDrive/Documents/Text Mining/"

pd.set_option('display.max_colwidth', 15000)

In [6]:
emaildf = pd.read_csv(pathname + "Hillary_Emails.csv")

print(emaildf.shape)
print(list(emaildf))

(7945, 22)
['Id', 'DocNumber', 'MetadataSubject', 'MetadataTo', 'MetadataFrom', 'SenderPersonId', 'MetadataDateSent', 'MetadataDateReleased', 'MetadataPdfLink', 'MetadataCaseNumber', 'MetadataDocumentClass', 'ExtractedSubject', 'ExtractedTo', 'ExtractedFrom', 'ExtractedCc', 'ExtractedDateSent', 'ExtractedCaseNumber', 'ExtractedDocNumber', 'ExtractedDateReleased', 'ExtractedReleaseInPartOrFull', 'ExtractedBodyText', 'RawText']


In [7]:
emaildf=emaildf.drop(emaildf.columns[5:11], axis=1)
emaildf=emaildf.drop(emaildf.columns[15], axis=1)
emaildf=emaildf.loc[emaildf['MetadataFrom'] == 'H']  # Get only emails sent by Hillary
emaildf=emaildf[pd.notnull(emaildf['ExtractedBodyText'])]

print(list(emaildf))
print(emaildf.shape)

['Id', 'DocNumber', 'MetadataSubject', 'MetadataTo', 'MetadataFrom', 'ExtractedSubject', 'ExtractedTo', 'ExtractedFrom', 'ExtractedCc', 'ExtractedDateSent', 'ExtractedCaseNumber', 'ExtractedDocNumber', 'ExtractedDateReleased', 'ExtractedReleaseInPartOrFull', 'ExtractedBodyText']
(1811, 15)


In [8]:
#emaildf.head(80)

In [9]:
from nltk.corpus import stopwords

nltk_stopwords = stopwords.words("english")

my_stopwords = nltk_stopwords + ['said','2010','2009','30','message','06','com','huma','would','07','part','abedin','10',
                'clintonemail','to','the','in','and','of','that','these','for','is','on','with','it','cc','original','b6',
                                'us','also','may','dept','date','31','unclassified','2015','13','05','information','sensitive',
                                'government','gov','agreement','&','redactions','foia','waiver','cheryl','mills','house','comm',
                                'secretary','president','subject','state','case','clinton','04841','doc','release','full',
                                 'partial','produced','select','abedinh','one','time','fw','re','office','u.s.','united',
                                'states','11','12','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30',
                                 '31','1','2','3','4','5','6','7','8','9','10','september','14','2011','2012','2013','2014',
                                 '2015','sent','like','obama','hdr22','hrod17','people','millscd','know','minister','prime',
                                'today','well','monday','tuesday','wednesday','thursday','friday','saturday','sunday','august',
                                 'foreign','see','get','january','february','march','april','may','june','july','august',
                                'september','october','november','december','jacob','sullivan','work','meeting','want',
                                'benghazi','call','new','http','00','09','one','two','b5','week','aug','tomorrow',
                                'fyi','let']

my_stopwords_2 = nltk_stopwords + ['said','benghazi','call','new','http','00','09','one','two','b5','week','aug','tomorrow',
                                'fyi','let','pls','print','pis','2010','2009','30','message','06','com','huma','would','07','part','abedin','10',
                'clintonemail','to','the','in','and','of','that','these','for','is','on','with','it','cc','original','b6',
                                'us','also','may','dept','date','31','unclassified','2015','13','05','information','sensitive',
                                'government','gov','agreement','&','redactions','foia','waiver','cheryl','mills','house','comm',
                                'secretary','president','subject','state','case','clinton','04841','doc','release','full',
                                 'partial','produced','select','abedinh','one','time','fw','re','office','u.s.','united',
                                'states','11','12','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30',
                                 '31','1','2','3','4','5','6','7','8','9','10','september','14','2011','2012','2013','2014',
                                 '2015','sent','like','obama','hdr22','hrod17','people','millscd','know','minister','prime',
                                'today','well','monday','tuesday','wednesday','thursday','friday','saturday','sunday','august',
                                 'foreign','see','get','january','february','march','april','may','june','july','august',
                                'september','october','november','december','jacob','sullivan','work','meeting','want',
                                  '<hrod17@clintonemail.com>', 'friday,', '11,', '1:36' 'pm' 'fw:' 'h:', 'h','please']


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=False,stop_words=my_stopwords_2, max_df=.6) 
cv_dm = cv.fit_transform(emaildf['ExtractedBodyText'])

#print(cv_dm.shape)

names = cv.get_feature_names()   #create list of feature names
count = np.sum(cv_dm.toarray(), axis = 0) # add up feature counts 
count2 = count.tolist()  # convert numpy array to list
count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list
#count_df.sort_values(['count'], ascending = False)[0:19]  #arrange by count instead


### T2. Create a sentiment dictionary from one of the sources in class or find/create your own (potential bonus points for appropriate creativity). Using your dictionary, create sentiment labels for the text entries in your corpus.

In [11]:
# replace contractions
# code borrowed from http://stackoverflow.com/questions/27845796/replacing-words-matching-regular-expressions-in-python
import re

replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would'),
(r'thx', 'thanks'),
(r'pls', 'please'),
(r'pis', 'please'),
(r'\n', ' ')
]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s

In [12]:
replacer = RegexpReplacer()

emaildf=emaildf.apply(lambda x: x.astype(str).str.lower())

emaildf['cleantext'] = emaildf.ExtractedBodyText.map(lambda x: replacer.replace(x))

emaildf[0:1]

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,ExtractedSubject,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,cleantext
4,5,c05739554,h: latest: how syria is aiding qaddafi and more... sid,"abedin, huma",h,,,,,,f-2015-04841,c05739554,05/13/2015,release in part,"h <hrod17@clintonemail.com>\nfriday, march 11, 2011 1:36 pm\nhuma abedin\nfw: h: latest: how syria is aiding qaddafi and more... sid\nhrc memo syria aiding libya 030311.docx\npis print.","h <hrod17@clintonemail.com> friday, march 11, 2011 1:36 pm huma abedin fw: h: latest: how syria is aiding qaddafi and more... sid hrc memo syria aiding libya 030311.docx please print."


In [13]:
# remove words that are not in the dictionary
#from nltk.corpus import words

emaildf['cleantext']=emaildf.cleantext.apply(lambda x: ' '.join([word for word in x.split() if word not in my_stopwords_2]))

emaildf[0:5]

Unnamed: 0,Id,DocNumber,MetadataSubject,MetadataTo,MetadataFrom,ExtractedSubject,ExtractedTo,ExtractedFrom,ExtractedCc,ExtractedDateSent,ExtractedCaseNumber,ExtractedDocNumber,ExtractedDateReleased,ExtractedReleaseInPartOrFull,ExtractedBodyText,cleantext
4,5,c05739554,h: latest: how syria is aiding qaddafi and more... sid,"abedin, huma",h,,,,,,f-2015-04841,c05739554,05/13/2015,release in part,"h <hrod17@clintonemail.com>\nfriday, march 11, 2011 1:36 pm\nhuma abedin\nfw: h: latest: how syria is aiding qaddafi and more... sid\nhrc memo syria aiding libya 030311.docx\npis print.",1:36 pm fw: h: latest: syria aiding qaddafi more... sid hrc memo syria aiding libya 030311.docx print.
5,6,c05739559,meet the right-wing extremist behind anti-muslim film that sparked deadly riots,russorv@state.gov,h,meet the right wing extremist behind anti-muslim film that sparked deadly riots,,,,"wednesday, september 12, 2012 01:00 pm",f-2015-04841,c05739559,05/13/2015,release in part,"pis print.\n-•-...-^\nh < hrod17@clintonernailcom>\nwednesday, september 12, 2012 2:11 pm\n°russorv@state.gov'\nfw: meet the right-wing extremist behind anti-fvluslim film that sparked deadly riots\nfrom [meat)\nsent: wednesday, september 12, 2012 01:00 pm\nto: 11\nsubject: meet the right wing extremist behind anti-muslim film that sparked deadly riots\nhtte/maxbiumenthal.com12012/09/meet-the-right-wing-extremist-behind-anti-musiim-tihn-that-sparked-\ndeadly-riots/\nsent from my verizon wireless 4g lte droid\nu.s. department of state\ncase no. f-2015-04841\ndoc no. c05739559\ndate: 05/13/2015\nstate dept. - produced to house select benghazi comm.\nsubject to agreement on sensitive information & redactions. no foia waiver. state-5cb0045251","print. -•-...-^ < hrod17@clintonernailcom> wednesday, 12, 2:11 pm °russorv@state.gov' fw: meet right-wing extremist behind anti-fvluslim film sparked deadly riots [meat) sent: wednesday, 12, 01:00 pm to: subject: meet right wing extremist behind anti-muslim film sparked deadly riots htte/maxbiumenthal.com12012/09/meet-the-right-wing-extremist-behind-anti-musiim-tihn-that-sparked- deadly-riots/ verizon wireless 4g lte droid department no. f-2015-04841 no. c05739559 date: 05/13/2015 dept. - comm. redactions. waiver. state-5cb0045251"
7,8,c05739561,h: latest: how syria is aiding qaddafi and more... sid,"abedin, huma",h,,,,,,f-2015-04841,c05739561,05/13/2015,release in part,"h <hrod17@clintonemail.corn>\nfriday, march 11, 2011 1:36 pm\nhuma abedin\nfw: h: latest: how syria is aiding qaddafi and more... sid\nhrc memo syria aiding libya 030311.docx\npis print.",<hrod17@clintonemail.corn> 1:36 pm fw: h: latest: syria aiding qaddafi more... sid hrc memo syria aiding libya 030311.docx print.
20,21,c05739578,more on libya,sullivanjj@state.gov,h,fwd: more on libya,,,,,f-2015-04841,c05739578,05/13/2015,release in part,"h <hrod17@clintonernaii.com›\nwednesday, september 12, 2012 11:26 pm\nesullivanjj@state.gov'\nfw: fwd: more on libya\nlibya 37 sept 12 12,docx\nwe should get this around asap.","<hrod17@clintonernaii.com› wednesday, 12, 11:26 pm esullivanjj@state.gov' fw: fwd: libya libya 37 sept 12,docx around asap."
21,22,c05739579,more on libya,russorv@state.gov,h,fwd: more on libya,,,,,f-2015-04841,c05739579,05/13/2015,release in part,"pis print.\nh < hrod17@clintoriernail.corn>\nwednesday, september 12, 2012 11:28 pm\n°russont@state.gov°\nfw: fwd: more on libya\nlibya 37 sept 12 12.dacx","print. < hrod17@clintoriernail.corn> wednesday, 12, 11:28 pm °russont@state.gov° fw: fwd: libya libya 37 sept 12.dacx"


In [14]:
afinn = {}
for line in open(pathname+"AFINN-111.txt"):
    tt = line.split('\t')
    afinn.update({tt[0]:int(tt[1])})

def afinn_sent(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word in afinn:
            sentcount = sentcount + afinn[word]
            
    
    if (sentcount < -1):
        sentiment = 'Negative'
    elif (sentcount > 1):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    #return sentcount

In [15]:
emaildf['afinn'] = emaildf.cleantext.apply(lambda x: afinn_sent(x))


In [16]:
emaildf.iloc[88:90][['ExtractedBodyText','cleantext','afinn']]

Unnamed: 0,ExtractedBodyText,cleantext,afinn
286,ok.,ok.,Neutral
296,let's do patsy. i'm ready.,patsy. ready.,Neutral


### T3. Consider a specific outcome you would like to achieve with your sentiment analysis.  That is, determine what sentiment you might want to have assigned to a specific piece of text.  It could be one entry in your corpus, several documents, or the entire corpus. Make changes to the feature space and/or dictionary to achieve that outcome. Show specific results. 

This phrase below is currently labeled as "neutral".  I am going to add "cnn" to the dictionary with a value of -2.

well, philippe looks right again. cnn is reporting this as being done against my wishes. any way to salvage?


In [17]:
emaildf.iloc[226:227][['ExtractedBodyText','cleantext','afinn']]

Unnamed: 0,ExtractedBodyText,cleantext,afinn
451,"well, philippe looks right again. cnn is reporting this as being done against my wishes. any way to salvage?","well, philippe looks right again. cnn reporting done wishes. way salvage?",Neutral


In [18]:
afinn_Edited = {}
for line in open(pathname+"AFINN-111_Edited.txt"):
    tt = line.split('\t')
    afinn_Edited.update({tt[0]:int(tt[1])})

def afinn_sent2(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word in afinn_Edited:
            sentcount = sentcount + afinn_Edited[word]
            
    
    if (sentcount < -1):
        sentiment = 'Negative'
    elif (sentcount > 1):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    #return sentcount

In [19]:
emaildf['afinn_Edited'] = emaildf.cleantext.apply(lambda x: afinn_sent2(x))

In [20]:
emaildf.iloc[226:227][['ExtractedBodyText','cleantext','afinn','afinn_Edited']]

Unnamed: 0,ExtractedBodyText,cleantext,afinn,afinn_Edited
451,"well, philippe looks right again. cnn is reporting this as being done against my wishes. any way to salvage?","well, philippe looks right again. cnn reporting done wishes. way salvage?",Neutral,Negative
