
# Determining Chiefs Popularity Through Sentiment Analysis

### Authors: Andrew Pierson and Kristopher Profit

## Task 1
Read in or create a data frame with at least one column of text to be analyzed.  This could be the text you used previously or new text. Based on the context of your dataset and the question you want to answer, identify at what processing you think is necessary (stop words, stemming, custom replacement, etc.) Compare the feature space before and after your processing.

### Project Background and Purpose
The purpose of this project is to gather insight on the popularity of The Kansas City Chiefs during the preseason, and to use this information to predict attendance at future games. This project will provide value to the KC Chiefs' organization by benchmarking their current popularity and allowing them the opportunity to set targets for improving their fanbase's satisfaction at games. An example of improving fan satisfaction at Arrowhead Stadium is by giving away memorabilia. In contrast, if the Chiefs organization notices that their fan's satisfaction is trending in a positive direction and is predicted to continue trending in a similar manner, they might choose to give out less free-merchandise at games. Twitter will be the first method used to determine popularity then other methods can be added to predict on a better sample from various media sites such as YouTube API, Facebook API, and Google Search API.     

In [55]:
!pip install tweepy
#Import pandas for basic data frame manipulation
import pandas as pd
#Import requests to use API calls
import requests
#Import tweepy to interact with twitter API 
import tweepy
from tweepy import OAuthHandler
#Import csv to export results for visual analysis
import csv
from __future__ import division

[33mYou are using pip version 9.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


### Tweepy OAuth Auhtentication
After registering for Twitter API development, unique key will be assigned. Below we define the keys that Twitter assigned, create an OAuthHandler instance using the consumer key and secret, then exchanging the request token for an access token allowing us to fetch information. All of this is done using the requests and tweepy packages.

In [2]:
#Define twitter developer API keys to get information from different  
consumer_key =  'tKRiH8NBRiMuexosYHjKzzfJG'
consumer_secret = 'W45BJD5QqwPu780jn8XqpweJqk3Ugd4xw8yeEr4PCZctHXC3Q9'
access_token= '185897808-IPVxscpc37tcZ5osjL4OIboAKYAjgonjssWwmbXd'
access_secret= 'GROWamgFWBBmV37H3FxXkW75WHLFSAuVo8uPaXqyBaqRT'

#Define authorization type for request
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

### Search for Tweets by Hashtag and Geospatial
In order to forecast attendance at the Kansas City Chiefs' a geospatial method will be used to search for tweets containing '#Chiefs" tag, that are within a 25 Kilometer radius of Arrowhead Stadium's Longitude and Latitude coordinates. The amount of tweets searched for in my API call is specified as 500 items and the objects are output in a list format 

In [3]:
#Define an empty dataframe to create a list of tweets that can be printed
results = []

#Define the variables to be used in the api search
latitude = 39.0427
longitude = -94.4837
max_range = 25
hashtag = '#Chiefs'

#Get the first 500 items based on the search query
for tweet in tweepy.Cursor(api.search, q = hashtag, geocode = '%f,%f,%dkm' % (latitude, longitude, max_range) ).items(500):
    results.append(tweet)

#Verify the number of list items is 500
print('API Call Output Type: ', type(results))
print('Tweet Objects: ', len(results))

API Call Output Type:  <class 'list'>
Tweet Objects:  500


### Choose Tweet Objects to mine
* Hashtags included in tweet objects will be important in this project when identifying fanbase loyalty.
* A total count of all of a user's tweets (including retweets and quote tweets) could be closely related to the number of a user's followers. In addition, users that have suspiciously small status counts might be omitted from the analysis due to misrepresentation of tweet population sample or because they are spam accounts.
* The verified tweet object is a boolean value that distinguishes people of interest, such as celebrities or politicians, from normal twitter users.
* Mentions on a tweet will help keep us informed of what people are sharing with others about The Kansas City Chiefs' Organization.
* The language on a majority of these tweets will likely be English (en), other languages have a higher chance of misrepresenting our sample data by including irrelevant tweet objects.
* I chose to include the text object from tweets that contain '#Chiefs' in order to find twitter users interested in the NFL team during the Kansas City Chiefs' preseason. This object was also chosen because it will likely provide insightful text that can be used to determine the underlying sentiment of a user towards the Chiefs. 
* The next two tweet objects that I chose to include in my data frame were the retweet and favorite counts for tweets including '#Chiefs' in their tags. Retweet and favorite counts can be useful when analyzing popularity of a tweet and will likely require different weights if a user is favoriting tweets more often than retweeting. 
* Including the created_at tweet object was done in order to allow the data frame to be sorted in chronological order upon anlysis. This might show trends in popularity or sentiment depending upon the Chiefs performance after a game. 
* A user's name and account description might also allow us to predict the tweet's sentiment, which is why I chose to mine user.name and user.description tweet objects. A case in which this may occur is if an opposing team's fans are including "#Chiefs" in a tweet that has a negative connotation. 
* The user.followers_count object was also  included and could provide interesting information, analyzing this using a forecasting method may indicate impressions on other users. 

In [4]:
#Create a function to convert a given list of tweets into a Pandas DataFrame
#The DataFrame will consist of only chosen values below

def toDataFrame(tweets):

    DataSet = pd.DataFrame()

    DataSet['TweetHashtags'] = [tweet.entities.get('hashtags') for tweet in tweets]
    DataSet['UserStatusCount'] = [tweet.user.statuses_count for tweet in tweets]
    DataSet['UserVerified'] = [tweet.user.verified for tweet in tweets]
    #DataSet['TweetMentions'] = [tweet.entities.get('user_mentions') for tweet in tweets]
    DataSet['TweetLanguage'] = [tweet.lang for tweet in tweets]
    DataSet['TweetText'] = [tweet.text for tweet in tweets]
    DataSet['TweetRetweetCount'] = [tweet.retweet_count for tweet in tweets]
    DataSet['TweetFavoriteCount'] = [tweet.favorite_count for tweet in tweets]
    DataSet['TweetCreated'] = [tweet.created_at for tweet in tweets]
    #DataSet['userName'] = [tweet.user.name for tweet in tweets]
    #DataSet['userDesc'] = [tweet.user.description for tweet in tweets]
    DataSet['UserFollowerCount'] = [tweet.user.followers_count for tweet in tweets]

    return DataSet

### Create and Populate DataFrame
Here we are going to use the predefined list of tweet objects that we chose to populate a dataframe, which is named 'tweet_frame.' The resulting dataframe's shape is 500 rows by 9 columns. Printing ten of the tweet's text object is very informative, already it is obvious that most of these tweet's are going to be a form of a retweet (RT). In addition, the user @ArrowheadPride (an account I am actually following) appears to be very popular.

In [7]:
#Pass the tweets list to create a DataFrame
tweet_frame = toDataFrame(results)
print('Dataframe Shape (Rows, Columns): ', tweet_frame.shape)

tweet_frame['TweetText'][0:10]

Dataframe Shape (Rows, Columns):  (500, 9)


0    RT @LWorthySports: Andy Reid keeping tabs on B...
1    Caption this photo of Patrick Mahomes and Trav...
2    RT @ArrowheadPride: REPORT: #Chiefs work out f...
3    RT @ArrowheadPride: With nearly 2,000 votes…th...
4    RT @ArrowheadPride: REPORT: #Chiefs work out f...
5    REPORT: #Chiefs work out former Oakland Raider...
6    RT @KC_Goddess29: I propose a new currency. Ju...
7    RT @thenoahdotson: More like “Hat-trick” Mahom...
8    @RealMNchiefsfan He will end up with the Pats ...
9                  @ChiefVolFan20 #Chiefs by a billion
Name: TweetText, dtype: object

Taking a look at the head of our tweet_frame is also very helpful, it gives a brief introduction to the various types of data that will potentially be used during analysis.  

In [8]:
tweet_frame.head()

Unnamed: 0,TweetHashtags,UserStatusCount,UserVerified,TweetLanguage,TweetText,TweetRetweetCount,TweetFavoriteCount,TweetCreated,UserFollowerCount
0,"[{'text': 'Chiefs', 'indices': [49, 56]}]",5804,False,en,RT @LWorthySports: Andy Reid keeping tabs on B...,2,0,2018-09-17 20:03:03,77
1,"[{'text': 'Chiefs', 'indices': [56, 63]}]",62891,False,en,Caption this photo of Patrick Mahomes and Trav...,0,0,2018-09-17 19:55:36,2674
2,"[{'text': 'Chiefs', 'indices': [28, 35]}]",18963,False,en,RT @ArrowheadPride: REPORT: #Chiefs work out f...,2,0,2018-09-17 19:54:32,261
3,"[{'text': 'Chiefs', 'indices': [68, 75]}]",14973,False,en,"RT @ArrowheadPride: With nearly 2,000 votes…th...",3,0,2018-09-17 19:50:14,637
4,"[{'text': 'Chiefs', 'indices': [28, 35]}]",32311,False,en,RT @ArrowheadPride: REPORT: #Chiefs work out f...,2,0,2018-09-17 19:45:22,807


In [9]:
tweet_frame.tail()

Unnamed: 0,TweetHashtags,UserStatusCount,UserVerified,TweetLanguage,TweetText,TweetRetweetCount,TweetFavoriteCount,TweetCreated,UserFollowerCount
495,"[{'text': 'Chiefs', 'indices': [109, 116]}]",2184,False,en,RT @JeremySickel: Pat Mahomes had more passes ...,12,0,2018-09-16 20:42:15,124
496,"[{'text': 'Chiefs', 'indices': [124, 131]}]",73250,False,en,RT @ClayWendler: Patrick Mahomes II is now 3-0...,22,0,2018-09-16 20:42:09,1891
497,"[{'text': 'Chiefs', 'indices': [22, 29]}]",162174,False,en,RT @JeremySickel: The #Chiefs went on the road...,3,0,2018-09-16 20:42:04,5730
498,"[{'text': 'Chiefs', 'indices': [24, 31]}]",11256,False,en,RT @ArrowheadPride: The #Chiefs are 2-0 to ope...,37,0,2018-09-16 20:42:04,1127
499,"[{'text': 'Chiefs', 'indices': [117, 124]}]",162174,False,en,RT @JeremySickel: Pat Mahomes is on pace to th...,3,0,2018-09-16 20:42:02,5730


### Count Vectorizer

In [10]:
#Import numpy into the workspace
import numpy as np
#Import the CountVectorizer module
from sklearn.feature_extraction.text import CountVectorizer
#Import the math function to do math
import math
#Specify the max width of pixels per column for text that will be analyzed
pd.set_option('display.max_colwidth', 800)

#### Count Vectorizer Summary
cv1
*   Parameters: stop_words='english'
*   Feature Space: 1113

cv2
*   Parameters: stop_words='stopwords'
*   Feature Space: 1106

cv3:
*   Parameters: stop_words=stopwords, min_df=.02, ngram_range = (2,3)
*   Feature Space: 122

cv4
*   Parameters: stop_words='stopwords', min_df=0.01
*   Feature Space: 232

### CV 1

Initial count vectorizer will be built to identify commonly occurring one-word terms. Filtering the 'english' stopwords will be the only parameter added.

In [11]:
#Define the count vectorizer using default parameters
cv1 = CountVectorizer(stop_words='english')
#Apply the count vectorizer for the bag of words to the dataframe feature
cv1_text = cv1.fit_transform(tweet_frame['TweetText'])
#This is the feature space for TweetText
print('Shape:', cv1_text.shape)
#This is the type of matrix that is returned for TweetText
print('Type:', type(cv1_text))

Shape: (500, 1114)
Type: <class 'scipy.sparse.csr.csr_matrix'>


In [12]:
#Print the sparse matrix as a data frame for an aestheticallly pleasing view of the feature space head#Print t 
pd.DataFrame(cv1_text.toarray(), columns = cv1.get_feature_names()).head(10)

Unnamed: 0,000,0fqfabcvm3,10,103,12,13,141,15,154,1986,...,yesterday,ym3gtcyxat,yoby3bpmir,youngest,yqeapwnhul,zaayse8uvb,zane,zbnomj0dij,zividnpr6n,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


What are the most common terms? How often do they appear in the text?

In [13]:
names = cv1.get_feature_names()   #create list of feature names

count = np.sum(cv1_text.toarray(), axis = 0) # convert list to array to add up feature counts 
count2 = count.tolist()  # convert numpy array to list

count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list

sorted_count = count_df.sort_values(['count'], ascending = False)
sorted_count.head(15)

Unnamed: 0,count
chiefs,459
rt,307
https,287
mahomes,166
win,73
patrick,71
today,68
steelers,64
kc_goddess29,64
td,59


Several terms were tokenized that do not appear to add any insights into the tweets. A custom list of terms will be added to the standard 'english' stopwords list. Another count vectorizer will be built using this list.

In [14]:
from sklearn.feature_extraction import text

mylist = ['rt', '&', 'amp','pff','http','md','way','fi','fbs','https','years','uh2bpyg3gu','just'] # Add more stopwords to standard english list
stopwords = text.ENGLISH_STOP_WORDS.union(mylist)

### CV 2
Initial count vectorizer will be built to identify commonly occurring one-word terms. The stop_words are defined by an english and custom dictionary parameter called stopwords. Last, the head 15 in the resulting data set will be printed.

The resulting list are mostly related to upcoming games and players on the current roster. The count of "Hearing" occurances in the list might be related to tweets about the loudest crowd roar in a stadium being in Arrowhead, measuring at 142.2 dbA. The "Pregame" large count of occurances might be attributable to recent regulations for tailgating or pregaming in the parking lot.

In [15]:
#Define the count vectorizer using default parameters
cv2 = CountVectorizer(stop_words=stopwords, min_df = .02)
#Apply the count vectorizer for the bag of words to the dataframe feature
cv2_text = cv2.fit_transform(tweet_frame['TweetText'])
#This is the feature space for TweetText
print('Shape:', cv2_text.shape)
#This is the type of matrix that is returned for TweetText
print('Type:', type(cv2_text))

Shape: (500, 106)
Type: <class 'scipy.sparse.csr.csr_matrix'>


In [16]:
names = cv2.get_feature_names()   #create list of feature names

count = np.sum(cv2_text.toarray(), axis = 0) # convert list to array to add up feature counts 
count2 = count.tolist()  # convert numpy array to list

count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list

sorted_count = count_df.sort_values(['count'], ascending = False)
sorted_count.head(15)

Unnamed: 0,count
chiefs,459
mahomes,166
win,73
patrick,71
today,68
kc_goddess29,64
steelers,64
td,59
new,50
big,48


### CV 3

After reducing the feature space by using more stop words and adding a rather large min_df parameter, I am also going to increase the n_gram range so that the vectorizer returns 2- and 3- word terms. This can aid in giving more context and insight into what is being tweeted about.

In [17]:
#Define the count vectorizer using default parameters
cv3 = CountVectorizer(stop_words=stopwords, min_df=.02,ngram_range = (2,3))
#Apply the count vectorizer for the bag of words to the dataframe feature
cv3_text = cv3.fit_transform(tweet_frame['TweetText'])
#This is the feature space for TweetText
print('Shape:', cv3_text.shape)
#This is the type of matrix that is returned for TweetText
print('Type:', type(cv3_text))

Shape: (500, 129)
Type: <class 'scipy.sparse.csr.csr_matrix'>


In [18]:
names = cv3.get_feature_names()   #create list of feature names

count = np.sum(cv3_text.toarray(), axis = 0) # convert list to array to add up feature counts 
count2 = count.tolist()  # convert numpy array to list

count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list

sorted_count = count_df.sort_values(['count'], ascending = False)
sorted_count.head(15)

Unnamed: 0,count
patrick mahomes,67
chiefs mahomes,50
new currency,46
currency dimes,46
propose new,46
propose new currency,46
currency dimes chiefs,46
dimes chiefs,46
new currency dimes,46
dimes chiefs mahomes,46


### CV 4

This count vectorizer will be built to identify commonly occurring one-word tokens. The min_df is  set to a 0.05 ratio to filter "random noise" from tweet text which contain large URL strings, and the stop_words are defined by an english and custom dictionary parameter called mylist. Last, the head 15 in the resulting data set will be printed.

In [19]:
#Define the count vectorizer using porter stemmer and min_df
cv4 = CountVectorizer(stop_words = stopwords, min_df = .05)
#Apply the count vectorizer for the bag of words to the dataframe feature
cv4_text = cv4.fit_transform(tweet_frame['TweetText'])
#This is the feature space for TweetText
print('Shape:', cv4_text.shape)
#This is the type of matrix that is returned for TweetText
print('Type:', type(cv4_text))

Shape: (500, 25)
Type: <class 'scipy.sparse.csr.csr_matrix'>


In [20]:
names = cv4.get_feature_names()   #create list of feature names

count = np.sum(cv4_text.toarray(), axis = 0) # convert list to array to add up feature counts 
count2 = count.tolist()  # convert numpy array to list

count_df = pd.DataFrame(count2, index = names, columns = ['count']) # create a dataframe from the list

sorted_count = count_df.sort_values(['count'], ascending = False)
sorted_count.head(15)

Unnamed: 0,count
chiefs,459
mahomes,166
win,73
patrick,71
today,68
steelers,64
kc_goddess29,64
td,59
new,50
big,48


## Porter Stemmer
The purpose of this count vectorizer is to consolidate the feature space to tokens with fewer characters in string for a more accurate general sentiment analysis.  The porter stemmer handles the majority of the "random noise" and will most likely reduce the amount of characters in the URL after stemming. Subsequently, the ouput feature space might be larger than expected but the distinct features will yield a smaller feature space.

In [21]:
#Import stemmer algorithm code 
from nltk.stem.porter import PorterStemmer
#Define the Porter Stemmer method using the ps acronym
ps = PorterStemmer()
#The "bag of words" from the TweetText vector space displayed as a list
cv4list = cv4.get_feature_names()
#print(cv1list)
#Print the first ten stemmed words
[ps.stem(word) for word in cv4list][:10]
#count_df = pd.DataFrame([ps.stem(word) for word in cv1list], index = names, columns = ['count']) # create a dataframe from the list

#sorted_count = count_df.sort_values(['count'], ascending = False)
#sorted_count.head(15)

['big',
 'chief',
 'claywendl',
 'currenc',
 'dime',
 'game',
 'go',
 'kc_goddess29',
 'knuyjjrmbi',
 'mahom']

## Lemmatization

Use lemmatization on text in count vecotrizer (CV 4)

In [22]:
# Import Lemmatization package
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print([wnl.lemmatize(word) for word in cv4list][:10])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/kristopherprofit/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
['big', 'chief', 'claywendler', 'currency', 'dime', 'game', 'going', 'kc_goddess29', 'knuyjjrmby', 'mahomes']


## Task 2
Create a sentiment dictionary from one of the sources in class or find/create your own (potential bonus points for appropriate creativity). Using your dictionary, create sentiment labels for the text entries in your corpus.

## Sentiment Dictionary

###  afinn Dictionary

In [24]:
# Sentiment dictionary that assigns scores to words signifying their sentiment polarity or neutrality

afinn = {}
for line in open("/Users/kristopherprofit/Documents/BIA 6304/Week 4/Dictionaries/AFINN-111.txt"):
    tt = line.split('\t')
    afinn.update({tt[0]:int(tt[1])})


print(type(afinn), len(afinn))

for key, value in sorted(afinn.items())[0:10]:
    print(key + " => " + str(value))
print("~~~~~~~~~~~~")
for key, value in sorted(afinn.items())[2467:]:
    print(key + " => " + str(value))

<class 'dict'> 2477
abandon => -2
abandoned => -2
abandons => -2
abducted => -2
abduction => -2
abductions => -2
abhor => -3
abhorred => -3
abhorrent => -3
abhors => -3
~~~~~~~~~~~~
yeah => 1
yearning => 1
yeees => 2
yes => 1
youthful => 2
yucky => -2
yummy => 3
zealot => -2
zealots => -2
zealous => 2


In [25]:
#Put the defined dictionary into a data frame
afinn_from_dict=pd.DataFrame.from_dict(afinn, orient='index')
afinn_from_dict.head(10)

Unnamed: 0,0
abandon,-2
abandoned,-2
abandons,-2
abducted,-2
abduction,-2
abductions,-2
abhor,-3
abhorred,-3
abhorrent,-3
abhors,-3


In [43]:
# here we are going for strictly the sum:  add up the positives and "subtract" the negatives
# you can return a number or a label

def afinn_sent(inputstring):
    
    sentcount =0
    for word in inputstring.split():  
        if word.rstrip('?:!.,;') in afinn:
            sentcount = sentcount + afinn[word.rstrip('?:!.,;')]
            
    
    if (sentcount < 0):
        sentiment = 'Negative'
    elif (sentcount >0):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    return sentiment
    #return sentcount

In [149]:
tweet_frame['afinn'] = tweet_frame["TweetText"].apply(lambda x: afinn_sent(x))
print(type(tweet_frame['TweetText']))

<class 'pandas.core.series.Series'>


In [154]:
tweet_frame.iloc[0:10][['TweetText','afinn']]

Unnamed: 0,TweetText,afinn
0,RT @LWorthySports: Andy Reid keeping tabs on BYU #Chiefs https://t.co/tTBFaZdg7i,Neutral
1,Caption this photo of Patrick Mahomes and Travis Kelce. #Chiefs https://t.co/rB2xG4UdGV,Neutral
2,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
3,"RT @ArrowheadPride: With nearly 2,000 votes…the biggest concern for #Chiefs fans is:\n\nVote: https://t.co/L1mxmgkuLf https://t.co/92Td8loa3w",Neutral
4,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
5,REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
6,RT @KC_Goddess29: I propose a new currency. Just dimes. #Chiefs #Mahomes https://t.co/KnuYJJRmbY,Neutral
7,"RT @thenoahdotson: More like “Hat-trick” Mahomes, amiright? #chiefs #patrickmahomes @RealMNchiefsfan @LockedOnChiefs @ArrowheadPride @Chie…",Positive
8,@RealMNchiefsfan He will end up with the Pats opposite of Josh Gordon because #chiefs,Neutral
9,@ChiefVolFan20 #Chiefs by a billion,Neutral


In [58]:
tweet_frame.iloc[490:501][['TweetText','afinn']]

Unnamed: 0,TweetText,afinn
490,RT @KCMO: #CHIEFS!!!! Big win over the Steelers today! @PatrickMahomes5 was 🔥! 6 TD passes to lead the way. https://t.co/Uczj2OondS,Positive
491,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral
492,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive
493,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral
496,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive
497,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral
498,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive
499,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative


Given that the conditions for a tweet to be labeled as neutral required a score of 0, it is surprising to see so many neutral labels, especially among the first ten observations. Upon inspecting the text in these tweets, most of them do not contain many words at all, and many are objective reports about Chiefs news. Because of the prevalence of these kinds of tweets, and the fact that tweets are very brief by nature, there will not be a "neutral buffer zone" for sentiment labeling. Any value less than or greater than zero will be given a negative or positive sentiment, respectively. 

### HL Dictionary

In [64]:
HLpos = [line.strip() for line in  open('/Users/kristopherprofit/Documents/BIA 6304/Week 4/Dictionaries/HLpos.txt','r')]
HLneg = [line.strip() for line in  open('/Users/kristopherprofit/Documents/BIA 6304/Week 4/Dictionaries/HLneg.txt','r',encoding = 'latin-1')]
print("HL pos  size: " + str(len(HLpos)))
print(HLpos[0:10])
print("HL neg  size: " + str(len(HLneg)))
print(HLneg[0:10])

# different dictionary
# different measure

def hl_sent(inputstring):

    poscount = 0
    negcount = 0
    
    for word in inputstring.split(): 
        if HLpos.count(word.rstrip('?:!.,;')):
            poscount +=1
        elif HLneg.count(word.rstrip('?:!.,;')):
            negcount +=1
     
    
    if poscount+negcount > 0:
        t = float((poscount - negcount)/(poscount+negcount))    
    else:
        t = 0
    
    
    if t > 0:
        tone = "Positive"
    elif t < 0:
        tone = "Negative"
    else:
        tone = "Neutral"
    
    return tone

HL pos  size: 2006
['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation']
HL neg  size: 4783
['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']


In [148]:
tweet_frame['hlsent'] = tweet_frame["TweetText"].apply(lambda x: hl_sent(x))

In [69]:
tweet_frame.iloc[0:10][['TweetText','afinn', 'hlsent']]

Unnamed: 0,TweetText,afinn,hlsent
0,RT @LWorthySports: Andy Reid keeping tabs on BYU #Chiefs https://t.co/tTBFaZdg7i,Neutral,Neutral
1,Caption this photo of Patrick Mahomes and Travis Kelce. #Chiefs https://t.co/rB2xG4UdGV,Neutral,Neutral
2,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive
3,"RT @ArrowheadPride: With nearly 2,000 votes…the biggest concern for #Chiefs fans is:\n\nVote: https://t.co/L1mxmgkuLf https://t.co/92Td8loa3w",Neutral,Neutral
4,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive
5,REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive
6,RT @KC_Goddess29: I propose a new currency. Just dimes. #Chiefs #Mahomes https://t.co/KnuYJJRmbY,Neutral,Neutral
7,"RT @thenoahdotson: More like “Hat-trick” Mahomes, amiright? #chiefs #patrickmahomes @RealMNchiefsfan @LockedOnChiefs @ArrowheadPride @Chie…",Positive,Positive
8,@RealMNchiefsfan He will end up with the Pats opposite of Josh Gordon because #chiefs,Neutral,Neutral
9,@ChiefVolFan20 #Chiefs by a billion,Neutral,Neutral


There are some discrepancies between the afinn and HL dictionaries, with afinn giving more documents a neutral label. The general inquirer dictionary will now be used.

### General Inquirer Dictionary

1.   Two large valence categories
    *   Positiv: words of positive outlook (~1915)
    *   Negativ: words of negative outlook (~2291)

2.   Harvard IV-4 categories
    *   An assortment of 25 semantics classifications

In [28]:
import pandas as pd   
#import io to specify format
import io
# import the file
GI_frame = pd.read_csv("/Users/kristopherprofit/Documents/BIA 6304/Week 4/Dictionaries/inquirerbasic.csv")
#Fill null values with blank formatting
GI_frame.fillna('', inplace=True)
print(GI_frame.shape)
#Get rid of multiple counts
GI_frame['NewEntry'] = GI_frame['Entry'].str.extract('([A-Z]\w{0,})', expand = False)
GI_frame.head(10)

  interactivity=interactivity, compiler=compiler, result=result)


(11788, 186)


Unnamed: 0,Entry,Source,Positiv,Negativ,Pstv,Affil,Ngtv,Hostile,Strong,Power,...,NegAff,PosAff,SureLw,If,NotLw,TimeSpc,FormLw,Othtags,Defined,NewEntry
0,A,H4Lvd,,,,,,,,,...,,,,,,,,DET ART,| article: Indefinite singular article--some or any one,A
1,ABANDON,H4Lvd,,Negativ,,,Ngtv,,,,...,,,,,,,,SUPV,|,ABANDON
2,ABANDONMENT,H4,,Negativ,,,,,,,...,,,,,,,,Noun,|,ABANDONMENT
3,ABATE,H4Lvd,,Negativ,,,,,,,...,,,,,,,,SUPV,|,ABATE
4,ABATEMENT,Lvd,,,,,,,,,...,,,,,,,,Noun,,ABATEMENT
5,ABDICATE,H4,,Negativ,,,,,,,...,,,,,,,,SUPV,|,ABDICATE
6,ABHOR,H4,,Negativ,,,,Hostile,,,...,,,,,,,,SUPV,|,ABHOR
7,ABIDE,H4,Positiv,,,Affil,,,,,...,,,,,,,,SUPV,|,ABIDE
8,ABILITY,H4Lvd,Positiv,,,,,,Strong,,...,,,,,,,,Noun,,ABILITY
9,ABJECT,H4,,Negativ,,,,,,,...,,,,,,,,Modif,|,ABJECT


Taking a look at the dictionary reveals some basic summary statistics of the context of the words defined. An advantage of using this dictionary over others is that you are given many different definitions, allowing for more custom feature space trimming. 

There appears to be duplicates between the two large valence catagories. Some of the words that intersect questionable positiv and negativ classifications include hit and arrest. Words that intersect questionable strong and weak classifications include fear and scared. 

In [29]:
#let's create lists: pos, neg, strong, weak

GIlist = GI_frame['NewEntry'].tolist()
GIlist = list(map(lambda x: str(x).lower(), GIlist))
GIset = set(GIlist)
print("GI dictionary size: " + str(len(GIlist)) + " words of which " +str(len(GIset))+ " are unique.")


GIpos = GI_frame['NewEntry'][GI_frame['Positiv'].str.contains('Positiv')].tolist()
GIneg = GI_frame['NewEntry'][GI_frame['Negativ'].str.contains('Negativ')].tolist()
GIstrong = GI_frame['NewEntry'][GI_frame['Strong'].str.contains('Strong')].tolist()
GIweak = GI_frame['NewEntry'][GI_frame['Weak'].str.contains('Weak')].tolist()
GIpos = list(map(lambda x: str(x).lower(), GIpos))
GIneg = list(map(lambda x: str(x).lower(), GIneg))
GIstrong = list(map(lambda x: str(x).lower(), GIstrong))
GIweak = list(map(lambda x: str(x).lower(), GIweak))

print("Positive words: " + str(len(GIpos)))
print("Negative words: " + str(len(GIneg)))
print("Strong words: "+ str(len(GIstrong)))
print("Weak words: " + str(len(GIweak)))

GI dictionary size: 11788 words of which 8559 are unique.
Positive words: 1915
Negative words: 2291
Strong words: 1902
Weak words: 755


In [30]:
#There are some overlaps that were not defined 
print('Positive and Negative Duplicates:', set(GIpos).intersection(set(GIneg)))
print("")
print ('Strong and Weak Duplicates:', set(GIstrong).intersection(set(GIweak)))

Positive and Negative Duplicates: {'hit', 'even', 'pass', 'particular', 'hand', 'make', 'laugh', 'fine', 'matter', 'mind', 'board', 'fun', 'order', 'help', 'arrest', 'deal'}

Strong and Weak Duplicates: {'control', 'ruin', 'surround', 'scared', 'stick', 'limit', 'blind', 'upset', 'lower', 'excuse', 'patient', 'broke', 'look', 'pass', 'convict', 'scare', 'can', 'beat', 'support', 'run', 'split', 'whip', 'long', 'founder', 'restrict', 'shock', 'hard', 'shift', 'divide', 'wound', 'fear', 'occasion', 'bound', 'press', 'few', 'break', 'order'}


In [31]:
#Let's take the most common use based on info in the "defined" column
# We could also use the POS here

#define function to pull out percent
def get_digits(text):
    temp_num = ''.join(list(filter(str.isdigit, text)))
    if temp_num == '':
        temp_num = 100
    return temp_num

#create new column for digits - not strictly necessary since it's not relevant for all entries but fastest way
GI_frame['Percent'] = GI_frame['Defined'].map(lambda x: get_digits(x))

print(GI_frame.shape)

(11788, 188)


In [32]:
#pull out only the versions of terms used more than 50% of the time
GI_frame.Percent.astype(int) > 65
GI_frame = GI_frame[GI_frame.Percent.astype(int) > 65]
print(GI_frame.shape)

(8866, 188)


In [33]:
GI_frame.head(10)

Unnamed: 0,Entry,Source,Positiv,Negativ,Pstv,Affil,Ngtv,Hostile,Strong,Power,...,PosAff,SureLw,If,NotLw,TimeSpc,FormLw,Othtags,Defined,NewEntry,Percent
0,A,H4Lvd,,,,,,,,,...,,,,,,,DET ART,| article: Indefinite singular article--some or any one,A,100
1,ABANDON,H4Lvd,,Negativ,,,Ngtv,,,,...,,,,,,,SUPV,|,ABANDON,100
2,ABANDONMENT,H4,,Negativ,,,,,,,...,,,,,,,Noun,|,ABANDONMENT,100
3,ABATE,H4Lvd,,Negativ,,,,,,,...,,,,,,,SUPV,|,ABATE,100
4,ABATEMENT,Lvd,,,,,,,,,...,,,,,,,Noun,,ABATEMENT,100
5,ABDICATE,H4,,Negativ,,,,,,,...,,,,,,,SUPV,|,ABDICATE,100
6,ABHOR,H4,,Negativ,,,,Hostile,,,...,,,,,,,SUPV,|,ABHOR,100
7,ABIDE,H4,Positiv,,,Affil,,,,,...,,,,,,,SUPV,|,ABIDE,100
8,ABILITY,H4Lvd,Positiv,,,,,,Strong,,...,,,,,,,Noun,,ABILITY,100
9,ABJECT,H4,,Negativ,,,,,,,...,,,,,,,Modif,|,ABJECT,100


In [34]:
GIlist = GI_frame['NewEntry'].tolist()
GIlist = list(map(lambda x: str(x).lower(), GIlist))
GIset = set(GIlist)
print("GI dictionary size: " + str(len(GIlist)) + " words of which " +str(len(GIset))+ " are unique.")

GI dictionary size: 8866 words of which 8254 are unique.


In [35]:
GIpos = GI_frame['NewEntry'][GI_frame['Positiv'].str.contains('Positiv')].tolist()
GIneg = GI_frame['NewEntry'][GI_frame['Negativ'].str.contains('Negativ')].tolist()
GIstrong = GI_frame['NewEntry'][GI_frame['Strong'].str.contains('Strong')].tolist()
GIweak = GI_frame['NewEntry'][GI_frame['Weak'].str.contains('Weak')].tolist()
GIpos = list(map(lambda x: str(x).lower(), GIpos))
GIneg = list(map(lambda x: str(x).lower(), GIneg))
GIstrong = list(map(lambda x: str(x).lower(), GIstrong))
GIweak = list(map(lambda x: str(x).lower(), GIweak))

In [36]:
print("Positive words: " + str(len(GIpos)))
print("Negative words: " + str(len(GIneg)))
print("Strong words: "+ str(len(GIstrong)))
print("Weak words: " + str(len(GIweak)))

Positive words: 1567
Negative words: 1947
Strong words: 1368
Weak words: 597


In [37]:
#The duplicates have now been removed
print('Positive and Negative Duplicates:', set(GIpos).intersection(set(GIneg)))
print("")
print('Strong and Weak Duplicates:', set(GIstrong).intersection(set(GIweak)))

Positive and Negative Duplicates: set()

Strong and Weak Duplicates: {'can', 'support', 'split', 'excuse', 'founder', 'convict'}


In [38]:
def gi_sent(inputstring, show = None):

    poscount = 0
    negcount = 0
    i = 0


    for word in inputstring.split():
        if i > 0:
            prev = inputstring.split().pop(i-1)
        else:
            prev =""

        #create scalar for strong and weak words.  Strong words double, weak words add half
        if GIstrong.count(word):
            scale = 2
            if show != None:
                print("Strong: " + word) 
        elif GIweak.count(word):
            if show != None: 
                print("Weak: " + word)
            scale = 0.5
        else:
            scale = 1
            
        if GIpos.count(word):
            if show != None:
                print("Postive: " + word ) 
            poscount +=1*scale
        elif GIneg.count(word):
            if show != None:
                print("Negative: " + word )
            negcount +=1*scale
            
        i+=1
    
    if poscount+negcount > 0:
        t = float((poscount - negcount)/(poscount+negcount))
        
    else:
        t = 0
    
    
    if t > 0:
        tone = "Positive"
    elif t < 0:
        tone = "Negative"
    else:
        tone = "Neutral"
    
    return tone

In [147]:
tweet_frame['gi_sent'] = tweet_frame["TweetText"].apply(lambda x: gi_sent(x))

In [77]:
#Tweets ranked by general 
tweet_frame.iloc[0:10][['TweetText', 'gi_sent']]

Unnamed: 0,TweetText,gi_sent
0,RT @LWorthySports: Andy Reid keeping tabs on BYU #Chiefs https://t.co/tTBFaZdg7i,Neutral
1,Caption this photo of Patrick Mahomes and Travis Kelce. #Chiefs https://t.co/rB2xG4UdGV,Neutral
2,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
3,"RT @ArrowheadPride: With nearly 2,000 votes…the biggest concern for #Chiefs fans is:\n\nVote: https://t.co/L1mxmgkuLf https://t.co/92Td8loa3w",Neutral
4,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
5,REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral
6,RT @KC_Goddess29: I propose a new currency. Just dimes. #Chiefs #Mahomes https://t.co/KnuYJJRmbY,Neutral
7,"RT @thenoahdotson: More like “Hat-trick” Mahomes, amiright? #chiefs #patrickmahomes @RealMNchiefsfan @LockedOnChiefs @ArrowheadPride @Chie…",Neutral
8,@RealMNchiefsfan He will end up with the Pats opposite of Josh Gordon because #chiefs,Neutral
9,@ChiefVolFan20 #Chiefs by a billion,Neutral


In [78]:
tweet_frame.iloc[490:501][['TweetText', 'gi_sent']]

Unnamed: 0,TweetText,gi_sent
490,RT @KCMO: #CHIEFS!!!! Big win over the Steelers today! @PatrickMahomes5 was 🔥! 6 TD passes to lead the way. https://t.co/Uczj2OondS,Neutral
491,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Negative
492,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Neutral
493,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Neutral
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Positive
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral
496,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Neutral
497,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Negative
498,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Neutral
499,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Positive


### How do all three compare?

In [79]:
tweet_frame.iloc[0:10][['TweetText','afinn', 'hlsent','gi_sent']]

Unnamed: 0,TweetText,afinn,hlsent,gi_sent
0,RT @LWorthySports: Andy Reid keeping tabs on BYU #Chiefs https://t.co/tTBFaZdg7i,Neutral,Neutral,Neutral
1,Caption this photo of Patrick Mahomes and Travis Kelce. #Chiefs https://t.co/rB2xG4UdGV,Neutral,Neutral,Neutral
2,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive,Neutral
3,"RT @ArrowheadPride: With nearly 2,000 votes…the biggest concern for #Chiefs fans is:\n\nVote: https://t.co/L1mxmgkuLf https://t.co/92Td8loa3w",Neutral,Neutral,Neutral
4,RT @ArrowheadPride: REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive,Neutral
5,REPORT: #Chiefs work out former Oakland Raiders tight end https://t.co/PhfTgDoCDG,Neutral,Positive,Neutral
6,RT @KC_Goddess29: I propose a new currency. Just dimes. #Chiefs #Mahomes https://t.co/KnuYJJRmbY,Neutral,Neutral,Neutral
7,"RT @thenoahdotson: More like “Hat-trick” Mahomes, amiright? #chiefs #patrickmahomes @RealMNchiefsfan @LockedOnChiefs @ArrowheadPride @Chie…",Positive,Positive,Neutral
8,@RealMNchiefsfan He will end up with the Pats opposite of Josh Gordon because #chiefs,Neutral,Neutral,Neutral
9,@ChiefVolFan20 #Chiefs by a billion,Neutral,Neutral,Neutral


In [80]:
tweet_frame.iloc[490:501][['TweetText','afinn', 'hlsent','gi_sent']]

Unnamed: 0,TweetText,afinn,hlsent,gi_sent
490,RT @KCMO: #CHIEFS!!!! Big win over the Steelers today! @PatrickMahomes5 was 🔥! 6 TD passes to lead the way. https://t.co/Uczj2OondS,Positive,Positive,Neutral
491,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
492,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
493,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral,Neutral,Neutral
496,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
497,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
498,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
499,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive


# Task 3 

Consider one of the entries in your corpus that had a surprising label.  How would you change your analysis to get the “right” label? Show specific results. 

There were some surprising results from the sentiment analysis. Upon inspection, I found that there are many words commonly used in the world of football that may or may not have sentiment, but may be seen as negative or neutral words in the dicitonaries. Some of these concerns were quelled when I found that words such as "against" and "division" are not included in the sentiment dicitonaries- these were commonly found in the Tweet text in a football context. There was one tweet that gave praise to the number of touchdown passes by the Chiefs quarterback, while claiming that the Chiefs defense was bad. While this tweet should be labeled as neutral or even positive, it was labeled as negative due the the word "bad" being in it, and "touchdown" having no sentiment attached to it. I am going to use a custom dictionary to replace "touchdown" and "TD" with "success" so that the dictionaries will register those words as positive. 

We will also see how amplifying and negating certain words affects the sentiment labels.

#### Create dictionary for replacement

In [139]:
import re

football_dict = {'touchdown':'success', 'TD':'success'}


def multiple_replace(dict, text): 

  """ Replace in 'text' all occurences of any key in the given
  dictionary by its corresponding value.  Returns the new tring.""" 
  text = str(text).lower()

  # Create a regular expression  from the dictionary keys
  regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

  # For each match, look-up corresponding value in dictionary
  return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

In [140]:
tweet_frame['replaced_text'] = tweet_frame['TweetText'].apply(lambda x: multiple_replace(football_dict, x))

---

#### Now apply sentiment dictionaries

In [150]:
tweet_frame['afinn_rep'] = tweet_frame["replaced_text"].apply(lambda x: afinn_sent(x))
tweet_frame['hlsent_rep'] = tweet_frame["replaced_text"].apply(lambda x: hl_sent(x))
tweet_frame['gi_sent_rep'] = tweet_frame["replaced_text"].apply(lambda x: gi_sent(x))

##### Before Replacement:

In [151]:
tweet_frame.iloc[494:496][['TweetText','afinn', 'hlsent','gi_sent']]

Unnamed: 0,TweetText,afinn,hlsent,gi_sent
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral,Neutral,Neutral


##### After Replacement

In [153]:
tweet_frame.iloc[494:496][['replaced_text','afinn_rep', 'hlsent_rep', 'gi_sent_rep']]

Unnamed: 0,replaced_text,afinn_rep,hlsent_rep,gi_sent_rep
494,"rt @jeremysickel: pat mahomes is on pace to throw 80 success passes. as bad as the defense is, i'll take the over. #chiefs",Negative,Neutral,Positive
495,rt @jeremysickel: pat mahomes had more passes that resulted in a success (6) than incompletions (5) today. #chiefs,Positive,Positive,Positive


#### Results

Replacing "touchdown" with "success" did have an impact on the sentiment labels. It changed a negative label to neutral, and three neutral labels to positive. Although "success" is not a perfect replacement for the word "touchdown", its impact on what sentiment is given to a tweet is demonstrated and useful. 

### Amplification and Negation 

In [81]:
#more complicated sentiment calculations
#create all the dictionaries just once

#amplification and negation words from qdap
negate = ["aint", "arent","cant", "couldnt" , "didnt" , "doesnt" ,"dont" ,"hasnt" , "isnt" ,"mightnt" , "mustnt" ,"neither" ,"never", "no" ,"nobody" , "nor", "not" , "shant", "shouldnt", "wasnt" , "werent" ,"wont", "wouldnt"]
amplify = ["acute" ,"acutely", "certain", "certainly" ,"colossal", "colossally","deep" , "deeply" , "definite","definitely" ,"enormous","enormously" , "extreme", "extremely" ,"great","greatly" ,"heavily", "heavy", "high","highly" ,"huge","hugely" , "immense", "immensely" ,"incalculable" ,"incalculably","massive", "massively", "more","particular" ,"particularly","purpose", "purposely", "quite" ,"real" ,"really","serious", "seriously", "severe","severely" ,"significant" ,"significantly","sure","surely" , "true" ,"truly" ,"vast" , "vastly" , "very"]

In [82]:
def afinn_sent2(inputstring):
    
    sentcount =0
    i=0
    

    for word in inputstring.split():
        prev = inputstring.split().pop(i-1)

        if word in afinn:
            if (prev == 'no'):
                sentcount = sentcount - afinn[word] - afinn[prev]
            elif (prev == 'not'):
                sentcount = sentcount - afinn[word]
            else:
                sentcount = sentcount + afinn[word]
            i+=1
    
    if (sentcount < 0):
        sentiment = 'Negative'
    elif (sentcount >0):
        sentiment = 'Positive'
    else:
        sentiment = 'Neutral'
    
    
    return sentiment

def hl_sent2(inputstring):

    poscount = 0
    negcount = 0
    i = 0


    for word in inputstring.split():
        if i > 0:
            prev = inputstring.split().pop(i-1)
        else:
            prev =""

        if HLpos.count(word):
            if negate.count(prev):
                negcount += 1
            elif amplify.count(prev):
                poscount +=2
            else: 
                poscount +=1
        elif HLneg.count(word):
            if negate.count(prev):
                poscount += 1
            elif amplify.count(prev):
                negcount +=2
            else:
                negcount +=1
        i+=1
    
    if poscount+negcount > 0:
        t = float((poscount - negcount)/(poscount+negcount))
        
    else:
        t = 0
    
    
    if t > 0:
        tone = "Positive"
    elif t < 0:
        tone = "Negative"
    else:
        tone = "Neutral"
    
    return tone

#let's create lists: pos, neg, strong, weak


def gi_sent2(inputstring):

    poscount = 0
    negcount = 0
    i = 0


    for word in inputstring.split():
        if i > 0:
            prev = inputstring.split().pop(i-1)
        else:
            prev =""

        #create scalar for strong and weak words.  Strong words double, weak words add half
        if GIstrong.count(word):
            scale = 2
        elif GIweak.count(word):
            scale = 0.5
        else:
            scale = 1
            
        if GIpos.count(word):
            if negate.count(prev):
                negcount += 1*scale
            elif amplify.count(prev):
                poscount +=2*scale
            else: 
                poscount +=1*scale
        elif GIneg.count(word):
            if negate.count(prev):
                poscount += 1*scale
            elif amplify.count(prev):
                negcount +=2*scale
            else:
                negcount +=1*scale
            
        i+=1
    
    if poscount+negcount > 0:
        t = float((poscount - negcount)/(poscount+negcount))
        
    else:
        t = 0
    
    
    if t > 0:
        tone = "Positive"
    elif t < 0:
        tone = "Negative"
    else:
        tone = "Neutral"
    

    #return (negcount, poscount)
    #return tone
    return tone

### How did Amplification and Negation affect our labels?

In [84]:
tweet_frame['afinn_sent2'] = tweet_frame["TweetText"].apply(lambda x: afinn_sent2(x))
tweet_frame['hl_sent2'] = tweet_frame["TweetText"].apply(lambda x: hl_sent2(x))
tweet_frame['gi_sent2'] = tweet_frame["TweetText"].apply(lambda x: gi_sent2(x))

In [88]:
tweet_frame.iloc[490:501][['TweetText','afinn_sent2', 'hl_sent2','gi_sent2']]

Unnamed: 0,TweetText,afinn_sent2,hl_sent2,gi_sent2
490,RT @KCMO: #CHIEFS!!!! Big win over the Steelers today! @PatrickMahomes5 was 🔥! 6 TD passes to lead the way. https://t.co/Uczj2OondS,Positive,Positive,Neutral
491,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
492,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
493,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral,Neutral,Neutral
496,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
497,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
498,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
499,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive


#### And the original...

In [89]:
tweet_frame.iloc[490:501][['TweetText','afinn', 'hlsent','gi_sent']]

Unnamed: 0,TweetText,afinn,hlsent,gi_sent
490,RT @KCMO: #CHIEFS!!!! Big win over the Steelers today! @PatrickMahomes5 was 🔥! 6 TD passes to lead the way. https://t.co/Uczj2OondS,Positive,Positive,Neutral
491,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
492,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
493,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
494,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive
495,RT @JeremySickel: Pat Mahomes had more passes that resulted in a touchdown (6) than incompletions (5) today. #Chiefs,Neutral,Neutral,Neutral
496,RT @ClayWendler: Patrick Mahomes II is now 3-0 (1.000) when the opponent scores 24+ points. The best record in NFL history. #Chiefs,Positive,Neutral,Neutral
497,RT @JeremySickel: The #Chiefs went on the road the first two weeks to open the season against a division rival and a team that has owned th…,Neutral,Negative,Negative
498,"RT @ArrowheadPride: The #Chiefs are 2-0 to open the season, winning in Pittsburgh for the first time since 1986 (via @Arrowheadphones) http…",Positive,Positive,Neutral
499,"RT @JeremySickel: Pat Mahomes is on pace to throw 80 touchdown passes. As bad as the defense is, I'll take the over. #Chiefs",Negative,Negative,Positive


No labels were changed by adding this particular list of words to amplify and negate. 