In [1]:
import json, io
import nltk
import pandas as pd

This notebook introduces the use of dictionaries for counting the frequency of some category of words in text, using sentiment (from the [AFINN sentiment lexicon](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010)) in the time series data of tweets as an example.

Checkout resources for available dictionaries. Here are a few widely used ones: [VADER](https://github.com/cjhutto/vaderSentiment), AFINN sentiment lexicon; [MPQA](https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/) (free for use with registration) and [LIWC](http://liwc.wpengine.com) (commercial), [EMPATH](https://github.com/Ejhfast/empath-client).

In [2]:
# read in json file of tweets and return a list of (date, tokenized text)
def read_tweets_from_json(filename):
    tweets=[]
#    with open(filename, encoding="utf-8") as file:
    with io.open(filename, encoding="utf-8") as file:
        data=json.load(file)
        for tweet in data:
            created_at=tweet["created_at"]
            date = pd.to_datetime(created_at)
            text=tweet["text"]
            tokens=nltk.casual_tokenize(text)
            tweets.append((date, tokens))
    return tweets

In [3]:
# read in list of (date, tokens) tweets and count whether each tweet contains 
# a (lowercased) term in the argument dictionary.  Return as pandas dataframe
# for easier slicing/plotting)
def dictionary_document_count(data, dictionary):
    counted=[]
    for date, tokens in data:
        val=0
        for word in tokens:
            if word.lower() in dictionary:
                val=1
        counted.append((date, val))
    df=pd.DataFrame(counted, columns=['date','document frequency'])
    return df

In [4]:
tweets=read_tweets_from_json("trump_tweets.json")

Here we'll define our own "immigration" dictionary by selecting words that we hypothesize are often found in the topic of immigration.

In [5]:
immigration_dictionary=set(["wall", "border", "borders", "immigrants","immigration"])

In [6]:
counts=dictionary_document_count(tweets, immigration_dictionary)

In [8]:
print(counts)

                     date  document frequency
0     2019-01-19 14:09:37                   1
1     2019-01-19 13:50:09                   0
2     2019-01-19 12:51:30                   0
3     2019-01-19 12:11:38                   0
4     2019-01-19 11:29:21                   0
5     2019-01-19 03:24:49                   0
6     2019-01-19 03:22:44                   0
7     2019-01-19 03:22:29                   0
8     2019-01-19 03:02:50                   0
9     2019-01-19 02:23:46                   0
10    2019-01-19 02:18:18                   0
11    2019-01-19 02:18:14                   0
12    2019-01-19 02:14:44                   0
13    2019-01-19 02:14:11                   0
14    2019-01-19 01:52:47                   0
15    2019-01-19 01:50:03                   0
16    2019-01-19 01:46:08                   1
17    2019-01-19 01:44:59                   0
18    2019-01-19 01:42:51                   0
19    2019-01-19 01:16:18                   0
20    2019-01-19 01:14:30         

The AFINN dictionary is a sentiment lexicon, where words are rated on a five-point affect scale (-5 = most negative, 5 = most positive). Write a function read_AFINN_dictionary to read in this file and create two dictionaries like that above -- one for positive terms and one for negative terms. How did you decide the cutoff point for positive and negative?

In [9]:
def read_AFINN_dictionary(filename):
    positive=[]
    negative=[]
    
    # Your code here
    return set(positive), set(negative)

In [10]:
def read_AFINN_dictionary(filename):
    positive=[]
    negative=[]
    with open(filename) as file:
        for line in file:
            cols=line.rstrip().split("\t")
            word=cols[0]
            value=int(cols[1])
            if value <= -2:
                negative.append(word)
            elif value >= 2:
                positive.append(word)
    
    return set(positive), set(negative)

In [11]:
positive, negative=read_AFINN_dictionary("AFINN-111.txt")

In [12]:
counts=dictionary_document_count(tweets, negative)

In [14]:
print(counts)

                     date  document frequency
0     2019-01-19 14:09:37                   0
1     2019-01-19 13:50:09                   0
2     2019-01-19 12:51:30                   1
3     2019-01-19 12:11:38                   1
4     2019-01-19 11:29:21                   1
5     2019-01-19 03:24:49                   1
6     2019-01-19 03:22:44                   0
7     2019-01-19 03:22:29                   0
8     2019-01-19 03:02:50                   1
9     2019-01-19 02:23:46                   1
10    2019-01-19 02:18:18                   1
11    2019-01-19 02:18:14                   0
12    2019-01-19 02:14:44                   0
13    2019-01-19 02:14:11                   0
14    2019-01-19 01:52:47                   0
15    2019-01-19 01:50:03                   1
16    2019-01-19 01:46:08                   0
17    2019-01-19 01:44:59                   1
18    2019-01-19 01:42:51                   0
19    2019-01-19 01:16:18                   1
20    2019-01-19 01:14:30         

###### EMPATH 

In [15]:
from empath import Empath #pip install empath

In [40]:
lexicon = Empath()
sentence = "she yelled at her son"

In [41]:
categ = lexicon.analyze(sentence, normalize=True)

In [42]:
print('Categories for the sentence: ' + sentence + ':')
for key, value in categ.items():
    if value != 0:
        print(key)

Categories for the sentence: she yelled at her son:
wedding
family
death
youth
children


In [22]:
#available categories in empath
print(categ.keys())

dict_keys(['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic

In [23]:
# number of categories
len(categ.keys())

194

In [43]:
import vaderSentiment

In [44]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [70]:
sentences = []
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<50} {}".format(sentence, str(vs)))

In [71]:
import tweepy
ACCESS_TOKEN = '1088502318876762112-31dLKSsBF3w7LOwdY0CIbGlPEtcpfn'
ACCESS_SECRET = 'YOMH1yazqXerT7sOZrB3mskfDbAMj8fTRsSFzaJcGPe9v'
CONSUMER_KEY = 'A8p6EzGDQ0oPULhiZL6cpm6kO'
CONSUMER_SECRET = 'zhO22pzZDmQAzGmOkFa6KHlURtE64ymmcmsBppgsGQBLSqdouk'
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your creadentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)

In [72]:
user = api.get_user("ProfFeynman")
tweets = api.user_timeline(screen_name = 'ProfFeynman', count = 5, include_rts = True)
for status in tweets:
    print(status.text)
    sentences.append(status.text)

FOUR Productivity FEYNMAN- strategies: 
i) Stop trying to know-it-all. 

ii) Don't worry about what others are thin… https://t.co/A1RdJGC9hG
Our responsibility is to do what we can, learn what we can, improve the solutions, and pass them on. https://t.co/Jn8SVW0ysN
Science is like sex: sometimes something useful comes out, but that is not the reason we are doing it.
If you cannot explain something in simple terms, you don't understand it. https://t.co/3LHPA2RuKA
"Science! It's just magic without the lies."


In [74]:
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<280} {}".format(sentence, str(vs)))

FOUR Productivity FEYNMAN- strategies: 
i) Stop trying to know-it-all. 

ii) Don't worry about what others are thin… https://t.co/A1RdJGC9hG-------------------------------------------------------------------------------------------------------------------------------------------- {'neg': 0.107, 'neu': 0.776, 'pos': 0.117, 'compound': 0.0531}
Our responsibility is to do what we can, learn what we can, improve the solutions, and pass them on. https://t.co/Jn8SVW0ysN------------------------------------------------------------------------------------------------------------------------------------------------------------ {'neg': 0.0, 'neu': 0.796, 'pos': 0.204, 'compound': 0.5574}
Science is like sex: sometimes something useful comes out, but that is not the reason we are doing it.---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- {'neg': 0.0, 'neu': 0.821, 'pos':