#OAuth Exercise

In this exercise we will try to scrape twitter data and do a tf-idf analysis on that (src-uwes twitter analysis). We will need OAuth authentication, and we will follow a similar approach as detailed in the yelp analysis notebook. 

In [12]:
import jsonpickle, operator,json
import numpy as np
import pandas as pd
import oauth2 as oauth
import urllib2 as urllib

We will now need twitter api access. The following steps as available online will help you set up your twitter account and access the live 1% stream.

1. Create a twitter account if you do not already have one.
2. Go to https://dev.twitter.com/apps and log in with your twitter credentials.
3. Click "Create New App"
4. Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
5. On the next page, click the "API Keys" tab along the top, then scroll all the way down until you see the section "Your Access Token"
6. Click the button "Create My Access Token". You can Read more about Oauth authorization online. 

Save the details of api_key, api_secret, access_token_key, access_token_secret in your vaule directory and load it in the notebook as shown in yelpSample notebook.

In [13]:
import sys
sys.path.append('/Users/mgalarny/VaultDSE')
import twitterKeys
api_key,api_secret,access_token_key,access_token_secret=twitterKeys.getkeys()

_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"

http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

Below is a twitter request method which will use the above user logins to sign, and open a twitter stream request

In [14]:
def getTwitterStream(url, method, parameters):
  req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url, 
                                             parameters=parameters)

  req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)

  headers = req.to_header()

  if http_method == "POST":
    encoded_post_data = req.to_postdata()
  else:
    encoded_post_data = None
    url = req.to_url()

  opener = urllib.OpenerDirector()
  opener.add_handler(http_handler)
  opener.add_handler(https_handler)

  response = opener.open(url, encoded_post_data)

  return response

We can use the above function to request a response as follows

In [15]:
#Now we will test the above function for a sample data provided by twitter stream here -  
url = "https://stream.twitter.com/1/statuses/sample.json"
parameters = []
response = getTwitterStream(url, "GET", parameters)

Write a function which will take a url and return the top 10 lines returned by the twitter stream

** Note ** The response returned needs to be intelligently parsed to get the text data which correspond to actual tweets. This part can be done in a number of ways and you are encouraged to try different approaches to parse the response data.

In [16]:
def fetchData(url):
    response = getTwitterStream(url, "GET", [])
    lines = response.read()
    allinfo = jsonpickle.loads(lines)
    statuses = allinfo['statuses']
    print 'Stream'
    print url.split('/')[-1][14:]
    print '\n'
    for i in range(10):
        try:
            print i+1
            print statuses[i]['text'],'\n'
        except:
            continue

In [17]:
queries = ['UCSD', 'Donald Trump', 'Syria']

for query in queries:
    #We can also request twitter stream data for specific search parameters as follows
    url= "https://api.twitter.com/1.1/search/tweets.json?q=" + query
    fetchData(url)

Stream
UCSD


1
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https://t.co/DrQjjn7ubg https://t.… 

2
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https://t.co/DrQjjn7ubg https://t.… 

3
RT @Jazz88: RETWEET til 8am PT to ENTER Pair of Seats CONTEST!&gt; Mark Dresser Septet @TheLoftatUCSD 12/11&gt;https://t.co/DrQjjn7ubg https://t.… 

4
RT @NetworkFact: Spectral graph theory by Fan Chung https://t.co/6B6PfMAFfo 

5
RT @UCSDtritons: SWIM: @UCSDSwimDive🏊 dropped dual meets in Santa Barbara Sat. Up next is @a3performance Invite Nov. 19-21. https://t.co/8n… 

6
TAT helps woman with cancer have less stress, more joy and improved posture. https://t.co/xPrSppMOLT #acepblog #energypsych 

7
RT @Natalya_Gallo: "Who, if not us? When, if not now?" Hoping for an ambitious climate agreement here in Paris at #COP21 2 days left. #ucsd… 

8
RT @pasquale_rossi: @AdviseOnly s

Call the fetchData function to fetch latest live stream data for following search queries and output the first 5 lines

1. "UCSD"
2. "Donald Trump"
3. "Syria"

### TF-IDF###

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is among the most regularly used statistical tool for word cloud analysis. You can read more about it online (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

We base our analysis on the following

1. The weight of a term that occurs in a document is simply proportional to the term frequency
2. The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs

For this question we will perform tf-idf analysis o the stream data we retrieve for a given search parameter. Perform the steps below

1. use the twitterreq function to search for the query "syria" and save the top 200 lines in the file twitterStream.txt
2. load the saved file and output the count of occurrences for each term. This will be your term frequency
3. Calculate the inverse document frequency for each of the term in the output above.
4. Divide the term frequency for each of the term by corresponding inverse document frequency.
5. Sort the terms in the descending order based on their term freq/inverse document freq scores 
6. Print the top 10 terms.

In [18]:
#1. use the twitterreq function to search for the query "syria" and save the top 200 lines in the file twitterStream.txt
writer = open('twitterStream.txt', 'a') 
url= "https://api.twitter.com/1.1/search/tweets.json?q="+"syria"
response = getTwitterStream(url, "GET", [])
lines = response.read()
j = json.loads(lines)
h = j['statuses']
for i in range(100):
    try:
        writer.write(h[i]['text'].replace('\n',' ')+'\n\n')
    except:
        continue
writer.close()

print 'Twitter Stream file generated'

Twitter Stream file generated


In [19]:
#2. load the saved file and output the count of occurrences for each term. This will be your term frequency

def tf(name):
    '''Term Frequency'''
    char = '.,?"'
    text = open(name, 'r')
    line = text.read()
    text.close()
    word_list=line.lower().split()
    count_dict = {}
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in count_dict:
            count_dict[word]=0
    for word in word_list:
        if word[-1] in char:
            word = word[:-1]
        count_dict[word]+=1
    return count_dict

name = 'twitterStream.txt'
tf = tf(name)

print 'Term Frequency:\n\n'
print tf

Term Frequency:


{'all': 1, 'signatures': 1, 'just': 1, 'https://t.co/7upxsezter': 1, 'https://t.co/peavqp2u8f': 1, 'over': 1, 'https://t.co/pwayryknaf': 1, 'vetted': 1, 'front': 1, 'its': 1, 'bombing': 1, 'death': 1, 'paris': 1, '@rt_com:': 1, 'torn': 1, 'https://t.co/m8eeen54td': 1, 'to': 8, 'only': 1, 'van': 1, 'policy': 1, 'has': 1, 'https://t.co/8hbyib7kbb': 1, 'get': 1, 'stop': 1, 'none': 1, '#iran': 2, '#iraq': 2, '#bataclan': 2, 'new': 1, 'not': 3, 'identified': 2, 'these': 1, '#gaza': 1, '@ramseyinho:': 1, 'ban': 1, 'river': 1, 'refugees': 1, 'side': 1, 'dis': 1, 'neighborhood': 1, '@teamtrump2016': 1, 'people': 2, 'idlib': 1, 'homs': 1, 'are': 1, 'escape': 1, 'girl': 1, 'guess': 1, 'https://t.co/u2pubd5lrf': 1, 'rt': 7, 'said': 1, '@manutd': 1, 'beaten': 1, '@syriasonline:': 1, 'weapons': 1, 'https://t.co/vjgd9nfwww': 1, 'got': 1, 'gov': 1, 'public': 1, 'be': 2, 'we': 2, 'iran': 1, 'bc': 1, '"poster': 1, '#russia': 1, 'https://t.co/mkoj4uisbz': 1, '#israel': 1, 'frenchman': 

In [20]:
#3. Calculate the inverse document frequency for each of the term in the output above.

def idf(name):
    '''Inverse Document Frequency'''
    docs = open(name, 'r')
    tot_docs = len(docs.readlines())
    count_dict = {}
    unique = []
    docs.close()
    
    #Get all unique terms
    docs = open(name, 'r')
    char = '.,?"'
    text_list = docs.read().lower().split()
    for word in text_list:
        if word[-1] in char:
            word = word[:-1]
        if word not in unique:
            unique.append(word)
            count_dict[word] = 0
    docs.close()
    
    #Term count in each doc
    docs = open(name, 'r')
    for line in docs.readlines():
        new_line = []
        for word in line.lower().split():
            if word[-1] in char:
                word = word[:-1]
            new_line.append(word)
        for term in unique:
            if term[-1] in char:
                term = term[:-1]
            if term in new_line:
                count_dict[term] += 1
            else:
                pass    
    docs.close()
    
    #IDF calculation
    for key in count_dict:
        count_dict[key] = np.log10(float(tot_docs) / float(count_dict[key]))
    
    return count_dict
        
        
name = 'twitterStream.txt'    
idf = idf(name)
print 'Inverse Document Frequency:\n\n'
print idf

Inverse Document Frequency:


{'all': 1.4471580313422192, 'signatures': 1.4471580313422192, 'just': 1.4471580313422192, 'https://t.co/7upxsezter': 1.4471580313422192, 'https://t.co/peavqp2u8f': 1.4471580313422192, 'over': 1.4471580313422192, 'https://t.co/pwayryknaf': 1.4471580313422192, 'vetted': 1.4471580313422192, 'front': 1.4471580313422192, 'its': 1.4471580313422192, 'bombing': 1.4471580313422192, 'death': 1.4471580313422192, 'paris': 1.4471580313422192, '@rt_com:': 1.4471580313422192, 'torn': 1.4471580313422192, 'https://t.co/m8eeen54td': 1.4471580313422192, 'to': 0.6020599913279624, 'only': 1.4471580313422192, 'van': 1.4471580313422192, 'policy': 1.4471580313422192, 'has': 1.4471580313422192, 'https://t.co/8hbyib7kbb': 1.4471580313422192, 'get': 1.4471580313422192, 'stop': 1.4471580313422192, 'none': 1.4471580313422192, '#iran': 1.146128035678238, '#iraq': 1.146128035678238, '#bataclan': 1.146128035678238, 'new': 1.4471580313422192, 'not': 0.97003677662255683, 'identified': 1.14

In [21]:
#4. Multiply the term frequency for each of the term by corresponding inverse document frequency.

def tfidf(tf_dict, idf_dict):
    tfidf_dict = {}
    for term in tf_dict.keys():
        tfidf_dict[term] = tf_dict[term] * idf_dict[term]
    return tfidf_dict

tfidf = tfidf(tf, idf)
print 'Term Frequency - Inverse Document Frequency:\n\n'
tfidf

Term Frequency - Inverse Document Frequency:




{'"poster': 1.4471580313422192,
 '#auspol': 1.4471580313422192,
 '#bataclan': 2.2922560713564759,
 '#bds': 1.4471580313422192,
 '#eu': 1.4471580313422192,
 '#feedly': 1.4471580313422192,
 '#gaza': 1.4471580313422192,
 '#iran': 2.2922560713564759,
 '#iraq': 2.2922560713564759,
 '#israel': 1.4471580313422192,
 '#middleeast': 1.4471580313422192,
 '#palestine': 1.4471580313422192,
 '#russia': 1.4471580313422192,
 '#syria': 3.7409401350310016,
 '#uk': 1.4471580313422192,
 '#usa': 1.4471580313422192,
 '2(?)': 1.4471580313422192,
 '400k+': 1.4471580313422192,
 '@arnews1936:': 1.4471580313422192,
 '@iran_policy:': 1.4471580313422192,
 '@mackylucifera:': 1.4471580313422192,
 '@madblacktwink:': 1.4471580313422192,
 '@manutd': 1.4471580313422192,
 '@ramseyinho:': 1.4471580313422192,
 '@rt_com:': 1.4471580313422192,
 '@syriasonline:': 1.4471580313422192,
 '@teamtrump2016': 1.4471580313422192,
 'a': 3.8801471064902273,
 'about': 2.2922560713564759,
 'aerial': 1.4471580313422192,
 'against': 1.44715

In [25]:
#5. Sort the terms in the descending order based on their term freq/inverse document freq scores

freqScore = pd.DataFrame(tfidf.items(),columns=['Term','TF-IDF']).sort(ascending=False,columns=['TF-IDF'])

In [26]:
# top 10 terms
freqScore.head(10)

Unnamed: 0,Term,TF-IDF
159,the,5.820221
17,to,4.81648
75,syria,4.352544
49,rt,4.21442
127,in,4.014041
99,war,3.880147
150,a,3.880147
109,#syria,3.74094
103,but,3.380392
122,is,3.380392
