# Twitter data

<div class=note><b>Copyright and Licensing:</b>


You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.</div>

### Twitter API Access

Twitter implements OAuth 1.0A as its standard authentication mechanism, and in order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note for an OAuth 1.0A workflow: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

The first time you execute the notebook, add all credentials so that you can save them in the `pkl` file, then you can remove the secret keys from the notebook because they will just be loaded from the `pkl` file.

The `pkl` file contains sensitive information that can be used to take control of your twitter acccount, **do not share it**.

In [None]:
# %load ../_data/standard_import.txt

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import pickle
import os

plt.style.use('seaborn-white')

In [None]:
file = '../_credentials/twitter_credentials.txt'
pickle_file = '../_credentials/twitter_credentials.pkl'

In [None]:
if not os.path.exists(pickle_file):
    Twitter={}
    Twitter['Consumer Key'] = 'p...6XH'
    Twitter['Consumer Secret'] = 'Y...4yR'
    Twitter['Access Token'] = '2...7eQ'
    Twitter['Access Token Secret'] = '1...w53'
    with open(pickle_file,'wb') as f:
        pickle.dump(Twitter, f)
else:
    Twitter=pickle.load(open(pickle_file,'rb'))

Install the `twitter` package to interface with the Twitter API

In [None]:
# !pip install twitter

### Authorizing an application to access Twitter account data

In [None]:
import twitter

auth = twitter.oauth.OAuth(Twitter['Access Token'],
                           Twitter['Access Token Secret'],
                           Twitter['Consumer Key'],
                           Twitter['Consumer Secret'])

twitter_api = twitter.Twitter(auth=auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(twitter_api)

### Retrieving trends

Twitter identifies locations using the __Yahoo! Where On Earth ID__.

The Yahoo! Where On Earth ID for the entire world is 1.  
[Find your WOE ID](http://www.woeidlookup.com)

In [None]:
WORLD_WOE_ID = 1
NL_WOE_ID = 23424909
US_WOE_ID = 23424977
LOCAL_WOE_ID = 727232 # Amsterdam, NL

# Too local to get tweets
BB_WOE_ID = 727407    # Baambrugge - De Ronde Venen, NL
ABC_WOE_ID = 727050   # Abcoude - De Ronde Venen, NL

In [None]:
# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)
nl_trends = twitter_api.trends.place(_id=NL_WOE_ID)
local_trends = twitter_api.trends.place(_id=LOCAL_WOE_ID)

#### Traversing through the nested dictionary

In [None]:
local_trends
dict_local_trends = local_trends[0]

In [None]:
list(dict_local_trends.keys())
dict_local_trends['trends']

#### List trends

In [None]:
[x['name'] for x in local_trends[0]['trends']]

#### Display dictionary as dataframe

In [None]:
df_trends = pd.DataFrame(dict_local_trends['trends'])
df_trends.sort_values('tweet_volume', ascending=False).head(10)

#### Display as JSON

In [None]:
import json

print((json.dumps(local_trends[:2], indent=1)))

### Computing the intersection of two sets of trends

In [None]:
trends_set = {}
trends_set['world'] = set([trend['name'] for trend in world_trends[0]['trends']])
trends_set['nl'] = set([trend['name'] for trend in nl_trends[0]['trends']]) 
trends_set['amsterdam'] = set([trend['name'] for trend in local_trends[0]['trends']]) 

In [None]:
for loc in trends_set.keys():
    print('\n------------ {} trends-----------\n'.format(loc))
    print((', '.join(trends_set[loc])))

In [None]:
print('='*10 + '> World & NL\n')
print((trends_set['world'].intersection(trends_set['nl'])))
print()
print('='*10 + '> NL & Amsterdam\n')
print((trends_set['nl'].intersection(trends_set['amsterdam'])))
print()
print('='*10 + '> World & NL & Amsterdam\n')
print((trends_set['amsterdam'].intersection(trends_set['nl'])).intersection(trends_set['world']))
print()
print('='*10 + '> World (NOT NL) \n')
print((trends_set['nl'] ^ trends_set['world']).intersection(trends_set['world']))
print()
print('='*10 + '> NL (NOT World) \n')
print((trends_set['nl'] ^ trends_set['world']).intersection(trends_set['nl']))
print()
print('='*10 + '> Amsterdam (NOT NL) \n')
print((trends_set['amsterdam'] ^ trends_set['nl']).intersection(trends_set['amsterdam']))

### Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

[api docs](https://dev.twitter.com/docs/api/1.1/get/search/tweets)

In [None]:
only_local_trends = (trends_set['amsterdam'] ^ trends_set['nl']).intersection(trends_set['amsterdam'])

q = list(only_local_trends)[0] #'#MTVAwards' 
number = 100

search_results = twitter_api.search.tweets(q=q, count=number)
statuses = search_results['statuses']

In [None]:
len(statuses)
statuses[0].keys()

In [None]:
[s['text'] for s in search_results['statuses']][:10]

#### Delete duplicate tweets
Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [None]:
all_text = []
filtered_statuses = []
for s in statuses:
    if not s["text"] in all_text:
        filtered_statuses.append(s)
        all_text.append(s["text"])
statuses = filtered_statuses     

In [None]:
len(statuses)

In [None]:
# Show one sample search result by slicing the list...
print(json.dumps(statuses[0], indent=1))

#### Retweets

In [None]:
# The result of the list comprehension is a list with only one element that
# can be accessed by its index and set to the variable t
t = statuses[0]

#[status for status in statuses 
#          if status['id']==316948241264549888][0]]

# Explore the variable t to get familiarized with the data structure...
statuses[0]['retweet_count']
statuses[0]['retweeted']


### Extracting text, screen names, and hashtags from tweets

In [None]:
status_texts = [status['text'] for status in statuses]

screen_names = [user_mention['screen_name'] for status in statuses
                                            for user_mention in status['entities']['user_mentions']]

hashtags = [hashtag['text'].lower() for status in statuses
                            for hashtag in status['entities']['hashtags']]

# Compute a collection of all words from all tweets
words = [w.lower() for t in status_texts 
           for w in t.split()]

In [None]:
# Explore the first 5 items for each...
print('status text: ', json.dumps(status_texts[0:5], indent=1))
print('screen names: ', json.dumps(screen_names[0:5], indent=1)) 
print('hashtags: ', json.dumps(hashtags[0:5], indent=1))
print('words: ', json.dumps(words[0:5], indent=1))

### Basic frequency distribution from the words in tweets

In [None]:
from collections import Counter

for item in [words, screen_names, hashtags]:
    c = Counter(item)
    print('-'*80)
    print(c.most_common()[:10]) # top 10
    

In [None]:
pd.DataFrame(Counter(words).most_common(30), columns=['word', 'count']).set_index('word').head()

In [None]:
pd.DataFrame(Counter(screen_names).most_common(30), columns=['mentions', 'count']).set_index('mentions').head()

In [None]:
pd.DataFrame(Counter(hashtags).most_common(30), columns=['hashtags', 'count']).set_index('hashtags').head()

### Most popular retweets

In [None]:
retweets = [
            # Store out a tuple of these three values ...
            (status['retweet_count'], 
             status['retweeted_status']['user']['screen_name'],
             status['text'].replace("\n","\\")) 
            
            # ... for each status ...
            for status in statuses 
            
            # ... so long as the status meets this condition.
                if 'retweeted_status' in status
           ]

In [None]:
df_retweets = pd.DataFrame(retweets, columns=['retweets', 'screen_name', 'text']).sort_values('retweets', ascending=False)

In [None]:
df_retweets.head()

In [None]:
df_retweets.text[16]