# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

### Please Note:

This notebook likely looks a little different from the video content in the course. This notebook has been modified to be easier to understand as Tweepy is generally an easier package to work with. The old notebooks will still be available in the course downloads page if desired, but they will not be regularly updated.

# Twitter API Access

In order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website. Further instructions can be found in week 6 of the course.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

Install the `tweepy` package to interface with the Twitter API

In [45]:
#pip install for the package we will be using
!pip install tweepy



## Example 1. Authorizing an application to access Twitter account data

In [46]:
import tweepy

#Setting up the keys and tokens
c_k = "MKvQeRdJEJVAnPawvtRdsrMv4"
c_s = "AvQ4RALsR8glFFl4h5lZIRZcYkWUBB03HxilPHn8ZY0NyEBIYi"

a_t = "762772799962411008-3sRgPnehRbkVcrxe1UrO1pSDliRh8iX"
a_s = "X2lwRraJxqLlzrfDxVnvDXTvZjzKoVziBTqqmsVkYclWs"

auth = tweepy.OAuthHandler(c_k, c_s)
auth.set_access_token(a_t, a_s)
api = tweepy.API(auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(api)

<tweepy.api.API object at 0x0000024E5C1F35E0>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

To look up an area use:
https://www.findmecity.com/

In [47]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look up the WOE ID for "San Diego" and you should find the following ID below defined as "LOCAL_WOE_ID".

You can change this if you would like.

In [55]:
LOCAL_WOE_ID=551801

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = api.trends_place(WORLD_WOE_ID)
us_trends = api.trends_place(US_WOE_ID)
local_trends = api.trends_place(LOCAL_WOE_ID)

In [56]:
type(world_trends[0]['trends'])

list

In [57]:
world_trends[:2]

[{'trends': [{'name': '#TOTMCI',
    'url': 'http://twitter.com/search?q=%23TOTMCI',
    'promoted_content': None,
    'query': '%23TOTMCI',
    'tweet_volume': 30318},
   {'name': 'Mourinho',
    'url': 'http://twitter.com/search?q=Mourinho',
    'promoted_content': None,
    'query': 'Mourinho',
    'tweet_volume': 46992},
   {'name': 'Spurs',
    'url': 'http://twitter.com/search?q=Spurs',
    'promoted_content': None,
    'query': 'Spurs',
    'tweet_volume': 70128},
   {'name': '#COYS',
    'url': 'http://twitter.com/search?q=%23COYS',
    'promoted_content': None,
    'query': '%23COYS',
    'tweet_volume': 25114},
   {'name': 'Harry Kane',
    'url': 'http://twitter.com/search?q=%22Harry+Kane%22',
    'promoted_content': None,
    'query': '%22Harry+Kane%22',
    'tweet_volume': 13701},
   {'name': 'Tottenham',
    'url': 'http://twitter.com/search?q=Tottenham',
    'promoted_content': None,
    'query': 'Tottenham',
    'tweet_volume': 50332},
   {'name': '#Strictly',
    'url'

In [58]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'list'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': '#RBSSTU', 'url': 'http://twitter.com/search?q=%23RBSSTU', 'promoted_content': None, 'query': '%23RBSSTU', 'tweet_volume': None}, {'name': '#ATPFinals2020', 'url': 'http://twitter.com/search?q=%23ATPFinals2020', 'promoted_content': None, 'query': '%23ATPFinals2020', 'tweet_volume': None}, {'name': '#zib1', 'url': 'http://twitter.com/search?q=%23zib1', 'promoted_content': None, 'query': '%23zib1', 'tweet_volume': None}, {'name': '#thiem', 'url': 'http://twitter.com/search?q=%23thiem', 'promoted_content': None, 'query': '%23thiem', 'tweet_volume': None}, {'name': 'Sieg', 'url': 'http://twitter.com/search?q=Sieg', 'promoted_content': None, 'query': 'Sieg', 'tweet_volume': None}, {'name': 'Genesung', 'url': 'http://twitter.com/search?q=Genesung', 'promoted_content': None, 'query': 'Genesung', 'tweet_volume': None}, {'name': 'verlauf', 'url': 'http://twitter.com/search?q=verlauf', 'promoted_content': None, 'query': 'ver

## Example 3. Displaying API responses as pretty-printed JSON

In [59]:
import json

print((json.dumps(us_trends[:2], indent=2)))

[
  {
    "trends": [
      {
        "name": "Justin Fields",
        "url": "http://twitter.com/search?q=%22Justin+Fields%22",
        "promoted_content": null,
        "query": "%22Justin+Fields%22",
        "tweet_volume": null
      },
      {
        "name": "Hayward",
        "url": "http://twitter.com/search?q=Hayward",
        "promoted_content": null,
        "query": "Hayward",
        "tweet_volume": 73016
      },
      {
        "name": "Indiana",
        "url": "http://twitter.com/search?q=Indiana",
        "promoted_content": null,
        "query": "Indiana",
        "tweet_volume": 39080
      },
      {
        "name": "Ohio State",
        "url": "http://twitter.com/search?q=%22Ohio+State%22",
        "promoted_content": null,
        "query": "%22Ohio+State%22",
        "tweet_volume": 18341
      },
      {
        "name": "Hornets",
        "url": "http://twitter.com/search?q=Hornets",
        "promoted_content": null,
        "query": "Hornets",
        "tweet_vo

## Example 4. Computing the intersection of two sets of trends

In [60]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['san diego'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [61]:
for loc in ['world','us','san diego']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
9ice,#bizevdeyiz,#TUBBO1MIL,Justin Fields,Man City,Buckeyes,السيتي,Hojbjerg,Danny Ainge,#Strictly,Nebraska,Maisie,baekhyun,توتنهام,Lennon,Guardiola,Stuart,Vandy,Avery Bradley,Hornets,Celta,Brian Williams,Manchester City,De Bruyne,Clemson,Dest,CHEGOU A PATROA,Mourinho,Mike Dean,#TOTMCI,Mahrez,Shaun Wade,Celtics,Ohio State,Aurier,Charlotte,#COYS,Michael Jordan,Indiana,Ederson,Tottenham,Spurs,Ndombele,Harry Kane,TRINTA DA RAI,Lo Celso,Batum,مورينهو,Tiago,Hayward
('----------', 'us')
#TUBBO1MIL,Rozier,Justin Fields,Man City,Buckeyes,#Huskers,Ruan,Danny Ainge,Heisman,Nebraska,Frost,Dom Perignon,Grantham,Olave,Trump Hotel,Trask,Vandy,Master Teague,Zeller,Avery Bradley,Hoosiers,Garrett Wilson,Hornets,Brian Williams,Penix,Clemson,Franks,Kemba,Mourinho,Crowder,Shaun Wade,#TOTMCI,Celtics,Arkansas,Mahrez,Ohio State,Charlotte,#COYS,Michael Jordan,Indiana,Tottenham,Spurs,Harry Kane,Lo Celso,#TwitterTurkeyDrive,Coastal Carolina,Batum,Gus Johnson,The Muppets,Hayward
('--------

In [62]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

{'#TUBBO1MIL', 'Justin Fields', 'Man City', 'Buckeyes', 'Danny Ainge', 'Nebraska', 'Vandy', 'Avery Bradley', 'Hornets', 'Brian Williams', 'Clemson', 'Mourinho', 'Shaun Wade', '#TOTMCI', 'Celtics', 'Mahrez', 'Ohio State', 'Charlotte', '#COYS', 'Michael Jordan', 'Indiana', 'Tottenham', 'Spurs', 'Harry Kane', 'Lo Celso', 'Batum', 'Hayward'}
set()


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [63]:
# You can change this to whatever hashtag you want, but if the tag isn't
# popular enough you might not get back a lot of results
q = "Rudy"

number = 100

search_results = tweepy.Cursor(api.search, q=q, lang="en").items(number)

#This will give us an Iterator
print(search_results)

# WE will be looking at the tags "retweeted", "retweet count", 
# and the text we found earlier
tweets = []
retweeted = []
retweet_count = []

for tweet in search_results:
    tweets.append(tweet.text)
    retweet_count.append(tweet.retweet_count)
    # This if/else just checks the number of retweets and defines "rewteeted"
    # based on that value
    if tweet.retweet_count > 0:
        retweeted.append(True)
    else:
        retweeted.append(False)


#tweets

<tweepy.cursor.ItemIterator object at 0x0000024E5F6A57F0>


In [64]:
# Not necessary, but this does make the data look pretty
import pandas as pd

df = pd.DataFrame({'Tweet':tweets, 'Retweeted':retweeted, "Retweet Count":retweet_count})

df

Unnamed: 0,Tweet,Retweeted,Retweet Count
0,RT @WickedExhausted: Officials at trump’s Just...,True,1
1,@80sAging In the battle of the 4th Bee Gee v t...,False,0
2,RT @emptywheel: Reupping from earlier in the w...,True,41
3,"@joannUSA1 @Acyn Don't worry, trumps will say ...",False,0
4,Here's what Giuliani's former colleagues say a...,False,0
...,...,...,...
95,RT @Strandjunker: Bill Barr’s daughter works a...,True,2292
96,RT @PascrellforNJ: 🚨 I’ve just filed legal com...,True,19176
97,RT @politvidchannel: BREAKING: A bar Complaint...,True,1758
98,RT @Strandjunker: Bill Barr’s daughter works a...,True,2292


Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [65]:
all_text = []
filtered_tweets = []
for t in tweets:
    if not t in all_text:
        filtered_tweets.append(t)
        all_text.append(t)
#filtered_tweets    
filtered_tweets[0]

'RT @WickedExhausted: Officials at trump’s Justice Department have no interest in pursuing rudy giuliani’s allegations of voter fraud. https…'

In [66]:
#This gives us the number of all of the unique tweets from our search results
print(len(filtered_tweets))
if len(filtered_tweets) < len(tweets):
    print("There were duplicates in our search results!")

59
There were duplicates in our search results!


## Example 6. Creating a basic frequency distribution from the words in tweets

In [67]:
from collections import Counter

words = []

for t in tweets:
    for word in t.split():
        words.append(word)
        
c = Counter(words)
c.most_common(10)

[('RT', 76),
 ('Rudy', 67),
 ('and', 52),
 ('the', 42),
 ('to', 39),
 ('at', 37),
 ('Giuliani', 34),
 ('Bill', 22),
 ('Trump’s', 20),
 ('Giuliani’s', 19)]

## Example 7. Create a prettyprint function to display tuples in a nice tabular format

In [68]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [69]:
for label, data in (('Word', words), 
                    ('Retweet_count', retweet_count)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
RT                   |     76
Rudy                 |     67
and                  |     52
the                  |     42
to                   |     39
at                   |     37
Giuliani             |     34
Bill                 |     22
Trump’s              |     20
Giuliani’s           |     19

   Retweet_count     | Count 
****************************************
                   0 |     25
                1758 |      9
                2291 |      6
               19175 |      4
                2290 |      4
                2292 |      4
                   2 |      3
                2289 |      3
                  17 |      3
                 121 |      3


## Example 8. Finding the most popular retweets

In [70]:
# This sets up a filter for our dataset that only leaves data with Retweeted
# marked as true
filter1 = df['Retweeted'] == True

#This is a built in pandas operation that will filter the data given the filter
rt_df = df.where(filter1)

#Now we will have a new df without any NaN values
rt_df = rt_df.dropna()

#The indices will look odd, but this is because it is keeping the old indices
rt_df.head(10)

Unnamed: 0,Tweet,Retweeted,Retweet Count
0,RT @WickedExhausted: Officials at trump’s Just...,1.0,1.0
2,RT @emptywheel: Reupping from earlier in the w...,1.0,41.0
5,RT @politvidchannel: BREAKING: A bar Complaint...,1.0,1758.0
6,RT @ResitsTrump: @thehill Robert DiNero is an ...,1.0,2.0
7,RT @Strandjunker: Bill Barr’s daughter works a...,1.0,2289.0
8,RT @Strandjunker: Bill Barr’s daughter works a...,1.0,2289.0
9,RT @RichardAngwin: I nominate Rudy Giuliani to...,1.0,208.0
10,RT @mileskahn: I am no longer impressed that S...,1.0,14036.0
11,RT @KimMangone: Rudy Giuliani should be disbar...,1.0,1122.0
12,"RT @noradunn: No, Venezuelan communists did no...",1.0,17.0


We can sort this dataframe in descending order of the number of retweets using df.sort_values()

In [71]:
rt_df_sorted = rt_df.sort_values(by="Retweet Count", ascending=0)

rt_df_sorted.head(5)

Unnamed: 0,Tweet,Retweeted,Retweet Count
31,RT @realDonaldTrump: Giuliani: The Case for El...,1.0,37769.0
93,RT @PascrellforNJ: 🚨 I’ve just filed legal com...,1.0,19176.0
96,RT @PascrellforNJ: 🚨 I’ve just filed legal com...,1.0,19176.0
23,RT @PascrellforNJ: 🚨 I’ve just filed legal com...,1.0,19175.0
22,RT @PascrellforNJ: 🚨 I’ve just filed legal com...,1.0,19175.0


We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [72]:
### Remember our pretty_print function from above
### We will modify it slightly
def prettyprint_counts_modified(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [73]:
rt_tweets = rt_df_sorted["Tweet"]
rt_re_count = rt_df_sorted["Retweet Count"]

for label, data in (('Tweet', rt_tweets), 
                    ('Retweet_count', rt_re_count)):
    
    c2 = Counter(data)
    prettyprint_counts_modified(label, c2.most_common()[:5])


       Tweet         | Count 
****************************************
RT @Strandjunker: Bill Barr’s daughter works at Treasury to protect Trump’s tax returns, and loans at Deutsche Bank.

Rudy Giuliani’s son g… |     17
RT @politvidchannel: BREAKING: A bar Complaint has been filed against Rudy Giuliani and several other attorneys representing the Trump camp… |      9
RT @PascrellforNJ: 🚨 I’ve just filed legal complaints with the AZ, MI, NV, NY, and PA bars against Rudy Giuliani and 22 other lawyers seeki… |      6
RT @KimMangone: Rudy Giuliani should be disbarred. |      4
RT @TeaPainUSA: When you're too corrupt for Bill Barr.

https://t.co/94IppJ04nT |      3

   Retweet_count     | Count 
****************************************
              1758.0 |      9
              2291.0 |      6
             19175.0 |      4
              2292.0 |      4
              2290.0 |      4
