<img src="https://datasciencedegree.wisconsin.edu/wp-content/themes/data-gulp/images/logo.svg" width="300">


# Lesson 13 Activity -- Using ```tweepy``` for Data Mining

This is an introduction to data collection from <a href="http://www.twitter.com/">Twitter</a> using the [`tweepy`](http://www.tweepy.org/) package.

---

## Getting set up -- things you do about once

### Install Tweepy

You must install the tweepy package from either Anaconda or the terminal on your computer before using it!

### Make a twitter account

You'll need to set up an app at <a href="https://apps.twitter.com/">apps.twitter.com</a>.

### Save your credentials to an external file

Make a plain text file on your computer called `twitter_credentials.py`, and put it anywhere but this directory.  I put mine in my home directory for my user.  It will look something like this:

    con_key = 'your consumer key goes here'
    con_secret = 'your consumer secret goes here'
    acc_token = 'your access token goes here'
    acc_secret = 'your access secret goes here'
    
* Save your consumer key, consumer secret, access token, and access secret there.
* Don't share these secrets with others!  
* It's also possible to generate access tokens and secrets from within an app, but now's not the right time for this.

---

## Preliminaries to using tweepy -- things you do once per session

You have to do these things about once per session.  If you close your notebook, or restart the kernel, then you have to do these things before you can again use the Tweepy interface to the Twitter API.

#### 1. Gain access to the Tweepy library

As you would any other Python library, `import`.

In [1]:
import tweepy

#### 2. Load your credentials from the external file

Invoke a python plain text source file located somewhere else on your computer.

In [4]:
con_key = 'Mzw4QX0V7aogMB7xSfflJJBXS'
con_secret = 'NhKzSqRP8jlWqdY5LzcxM1tvNgtCJPsyRDifPg3twhk3KWPxKq'
acc_token = '1009600220986916864-SaVmz9QNKC59hXexHcr0lkbiNtMe3u'
acc_secret = 'eUdvz4PLt1jzzKYAmLLh3tBeJEXrtuoTmvsxst3hwWJBl'
# this cell will evaluate silently 🙊, and not print anything.  
# This is desired, because a person with your keys can act as you on Twitter in literally every way 😟

🔐 If you need to check whether the four variables, such as `con_key` have the correct value, insert a cell and print the value, then delete the cell.  Keep your credentials secret and safe!!!  

#### 3. Make an `API` object

The `tweepy.API` object handles construction of the Twitter API calls for you.  It's a convenience layer, but it's really dang convenient!

In [15]:
#Use tweepy.OAuthHandler to create an authentication using the given key and secret
auth = tweepy.OAuthHandler(consumer_key=con_key, consumer_secret=con_secret)
auth.set_access_token(acc_token, acc_secret)
# auth = tweepy.AppAuthHandler(con_key, con_secret)

#Connect to the Twitter API using the authentication
# api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
api = tweepy.API(auth)

## Using the API

Twitter has two versions of its API:
* The [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) [API](https://en.wikipedia.org/wiki/Application_programming_interface) allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  For example,  
  💡 if I wanted to have a Python script that ran as a CRON job to automatically tweet for me under certain conditions, I would use the REST API.
* The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.  For example,  
  💡 if I wanted to make a little device powered by a Raspberry Pi that showed interesting tweets in real time on a tiny screen by my desk, I would use the streaming API.

### Method 1. The REST API

The REST API allows you to _pull_ information from Twitter, or _push_ information back to Twitter.  We'll use the REST API to run a specific search.  You could also use the REST API to make automatic tweets on Twitter, or get information about specific users.

In [8]:
#Use the REST API for a static search
#Our example finds recent tweets using the hashtag #datascience

# tweet_list = api.search(q='#%23datascience') #%23 is used to specify '#'
tweet_list = api.search(q = "%23climate OR %23environment", per_page = 5)

See [twitter's search documentation](https://dev.twitter.com/rest/public/search) for examples of query operators.  Pay attention to how to URL encode your query.  [This w3schools page](https://www.w3schools.com/tags/ref_urlencode.asp) has information on what `%23` and other encodings for URL's mean.

We retrieve a SearchResult object for each tweet, full of data such as the language, the identity of the poster, etc.

In [80]:
dir(tweet_list[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_api',
 '_json',
 'author',
 'contributors',
 'coordinates',
 'created_at',
 'destroy',
 'entities',
 'favorite',
 'favorite_count',
 'favorited',
 'geo',
 'id',
 'id_str',
 'in_reply_to_screen_name',
 'in_reply_to_status_id',
 'in_reply_to_status_id_str',
 'in_reply_to_user_id',
 'in_reply_to_user_id_str',
 'is_quote_status',
 'lang',
 'metadata',
 'parse',
 'parse_list',
 'place',
 'possibly_sensitive',
 'quoted_status',
 'quoted_status_id',
 'quoted_status_id_str',
 'retweet',
 'retweet_count',
 'retweeted',
 'retweets',
 'source',
 'source_url',
 'text',
 'truncated',
 'user']

In [12]:
#We can use the dir command to view a list of the attributes of each tweet
tweet_list[0].text

"Balck and yellow..... Doesn't it look nice?\n\n#TheVisuality #GuruShots #GuruShotsChallenge\n#autohash #nature #flora… https://t.co/2BlGDo7P3D"

In [94]:
#Let's display the text of each tweet we found.
userids = [tweet.user.id for tweet in tweet_list]
userids

test = api.show_friendship(source_id = 43092107, target_id = 26787673)
test

(Friendship(_api=<tweepy.api.API object at 0x10b22e940>, id=43092107, id_str='43092107', screen_name='ClimateHome', following=True, followed_by=False, live_following=False, following_received=None, following_requested=None, notifications_enabled=None, can_dm=True, blocking=None, blocked_by=None, muting=None, want_retweets=None, all_replies=None, marked_spam=None),
 Friendship(_api=<tweepy.api.API object at 0x10b22e940>, id=26787673, id_str='26787673', screen_name='CocaCola', following=False, followed_by=True, following_received=None, following_requested=None))

By default, the REST API returns 15 tweets.  We can get up to 100 by using the argument "count".

In [75]:
tweet_list = api.search(q='#%23datascience', count = 100)
len(tweet_list)

95

If we want more than 100 tweets, we can use a *while* loop.  The max_id argument lets us collect tweets that are older than a particular tweet index (in this case, the oldest tweet we've seen so far).

The `try/except/else` structure lets us fail gracefully in case the API search returns an error (e.g., if we run up against Twitter's rate limits).

In [28]:
num_needed = 200
tweet_list = []
last_id = -1 # id of last tweet seen
while len(tweet_list) < num_needed:
    try:
        new_tweets = api.search(q = '#%23datascience', count = 100, max_id = str(last_id - 1))
    except tweepy.TweepError as e:
        print("Error", e)
        break
    else:
        if not new_tweets:
            print("Could not find any more tweets!")
            break
        tweet_list.extend(new_tweets)
        last_id = new_tweets[-1].id

len(tweet_list)

291

Note that the free REST API restricts the number of tweets you can retrieve, and the dates: you may not be able to retrieve tweets that are more than a week old.  Pay attention to this restriction as you approach your final project topic!

## Method 2. The Streaming API

The Streaming API allows us to monitor Twitter in real time, grabbing tweets as they are made.

The ```tweepy``` package includes a class called ```StreamListener``` which monitors Twitter for us.  However, by default StreamListener does nothing with the tweets it collects.

In this demonstration, we'll modify ```StreamListener``` to make a class that prints each tweet we're interested in to the screen.  Later, you may wish to create your own class which saves information from tweets to a file.

In [67]:
#We create a subclass of tweepy.StreamListener to add a response to on_status

# class PrintingStreamListener(tweepy.StreamListener):
#     def on_status(self, status):
#         print(status.text)
        
#     #disconnect the stream if we receive an error message indicating we are overloading Twitter
#     def on_error(self, status_code):
#         if status_code == 420:
#             #returning False in on_data disconnects the stream
#             return False
        
class MyStreamListener(tweepy.StreamListener):
	def on_data(self, data):
		count = 0
		with open('tweet_stream_v2.json', 'a') as file:
			file.write(data)
		count += 1
		print(count)

	#disconnect the stream if we receive an error message indicating we are overloading Twitter
	def on_error(self, status_code):
		print(status_code)
		if status_code == 420:
		#returning False in on_data disconnects the stream
			return False        

Once we have created our subclass, we can set up our own Twitter stream.

In [68]:
#We create and authenticate an instance of our new ```PrintingStreamListener``` class

my_stream_listener = PrintingStreamListener()
my_stream = tweepy.Stream(auth = api.auth, listener=my_stream_listener)
# dir(my_stream_listener)

We'll use the ```track``` command to look for tweets with a specific keyword.  You can read more about constructing searches with ```track``` in the <a href="https://dev.twitter.com/streaming/overview/request-parameters#track">Twitter streaming API documentation</a>.

In [69]:
# Now, we're ready to start streaming!  We'll look for recent tweets which use the word "data".
# You can pause the display of tweets by interrupting the Python kernel (use the menu bar at the top)
# my_stream.filter(track=['data'], is_async=True)
query = ['vote', 'midterms', 'RockTheVote', 'democrat', 'republican']
my_stream.filter(track=query, is_async=True)

RT @Inside_Showbiz: #KathNiel (@bernardokath @imdanielpadilla) of La Luna Sangre is nominated as Favorite TV Loveteam! Retweet to vote!

#I…
RT @StephenAmell: Hey friendos — I can’t vote. I am a US resident but until I spend over 50% of my year in the States for 5 years I can’t a…
This is the first time that I vote for someone I really believe in. #ivoted #betofortexas
RT @GovMikeHuckabee: I hope my candidates win, but if not, I'll still cherish my right to vote. I won't scream at ppl in restaurants, scrat…
RT @wearepoweruk: MUSIC FANS! 💗 VOTE for your favourite right now! 

The winner 🥇 will have 1 whole hour of music played, this Sunday at 7p…
#maddow #TheVote #msnbc #KY06
RT @NathanHRubin: Long lines aren’t a sign of a healthy democracy. They’re a sign it’s too hard to vote. 

1) Election Day should be a holi…
RT @MMFlint: There is still time to vote! As long as u are in line by the time the polls close, you can vote! Go now! And to anyone who is…
RT @ACLU: BREAKING: Arizona county 

RT @SandraTXAS: #MAGA did you vote???

If not, go now!! Vote RED vote for Trump agenda 
Economy booming! Jobs! 
Stronger military!
Out of P…
RT @DeplorableLola: #IVoted 🔻Republican 🔻 #Arizona don’t forget to vote.
RT @JJprojectworld1: MAMA VOTE OPPORTUNITY!

Post here about which JJ Project song impacted your life and why. Use the tags #GOT7 and #MAMA…
RT @emmaladyrose: To voters still waiting at the polls: remember to STAY IN LINE. They have to let you vote #ElectionDay
RT @RealSaavedra: WATCH: Texas Poll Worker Tells Undercover Reporter They've Allowed 'Tons' Of DACA Recipients To Vote https://t.co/3sJ3IK7…
Me: hi would you like a democratic sample ballot?
Man: no sorry I don’t vote for baby killers
RT @AkilahObviously: Vote for the kids who would be old enough to vote Tuesday but got shot in fucking school because Republicans are addic…
RT @floodedpatekkob: I swear to god a nigga joined my ps4 party and said “y’all niggas vote I heard its hoes at the polls”
BREAKING NEWS: I’ve just 

RT @mitchellvii: Florida Exit Poll: Black Vote for GOP Up 6% Points from 2016 -- DeSantis and Scott Winning 14% of Black Vote https://t.co/…
RT @CollinRusty: I pray Republican candidates win across the board tonight, but if that doesn’t happen, I'll still love my country. 

I won…
RT @iamdevinwagner: Don’t vote Democratic.
Don’t vote Republican.

Vote for the PERSON or issue that aligns with your morals and values. Do…
RT @RealMarkKennedy: Stacy Abrams running for Governor in Georgia creates militia to confiscate your guns and bow to radical left ideas.

V…
RT @JoyAnnReid: So... this happened... #GAGOV https://t.co/p0OeXy3dfR
RT @RealMAGASteve: BREAKING: A new Project Veritas Undercover Video Shows A Texas Election Official Admitting, “There’s tons of Non-Citizen…
RT @drdesrochers: McGrath has around 60 percent of the vote in Fayette County with 59 percent of the county reporting. That's a 14,000 vote…
RT @Brad_S_Brewer: Prediction for tonight? Bad news for Democrats. 

#Vote #Election

RT @WSJPolitics: AP Votecast Survey:
Women voted Democrat by a 56%-38% margin.
Men voted Republican by a 49%-46% margin
https://t.co/68G0xP…
RT @RealSaavedra: WATCH: Texas Poll Worker Tells Undercover Reporter They've Allowed 'Tons' Of DACA Recipients To Vote https://t.co/3sJ3IK7…
May God be with us &amp; you are elected. Communism ain’t Fla. style.
RT @AndrewGillum: Everything is at stake here. Stay in line, Florida. VOTE. https://t.co/4ejgDh5QIq
RT @HuffPost: John McCain's former chief of staff is calling on Americans to vote Democrat in the midterm elections.

"The bigger the rebuk…
RT @AMarch4OurLives: WALK OUT TO VOTE!! https://t.co/vymSXcNsMk
RT @Hbobrow1Hbobrow: I’m a Canadian. 
I can’t vote in your election. 
Can I ask everyone on the fence to think about this. 

We, Canadians…
@cnnbrk Despite having PTSD from 2016, I will attempt to watch returns 🍿

#Midterms2018⁠ ⁠⁠#BeAVoter⁠ ⁠⁠#VoteToday… https://t.co/Otr6b8F1px
RT @tinyhandspb: #Florida #IWillVote @laurenbaer #FL18 #WaveCas

In [71]:
# Even if you pause the display of tweets, your stream is still connected to Twitter!
# To disconnect (for example, if you want to change which words you are searching for), 
# use the disconnect() function.

my_stream.disconnect()

---

## Suggestions for skills to learn

* Collect 1000 tweets matching a search, or all available in the current time window, whichever comes first.  That 1000 was arbitrary
* Extract just the fields you are most interested in from a search, and create a Pandas data frame
* Follow the graph of followers from a specific Twitter user

---

## Useful resources and links

* [the structure of the Status object of Tweepy](https://gist.github.com/dev-techmoe/ef676cdd03ac47ac503e856282077bf2)
* [Tweet Data Dictionary](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)
* [Standard Operators](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators) -- premium operators cost money.
* [Twitter operators by product](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/operators-by-product) -- by product they mean *paid access level*
* [How to use Twitter’s Search REST API most effectively](https://www.karambelkar.info/2015/01/how-to-use-twitters-search-rest-api-most-effectively./)
* [Collecting Tweets with Tweepy](http://www.dealingdata.net/2016/07/23/PoGo-Series-Tweepy/)
