# Twitter and Python

We will be showing how to access Twitter using Python. We will use
1. the relatively low-level `requests` library, which will deal with any API
2. the easier to use tweety library, but which only works with Twitter

We will need to install some packages. From inside the directory with the file, run
```bash
conda install -c conda-forga --file requirements.txt
```
or 
```bash
pip install -r requirements.txt
```

[Click here for Twitter Search API documentation.](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html)

In [1]:
import pprint
import pandas as pd
import requests 

## The requests library

We will be doing a lot of work with Twitter, but I don't want you to leave with the impresssion that requests only works with Twitter. It works with any API, and we have already used it with our Flask apps. A request is a "low-level" network call, and duplicates what the command line tool curl does, for example.

Before diving into connecting to Twitter (where we will need to provide a username and password), I wanted to give an example using the [Star Wars API](http://swapi.co) which doesn't require authentication. 

Let's start with curl from the command line (the `json_pp` just makes the output JSON pretty)

#### Basic takeway
- requests.get(url): make a (GET) request to a URL. Can get a webpage or a JSON object back. Returns a `response` object
- `response.json()` access the JSON object returned (if there was one)

To get this to work with Twitter, we will need to authenticate ourselves. This is the job of OAuth.

## Instructions to connect to Twitter using requests

We need to identify ourselves to Twitter using 
- a public and private key, which doesn't expire
- as well as a public and private token (which does expire). 
If you are re-running this notebook tomorrow, you will need to get a token from the Twitter page (but your keys will remain the same).

Follow the instructions [here](setup_twitter_instructions.md) to get your keys and tokens, and place them in `twitter_credenitials.py`.

**The cell below won't work until you follow the instructions!**

In [2]:
# This is needed to authenticate us to Twitter

try:
    from requests_oauthlib import OAuth1
except ModuleNotFoundError:
    import sys
    import os

    # I need this because requests_oauth gets installed in a weird place on my system
    sys.path.append('/usr/local/lib/python3.6/site-packages')
    from requests_oauthlib import OAuth1

In [3]:
from twitter_credentials import credentials

oauth = OAuth1(credentials["TWITTER_CONSUMER_KEY"],
               credentials["TWITTER_CONSUMER_KEY_SECRET"],
               credentials["TWITTER_ACCESS_TOKEN"],
               credentials["TWITTER_ACCESS_TOKEN_SECRET"])

## Twitter search API (free version scrapes last week's tweets)

A detailed description of the twitter search API can be found [here](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html). Some of the key parameters

| Parameter | Notes | Example |
|---|---|---|
| q | (required) query string to search for | `@metis` |
| geocode | (optional) Uses tweet geolocation, or user's profile location if tweet geolocation disabled. Should be of the format `latitude longitude radius[unit]` where unit is either "km" or "mi" | `41.8781, -87.6298, 5mi` |
| lang | (optional) Only return tweets in language given. Languages are coded by the two character code used in [ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | `en` `es` |
| count | (optional) Number of results to return. Defaults to 15, max value is 100 | `20` |

The API returns a JSON object with two keys:
- search_metadata: Information about how long the search took, what was searched for, etc
- statuses: the actual queries that you wanted

Let's see it in action:

In [55]:
parameters = {"q": "@MatthewBerryTMR", 
              "count":1000, 
              #"geocode": "41.8781,-87.6298,10mi",
             "lang":"en",
            "include_entities":"True",
              "tweet_mode":"extended"
             }

response = requests.get("https://api.twitter.com/1.1/search/tweets.json",
                        params = parameters,
                        auth=oauth)

In [69]:
# Just look at the first tweet:
pprint.pprint(response.json()['statuses'][2])

{'contributors': None,
 'coordinates': None,
 'created_at': 'Tue Nov 13 21:36:53 +0000 2018',
 'display_text_range': [17, 76],
 'entities': {'hashtags': [],
              'symbols': [],
              'urls': [],
              'user_mentions': [{'id': 20899023,
                                 'id_str': '20899023',
                                 'indices': [0, 16],
                                 'name': 'Matthew Berry',
                                 'screen_name': 'MatthewBerryTMR'}]},
 'favorite_count': 1,
 'favorited': False,
 'full_text': '@MatthewBerryTMR Unless they drafted Connor in the handcuff '
              'like this guy 👍👍',
 'geo': None,
 'id': 1062459307080515584,
 'id_str': '1062459307080515584',
 'in_reply_to_screen_name': 'MatthewBerryTMR',
 'in_reply_to_status_id': 1062448993383387146,
 'in_reply_to_status_id_str': '1062448993383387146',
 'in_reply_to_user_id': 20899023,
 'in_reply_to_user_id_str': '20899023',
 'is_quote_status': False,
 'lang': 'en',
 'metadata

In [67]:
# Ok, can we extract some of the info from this text?
tweets = response.json()['statuses']
print(tweet_to_string(tweets[0]))


        Tweet body: @CoreyCorbin78 @kevinkerbrat @MatthewBerryTMR yes all of them lol if you believe connor is a better player than bell then there’s no point in arguing
        Hashtags: []
        Username: SeanOneill_NJ
        Bio: Just sling it
        Social status: 139 friends, 102 followers
        Location: 
    


In [68]:
for tweet in tweets[:3]:
    print(tweet_to_string(tweet))


        Tweet body: @CoreyCorbin78 @kevinkerbrat @MatthewBerryTMR yes all of them lol if you believe connor is a better player than bell then there’s no point in arguing
        Hashtags: []
        Username: SeanOneill_NJ
        Bio: Just sling it
        Social status: 139 friends, 102 followers
        Location: 
    

        Tweet body: RT @MatthewBerryTMR: LeVeon goes into the fantasy football record book as the worst draft pick in the history of the game. https://t.co/Nag…
        Hashtags: []
        Username: canes415
        Bio: Canes 🙌🏻Heat 🔥Dolphins🐬
        Social status: 643 friends, 205 followers
        Location: Ft. Lauderdale Florida
    

        Tweet body: @MatthewBerryTMR Unless they drafted Connor in the handcuff like this guy 👍👍
        Hashtags: []
        Username: GregnotSteve
        Bio: 
        Social status: 233 friends, 29 followers
        Location: 
    


Did we pull all 20 tweets?

In [47]:
print("Number of tweets = ", len(tweets))

Number of tweets =  100


We can pull the next set of tweets if we want (i.e. the "next" 20)

In [62]:
response.json()['search_metadata']['next_results']+'&tweet_mode=extended'

'?max_id=1062463013284929536&q=%40MatthewBerryTMR&lang=en&count=100&include_entities=1&tweet_mode=extended'

In [64]:
next_page_url = "https://api.twitter.com/1.1/search/tweets.json" + response.json()['search_metadata']['next_results']\
+'&tweet_mode=extended'

response = requests.get(next_page_url, auth=oauth)

more_tweets = response.json()['statuses']

for tweet in more_tweets[:5]:
    print(tweet['full_text'])
    print()

@CoreyCorbin78 @kevinkerbrat @MatthewBerryTMR yes all of them lol if you believe connor is a better player than bell then there’s no point in arguing

RT @MatthewBerryTMR: LeVeon goes into the fantasy football record book as the worst draft pick in the history of the game. https://t.co/Nag…

@MatthewBerryTMR Unless they drafted Connor in the handcuff like this guy 👍👍

@DezpicableD @MatthewBerryTMR It’s possible.  It’s very unlikely.

@MatthewBerryTMR Guilty as charged



In [49]:
response = requests.get(next_page_url, auth=oauth)

more_tweets = response.json()['statuses']

for tweet in more_tweets[:5]:
    print(tweet['text'])
    print()

@Nicholasgee33 @LeVeonBell @MatthewBerryTMR I'm hoping bell comes back though cuz I think my team would be unstoppable

@tg_716 @MatthewBerryTMR I think he probably does feel bad for all the other people who “get the shaft.” That’s cal… https://t.co/Yr56XB2ejP

@Nicholasgee33 @LeVeonBell @MatthewBerryTMR What??

@Nicholasgee33 @LeVeonBell @MatthewBerryTMR Why would u do that lol I drafted both but also had Rodgers so I traded… https://t.co/nkpraNbf7r

RT @MatthewBerryTMR: In Week 4, Mitchell Trubisky threw 6 touchdown passes. Okay. That’s a fluke, right? TB is brutal. Forget that. Here’s…



## Streaming tweets

Instead of looking at the tweets that have already been made, we can look at the tweets in real time.

In [4]:
import tweepy

auth = tweepy.OAuthHandler(credentials["TWITTER_CONSUMER_KEY"],
                           credentials["TWITTER_CONSUMER_KEY_SECRET"])
auth.set_access_token(credentials["TWITTER_ACCESS_TOKEN"],
                      credentials["TWITTER_ACCESS_TOKEN_SECRET"])

api=tweepy.API(auth)

In [7]:
max_tweets = 3

test_q_params = {
    'q':"@MatthewBerryTMR",
    'tweet_mode':'extended',
    'lang':'en'
}

for index, tweet in enumerate(tweepy.Cursor(api.search, **test_q_params).items(max_tweets)):
    # You can see all the methods available on tweet using .<tab> or 
    # dir(tweet). You can access the raw JSON using tweet._json
    print(str(index) + '. ' + tweet.user.screen_name + " says: "+ tweet.full_text + '\n')

0. Daniel_Penrod11 says: @MatthewBerryTMR I have Alex Collins and John Brown. Only starting Collins for now, but really want to roll the dice on Smokey

1. mike__patton77 says: @Beav04 @WilAnthropist @MatthewBerryTMR You realize he’s only missed like 3 kicks all season right.

2. k_bender23 says: RT @MatthewBerryTMR: LeVeon goes into the fantasy football record book as the worst draft pick in the history of the game. https://t.co/Nag…



In [9]:
# We can also duplicate our original query

# use ** when passing in parameters as a shortcut way to pass params in to the function as a list
for index, tweet in enumerate(tweepy.Cursor(api.search, **test_q_params).items(max_tweets)):
    print(str(index) + '. ' + tweet.user.screen_name + " says: "+ tweet.full_text + '\n')

0. Daniel_Penrod11 says: @MatthewBerryTMR I have Alex Collins and John Brown. Only starting Collins for now, but really want to roll the dice on Smokey

1. mike__patton77 says: @Beav04 @WilAnthropist @MatthewBerryTMR You realize he’s only missed like 3 kicks all season right.

2. k_bender23 says: RT @MatthewBerryTMR: LeVeon goes into the fantasy football record book as the worst draft pick in the history of the game. https://t.co/Nag…



## Getting data into Mongo

We can also insert tweets into MongoDB. 

In [10]:
import json
from pymongo import MongoClient

# If we wanted to connect to Mongo on AWS, this is where we'd set that info

#uri = 'mongodb://sam:mongo_sam@18.216.212.82:27017'
#client = MongoClient(uri)

# Using local mongo client after we activate the mongod Daemon in Terminal
client = MongoClient()
db = client.berry_tweets #database structure is the overhead needed to hold collections of documents
twt_cl = db.tweets

In [11]:
client.database_names()

['admin', 'berry_tweets', 'config', 'legistlation', 'local', 'my_new_db']

In [12]:
def processTweet(tweet):
    tweet_dict = {
        'id': tweet.id_str,
        'datetime': tweet.created_at,
        'tweet': tweet.full_text,
        'entities': tweet.entities,
        # The following stores a user object
        'user': tweet.user._json
    }
    
    if tweet.coordinates:
        tweet_dict['coordinates'] = tweet.coordinates
    if tweet.geo:
        tweet_dict['geo'] = tweet.geo
    
    return tweet_dict

In [13]:
def query_twitter(api, query_params, n_queries):
    '''
    Query twitter search api and return list of desired tweet components
    as a list of dictionaries.
    Handles exceptions from Twitter's 429 error code for too many queries.
    ---
    Inputs:
        api: Tweepy API object instance. Should already be authenticated.
        query_params: Twitter API query parameters
    Returns:
        tweets: list[dict()], each dict is a processed tweet.
    '''
    # Create cursor object to look through tweets
    cursor = tweepy.Cursor(api.search, **query_params).items(n_queries)
    
    tweets = []
    # Use try clause to preserve previously processed tweets in the event 
    # of a an error, usually TweepError: Twitter error response: status code = 429
    try:
        for tweet in cursor:
            # Retrieve selected fields from tweet
            tweets.append(processTweet(tweet))
#     except TweepError:
#         print('Shut down by twitter api')
    except:
        print('Exiting `try` loop because of error.')
    else:
        print('Completed query without errors.')
    finally:
        print('Retrieved {} tweets.'.format(len(tweets)))
    
    return tweets

In [24]:
params = {
    'q':'@MatthewBerryTMR',
    'tweet_mode':'extended',
    'lang':'en',
    'include_entities':True
}
list_of_tweets = query_twitter(api,params,3)

Completed query without errors.
Retrieved 3 tweets.


In [41]:
too_many_tweets = query_twitter(api,params,2800)

Exiting `try` loop because of error.
Retrieved 2699 tweets.


In [None]:
twt_cl.create_index()

Let's use the PyMongo client to get some information back from the database!

In [42]:
# Take a list of processed tweets from a query and insert unique ones into the database

already_in = 0
collection_count_initial = twt_cl.count()
for tweet in too_many_tweets:
    if twt_cl.find({'id':tweet['id']}).count() >0:
        already_in += 1
    elif twt_cl.find({'id':tweet['id']}).count() ==0:
        twt_cl.insert_one(tweet)
    else:
        print("unexpected")

print('Duplicates: {0} Duplicates in new query: {1}'.format(already_in, already_in/len(too_many_tweets)))
print('New entries: {}'.format(twt_cl.count()-collection_count_initial))
print('Database size: {}'.format(twt_cl.count()))

Duplicates: 2693 Duplicates in new query: 0.9977769544275658
New entries: 6
Database size: 5204


In [43]:
# How many documents do we have
twt_cl.count()

5204

In [17]:
# How many mention Matthew Berry?
twt_cl.find({"tweet": {"$regex": "@MatthewBerryTMR"}}).count()

2668

In [23]:
twt_cl.find_one({"tweet": {"$regex": "Bell"}})['user']['location']

'Ohio, USA'

In [None]:
#news_collection.delete_many({}) # delete all documents from collection

## More complicated query: 20 most popular hashtags

What are the most popular hashtags in the dataset? 
    
To start, let's find a document with at least one hashtag. Then we will build a pipeline using `aggregate`, which goes through a series of filtering sets.


In [None]:
news_collection.find_one({"entities.hashtags.0": {"$exists": 1}})

Aggregation steps used below. Note that `$` signs get used for operators or for existing column names (to distinguish them from normal strings):
- `$match`: The standard query used in `find` that we have already seen
- `$project`: pick the fields that will end up in the output (projected)
- `$unwind`: Take a field that contains an array, and create a new record for each element in that array (see example); like a SQL `JOIN`
- `$group`: This is like a SQL `GROUP BY`. Take a mandatory `_id` (which is what it groups by). Create new fields with aggegate function
- `$sort: { sort_field : +1 ascending or -1 descending }`
- `$limit`: the number of records to return. This can also be called on the resulting cursor.


#### Unwind example

If we have a document
```javascript
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': [5,6,7],
   'field3': 'abba'
}
```
after doing an `$unwind` on field 2 you would get three new documents:
```javascript
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 5,
   'field3': 'abba'
},
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 6,
   'field3': 'abba'
},
{
   '_id': 123456789,
   'field1': [1,2,3],
   'field2': 7,
   'field3': 'abba'
}
```


<img src="./images/mongo_aggregation.png">

In [None]:
# aggregate is a pipeline, order matters
cursor = news_collection.aggregate([
    {'$match': {'entities.hashtags.0': {'$exists': 1}}},
    {'$project': {'_id': 0, 'hashtags': '$entities.hashtags'}},
    {'$unwind': '$hashtags'},
    {'$group': {'_id': '$hashtags.text', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}},
    {'$limit': 20}
])

list(cursor)

## Real streaming: the tweepy API

Okay, but how about real streaming? That is, an object that sits there are "listens" for new tweets, and then processes them as they arrive?

This uses a slightly different API. There are a couple of things to pay attention to
- The tweets that we get are *strings* of JSON objects, not the JSON objects themselves. We also don't have the nice ways of accessing the attributes directly (e.g. tweet.text above). Instead we convert the string to JSON, which gives us a dictionary, and then go from there.
- A twitter stream takes a `StreamListener` class. We should write member functions `on_data` and `on_error` that are called when a new tweet arrives, or we encounter an error, respectively.

In this example, we implement a `deque` of length 5, so that we are retaining the 5 most recent tweets. In the `on_data` call, we are adding the tweet to our collection, then printing out the currently stored tweets. 

If we wanted to store the data for all time, then `on_data` method would be where we would load them into Mongo.

In [194]:
from tweepy import Stream
from tweepy.streaming import StreamListener
from IPython import display
from collections import deque
import json

class MyListener(StreamListener):
    def __init__(self):
        super().__init__()
        self.list_of_tweets = deque([], maxlen=5)
        
    def on_data(self, data):
        tweet_text = json.loads(data)['text']
        self.list_of_tweets.append(tweet_text)
        self.print_list_of_tweets()
        
    
    def on_error(self, status):
        print(status)

    def print_list_of_tweets(self):
        display.clear_output(wait=True)
        for index, tweet_text in enumerate(self.list_of_tweets):
            m='{}. {}\n\n'.format(index, tweet_text)
            print(m)
            
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#MondayMotivation'])

0. RT @pollsofpolitics: What Approval percentage number would you give @POTUS @realDonaldTrump on the economy??

Please vote and retweet to sp…


1. RT @pollsofpolitics: Do you believe #VoterFraud is actually happening in #Florida??

Please vote and retweet to spread poll for a wider sam…


2. RT @JeferOficial: Buenos días mi amor ¿Cómo amaneces? 

*Ya me voy a bañar ¿Vienes?

#MondayMotivation 
#GoPackGo https://t.co/Y4e0M7ToRg


3. RT @SpectreKaiTM: When the media says “White nationalist” or “racist” what they mean is “a White person who hasn’t learned to hate themselv…


4. RT @vashtiroebuck1: Happy Monday with some Cos D’Estournel @RoebuckSteve1 @DrinkBordeaux @BordeauxWines @Bordeauxwinenew @cotes2bordeaux @V…




KeyboardInterrupt: 