# Accessing Twitter API and Storing Tweets in MongoDB

We can access the Twitter API using the `requests` library or through `tweepy`, and then store the tweets locally on a MongoDB database. [Click here for Twitter Search API documentation.](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html)

Large swaths of this notebook are directly from notebooks found in the Metis 2018 Chicago Winter cohort, and adapted as needed for my project.  

I've queried tweets relating to [Matthew Berry](https://en.wikipedia.org/wiki/Matthew_Berry), an ESPN Fantasy Sports analyst and column writer who I've followed for several years while playing fantasy football. His ["Love/Hate" column](http://www.espn.com/fantasy/football/story/_/page/TMRlovehate181115/fantasy-football-picks-sleepers-busts-week-11) airs every Thursday. Once a year, he unblocks people on social media who make a donation to the [Jimmy V Foundation](https://www.jimmyv.org), and it's always had me wondering how nasty people will get on Twitter over fantasy football.

In [14]:
import pprint
import pandas as pd
import requests

# This is needed to authenticate us to Twitter

try:
    from requests_oauthlib import OAuth1
except ModuleNotFoundError:
    import sys
    import os

    # I need this because requests_oauth gets installed in a weird place on my system
    sys.path.append('/usr/local/lib/python3.6/site-packages')
    from requests_oauthlib import OAuth1

# Loading Twitter API access tokens and keys
from twitter_credentials import credentials
import tweepy

# MongoDB
import json
from pymongo import MongoClient

In [8]:
def processTweet(tweet):
    tweet_dict = {
        'id': tweet.id_str,
        'datetime': tweet.created_at,
        'tweet': tweet.full_text,
        'entities': tweet.entities,
        # The following stores a user object
        'user': tweet.user._json
    }
    
    if tweet.coordinates:
        tweet_dict['coordinates'] = tweet.coordinates
    if tweet.geo:
        tweet_dict['geo'] = tweet.geo
    
    return tweet_dict

def query_twitter(api, query_params, n_queries):
    '''
    Query twitter search api and return list of desired tweet components
    as a list of dictionaries.
    Handles exceptions from Twitter's 429 error code for too many queries.
    ---
    Inputs:
        api: Tweepy API object instance. Should already be authenticated.
        query_params: Twitter API query parameters
    Returns:
        tweets: list[dict()], each dict is a processed tweet.
    '''
    # Create cursor object to look through tweets
    cursor = tweepy.Cursor(api.search, **query_params).items(n_queries)
    
    tweets = []
    # Use try clause to preserve previously processed tweets in the event 
    # of a an error, usually TweepError: Twitter error response: status code = 429
    try:
        for tweet in cursor:
            # Retrieve selected fields from tweet
            tweets.append(processTweet(tweet))
    except:
        print('Exiting `try` loop because of error.')
    else:
        print('Completed query without errors.')
    finally:
        print('Retrieved {} tweets.'.format(len(tweets)))
    
    return tweets

#### Basic takeway
- requests.get(url): make a (GET) request to a URL. Can get a webpage or a JSON object back. Returns a `response` object
- `response.json()` access the JSON object returned (if there was one)

To get this to work with Twitter, we will need to authenticate ourselves. This is the job of OAuth.

In [3]:
oauth = OAuth1(credentials["TWITTER_CONSUMER_KEY"],
               credentials["TWITTER_CONSUMER_KEY_SECRET"],
               credentials["TWITTER_ACCESS_TOKEN"],
               credentials["TWITTER_ACCESS_TOKEN_SECRET"])

## Twitter search API (free version scrapes last week's tweets)

A detailed description of the twitter search API can be found [here](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html). Some of the key parameters

| Parameter | Notes | Example |
|---|---|---|
| q | (required) query string to search for | `@metis` |
| geocode | (optional) Uses tweet geolocation, or user's profile location if tweet geolocation disabled. Should be of the format `latitude longitude radius[unit]` where unit is either "km" or "mi" | `41.8781, -87.6298, 5mi` |
| lang | (optional) Only return tweets in language given. Languages are coded by the two character code used in [ISO 639-1](http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | `en` `es` |
| count | (optional) Number of results to return. Defaults to 15, max value is 100 | `20` |

The API returns a JSON object with two keys:
- search_metadata: Information about how long the search took, what was searched for, etc
- statuses: the actual queries that you wanted

Let's see it in action:

In [5]:
parameters = {"q": "@MatthewBerryTMR", 
              "count":1000, 
             "lang":"en",
            "include_entities":"True",
              "tweet_mode":"extended"
             }

response = requests.get("https://api.twitter.com/1.1/search/tweets.json",
                        params = parameters,
                        auth=oauth)

In [15]:
# Just look at the first tweet text
pprint.pprint(response.json()['statuses'][1]['full_text'])

('RT @MatthewBerryTMR: Fantasy heartbreak as a long TD run by David Johnson is '
 'called back by a holding penalty on Ricky Seals-Jones')


In [16]:
# Check number of tweets in the response JSON
print("Number of tweets = ", len(tweets))

Number of tweets =  100


We can pull the next set of tweets using the query's metadata. We're looking for the extended text of a tweet rather than the truncated version.

In [18]:
next_page_url = "https://api.twitter.com/1.1/search/tweets.json" + response.json()['search_metadata']['next_results']\
+'&tweet_mode=extended'

response = requests.get(next_page_url, auth=oauth)

more_tweets = response.json()['statuses']

for tweet in more_tweets[:5]:
    print(tweet['full_text'])
    print()

@MatthewBerryTMR Ingram is really being a vulture today 💔💔 #IHaveKamara

RT @MatthewBerryTMR: Crazy https://t.co/GRnpDLCsi5

RT @MatthewBerryTMR: Crazy https://t.co/GRnpDLCsi5

RT @MatthewBerryTMR: Crazy https://t.co/GRnpDLCsi5

RT @MatthewBerryTMR: Crazy https://t.co/GRnpDLCsi5



## Pull tweets into Mongo DB

Using `mongoclient` we can load in credentials and use the cursor object to scroll through the search results. Unlike the JSON style object we have from requests, we can use dot '`.`' calls to the JSON keys to get information.

In [24]:
# Load twitter credentials from local file
auth = tweepy.OAuthHandler(credentials["TWITTER_CONSUMER_KEY"],
                           credentials["TWITTER_CONSUMER_KEY_SECRET"])
auth.set_access_token(credentials["TWITTER_ACCESS_TOKEN"],
                      credentials["TWITTER_ACCESS_TOKEN_SECRET"])

api=tweepy.API(auth)

# Using local mongo client after we activate the mongod Daemon in Terminal
client = MongoClient()
db = client.berry_tweets #database structure is the overhead needed to hold collections of documents
twt_cl = db.tweets

In [25]:
# View active databases
client.database_names()

['admin', 'berry_tweets', 'config', 'legistlation', 'local', 'my_new_db']

In [26]:
# Query twitter search API until rate limits kick in. Grab as many tweets as possible.
params = {
    'q':'@MatthewBerryTMR',
    'tweet_mode':'extended',
    'lang':'en',
    'include_entities':True
}
n_tweets = 2800
query_results = query_twitter(api,params,n_tweets)

Exiting `try` loop because of error.
Retrieved 2620 tweets.


Let's use the PyMongo client to store some information in the database.
Eventually look at using `twt_cl.create_index()` command to give all tweets a unique index.

In [27]:
# Take a list of processed tweets from a query and insert unique ones into the database

already_in = 0
collection_count_initial = twt_cl.count()
for tweet in too_many_tweets:
    if twt_cl.find({'id':tweet['id']}).count() >0:
        already_in += 1
    elif twt_cl.find({'id':tweet['id']}).count() ==0:
        twt_cl.insert_one(tweet)
    else:
        print("unexpected")

print('Duplicates: {0} Duplicates in new query: {1}'.format(already_in, already_in/len(too_many_tweets)))
print('New entries: {}'.format(twt_cl.count()-collection_count_initial))
print('Database size: {}'.format(twt_cl.count()))

Duplicates: 0 Duplicates in new query: 0.0
New entries: 2620
Database size: 9116


In [28]:
# How many documents do we have
twt_cl.count()

9116

In [29]:
# How many mention Matthew Berry?
twt_cl.find({"tweet": {"$regex": "@MatthewBerryTMR"}}).count()

9051