# Mining Twitter Data

## Configuration

In order to use Twitter's API and Tweepy, we need to authenticate our Twitter application and obtain an access token for our user account - see details in the *twitter_test* Notebook. The code below requires that your credentials are in the file *credentials.py*, and that this file be located in the same directory as this Notebook. 

If you do not have a Twitter Developer account, you still need the *credentials.py* file, with its default values, in the same directory as this Notebook. In addition, you will need the files *user.txt*, *results.txt*, *trends_available.txt*, and *trends_results.txt* available from our course's homepage.


Python cells may contain one of two important comments:
- *DEVELOPER ACCOUNT REQUIRED*: A developer account and a valid *credentials.py* file are required to run this cell, otherwise errors will be produced.
- *BACKUP FOR STUDENTS WITHOUT A DEVELOPER ACCOUNT*: This code should be run by students without a developer account or students whose *credentials.py* file is not valid.


The code below sets up our API to wait for rate limits and to notify us if it is waiting. Twitter rate limits can be found here: https://developer.twitter.com/en/docs/basics/rate-limits.html

In [None]:
import tweepy
import credentials

auth = tweepy.OAuthHandler(credentials.CONSUMER_KEY, credentials.CONSUMER_SECRET)
auth.set_access_token(credentials.ACCESS_TOKEN, credentials.ACCESS_TOKEN_SECRET)

# create an API for accessing twitter, which will wait for rate limits to reset when
# reached and will notify the user if that is the case
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

## Get user information

We can get user information using *api.get_user()*, which returns a *tweepy.models.User* object.

In [None]:
# DEVELOPER ACCOUNT REQUIRED

# Get the User object for the user 'EasternCTStateU' 
user = api.get_user('EasternCTStateU')
type(user)

In [None]:
# BACKUP FOR STUDENTS WITHOUT A DEVELOPER ACCOUNT

import pickle

# used to load previously saved Twitter data
def loadData(file) :
    with open(file, 'rb') as infile :
         return pickle.load(infile)

if credentials.CONSUMER_KEY == 'CONSUMER_KEY' :
    user = loadData('user.txt')
type(user)

### JSON format

Data transferred across the internet is typically stored in JSON (Javascript Object Notation) format, which has a format similar to that of a Python dictionary. 

We can see this by running the code below and copying the output into the JSON viewer here:
http://jsonviewer.stack.hu/ 

We do this for demonstration purposes only. *Tweepy* will convert the JSON information into a *model* class as shown below.

In [None]:
import json
print(json.dumps(user._json))

### Tweepy model class
Most data returned by *tweepy* will be stored in a Tweepy model class object. Information (properties) of these objects can be accessed using the dot (.) operator. For example, a *Tweepy.models.User* object stored in the variable *user* contains the following properties:

- *user.screen_name*: the user's screen name
- *user.followers_count*: the user's followers count
- *user.friends()*: a method that returns a list of the user's friends (up to 20 per page)
- *user.location*: the location from the user's profile


In [None]:
# display some information about the user
print("Twitter user:", user.screen_name)
print("Location:", user.location)
print("Number of followers:", user.followers_count)

In [None]:
# Display the first 10 friends (DEVELOPER ACCOUNT REQUIRED)
print("The user's first 10 friends are:")
for friend in user.friends()[:10]:
   location = friend.location
   if location == '' :
        location = '?'
   print('\t',friend.screen_name, 'from', location)

## Searching twitter

We can search twitter using the *search* function, with some parameters described below:

- _q_: the search query; to retreive a sample of all tweets, set this value to '\*'
- *tweet_mode*: if 'extended', returns full tweets that are more than 140 characters
- *result_type*: either 'recent', 'popular', or 'mixed' (the default)
- *count*: the number of tweets per page, up to 100 (default is 15)
- *lang*: restrict tweets by language; use 'en' for english. Default is to have no restriction.

Searches can also be restricted by location; for more details see *API.search* at http://docs.tweepy.org/en/v3.8.0/api.html.


The *api.search* function will return a *list* of *tweepy.models.Status* objects.

In [None]:
query = 'Connecticut'

In [None]:
# DEVELOPER ACCOUNT REQUIRED
results = api.search(q = query,tweet_mode = 'extended', lang = 'en')

In [None]:
# BACKUP FOR STUDENTS WITHOUT A DEVELOPER ACCOUNT
if credentials.CONSUMER_KEY == 'CONSUMER_KEY' :
    results = loadData('results.txt')

In [None]:
# api.search() returns a list, so let's look at the first tweet in the list
tweet1 = results[0]
type(tweet1)

As seen before, Twitter uses the JSON format, which is parsed by *Tweepy*. For demonstration purposes, copy and paste the output below into http://jsonviewer.stack.hu/ to view the JSON data.

In [None]:
print(json.dumps(tweet1._json))

### Extracting information from tweets
Individual tweets are stored as a *tweepy.models.Status* object, so its properties can be accessed using the dot (.) operator. A full list of properties can be seen in the *Tweet Data Dictionary* from the following link:
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.

We will focus on the following properties:
- *full_text*: the full text of the tweet (only available if *tweet_mode* is 'extended'); otherwise text will be stored in *text*
- *retweet_count*: the number of times the tweet has been retweeted
- *retweeted*: True if the tweet was retweeted; otherwise False
- *user*: a *Tweepy.models.User* object for the user who tweeted
- *lang*: the (machine-detected) language of the tweet
- *id*: the unique identifier for the tweet

Try to look at the full text and retweet_count of the tweet stored in *tweet1* in the cell below.

In [None]:
tweet1.full_text

Since the tweets returned from *api.search* are stored in a list, we can iterate through the list and output information for each tweet.

In [None]:
# output each tweet, including the user name, retweet information, and a link to view the tweet
# this code uses 'hasattr' to check whether the tweet has the 'retweeted_status' property, which
# includes the original tweet
for r in results :
    print(r.user.screen_name, ': ', r.full_text, sep = '')
    if (r.retweet_count > 0) :
        if hasattr(r, 'retweeted_status') :
            print('retweeted from: ', r.retweeted_status.user.screen_name)
        print('retweet count: ', r.retweet_count)
    print('link: https://twitter.com/', r.user.screen_name, '/status/', r.id, sep = '')
    print()

## Trending topics

The Twitter API and *Tweepy* provide several methods for retreiving trending topics.

### Get the available trends

The API function *trends_available()* will return a list of locations that Twitter has trending topic information for. Information for each location is stored in a *dictionary*, so the result is a list of dictionaries. A location is identified by its WOEID (a Yahoo! Where On Earth ID), where a WOEID of 1 indicates 'worldwide'.

In [None]:
## DEVELOPER ACCOUNT REQUIRED
trends_available = api.trends_available()

In [None]:
# BACKUP FOR STUDENTS WITHOUT A DEVELOPER ACCOUNT
if credentials.CONSUMER_KEY == 'CONSUMER_KEY' :
    trends_available = loadData('trends_available.txt')

In [None]:
# look at first trend, which is a dictionary
trends_available[0]

In [None]:
print('Number of trend locations: ', len(trends_available))
print()
print('First 5 trend locations')
for t in trends_available[:5]: 
    print(t['name'], ', woeid = ', t['woeid'], sep ='')

### Get trends for a specific location

We can get the top 50 trends for a specific location using the *api.trends_place* method and specifying the woeid of the location of interest.

In [None]:
## DEVELOPER ACCOUNT REQUIRED

# get worldwide trends (a WOEID of 1 corresonds to 'worldwide')
trends_results = api.trends_place(id = 1)

In [None]:
# BACKUP FOR STUDENTS WITHOUT A DEVELOPER ACCOUNT
if credentials.CONSUMER_KEY == 'CONSUMER_KEY' :
    trends_results = loadData('trends_results.txt')

The *trends_results* object is a list that contains a dictionary, where the *key* is *'trends'* and the *value* is a list of *trends* that are stored as dictionaries.

Each *trend* is a dictionary containing the following:
- *name*: the name of the trend
- *url*: the url to search the trend
- *tweet_volume*: the number of tweets, or None if not available
- *query*: a query that can be used in the *api.search* method

In [None]:
# extract the list of trends
trends = trends_results[0]['trends']

# look at the first trend, which is stored as a dictionary
trends[0]

The *print_trends* function below prints information about each trend *t*.

In [None]:
# print Trend name and volume
def print_trend(t) :
    num = t['tweet_volume']
    if num is None :
        num = '?'
    print(t['name'], 'has', num, 'tweets')

In [None]:
# print information for each trend
for t in trends :
    print_trend(t)

### Get trends *closest* to a specific location

The *api.trends_closest* function allows us to look at trends close to a specific location (specified by its latitude and longitude). Let's look at the trends closest to Eastern, which has lattitude of 41.722 and longitude of -72.22: https://goo.gl/maps/nZWRwdCEuQGFiXQH6

This function returns a list of locations, similar to *trends_available()*.

In [None]:
## DEVELOPER ACCOUNT REQUIRED for the rest of the Notebook
trends_closest = api.trends_closest(lat = 41.722, long = -72.22)

In [None]:
trends_closest

### Display a list of trending topics closest to Eastern

Now that we found the closest trend, we can extract the WOEID for the location of this trend, and use the *trends_place* method to get a list of trending topics.

In [None]:
print('Trending topics for', trends_closest[0]['name'], ':')
print()
trends_results = api.trends_place(id = trends_closest[0]['woeid']) 
trends = trends_results[0]['trends']
for t in trends:
    print_trend(t)