# Lab1.3: Twitter as a source of text

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we are going to query Twitter streams using the package **tweepy**: https://github.com/tweepy/tweepy
Some documentation can be found at: https://github.com/tweepy/tweepy/tree/master/docs

Tweepy allows you to access Twitter using credentials and returns a so-called Cursor object. From the Cursor object, you can access the twitter data in e.g. JSON format. Documentation on the Twitter data objects can be found here:

https://developer.twitter.com/en/docs


Instructions on how to install Tweepy, get credentials and use the API can be found here:

http://socialmedia-class.org/twittertutorial.html

The notebook below is partially based on this tutorial. Credits: Wei Xu

Make sure you installed the package and obtained the Twitter credentials before your start using the API.

In the next code, we are importing a json package and the tweepy package. Next, we first set up tweepy with the authenticaton credentials so that we can make a connection. Consult the documentation how to get yoru credentials. Using the credentials, we call the tweepy.API function to create an api object.

We show how you can get the results through a Cursor function of tweepy, in which you need to set a number of variables. We use the search api to pass a keyword as a query and limit the results to a number of tweets, number of pages, excluding retweets and setting a period.

In [26]:
# Import the necessary package to process data in JSON format
try:
    import json
except ImportError:
    import simplejson as json

# Import the tweepy library
import tweepy

# Variables that contains the user credentials to access Twitter API 
#ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
#ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
#CONSUMER_KEY = 'YOUR API KEY'
#CONSUMER_SECRET = 'ENTER YOUR API SECRET'

# Variables that contains the user credentials to access Twitter API 
ACCESS_TOKEN = ''
ACCESS_SECRET = ''
CONSUMER_KEY = ''
CONSUMER_SECRET = ''

# Setup tweepy to authenticate with Twitter credentials:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to twitter with your credentials
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
#---------------------------------------------------------------------------------------------------------------------
# wait_on_rate_limit= True;  will make the api to automatically wait for rate limits to replenish
# wait_on_rate_limit_notify= Ture;  will make the api  to print a notification when Tweepyis waiting for rate limits to replenish
#---------------------------------------------------------------------------------------------------------------------

#---------------------------------------------------------------------------------------------------------------------
# we now predefine some variables to restrict the twitter stream

#set two date variables for date range
start_date = '2018-10-01'
end_date = '2018-10-31'

#we define the nr_tweets we want to return 
nr_tweets=10

#finally, we define a query. Note that we can mix hash tags with words and use boolean operators OR and AND
keywords='#autism AND vaccines OR medicine AND children'

#---------------------------------------------------------------------------------------------------------------------
# Twitter API development use pagination for Iterating through timelines, user lists, direct messages, etc. 
# To help make pagination easier and Tweepy has the Cursor object.
#---------------------------------------------------------------------------------------------------------------------
for page in tweepy.Cursor(api.search, q=keywords,
                     count=nr_tweets, include_rts=False, since=start_date, till=end_date).pages(5):
    for index, status in enumerate(page):
        
        ## we get a json object from the result
        json_result = status._json
        
        ## check whether the tweet is in english or skip to the next tweet
        if json_result['lang'] != 'en':
            continue

        text=json_result['text']
        name=json_result['user']['screen_name']
        print("\n"+str(index)+"\nUser:"+name, "\n", "Tweet:" + text)


0
User:freedomgirl2011 
 Tweet:Corporations will make money off your brain damaged child,one in 38 children in United States are diagnosed with… https://t.co/ZT0JOjqTF2

1
User:andrewmorrisuk 
 Tweet:RT @sallyKP: From 2011... listen.

Very sad for all of the families who have been advocating for 20 years now.

What would happen if the US…

2
User:anhisu7 
 Tweet:RT @sallyKP: From 2011... listen.

Very sad for all of the families who have been advocating for 20 years now.

What would happen if the US…

3
User:WarriorWifeMom 
 Tweet:RT @sallyKP: From 2011... listen.

Very sad for all of the families who have been advocating for 20 years now.

What would happen if the US…

6
User:andrewmorrisuk 
 Tweet:RT @BioscienceNB: #educateyourself
10 years and over 600,000 children.

As promised, here is another link to the most recent study conducte…

7
User:BioscienceNB 
 Tweet:#educateyourself
10 years and over 600,000 children.

As promised, here is another link to the most recent study co… htt

We see here 5 pages which each 10 tweets, where we print the user name and the text. As you can see, the tweets contains all kinds of non-textual elements as well.

Instead of printing the tweet text and screen name to the screen, we also directly dump the JSON result to a file.
We show this next.

In [48]:
# We define a file path to store the results as CSV. Make sure the folder 'twitter_search_results' exists 
# or that you specify another path to an existing location. The 'twitter_results_vaccination.csv' file will be created in that location.
jsonFilePath='twitter_search_results/twitter_results_vaccination.json'


for page in tweepy.Cursor(api.search, q=keywords,
                     count=nr_tweets, include_rts=False, since=start_date, till=end_date).pages(5):
    for index, status in enumerate(page):
        
        ## we get a json object from the result
        json_result = status._json
        ## we open the result file for appending the JSON
        jsonFile = open(jsonFilePath, 'a' ,encoding='utf-8')
        ## To save the data, we need to convert it to a 'str'
        jsonFile.write(str(json_result))


You can now open the file *jsonFilePath* in a plain text editor and inspect the structure.

The JSON output contains all kinds of meta data in addition to the tweet ietself. We are going to show how you can get these and save the result as a CSV output file. We assume the above imports and credentials and re-use the api we defined.

In order to obtain the relevant information, you need to know the JSON structure of the output. Please consult the Twitter documentation to understand the structure.

https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json

We are going to define a number of columns for a CSV file to store the result:

In [49]:
COLS = ['id', 'created_at', 'source', 'tweet_text', 'lang',
'favorite_count', 'retweet_count', 'original_author', 'hashtags',
'user_mentions', 'place', 'place_coord_boundaries']

The next code shows how we obtain the above data from the JSON structure and store it in a single CSV file. For illustration, we are going to use the same settings and API as before.

To create the output as CSV data, we are going to use the Pandas package: https://pandas.pydata.org
Please follow the instructions to install pandas locally:

* conda install pandas
* python -m pip install --upgrade pandas

Consult the documentation to learn more about the functionalities. Here we are going to use it to convert our list of featurures for a tweet to a CSV format.

We need to import *os* for writing to a file and *pandas* (after the install) for dealing with the data structure. Take your time to study the next bit of code so that you understand the individual steps.

In [50]:
import os
import pandas as pd

#set two date variables for date range
start_date = '2018-10-01'
end_date = '2018-10-31'

#we define the nr_tweets we want to return 
nr_tweets=10

#finally, we define a query. Note that we can mix hash tags with words and use boolean operators OR and AND
keywords='#autism AND vaccines OR medicine AND children'

# We first define a data frame that we name 'all_tweets_dataframe' with pandas imported as 'pd' using the columns list that we defined before.
# Basically, we tell pandas what data will be stored.
all_tweets_dataframe = pd.DataFrame(columns=COLS)

for page in tweepy.Cursor(api.search, q=keyword,
                     count=nr_tweets, include_rts=False, since=start_date).pages(50):
    # now we have the tweets, we are going to obtain the features from the json and store them in the right order for our data frame
    for status in page:
        ## new_entry is going to contain the data 
        new_entry = []
        
        ## we get a json object from the result
        status = status._json
        
        ## check whether the tweet is in english or skip to the next tweet
        if status['lang'] != 'en':
            continue

        text=status['text']
        
        #new entry append in the order of the data frame
        new_entry += [status['id'], 
                      status['created_at'],
                      status['source'], 
                      status['text'],
                      status['lang'],
                      status['favorite_count'], 
                      status['retweet_count']]

        #to append original author of the tweet
        new_entry.append(status['user']['screen_name'])

        # hashtagas and mentiones are saved using comma separted
        hashtags = ", ".join([hashtag_item['text'] for hashtag_item in status['entities']['hashtags']])
        new_entry.append(hashtags)
        mentions = ", ".join([mention['screen_name'] for mention in status['entities']['user_mentions']])
        new_entry.append(mentions)

        #get location of the tweet if possible
        try:
            location = status['user']['location']
        except TypeError:
            location = ''
        new_entry.append(location)

        try:
            coordinates = [coord for loc in status['place']['bounding_box']['coordinates'] for coord in loc]
        except TypeError:
            coordinates = None
        new_entry.append(coordinates)

        # We now completed appending all the possible values for this tweet.
        # We use the pandas framework imported as 'pd' to create a dataframe from the aggregated data in new_entry
        # We need to provide the columns COLS to tell pandas what value belongs to what.
        # Note that the data need to be aggregated in the same order as the names in COLS, otherwise values will get mixed up
        single_tweet_dataframe = pd.DataFrame([new_entry], columns=COLS)
        
        # single_tweet_dataframe now contains the data for a single tweet
        # next we add it to the data frame for all tweets 'all_tweets_dataframe'
        # check the pandas documentation if you want to know what ignore_index=True does to the data aggregation
        all_tweets_dataframe = all_tweets_dataframe.append(single_tweet_dataframe, ignore_index=True)

Our data frame basically is a table with columns and rows. We use the *shape* function to ask for the number of rows and columns

In [51]:
print(all_tweets_dataframe.shape)

(145, 12)


Through the *pandas* framework, we can now save it to a CSV file.

In [52]:
# We define a file path to store the results as CSV. Make sure the folder 'twitter_search_results' exists 
# or that you specify another path to an existing location. The 'twitter_results_vaccination.csv' file will be created in that location.
csvFilePath='twitter_search_results/twitter_results_vaccination.csv'

# we now open the csvFile for appending our result
csvFile = open(csvFilePath,"w+")       
all_tweets_dataframe.to_csv(csvFile, columns=COLS, index=False)

## End of this notebook