# Tutorial - Tweepy

### The package Tweepy

This tutorial shows how to capture Twitter data using the package **Tweepy**. This package is not included in the Anaconda distribution. It can be installed from the shell, with `conda install -c conda-forge tweepy`. I suppose that this step has already been made.

Tweepy uses the **Twitter Search API**, version v1.1. The data are extracted by querying the site `api.twitter.com/1.1/search/tweets.json`. The data format is JSON, which Tweepy converts to Python lists and dictionaries.

Two relevants limitations of the Standard Search API are:

* It is limited to the last seven days, starting yesterday.

* Simple GET requests are limited to 100 tweets. Tweepy overcomes this by making repeated calls, which means that you have to wait a bit if you ask for a high number of tweets.

### Capturing tweets with Tweepy

The first step is to enter the four **tokens** required for the **authentication**. The tokens used here come from an application which I created at the Twitter Applications website, special for teaching. Replace them with your own set, to avoid conflict. 

In [1]:
consumer_key = '9PQLhrb4ZUvte5Cb2VZRa7wce'
consumer_secret = 'Xxkae5otxok1j30dwPNLgScJKzvLqSKJcvU77D13azYCDmoO1D'
access_key = '2795394087-drbIfGPKwFZTfiNIKYXUqUrUvg1ypC2A7DfdXal'
access_secret = 'FfN5Up4WtArKoLlitGUC2ukXKElmyA7paMpYphVxGQhlP'

Next, we import Tweepy:

In [2]:
import tweepy

In Tweepy, the authentication is carried out as follows:

In [3]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)

The next step is to set the API, that is, the way Tweepy will send the requests.

In [4]:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

What is the meaning of `wait_on_rate_limit=True`? The **rate limit** is the maximum number of calls allowed in a certain number of minutes. For the Twitter Standard Search API, it is set at 180 calls per 15 minutes. This is a defense mechanism of Twitter against greedy tweet collectors. The solution is to wait for those 15 minutes to lapse, and then call again. Adding the argument `wait_on_rate_limit_notify=True` allows you to learn whether this is happening. 

The next step is to set a **Tweepy cursor**. Here you specify a collection of parameters which define the focus and scope of the search. Suppose that we are interested in the tweets that mention the company American Airlines, more specifically, in those that include any of the strings '@AmericanAir' and '#americanair'. 

We set a Tweepy cursor as:

In [5]:
twcursor = tweepy.Cursor(api.search, q='@AmericanAir OR #americanair',
  since='2021-10-18', until='2021-10-19', result_type='recent',
  language='en', tweet_mode='extended').items(250)

Let me review these parameters:

* `q` is the **search string**. The Twitter rules for these strings are easy. If the search string consists of a single word, you get the tweets whose text contains that word. If it has more than one word, (some of) the rules are: (a) `'word1 word2'` means containing both words, (b) `'word1 OR word2'` means containing at least one of them, and (c) `'"word1 word2"'` means containing the exact sentence.

* `lang` is the language, as far as the Twitter algorithm can identify it.

* `result_type` sets the priority for the search. The default is `result_type='popular'`. This argument is relevant only for a limited search.

* `tweet_mode='extended'` means that you wish to capture the complete text, beyond the 140-character limit of the old Twitter. If you omit this argument, the text will come truncated.

* `until` sets the date limit, so here the search is limited to tweets posted before 2021-10-19. Having set `result_type='recent'`, the first tweets captured will be those posted on October 18th night. The date has to be written as `yyyy-mm-dd`. Note that this service only covers a range of 7 days, ending the day before the current date. So, you cannot get these tweets (through this API) if you try too late.

* `since='2021-10-18'` and `items(250)` limit the repeated calls in an obvious way. 

This cursor is a Python **iterator**, that is, an object that can be iterated upon. When used, the iterator extracts the statuses, one at a time.

In [6]:
type(twcursor)

tweepy.cursor.ItemIterator

So long, the iteration has not yet been performed. The next step is to run it and store the JSON documents in a list. This step can take longer, depending on your connection and on the number of statuses to be collected.

In [7]:
twlist = [t._json for t in twcursor]

Now I have the data in a list which is available as far as my current session lasts. Each element of the list is a dictionary containing information related about one tweet. This list of dictionaries is what Python extracts from the JSON document returned by the `api.twitter.com` page.

I can save these data right now, or extract from them the information that I am interested in, which is is what I will do below. Mind that Twitter gives you, not only the text of the tweets, but a lot of metadata.

### Organizing the data

Let us see how many tweets have been captured:

In [8]:
len(twlist)

250

Since each element of the list is a dictionary, we can take a look at the keys:

In [9]:
twlist[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'metadata', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

A tweet is called a **status** in the Twitter API's. The elements of these dictionaries are not the exactly the same for all statuses, which makes the job a bit harder. For instance, the first tweet is a retweet, so the corresponding dictionary comes with the element `retweeted_status`, which contains information about the original tweet. 

A list of dictionaries can be directly converted to a data frame by means of the Pandas function `DataFrame`. Even if you decide that the resulting data frame contains too many things, this is a good way to learn about the contents, since you can use the Pandas tools in the exploration.

In [10]:
import pandas as pd
twdf = pd.DataFrame(twlist)
twdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 31 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   created_at                 250 non-null    object 
 1   id                         250 non-null    int64  
 2   id_str                     250 non-null    object 
 3   full_text                  250 non-null    object 
 4   truncated                  250 non-null    bool   
 5   display_text_range         250 non-null    object 
 6   entities                   250 non-null    object 
 7   metadata                   250 non-null    object 
 8   source                     250 non-null    object 
 9   in_reply_to_status_id      125 non-null    float64
 10  in_reply_to_status_id_str  125 non-null    object 
 11  in_reply_to_user_id        154 non-null    float64
 12  in_reply_to_user_id_str    154 non-null    object 
 13  in_reply_to_screen_name    154 non-null    object 

Missing values are frequent in this context. There are keys which always occur in the status but have rarely a value, like the keys related to geographical data. And there are keys which only occur in some statuses, like `retweeted_status`. This element contains all the information of the original tweet (`created_at`, `id`, etc). Missingness indicates that the tweet is not a retweet, which is an information you are probably interested in. Also, note that the data type has been adapted to the inclusion of `NaN`'s in the columns with missing data. For instance, `possibly_sensitive` was originally Boolean, but it comes as type `object` in this data frame.

To make it simple, the data collection will be limited to the status ID, the time it was posted, a dummy for being a retweet and the text of the tweet. The first three fields can be extracted from `twdf` in a trivial way, but the text would give a bit more work.  

We can extract these fields from the data frame `twdf`, or leave aside the data frame, which has been used here just for a quick exploration, working instead with the list `twlist`. I use the list here, since the text is easier to manage so. 

The status ID and the creation time are easily extracted:

In [11]:
id = [t['id'] for t in twlist]
id[:10]

[1450250186811400193,
 1450250033673105410,
 1450249976496500740,
 1450249849870303235,
 1450249812889190403,
 1450249631414329346,
 1450249583410524160,
 1450249559112880129,
 1450249501260738561,
 1450249450748661762]

In [12]:
created_at = [t['created_at'] for t in twlist]
created_at[:10]

['Mon Oct 18 23:59:23 +0000 2021',
 'Mon Oct 18 23:58:47 +0000 2021',
 'Mon Oct 18 23:58:33 +0000 2021',
 'Mon Oct 18 23:58:03 +0000 2021',
 'Mon Oct 18 23:57:54 +0000 2021',
 'Mon Oct 18 23:57:11 +0000 2021',
 'Mon Oct 18 23:56:59 +0000 2021',
 'Mon Oct 18 23:56:53 +0000 2021',
 'Mon Oct 18 23:56:40 +0000 2021',
 'Mon Oct 18 23:56:28 +0000 2021']

Note that the data collection is based on recency, starting at 2021-06-08 00:00:00. Now, the dummy for being a retweet. You have to signal the statuses that have the key `'retweeted_status'`, so you can use:

In [13]:
is_retweet = [int('retweeted_status' in t.keys()) for t in twlist]

In [14]:
is_retweet[:10]

[1, 1, 0, 0, 0, 0, 1, 1, 0, 1]

With the same approach used to build `id` and `created_at`, you could get a list for the key `'full_text'`. Let me call this list `text`:

In [15]:
text = [t['full_text'] for t in twlist]
text[:5]

['RT @RealCandaceO: Flying @delta exclusively now because they don’t force-vax their employees. \nBarring one pre-planned flight, I am done wi…',
 'RT @RealCandaceO: Flying @delta exclusively now because they don’t force-vax their employees. \nBarring one pre-planned flight, I am done wi…',
 '@AmericanAir how the fuck are you guys even a company still? My plane sitting on the runway for 30 min then they tell us they have to pull back up to the gate to do some maintenance? I’m going to miss my connecting flight and not get to see my daughter because of you cucks.',
 '@AmericanAir Why did I receive a call back for being in line with AAdvantage CS but have still been on hold for over an hour? What’s the point of the call back feature if you’re not available as intended by the feature…??? https://t.co/lhlqo4oRrP',
 '@CPU84 @AmericanAir Sadly that information would have helped. My mistake.   The word of @AmericanAir awfulness will be spread far and wide from now on.']

As this exploration illustrates, though the Twitter API calls this `full_text`, we do not find the full text for all tweets. What is the rule? For the tweets that are not retweets, the full text is there. Among the retweets, the text is truncated when it has more than 140 characters. Let us see how to get out of the mess.

Since the complete text of a retweet comes in the element `retweeted_status`, the full text can be extracted from there. We can manage this by adding a for loop with an if statement to our code, so the text of the retweets is replaced by that of the original tweet. 

In [16]:
for i in range(len(twlist)):
  if is_retweet[i] == 1:
    text[i] = twlist[i]['retweeted_status']['full_text']

Now, the text that was truncated in the `'full_text'` element is complete. Note, in the example below, that the header 'RT @RealCandaceO:' has desappeared, since it was not in the original tweet. So the screen name of the original author is lost. If you wish to capture these names, you can extract them from the `'user'` element of the `'retweeted_status'` element, and store them as a different field.

In [17]:
text[0]

'Flying @delta exclusively now because they don’t force-vax their employees. \nBarring one pre-planned flight, I am done with @AmericanAir and @SouthwestAir until they stop violating the basic human rights of their pilots &amp; attendants. \n\nOnly slaves are without choice. #FreedomFlu https://t.co/hfsUQvlxW1'

I can pack these lists as the columns of a data frame, with `id` as the index (this is optional):

In [18]:
twdf = pd.DataFrame({'created_at': created_at, 'text': text,
  'is_retweet': is_retweet}, index=id)

In [19]:
twdf.head()

Unnamed: 0,created_at,text,is_retweet
1450250186811400193,Mon Oct 18 23:59:23 +0000 2021,Flying @delta exclusively now because they don...,1
1450250033673105410,Mon Oct 18 23:58:47 +0000 2021,Flying @delta exclusively now because they don...,1
1450249976496500740,Mon Oct 18 23:58:33 +0000 2021,@AmericanAir how the fuck are you guys even a ...,0
1450249849870303235,Mon Oct 18 23:58:03 +0000 2021,@AmericanAir Why did I receive a call back for...,0
1450249812889190403,Mon Oct 18 23:57:54 +0000 2021,@CPU84 @AmericanAir Sadly that information wou...,0


Now that the data are in tabular form, they can exported to an Excel file, or a database server.

### Homework

Write a script for capturing the text of all the tweets posted yesterday that mention the company Huawei, in such a way that your code can be used any day. To get this, import the package `datetime` and define:

`today = datetime.date.today())`

`yesterday = datetime.date.today() - datetime.timedelta(1)`

Use these dates to set the parameters `since` and `until` in the Tweepy search. Include a line in your script to save the data to a CSV file in such a way that the date in which the tweets were posted is part of the filename.