# Natural language processing part 4:
# Obtaining Twitter data

## Lecture objectives
* Learn how to scrape Twitter data using their API

We reviewed topic modeling in the previous lecture. Here and in the next lecture, we'll focus on another common Natural Language Processing tool: sentiment analysis. In short, sentiment analysis tries to understand whether a snippet of text (e.g. a tweet, a review, or a sentence from an article) is positive, negative, or neutral.

We'll apply sentiment analysis to some Twitter data on public transportation.

If you want to access the Twitter data yourself, you'll need to [sign up for a Twitter developer account](https://developer.twitter.com/en/docs/developer-portal/overview), which will give you an API key. The default access is 7 days, but the academic research product gives you access to the full archive as well as providing geographic filters (e.g. to search for tweets within a bounding box).

However, you can also just watch this lecture, and resume in interactive mode in the subsequent lecture.

For a thorough treatment of obtaining, analyzing, and interpreting Twitter data, check out [*Twitter as Data*](https://www.cambridge.org/core/elements/twitter-as-data/27B3DE20C22E12E162BFB173C5EB2592) by Prof. Zachary Steinert-Threlkeld here in the Luskin School of Public Affairs.

## Using the Twitter API
The `tweepy` library provides easy access to Twitter data, once you [register for an API key](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api). You can enter your credentials here, or just follow along for the time being.

The following code snippets are adapted from *Twitter as Data* or the `tweepy` documentation.

In [None]:
import tweepy

# log in via codes provided by Twitter
bearer_token=''
client = tweepy.Client(bearer_token=bearer_token)

Now we have our `client` object that has several methods.

For example, we can search for a keyword and get the text of each tweet.

In [None]:
miami = client.search_recent_tweets('transit miami', max_results=10).data
miami

Let's get 100 tweets about transit in three cities. We'll also add `-is:retweet` to the query to exclude retweets. 

To get more than 100 tweets, we would need to use the `Paginator`, as explained in the [tweepy docs](https://docs.tweepy.org/en/stable/v2_pagination.html). 

In [None]:
miami   = client.search_recent_tweets('transit miami -is:retweet', max_results=100).data
chicago = client.search_recent_tweets('transit chicago -is:retweet', max_results=100).data
toronto = client.search_recent_tweets('transit toronto -is:retweet', max_results=100).data

We get a list of tweets for each city. 

In [None]:
type(miami)

Each element of the list is a tweepy Tweet object.

In [None]:
type(miami[0])

What can we do with a tweet? Among the methods that are visible is `text`, which seems useful if we want to get the text.

In [None]:
miami[0].

In [None]:
miami[0].text

So let's use a list comprehension to get the text of each tweet, from our list of tweets.

In [None]:
miami   = [tweet.text for tweet in miami]
chicago = [tweet.text for tweet in chicago]
toronto = [tweet.text for tweet in toronto]
miami[0]

Now let's save these tweets to a file. We'll use a pickle, which can save most Python objects in their original format. (We could also have looped over the list of tweets and saved them as text.)

In [None]:
import pickle
with open('data/tweets/miami.pickle', 'wb') as f:
    pickle.dump(miami, f)
with open('data/tweets/chicago.pickle', 'wb') as f:
    pickle.dump(chicago, f)
with open('data/tweets/toronto.pickle', 'wb') as f:
    pickle.dump(toronto, f)

We'll pick up these data in the next lecture and see how to analyze the sentiment of the tweets.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Twitter has a powerful API that is relatively easy to use. Twitter also recently released a <a href="https://zacharyst.com/2021/01/27/initial-thoughts-on-twitters-academic-accounts/">new product for academic research</a>.</li>
  <li>Twitter is not representative. Whether that matters depends on your particular project and use case.</li>
</ul>
</div>