## Collecting tweets using the Twitter API


In this section we are going to see how to connect to the Twitter API to collect tweets and save them.

"In computer programming, an **Application Programming Interface (API)** is a set of subroutine definitions, protocols, and tools for building application software." [wikipedia](https://en.wikipedia.org/wiki/Application_programming_interface)

The Twitter API is the tool we use to collect tweets from Twitter

[Twitter APIs](https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api) has different endpoints that allows one to preform different actions, such as:
- Accessing a roughly 1% random sample of publicly available Tweets in real-time (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/overview).

- Searching among historical tweets (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview).

To use the Twitter API from python, we will use the library [tweepy](http://www.tweepy.org/) which facilitate the access to the API.

To install it run on of the following command in your terminal or execute the cell below:

Intallation with pip:
```
pip install tweepy
```

Installation with conda:
```
conda install -c conda-forge tweepy
```

In [None]:
# this will install tweepy on your machine
!pip install tweepy

In order to be able to use Twitter APIs, you need to apply for a developper account, create a [project](https://developer.twitter.com/en/docs/projects/overview) and an [app](https://developer.twitter.com/en/docs/apps/overview).

Follow the instructions here: https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api

Once you have created an app, create a new file in the lesson's folder named `keys.json` and copy paste your *Consumer Keys* (*API Key* and *API Secret Key*) and *Authentication Tokens* (*Access Token* and *Access Token Secret*) as shown below in this new file:

It is important to keep your keys private and secure. See https://developer.twitter.com/en/docs/authentication/guides/authentication-best-practices

In [None]:
import json
with open('keys.json', 'r') as fopen:
    keys = json.load(fopen)
# print(keys)

### Authentificate with the Twitter API


In [None]:
import tweepy

auth = tweepy.OAuthHandler(keys['api_key'], keys['api_secret_key'])
auth.set_access_token(keys['access_token'], keys['access_token_secret'])

# create the api object that we will use to interact with Twitter
api = tweepy.API(auth)

In [None]:
# example of an action:
tweet = api.update_status('Hey @BovetAlexandre!')

In [None]:
# the returned object is a tweepy Status object
type(tweet)

The tweet object contains all the attributes of a [tweet data dictionary](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet)

In [None]:
print('Tweet text: ', tweet.text)
print('Tweet author: ', tweet.author.screen_name)
print('Tweet creation time: ', tweet.created_at)

In [None]:
# It also contains a JSON version of the tweet object
tweet._json

## Collecting tweets from the Streaming API
source : https://docs.tweepy.org/en/v3.10.0/streaming_how_to.html            

### Step 1: Creating a StreamListener

This simple stream listener prints status text. The on_data method of Tweepy’s StreamListener conveniently passes data from statuses to the on_status method.
Create class MyStreamListener inheriting from StreamListener and overriding on_status.:

In [None]:
#override tweepy.StreamListener to make it print tweet content when new data arrives
class MyStreamListener(tweepy.StreamListener):

    def on_status(self, status):
        print(status.text)

### Step 2: Creating a Stream

Using the api object we created and the StreamListener we can create a Stream Object:

In [None]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

### Step 3: Starting a Stream

A number of twitter streams are available through Tweepy. Most cases will use filter, the user_stream, or the sitestream. For more information on the capabilities and limitations of the different streams see [Twitter Streaming API Documentation](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/filter-realtime/overview)

In this example we will use filter to stream all tweets containing the word python. The track parameter is an array of search terms to stream.

In [None]:
# this will start tracking tweets with the key word 'new york'.
# to stop it, interrupt the kernel.
# try with different keywords
# you have to run the cell below to disconnect the stream before rerunning this one
myStream.filter(track=['#cop26','#ClimateChange', 'climate'])

In [None]:
myStream.disconnect()

In [None]:
myStream.filter(track=['moderna'], languages=['en'])

In [None]:
myStream.disconnect()

In [None]:
# streaming tweets from a given location
# we need to provide a comma-separated list of longitude,latitude pairs specifying a set of bounding boxes
# for example for New York
myStream.filter(locations=[-74,40,-73,41])

In [None]:
myStream.disconnect()

### Saving the stream to a file
Lets' define a new StreamListener that will save the collected data to a file

In [None]:
#override tweepy.StreamListener to make it save data to a file
# and limit the maximum number of tweets we want to collect
class StreamSaver(tweepy.StreamListener):
    def __init__(self, filename, max_num_tweets=2000, api=None):
        self.filename = filename
        
        self.num_tweets = 0
        
        self.max_num_tweets = max_num_tweets
        
        tweepy.StreamListener.__init__(self, api=api)
        
        
    def on_data(self, data):
        #print json directly to file
        
        with open(self.filename,'a') as tf:
            tf.write(data)

        self.num_tweets += 1

        if self.num_tweets%100 == 0:
            print(self.num_tweets)

        if self.num_tweets > self.max_num_tweets:
            return False
        
            
    def on_error(self, status):
        print(status)

In [None]:
# create the new StreamListener and stream object that will save collected tweets to a file
saveStream = StreamSaver(filename='tweets.txt', max_num_tweets=5000)
mySaveStream = tweepy.Stream(auth = api.auth, listener=saveStream)


In [None]:
mySaveStream.filter(track=['coronavirus','covid-19',
                           'covid19','covid_19','corona virus',
                           'covid','vaccines','vaccine'],
                    languages=['en'])


In [None]:
mySaveStream.disconnect()
saveStream.num_tweets