# _Streaming Twitter Data_

When people traditionally think of data analysis, one of the first steps tends to be reading in data, via a CSV file or by querying a database for example, after which they can then explore it. This approach works fine when you analyzing historical data (e.g. what products a customer at an store has bought and is most likely to purchase or the effects of a particular advertising campaign on customers purchasing patterns). 

However, what if we want to explore social media data? While historical data can provide some value in this realm, we're leaving a lot on the table if we don't consider the continuous stream of data being generated every second of every day on platforms like Twitter, which we'll be focusing on today. Especially as it pertains to coronavirus/COVID-19, information is continually changing and being updated to reflect this change, so it is essential that we be able to gather streaming data. This will help us stay up-to-date with current trends, and to ensure that our product does not grow stale. 

That being said, how do you access real-time Twitter data, and more specifically, Tweets related to the coronavirus? You use [Twitter's API](https://developer.twitter.com/en) which gives us the ability to [filter Tweets in realtime](https://developer.twitter.com/en/docs/tweets/filter-realtime/overview). 

We can access all the Python tools we for streaming via the [tweepy](https://github.com/tweepy/tweepy) library, which can be installed via pip using the following command: 
- `pip install tweepy`

Once that's installed we're ready to start! The first step will be to check on the speed of the streaming data. In other words, how fast are we able to stream tweets in? To test this, we'll begin by setting up a very basic tweepy [`StreamListener`](http://docs.tweepy.org/en/latest/streaming_how_to.html) to see how quickly we can gather 1,000 tweets, which we'll also output as a CSV so we can check the content of the data.

In [21]:
import tweepy
from tweepy import OAuthHandler, Stream
from tweepy.streaming import StreamListener
import time
import csv
import sys

Now that we have the tools we're going to put together a custom `StreamListener` that'll filter tweets according to the terms `covid19` & `coronavirus`, which will then grab a few basic data points associated with the Tweet, such as the text, when it was created, and the username of the user who tweeted it.

In [22]:
# create a streamer object
class StdOutListener(StreamListener):
    
    def __init__(self, api = None):
        self.api = api
        self.num_tweets = 0
        self.filename = "data_" + time.strftime("%Y%m%d-%H%M%S") + ".csv"
        csvfile = open(self.filename, "w")
        csvwriter = csv.writer(csvfile)
        
        # write a single row with headers of the columns
        csvwriter.writerow([
            "created_at", "user_id", "user_screenname", "tweet_id", "text"
        ])
    
    # when a tweet appears
    def on_status(self, status):
        csvfile = open(self.filename, "a")
        csvwriter = csv.writer(csvfile)
        
        # if the tweet is not a retweet
        if not "RT @" in status.text:
            try:
                self.num_tweets += 1
                if self.num_tweets <= 1000:
                    csvwriter.writerow([
                        status.created_at, status.user.id, status.user.screen_name, status.id, status.text
                    ])
                    if self.num_tweets % 100 == 0:
                        print("Number of Tweets gathered: ", self.num_tweets)
                else:
                    # once we've gathered 1,000 tweets stop the stream
                    return False
            # if some error occurs
            except Exception as e:
                print(e)
                pass # print error and continue
            
        csvfile.close()
        return
    
    # when an error occurs
    def on_error(self, status_code):
        print("Encountered error with status code:", status_code)
        
        # if error code for bad credentials, end the stream
        if status_code == 401:
            return False
        
    # when a deleted tweet appears
    def on_delete(self, status_id, user_id):
        print("Delete notice")
        return
    
    # when reach the rate limit
    def on_limit(self, track):
        # continue mining tweets
        return True
    
    # when timed out
    def on_timeout(self):
        print(sys.stderr, "Timeout...")
        time.sleep(10)
        return

Ok so we have our `StreamListener`, which will gather 1,000 Tweets in real-time, and append each record to a CSV file. Before we can use it however, we need to make an additional function, a wrapper, which will be used to set up our connection to the API and subsequently begin the streaming process. 

Now Twitter doesn't just let anybody stream Tweets; you have to first create a developer account which will then give you the ability to create an app. Once the app has been approved, you receive four keys and tokens that then give you the ability to access Twitter data. This part of the set-up process is outside of the scope of this notebook as having already created an app, I have a set of keys/tokens available to use for our `StreamListener`. These keys/tokens ARE MEANT TO BE KEPT PRIVATE, as making them publicly available presents the opportunity for anybody to then be able to read (and potentially write to, depending on permissions) your Twitter!

To be able to iterate quickly, I will be keeping these keys & tokens in a seperate Python script named `twitter_keys_tokens.py` that will never be uploaded to GitHub. Luckily accessing the data in this script is straightforward, as we can simply import it similar to how we would import `pandas` or `numpy`. After this, we can go ahead and create our wrapper function, which, given a list of strings, will return Tweets containing those strings. In our case, we'll pass in two terms: `covid19` and `coronavirus`. 

In [23]:
from twitter_keys_tokens import keys_tokens

def start_mining(queries):
    # variables that contain credentials to access Twitter API
    consumer_key = keys_tokens["API_KEY"]
    consumer_secret = keys_tokens["API_SECRET"]
    access_token = keys_tokens["ACCESS_TOKEN"]
    access_secret = keys_tokens["ACCESS_SECRET"]
    
    # create a listener based on class above
    listener = StdOutListener()
    
    # create authorization info
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    
    # create Stream object
    stream = Stream(auth, listener)
    
    # run the stream object, searching for tweets according to search terms and in English
    stream.filter(track=queries, languages=["en"])

## _Run the Stream Miner_

Now we have everything that we need! So what we'll do next is run `start_mining` with two search terms: `covid19`, and `coronavirus`. Additionally, we'll use the `%%time` magic command to see how long it takes.

In [28]:
%%time

start_mining(["covid19, coronavirus"])

Number of Tweets gathered:  100
Number of Tweets gathered:  200
Number of Tweets gathered:  300
Number of Tweets gathered:  400
Number of Tweets gathered:  500
Number of Tweets gathered:  600
Number of Tweets gathered:  700
Number of Tweets gathered:  800
Number of Tweets gathered:  900
Number of Tweets gathered:  1000
CPU times: user 3.34 s, sys: 489 ms, total: 3.83 s
Wall time: 1min 46s


Ok, so it took us `1min 46s` to gather 1,000 tweets with these particular methods. Let's check the CSV file that was generated to make sure the data that we got was indeed what we wanted.

In [29]:
import pandas as pd

df = pd.read_csv("data_20200406-233602.csv")

# get info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   created_at       1000 non-null   object
 1   user_id          1000 non-null   int64 
 2   user_screenname  1000 non-null   object
 3   tweet_id         1000 non-null   int64 
 4   text             1000 non-null   object
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


In [30]:
# see the first five observations
df.head()

Unnamed: 0,created_at,user_id,user_screenname,tweet_id,text
0,2020-04-06 23:35:57,502114278,Frazzle_Rocks,1247307085899075592,What the fuck
1,2020-04-06 23:35:57,124022302,Colorado_Right,1247307086230212609,There still appears to be ZERO actual inertia ...
2,2020-04-06 23:35:57,1049590943840722944,saturnohes,1247307086305923079,coronavirus i want that shit to be gONE
3,2020-04-06 23:35:57,1145584128076800000,lastboyalive,1247307086389825546,The world is going to be an entirely different...
4,2020-04-06 23:35:57,813345856275574784,JohnTitor33621,1247307086440083457,"The #vaccine may be legit, it may save lives, ..."


Nothing looks critically wrong, but there are a few observations that I would like to make. The first being that the `created_at` column returns the date as a string, so in future iterations we will need to convert that to a datetime object. It is yet to be determined if it is better to do this up-front during the streaming or if we can convert it after the file is already generated (I am leaning towards the former, but this will depend on how much it impacts the speed). 

Next, in regards to the `user_id` column, even though the ID is returned correctly as an integer type, Twitter offers the ID in string format, which it recommends using over the integer version. Lastly, in the next iteration I hope to add a text-preprocessing step which will ensure the tweet is in a more appropriate format for analysis. 