<img src="https://www.aiforpeople.org/wp-content/uploads/2020/01/cropped-AIforPeople-logo-full-2.png" width="200">

# Twitter Crawler Tutorial
In this notebook we are discussing the steps necessary to obtain data from Twitter in order to create a dataset. First and foremost there are some hard requirements that need to be done before we can start crawling tweets:

1.   Have a Twitter Account
2.   Sign up as a Twitter Developer
3.   Generate the access codes to connect to Twitter

All of these above steps are completely free, but they require some time and you'll need to create them by yourself.

## 1. Create a Twitter Account
You can go to https://twitter.com/i/flow/signup and sign up using your email address or phone number. In order to acquire data from Twitter you do not need more than one Twitter account, i.e. if you already have one, you can skip this step.

<img src="https://i.imgur.com/H0jAem1.png" width="600">

## 2. Sign up as a Twitter Developer
In order to use the Twitter API, you'll need to apply for a developer account. This can be done at: https://developer.twitter.com/en/apply-for-access This process can take a bit of time, so for now we cannot wait for this and continue with an example account.

<img src="https://i.imgur.com/T7MohXw.png" width="600">

## 3. Generate the access codes to connect to Twitter
Once you have been granted access to the developer pages. You can navigate to "Apps" and create your own App. You'll need this App to generate you Consumer API key, access token and access token secrets. Make sure you are storing the generated keys and tokens somewhere on your machine.

<img src="https://i.imgur.com/7SyOmsL.png" width="600">

Now, we can use these access credentials to connect our code with the Twitter api. For this, we are going to use the [Tweepy Package](https://www.tweepy.org/).





In [None]:
import tweepy as tw
import logging
from tweepy.error import TweepError, RateLimitError, is_rate_limit_error_message

consumer_key= "writeYourOwnConsumerKeyHere12345"
consumer_secret= "writeYourOwnConsumerSecretHere12345"
access_token= "writeYourOwnAccessTokenHere12345"
access_token_secret= "infowriteYourOwnAccessTokenSecretHere12345"

# Twitter authentication
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
logging.getLogger('tweepy.binder').addHandler(logging.NullHandler())

What is happening in this code above? Well, we use tweepy to create an Authentication Handler and store it in `auth` using the consumer key and consumer secret ‚Äî basically we define through which door we want to access Twitter and with the `auth.set_access_token(‚Ä¶)` we provide the key to access the door. Now, the open door will be stored with certain parameters in `api`. One of those parameters is the door (`auth`) and the other one here is called `wait_on_rate_limit=True`. We can see in the Tweepy API that this parameter decides ‚Äú*Whether or not to automatically wait for rate limits to replenish*‚Äù.

<img src="https://media1.tenor.com/images/423c375c2e12c1a708ecc1694e472ff1/tenor.gif?itemid=13052487" width="600" class="center">


The free Twitter access comes with a rate limit, i.e. you can only download a certain number of tweets before you need to wait or before Twitter kicks you out. But: ‚Äú*Rate limits are divided into 15 minute intervals.*‚Äù When we set wait_on_rate_limit to True, we will have our program wait automatically 15 minutes so that Twitter does not lock us out, whenever we exceed the rate limit and we automatically continue to get new data!
Now, we need to specify the parameters for our search:
*    `days`: Dates beginning to crawl data in format `YYYY-MM-DD`. It can only be 7 days in the past. The last day in the list is exclusive.
*    `search_words`: This is a string that combines your search words with AND or OR connection. We will look at an example of this.
*   `batch_size`: Specify how many tweets you want to collect. Maybe take 15 for the beginning to test everything.


In [None]:
days = ["2020-07-28", "2020-07-29", "2020-07-30", "2020-07-31"]
search_words = "#covid19 OR #coronavirus OR #ncov2019 OR #2019ncov OR #nCoV OR #nCoV2019 OR #2019nCoV OR #COVID19 -filter:retweets"
batch_size = 25

The parameters above will collect a test sample of 25 tweets for all the dates specified in the `days` list ‚Äî so just 25 tweets from each day. We now apply the first filter, otherwise we would just collect any sort of tweet from that day. Our search words are common hashtags of the Covid19 discourse: `#covid19`,  `#coronavirus` etc. Furthermore, we want to look at tweets and not re-tweets, therefore we exclude re-tweets with `-filter:retweets`. Here, we further set the language to `en` = English and the tweet mode to `extended`, which makes sure the entire tweet is stored. The rest of the parameters are as we have defined them before. We want to store all the tweets in a dictionary for every day an entry and loop over all days with the same parameters for search terms and language:

In [None]:
collection = dict()

# iterate over every day
for index in range(0, len(days)-1):
    date_since = days[index]
    date_until = days[index+1]

    print("Start collecting Day "+str(index))
    # Collect tweets
    tweets = tw.Cursor(api.search,
                       tweet_mode='extended', # make sure to collect entire tweet
                       q=search_words,
                       lang="en",
                       since=date_since,
                       until=date_until).items(batch_size)
    tweets = [tweet for tweet in tweets]

    collection[index] = tweets
    print(str(len(tweets))+" tweets collected for date: "+days[index])

Start collecting Day 0
25 tweets collected for date: 2020-07-28
Start collecting Day 1
25 tweets collected for date: 2020-07-29
Start collecting Day 2
25 tweets collected for date: 2020-07-30


In [None]:
print(collection[0])
print(type(collection[0]), len(collection[0]))
print(type(collection[0][0]))

[Status(_api=<tweepy.api.API object at 0x7f8619c81d30>, _json={'created_at': 'Tue Jul 28 23:59:58 +0000 2020', 'id': 1288262957135278081, 'id_str': '1288262957135278081', 'full_text': 'AOC calls out stingy #COVID19 relief bill, just like ancient Roman senate. \n\nVisit @webtoon ‚ÄúFALSE EDICTS‚Äù for more on the plague &amp; quarantine\n\nREAD: https://t.co/H6UJq3hhXe \n \nW: @JaySandlin_\nA: @schiekapedia \nC: @ComicsShimmy \nL: @JustinBirch \nEdits: me https://t.co/XmMnpW9q8L https://t.co/SH44CX4UMa', 'truncated': False, 'display_text_range': [0, 283], 'entities': {'hashtags': [{'text': 'COVID19', 'indices': [21, 29]}], 'symbols': [], 'user_mentions': [{'screen_name': 'webtoon', 'name': 'WEBTOON', 'id': 2511697256, 'id_str': '2511697256', 'indices': [83, 91]}, {'screen_name': 'JaySandlin_', 'name': 'Jay Sandlin', 'id': 230377199, 'id_str': '230377199', 'indices': [184, 196]}, {'screen_name': 'schiekapedia', 'name': 'J. Schiek', 'id': 4719339432, 'id_str': '4719339432', 'indices': [20

As you can see, this is a ton of information. Number of retweets, number of likes, coordinates, profile background image url‚Ä¶ everything about that single tweet! That is why we now filter for the user.id and the full_text. If you want you can also access other information such as location etc, but for now we are not interested in that. Have a look at the following code, before you can find its explanation below:

In [None]:
first_entry = None
last_entry = None
all_user_ids = []
raw_tweets = []
for tweet in tweets:
  if not first_entry:
    first_entry = tweet.created_at.strftime("%Y-%m-%d %H:%M:%S")
    print("First tweet collected at: "+str(first_entry))
    print(" ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî -")
  if tweet.user.id not in all_user_ids:
    all_user_ids.append(tweet.user.id)
    full_tweet = tweet.full_text.replace('\n','')
    if full_tweet: 
      print("User #"+str(tweet.user.id)+" : ")
      print(full_tweet+"\n ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ")
      raw_tweets.append(full_tweet)
  last_entry = tweet.created_at.strftime("%Y-%m-%d %H:%M:%S")
print("Last tweet collected at: "+str(last_entry))

First tweet collected at: 2020-07-30 23:59:56
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî -
User #2235142436 : 
@Dixit_Munjani @srinivas_inlvd The way economic situation is situation of our state is a hindrance for our schemes and policies but we can utilize every material available to create activities.
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #3303343184 : 
These people are ignorant
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #20236638 : 
@BaiSuthu @dharipSureshg @acna_ghatkan10 @NamitaKVj No good(ish) will come out of this Corona going on
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #11923423462386124801 : 
@kyledwinne Maybe this is why covid-19 is so hard to track down
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #35723365 : 
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #2212360389 : 
Shameful. https://t.co/GyD3w1wgyZ
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
User #75642348 : 
When COVID-19 spreads quickly &amp; reports and: Fauci negative &amp; CREDIT card issue
 ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
 ...
‚Äî ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî 
La

This code creates an empty list for all the user ids and then iterates over all tweets. It looks at the created_at field of a tweet to check whether it is the first entry (because we initially set first_entry to `None`). Now it checks if `tweet.user.id` is not in the list of `all_user_ids`. This means it only looks at tweets from users we have not seen yet. Why did we do that?


---


*A scientific analysis of fake news spread during the 2016 US presidential election showed that about 1% of users accounted for 80% of fake news and report that other research suggests that 80% of all tweets can be linked to the top 10% of most tweeting users. Therefore, in order to have a representation of a diverse opinion that cannot be linked to a few but many users, we filter out multiple tweets from the same user.*


---


Then our code appends the user id (as we now have seen the user) and stores the full tweet. The replace statement (`replace("\n", " ")`) just gets rid of line-breaks in tweets. The if full_tweet is checked, because we could have an empty tweet (which sometimes is a bug of the api). We print the full tweet (the `"\n ‚Äî ‚Äî ‚Äî ‚Äî ‚Äî "` is a line break and some dashes so it looks nicer when printed). And store each full tweet in a list called `raw_tweets`. Finally, we access the created_at field to get the date of creation when we have reached the very last tweet.

<img src="https://cdn.vox-cdn.com/thumbor/PFOnUxhlmDobgoPMjHv3K1xTDQo=/0x0:1920x1080/1400x933/filters:focal(391x323:697x629):no_upscale()/cdn.vox-cdn.com/uploads/chorus_image/image/66208405/cute-success-kid-1920x1080.0.0.jpg" width="500">

We have now collected 25 tweets for three days, accumulating 75 tweets. Ideally, we store those tweets now in a file that allows us further access. Here, we propose a format of: *YYYY-MM-DD* ::: *TWEET*
This way, we only store the date information and the text information. In the next part of the tutorial, we can analyse the database which will have the following structure:

```
1/03/2020	:::	@ryanshwar_ @xomuzza @RubinReport Yeah and I don‚Äôt have enough of this covid shit to have any wish to make it over! :(
1/03/2020	:::	@NowThis @GretaWillis If I was mad about a dead in an election (I've been here a few times before) I have to remember to to stress what's happening
1/03/2020	:::	Old man left as he wanted to go to-go market and confirmed 'ok'
1/03/2020	:::	Corona: With all this out &amp; cheapchll... https://t.co/lDlycCfCuM
1/03/2020	:::	Trump doesn't put his stamp of approval on how our media is being run. We will miss the best guy for president.
1/03/2020	:::	Right now
```

