### Pre-talk notes for Speaker!
* Delete Academic tier app
* get backup API keys

During talk:
* Open a terminal tab, and put in side pane
* move to this folder cd .\Documents\GitHub\working-with-twitter-data\HealthDemo
* Open the [Twitter Developer dashboard](https://developer.twitter.com/en/portal/dashboard)
* Zoom in
* Share public link -> https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/TwarcDemo.ipynb

Talk time - 17:30 minutes

# Twitter Archive Search
In this notebook we detail my recommended route to collecting Twitter Archive data. Through the usual free, or professional API tiers you can only access at most the last 30 days of Twitter data. This is adequate if you want to start collecting and curating a Twitter dataset, but largely useless if you want to look at the history of Twitter.

Twitters archive search was released as part of Twitter's V2 API in August 2020, as such it's relatively new and largely unsupported by open source packages.
Twitters documentation, and community tutorials have also changed a lot in this time.

There are a few packages that can play with this V2 API, but my personal recommendation is [Twarc](https://github.com/DocNow/twarc). There are also a few plugins we can use to convert the output into a CSV file, and various other helpers.

This package is a little weird as it runs as a command-line tool. It can be used as a Python package but in this regard the documentation is less clear currently.
As this is a command-line tool, it's a little tricky to demo so most code cells here will exist to be copied into the terminal, and details for that installation are outlined.


# Installation
You will need Python 3 and pip3 availible on your local machine. I recommend doing this by installing [Anaconda](https://docs.anaconda.com/anaconda/install/index.html)

We can check these exist by typing the following:
```
python
```
which should open a REPL and print out our python version.
And:
```
pip3
```
which should log out the manual for pip3. If these don't happen you will need to install these [here](https://www.python.org/downloads/) which may take some time.

## Installing Twarc
First check if you already have twarc installed:
```
twarc
```
You should see twarc is not recognized.

Let's install it with:
```
pip3 install twarc
```

OR by running the below code cell.

In [None]:
!pip3 install twarc

# Twarc time!
Now we should have twarc installed, running `twarc` again will log out all the options we have.

What we actually need is `twarc2` which supports the new V2 API, which has academic access! If you've got Python 3 you should be able to run this and see you already have twarc2. If you cannot run twarc2 follow the details for your OS [On the Twarc2 installation page](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/)

In [None]:
!pip3 install --upgrade twarc

# Configuration & API keys
Now comes the annoying bit. To talk to a web-based data source we need API keys so Twitter knows who is asking for millions of their tweets. I suggest not being to concerned with how this part works, we are here to collect our data and get out!

## Twitter Developer Portal
Before using twarc you will need to create an application and attach it to an project on your [Twitter Developer Portal](https://developer.twitter.com/en/portal/dashboard).

1. Create an App
2. Note down your API key, API key secret and Bearer token

In [None]:
# WARNING - copy the bearer in full
bearer = ''
api_key = ''
api_key_secret = ''

Next we run
```
twarc2 configure
```

Follow the instructions:
1. Enter your bearer token
2. say y to optional user mode authentication
3. Enter API and API secret
4. generate access keys by visiting Twitter
5. Enter pin

# Running Twarc
As a test to see if this is working run the following:
```
twarc2 search --archive --limit 100 cough 100cough.jsonl
```
We will build up to this, and more shortly.

so let's break that request down:
* twarc2 - use twarc with the V2 API
* search - use the Twitter search endpoint
* --archive - make an archive search
* --limit 10 - only give me 10 tweets back
* cough - our search term
* 100cough.jsonl- our output file

# Dashboarding
Each application has a dashboard so we can check how close we are to our limit, For academic proejcts we can collect 10,000,000 tweets a month.

So two questions come to mind:
1. Did we hit the archive?
2. Have we actually got 100 tweets about coughs

To answer either we need to look at the data. JSON isn't very readable so it's time to look at some Python.

In [1]:
# To read in the data we need to make use of some python packages
import pandas as pd #pandas, our data manipulation library
import numpy as np #numpy, to support matrix style tables.
import json

In [2]:
# next we read our JSON into a dataframe.
data = json.load(open('100cough.jsonl'))
df = pd.DataFrame(data["data"])

# check the last five rows
df.tail()

Unnamed: 0,text,lang,source,created_at,entities,public_metrics,conversation_id,author_id,reply_settings,possibly_sensitive,id,in_reply_to_user_id,referenced_tweets,context_annotations,attachments
95,RT @keylimedlacroix: It seems like our elected...,en,Twitter for iPhone,2022-02-16T19:34:43.000Z,"{'mentions': [{'start': 3, 'end': 19, 'usernam...","{'retweet_count': 15, 'reply_count': 0, 'like_...",1494032510971518977,339340704,everyone,False,1494032510971518977,,"[{'type': 'retweeted', 'id': '1493866575459495...","[{'domain': {'id': '10', 'name': 'Person', 'de...",
96,RT @missglh_: Why do kids cough like that? Ton...,en,Twitter for iPhone,2022-02-16T19:34:42.000Z,"{'mentions': [{'start': 3, 'end': 12, 'usernam...","{'retweet_count': 4631, 'reply_count': 0, 'lik...",1494032509633449991,841874865661583361,everyone,False,1494032509633449991,,"[{'type': 'retweeted', 'id': '1493888733640376...",,
97,RT @missglh_: Why do kids cough like that? Ton...,en,Twitter for iPhone,2022-02-16T19:34:38.000Z,"{'mentions': [{'start': 3, 'end': 12, 'usernam...","{'retweet_count': 4631, 'reply_count': 0, 'lik...",1494032492931731459,1483273704,everyone,False,1494032492931731459,,"[{'type': 'retweeted', 'id': '1493888733640376...",,
98,@rayvenmccoy Violently cough. Guaranteed way t...,en,Twitter for iPhone,2022-02-16T19:34:38.000Z,"{'mentions': [{'start': 0, 'end': 12, 'usernam...","{'retweet_count': 0, 'reply_count': 0, 'like_c...",1493694815673876480,1025506749963370496,everyone,False,1494032490708672513,624687421.0,"[{'type': 'replied_to', 'id': '149369481567387...",,
99,RT @missglh_: Why do kids cough like that? Ton...,en,Twitter for iPhone,2022-02-16T19:34:37.000Z,"{'mentions': [{'start': 3, 'end': 12, 'usernam...","{'retweet_count': 4631, 'reply_count': 0, 'lik...",1494032486208356353,1067171325972815875,everyone,False,1494032486208356353,,"[{'type': 'retweeted', 'id': '1493888733640376...",,


## Twitter URL formats
A good next step is to actually inspect some of these tweets and see what data we get.

If we navigate to any individual tweet on Twitter, for example `https://twitter.com/JosephAllen1234/status/1448222861701832704`

This number at the end is our tweet ID. if we paste any ID from our dataset there we can navigate to it.

You'll also notice some nested JSON files, we haven't really flattened out this data fully.

## Twarc convert to CSV
We aren't the first people to want to flatten this JSON object, there is a twarc plugin to convert to CSV.

First make sure twarc is up to date with:
```
pip3 install --upgrade twarc
twarc2 configure
```

Then install the twarc-csv plugin:
```
pip3 install --upgrade twarc-csv
```
OR if running straight from anaconda run the below cell.

In [None]:
!pip3 install --upgrade twarc-csv

Now we can convert with the following pattern:
```
twarc2 csv tweets.jsonl tweets.csv
```
renaming files where needed.

We can also use the "--no-inline-referenced-tweets" flag to remove replies referencing other tweets.
```
twarc2 csv tweets.jsonl tweets.csv --no-inline-referenced-tweets
```
So lets convert our 100 tweets.

```
twarc2 csv 100cough.jsonl 100cough.csv --no-inline-referenced-tweets
```

Now we should be able to read this in directly with pandas

In [3]:
df = pd.read_csv('100cough.csv')
df.tail()

Unnamed: 0,id,conversation_id,referenced_tweets.replied_to.id,referenced_tweets.retweeted.id,referenced_tweets.quoted.id,author_id,in_reply_to_user_id,retweeted_user_id,quoted_user_id,created_at,...,geo.geo.bbox,geo.geo.type,geo.id,geo.name,geo.place_id,geo.place_type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 73
95,1494032510971518977,1494032510971518977,,1.493867e+18,,339340704,,1.423058e+18,,2022-02-16T19:34:43.000Z,...,,,,,,,2022-02-16T19:38:09+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.9.2,
96,1494032509633449991,1494032509633449991,,1.493889e+18,,841874865661583361,,451385700.0,,2022-02-16T19:34:42.000Z,...,,,,,,,2022-02-16T19:38:09+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.9.2,
97,1494032492931731459,1494032492931731459,,1.493889e+18,,1483273704,,451385700.0,,2022-02-16T19:34:38.000Z,...,,,,,,,2022-02-16T19:38:09+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.9.2,
98,1494032490708672513,1493694815673876480,1.493695e+18,,,1025506749963370496,624687421.0,,,2022-02-16T19:34:38.000Z,...,,,,,,,2022-02-16T19:38:09+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.9.2,
99,1494032486208356353,1494032486208356353,,1.493889e+18,,1067171325972815875,,451385700.0,,2022-02-16T19:34:37.000Z,...,,,,,,,2022-02-16T19:38:09+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.9.2,


In [4]:
# List out all columns we have access to.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 74 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     100 non-null    int64  
 1   conversation_id                        100 non-null    int64  
 2   referenced_tweets.replied_to.id        9 non-null      float64
 3   referenced_tweets.retweeted.id         86 non-null     float64
 4   referenced_tweets.quoted.id            1 non-null      float64
 5   author_id                              100 non-null    int64  
 6   in_reply_to_user_id                    9 non-null      float64
 7   retweeted_user_id                      86 non-null     float64
 8   quoted_user_id                         1 non-null      float64
 9   created_at                             100 non-null    object 
 10  text                                   100 non-null    object 
 11  lang   

# Case Study - Omicron September 2021
It's not enough that we simply search all tweets, in fact we will very quickly hit our 10,000,000 tweet limit with this approach, adding a one month delay to our research.

## How useful is 10,000,000 tweets?
From my previous attempts at this collecting all tweets containing the word vegan in January 2019 returned 1.2 million tweets. This took 32 hours and resulted in a 4.2GB file.

There are an estimated 500 million tweets uploaded every day.
As such it's important to narrow the scope of your project, and focus on very specific time periods.
Luckily for covid symptoms there seems to be less of a marketing push than there are for vegan products. When looking at covid tweets tagged with location there are only a few thousand tweets a month.

## The query
My main hypothesis is "Do self-reported symptoms on Twitter, predict a covid wave?".
we need the following data:
* tweet ID
* tweet text
* user names and IDs
* Anything that will help us check locations

Let's build this up piece by piece

### Custom search term
We simply write the term we are searching for, "cough", after the search keyword.

WARNING - DO NOT RUN THIS WITHOUT --limit
```
twarc2 search cough cough.jsonl
```

Of course "cough" isn't the only word linked with covid cases. Coughing, coughed and cough all count. There are also other symptoms like fatigue, headaches, sneezing that are reported. There are loads of operators we can use to join these [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#order-of-operations). Specifically we have to add parentheses, otherwise any spaces will be assumed as an AND operator.

WARNING - DO NOT RUN THIS WITHOUT --limit
```
twarc2 search "(cough OR coughing OR sneeze OR sneezing OR headache OR fatigue)" cough.jsonl
```

### Limit the search
Unless you want to capture all tweets that match this search, you should limit your search. adding `--limit x` replacing x with the number of tweets you want will help you here. 

```
twarc2 search --limit 100 "(cough OR coughing OR sneeze OR sneezing OR headache OR fatigue)" 100cough.jsonl
```

### Archive search
The `--archive` flag lets Twitter know we want to access more than the last 7 days. You will only be allowed to do this with the academic tier, it won't fail otherwise it will just give you recent tweets instead.

```
twarc2 search --archive --limit 100 "(cough OR coughing OR sneeze OR sneezing OR headache OR fatigue)" 100.jsonl
```

### Start and End times
Adding a `--start-time` and `--end-time` flag lets you set start and end times to your archive search. These require dates in the format YYYY-MM-DD.

```
twarc2 search --archive --limit 100 --start-time "2021-09-01" --end-time "2022-01-01" "(cough OR coughing OR sneeze OR sneezing OR headache OR fatigue)" 100Covid2021.jsonl
```

### Location search
We can collect a users self-defined location which varies from pronouns, to memes to relevant data. I suggest using place_country here, others seem to be less reliable and you can't relaly verify where a tweet came from.

```
twarc2 search --archive --limit 100 --start-time "2021-09-01" --end-time "2022-01-01" "(vegan OR vegetarian OR plant-based) place_country:GB" 100Covid2021.jsonl
```

So that's everything we need. There are some additional rules you can use based on your API tier [here](https://developer.twitter.com/en/docs/twitter-api/premium/rules-and-filtering/operators-by-product)

Let's run this one last time to collect our data. remove the limit if you want the full dataset, though I won't run it to save some time.
```
twarc2 search --archive --limit 1000 --start-time "2021-09-01" --end-time "2022-01-01" "(cough OR coughing OR fatigue OR sneeze OR sneezing OR headache) place_country:GB" 3monthCovid2021.jsonl
```
Then convert it to a csv
```
twarc2 csv 3monthCovid2021.jsonl 3monthCovid2021.csv --no-inline-referenced-tweets
```

## Useful Links
* If you want to play with some data check out my [Tidying tutorial](https://github.com/UKDataServiceOpen/working-with-twitter-data/HealthDemo/HealthTidyingDemo.ipynb)
* This [Twarc tutorial](https://github.com/alblaine/twarc-tutorial) breaks down installing Python, API keys and beyond!
* This [Twarc2 tutorial](https://github.com/jeffcsauer/twarc-v2-tutorials/blob/master/twarc_fas.md) specifically covers archive searching.
* The [Twarc report plugin](https://github.com/pbinkley/twarc-report) automates a lot of the tidying and visualization we would do in a next step.
* The Twarc community is incredibly helpfulm if you have any issues or questions open them on their [GitHub](https://github.com/DocNow/twarc/issues)