### Pre-talk notes for Speaker!
* Delete Academic tier app
* get backup API keys

During talk:
* Open a terminal tab, and put in side pane
* move to this folder cd .\Documents\GitHub\working-with-twitter-data\
* Open the [Twitter Developer dashboard](https://developer.twitter.com/en/portal/dashboard)
* Zoom in

Talk time - 25 minutes

# Twitter Archive Search
In this notebook we detail my recommended route to collecting Twitter Archive data. Through the usual free, or professional API tiers you can only access at most the last 30 days of Twitter data. This is adequate if you want to start collecting and curating a Twitter dataset, but largely useless if you want to look at the history of Twitter.

Twitters archive search was released as part of Twitter's V2 API in August 2020, as such it's relatively new and largely unsupported by open source packages.
Twitters documentation, and community tutorials have also changed a lot in this time.

There are a few packages that can play with this V2 API, but my personal recommendation is [Twarc](https://github.com/DocNow/twarc). There are also a few plugins we can use to convert the output into a CSV file, and various other helpers.

This package is a little weird as it runs as a command-line tool. It can be used as a Python package but in this regard the documentation is less clear currently.
As this is a command-line tool, it's a little tricky to demo so most code cells here will exist to be copied into the terminal, and details for that installation are outlined.


# Installation
You will need Python 3 and pip3 availible on your local machine. I recommend doing this by installing [Anaconda](https://docs.anaconda.com/anaconda/install/index.html)

We can check these exist by typing the following:
```
python
```
which should open a REPL and print out our python version.
And:
```
pip3
```
which should log out the manual for pip3. If these don't happen you will need to install these [here](https://www.python.org/downloads/) which may take some time.

## Installing Twarc
First check if you already have twarc installed:
```
twarc
```
You should see twarc is not recognized.

Let's install it with:
```
pip3 install twarc
```

OR by running the below code cell.

In [2]:
!pip3 install twarc

# Twarc time!
Now we should have twarc installed, running `twarc` again will log out all the options we have.

What we actually need is `twarc2` which supports the new V2 API, which has academic access! If you've got Python 3 you should be able to run this and see you already have twarc2. If you cannot run twarc2 follow the details for your OS [On the Twarc2 installation page](https://twarc-project.readthedocs.io/en/latest/twarc2_en_us/)

# Configuration & API keys
Now comes the annoying bit. To talk to a web-based data source we need API keys so Twitter knows who is asking for millions of their tweets. I suggest not being to concerned with how this part works, we are here to collect our data and get out!

## Twitter Developer Portal
Before using twarc you will need to create an application and attach it to an project on your [Twitter Developer Portal](https://developer.twitter.com/en/portal/dashboard).

1. Create an App
2. Note down your API key, API key secret and Bearer token

In [3]:
# WARNING - copy the bearer in full
bearer = ''
api_key = ''
api_key_secret = ''

Next we run
```
twarc2 configure
```

Follow the instructions:
1. Enter your bearer token
2. say y to optional user mode authentication
3. Enter API and API secret
4. generate access keys by visiting Twitter
5. Enter pin

# Running Twarc
As a test to see if this is working run the following:
```
twarc2 search --archive --limit 100 vegan 100vegan.jsonl
```
We will build up to this, and more shortly.

so let's break that request down:
* twarc2 - use twarc with the V2 API
* search - use the Twitter search endpoint
* --archive - make an archive search
* --limit 10 - only give me 10 tweets back
* vegan - our search term
* tweets.jsonl - our output file

# Dashboarding
Each application has a dashboard so we can check how close we are to our limit, For academic proejcts we can collect 10,000,000 tweets a month.

So two questions come to mind:
1. Did we hit the archive?
2. Have we actually got 100 tweets about veganism

To answer either we need to look at the data. JSON isn't very readable so it's time to look at some Python.

In [2]:
# To read in the data we need to make use of some python packages
import pandas as pd #pandas, our data manipulation library
import numpy as np #numpy, to support matrix style tables.
import json

In [4]:
# next we read our JSON into a dataframe.
data = json.load(open('100vegan.jsonl'))
df = pd.DataFrame(data["data"])

# check the last five rows
df.tail()

Unnamed: 0,text,possibly_sensitive,reply_settings,created_at,public_metrics,referenced_tweets,source,in_reply_to_user_id,author_id,entities,lang,id,conversation_id,context_annotations,attachments,geo
95,Herbal Basil | Holy Basil Powder | Tulsi Leaf ...,False,everyone,2021-10-20T07:54:07.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,Twitter Web App,,1383353054506872833,"{'urls': [{'start': 131, 'end': 154, 'url': 'h...",tl,1450732047098515457,1450732047098515457,"[{'domain': {'id': '65', 'name': 'Interests an...",,
96,Organic Shikakai Powder | Acacia Concinna | Ay...,False,everyone,2021-10-20T07:53:55.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,Twitter Web App,,1383353054506872833,"{'urls': [{'start': 141, 'end': 164, 'url': 'h...",in,1450731993637916676,1450731993637916676,"[{'domain': {'id': '65', 'name': 'Interests an...",,
97,"@Tipuncho Bien sur ! ainsi que de Pocramé, les...",False,everyone,2021-10-20T07:53:54.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c...","[{'type': 'replied_to', 'id': '145054357424481...",Twitter Web App,485565663.0,1379433389283962886,"{'mentions': [{'start': 0, 'end': 9, 'username...",fr,1450731989846401028,1450543574244810757,"[{'domain': {'id': '65', 'name': 'Interests an...",,
98,@MellowManed BURGER KING WITH MY BURGER QUEEN ...,False,everyone,2021-10-20T07:53:52.000Z,"{'retweet_count': 0, 'reply_count': 1, 'like_c...","[{'type': 'replied_to', 'id': '145073088455506...",Twitter for Android,9.899838247790756e+17,1250095263445848064,"{'mentions': [{'start': 0, 'end': 12, 'usernam...",en,1450731982472761344,1450730884555067394,"[{'domain': {'id': '45', 'name': 'Brand Vertic...",,
99,La única diferencia es tu percepción.\n\nNo ha...,False,everyone,2021-10-20T07:53:50.000Z,"{'retweet_count': 0, 'reply_count': 0, 'like_c...",,Twitter for iPhone,,901054343457251328,"{'urls': [{'start': 275, 'end': 298, 'url': 'h...",es,1450731973136236546,1450731973136236546,,{'media_keys': ['3_1450731963468468229']},


## Twitter URL formats
A good next step is to actually inspect some of these tweets and see what data we get.

If we navigate to any individual tweet on Twitter, for example `https://twitter.com/JosephAllen1234/status/1448222861701832704`

This number at the end is our tweet ID. if we paste any ID from our dataset there we can navigate to it.

You'll also notice some nested JSON files, we haven't really flattened out this data fully.

## Twarc convert to CSV
We aren't the first people to want to flatten this JSON object, there is a twarc plugin to convert to CSV.

First make sure twarc is up to date with:
```
pip3 install --upgrade twarc
twarc2 configure
```

Then install the twarc-csv plugin:
```
pip3 install --upgrade twarc-csv
```
OR if running straight from anaconda run the below cell.

In [None]:
!pip3 install --upgrade twarc-csv

Now we can convert with the following pattern:
```
twarc2 csv tweets.jsonl tweets.csv
```
renaming files where needed.

We can also use the "--no-inline-referenced-tweets" flag to remove replies referencing other tweets.
```
twarc2 csv tweets.jsonl tweets.csv --no-inline-referenced-tweets
```
So lets convert our 100 tweets.

```
twarc2 csv 100vegan.jsonl 100vegan.csv --no-inline-referenced-tweets
```

Now we should be able to read this in directly with pandas

In [6]:
df = pd.read_csv('100vegan.csv')
df.tail()

Unnamed: 0,id,created_at,text,attachments.media,attachments.media_keys,attachments.poll.duration_minutes,attachments.poll.end_datetime,attachments.poll.id,attachments.poll.options,attachments.poll.voting_status,...,reply_settings,source,withheld.scope,withheld.copyright,withheld.country_codes,type,__twarc.retrieved_at,__twarc.url,__twarc.version,Unnamed: 93
95,1450732047098515457,2021-10-20T07:54:07.000Z,Herbal Basil | Holy Basil Powder | Tulsi Leaf ...,,,,,,,,...,everyone,Twitter Web App,,,,,2021-10-20T08:02:10+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.7.3,
96,1450731993637916676,2021-10-20T07:53:55.000Z,Organic Shikakai Powder | Acacia Concinna | Ay...,,,,,,,,...,everyone,Twitter Web App,,,,,2021-10-20T08:02:10+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.7.3,
97,1450731989846401028,2021-10-20T07:53:54.000Z,"@Tipuncho Bien sur ! ainsi que de Pocramé, les...",,,,,,,,...,everyone,Twitter Web App,,,,,2021-10-20T08:02:10+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.7.3,
98,1450731982472761344,2021-10-20T07:53:52.000Z,@MellowManed BURGER KING WITH MY BURGER QUEEN ...,,,,,,,,...,everyone,Twitter for Android,,,,,2021-10-20T08:02:10+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.7.3,
99,1450731973136236546,2021-10-20T07:53:50.000Z,La única diferencia es tu percepción.\n\nNo ha...,"[{""media_key"": ""3_1450731963468468229"", ""heigh...","[""3_1450731963468468229""]",,,,,,...,everyone,Twitter for iPhone,,,,,2021-10-20T08:02:10+00:00,https://api.twitter.com/2/tweets/search/all?ex...,2.7.3,


In [26]:
# List out all columns we have access to.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 94 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   id                                               134 non-null    int64  
 1   created_at                                       134 non-null    object 
 2   text                                             134 non-null    object 
 3   attachments.media                                37 non-null     object 
 4   attachments.media_keys                           37 non-null     object 
 5   attachments.poll.duration_minutes                0 non-null      float64
 6   attachments.poll.end_datetime                    0 non-null      float64
 7   attachments.poll.id                              0 non-null      float64
 8   attachments.poll.options                         0 non-null      float64
 9   attachments.poll.voting_status  

# Case Study - London Veganuary 2019
It's not enough that we simply search all tweets, in fact we will very quickly hit our 10,000,000 tweet limit with this approach, adding a one month delay to our research.

## How useful is 10,000,000 tweets?
From my previous attempts at this collecting all tweets containing the word vegan in January 2019 returned 1.2 million tweets. This took 32 hours and resulted in a 4.2GB file.

There are an estimated 500 million tweets uploaded every day.
As such it's important to narrow the scope of your project, and focus on very specific time periods.

## The query
We are asking "How did the vegan perception change over Veganuary 2019" This was the first largely adopted veganuary, famously the announcment of Greggs sausage rolls. We need to get only tweets from this period, near London, and we are not interested in collecting any Retweets or replies.

we need the following data:
* tweet ID
* tweet text
* user names and IDs
* Anything that will help us check locations
* the number of retweets and favorites on each tweet

Let's build this up piece by piece

### Custom search term
We simply write the term we are searching for, "vegan", after the search keyword.

WARNING - DO NOT RUN THESE WITHOUT --limit
```
twarc2 search vegan vegan.jsonl
```

what about vegatarians? What about plant-based diets? For multiple terms we wrap the search in quotes. We can also OR, AND and use other operators to specify what we want. Twitter is very specific with logical operators, so consult [their documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#order-of-operations). Specifically we have to add parentheses, otherwise our query will assume any spaces are actually AND operators.

```
twarc2 search "(vegan OR vegetarian OR plant-based)" vegan.jsonl
```

### Limit the search
Unless you want to capture all tweets that match this search, you should limit your search. adding `--limit x` replacing x with the number of tweets you want will help you here. 

```
twarc2 search --limit 100 "(vegan OR vegetarian OR plant-based)" vegan.jsonl
```

### Archive search
The `--archive` flag lets Twitter know we want to access more than the last 7 days. You will only be allowed to do this with the academic tier, it won't fail otherwise it will just give you recent tweets instead.

```
twarc2 search --archive --limit 100 "(vegan OR vegetarian OR plant-based)" vegan.jsonl
```

### Start and End times
Adding a `--start-time` and `--end-time` flag lets you set start and end times to your archive search. These require dates in the format YYYY-MM-DD.

```
twarc2 search --archive --limit 100 --start-time "2019-01-01" --end-time "2019-01-30" "(vegan OR vegetarian OR plant-based)" vegan.jsonl
```

### Location search
We can collect a users self-defined location which varies from pronouns, to memes to relevant data.

```
twarc2 search --archive --limit 100 --start-time "2019-01-01" --end-time "2019-01-30 "(vegan OR vegetarian OR plant-based) place:london" vegan.jsonl
```

finally we can add reducing by location. This is one of Twitters built-in operators and so it must go in the search term, rather than be a flag for twarc. We add `point_radius:[lon,lat,radius]`, see the below example. You can use a tool like [this](https://www.latlong.net/) to find the lon/lat needed. This is only supported in the standard tier, and as such we cannot access archive data.
```
twarc2 search --limit 100 "(vegan OR vegetarian OR plant-based) point_radius:[-0.127758 51.507351 5mi]" vegan.jsonl
```

So that's everything we need. There are some additional rules you can use based on your API tier [here](https://developer.twitter.com/en/docs/twitter-api/premium/rules-and-filtering/operators-by-product)

Let's run this one last time to collect our data.
```
twarc2 search --archive --limit 1000 --start-time "2019-01-04" --end-time "2019-01-06" "(vegan OR vegetarian OR plant-based) place:london" veganLondon1000.jsonl
```
Then convert it to a csv
```
twarc2 csv veganLondon1000.jsonl veganLondon1000.csv --no-inline-referenced-tweets
```

## Useful Links
* If you want to play with some data check out my [Tidying tutorial](https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/TidyingDemo.ipynb)
* This [Twarc tutorial](https://github.com/alblaine/twarc-tutorial) breaks down installing Python, API keys and beyond!
* This [Twarc2 tutorial](https://github.com/jeffcsauer/twarc-v2-tutorials/blob/master/twarc_fas.md) specifically covers archive searching.
* The [Twarc report plugin](https://github.com/pbinkley/twarc-report) automates a lot of the tidying and visualization we would do in a next step.
* The Twarc community is incredibly helpfulm if you have any issues or questions open them on their [GitHub](https://github.com/DocNow/twarc/issues)