# Search Tweets

Twitter forms a rich source of information for researches interested in study 'the public conversation' about any thinkable topic. 

In order to find Tweets related to the topic of your interest, you may search tweets using two endpoints, **Recent search** and **full-archive search**. A recent search will yield Tweets from the past 7 days, while the full-archive search goes back to the first public Tweet from March 2006. 

To filter tweets, no matter which endpoint you are using, you need to provide a **search query**.
According to Twitter website, "These search queries are created with a set of operators that match on Tweet and user attributes, such as message keywords, hashtags, and URLs. Operators can be combined into queries with boolean logic and parentheses to help refine the queries matching behavior." More on these queries later.

In this guideline we will describe how you can use the [tweet_collector repository](https://github.com/UtrechtUniversity/tweet_collector) to extract tweets for *your* research!

## The tweet_collector repository
The tweet_collector repository consists of three main folders; the config, guidelines, and src folder. 

The search_tweet.py script can be found in de `src` foler. The script takes three arguments: 
1. `--credential-file <CREDENTIAL_FILENAME>`
2. `--config-file <CONFIG_FILENAME>` 
3. `--env-overwrite <BOOLEAN>` 

The credential and configuration files are requirements for running the script, and can be found in the `config` folder. The `--env-overwrite` argument overwrites YAML-parsed credentials with any set environment variables (default is TRUE).

Finally, guidelines on how to apply for an Academic Research application, are stored in the `guidelines` folder.

## Requirements


### 1. Credential file
The credential file holds your Twitter credentials. The simplest credential file should look like this:

```
search_tweets_v2:
  endpoint:  https://api.twitter.com/2/tweets/search/...
  consumer_key: ek...
  consumer_secret: hy...
  bearer_token: AA...
```

By default, this library expects this file at `~/.twitter_keys.yaml`, but you can pass the relevant location as needed with the `--credential-file <CREDENTIAL_FILENAME>` flag for the command-line app.

#### 1.A. Recent Search
To execute a recent search, the endpoint specification in the credential file needs to be set to 'recent'. The ‘recent’ search endpoint provides Tweets from the **past 7 days**.

```
search_tweets_v2:
  endpoint:  https://api.twitter.com/2/tweets/search/recent
  consumer_key: ek...
  consumer_secret: hy...
  bearer_token: AA...
```

#### 1.B. Full-archive Search
To execute a full-archive search, the endpoint specification in the credential file needs to be set to 'all'. The ‘all’ search endpoint, launched in January 2021 as part of the ‘academic research’ tier of Twitter API v2 access, provides access to all publicly avaialble Tweets posted **since March 2006**.

```
search_tweets_v2:
  endpoint:  https://api.twitter.com/2/tweets/search/all
  consumer_key: ek...
  consumer_secret: hy...
  bearer_token: AA...
```

### 2. Configuration file
The configuration file (i.e., `api_config.config`) contains all parameters. Placing all paramters into one file is far easier to use than the command-line args version. If a valid configuration file is found, all arguments will be populated from there. N.B. Remaining command-line arguments will overrule arguments found in the config file (if `--env-overwrite` is not set to FALSE).

An example of such a config-file:

```
[search_rules]
start_time = 2021-05-01
end_time = 2021-06-01
query = (happy or happiness) lang:en -birthday -is:retweet has:hashtags
tweet_fields = id,created_at,text,public_metrics
user_fields = description
expansions=author_id

[search_params]
results_per_call = 10
max_tweets = 10
output_format = r

[output_params]
save_file = True
filename_prefix = output_new
results_per_file = 10
```

#### 2.A. Search rules
In the config-file you can enter all search rules necessary for your research. 

To filter Tweets based on time/id, use one or more of the following search rules:
```
start_time = <Start of datetime window, format ‘YYYY-mm-DDTHH:MM’> 
end_time = <End of datetime window, format ‘YYYY-mm-DDTHH:MM’>
since_id = <Tweet ID, will start search from Tweets after this one>
until_id = <Tweet ID, will end search from Tweets before this one>
```

To include extra information to your results, you can use various 'fields' rules (see [this page](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user) for more info):
```
tweet_fields = <A comma-delimited list of Tweet JSON attributes to include in endpoint responses>
place_fields = <A comma-delimited list of Twitter Place JSON attributes to include in endpoint responses>
user_fields = <A comma-delimited list of User JSON attributes to include in endpoint responses>
media_fields = <A comma-delimited list of media JSON attributes to include in endpoint responses>
poll_fields = <A comma-delimited list of Twitter Place JSON attributes to include in endpoint responses>
```
Example: if you want to know more about the geographical information of the filtered tweets, you can include `geo_fields = country,country_code`.

To be able to include (one of) these extra fields listed above, you also need to provide the 'expansions' rule:
```
expansions = <A comma-delimited list of expansions. Specified expansions results in full objects in the ‘includes’ response object>
```
Example: if you wish to include `geo_fields = country,country_code` you have to include `expansions = geo.place_id`. While the place ID will be located in the Tweet object, you will find this ID and all additional place fields in the 'includes' data object. (Please see [this page](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user) for field-expansions pairs.)

Finally, the most import rule, is your search query:
```
query = <Search query>
```

##### Search query
In the query specification, you enter how you wish to filter Tweets. Commonly used arugments are:
* `<key_word1> OR <key_word2>` (look for Tweets including either word1 or word2)
* `lang:<lang>` (only receive Tweets that are in specific langauge. Example, lang:en selects only English Tweets)
* `-is:<type>` ('-' is a negation operator; excludes certain types of Tweets. Example, -is:retweet exclused retweets, leaving only original Tweets) 
* `-<key_word>` ('-' is a negation operator; excludes Tweets with key_word in it)
* `has:<prop>` (matches Tweets that have specific property. Example, has:geo selectes Tweets with Tweet-specific geolocation data provided by the Twitter user)

Hence, if you want to look for original Tweets in English related to happy or happiness containing at least one hashtag, but are not related to birthday whishes, we write:
```
query = (happy or happiness) lang:en -birthday -is:retweet has:hashtags
```

To get an extensive overview of how you can structure a query, have a look [here](https://developer.twitter.com/en/docs/twitter-api/tweets/counts/integrate/build-a-query). 

#### 2.B. Search parameters
In the config-file you can enter all search parameters necessary for your research. 

Here are some examples:
```
[search_params]
results_per_call = <Number of results to return per call (default 10; max 100)>
max_tweets = <Maximum number of Tweets to return for this session of requests>
max_pages = <Maximum number of pages/API calls to use for this session>
output_format = <Set output format*>
extra_headers = <JSON-formatted str representing a dict of additional HTTP request headers>
```
*= ‘r’ Unmodified API [R]esponses. (default). ‘a’ [A]tomic Tweets: Tweet objects with expansions inline. ‘m’ [M]essage stream: Tweets, expansions, and pagination metadata as a stream of messages.

#### 2.C. Output parameters
In the config-file you can enter all output parameters necessary for your research. 

Here are some examples:
```
[output_params]
save_file = True
filename_prefix = <prefix for the filename where tweet json data will be stored>
results_per_file = <Maximum tweets to save per file>
```


## Running script
After you've filled in your credentials in the `.twitter_keys.yaml` and entered all parameters needed for your Twitter search in the `api_config.config` file, you can run the script in the command line with the following code:

```
cd tweet_collector
python3 src/search_tweet.py --credential-file "config/.twitter_keys.yaml" --config-file "config/api_config.config" 
```

### Output
Running this line of code with the above described api_config.config example, will result in the following output:
```
{"data": [{"text": "Happy 7th Anniversary, DARREN &amp; DARRENatics!\n\n7years of happiness even by simply looking at you or watching you.\n\nLove you then, love you now, love you always!\n\n#DarrenEspanto @Espanto2001", "created_at": "2021-05-31T23:23:33.000Z", "id": "1399506867999547392", "author_id": "2749743179", "public_metrics": {"retweet_count": 15, "reply_count": 2, "like_count": 30, "quote_count": 0}}, 
(...)
{"id": "1399409079446081536", "public_metrics": {"retweet_count": 0, "reply_count": 0, "like_count": 2, "quote_count": 0}, "created_at": "2021-05-31T16:54:58.000Z", "text": "To Whom It May Concern:  All the people who tell you who you should or should not be, will never be there for you as much as you wish. Be whoever the hell makes you happy. If people don\u2019t want you to achieve personal happiness/acceptance, why keep them around? #MutantFam"}}], 

"includes": {"users": [{"username": "ggr_mom", "description": "Certified DN since 2014 \u2022 Follow DARREN\u2019s Twitter @Espanto2001 \u2022IG: @darrenespanto \u2022YT: Darren Lyndon Espanto\u2022 HOME RUN CONCERT 6.19.2021 \ud83d\udc9a New single: TAMA NA", "id": "2749743179", "name": "DarrenaticMom"},
(...)
{"username": "UselessKnwldge", "description": "Comedic reviews of horror with @arthurreturns Please subscribe. #HorrorFamily #MutantFam #TheLastDriveIn #HorrorReviews #HorrorPodcast", "id": "196303269", "name": "Useless Knowledge Casey"}]},

"meta": {"newest_id": "1399506867999547392", "oldest_id": "1399409079446081536", "result_count": 10, "next_token": "b26v89c19zqg8o3foswucojbbx9lqiez94auwn10p55z1"}}
```

### Saved format
The output will be saved as:
1. a dictionary, <filename_prefix>.json file
2. a compact <filename_prefix>.csv file
3. an unpacked and cleaned table <filename_prefix_cleaned>.csv