# Harvest Tweet Corpus

## Overview

This notebook contains code to harvest Tweets that mention the following words and hashtags:

- `Bitcoin`
- `BTC`
- `#BTC`
- `#Bitcoin`

We'll use the [Twitter API for Academic Research][] for bulk access to Tweets, with the goal of building a corpus that spans eight years of Twitter history.

## Docs and other resources

- [Twitter API for Academic Research][] - overview of Academic-level access
- [Twitter Search API docs][] - Provides "premium"-level access for academic researchers to the *Full Archive Search*, which allows you to gather Tweets as far back as 2006.
- [Twitter Full Archive Search python example][]



[Twitter API for Academic Research]: https://developer.twitter.com/en/products/twitter-api/academic-research

[Twitter Search API docs]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview

[Twitter Full Archive Search python example]: https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/Full-Archive-Search/full-archive-search.py


## Twitter Full Archive search

Below is a sample code based on the `Twitter Full Archive Search python example` (*see above for link to original code example*).

### Preliminary steps

The below code will only work if you use the standard `pipenv` workflow:

```
cd cryptocurrency1/
pipenv run jupyter lab
```

The above starts Jupyter Lab in the context of a virtual environment, which in turn makes the `TWITTER_BEARER_TOKEN` available to your code as an environment variable.

If you followed that procedure, the below code should work.

### Demo Twitter Full Archive search

The below code demonstrates how to use Twitter Full Search to gather recent mentions of `bitcoin` or `#btc`. 

You'll need to update this code to include all the keywords/hashtags that you're targeting. Along the way, you should explore whether keyword and hashtag searches are case sensitive and if so, account for that. It doesn't appear to be case sensitive, but I would verify that...

Additionally, you'll need to flesh out the code to properly handle "paging" of results (by default only 10 Tweets are returned).

Some docs you'll want to spend time with:

- [Data Endpoint][] details the various query parameters needed to target your keywords, hashtags, start/end dates, etc.
- [Pagination][] explains how to page through results and use start and end dates to limit the time window

[Data Endpoint]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#DataEndpoint
               
[Pagination]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#Pagination


In [6]:
import requests
import os
import json

# The below environment variable is in the .env file
# It becomes accessible when you run "pipenv run jupyter lab" to start
# this notebook from the command line.
bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")

search_url = "https://api.twitter.com/2/tweets/search/all"

# TODO: Customize these params based on your needs, for example start_time and end_time
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': 'bitcoin OR #btc'}


def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FullArchiveSearchPython"
    return r


def connect_to_endpoint(url, params):
    response = requests.request("GET", search_url, auth=bearer_oauth, params=params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


json_response = connect_to_endpoint(search_url, query_params)
print(json.dumps(json_response, indent=4, sort_keys=True))


200
{
    "data": [
        {
            "id": "1510016495681937409",
            "text": "RT @silvinaescudero: \ud83d\udc51 #Upril is here fam \ud83e\udd0d\n\n\ud83d\udcb0I have 5 #ETH\u00a0 ready to invest!\n\n#SHILL me the best option\u263a\ufe0f\n\n#BTC\u00a0  #cryptocurrecy  #Metaver\u2026"
        },
        {
            "id": "1510016495509778436",
            "text": "RT @Crypto101SA: \ud83d\udcb8 1800TL \u00d6D\u00dcLL\u00dc \u00c7EK\u0130L\u0130\u015e\ud83c\udf81\n\n  Tek \u015fart bu tweete RT atmak  ve a\u015fa\u011f\u0131daki  HESAPLARI takip etmek \ud83c\udf89\n\n@MistoBistoCryp \ud83d\udfe1\n\n@jamie_\u2026"
        },
        {
            "id": "1510016493169389568",
            "text": "RT @TheBitcoinConf: Announcing Aarika Rhodes as a #Bitcoin2022 speaker!\n \n@AarikaRhodes is an elementary school teacher running to unseat C\u2026"
        },
        {
            "id": "1510016492607213569",
            "text": "@Mario_Gibney @achim @Truthcoin Nash says in the fut