# Harvest Tweet Corpus

## Overview

This notebook contains code to harvest Tweets that mention the following words and hashtags:

- `Bitcoin`
- `BTC`
- `#BTC`
- `#Bitcoin`

We'll use the [Twitter API for Academic Research][] for bulk access to Tweets, with the goal of building a corpus that spans eight years of Twitter history.

## Docs and other resources

- [Twitter API for Academic Research][] - overview of Academic-level access
- [Twitter Search API docs][] - Provides "premium"-level access for academic researchers to the *Full Archive Search*, which allows you to gather Tweets as far back as 2006.
- [Twitter Full Archive Search python example][]



[Twitter API for Academic Research]: https://developer.twitter.com/en/products/twitter-api/academic-research

[Twitter Search API docs]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview

[Twitter Full Archive Search python example]: https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/Full-Archive-Search/full-archive-search.py


## Twitter Full Archive search

Below is a sample code based on the `Twitter Full Archive Search python example` (*see above for link to original code example*).

### Preliminary steps

The below code will only work if you use the standard `pipenv` workflow:

```
cd cryptocurrency1/
pipenv run jupyter lab
```

The above starts Jupyter Lab in the context of a virtual environment, which in turn makes the `TWITTER_BEARER_TOKEN` available to your code as an environment variable.

If you followed that procedure, the below code should work.

### Demo Twitter Full Archive search

The below code demonstrates how to use Twitter Full Search to gather recent mentions of `bitcoin` or `#btc`. 

You'll need to update this code to include all the keywords/hashtags that you're targeting. Along the way, you should explore whether keyword and hashtag searches are case sensitive and if so, account for that. It doesn't appear to be case sensitive, but I would verify that...

Additionally, you'll need to flesh out the code to properly handle "paging" of results (by default only 10 Tweets are returned).

Some docs you'll want to spend time with:

- [Data Endpoint][] details the various query parameters needed to target your keywords, hashtags, start/end dates, etc.
- [Pagination][] explains how to page through results and use start and end dates to limit the time window

[Data Endpoint]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#DataEndpoint
               
[Pagination]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#Pagination


In [4]:
import requests
import os
import json

# The below environment variable is in the .env file
# It becomes accessible when you run "pipenv run jupyter lab" to start
# this notebook from the command line.
bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")

search_url = "https://api.twitter.com/2/tweets/search/all"

# TODO: Customize these params based on your needs, for example start_time and end_time
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': '("bitcoin" OR "Bitcoin")(-is:retweet OR -is:reply OR -is:quote OR -has:cashtags OR -has:media OR -has:links OR -has:mentions OR -has:videos OR -has:images OR -has:hashtags) -is:nullcast (-ethereum OR -cardano OR -solana OR -terra OR -avalanche OR -tether OR -https) place_country:US lang:en', 
                'start_time': '2013-01-01T00:00:00Z',
                'end_time':'2022-01-01T00:00:00Z', 
                'max_results': 10,
               'tweet.fields': 'text,created_at,public_metrics',
               'user.fields':'verified,location',
               'next_token':{}}
#I want to use max_results and next_token but its giving me an error 
 
def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FullArchiveSearchPython"
    return r


def connect_to_endpoint(url, params, next_token = None):
    params['next_token'] = next_token
    response = requests.request("GET", search_url, auth=bearer_oauth, params=params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


json_response = connect_to_endpoint(search_url, query_params)
print(json.dumps(json_response, indent=4, sort_keys=True))


200
{
    "data": [
        {
            "created_at": "2021-12-31T23:59:18.000Z",
            "id": "1477066868489814016",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "@rahulmagan8 Happy new, sir\n\nI'm cesar delgado \n\nI got your contact info from your YouTube channel. Great  content.\n \nI'm a intermediary for a  chief compliance officer of financial platform for financial derivatives and bitcoin.  The mandate is direct. At present time there's"
        },
        {
            "created_at": "2021-12-31T23:42:27.000Z",
            "id": "1477062626094030848",
            "public_metrics": {
                "like_count": 2,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "@mikeinspace Blame @GaryGensler. A spot #Bitcoin ETF would amelior

In [1]:
curl "https://api.twitter.com/2/tweets/counts/all?query=(bitcoin%20OR%20%23bitcoin)%20lang%3Aen%20(-is%3Aretweet%20OR%20-is%3Areply%20OR%20-is%3Aquote)%20-is%3Anullcast%20(-has%3Amedia%20OR%20-has%3Acashtags%20OR%20-has%3Aimages%20OR%20-has%3Avideos)%20(-ethereum%20OR%20-cardano%20OR%20-solana%20OR%20-terra%20OR%20-avalanche%20OR%20-tether)&start_time=2017-01-01T00:00:00.000Z&end_time=2022-04-01T00:00:00.000Z&granularity=day" -H "Authorization: Bearer AAAAAAAAAAAAAAAAAAAAALw3awEAAAAAiPLE5sKSIkBtEQy1IjeBwgd5R9s%3D8BA13jYK6uaJI9Ggnuw9VF08rJWrqkXApejZzpfVnYZKZgTKsV"


SyntaxError: invalid syntax (2512848194.py, line 1)

In [3]:
curl "https://api.twitter.com/2/tweets/counts/all?query=(bitcoin%20OR%20%23bitcoin)%20lang%3Aen%20(-is%3Aretweet%20OR%20-is%3Areply%20OR%20-is%3Aquote)%20-is%3Anullcast%20(-has%3Amedia%20OR%20-has%3Acashtags%20OR%20-has%3Aimages%20OR%20-has%3Avideos)%20(-ethereum%20OR%20-cardano%20OR%20-solana%20OR%20-terra%20OR%20-avalanche%20OR%20-tether)&start_time=2017-01-01T00:00:00.000Z&end_time=2022-04-01T00:00:00.000Z&granularity=day" -H "Authorization: TWITTER_BEARER_TOKEN"

SyntaxError: invalid syntax (1903413812.py, line 1)

In [None]:
query_params = {'query': '("bitcoin" OR "Bitcoin" OR #bitcoin)(-is:retweet OR -is:reply OR -is:quote OR -has:cashtags OR -has:media OR -has:links OR -has:mentions OR -has:videos OR -has:images OR -has:hashtags) -is:nullcast (-ethereum OR -cardano OR -solana OR -terra OR -avalanche OR -tether OR -https) lang:en', 
                'start_time': '2013-01-01T00:00:00Z',
                'end_time':'2022-01-01T00:00:00Z', 
                'max_results': 10,
               'tweet.fields': 'text,created_at,public_metrics',
               'user.fields':'verified',
               'next_token':{}}