# Harvest Tweet Corpus

## Overview

This notebook contains code to harvest Tweets that mention the following words and hashtags:

- `Bitcoin`
- `BTC`
- `#BTC`
- `#Bitcoin`

We'll use the [Twitter API for Academic Research][] for bulk access to Tweets, with the goal of building a corpus that spans eight years of Twitter history.

## Docs and other resources

- [Twitter API for Academic Research][] - overview of Academic-level access
- [Twitter Search API docs][] - Provides "premium"-level access for academic researchers to the *Full Archive Search*, which allows you to gather Tweets as far back as 2006.
- [Twitter Full Archive Search python example][]



[Twitter API for Academic Research]: https://developer.twitter.com/en/products/twitter-api/academic-research

[Twitter Search API docs]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/overview

[Twitter Full Archive Search python example]: https://github.com/twitterdev/Twitter-API-v2-sample-code/blob/main/Full-Archive-Search/full-archive-search.py


## Twitter Full Archive search

Below is a sample code based on the `Twitter Full Archive Search python example` (*see above for link to original code example*).

### Preliminary steps

The below code will only work if you use the standard `pipenv` workflow:

```
cd cryptocurrency1/
pipenv run jupyter lab
```

The above starts Jupyter Lab in the context of a virtual environment, which in turn makes the `TWITTER_BEARER_TOKEN` available to your code as an environment variable.

If you followed that procedure, the below code should work.

### Demo Twitter Full Archive search

The below code demonstrates how to use Twitter Full Search to gather recent mentions of `bitcoin` or `#btc`. 

You'll need to update this code to include all the keywords/hashtags that you're targeting. Along the way, you should explore whether keyword and hashtag searches are case sensitive and if so, account for that. It doesn't appear to be case sensitive, but I would verify that...

Additionally, you'll need to flesh out the code to properly handle "paging" of results (by default only 10 Tweets are returned).

Some docs you'll want to spend time with:

- [Data Endpoint][] details the various query parameters needed to target your keywords, hashtags, start/end dates, etc.
- [Pagination][] explains how to page through results and use start and end dates to limit the time window

[Data Endpoint]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#DataEndpoint
               
[Pagination]: https://developer.twitter.com/en/docs/twitter-api/premium/search-api/api-reference/premium-search#Pagination


In [12]:
import requests
import os
import json

# The below environment variable is in the .env file
# It becomes accessible when you run "pipenv run jupyter lab" to start
# this notebook from the command line.
bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")

search_url = "https://api.twitter.com/2/tweets/search/all"

# TODO: Customize these params based on your needs, for example start_time and end_time
# Optional params: start_time,end_time,since_id,until_id,max_results,next_token,
# expansions,tweet.fields,media.fields,poll.fields,place.fields,user.fields
query_params = {'query': '("bitcoin")(-is:retweet OR -is:reply OR -is:quote) (-has:mentions -has:cashtags -has:media -has:links -has:videos -has:images -has:hashtags) -is:nullcast -ethereum -cardano -dogecoin -solana -terra -avalanche -tether place_country:US lang:en', 
                'start_time': '2021-01-01T00:00:00Z',
                'end_time':'2021-12-31T00:00:00Z', 
                'max_results': 500,
               'tweet.fields': 'text,created_at,public_metrics',
               'user.fields':'verified,location',
               'next_token':{}}
#I want to use max_results and next_token but its giving me an error 
 
def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FullArchiveSearchPython"
    return r


def connect_to_endpoint(url, params, next_token = None):
    params['next_token'] = next_token
    response = requests.request("GET", search_url, auth=bearer_oauth, params=params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()


json_response = connect_to_endpoint(search_url, query_params)
print(json.dumps(json_response, indent=4, sort_keys=True))


200
{
    "data": [
        {
            "created_at": "2021-12-31T23:06:30.000Z",
            "id": "1477053580377038855",
            "public_metrics": {
                "like_count": 1,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "Bitcoin been fluctuating up and down for a minute. Like every week it will be a 5k profit or loss. Almost like folks using it to give themselves paychecks."
        },
        {
            "created_at": "2021-12-31T21:59:09.000Z",
            "id": "1477036631362179075",
            "public_metrics": {
                "like_count": 5,
                "quote_count": 0,
                "reply_count": 1,
                "retweet_count": 0
            },
            "text": "Bitcoin is portable magic \u2728"
        },
        {
            "created_at": "2021-12-31T20:39:39.000Z",
            "id": "1477016622430752774",
            "public_metrics": {
            

In [21]:
import requests
import os
import json


# The below environment variable is in the .env file
# It becomes accessible when you run "pipenv run jupyter lab" to start
# this notebook from the command line.
bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")

search_url = "https://api.twitter.com/2/tweets/search/all"


		
#I want to use max_results and next_token but its giving me an error 
 
def bearer_oauth(r):
    """
    Method required by bearer token authentication.
    """

    r.headers["Authorization"] = f"Bearer {bearer_token}"
    r.headers["User-Agent"] = "v2FullArchiveSearchPython"
    return r


def connect_to_endpoint(url, params, next_token = None):
    params['next_token'] = next_token
    response = requests.request("GET", search_url, auth=bearer_oauth, params=params)
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(response.status_code, response.text)
    return response.json()

start_time_list = ['2013-01-01T00:00:00Z', '2013-02-01T00:00:00Z', '2013-03-01T00:00:00Z', '2013-04-01T00:00:00Z', '2013-05-01T00:00:00Z', '2013-06-01T00:00:00Z']
end_time_list = ['2013-01-31T00:00:00Z', '2013-02-28T00:00:00Z', '2013-03-31T00:00:00Z', '2013-04-30T00:00:00Z', '2013-05-31T00:00:00Z', '2013-06-30T00:00:00Z']
month13 = ['1-2013', '2-2013', '3-2013', '4-2013', '5-2013', '6-2013', '7-2013', '8-2013', '9-2013', '10-2013', '11-2013', '12-2013']    
final_results_json = {}

for i in range(0, len(start_time_list)):
    query_params = {'query': '("bitcoin")(-is:retweet OR -is:reply OR -is:quote)(-has:mentions -has:cashtags -has:media -has:links -has:videos -has:images -has:hashtags) -is:nullcast -ethereum -dogecoin -cardano -solana -terra -avalanche -tether place_country:US lang:en', 
                        'start_time': start_time_list[i],
                        'end_time': end_time_list[i], 
                        'max_results': 10,
                        'tweet.fields': 'text,created_at,public_metrics',
                        'user.fields':'verified,location',
                        'next_token':{}}

## make the API call to get the data
    json_response = connect_to_endpoint(search_url, query_params)
    final_results_json[month13[i]] = json_response
print(final_results_json)


200
429


Exception: (429, '{"title":"Too Many Requests","detail":"Too Many Requests","type":"about:blank","status":429}')

In [None]:
start_time_list = ['2013-01-01T00:00:00Z', '2013-02-01T00:00:00Z', '2013-03-01T00:00:00Z', '2013-04-01T00:00:00Z']
end_time_list = ['2013-01-31T00:00:00Z', '2013-02-28T00:00:00Z', '2013-03-31T00:00:00Z', '2013-04-30T00:00:00Z']
month13 = ['1-2013', '2-2013', '3-2013', '4-2013']

max_results = 500


                

cannot read file ./search_tweets_creds_example.yaml
Error parsing YAML file; searching for valid environment variables
Account type is not specified and cannot be inferred.
        Please check your credential file, arguments, or environment variables
        for issues. The account type must be 'premium' or 'enterprise'.
        


KeyError: 