# Dehydrating Tweet IDs

In most social media projects, it is desirable to have open access to content.

Twitter has rules and regulations how one can share dataset publicly. Check out section [Content redistribution](https://developer.twitter.com/en/developer-terms/agreement-and-policy) or quotes below.

```
The best place to get Twitter Content is directly from Twitter. Consequently, we restrict the redistribution of Twitter Content to third parties.  If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute Tweet IDs, Direct Message IDs, and/or User IDs (except as described below). We also grant special permissions to academic researchers sharing Tweet IDs and User IDs for non-commercial research purposes.

In total, you may not distribute more than 1,500,000 Tweet IDs to any entity (inclusive of multiple individuals associated with a single entity) within any 30 day period unless you have received written permission from Twitter. In addition, all developers may provide up to 50,000 public Tweets Objects and/or User Objects to each person who uses your service on a daily basis if this is done via non-automated means (e.g., download of spreadsheets or PDFs).

Academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research. For example, you are permitted to share an unlimited number of Tweet IDs for the purpose of enabling peer review or validation of your research. If you have questions about whether your use case qualifies under this category please submit a request via the API Policy Support form.
```

It's a common practice to share files with list of TweetIDs. I'm sharing a similar file `data/tweet-id-collection.txt` for our use in this tutorial.

There are couple of ways to obtains content for these Tweets. Some applications like [Hydrator](https://github.com/DocNow/hydrator) provide easy to use interface. 

We can also download tweets using Twitter's API.

In [8]:
import json
import gzip

import tweepy

# Update the file with your Twitter Application Credentials and DO NOT SHARE with others.
# I include this config.py file into .gitignore file, so that my changes won't get tracked by Github.
from config import TWITTER_KEYS


In [6]:
auth = tweepy.OAuthHandler(TWITTER_KEYS["consumer_key"], TWITTER_KEYS["consumer_secret"])
auth.set_access_token(TWITTER_KEYS["access_token"], TWITTER_KEYS["access_token_secret"])            
api = tweepy.API(auth, retry_count=5, retry_delay=15, compression=True,
                 wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

#api.verify_credentials()

## Collect Tweet objects from TweetId file

In [14]:
tweetIds = list()
with open('../data/tweet-id-collection.txt', 'r') as fl:
    for line in fl:
        tweetIds.append(line.strip())
print('{} tweet-ids in the file'.format(len(tweetIds)))

alreadyCollected = set()
with gzip.open('../data/tweets_dehydrated.jsons.gz', 'rb') as fl:
    for line in fl:
        tweet = json.loads(line)
        alreadyCollected.add(tweet['id_str'])
print('{} tweets already collected'.format(len(alreadyCollected)))

tweetIds = list(set(tweetIds) - alreadyCollected)
print('{} tweet-ids to collect or not available'.format(len(tweetIds)))

'''
## UNCOMMENT IF YOU WANT TO CONTINUE DATA COLLECTION
requested, collected = 0, 0
with gzip.open('../data/tweets_dehydrated.jsons.gz', 'wb') as fl:
    for idx in range(0, len(tweetIds), 100):
        chunk = tweetIds[idx:idx+100]
        for tweet in api.statuses_lookup(chunk, include_entities=True):
            fl.write('{}\n'.format(json.dumps(tweet._json)).encode('utf-8'))
            collected += 1
        requested += len(chunk)

        print('{} requested, {} collected so far'.format(requested, collected))
'''


280257 tweet-ids in the file
243533 tweets already collected
36724 tweet-ids to collect or not available


"\nrequested, collected = 0, 0\nwith gzip.open('../data/tweets_dehydrated.jsons.gz', 'wb') as fl:\n    for idx in range(0, len(tweetIds), 100):\n        chunk = tweetIds[idx:idx+100]\n        for tweet in api.statuses_lookup(chunk, include_entities=True):\n            fl.write('{}\n'.format(json.dumps(tweet._json)).encode('utf-8'))\n            collected += 1\n        requested += len(chunk)\n\n        print('{} requested, {} collected so far'.format(requested, collected))\n"

# Downloading data from public server

I also uploaded this dataset to Zenodo.

You can use [this link](http://doi.org/10.5281/zenodo.3900655) to access the file.

`
Onur Varol. (2020). SICSS - Tutorial Dataset File (Version v1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3900655
`