# Step 2: Tweet Collection

Now that the accounts for examination have been assembled, it is time to collect all tweets for the years 2016 and 2017.
Since the expected total size in JSON format exceeds two GB, it helpful to collect tweets in chunks.

The following steps will be undertaken in this notebook:
1. Setup API connection
2. Load user IDs from file
3. Test tweet collection process
4. Collect and save tweets
5. Sync data files to S3

## Setup API connection

In [1]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

In [2]:
from tep.tweetCollector import TweetCollector
tc = TweetCollector()

{"created_at": "Thu May 01 12:37:22 +0000 2014", "description": "Student of Information Systems @TUDarmstadt , co-founder of a small web agency. Interested in Machine Learning", "favourites_count": 394, "followers_count": 58, "friends_count": 226, "id": 2472450259, "id_str": "2472450259", "lang": "en", "listed_count": 7, "location": "Darmstadt, Deutschland", "name": "Felix Peters", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_image_url": "http://pbs.twimg.com/profile_images/600953861629734913/7y_RkdW4_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/600953861629734913/7y_RkdW4_normal.jpg", "profile_link_color": "224F82", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "screen_name"

In [3]:
from tep.accountCollector import AccountCollector
ac = AccountCollector()

{"created_at": "Thu May 01 12:37:22 +0000 2014", "description": "Student of Information Systems @TUDarmstadt , co-founder of a small web agency. Interested in Machine Learning", "favourites_count": 394, "followers_count": 58, "friends_count": 226, "id": 2472450259, "id_str": "2472450259", "lang": "en", "listed_count": 7, "location": "Darmstadt, Deutschland", "name": "Felix Peters", "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_image_url": "http://pbs.twimg.com/profile_images/600953861629734913/7y_RkdW4_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/600953861629734913/7y_RkdW4_normal.jpg", "profile_link_color": "224F82", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "screen_name"

## Load user IDs from file

In [4]:
user_ids = ac.load_user_ids(fname="data/user_ids.txt")
user_ids[:10]

[12, 418, 586, 648, 989, 1081, 3980, 5699, 6204, 30973]

In [5]:
len(user_ids)

1061

## Test tweet collection process

In [6]:
# setup start and end date of examined period
from datetime import datetime
from tep.utils import *
start = datetime(2018, 4, 1, 0, 0, 0, 000000, tzinfo=UTC())
end = datetime(2018, 5, 1, 0, 0, 0, 000000, tzinfo=UTC())

In [7]:
tweets = tc.get_tweets(user_ids=user_ids[:10], start=start, end=end, update_interval=5)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 5 steps: 1,319
Terminating data collection with total of 3246 tweets


In [8]:
save_as_json(data=tweets, filename="data/tweet_test.json")

Everything seems to be working, so let's start the collection process...

## Collect and save tweets

In [9]:
start = datetime(2016, 4, 1, 0, 0, 0, 000000, tzinfo=UTC())
end = datetime(2018, 4, 1, 0, 0, 0, 000000, tzinfo=UTC())

### First chunk

In [10]:
tweets = tc.get_tweets(user_ids=user_ids[:200], start=start, end=end, update_interval=10)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 10 steps: 21,867
Total tweets after 20 steps: 44,934
Total tweets after 30 steps: 65,732
Total tweets after 40 steps: 88,971
Total tweets after 50 steps: 102,383
Total tweets after 60 steps: 115,205
Total tweets after 70 steps: 135,308
Total tweets after 80 steps: 161,749
Total tweets after 90 steps: 178,241
Total tweets after 100 steps: 198,941
Total tweets after 110 steps: 214,739
Total tweets after 120 steps: 228,332
Total tweets after 130 steps: 241,661
Total tweets after 140 steps: 260,685
Total tweets after 150 steps: 282,792
Total tweets after 160 steps: 303,918
Total tweets after 170 steps: 320,964
Total tweets after 180 steps: 342,600
Total tweets after 190 steps: 356,153
Terminating data collection with total of 379051 tweets


In [11]:
tweets = [t for t in tweets if t.retweeted_status == None]
total = len(tweets)
print(total)
save_as_json(data=tweets, filename="data/tweets_1.json")

284701


### Second chunk

In [12]:
tweets = tc.get_tweets(user_ids=user_ids[200:400], start=start, end=end, update_interval=10)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 10 steps: 21,355
Total tweets after 20 steps: 40,100
Total tweets after 30 steps: 59,179
Total tweets after 40 steps: 77,571
Total tweets after 50 steps: 87,690
Total tweets after 60 steps: 102,661
Total tweets after 70 steps: 120,477
Total tweets after 80 steps: 137,921
Total tweets after 90 steps: 157,771
Total tweets after 100 steps: 180,997
Total tweets after 110 steps: 197,955
Total tweets after 120 steps: 217,922
Total tweets after 130 steps: 232,458
Total tweets after 140 steps: 251,065
Total tweets after 150 steps: 270,767
Total tweets after 160 steps: 282,777
Total tweets after 170 steps: 306,684
Total tweets after 180 steps: 318,658
Total tweets after 190 steps: 337,382
Terminating data collection with total of 357057 tweets


In [13]:
tweets = [t for t in tweets if t.retweeted_status == None]
total += len(tweets)
print(total)
save_as_json(data=tweets, filename="data/tweets_2.json")

565486


### Third chunk

In [14]:
tweets = tc.get_tweets(user_ids=user_ids[400:600], start=start, end=end, update_interval=10)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 10 steps: 13,427
Total tweets after 20 steps: 26,004
Total tweets after 30 steps: 41,426
Total tweets after 40 steps: 52,578
Total tweets after 50 steps: 69,819
Total tweets after 60 steps: 86,659
Total tweets after 70 steps: 104,218
Total tweets after 80 steps: 117,268
Total tweets after 90 steps: 125,569
Total tweets after 100 steps: 140,319
Total tweets after 110 steps: 152,975
Total tweets after 120 steps: 172,634
Total tweets after 130 steps: 185,480
Total tweets after 140 steps: 196,395
Total tweets after 150 steps: 215,756
Total tweets after 160 steps: 233,264
Total tweets after 170 steps: 250,523
Total tweets after 180 steps: 265,658
Total tweets after 190 steps: 279,886
Terminating data collection with total of 288532 tweets


In [15]:
tweets = [t for t in tweets if t.retweeted_status == None]
total += len(tweets)
print(total)
save_as_json(data=tweets, filename="data/tweets_3.json")

781721


### Fourth chunk

In [16]:
tweets = tc.get_tweets(user_ids=user_ids[600:800], start=start, end=end, update_interval=10)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 10 steps: 15,907
Total tweets after 20 steps: 28,505
Total tweets after 30 steps: 48,975
Total tweets after 40 steps: 70,384
Total tweets after 50 steps: 86,282
Total tweets after 60 steps: 101,434
Total tweets after 70 steps: 123,134
Total tweets after 80 steps: 142,955
Total tweets after 90 steps: 159,548
Total tweets after 100 steps: 168,219
Total tweets after 110 steps: 190,111
Total tweets after 120 steps: 210,322
Total tweets after 130 steps: 221,961
Total tweets after 140 steps: 234,095
Total tweets after 150 steps: 248,300
Total tweets after 160 steps: 263,502
Total tweets after 170 steps: 278,086
Total tweets after 180 steps: 296,363
Total tweets after 190 steps: 311,215
Terminating data collection with total of 325955 tweets


In [17]:
tweets = [t for t in tweets if t.retweeted_status == None]
total += len(tweets)
print(total)
save_as_json(data=tweets, filename="data/tweets_4.json")

1031077


### Fifth chunk

In [18]:
tweets = tc.get_tweets(user_ids=user_ids[800:], start=start, end=end, update_interval=10)

Starting data collection...
Total tweets after 0 steps: 0
Total tweets after 10 steps: 13,203
Total tweets after 20 steps: 24,640
Total tweets after 30 steps: 39,746
Total tweets after 40 steps: 59,006
Total tweets after 50 steps: 70,112
Total tweets after 60 steps: 87,519
Total tweets after 70 steps: 97,802
Total tweets after 80 steps: 111,791
Total tweets after 90 steps: 124,724
Total tweets after 100 steps: 138,694
Total tweets after 110 steps: 156,747
Total tweets after 120 steps: 171,071
Total tweets after 130 steps: 184,987
Total tweets after 140 steps: 198,662
Total tweets after 150 steps: 208,381
Total tweets after 160 steps: 215,515
Total tweets after 170 steps: 232,338
Error occurred when collecting tweets for user 1056152593
Total tweets after 180 steps: 252,377
Total tweets after 190 steps: 263,766
Total tweets after 200 steps: 272,172
Total tweets after 210 steps: 284,723
Total tweets after 220 steps: 295,190
Total tweets after 230 steps: 302,807
Total tweets after 240 ste

In [19]:
tweets = [t for t in tweets if t.retweeted_status == None]
total += len(tweets)
print(total)
save_as_json(data=tweets, filename="data/tweets_5.json")

1293005


## Sync data files to S3

In [20]:
!aws s3 ls

2018-05-30 09:05:02 tep-research-project


In [21]:
%pwd

'/Users/felix/code/ml/tweet-engagement-prediction/nbs'

In [25]:
!aws s3 sync data/ s3://tep-research-project

upload: data/tech_accounts.txt to s3://tep-research-project/tech_accounts.txt      
upload: data/journalist_accounts.txt to s3://tep-research-project/journalist_accounts.txt
upload: data/celebrity_accounts.txt to s3://tep-research-project/celebrity_accounts.txt
upload: data/user_ids.txt to s3://tep-research-project/user_ids.txt              
upload: data/tweets_2.json to s3://tep-research-project/tweets_2.json
upload: data/tweets_4.json to s3://tep-research-project/tweets_4.json
upload: data/tweets_3.json to s3://tep-research-project/tweets_3.json
upload: data/tweets_1.json to s3://tep-research-project/tweets_1.json
upload: data/tweets_5.json to s3://tep-research-project/tweets_5.json
