# Clean raw data
Author: Daheng Wang  
Last modified: 2017-05-24

# Roadmap
1. Import raw data into MongoDB database
2. Cleanup on the raw data
3. Build indexes
4. Seperate native/retweet tweets
5. Make basic pickles of tweet/user ids lists
6. Check basic statistics
7. Update 'retweet_count' field of native tweets

# Steps

In [1]:
"""
Initialization
"""
import pymongo, mongodb, json, os, pickle

from config import *

NB_NAME = '20170414-clean_raw_data'

## Import raw data into MongoDB database
Import all raw data json files into MongoDB database.  
Ref: https://docs.mongodb.com/manual/reference/program/mongoimport/  
_**NOTE**_: check config for database and collection name

In [1]:
'''
(Part1): /home/dwang8/Documents/data/tweets2-first.json
'''
! mongoimport \
    --db 'tweets_ek-2' \
    --collection 'tw_raw' \
    --type json \
    --mode insert \
    --numInsertionWorkers 11\
    --file '/home/dwang8/Documents/data/tweets2-first.json'

2017-05-24T11:31:49.911-0400	connected to: localhost
2017-05-24T11:31:52.910-0400	[........................] tweets_ek-2.tw_raw	39.5MB/32.7GB (0.1%)
2017-05-24T11:31:55.910-0400	[........................] tweets_ek-2.tw_raw	69.2MB/32.7GB (0.2%)
2017-05-24T11:31:58.909-0400	[........................] tweets_ek-2.tw_raw	106MB/32.7GB (0.3%)
2017-05-24T11:32:01.909-0400	[........................] tweets_ek-2.tw_raw	149MB/32.7GB (0.4%)
2017-05-24T11:32:04.909-0400	[........................] tweets_ek-2.tw_raw	188MB/32.7GB (0.6%)
2017-05-24T11:32:07.910-0400	[........................] tweets_ek-2.tw_raw	229MB/32.7GB (0.7%)
2017-05-24T11:32:10.910-0400	[........................] tweets_ek-2.tw_raw	270MB/32.7GB (0.8%)
2017-05-24T11:32:13.910-0400	[........................] tweets_ek-2.tw_raw	312MB/32.7GB (0.9%)
2017-05-24T11:32:16.909-0400	[........................] tweets_ek-2.tw_raw	352MB/32.7GB (1.1%)
2017-05-24T11:32:19.910-0400	[........................] tweets_ek-2.tw_raw	392MB/32.7GB (1

2017-05-24T11:36:10.909-0400	[##......................] tweets_ek-2.tw_raw	3.30GB/32.7GB (10.1%)
2017-05-24T11:36:13.909-0400	[##......................] tweets_ek-2.tw_raw	3.34GB/32.7GB (10.2%)
2017-05-24T11:36:16.910-0400	[##......................] tweets_ek-2.tw_raw	3.38GB/32.7GB (10.3%)
2017-05-24T11:36:19.909-0400	[##......................] tweets_ek-2.tw_raw	3.42GB/32.7GB (10.5%)
2017-05-24T11:36:22.909-0400	[##......................] tweets_ek-2.tw_raw	3.46GB/32.7GB (10.6%)
2017-05-24T11:36:25.909-0400	[##......................] tweets_ek-2.tw_raw	3.50GB/32.7GB (10.7%)
2017-05-24T11:36:28.909-0400	[##......................] tweets_ek-2.tw_raw	3.54GB/32.7GB (10.8%)
2017-05-24T11:36:31.909-0400	[##......................] tweets_ek-2.tw_raw	3.58GB/32.7GB (11.0%)
2017-05-24T11:36:34.909-0400	[##......................] tweets_ek-2.tw_raw	3.62GB/32.7GB (11.1%)
2017-05-24T11:36:37.909-0400	[##......................] tweets_ek-2.tw_raw	3.66GB/32.7GB (11.2%)
2017-05-24T11:36:40.909-0400	[

2017-05-24T11:40:25.909-0400	[####....................] tweets_ek-2.tw_raw	6.55GB/32.7GB (20.0%)
2017-05-24T11:40:28.910-0400	[####....................] tweets_ek-2.tw_raw	6.59GB/32.7GB (20.2%)
2017-05-24T11:40:31.909-0400	[####....................] tweets_ek-2.tw_raw	6.63GB/32.7GB (20.3%)
2017-05-24T11:40:34.909-0400	[####....................] tweets_ek-2.tw_raw	6.67GB/32.7GB (20.4%)
2017-05-24T11:40:37.909-0400	[####....................] tweets_ek-2.tw_raw	6.71GB/32.7GB (20.5%)
2017-05-24T11:40:40.909-0400	[####....................] tweets_ek-2.tw_raw	6.75GB/32.7GB (20.6%)
2017-05-24T11:40:43.909-0400	[####....................] tweets_ek-2.tw_raw	6.79GB/32.7GB (20.8%)
2017-05-24T11:40:46.909-0400	[#####...................] tweets_ek-2.tw_raw	6.83GB/32.7GB (20.9%)
2017-05-24T11:40:49.911-0400	[#####...................] tweets_ek-2.tw_raw	6.87GB/32.7GB (21.0%)
2017-05-24T11:40:52.909-0400	[#####...................] tweets_ek-2.tw_raw	6.90GB/32.7GB (21.1%)
2017-05-24T11:40:55.909-0400	[

2017-05-24T11:44:40.909-0400	[#######.................] tweets_ek-2.tw_raw	9.80GB/32.7GB (30.0%)
2017-05-24T11:44:43.909-0400	[#######.................] tweets_ek-2.tw_raw	9.84GB/32.7GB (30.1%)
2017-05-24T11:44:46.909-0400	[#######.................] tweets_ek-2.tw_raw	9.88GB/32.7GB (30.2%)
2017-05-24T11:44:49.909-0400	[#######.................] tweets_ek-2.tw_raw	9.92GB/32.7GB (30.3%)
2017-05-24T11:44:52.909-0400	[#######.................] tweets_ek-2.tw_raw	9.96GB/32.7GB (30.5%)
2017-05-24T11:44:55.909-0400	[#######.................] tweets_ek-2.tw_raw	9.99GB/32.7GB (30.6%)
2017-05-24T11:44:58.910-0400	[#######.................] tweets_ek-2.tw_raw	10.0GB/32.7GB (30.7%)
2017-05-24T11:45:01.909-0400	[#######.................] tweets_ek-2.tw_raw	10.1GB/32.7GB (30.8%)
2017-05-24T11:45:04.909-0400	[#######.................] tweets_ek-2.tw_raw	10.1GB/32.7GB (30.9%)
2017-05-24T11:45:07.909-0400	[#######.................] tweets_ek-2.tw_raw	10.2GB/32.7GB (31.0%)
2017-05-24T11:45:10.909-0400	[

2017-05-24T11:48:55.909-0400	[#########...............] tweets_ek-2.tw_raw	13.0GB/32.7GB (39.7%)
2017-05-24T11:48:58.909-0400	[#########...............] tweets_ek-2.tw_raw	13.0GB/32.7GB (39.8%)
2017-05-24T11:49:01.909-0400	[#########...............] tweets_ek-2.tw_raw	13.1GB/32.7GB (39.9%)
2017-05-24T11:49:04.909-0400	[#########...............] tweets_ek-2.tw_raw	13.1GB/32.7GB (40.0%)
2017-05-24T11:49:07.909-0400	[#########...............] tweets_ek-2.tw_raw	13.1GB/32.7GB (40.2%)
2017-05-24T11:49:10.921-0400	[#########...............] tweets_ek-2.tw_raw	13.2GB/32.7GB (40.3%)
2017-05-24T11:49:13.909-0400	[#########...............] tweets_ek-2.tw_raw	13.2GB/32.7GB (40.4%)
2017-05-24T11:49:16.909-0400	[#########...............] tweets_ek-2.tw_raw	13.2GB/32.7GB (40.5%)
2017-05-24T11:49:19.909-0400	[#########...............] tweets_ek-2.tw_raw	13.3GB/32.7GB (40.6%)
2017-05-24T11:49:22.909-0400	[#########...............] tweets_ek-2.tw_raw	13.3GB/32.7GB (40.7%)
2017-05-24T11:49:25.909-0400	[

2017-05-24T11:53:10.909-0400	[###########.............] tweets_ek-2.tw_raw	16.2GB/32.7GB (49.5%)
2017-05-24T11:53:13.910-0400	[###########.............] tweets_ek-2.tw_raw	16.2GB/32.7GB (49.6%)
2017-05-24T11:53:16.909-0400	[###########.............] tweets_ek-2.tw_raw	16.3GB/32.7GB (49.7%)
2017-05-24T11:53:19.909-0400	[###########.............] tweets_ek-2.tw_raw	16.3GB/32.7GB (49.9%)
2017-05-24T11:53:22.909-0400	[###########.............] tweets_ek-2.tw_raw	16.3GB/32.7GB (50.0%)
2017-05-24T11:53:25.909-0400	[############............] tweets_ek-2.tw_raw	16.4GB/32.7GB (50.1%)
2017-05-24T11:53:28.910-0400	[############............] tweets_ek-2.tw_raw	16.4GB/32.7GB (50.2%)
2017-05-24T11:53:31.909-0400	[############............] tweets_ek-2.tw_raw	16.5GB/32.7GB (50.3%)
2017-05-24T11:53:34.909-0400	[############............] tweets_ek-2.tw_raw	16.5GB/32.7GB (50.5%)
2017-05-24T11:53:37.909-0400	[############............] tweets_ek-2.tw_raw	16.5GB/32.7GB (50.6%)
2017-05-24T11:53:40.909-0400	[

2017-05-24T11:57:25.909-0400	[##############..........] tweets_ek-2.tw_raw	19.4GB/32.7GB (59.4%)
2017-05-24T11:57:28.910-0400	[##############..........] tweets_ek-2.tw_raw	19.5GB/32.7GB (59.5%)
2017-05-24T11:57:31.909-0400	[##############..........] tweets_ek-2.tw_raw	19.5GB/32.7GB (59.7%)
2017-05-24T11:57:34.909-0400	[##############..........] tweets_ek-2.tw_raw	19.5GB/32.7GB (59.8%)
2017-05-24T11:57:37.909-0400	[##############..........] tweets_ek-2.tw_raw	19.6GB/32.7GB (59.9%)
2017-05-24T11:57:40.910-0400	[##############..........] tweets_ek-2.tw_raw	19.6GB/32.7GB (60.0%)
2017-05-24T11:57:43.909-0400	[##############..........] tweets_ek-2.tw_raw	19.7GB/32.7GB (60.1%)
2017-05-24T11:57:46.910-0400	[##############..........] tweets_ek-2.tw_raw	19.7GB/32.7GB (60.3%)
2017-05-24T11:57:49.909-0400	[##############..........] tweets_ek-2.tw_raw	19.7GB/32.7GB (60.4%)
2017-05-24T11:57:52.909-0400	[##############..........] tweets_ek-2.tw_raw	19.8GB/32.7GB (60.5%)
2017-05-24T11:57:55.909-0400	[

2017-05-24T12:01:40.909-0400	[################........] tweets_ek-2.tw_raw	22.7GB/32.7GB (69.4%)
2017-05-24T12:01:43.909-0400	[################........] tweets_ek-2.tw_raw	22.7GB/32.7GB (69.5%)
2017-05-24T12:01:46.909-0400	[################........] tweets_ek-2.tw_raw	22.8GB/32.7GB (69.6%)
2017-05-24T12:01:49.920-0400	[################........] tweets_ek-2.tw_raw	22.8GB/32.7GB (69.7%)
2017-05-24T12:01:52.909-0400	[################........] tweets_ek-2.tw_raw	22.8GB/32.7GB (69.8%)
2017-05-24T12:01:55.909-0400	[################........] tweets_ek-2.tw_raw	22.9GB/32.7GB (70.0%)
2017-05-24T12:01:58.909-0400	[################........] tweets_ek-2.tw_raw	22.9GB/32.7GB (70.1%)
2017-05-24T12:02:01.909-0400	[################........] tweets_ek-2.tw_raw	22.9GB/32.7GB (70.2%)
2017-05-24T12:02:04.909-0400	[################........] tweets_ek-2.tw_raw	23.0GB/32.7GB (70.3%)
2017-05-24T12:02:07.909-0400	[################........] tweets_ek-2.tw_raw	23.0GB/32.7GB (70.5%)
2017-05-24T12:02:10.909-0400	[

2017-05-24T12:05:55.910-0400	[###################.....] tweets_ek-2.tw_raw	25.9GB/32.7GB (79.2%)
2017-05-24T12:05:58.909-0400	[###################.....] tweets_ek-2.tw_raw	25.9GB/32.7GB (79.3%)
2017-05-24T12:06:01.909-0400	[###################.....] tweets_ek-2.tw_raw	26.0GB/32.7GB (79.4%)
2017-05-24T12:06:04.909-0400	[###################.....] tweets_ek-2.tw_raw	26.0GB/32.7GB (79.5%)
2017-05-24T12:06:07.909-0400	[###################.....] tweets_ek-2.tw_raw	26.0GB/32.7GB (79.6%)
2017-05-24T12:06:10.910-0400	[###################.....] tweets_ek-2.tw_raw	26.1GB/32.7GB (79.7%)
2017-05-24T12:06:13.909-0400	[###################.....] tweets_ek-2.tw_raw	26.1GB/32.7GB (79.9%)
2017-05-24T12:06:16.909-0400	[###################.....] tweets_ek-2.tw_raw	26.1GB/32.7GB (80.0%)
2017-05-24T12:06:19.910-0400	[###################.....] tweets_ek-2.tw_raw	26.2GB/32.7GB (80.1%)
2017-05-24T12:06:22.909-0400	[###################.....] tweets_ek-2.tw_raw	26.2GB/32.7GB (80.2%)
2017-05-24T12:06:25.909-0400	[

2017-05-24T12:10:10.910-0400	[#####################...] tweets_ek-2.tw_raw	29.0GB/32.7GB (88.9%)
2017-05-24T12:10:13.910-0400	[#####################...] tweets_ek-2.tw_raw	29.1GB/32.7GB (89.0%)
2017-05-24T12:10:16.909-0400	[#####################...] tweets_ek-2.tw_raw	29.1GB/32.7GB (89.1%)
2017-05-24T12:10:19.910-0400	[#####################...] tweets_ek-2.tw_raw	29.2GB/32.7GB (89.2%)
2017-05-24T12:10:22.909-0400	[#####################...] tweets_ek-2.tw_raw	29.2GB/32.7GB (89.3%)
2017-05-24T12:10:25.909-0400	[#####################...] tweets_ek-2.tw_raw	29.2GB/32.7GB (89.5%)
2017-05-24T12:10:28.909-0400	[#####################...] tweets_ek-2.tw_raw	29.3GB/32.7GB (89.6%)
2017-05-24T12:10:31.909-0400	[#####################...] tweets_ek-2.tw_raw	29.3GB/32.7GB (89.7%)
2017-05-24T12:10:34.909-0400	[#####################...] tweets_ek-2.tw_raw	29.4GB/32.7GB (89.8%)
2017-05-24T12:10:37.909-0400	[#####################...] tweets_ek-2.tw_raw	29.4GB/32.7GB (89.9%)
2017-05-24T12:10:40.909-0400	[

2017-05-24T12:14:25.909-0400	[#######################.] tweets_ek-2.tw_raw	32.2GB/32.7GB (98.5%)
2017-05-24T12:14:28.909-0400	[#######################.] tweets_ek-2.tw_raw	32.2GB/32.7GB (98.6%)
2017-05-24T12:14:31.909-0400	[#######################.] tweets_ek-2.tw_raw	32.3GB/32.7GB (98.7%)
2017-05-24T12:14:34.909-0400	[#######################.] tweets_ek-2.tw_raw	32.3GB/32.7GB (98.9%)
2017-05-24T12:14:37.909-0400	[#######################.] tweets_ek-2.tw_raw	32.4GB/32.7GB (99.0%)
2017-05-24T12:14:40.909-0400	[#######################.] tweets_ek-2.tw_raw	32.4GB/32.7GB (99.1%)
2017-05-24T12:14:43.909-0400	[#######################.] tweets_ek-2.tw_raw	32.4GB/32.7GB (99.2%)
2017-05-24T12:14:46.909-0400	[#######################.] tweets_ek-2.tw_raw	32.5GB/32.7GB (99.3%)
2017-05-24T12:14:49.909-0400	[#######################.] tweets_ek-2.tw_raw	32.5GB/32.7GB (99.4%)
2017-05-24T12:14:52.909-0400	[#######################.] tweets_ek-2.tw_raw	32.5GB/32.7GB (99.6%)
2017-05-24T12:14:55.909-0400	[

In [2]:
'''
(Part2): /home/dwang8/Documents/data/tweets2-second.json
'''
! mongoimport \
    --db 'tweets_ek-2' \
    --collection 'tw_raw' \
    --type json \
    --mode insert \
    --numInsertionWorkers 11\
    --file '/home/dwang8/Documents/data/tweets2-second.json'

2017-05-24T12:35:53.896-0400	connected to: localhost
2017-05-24T12:35:56.895-0400	[........................] tweets_ek-2.tw_raw	40.1MB/25.7GB (0.2%)
2017-05-24T12:35:59.895-0400	[........................] tweets_ek-2.tw_raw	77.4MB/25.7GB (0.3%)
2017-05-24T12:36:02.895-0400	[........................] tweets_ek-2.tw_raw	117MB/25.7GB (0.4%)
2017-05-24T12:36:05.895-0400	[........................] tweets_ek-2.tw_raw	161MB/25.7GB (0.6%)
2017-05-24T12:36:08.895-0400	[........................] tweets_ek-2.tw_raw	200MB/25.7GB (0.8%)
2017-05-24T12:36:11.895-0400	[........................] tweets_ek-2.tw_raw	232MB/25.7GB (0.9%)
2017-05-24T12:36:14.895-0400	[........................] tweets_ek-2.tw_raw	274MB/25.7GB (1.0%)
2017-05-24T12:36:17.895-0400	[........................] tweets_ek-2.tw_raw	313MB/25.7GB (1.2%)
2017-05-24T12:36:20.895-0400	[........................] tweets_ek-2.tw_raw	353MB/25.7GB (1.3%)
2017-05-24T12:36:23.895-0400	[........................] tweets_ek-2.tw_raw	396MB/25.7GB (1

2017-05-24T12:40:11.895-0400	[###.....................] tweets_ek-2.tw_raw	3.25GB/25.7GB (12.7%)
2017-05-24T12:40:14.895-0400	[###.....................] tweets_ek-2.tw_raw	3.29GB/25.7GB (12.8%)
2017-05-24T12:40:17.895-0400	[###.....................] tweets_ek-2.tw_raw	3.33GB/25.7GB (13.0%)
2017-05-24T12:40:20.895-0400	[###.....................] tweets_ek-2.tw_raw	3.37GB/25.7GB (13.1%)
2017-05-24T12:40:23.895-0400	[###.....................] tweets_ek-2.tw_raw	3.41GB/25.7GB (13.3%)
2017-05-24T12:40:26.895-0400	[###.....................] tweets_ek-2.tw_raw	3.43GB/25.7GB (13.4%)
2017-05-24T12:40:29.895-0400	[###.....................] tweets_ek-2.tw_raw	3.46GB/25.7GB (13.5%)
2017-05-24T12:40:32.895-0400	[###.....................] tweets_ek-2.tw_raw	3.51GB/25.7GB (13.7%)
2017-05-24T12:40:35.895-0400	[###.....................] tweets_ek-2.tw_raw	3.55GB/25.7GB (13.8%)
2017-05-24T12:40:38.895-0400	[###.....................] tweets_ek-2.tw_raw	3.58GB/25.7GB (14.0%)
2017-05-24T12:40:41.895-0400	[

2017-05-24T12:44:26.895-0400	[######..................] tweets_ek-2.tw_raw	6.45GB/25.7GB (25.1%)
2017-05-24T12:44:29.895-0400	[######..................] tweets_ek-2.tw_raw	6.49GB/25.7GB (25.3%)
2017-05-24T12:44:32.895-0400	[######..................] tweets_ek-2.tw_raw	6.53GB/25.7GB (25.4%)
2017-05-24T12:44:35.896-0400	[######..................] tweets_ek-2.tw_raw	6.55GB/25.7GB (25.5%)
2017-05-24T12:44:38.895-0400	[######..................] tweets_ek-2.tw_raw	6.59GB/25.7GB (25.7%)
2017-05-24T12:44:41.895-0400	[######..................] tweets_ek-2.tw_raw	6.62GB/25.7GB (25.8%)
2017-05-24T12:44:44.895-0400	[######..................] tweets_ek-2.tw_raw	6.66GB/25.7GB (25.9%)
2017-05-24T12:44:47.895-0400	[######..................] tweets_ek-2.tw_raw	6.70GB/25.7GB (26.1%)
2017-05-24T12:44:50.895-0400	[######..................] tweets_ek-2.tw_raw	6.74GB/25.7GB (26.2%)
2017-05-24T12:44:53.895-0400	[######..................] tweets_ek-2.tw_raw	6.77GB/25.7GB (26.4%)
2017-05-24T12:44:56.895-0400	[

2017-05-24T12:48:41.895-0400	[########................] tweets_ek-2.tw_raw	9.62GB/25.7GB (37.4%)
2017-05-24T12:48:44.895-0400	[#########...............] tweets_ek-2.tw_raw	9.65GB/25.7GB (37.6%)
2017-05-24T12:48:47.895-0400	[#########...............] tweets_ek-2.tw_raw	9.70GB/25.7GB (37.8%)
2017-05-24T12:48:50.895-0400	[#########...............] tweets_ek-2.tw_raw	9.73GB/25.7GB (37.9%)
2017-05-24T12:48:53.895-0400	[#########...............] tweets_ek-2.tw_raw	9.75GB/25.7GB (38.0%)
2017-05-24T12:48:56.895-0400	[#########...............] tweets_ek-2.tw_raw	9.78GB/25.7GB (38.1%)
2017-05-24T12:48:59.897-0400	[#########...............] tweets_ek-2.tw_raw	9.82GB/25.7GB (38.2%)
2017-05-24T12:49:02.895-0400	[#########...............] tweets_ek-2.tw_raw	9.86GB/25.7GB (38.4%)
2017-05-24T12:49:05.895-0400	[#########...............] tweets_ek-2.tw_raw	9.90GB/25.7GB (38.5%)
2017-05-24T12:49:08.895-0400	[#########...............] tweets_ek-2.tw_raw	9.93GB/25.7GB (38.7%)
2017-05-24T12:49:11.895-0400	[

2017-05-24T12:52:56.895-0400	[###########.............] tweets_ek-2.tw_raw	12.8GB/25.7GB (49.6%)
2017-05-24T12:52:59.895-0400	[###########.............] tweets_ek-2.tw_raw	12.8GB/25.7GB (49.8%)
2017-05-24T12:53:02.895-0400	[###########.............] tweets_ek-2.tw_raw	12.8GB/25.7GB (49.9%)
2017-05-24T12:53:05.895-0400	[############............] tweets_ek-2.tw_raw	12.9GB/25.7GB (50.1%)
2017-05-24T12:53:08.895-0400	[############............] tweets_ek-2.tw_raw	12.9GB/25.7GB (50.2%)
2017-05-24T12:53:11.895-0400	[############............] tweets_ek-2.tw_raw	12.9GB/25.7GB (50.3%)
2017-05-24T12:53:14.895-0400	[############............] tweets_ek-2.tw_raw	13.0GB/25.7GB (50.5%)
2017-05-24T12:53:17.895-0400	[############............] tweets_ek-2.tw_raw	13.0GB/25.7GB (50.6%)
2017-05-24T12:53:20.895-0400	[############............] tweets_ek-2.tw_raw	13.0GB/25.7GB (50.8%)
2017-05-24T12:53:23.895-0400	[############............] tweets_ek-2.tw_raw	13.1GB/25.7GB (50.9%)
2017-05-24T12:53:26.895-0400	[

2017-05-24T12:57:11.895-0400	[##############..........] tweets_ek-2.tw_raw	15.9GB/25.7GB (62.0%)
2017-05-24T12:57:14.895-0400	[##############..........] tweets_ek-2.tw_raw	16.0GB/25.7GB (62.2%)
2017-05-24T12:57:17.895-0400	[##############..........] tweets_ek-2.tw_raw	16.0GB/25.7GB (62.3%)
2017-05-24T12:57:20.895-0400	[##############..........] tweets_ek-2.tw_raw	16.0GB/25.7GB (62.4%)
2017-05-24T12:57:23.895-0400	[##############..........] tweets_ek-2.tw_raw	16.1GB/25.7GB (62.5%)
2017-05-24T12:57:26.895-0400	[###############.........] tweets_ek-2.tw_raw	16.1GB/25.7GB (62.6%)
2017-05-24T12:57:29.895-0400	[###############.........] tweets_ek-2.tw_raw	16.1GB/25.7GB (62.8%)
2017-05-24T12:57:32.895-0400	[###############.........] tweets_ek-2.tw_raw	16.2GB/25.7GB (63.0%)
2017-05-24T12:57:35.895-0400	[###############.........] tweets_ek-2.tw_raw	16.2GB/25.7GB (63.1%)
2017-05-24T12:57:38.895-0400	[###############.........] tweets_ek-2.tw_raw	16.2GB/25.7GB (63.3%)
2017-05-24T12:57:41.895-0400	[

2017-05-24T13:01:26.895-0400	[#################.......] tweets_ek-2.tw_raw	19.1GB/25.7GB (74.5%)
2017-05-24T13:01:29.895-0400	[#################.......] tweets_ek-2.tw_raw	19.2GB/25.7GB (74.6%)
2017-05-24T13:01:32.895-0400	[#################.......] tweets_ek-2.tw_raw	19.2GB/25.7GB (74.8%)
2017-05-24T13:01:35.895-0400	[#################.......] tweets_ek-2.tw_raw	19.2GB/25.7GB (74.8%)
2017-05-24T13:01:38.895-0400	[#################.......] tweets_ek-2.tw_raw	19.2GB/25.7GB (75.0%)
2017-05-24T13:01:41.895-0400	[##################......] tweets_ek-2.tw_raw	19.3GB/25.7GB (75.1%)
2017-05-24T13:01:44.903-0400	[##################......] tweets_ek-2.tw_raw	19.3GB/25.7GB (75.3%)
2017-05-24T13:01:47.895-0400	[##################......] tweets_ek-2.tw_raw	19.4GB/25.7GB (75.4%)
2017-05-24T13:01:50.895-0400	[##################......] tweets_ek-2.tw_raw	19.4GB/25.7GB (75.6%)
2017-05-24T13:01:53.895-0400	[##################......] tweets_ek-2.tw_raw	19.4GB/25.7GB (75.7%)
2017-05-24T13:01:56.895-0400	[

2017-05-24T13:05:41.896-0400	[####################....] tweets_ek-2.tw_raw	22.3GB/25.7GB (86.9%)
2017-05-24T13:05:44.895-0400	[####################....] tweets_ek-2.tw_raw	22.4GB/25.7GB (87.1%)
2017-05-24T13:05:47.913-0400	[####################....] tweets_ek-2.tw_raw	22.4GB/25.7GB (87.2%)
2017-05-24T13:05:50.895-0400	[####################....] tweets_ek-2.tw_raw	22.4GB/25.7GB (87.3%)
2017-05-24T13:05:53.895-0400	[####################....] tweets_ek-2.tw_raw	22.4GB/25.7GB (87.4%)
2017-05-24T13:05:56.895-0400	[#####################...] tweets_ek-2.tw_raw	22.5GB/25.7GB (87.6%)
2017-05-24T13:05:59.895-0400	[#####################...] tweets_ek-2.tw_raw	22.5GB/25.7GB (87.7%)
2017-05-24T13:06:02.895-0400	[#####################...] tweets_ek-2.tw_raw	22.6GB/25.7GB (87.9%)
2017-05-24T13:06:05.895-0400	[#####################...] tweets_ek-2.tw_raw	22.6GB/25.7GB (88.0%)
2017-05-24T13:06:08.895-0400	[#####################...] tweets_ek-2.tw_raw	22.6GB/25.7GB (88.2%)
2017-05-24T13:06:11.895-0400	[

2017-05-24T13:09:56.895-0400	[#######################.] tweets_ek-2.tw_raw	25.5GB/25.7GB (99.3%)
2017-05-24T13:09:59.895-0400	[#######################.] tweets_ek-2.tw_raw	25.5GB/25.7GB (99.4%)
2017-05-24T13:10:02.895-0400	[#######################.] tweets_ek-2.tw_raw	25.6GB/25.7GB (99.5%)
2017-05-24T13:10:05.895-0400	[#######################.] tweets_ek-2.tw_raw	25.6GB/25.7GB (99.6%)
2017-05-24T13:10:08.895-0400	[#######################.] tweets_ek-2.tw_raw	25.6GB/25.7GB (99.7%)
2017-05-24T13:10:11.895-0400	[#######################.] tweets_ek-2.tw_raw	25.6GB/25.7GB (99.9%)
2017-05-24T13:10:14.512-0400	[########################] tweets_ek-2.tw_raw	25.7GB/25.7GB (100.0%)
2017-05-24T13:10:14.512-0400	imported 5127592 documents


## Cleanup on the raw data

In [2]:
"""
Clean server side messages: no 'id' field
"""
if 0 == 1:
    tw_raw_col = mongodb.initialize(DB_NAME, TW_RAW_COL)
    result = tw_raw_col.delete_many(filter={'id': {'$exists': False}})
    print('Successfully deleted {} tweets with no "id" field'.format(result.deleted_count))

MongoDB on localhost:27017/tweets_ek-2.tw_raw connected successfully!
Successfully deleted 0 tweets with no "id" field
CPU times: user 220 ms, sys: 88 ms, total: 308 ms
Wall time: 14min 41s


In [5]:
"""
Clean tweets with no 'user' field (probabaly due to server error)
"""
if 0 == 1:
    tw_raw_col = mongodb.initialize(DB_NAME, TW_RAW_COL)
    result = tw_raw_col.delete_many(filter={'user': {'$exists': False}})
    print('Successfully deleted {} tweets with no "user" field'.format(result.deleted_count))

MongoDB on localhost:27017/tweets_ek-2.tw_raw connected successfully!
Successfully deleted 0 tweets with no "user" field


## Build indexes

In [4]:
'''
Baisc indexes
'''

from pymongo import IndexModel, ASCENDING, DESCENDING

# IndexModel instances for tweets
id_index = IndexModel([('id', ASCENDING)], background=True)
id_str_index = IndexModel([('id_str', ASCENDING)], background=True)

# IndexModel instances for users
user_id_index = IndexModel([('user.id', ASCENDING)], background=True)
user_id_str_index = IndexModel([('user.id_str', ASCENDING)], background=True)
user_screen_name_index = IndexModel([('user.screen_name', ASCENDING)], background=True)

indexes_list = [id_index, id_str_index, user_id_index, user_id_str_index, user_screen_name_index]

In [7]:
if 0 == 1:
    tw_raw_col = mongodb.initialize(DB_NAME, TW_RAW_COL)
    tw_raw_col.create_indexes(indexes=indexes_list)

MongoDB on localhost:27017/tweets_ek-2.tw_raw connected successfully!


## Seperate native/retweet tweets

### Create two new collections

In [2]:
"""
Output native tweets into a new collection

Register TW_NT_COL = 'tw_nt' in config first
"""
if 0 == 1:
    tw_raw_col = mongodb.initialize(DB_NAME, TW_RAW_COL)
    
    match_dict = {'$match': {'retweeted_status': {'$exists': False}}}

    out_dict = {'$out': TW_NT_COL}

    ppl_lst = [match_dict, out_dict]

    tw_raw_col.aggregate(pipeline=ppl_lst)

MongoDB on localhost:27017/tweets_ek-2.tw_raw connected successfully!


In [3]:
"""
Output retweets into a new collection

Register TW_RT_COL = 'tw_rt' in config first
"""
if 0 == 1:
    tw_raw_col = mongodb.initialize(DB_NAME, TW_RAW_COL)
    
    match_dict = {'$match': {'retweeted_status': {'$exists': True}}}

    out_dict = {'$out': TW_RT_COL}

    ppl_lst = [match_dict, out_dict]

    tw_raw_col.aggregate(pipeline=ppl_lst)

MongoDB on localhost:27017/tweets_ek-2.tw_raw connected successfully!


### Build basic indexes on two new collections

In [6]:
"""
Basic indexes on native tweets collection
"""
if 0 == 1:
    tw_nt_col = mongodb.initialize(DB_NAME, TW_NT_COL)
    tw_nt_col.create_indexes(indexes=indexes_list)

MongoDB on localhost:27017/tweets_ek-2.tw_nt connected successfully!


In [7]:
"""
Basic indexes on retweets collection
"""
if 0 == 1:
    tw_rt_col = mongodb.initialize(DB_NAME, TW_RT_COL)
    tw_rt_col.create_indexes(indexes=indexes_list)

MongoDB on localhost:27017/tweets_ek-2.tw_rt connected successfully!
CPU times: user 484 ms, sys: 164 ms, total: 648 ms
Wall time: 18min 41s


### Build extra indexes on retweets collection

In [8]:
'''
Extra indexes on fields inside 'retweeted_status'
'''

from pymongo import IndexModel, ASCENDING, DESCENDING

# IndexModel instances for retweeted_status
rted_id_index = IndexModel([('retweeted_status.id', ASCENDING)], background=True)
rted_id_str_index = IndexModel([('retweeted_status.id_str', ASCENDING)], background=True)

# IndexModel instances for retweeted_status users
rted_user_id_index = IndexModel([('retweeted_status.user.id', ASCENDING)], background=True)
rted_user_id_str_index = IndexModel([('retweeted_status.user.id_str', ASCENDING)], background=True)
rted_user_screen_name_index = IndexModel([('retweeted_status.user.screen_name', ASCENDING)], background=True)

extra_indexes_lst = [rted_id_index, rted_id_str_index, rted_user_id_index, rted_user_id_str_index, rted_user_screen_name_index]

In [10]:
"""
Extra indexes on retweets collection
"""
if 0 == 1:
    tw_rt_col = mongodb.initialize(DB_NAME, TW_RT_COL)
    tw_rt_col.create_indexes(indexes=extra_indexes_lst)

MongoDB on localhost:27017/tweets_ek-2.tw_rt connected successfully!


## Make basic pickles of tweet/user ids lists

In [2]:
db = mongodb.initialize_db(DB_NAME)

MongoDB on localhost:27017/tweets_ek-2 connected successfully!


In [3]:
"""
Tweet ids list pickle

Register in config:
    TW_RAW_IDS_LST_PKL = os.path.join(DATA_DIR, 'tw_raw_ids.lst.pkl')
    TW_NT_IDS_LST_PKL = os.path.join(DATA_DIR, 'tw_nt_ids.lst.pkl')
    TW_RT_IDS_LST_PKL = os.path.join(DATA_DIR, 'tw_rt_ids.lst.pkl')
"""
if 0 == 1:
    '''
    Make TW_RAW_IDS_LST_PKL
    '''
    print('Making {}...'.format(TW_RAW_IDS_LST_PKL))
    tw_raw_ids_lst = []

    cursor = db[TW_RAW_COL].find(projection={'_id': 0, 'id': 1})
    
    for doc in cursor:
        tw_raw_id = int(doc['id'])
        tw_raw_ids_lst.append(tw_raw_id)
    
    with open(TW_RAW_IDS_LST_PKL, 'wb') as f:
        pickle.dump(tw_raw_ids_lst, f)
    print('List lenghth: {}'.format(len(tw_raw_ids_lst)))
    
    '''
    Make TW_NT_IDS_LST_PKL
    '''
    print('Making {}...'.format(TW_NT_IDS_LST_PKL))
    tw_nt_ids_lst = []

    cursor = db[TW_NT_COL].find(projection={'_id': 0, 'id': 1})
    
    for doc in cursor:
        tw_nt_id = int(doc['id'])
        tw_nt_ids_lst.append(tw_nt_id)
    
    with open(TW_NT_IDS_LST_PKL, 'wb') as f:
        pickle.dump(tw_nt_ids_lst, f)
    print('List lenghth: {}'.format(len(tw_nt_ids_lst)))
    
    '''
    Make TW_RT_IDS_LST_PKL
    '''
    print('Making {}...'.format(TW_RT_IDS_LST_PKL))
    tw_rt_ids_lst = []

    cursor = db[TW_RT_COL].find(projection={'_id': 0, 'id': 1})
    
    for doc in cursor:
        tw_rt_id = int(doc['id'])
        tw_rt_ids_lst.append(tw_rt_id)
    
    with open(TW_RT_IDS_LST_PKL, 'wb') as f:
        pickle.dump(tw_rt_ids_lst, f)
    print('List lenghth: {}'.format(len(tw_rt_ids_lst)))

Making ./data/tw_raw_ids.lst.pkl...
List lenghth: 11635450
Making ./data/tw_nt_ids.lst.pkl...
List lenghth: 5812824
Making ./data/tw_rt_ids.lst.pkl...
List lenghth: 5822626
CPU times: user 4min 33s, sys: 4.07 s, total: 4min 37s
Wall time: 32min 44s


In [5]:
%%time
"""
User ids list pickle

Register in config:
    USER_RAW_IDS_LST_PKL = os.path.join(DATA_DIR, 'user_raw_ids.lst.pkl')
    USER_NT_IDS_LST_PKL = os.path.join(DATA_DIR, 'user_nt_ids.lst.pkl')
    USER_RT_IDS_LST_PKL = os.path.join(DATA_DIR, 'user_rt_ids.lst.pkl')
"""
if 0 == 1:
    '''
    Make USER_RAW_IDS_LST_PKL
    '''
    print('Making {}...'.format(USER_RAW_IDS_LST_PKL))
    user_raw_ids_set = set()

    cursor = db[TW_RAW_COL].find(projection={'_id': 0, 'user.id': 1})
    
    for doc in cursor:
        user_raw_id = int(doc['user']['id'])
        user_raw_ids_set.add(user_raw_id)
    
    with open(USER_RAW_IDS_LST_PKL, 'wb') as f:
        pickle.dump(list(user_raw_ids_set), f)
    print('List lenghth: {}'.format(len(user_raw_ids_set)))
    
    '''
    Make USER_NT_IDS_LST_PKL
    '''
    print('Making {}...'.format(USER_NT_IDS_LST_PKL))
    user_nt_ids_set = set()

    cursor = db[TW_NT_COL].find(projection={'_id': 0, 'user.id': 1})
    
    for doc in cursor:
        user_nt_id = int(doc['user']['id'])
        user_nt_ids_set.add(user_nt_id)
    
    with open(USER_NT_IDS_LST_PKL, 'wb') as f:
        pickle.dump(list(user_nt_ids_set), f)
    print('List lenghth: {}'.format(len(user_nt_ids_set)))
    
    '''
    Make USER_RT_IDS_LST_PKL
    '''
    print('Making {}...'.format(USER_RT_IDS_LST_PKL))
    user_rt_ids_set = set()

    cursor = db[TW_RT_COL].find(projection={'_id': 0, 'user.id': 1})
    
    for doc in cursor:
        user_rt_id = int(doc['user']['id'])
        user_rt_ids_set.add(user_rt_id)
    
    with open(USER_RT_IDS_LST_PKL, 'wb') as f:
        pickle.dump(list(user_rt_ids_set), f)
    print('List lenghth: {}'.format(len(user_rt_ids_set)))

Making ./data/user_raw_ids.lst.pkl...
List lenghth: 1469738
Making ./data/user_nt_ids.lst.pkl...
List lenghth: 609799
Making ./data/user_rt_ids.lst.pkl...
List lenghth: 1036781
CPU times: user 7min 51s, sys: 3.6 s, total: 7min 54s
Wall time: 40min 42s


## Check basic statistics

### Tweets

In [15]:
"""
Number of total tweets

Query from database
"""
db = mongodb.initialize_db(DB_NAME)

total_tw_num = db[TW_RAW_COL].count()
print('Number of total tweets: {}'.format(total_tw_num))

Number of total tweets: 11635450


In [16]:
"""
Number of native/retweet tweets

Query from database
"""
tw_nt_num = db[TW_NT_COL].count()
print('Number of native tweets: {}'.format(tw_nt_num))

tw_rt_num = db[TW_RT_COL].count()
print('Number of retweets: {}'.format(tw_rt_num))

Number of native tweets: 5812824
Number of retweets: 5822626


In [2]:
"""
Number of total tweets

Load tweet ids list pickle
"""

tw_raw_ids_lst = []
with open(TW_RAW_IDS_LST_PKL, 'rb') as f:
    tw_raw_ids_lst = pickle.load(f)
    
print('Number of raw tweets: {}'.format(len(tw_raw_ids_lst)))

tw_raw_unique_ids_set = set(tw_raw_ids_lst)
print('Number of unique raw tweets: {}'.format(len(tw_raw_unique_ids_set)))

Number of raw tweets: 11635450
Number of unique raw tweets: 11617643


### Users

In [3]:
"""
Number of total unique users
"""
total_unique_user_ids_lst = []
with open (USER_RAW_IDS_LST_PKL, 'rb') as f:
    total_unique_user_ids_lst = pickle.load(f)
    
print('Number of total users: {}'.format(len(total_unique_user_ids_lst)))

total_unique_user_ids_set = set(total_unique_user_ids_lst)
print('Number of total unique users: {}'.format(len(total_unique_user_ids_set)))

Number of total users: 1469738
Number of total unique users: 1469738


In [4]:
"""
Number of unique users of native tweets
"""
nt_unique_user_ids_lst = []
with open (USER_NT_IDS_LST_PKL, 'rb') as f:
    nt_unique_user_ids_lst = pickle.load(f)

print('Number of total users of native tweets: {}'.format(len(nt_unique_user_ids_lst)))

nt_unique_user_ids_set = set(nt_unique_user_ids_lst)
print('Number of unique users of native tweets: {}'.format(len(nt_unique_user_ids_lst)))

Number of total users of native tweets: 609799
Number of unique users of native tweets: 609799


In [5]:
"""
Number of unique users of retweets
"""
rt_unique_user_ids_lst = []
with open (USER_RT_IDS_LST_PKL, 'rb') as f:
    rt_unique_user_ids_lst = pickle.load(f)
    
print('Number of total users of retweets: {}'.format(len(rt_unique_user_ids_lst)))

rt_unique_user_ids_set = set(rt_unique_user_ids_lst)
print('Number of unique users of retweets: {}'.format(len(rt_unique_user_ids_set)))

Number of total users of retweets: 1036781
Number of unique users of retweets: 1036781


In [6]:
"""
Number of retweets only users
"""
rt_only_user_ids_set = set(rt_unique_user_ids_lst).difference(set(nt_unique_user_ids_lst))
print('Number of retweets only users: {}'.format(len(rt_only_user_ids_set)))

Number of retweets only users: 859939


## Update 'retweet_count' field of native tweets

### Update tweets_ek-2:tw_nt collection

In [6]:
%%time
if 1 == 1:
    db = mongodb.initialize_db(DB_NAME)
    
    '''
    Extract the id list of original tweets of retweets
    '''
    print('Extracting the id list of original tweets of retweets...')
    rt_origin_tw_ids_lst = []
    
    cursor = db[TW_RT_COL].find(projection={'_id': 0, 'retweeted_status.id': 1})
    
    for doc in cursor:
        rt_origin_tw_id = int(doc['retweeted_status']['id'])
        rt_origin_tw_ids_lst.append(rt_origin_tw_id)
    
    print('List length: {}'.format(len(rt_origin_tw_ids_lst)))
    
    '''
    Update the "retweet_count" field of native tweets
    '''
    print('Updating "retweet_count" field of native tweets...')
    
    for rt_origin_tw_id in rt_origin_tw_ids_lst:
        db[TW_NT_COL].update_one(filter={'id': rt_origin_tw_id},
                                 update={'$inc': {'retweet_count': 1}})
    
    print('Done')

MongoDB on localhost:27017/tweets_ek-2 connected successfully!
Extracting the id list of original tweets of retweets...
List length: 5822626
Updating "retweet_count" field of native tweets...
Done
CPU times: user 34min 46s, sys: 3min 16s, total: 38min 3s
Wall time: 1h 35min 56s


# Notes