# Clean raw data
**Objective**: Clean tweets directly extracted from [Twitter Streaming APIs (https://dev.twitter.com/streaming/overview)](https://dev.twitter.com/streaming/overview)

## Roadmap
1. Check basic statistics of the raw data
2. Perform necessary cleanup on the raw data
3. Output a list of cleaned tweets' id (serve as input file to query these tweets again)

### Check basic statistics of the raw data
#### How many raw tweets?

In [13]:
import pymongo
import mongodb
import codecs
import json
import os

In [2]:
DB_NAME = 'tweets_ek' # database for tweets collected on expanded keywords
COLLECTION_NAME = 'c1' # collection for raw data

raw_data = mongodb.initialize(db_name=DB_NAME, collection_name=COLLECTION_NAME)
tweets_num = raw_data.count()
print('Total tweets: {}'.format(tweets_num))

MongoDB on localhost:27017 connected successfully!
Total tweets: 5448346


#### How many unique users?
First, build index on user.id, user.id\_str, screen\_name fields.

In [3]:
%%time
if 0 == 1:
    from pymongo import IndexModel, ASCENDING, DESCENDING
    id_index = IndexModel([("user.id", ASCENDING)])
    id_str_index = IndexModel([("user.id_str", ASCENDING)])
    screen_name_index = IndexModel([("user.screen_name", ASCENDING)])
    raw_data.create_indexes([id_index, id_str_index, screen_name_index])

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 11 µs


In [4]:
raw_data.index_information() # list existing indexies

{'_id_': {'key': [('_id', 1)], 'ns': 'tweets_ek.c1', 'v': 2},
 'user.id_1': {'key': [('user.id', 1)], 'ns': 'tweets_ek.c1', 'v': 2},
 'user.id_str_1': {'key': [('user.id_str', 1)], 'ns': 'tweets_ek.c1', 'v': 2},
 'user.screen_name_1': {'key': [('user.screen_name', 1)],
  'ns': 'tweets_ek.c1',
  'v': 2}}

 Then, look up number of unique users.

In [5]:
%%time
# Time-consuming
# raw_data.distinct(key='user.id')

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10.3 µs


#### How many native/retweet tweets?

In [6]:
if 0 == 1:
    native_tweets_num = raw_data.count(filter={'retweeted_status': {'$exists': False}})
    print('native tweets: {}, retweets: {}'.format(native_tweets_num, (tweets_num - native_tweets_num)))

native tweets: 2749254, retweets: 2699092


### Perform necessary cleanup on the raw data

#### Clean tweets with no user field (probabaly due to server error)

In [7]:
%%time
if 0 == 1:
    result = raw_data.delete_many(filter={'user': {'$exists': False}})
    print('Successfully deleted {} tweets with no user field'.format(result.deleted_count))

Successfully deleted 0 tweets with no user field
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 6min 17s


### Output a list of cleaned tweets' id

In [14]:
if 0 == 1:
    cursor = raw_data.find(projection={'_id': 0,
                                       'id': 1})

    with codecs.open(os.path.join('inter', 'tweets_ids.json'), 'w', 'utf-8') as f:
        for obj in cursor:
            f.write(json.dumps(obj) + '\n')

## Next steps
Re-query tweets identified by the tweets_ids.json file from [Twitter REST APIs (https://dev.twitter.com/rest/public)](https://dev.twitter.com/rest/public) to get the updated **retweet_count** field.
Implementations are separated into another repository collector3.