# Identify user's affiliation with IBM 
Author: Daheng Wang  
Last modified: 2017-05-04

# Roadmap
1. Identify affiliation based on the 'description' field on user
2. Simple evaluations of description-field-based strategy
3. ~~Select out part of users to extract history tweets~~

# Steps

## Initialization

In [2]:
# Data analysis modules: pandas, matplotlib, numpy, and etc.
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # render double resolution plot output for Retina screens 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Standard modules, MongoDB modules
import os, sys, json, datetime, pickle
from pprint import pprint

import pymongo
from pymongo import IndexModel, ASCENDING, DESCENDING

# Custom tool modules
import mongodb  # module for setting up connection with (local) MongoDB database
import multiprocessing_workers  # module for splitting workloads between processes
import utilities  # module for various custom utility functions
from config import * # import all global configuration variables

## Identify affiliation based on the 'description' field on user

**Idea**: if the keyword 'ibm' exists in a user's 'description' field, we say the user is _directly affiliated_ with IBM.  
**Problems**:
 - not all users have 'description' field filled
 - cannot distinguish between different types of user (individuals, official accounts, media outlet...)

In [2]:
"""
This section generates a new collection for affliation identification based on user's 'description' field.
Register USER_DESC_AFFL_COL = 'c2_user_desc_affl' in config if first time.
"""
if 0 == 1:
    # the keyword stands for affiliation with IBM
    keyword = 'ibm'
    
    db = mongodb.initialize_db(db_name=DB_NAME)
    users_col = db[USERS_COL]
    cursor = users_col.find(projection={'_id': 0, 'id': 1, 'description': 1}, # minimize I/O bandwith
                            sort=[('_id', pymongo.ASCENDING)])
    
    data_lst = []
    print("Processing 'description' field of users...")
    for doc in cursor:
        data_lst.append(doc)
        # get the current user (doc) first
        # add a new field 'description_ibm' for boolean result of if keyword exists in 'description'
        current_user = data_lst[-1]
        current_user['description_ibm'] = utilities.simple_test_keyword_in_text(text=current_user['description'],
                                                                                keyword=keyword)
    print('{} users processed'.format(len(data_lst)))
    
    print('Create new collection: "{}"'.format(USER_DESC_AFFL_COL))
    if USER_DESC_AFFL_COL in db.collection_names():
        print('\tAlready exist! Drop...')
        db[USER_DESC_AFFL_COL].drop()
    user_desc_affl_col = db[USER_DESC_AFFL_COL]
    cnt = user_desc_affl_col.count()
    print('New collection "{}" empty: {}'.format(USER_DESC_AFFL_COL, cnt))
    print('Inserting into "{}"...'.format(USER_DESC_AFFL_COL))
    user_desc_affl_col.insert_many(data_lst)
    print('Done')

MongoDB on localhost:27017/tweets_ek connected successfully!
Processing 'description' field of users...
844675 users processed
Create new collection: "c2_user_desc_affl"
	Already exist! Drop...
New collection "c2_user_desc_affl" empty: 0
Inserting into "c2_user_desc_affl"...
Done


## Simple evaluations of this identification strategy

In [2]:
db = mongodb.initialize_db(db_name=DB_NAME)
users_col = db[USERS_COL]
user_desc_affl_col = db[USER_DESC_AFFL_COL]

MongoDB on localhost:27017/tweets_ek connected successfully!


How many unique users we have in total?

In [3]:
total_users = user_desc_affl_col.count()
print('{} users in total'.format(total_users))

844675 users in total


How many users have non-empty 'description' field?

In [4]:
nonempty_desc_users = user_desc_affl_col.count(filter={'description': {'$ne': ''}})
print('{} ({:.2%}) users have non-empty description field'
      .format(nonempty_desc_users, nonempty_desc_users/total_users))

704460 (83.40%) users have non-empty description field


How many users have emtpy 'description' field?

In [5]:
empty_desc_users = user_desc_affl_col.count(filter={'description': {'$eq': ''}})
print('{} ({:.2%}) users have empty description field'
      .format(empty_desc_users, empty_desc_users/total_users))

140215 (16.60%) users have empty description field


How many users are identified as 'ibm users'?

In [6]:
ibm_desc_users = user_desc_affl_col.count(filter={'description_ibm': {'$eq': True}})
print('{} ({:.2%} out of total, {:.2%} out of nonempty desc field users) users have keyword "ibm" in description field'
      .format(ibm_desc_users,
              ibm_desc_users/total_users,
              ibm_desc_users/nonempty_desc_users))

6736 (0.80% out of total, 0.96% out of nonempty desc field users) users have keyword "ibm" in description field


Pickle the set of 'ibm users' and 'non-ibm users' into local files

In [7]:
"""
This section generates two new global pickle files: list of 'ibm users' and list of 'non-ibm users'.
Register M1_IBM_USER_IDS_PKL = os.path.join(DATA_DIR, 'm1_ibm_user_ids.lst.pkl')
and M1_NONIBM_USER_IDS_PKL = os.path.join(DATA_DIR, 'm1_nonibm_user_ids.lst.pkl')
in config if first time.
"""
if 0 == 1:
    # pickle list of 'ibm users' 'ids'
    user_desc_affl_col = db[USER_DESC_AFFL_COL]
    cursor = user_desc_affl_col.find(filter={'description_ibm': {'$eq': True}},
                                     projection={'_id': 0, 'id': 1})
    ibm_user_ids_lst = []
    for doc in cursor:
        ibm_user_id_int64 = int(doc['id'])
        ibm_user_ids_lst.append(ibm_user_id_int64)
    print('{} "ibm user" ids get'.format(len(ibm_user_ids_lst)))
    print('Dumping to local pickle "{}"'.format(M1_IBM_USER_IDS_PKL))
    with open(M1_IBM_USER_IDS_PKL, 'wb') as f:
        pickle.dump(ibm_user_ids_lst, f)
    print('Done')

if 0 == 1:
    # pickle list of 'non-ibm users' 'ids'
    user_desc_affl_col = db[USER_DESC_AFFL_COL]
    cursor = user_desc_affl_col.find(filter={'description_ibm': {'$eq': False}},
                                     projection={'_id': 0, 'id': 1})
    nonibm_user_ids_lst = []
    for doc in cursor:
        nonibm_user_id_int64 = int(doc['id'])
        nonibm_user_ids_lst.append(nonibm_user_id_int64)
    print('{} "non-ibm user" ids get'.format(len(nonibm_user_ids_lst)))
    print('Dumping to local pickle "{}"'.format(M1_NONIBM_USER_IDS_PKL))
    with open(M1_NONIBM_USER_IDS_PKL, 'wb') as f:
        pickle.dump(nonibm_user_ids_lst, f)
    print('Done')

6736 "ibm user" ids get
Dumping to local pickle "./data/m1_ibm_user_ids.pkl"
Done


## Other evaluations

_Step 1_ check the distribution of number of tweets authored by differrent users

First, build a compound index 'user.id'-'id' on updated data collection to speedup aggregation process

In [9]:
if 0 == 1:
    updated_col = mongodb.initialize(db_name=DB_NAME, collection_name=UPDATED_COL)
    index_lst = [('user.id', pymongo.ASCENDING),
                 ('id', pymongo.ASCENDING)]
    print('Building compond index {}...'.format(index_lst))
    updated_col.create_index(keys=index_lst)
    print('Done')

MongoDB on localhost:27017/tweets_ek.c2 connected successfully!
Building compond index [('user.id', 1), ('id', 1)]...
Done


Check indexes on updated data collection

In [11]:
pprint(updated_col.index_information())

{'_id_': {'key': [('_id', 1)], 'ns': 'tweets_ek.c2', 'v': 2},
 'id_1': {'background': True, 'key': [('id', 1)], 'ns': 'tweets_ek.c2', 'v': 2},
 'id_str_1': {'background': True,
              'key': [('id_str', 1)],
              'ns': 'tweets_ek.c2',
              'v': 2},
 'user.id_1': {'background': True,
               'key': [('user.id', 1)],
               'ns': 'tweets_ek.c2',
               'v': 2},
 'user.id_1_id_1': {'key': [('user.id', 1), ('id', 1)],
                    'ns': 'tweets_ek.c2',
                    'v': 2},
 'user.id_str_1': {'background': True,
                   'key': [('user.id_str', 1)],
                   'ns': 'tweets_ek.c2',
                   'v': 2},
 'user.screen_name_1': {'background': True,
                        'key': [('user.screen_name', 1)],
                        'ns': 'tweets_ek.c2',
                        'v': 2}}


Plot the distribution of number of tweets authored by differrent users

In [21]:
user_tweets_num_pkl = os.path.join(TMP_DIR, 'user_tweets_num.pkl')

In [25]:
if not os.path.exists(user_tweets_num_pkl):
    print("Building pickle from database...")
    # dictoinary format {'user.id': tweet_num}
    user_tweets_num_dict = {}
    updated_col = mongodb.initialize(db_name=DB_NAME, collection_name=UPDATED_COL)
    
    # dictionaries for aggregate pipeline on MongoDB
    group_dict = {'$group': {'_id': '$user.id',
                             'tweet_num': {'$sum': 1}}}
    project_dict = {'$project': {'_id': 0,
                                 'id': '$_id',
                                 'tweet_num': 1}}
    ppl_lst = [group_dict, project_dict]
    print('Aggregating "tweet_num" on "user.id"...')
    cursor = updated_col.aggregate(pipeline=ppl_lst)
    for doc in cursor:
        user_id = int(doc['id'])
        user_tweet_num = (doc['tweet_num'])
        user_tweets_num_dict[user_id] = user_tweet_num
    print('{} users processed'.format(len(user_tweets_num_dict)))
    
    print('Dumpting to local pickle "{}"...'.format(user_tweets_num_pkl))
    with open(user_tweets_num_pkl, 'wb') as f:
        pickle.dump(user_tweets_num_dict, f)
    print('Done')
else:
    print('Pickled data found')

Building pickle from database...
MongoDB on localhost:27017/tweets_ek.c2 connected successfully!
Aggregating "tweet_num" on "user.id"...
844675 users processed
Dumpting to local pickle "./tmp/user_tweets_num.pkl"...
Done


In [22]:
if 1 == 1:
    data_dict = {} # dictoinary format {'user.id': tweet_num}
    if not data_dict: # load data from pickle
        with open(user_tweets_num_pkl, 'rb') as f:
            data_dict = pickle.load(f)
    
    print('Loading pickle file "{}" into df...'.format(user_tweets_num_pkl))
    df = pd.DataFrame.from_dict(data=data_dict,
                                orient='index',
                                dtype=int)
    df.columns = ['tweet_num']

Loading pickle file "./tmp/user_tweets_num.pkl" into df...


In [23]:
pprint(df)

                    tweet_num
761864635                   1
2477995763                  1
2815400688                  1
78042494                    1
2437127578                  1
771071042097819648          1
412940231                   1
274676454                   1
2188839702                  1
492567298                   1
290753747                   1
1733423444                  1
711000942                   1
756462537642741761          1
617456798                   1
723202430                   1
1167287023                  1
4264195704                  1
851228970628640768          1
3304160418                  1
19406260                    1
10015502                    1
737768154                   1
2950539875                  1
8908802                     1
37228994                    1
68303031                    1
125723286                   1
615225788                   1
4489763244                  1
...                       ...
3081081687                  1
2436747117

In [27]:
df.loc[df['tweet_num'] > 100]

Unnamed: 0,tweet_num
2875959767,180
805063072528273408,257
1105825938,142
2148804949,119
747332928686874624,102
848804341880242176,672
847574679141457921,106
848069595902627840,1524
848074284622692353,337
848072361588932609,2057


_Step 2_ check the distribution of number of retweets received by different users

Plot the distribution of number of retweets received by different users

In [4]:
# NOTE that we differentiate 'retweets_count' from 'retweet_count' field of a tweet
# 'retweet_count' field indicates number of times a specific Tweet has been retweeted
# 'retweets_count' is the sum of 'retweet_count' on all Tweets author by a specific user
user_retweets_count_pkl = os.path.join(TMP_DIR, 'user_retweets_count.pkl')

In [5]:
if not os.path.exists(user_retweets_count_pkl):
    print("Building pickle from database...")
    # dictoinary format {'user.id': retweets_count}
    user_retweets_count_dict = {}
    updated_col = mongodb.initialize(db_name=DB_NAME, collection_name=UPDATED_COL)
    
    # dictionaries for aggregate pipeline on MongoDB
    group_dict = {'$group': {'_id': '$user.id',
                             'retweets_count': {'$sum': '$retweet_count'}}}
    project_dict = {'$project': {'_id': 0,
                                 'id': '$_id',
                                 'retweets_count': 1}}
    ppl_lst = [group_dict, project_dict]
    print('Aggregating "retweets_count" on "user.id"...')
    cursor = updated_col.aggregate(pipeline=ppl_lst)
    for doc in cursor:
        user_id = int(doc['id'])
        user_retweets_count = (doc['retweets_count'])
        user_retweets_count_dict[user_id] = user_retweets_count
    print('{} users processed'.format(len(user_retweets_count_dict)))
    
    print('Dumpting to local pickle "{}"...'.format(user_retweets_count_pkl))
    with open(user_retweets_count_pkl, 'wb') as f:
        pickle.dump(user_retweets_count_dict, f)
    print('Done')
else:
    print('Pickled data found')

Building pickle from database...
MongoDB on localhost:27017/tweets_ek.c2 connected successfully!
Aggregating "retweets_count" on "user.id"...
844675 users processed
Dumpting to local pickle "./tmp/user_retweets_count.pkl"...
Done


In [9]:
if 1 == 1:
    data_dict = {} # dictoinary format {'user.id': retweets_count}
    if not data_dict: # load data from pickle
        with open(user_retweets_count_pkl, 'rb') as f:
            data_dict = pickle.load(f)
    
    print('Loading pickle file "{}" into df...'.format(user_retweets_count_pkl))
    df = pd.DataFrame.from_dict(data=data_dict,
                                orient='index',
                                dtype=int)
    df.columns = ['retweets_count']

Loading pickle file "./tmp/user_retweets_count.pkl" into df...


In [19]:
df.loc[df['retweets_count'] > 1000]

Unnamed: 0,retweets_count
2815400688,4218
1708845985,1038
3161916320,2834
1716882602,2669
2850883017,1203
2359350947,1038
816433804675928065,1460
730533306081660933,1038
1691704364,1460
826844975317057537,7687


## Check whehter accounts @Natasha_D_G and @jameskobielus are included.

One special task, we check two accounts for two known IBM affiliated individual.
1. @Natasha_D_G (Natasha Bishop): https://twitter.com/Natasha_D_G?lang=en
2. @jameskobielus (James Kobielus): https://twitter.com/jameskobielus?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor

_Step 1_ Check if these two individuals exist in our database.

In [7]:
if 0 == 1:
    user_col = mongodb.initialize(db_name=DB_NAME, collection_name=USERS_COL)
    
    
    screen_name_1 = 'Natasha_D_G'
    doc = user_col.find_one(filter={'screen_name': screen_name_1})
    if doc:
        print('User "{}" found in collection {}'.format(screen_name_1, USERS_COL))
        pprint(doc)
    else:
        print('User "{}" not found in collection {}'.format(screen_name_1, USERS_COL))
        
    screen_name_2 = 'jameskobielus'
    doc = user_col.find_one(filter={'screen_name': screen_name_2})
    if doc:
        print('User "{}" found in collection {}'.format(screen_name_2, USERS_COL))
        pprint(doc)
    else:
        print('User "{}" not found in collection {}'.format(screen_name_2, USERS_COL))

MongoDB on localhost:27017/tweets_ek.c2_users connected successfully!
User "Natasha_D_G" found in collection c2_users
{'_id': ObjectId('58fed797fe57a10b23955391'),
 'contributors_enabled': False,
 'created_at': 'Tue May 12 02:37:34 +0000 2009',
 'default_profile': False,
 'default_profile_image': False,
 'description': 'Digital #Marketing & Public Sector Lead for #IBM. Top 100 for '
                '#Bigdata & #IoT  https://t.co/sImYhvcoA1 Tweets = mine & any '
                'and all #sports get me excited!',
 'entities': {'description': {'urls': [{'display_url': 'ibm.co/1PgNsAk',
                                        'expanded_url': 'http://ibm.co/1PgNsAk',
                                        'indices': [79, 102],
                                        'url': 'https://t.co/sImYhvcoA1'}]},
              'url': {'urls': [{'display_url': 'ibm.co/cybersecurityp…',
                                'expanded_url': 'http://ibm.co/cybersecuritypredictions2017',
                       

We see that both individuals were captured and exist in our database.
1. @Natasha_D_G (Natasha Bishop). 'id': 39413322
2. @jameskobielus (James Kobielus): 'id': 14072398

Besides, we also see that @Natasha_D_G has keyword 'ibm' in her description while @jameskobielus does not. We expect to see that @Natasha_D_G appear in our identified IBM-user set here.

In [8]:
if 1 == 1:
    id_1 = 39413322
    id_2 = 14072398
    
    m1_ibm_user_ids_lst = []
    with open(M1_IBM_USER_IDS_PKL, 'rb') as f:
        m1_ibm_user_ids_lst = pickle.load(f)
        
    print('User {} exists in IBM-user set? {}'.format(screen_name_1, id_1 in set(m1_ibm_user_ids_lst)))
    print('User {} exists in IBM-user set? {}'.format(screen_name_2, id_2 in set(m1_ibm_user_ids_lst)))

User Natasha_D_G exists in IBM-user set? True
User jameskobielus exists in IBM-user set? False


# Notes