# User list generator

In this notebook, we find a set of 1000 users whose tweets we want to download. For this, the strategy I am going to use is the fact that people who follow the twitter accounts of well-established farmers markets are vry likely to participate in the idea of farmers markets, are interested in buying local and are socially conscious. 

The proceeds as follows: I pick a farmers market twitter account at random from listings of the top farmers markets in the US. I then get a list of all their followers and then pick a follower at random (this randomizes features across cities and help remove any location or position specific biases). I repeat this random choice process a thousand times (or two thousand times, depending on how many total tweets I can retrieve - or possibly rank order by number of tweets to find the most prolific users) to build my corpus of tweets. 

## Basic Imports

In [1]:
import json
import glob
import pickle
import collections
import random
from tqdm import tqdm as tqdm

import os
dirpath = os.path.dirname(os.path.realpath('__file__'))

import tweepy
import config

import nltk

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
list_markets = pd.read_excel('./list_of_farmers_markets.xlsx', sheet_name='Selected')
list_markets = list_markets.sort_values(by=['Num_Followers'], ascending=False)
list_markets = list_markets.reset_index(drop=True)
list_markets.head(20)

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'config.p'

In [0]:
# Defining the number of people to pick from each city; for now I am choosing
# this number as being proportional to the total number of follower
list_markets['Num_Followers_ToPick'] = (list_markets['Num_Followers']/
                                        np.sum(list_markets['Num_Followers'])*2000).astype(np.int)
list_markets.head(20)

## Authenticating the Twitter API

In [0]:
consumer_key = config.consumer_key
consumer_secret = config.consumer_secret
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

## Downloading the user handles i.e. `screen_name` of users

First, we begin with some helper functions

We iterature through the list of screen_names and we download all other follower_ids

In [5]:
# In future runs, if you don't have to download this data again,
# just load the original pickle file
# followers_dict = {}
# for market in tqdm(list_markets['Twitter_handle']):
#     try:
#         followers = tweepy.Cursor(api.followers,
#                                     screen_name=market,
#                                     lang='en',
#                                     include_entities=True,
#                                     count=2000).items(2000)
#         followers_list = list(followers)
#         followers_json = list(map(lambda f: f._json, followers_list))
#         followers_dict[market] = followers_dict.get(market, []) + followers_json
#     except:
#         with open('./followers_dict.data', 'wb') as filehandle:
#             pickle.dump(followers_dict, filehandle)

# with open('./followers_dict.data', 'wb') as filehandle:
#     pickle.dump(followers_dict, filehandle)

I now have a dictionary of the form

```
{
    market1: [{user1_json}, {user2_json}, ..., {user2000_json}],
    market2: [{user1_json}, {user2_json}, ..., {user2000_json}],
    .
    .
    market10: [{user1_json}, {user2_json}, ..., {user2000_json}],    
}
```

## Trimming `follower_dict`

Now that I have the `follower_dict`, I have a lot of users along with all their metadata. To really distinguish between users who provide signal and users who provide noise, I choose two parameters: users who have more than 500 followers themselves, and users who have tweeted out more than 500 times (relax this second condition if 500 is too high - I don't have a sense for how high this should be). I have 10 x 1000 total followers so hopefully I will find enough users with over 500 followers and over 500 tweets.

### Defining the function that selects followers

In [6]:
def selected_follower(follower):
    """
    input: Accepts a follower json and then checkes to see if they have over 500 followers and have tweeted over 500 times.
    returns: Boolean if criteria are met 
    """
    followers_bool = False
    tweets_bool = False
    if follower['followers_count'] >= 300:
        followers_bool = True
    if follower['statuses_count'] >= 1000:
        tweets_bool = True
    return followers_bool and tweets_bool

###  Making the `master_dict` only with the selected followers

Use only if downloading new data. Otherwise, go ahead and use the file that has been exported.

In [7]:
with open('./followers_dict.data', 'rb') as filehandle:
    followers_dict = pickle.load(filehandle)

counter = 0
follower_dict_trimmed = collections.defaultdict(lambda: [])
for market in tqdm(followers_dict):
    followers = followers_dict[market]
    for follower in followers:
        if selected_follower(follower):
            counter += 1
            follower_dict_trimmed[market] = follower_dict_trimmed[market] + [follower]
print(counter)

# With these criterion, I get 2028 unique followers. I next download 500 tweets from each one of those 2028 followers. Perhaps this will give me enough diversity and a large enough corpus of words. 

100%|██████████| 10/10 [00:00<00:00, 556.98it/s]
4323


In [8]:
with open('./followers_dict_trimmed.data', 'wb') as filehandle:
    pickle.dump(dict(follower_dict_trimmed), filehandle)

## Downloading 500 tweets from each of the selected followers

I am going to retain the market split because I want documents grouped by market. i.e. I am looking for a dictionary of the following structure:

```
{
    market1: {
                user1: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}],
                user2: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}],
                .
                .
                usern: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}]
            }
    .
    .
    market10: {
                user1: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}],
                user2: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}],
                .
                .
                usern: [{tweet1_json}, {tweet2_json}, ..., {tweetn_json}]
              }
```

In [9]:
with open('./followers_dict_trimmed.data', 'rb') as filehandle:
    follower_dict_trimmed = pickle.load(filehandle)

all_tweets = {}
markets = list(follower_dict_trimmed.keys())
for market in tqdm(markets[:3]):
    all_tweets[market] = {}
    followers = follower_dict_trimmed[market]
    for follower in followers:
        try:
            screen_name = follower['screen_name']
            tweets = tweepy.Cursor(api.user_timeline,
                                    screen_name=screen_name,
                                    tweet_mode='extended',
                                    count=500).items(500)
            for tweet in tweets:
                all_tweets[market][screen_name] = all_tweets[market].get(screen_name, []) + [tweet._json]
        except:
            pass

# with open('./all_tweets_dict.data', 'wb') as filehandle:
#     pickle.dump(all_tweets, filehandle)

0%|          | 0/3 [00:00<?, ?it/s]Rate limit reached. Sleeping for: 64
100%|██████████| 3/3 [39:37<00:00, 845.21s/it]


In [10]:
with open('./all_tweets_dict.data', 'wb') as filehandle:
    pickle.dump(all_tweets, filehandle)

Note on the structure of the output dictionary: It has the form

```
{
    market1: {
                screen_name1: [{tweet1, ..., tweet500}],
                .
                .
                .
                screen_name2028: [{tweet1, ..., tweet500}]
            }
    .
    .
    .
    market2: {
                screen_name1: [{tweet1, ..., tweet500}],
                .
                .
                .
                screen_name2028: [{tweet1, ..., tweet500}]
             }
}
```

Now that I have 500 tweets from each of the 2028 users, I can use my previous LDA code to make a the requisite dictionary of the form:

```
{
    user1: 
        {
            hashtags: [list of hashtags from each tweet], 
            fulltext: [list of all cleaned/depunkt words across all tweets]
        },
    .
    .
    .
    usern: 
        {
            hashtags: [list of hashtags from each tweet], 
            fulltext: [list of all cleaned/depunkt words across all tweets]
        }
}
```