## Predicting political preferences from Twitter metadata ##

Predicting political affiliation based on Twitter data is a challenging problem that has been tackled via two primary approaches:

- __Parsing text__ of individuals' tweets
- Analyzing __metadata__ (followers, followees, etc.)

Pennacchiotti and Popescu established in their [2010 paper](http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2886/3262) that the latter approach is far more effective. Specifically, using only friends (i.e., who individual users follow on Twitter) can generate ~85% accuracy in predicting political affiliation, whereas text parsing coupled with a relatively sophisticated DLDA model does approximately 10 points worse.

Here, we use a small set of well-known politicians and celebrities to divide users along ideological lines. We then learn a random forest model to classify users based on their affiliation. The final model achieves an accuracy of __~80%__ in predicting political affiliation, which likely can be improved as we refine the comparison set of "followees" (high-profile politicians we use to decide on a user's affiliation).

In [20]:
# Standard imports
import tweepy
import csv
import re
import operator 
from collections import Counter
from tweepy import OAuthHandler
import sys
import time
import urllib2
import pandas as pd
import numpy as np

In [209]:
# Retrieve consumer secret
import json
with open('../config.json') as data_file:    
    secret_data = json.load(data_file)
    consumer_secret = str(secret_data[0]['consumer_secret'])

# Tweepy OAuth 
consumer_key = "BVfNGMa15W5bKm2iRZHqLmNTx"
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
access_token = "52306095-Rr9ONTH5ZKbAcKDFj8h8AvsBPuH0x2qiRPWg7oaI6"
access_token_secret = "vtZm8fNE2VxxCVhQCkrUMPeKxggdbMZFPSPy2LXctHIGm"
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
 
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

Next, we're going to find 1000 Democrats and Republicans by searching for users who tweeted __"#ImWithHer"__ and __"#MAGA."__

Note that this clearly does not generate gold-standard test labels, as it is on the verge of being circular. We are trying to learn whether each user tweeted these campaign slogans, but our independent variables include whether each user follows Trump/Clinton on Twitter, a _very_ closely related outcome.

We would ideally like some sort of self-reporting tool to use as our test set (i.e., find users who self-identified as Democrats and Republicans), although, unfortunately, the popular tools for self-identification (e.g., www.wefollow.com) seem to have shut down. A potential next step here would be to use Amazon Mechanical Turk to hand-label ~1000 users to build a more robust test set.

In [72]:
democrats_iterator = tweepy.Cursor(api.search_users,
                           q="#ImWithHer",
                           count=1000,
                           result_type="recent",
                           include_entities=True,
                           lang="en").items(1000)

republicans_iterator = tweepy.Cursor(api.search_users,
                           q="#MAGA",
                           count=1000,
                           result_type="recent",
                           include_entities=True,
                           lang="en").items(1000)

democrat_ids = []
republican_ids = []
democrat_names = []
republican_names = []

count = 0
for d in democrats_iterator:
    count += 1
    if count % 100 == 0:
        print "Democrat {0}".format(count)
    democrat_ids.append(d.id)
    democrat_names.append(d.screen_name)

count = 0
for r in republicans_iterator:
    count += 1
    if count % 100 == 0:
        print "Republican {0}".format(count)
    republican_ids.append(r.id)
    republican_names.append(r.screen_name)

Democrat 100
Democrat 200
Democrat 300
Democrat 400
Democrat 500
Democrat 600
Democrat 700
Democrat 800
Democrat 900
Democrat 1000
Republican 100
Republican 200
Republican 300
Republican 400
Republican 500
Republican 600
Republican 700
Republican 800
Republican 900
Republican 1000


We then set out a standard list of __'followees'__, or politicians and celebrities we believe should strongly divide individuals along ideological lines. We can use the Twitter API to determine, for each (follower, followee) pair, whether there is a relationship or not (binary indicator of 1 or 0).

In [191]:
followees = ['HillaryClinton', 
             'realDonaldTrump', 
             'tedcruz', 
             'elizabethforma', 
             'SenSanders', 
             'mike_pence',
             'seanhannity']
id_dict = {}
for followee in followees:
    results = api.get_user(followee)
    id_dict[results.id] = results.screen_name

In [154]:
user_ids = pd.Series(democrat_ids[:25] + republican_ids[:25])
user_names = pd.Series(democrat_names[:25] + republican_names[:25])
dems = np.ones(len(democrat_ids))[:25]
is_democrat = pd.Series(np.append(dems, np.zeros(len(republican_ids))[:25]))[:50]

In [155]:
# Create pandas dataframe and prepopulate with dummy values

# TODO: Figure out why one Republican username is repeated

# user_ids = pd.Series(democrat_ids + republican_ids)
# user_names = pd.Series(democrat_names + republican_names)
# dems = np.ones(len(democrat_ids))
# is_democrat = pd.Series(np.append(dems, np.zeros(len(republican_ids))))

df = pd.DataFrame({ 'user_id' : user_ids,
                    'user_names' : user_names,
                    'is_democrat' : is_democrat,
                    followees[0] : None,
                    followees[1] : None,
                    followees[2] : None,
                    followees[3] : None,
                    followees[4] : None,
                    followees[5] : None,
                    followees[6] : None
                  })

# Populate (follower, followee) pairs in the dataframe. Since we hit the rate limit 
# after 180 calls, this code takes ~19 hours to run. Potentially should find a workaround?

for followee_id in id_dict:
    followee = id_dict[followee_id]
    print("Retrieving {0} data".format(followee))
    count = 0
    for user_id in user_ids:
        try:
            friendship = api.show_friendship(source_id=user_id, target_id=followee_id)
        except RateLimitError:
            print("Sleeping for 15 minutes...")
            time.sleep(60*15)
        follows_politician = friendship[1].followed_by
        df.loc[df.user_id == user_id, followee] = follows_politician
        count += 1
        if (count % 20 == 0):
            print("Processing follower number {0}".format(count))

Retrieving realDonaldTrump data
Processing follower number 20
Rate limit reached. Sleeping for: 853
Processing follower number 40
Retrieving SenSanders data
Processing follower number 20
Processing follower number 40
Retrieving mike_pence data
Processing follower number 20
Processing follower number 40
Retrieving HillaryClinton data
Processing follower number 20
Processing follower number 40
Retrieving elizabethforma data
Rate limit reached. Sleeping for: 863
Processing follower number 20
Processing follower number 40
Retrieving seanhannity data
Processing follower number 20
Processing follower number 40
Retrieving tedcruz data
Processing follower number 20
Processing follower number 40


In [157]:
# Save data file
df.to_csv("saved_twitter_user_data_1.csv")

In [208]:
import sklearn
import sklearn.ensemble as ensemble
from sklearn import cross_validation
from sklearn import metrics
import copy

forest = ensemble.RandomForestClassifier()
# feature_followees = copy.copy(followees)
followee_sets = [['mike_pence', 'seanhannity'], ['elizabethforma','SenSanders'], ['realDonaldTrump', 'HillaryClinton'], followees]

for feature_followees in followee_sets:
    X = df.loc[:,feature_followees].as_matrix()
    y = df.loc[:, 'is_democrat']
    
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0, random_state=8)

    forest.fit(X_train, y_train)
    predictions = forest.predict(X_train)
    accuracy = metrics.accuracy_score(predictions, y_train)
    print("Accuracy is {0} on a test set size of {1}".format(accuracy, len(predictions)))

Accuracy is 0.8 on a test set size of 50
Accuracy is 0.7 on a test set size of 50
Accuracy is 0.86 on a test set size of 50
Accuracy is 0.86 on a test set size of 50
