# #WILL_YOU_FOLLOW_OBAMA_ON_TWITTER?

## 1. Introduction

There are a lot of celebrities using Twitter, who are followed by tens of millions of general users. Whether a specific user will follow a particular celebrity is an interesting question to ask.

In this project, the team aims at answering the question of whether a looks into predicting whether a given user is following Barack Obama (@BarackObama). Three different learning models are tried, Bag-of-words model, Word2Vector, as well as collaborative filtering. In this report, we will talk about how we gathered data for each model and how well each model performs.


future use?
result could be used to automatically generate recommended celebrities for users. Furthermore, we are interested how to extend the model to also predict the likelihood of a user following another ordinary user.

## 2. Twitter API and Tweepy

In this project we mainly used tweepy as a wrapper of the Twitter API. Most queries we made were for user followers, friends, and tweets.

In [None]:
import tweepy
from tweepy import OAuthHandler
import pandas as pd
import matplotlib.pyplot as plt
import time
import numpy as np
from twython import Twython


consumer_key = '<CONSUMER_KEY>'
consumer_secret = '<CONSUMER_SECRET>'
access_token = '<ACCESS_TOKEN>'
access_secret = '<ACCESS_SECRET>'
auth = OAuthHandler(consumer_key=consumer_key, consumer_secret=consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth) 

## 3. Bag-of-words model

We found that user tweets contain a lot of information and might be used to predict whether a user is following Obama. For example, when a user's tweets convey his or her interest in the field of politics, it would be reasonable to say that the user is more likely to have followed Obama. With this assumption in mind, we decided to use the bag-of-words model. Training data would be collective tweets from followers and non-followers of Obama respectively, and each testing instance would just be collective tweets from a specific user.

### 3.1 Collecting data

Due to the rate limits on Twitter API calls, we spent quite some time collecting the amount of data that we thought we needed for the project. 

For postive examples (tweets from users who follow Obama), we simply collected the most recent 2000 followers of Obama. However, we found that many of the users have very few tweets, which might imply that they are fairly new Twitter users (also considering the fact that they just started following Obama). In order to keep the randomness in our selection of users rather than having a bias towards new users, we only collected tweets from those users who have more than 50 tweets. We collected 200 tweets from each "qualified" user, or however tweets that user has, if the count is less than 200.

In [None]:
def retrieve_n_recent_follower_ids(id, n):
    """ Retrieve the most recent n followers from a given Twitter id or screen name. """
    
    followers = []
    for i, follower_list in enumerate(tweepy.Cursor(api.followers_ids, id).pages()):
        if len(followers) >= n:
            break
        followers.append(follower_list)
    return followers


In [None]:
# Get most recent 2000 followers of Obama
followers = api.followers("BarackObama")[:2000]

In [None]:
import os.path
def get_tweets(screen_name, subdirectory):
    """ Retrieve and save 200 tweets (or all tweets if less than 200) from the user given screen name, 
    if and only if the user has more than 50 total tweets"""
    
    try:
        os.mkdir(subdirectory)
    except Exception:
        pass
    try: 
        new_tweets = api.user_timeline(screen_name = screen_name,count=200)
        if len(new_tweets)>50:
            outtweets = [tweet.text.encode("utf-8") for tweet in new_tweets]
            tweets_200 = ' '.join(outtweets)
            with open(os.path.join(subdirectory, '%s_tweets.txt' % screen_name), 'a') as f:
                f.write(tweets_200)
            return True
        else:
            return False
    except tweepy.TweepError:
        pass
    pass

In [None]:
# a list of Boolean values, indicating whether the user has more than 50 tweets therefore whose tweets are saved
tweet_saved = [] 

for follower in followers:
    screen_name = api.get_user(follower).screen_name
    tweet_saved.append(get_tweets(follower, "follower_tweets"))

print len(tweet_saved == True) # print count of valid followers

In order to find users who do not follow Obama, we inspected the followers of the previous group of 2000 users who follow Obama. We think this is a good way to maintain the randomness in our training and test data. Specifically, we select the most recent follower of each obama-follower, check whether a) he or she is following Obama b) he or she has more than 50 tweets. Only when both a) and b) are met do we save the tweets from the specific user. Twitter does not have any endpoints for checking the existence of friendship between two users, therefore we retrieved the entire friend list of any user and checked whether Obama is in the list.

In [None]:
# Get one follower from each of the previous 2000 followers of Obama
candidates = [retrieve_n_recent_follower_ids(follower, 1) for follower in followers]

print len(candidates) # should print 2000
print candidates[:20] # print some IDs

In [None]:
def isFollowing(users, target_user):
    """Given the twitter ID of a list of users and a target user screen name, return a list of Boolean values. 
    The value at specific index of returned vector corresponds to the fact of whether the user at the same index 
    follows the target user. """
    
    isFollowing = []
    for user in users:
        try:
            following = api.friends_ids(user)
            if target_user in following:
                isFollowing.append(1)
            else:
                isFollowing.append(0)
        except tweepy.TweepError:
            print "sleep"
            time.sleep(60*15)
            print "wakeup"
            following = api.friends_ids(user)
            if target_user in following:
                isFollowing.append(1)
            else:
                isFollowing.append(0)
    return isFollowing

In [None]:
# a list of Boolean values, indicating whether the candidate is following Obama
is_obama_follower = isFollowing(candidates, obama_id)
print len(is_obama_follower == 0) # number of non-followers (not necessarily with >50 tweets)


In [None]:
# get tweets from the non-followers
tweet_saved = []
for i, candidate in enumerate(candidates):
    screen_name = api.get_user(candidate).screen_name
    if (is_obama_follower[i] == 0):
        tweet_saved.append(get_tweets(screen_name, "non_follower_tweets"))
        
print len(tweet_saved == 1) # should print count of valid non-followers

### 3.2 Processing tweet data

We discovered that, the raw tweets we got from both follower and non-follower groups might contain contents that we need to take care of before the tweets could be used for "bag-of-words" model, including non-English words, punctuations, emoticons, hash-tags, etc. To be more specific, our preprocess procedure includes:

1. remove stop words, including some customized stopwords
2. remove punctuations
3. remove non ascii characters
4. remove tweets that are not written in English
5. remove non words, such as "ab123"


In [2]:
# code for data pre-processing

### 3.3 Learning and predicting

We merged the preprocessed tweets into two big corpus, the follower tweets and the non-follower tweets. We used Naive Bayes and maximum likelihood estimation for the classifier, where we calculated the frequency for each word in both corpus, with Laplace smoothing(?). During the testing phase, when given the collective tweets of a user, we calculate the probability of following and not following, and classify the user based on the one more likely. Our results achieve an average precision of 67%.

In [None]:
# code goes here

### 3.4 Observations and Analysis

We had some interesting observations after further analyzing both the result and the data. We also tried to modify our model in a couple of ways to see whether its performance would improve. 

a ) We started with the default nltk stop word set, but later when we inspected the most frequent words from the two corpus, we found that there were still a lot of overlapping between the two lists of top words, which did not necessarily seem relevant to whether the user would actually follow Barack Obama. For exmplae, those words include ...Since the idea for "bag-of-words" text classification was that the corpus under each class label should have rather "characteristic" words, meaning that they should be specific to that label, we decided to add those words to the stop word set. 


b) The prediction accuracy does not necessarily increase as we used more training data. Unlike most cases in machine learning where more training data usually gives better test result under the same model, our model had this rather uncommon behavior. We suspected that this was caused by the fact that a user's tweets sometimes might not accurately reflect the friendship status between the user and Obama. To use a more straightforward example, if a user does not talk about politics on Twitter, that does not necessarily suggest that the user would not follow Obama, and vice versa. In those cases, the tweets can only get the model more confused by adding irrelevant words to the corpus, while not actually helping the prediction.



c) The accuracy seems to have a lot to do with the training set we chose to use. When using a different subset of all training data we got as the training set while keeping the test set the same, the results might vary a bit (by %15 the most). Our explanation is that, it again has to do with some of the tweets being unhelpful(or even harmful) to predicting the friendship. When the subset contains tweets that have stronger relevance to the classification (say, when contains a lot of keywords from the field of politics), the model may tend to perform better on test data.


We realized that we could've possibly chosen another Twitter celebrity for this project such that the model would generally have better performance. If we had used Katy Perry, for instance, there's reason to believe that Twitter followers of Katy Perry will probably tweet more about music comparing to Twitter followers of Obama tweet about politics.

Some other features we think might be helpful, if we had more time to work on this model, would be 
1. How many of my followers have followed Obama, and
2. How many of my friends(people I'm following) are "similar" to Obama (maybe are also politicians in this case)




## 5. Collaborative Filtering