# Introduction

In this project, the team looks into predicting whether a given user is likely to follow a specific celebrity on twitter who the user hasn’t followed yet. The result could be used to automatically generate recommended celebrities for users. Furthermore, we are interested how to extend the model to also predict the likelihood of a user following another ordinary user.


## General Approach

The substantial amount of data available through Twitter API gives a lot of flexibility in selecting data, processing data, selecting learning model, etc. So far, we have thoughts on using the following approaches:

### 1. Collaborate filtering and alternating least squares method

The team found that the behavior of a user following another user is in some way similar to how movie ratings happen for MovieLens dataset. The difference is that, instead of a floating number from 1 to 5 for a rating, the unidirectional relationship of follow/not follow only has a binary value. It would be interesting to observe how this property would affect the performance of alternating least squares method when it’s applied. And as a possible way to close the gap between the two scenarios, a “fondness” score could be added for a follower to one of the target he or she is following, indicating how much of a fan the follower is to the target. For example, if a user has retweeted or liked a lot of posts of the target, the fondness score might then be relatively high.

### 2. Neural network 

Rather than implicit features used in alternating least square method, the team also wants to try to manually select and/or extract features from twitter data that might seem to play a part in the prediction.

### 3. Linear Regression and SVM

The team also plans to use Natural Language Processing to analysis each user's tweets, and generate a feature vector for each user. The team planned to use "monkeylearn" to do text analysis. After generating user's feature vector, then the team plan to use Linear Regression and SVM model to do the prediction.



#### The team has so far been working on method 1 only, and would like to give a walk-through of our work in this report.


## Finding top 100 most popular Twitter accounts

We start from creating a developer account on Twitter. 

In [31]:
import tweepy
from tweepy import OAuthHandler
import pandas as pd
import matplotlib.pyplot as plt
import time
import numpy as np
from twython import Twython


consumer_key = '<CONSUMER_KEY>'
consumer_secret = '<CONSUMER_SECRET>'
access_token = '<ACCESS_TOKEN>'
access_secret = '<ACCESS_SECRET>'
auth = OAuthHandler(consumer_key=consumer_key, consumer_secret=consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth) 


For this project, mainly four "objects" are retrieved from Twitter API: “Tweets”,” Users”, “Entities”, and “Places”. The team queried data from “Users” and “Tweets”. When querying data from Twitter API, the biggest problem that the team faced was the limitation on the query rate. Rate limits are divided into 15 minute intervals and the team can only call the same API endpoint 15 time per 15 minutes. Hence, the team has been using a small data set for this project. 

First,the team chose 100 popular twitter accounts based on Twitter statistics and used BeautifulSoup to get a list these 100 accounts’ screen_names.

In [5]:
# setup library imports
import urllib2
from bs4 import BeautifulSoup

web = "http://twittercounter.com/pages/100?vt=1&utm_expid=102679131-111.l9w6V73qSUykZciySuTZuA.1&utm_referrer=https%3A%2F%2Fwww.google.com%2F"

page = urllib2.urlopen(web)
soup = BeautifulSoup(page)
span_list = soup.find_all("span", {"itemprop":"alternateName"})
name_list = []
for each in span_list:
    name_list.append(str(each.text)[1:])


The above code prints out the following list of top 100 Twitter celebrities:

    ['katyperry', 'justinbieber', 'taylorswift13', 'BarackObama', 'rihanna', 'YouTube', 'ladygaga', 'TheEllenShow', 'twitter', 'jtimberlake', 'KimKardashian', 'britneyspears', 'Cristiano', 'selenagomez', 'cnnbrk', 'jimmyfallon', 'ArianaGrande', 'shakira', 'instagram', 'ddlovato', 'JLo', 'Oprah', 'Drake', 'KingJames', 'BillGates', 'nytimes', 'onedirection', 'KevinHart4real', 'MileyCyrus', 'SportsCenter', 'espn', 'CNN', 'Harry_Styles', 'Pink', 'LilTunechi', 'wizkhalifa', 'NiallOfficial', 'Adele', 'BrunoMars', 'BBCBreaking', 'kanyewest', 'neymarjr', 'KAKA', 'ActuallyNPH', 'danieltosh', 'narendramodi', 'aliciakeys', 'NBA', 'LiamPayne', 'Louis_Tomlinson', 'SrBachchan', 'EmmaWatson', 'pitbull', 'khloekardashian', 'iamsrk', 'ConanOBrien', 'kourtneykardash', 'realmadrid', 'Eminem', 'davidguetta', 'NICKIMINAJ', 'NFL', 'AvrilLavigne', 'KendallJenner', 'BeingSalmanKhan', 'zaynmalik', 'NASA', 'aamir_khan', 'FCBarcelona', 'KylieJenner', 'blakeshelton', 'chrisbrown', 'coldplay', 'aplusk', 'TheEconomist', 'vine', 'MariahCarey', 'BBCWorld', 'LeoDiCaprio', 'edsheeran', 'deepikapadukone', 'google', 'xtina', 'MohamadAlarefe', 'agnezmo', 'shugairi', 'ricky_martin', 'TwitterEspanol', 'priyankachopra', 'realDonaldTrump', 'Reuters', 'JimCarrey', 'iHrithik', 'KDTrey5', 'RyanSeacrest', 'ivetesangalo', 'akshaykumar', 'AlejandroSanz', 'SnoopDogg', 'TwitterSports']




### Get Followers From Each Celebrity User

Next step is to iterate over these 100 popular accounts’ followers and fetch n followers(“general user”) (use n=1 here for experimenting) from each popular accounts’ followers list. Those selected followers will together form the group of users for whom we'll be predicting the likelihood of following each of the 100 Twitter celebrities.

In [56]:
import random
def retrieve_n_follower_ids(id, n):
    """ Retrieve n followers from a given Twitter id by taking the first user of the first n returning page"""
    user = api.get_user(id)
    print user.followers_count
    followers = []
    for i, follower in enumerate(tweepy.Cursor(api.followers_ids, id=id).pages()):
        if i == n:
            break
        followers.append(follower[0])
    return followers


def generate_user_id_list (id_list,n):
    """ Generate the entire set of users that unions the returned list of running the retrieve_n_follower_ids 
    for each celebrity account"""
    
    users = []
    for id in id_list:
        try:
            new_users = retrieve_n_follower_ids(id, n)
        except tweepy.TweepError:
            time.sleep(60*15)
            new_users = retrieve_n_follower_ids(id, n)
        users += new_users
        
    return set(users)

In [None]:
users = generate_user_id_list (name_list, 1)
print users
print len(users)

We get the following outputs:

    [800177281964486656, 800176882859638784, 800177281964486656, 300477987, 800171035093921792, 800176806456238080, 785276975455805441, 356142384, 3079691526, 1067137320, 3628005984, 800177747666440192, 800173059671887872, 800177281964486656, 800177347777306624, 800180452086550529, 800174055315808256, 800180795377758208, 786644749067358208, 800175512131751937, 800122422988943360, 800180659255799810, 800177487686737921, 800178652478476288, 800172515234455552, 800180926483337218, 800179509139881985, 1404269034, 704775480017362944, 800181164510019584, 800184113520967680, 800183470802628608, 719360741606694912, 800184964876627968, 800185366942601216, 203195412, 800185355341152257, 791159574162276353, 800150996882051073, 800184775445073920, 203195412, 800185192577019904, 800184420887990272, 800173538212577280, 800185302459387904, 800187424185851904, 3763069162, 800188092900462593, 720635469743022081, 720635469743022081, 799587973129965568, 800188741872586756, 790739389429059584, 791905771113938944, 744753183809998848, 799622713040113664, 141358236, 745695894666907648, 799842965418110977, 1945758307, 800071767532261376, 800192899069554689, 800191361395736576, 800192632685068288, 800189724925145088, 800192267545759745, 800190776726601728, 792704170092556288, 800187664628449280, 799082231075586062, 800190776726601728, 800191877513232385, 516222248, 112533047, 800192551399485441, 795516083323228160, 330074489, 800195958839443456, 330074489, 800196114972450820, 800196601939566594, 794328224104845312, 3222517458, 800192769293619200, 800195142619541504, 800192769293619200, 800196624723021824, 2842125299, 800196061016911872, 800196538089631744, 800199656256860161, 3131642635, 800197015154008064, 800198019803025408, 800200273796829184, 798796801058816000, 800199539558739968, 800199578326683648, 800200401794404357, 800197419484868608]
    
100

## Generate user-celebrity matrix

Now that both user and celebrity arrays are ready, the following step is to generate the m by n matrix (m is number of total general users and n the number of celebrities collected). matrix[i][j] indicates whether user i is currently following celebrity j. 

In [None]:
import numpy as np
def genenerate_user_celebrity_matrix(users, celebrities): # both are lists of Twitter IDs
    matrix = np.zeros((len(users), len(celebrities)))
    celebrity_set = set(celebrities)
    for i in range(len(users)):
        try:
            friends = get_friends(users[i]) 
        except tweepy.TweepError:
            time.sleep(60*15)
            friends = get_friends(users[i])    
        for j in range(len(celebrities)):
            matrix[i][j] = celebrities[j] in friends
    return matrix
    

In [None]:
celebrities = id_list
matrix = genenerate_user_celebrity_matrix(users, celebrities)


The team is able to get the following outputs:
```python
user_id = 800177281964486656
user_id = 800176882859638784
user_id = 800177281964486656
user_id = 300477987
user_id = 800171035093921792
user_id = 800176806456238080
user_id = 785276975455805441
user_id = 3079691526
user_id = 1067137320
user_id = 3628005984
user_id = 800177747666440192
user_id = 800173059671887872
user_id = 800177281964486656
user_id = 800177347777306624
user_id = 800180452086550529
user_id = 800174055315808256
sleep
wake up
user_id = 800174055315808256
user_id = 800180795377758208
user_id = 786644749067358208
user_id = 800175512131751937
user_id = 800122422988943360
user_id = 800180659255799810
user_id = 800177487686737921
user_id = 800178652478476288
user_id = 800172515234455552
user_id = 800180926483337218
user_id = 800179509139881985
user_id = 1404269034
user_id = 704775480017362944
user_id = 800181164510019584
user_id = 800184113520967680
user_id = 800183470802628608
sleep
wake up
user_id = 800183470802628608
user_id = 719360741606694912
user_id = 800184964876627968
user_id = 800185366942601216
user_id = 203195412
user_id = 800185355341152257
user_id = 791159574162276353
user_id = 800150996882051073
user_id = 800184775445073920
user_id = 203195412
user_id = 800185192577019904
user_id = 800184420887990272
user_id = 800173538212577280
user_id = 800185302459387904
user_id = 800187424185851904
user_id = 3763069162
sleep
wake up
user_id = 3763069162
user_id = 800188092900462593
user_id = 720635469743022081
user_id = 720635469743022081
user_id = 799587973129965568
user_id = 800188741872586756
user_id = 790739389429059584
user_id = 791905771113938944
user_id = 744753183809998848
user_id = 745695894666907648
user_id = 799842965418110977
user_id = 1945758307
user_id = 800071767532261376
user_id = 800192899069554689
user_id = 800191361395736576
user_id = 800192632685068288
sleep
wake up
user_id = 800192632685068288
user_id = 800189724925145088
user_id = 800192267545759745
user_id = 800190776726601728
user_id = 792704170092556288
user_id = 800187664628449280
user_id = 799082231075586062
user_id = 800190776726601728
user_id = 800191877513232385
user_id = 516222248
user_id = 112533047
user_id = 800192551399485441
user_id = 795516083323228160
user_id = 330074489
user_id = 800195958839443456
sleep
wake up
user_id = 800195958839443456
user_id = 330074489
user_id = 800196114972450820
user_id = 800196601939566594
user_id = 794328224104845312
user_id = 3222517458
user_id = 800192769293619200
user_id = 800195142619541504
user_id = 800192769293619200
user_id = 800196624723021824
user_id = 2842125299
user_id = 800196061016911872
user_id = 800196538089631744
user_id = 800199656256860161
user_id = 800197015154008064
user_id = 800198019803025408
sleep
wake up
user_id = 800198019803025408
user_id = 800200273796829184
user_id = 798796801058816000
user_id = 800199539558739968
user_id = 800199578326683648
user_id = 800200401794404357
user_id = 800197419484868608

```

In [None]:
#delete all zero users
new_matrix = []
for i in range(matrix.shape[0]):
    if np.count_nonzero(matrix[i])!= 0:
        new_matrix.append(matrix[i])
new_matrix = np.array(new_matrix)
print new_matrix.shape

The team got the following outputs:
```python
(88, 100)
```

###  Using Collaborate filtering do analysis

### - Doing process

In [1]:
import math
def process(following, P):
    """ Given a dataframe of following and a random permutation, split the data into a training 
        and a testing set, in matrix form. 
        
        Args: 
            following (2D numpy array) : a 2D numpy array of following 
            P (numpy 1D array) : random permutation vector
            
        Returns: 
            (X_tr, X_te)  : training and testing splits of the following matrix (both 
                                         numpy 2D arrays) 
    """
    l= len(following)   
    train_length = int(math.floor(l*0.9))
    train_P = P[0:train_length]
    test_P = P[train_length:]
    
    new_p1 = np.zeros(following.shape[1])
    new_p2= np.zeros(following.shape[1])
    for each in train_P:
        new_p1[each] = 1
    for each in test_P:
        new_p2[each] = 1
    #training matrix
    training_list = []
    for i,xi in enumerate(following):
        x = (xi*new_p1).tolist()
        training_list.append(x)
    
    
    #testing matrix
    testing_list = []
    for i,xi in enumerate(following):
        x = (xi*new_p2).tolist()
        test_list.append(x)
        
    return np.array(training_list),np.array(testing_list)
    pass

X_tr, X_te, movieNames = process(ratings, movies, np.random.permutation(len(ratings)))
# print X_tr[0][np.nonzero(X_tr[0])]
# print movieNames[:5]
print X_tr.shape, X_te.shape,movieNames[:5]

# AUTOLAB_IGNORE_STOP

In [None]:
X_tr, X_te = process(new_matrix, np.random.permutation(new_matrix.shape[1]))
print X_tr.shape
print X_te.shape

The team got the following output:  
```python
(88, 100)
(88, 100)
```

### - Calculate U and V 

In [None]:
def error(X, U, V):
    """ Compute the mean error of the observed ratings in X and their estimated values. 
        Args: 
            X (numpy 2D array) : a ratings matrix as specified above
            U (numpy 2D array) : a matrix of features for each user
            V (numpy 2D array) : a matrix of features for each movie
        Returns: 
            (float) : the mean squared error of the observed ratings with their estimated values
        """
    dif =np.square(X- U.dot(V.T)) 
    new_dif =[X!=0]*dif
    return np.mean(new_dif)
    pass

def train(X, X_te, k, U, V, niters=51, lam=10, verbose=False):
    """ Train a collaborative filtering model. 
        Args: 
            X (numpy 2D array) : the training ratings matrix as specified above
            X_te (numpy 2D array) : the testing ratings matrix as specified above
            k (int) : the number of features use in the CF model
            U (numpy 2D array) : an initial matrix of features for each user
            V (numpy 2D array) : an initial matrix of features for each movie
            niters (int) : number of iterations to run
            lam (float) : regularization parameter
            verbose (boolean) : verbosity flag for printing useful messages
            
        Returns:
            (U,V) : A pair of the resulting learned matrix factorization
    """
    temp = X !=0
    W = temp.astype(np.int)
    for ite in range(niters):
        for j,w in enumerate(W): 
            U[j]=np.linalg.solve(V.T.dot((V.T.dot(np.diag(w))).T) + lam * np.eye(k), V.T.dot(X[j]))
        for j,wt in enumerate(W.T):
            V[j] = np.linalg.solve(U.T.dot(np.diag(wt).dot(U))+lam *np.eye(k),U.T.dot(X[:,j]))
        if verbose == True:
            if ite== 0:
                print "Iter |Train Err |Test Err"
            train_error= error(X, U, V)
            test_error = error(X_te, U, V)
            print ite, "|",train_error,"|",test_error
            
    return U,V
    pass




In [None]:
U = np.random.rand(X_tr.shape[0],10)
V = np.random.rand(X_tr.shape[1],10)
U,V = train(X_tr, X_te, 10, U, V, niters=10, lam=3, verbose=True)

The team got the following output: 
```python

Iter |Train Err |Test Err
0 | 0.0215557892728 | 0.0247727272727
1 | 0.0124878550122 | 0.0247727272727
2 | 0.011533655087 | 0.0247727272727
3 | 0.0113421192911 | 0.0247727272727
4 | 0.0112910212752 | 0.0247727272727
5 | 0.011275019604 | 0.0247727272727
6 | 0.0112696053164 | 0.0247727272727
7 | 0.0112677039911 | 0.0247727272727
8 | 0.0112670189888 | 0.0247727272727
9 | 0.0112667645063 | 0.0247727272727
```


As we can see from the result, the training errors decrease as the algorithm iterates, and is always smaller than the testing error. However, the testing error doesn’t change at all, which might imply that our training data is not enough.