# Twitter User Recommeder

This notebook details the approaches and the conclusions of building a twitter recommender.

<b>This notebook covers:</b>
1. Data Acquisition
2. Data Cleaning & Formatting
3. Training and Optimizing LDA
4. Detailing Next Steps

<b>Our Approach:</b>
1. Our approach was to get twitter data via the Twitter Search API and derive topics mentioned by a user.
2. Used Latent Dirichlet allocation derive the topics and optimized it using coherenc.
3. I found topics for the main user and 2nd degree users and found the ones with the highest overlaps.
4. The users were then ranked according to how many topics they have 2 words in common.

<b> Uses:</b>
1. This can be used to recommend followers to others based on the topics that folks talk about.
2. Via an app one can learn about others that have similar interests.

<b> Next Steps:</b>
1. Try RMF unsupervised learning
2. Optimize it to cut down the time
3. Get access to larger amounts of data
4. Build An App

### Aside from that, I hope you enjoy this notebook!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [10]:
import tweepy 
access_token = "318256106-lLR709QO7OzkUSQ549ji1bKzuCggcaNBbdDJkq9U"
access_token_secret = "cm14qVXM45PCsEutAp8a2haKgTAzwClkS55XusLUXlodO"
consumer_key = "ObXTJ4Qwrt9rnDr1SlbYAF1nu"
consumer_secret = "oU1BOffC7zOoqXsCuiBXw5ul3vHAjyK2XgyeRF156HG1Z0rbTs"

In [11]:
donald_raw = pd.read_csv('Donald-Tweets!.csv')
donald_raw.columns = [i.lower() for i in donald_raw.columns]

In [12]:
donald = donald_raw[donald_raw.type == 'text']
user_tweets = donald.tweet_text

# Getting Data

In [17]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)
#user = api.get_user('kanyewest')
#
#
### Get Elon's Tweets
##stuff = api.user_timeline(screen_name = 'elonmusk', count = 5000, include_rts = False)
##user_tweets = [i.text for i in  stuff]
#    
## Get The People Elon Follows
users_friends = [i for i in api.friends_ids(screen_name = 'realDonaldTrump')]


##Getting 2nd degree followers
second_degree_friends = []
for ids in users_friends[:35]:
    second_degree_friends.extend([i for i in api.friends_ids(id = ids, count = 20)])

   
 final_second_degree_friends = [i for i in second_degree_friends if i not in set(users_friends)]    

In [19]:
len(final_second_degree_friends)

168

In [166]:
second_degree_friends_user_names = []

for i,friend_ids in enumerate(final_second_degree_friends[:50]):
    try:
        second_degree_friends_user_names.append(api.get_user(friend_ids).screen_name )
    except:
        pass

# 2. Clean Data

In [67]:
import spacy
import nltk
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
from gensim.corpora import Dictionary, MmCorpus
from gensim.models.ldamulticore import LdaMulticore
import pyLDAvis
import pyLDAvis.gensim
from collections import Counter
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

nlp = spacy.load('en')

#### Pt1. Remove Second Degree Connections That Are News

In [31]:
exclude = ('womenfortrump', 'fema', 'flotus', 'potus', 'huffpost', 'secondlady', 
           'realdonaldtrump', 'VPPressSec', 'axios', 'vppresssec', 'oann', 'noaa')

In [32]:
final_second_degree_screen_name = [i for i in second_degree_friends_user_names if i.lower() not in exclude]

In [34]:
#Getting 2nd degree connection tweets    
second_degree_tweets = {}
for screen_name in final_second_degree_screen_name[-10:]:
    try:
        stuff = api.user_timeline( screen_name= screen_name, count = 200, include_rts = False)
        second_degree_tweets[screen_name] = [i.text for i in  stuff]    
    except:
        pass

In [70]:
len(users_friends)

47

In [75]:
len(second_degree_friends)

263

In [81]:
len(second_degree_tweets.keys())

10

In [40]:
len(second_degree_tweets.values()[2])

119

#### Pt2. Stem, Lemmatize and Remove Stop Words

In [42]:
def clean_text(text):
    tokenized_tweets = []
    for tweet in user_tweets:
        tokenized_tweet = nlp(tweet.decode('unicode-escape'))

        tweet = "" # we want to keep each tweet seperate

        for token in tokenized_tweet:
            if token.is_space:
                continue
            elif token.is_punct:
                continue
            elif token.is_stop:
                continue
            elif token.is_digit:
                continue
            elif len(token) == 1:
                continue
            elif len(token) == 2:
                continue
            elif '@' in str(token):
                continue

            elif 'http' in str(token):
                continue
            elif 'amp' in str(token):
                continue
            else:
                try:
                    tweet += str(token.lemma_) + " " #creating lemmatized version of tweet
                except:
                    pass

        tokenized_tweets.append(tweet)
    tokenized_tweets = list(map(str.strip, tokenized_tweets)) # strip whitespace
    tokenized_tweets = [x for x in tokenized_tweets if x != ""] # remove empty entries
    
    gensim_tweets = []
    for tweet in tokenized_tweets:
        gensim_tweets.append(tweet.replace('-PRON- ', '').replace('the ', '').split(' '))
    
    return tokenized_tweets, gensim_tweets

## 3. Vectorize

In [43]:
def create_dict (tokenized_tweets):
    gensim_dict = Dictionary(tokenized_tweets)

    #removes really rare words and really frequent words
    gensim_dict.filter_extremes(no_below=5, no_above=0.7)

    #remove spaces between the  removed words
    gensim_dict.compactify() # remove gaps after words that were removed
    
    return gensim_dict

In [44]:
def create_corpus (dictionary, tokenized_text):

    corpus = [dictionary.doc2bow(i) for i in tokenized_text ]
    
    return corpus

## 4. Run Model (Pick # Of Topics)

In [45]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        
        model=LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

## 5. Run LDA

In [51]:
def run_lda(dictionary, corpus, tokenized_text):
    limit=40; start=1; step=3;

    model_list, coherence_values = compute_coherence_values(dictionary=dictionary, 
                                                        corpus=corpus,
                                                        texts=tokenized_text, start=start,
                                                        limit=limit, step=step)
    x = range(start, limit, step)
    
    
    
    for i,value in enumerate(coherence_values):
        if (value >= max(coherence_values) *.95) & (value <= max(coherence_values) ):
            optimal_num = x[i]
            break    

        
    num_topics=optimal_num

    lda = LdaModel(corpus, num_topics=num_topics, id2word=text_dictionary)
    
    return lda


# 6. Get Topics

In [52]:
def get_topics(lda):
    temp_dict = {}

    for o in range(lda.num_topics):
        temp_dict[o] =   [i[0] for i in lda.show_topic(o, topn = 10)]
        
    return temp_dict
    

# 7. Execute

#### a. Get Main User Topics

In [53]:
clean_tweets, tokenized_tweets = clean_text(user_tweets)
text_dictionary = create_dict(tokenized_tweets)
corpus = create_corpus(text_dictionary, tokenized_tweets)
lda = run_lda(text_dictionary, corpus, tokenized_tweets)
trump_topics = get_topics(lda)



In [54]:
print lda.num_topics

16


#### b. Get Secondary Followers

In [65]:
followers_topic_dict = {}

for o,i in enumerate(second_degree_tweets.keys()):

    temp_text = second_degree_tweets[i]
    
    
    temp_clean_tweets, temp_tokenized_tweets = clean_text(temp_text)
    temp_text_dictionary = create_dict(temp_tokenized_tweets)
    temp_corpus = create_corpus(temp_text_dictionary, temp_tokenized_tweets)
    temp_lda = run_lda(temp_text_dictionary, temp_corpus, temp_tokenized_tweets)
    temp_topics = get_topics(temp_lda)
    
    followers_topic_dict[i] = temp_topics
    print o

0
1
2
3
4
5
6
7
8
9


In [68]:
followers_topic_dict.keys()

[u'thebradfordfile',
 u'specialkharvey',
 u'chuckwoolery',
 u'emilychangtv',
 u'PARISDENNARD',
 u'CraigCaplan',
 u'DineshDSouza',
 u'reingowsky',
 u'RealCandaceO',
 u'paulsperry_']

# 8. Cross Check and Recommend

In [157]:
points_dict = {}
for follower_ in followers_topic_dict.keys():
    count = 0 
    points_dict[follower_] = 0
    for topic_ in followers_topic_dict[follower_].values():
        if count < 1:
        
            for trump_topic in trump_topics.values():


                    if len(set(topic_).intersection(trump_topic)) == 2 :
                            points_dict[follower_] +=1
                            count +=1



In [158]:
recommended = []
for i in sorted(points_dict.values(), reverse = True)[:3]:
    for item in points_dict.items():
        if i == item[1]:
            recommended.append([item[0], i])

In [159]:
recommended

[[u'reingowsky', 7], [u'CraigCaplan', 6], [u'RealCandaceO', 5]]

In [165]:

for follower_ in [i[0] for i in recommended]:
    count = 0 
    print '_____'
    print follower_

    for topic_ in followers_topic_dict[follower_].values():
        if count < 1:
        
            for trump_topic in trump_topics.values():


                    if len(set(topic_).intersection(trump_topic)) == 2 :
                            points_dict[follower_] +=1
                            count +=1
                            print trump_topic
                            


_____
reingowsky
[u'trump', u'good', u'need', u'know', u'love', u'donaldtrump', u'right', u'reagan', u'time', u'what']
[u'like', u'politician', u'talk', u'run', u'trump', u'great', u'silent', u'president', u'tell', u'people']
[u'money', u'illegal', u'give', u'win', u'help', u'like', u'need', u'all', u'know', u'trump']
[u'thank', u'trump2016', u'great', u'makeamericagreatagain', u'people', u'amazing', u'trump', u'new', u'go', u'lets']
[u'jeb', u'bush', u'presidential', u'walker', u'candidate', u'clown', u'trump', u'million', u'thank', u'news']
[u'america', u'make', u'great', u'again', u'trump', u'true', u'rubio', u'deal', u'matter', u'beck']
[u'border', u'fail', u'read', u'guy', u'sure', u'story', u'old', u'gun', u'change', u'trump']
_____
CraigCaplan
[u'trump', u'good', u'need', u'know', u'love', u'donaldtrump', u'right', u'reagan', u'time', u'what']
[u'trump', u'donald', u'makeamericagreatagain', u'love', u'thank', u'vote', u'trump2016', u'people', u'get', u'liberal']
[u'thank', u'tru

# Conclusions

1. This Recommender uses LDA 
2. LDA does not seem to be the 
3. Running Time has to be improved since this cannot be done real time

Next Steps:
1. Try RMF 
2. Build An App
3. Get Lager Amounts Of Data