## Construct User Interest from twitter activity of user

Problem Definition:
As we know Twitter is one of good sources of a user’s profile, people followed, Retweets and favorites etc. The data set contains tweets from around 2000+ users which can be used tag each user with an interest.

Assign user to interest buckets namely - "Sports", "Business", "Music" and "Other".

DATA:

The tweets dataset is randomly collected from twitter. There are around 500,000+ tweets from 2000+ users and also the  no.of tweets from user is not unique.
Data has been preprocessed to clean the unwanted text and to make sure the words are meaningfull to an extent.

In [1]:
import pandas as pd
tdata = pd.read_csv('tweetscopy.csv',delimiter=";",header= None)

In [2]:
tdata.head()

Unnamed: 0,0,1,2,3
0,1,854630623629131776,PreetReema,RT @balewadihstreet: Wednesdays mean weâ€™re h...
1,2,854630577156182016,PreetReema,RT @NawabAsia: Happy Hours!! Enjoy live screen...
2,3,852456000531570688,PreetReema,Wishing all of you a very Happy &amp; Prospero...
3,4,852455662051237888,PreetReema,RT @balewadihstreet: #SignatureDish: Order sig...
4,5,850620266694692866,PreetReema,RT @NawabAsia: The #CricketFever is back with ...


Total No.of Tweets: 526851
Total No.of Users: 2057

In [3]:
len(tdata)

526851

In [4]:
tdata.columns = ["SlNo", "ID", "User", "Text"]
tdata.head(2)

Unnamed: 0,SlNo,ID,User,Text
0,1,854630623629131776,PreetReema,RT @balewadihstreet: Wednesdays mean weâ€™re h...
1,2,854630577156182016,PreetReema,RT @NawabAsia: Happy Hours!! Enjoy live screen...


In [5]:
len(tdata['User'].unique())

2057

Install/Download the required libraries/corpora from nltk using below commands. (If not exists)
1. !pip install nltk
2. import nltk
3. nltk.download()

# Data Preprocessing

Steps:
1. Keep only Printable Text
2. Remove White Spaces, Numbers, Hashtags(#), Stopwords, Punctuations, Tickers
3. Convert text to Lower case
4. Lemmatize words to its root word. etc..

In [6]:
tweet_text = tdata['Text']
tweet_text = tweet_text.tolist()

## Text preprocessing
#--------------------
import string
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import itertools

lemmatizer = WordNetLemmatizer()
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)

for i in range(len(tweet_text)):
    tweet_text[i] = filter(lambda x: x in string.printable, tweet_text[i]) #Keep only printable text
    tweet_text[i] = re.sub(r'\s+', ' ', tweet_text[i]) #Removing white spaces in the sentence
    tweet_text[i] = re.sub('[\d]', '',tweet_text[i]) #Removing numbers
    tweet_text[i] = re.sub(r'#([^\s]+)', r'\1', tweet_text[i])  #Replace #word with word
    #tweet_text[i] = " ".join(re.findall('[A-Z][^A-Z]*', tweet_text[i]))   #Split Join words
    tweet_text[i] = tweet_text[i].lower() #Convert to lower case
    tweet_text[i] = " ".join([j for j in tweet_text[i].split() if j not in stop]) #Remove Stopwords
    tweet_text[i] = ''.join(ch for ch in tweet_text[i] if ch not in exclude) #Remove Punctuations
    tweet_text[i] = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet_text[i]) #Convert www.* or https?://* to URL
    tweet_text[i] = re.sub('@[^\s]+','AT_USER',tweet_text[i])  #Convert @username to AT_USER
    tweet_text[i] = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet_text[i]))  #Standardizing words
    tweet_text[i] = ' '.join( [f for f in tweet_text[i].split() if len(f)>2] )
    tweet_text[i] = tweet_text[i].strip('\'"')  #trim
    tweet_text[i] = re.sub(r'\$\w*','',tweet_text[i]) # Remove tickers
    #tweet_text[i] = re.sub(r'https?:\/\/.*\/\w*','',tweet_text[i]) # Remove hyperlinks
    tweet_text[i] = " ".join(lemmatizer.lemmatize(word,pos='v') for word in tweet_text[i].split()) # Lemmatizing words

In [7]:
clean_tweet=pd.DataFrame(tweet_text,columns=['Cleaned_Tweets'])
new_data = pd.concat([tdata['User'], clean_tweet],axis=1)

In [8]:
new_data.head(4)

Unnamed: 0,User,Cleaned_Tweets
0,PreetReema,balewadihstreet wednesdays mean halfway deserv...
1,PreetReema,nawabasia happy hours enjoy live screen match ...
2,PreetReema,wish happy amp prosperous baisakhi nawabasia h...
3,PreetReema,balewadihstreet signaturedish order signature ...


Collect all the tweets of a user

In [10]:
data = pd.DataFrame(new_data.groupby('User')['Cleaned_Tweets'].apply(lambda x: "%s" % ' '.join(x)))

In [11]:
data.index.name = 'User_ID'
data.reset_index(inplace=True)

In [12]:
data.head(2)

Unnamed: 0,User_ID,Cleaned_Tweets
0,001_mr,httptcokylcyblq httptcojbocgi httptcoqtdoob
1,014max,andrewastor try something good today dogoodfr...


In [28]:
# Let's see if the total user(all tweets) entries in the transformed data matches with the actual no.of users
len(data)

2057

In [12]:
tweet_feed = data['Cleaned_Tweets']
tweet_feed = tweet_feed.tolist()
tweet_clean = [i.split() for i in tweet_feed] 

# Topic Modeling with Latent Dirichlet Allocation (LDA)

Topic modeling is family of techniques that can be used to describe and summarize the documents in a corpus according to a set of latent "topics". For this demo, we'll be using Latent Dirichlet Allocation or LDA, a popular approach to topic modeling.


In many conventional NLP applications, documents are represented a mixture of the individual tokens (words and phrases) they contain. In other words, a document is represented as a vector of token counts. There are two layers in this model — documents and tokens — and the size or dimensionality of the document vectors is the number of tokens in the corpus vocabulary. This approach has a number of disadvantages:

1. Document vectors tend to be large (one dimension for each token $\Rightarrow$ lots of dimensions)
2. They also tend to be very sparse. Any given document only contains a small fraction of all tokens in the vocabulary, so most values in the document's token vector are 0.
3. The dimensions are fully indepedent from each other — there's no sense of connection between related tokens, such as knife and fork.

But, as twitter has a character limit of 140 per tweet, we can use LDA to get the topic of interest as the document vector(cumulative tweets) would be small on average.

LDA injects a third layer into this conceptual model. Documents are represented as a mixture of a pre-defined number of topics, and the topics are represented as a mixture of the individual tokens in the vocabulary. The number of topics is a model hyperparameter selected by the practitioner. LDA makes a prior assumption that the (document, topic) and (topic, token) mixtures follow Dirichlet probability distributions. This assumption encourages documents to consist mostly of a handful of topics, and topics to consist mostly of a modest set of the tokens.

<img src="probabilistic-topic-models-3-638.jpg">

LDA is fully unsupervised. The topics are "discovered" automatically from the data by trying to maximize the likelihood of observing the documents in your corpus, given the modeling assumptions. They are expected to capture some latent structure and organization within the documents, and often have a meaningful human interpretation for people familiar with the subject material.

In [13]:
# Importing Gensim
import gensim
from gensim import corpora

Using Theano backend.


Importing pyLDAvis for visualisation

In [14]:
import pyLDAvis
import pyLDAvis.gensim
import warnings
import cPickle as pickle

The first step to creating an LDA model is to learn the full vocabulary of the corpus to be modeled. We'll use gensim's Dictionary class for this.

Like many NLP techniques, LDA uses a simplifying assumption known as the bag-of-words model. In the bag-of-words model, a document is represented by the counts of distinct terms that occur within it. Additional information, such as word order, is discarded.

In [15]:
# Creating the term dictionary of our courpus, where every unique term is assigned an index. 
dictionary = corpora.Dictionary(tweet_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in tweet_clean]

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

In [16]:
import os
intermediate_directory = os.path.join('..')
lda_model_filepath = os.path.join(intermediate_directory, 'lda_model_all')

In [19]:
# this is a bit time consuming - make the if statement True
# if you want to train the LDA model yourself.
if 0 == 1:

    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # Running and Training LDA model on the document term matrix.
        ldamodel = Lda(doc_term_matrix, num_topics=50, id2word = dictionary, passes=50)
    
    ldamodel.save(lda_model_filepath)
    
# load the finished LDA model from disk
ldamodel = Lda.load(lda_model_filepath)

In [20]:
def explore_topic(topic_number, topn=25):
    """
    accept a user-supplied topic number and
    print out a formatted list of the top terms
    """
        
    print u'{:20} {}'.format(u'term', u'frequency') + u'\n'

    for term, frequency in ldamodel.show_topic(topic_number, topn=25):
        print u'{:20} {:.3f}'.format(term, round(frequency, 3))

Our topic model is now trained and ready to use! Since each topic is represented as a mixture of tokens, you can manually inspect which tokens have been grouped together into which topics to try to understand the patterns the model has discovered in the data.

In [21]:
explore_topic(topic_number=32)

term                 frequency

mastercard           0.006
mybigcoin            0.006
anammirza            0.006
dogoodfriday         0.004
coin                 0.003
cnbc                 0.003
deal                 0.003
big                  0.003
today                0.003
amp                  0.003
currency             0.003
finalize             0.003
crypto               0.003
httptcooqynhorwq     0.003
cryptocurr           0.003
pay                  0.003
realdonaldtrump      0.003
iampayalghosh        0.003
marcorubio           0.003
dosomething          0.002
beauthentic          0.002
buy                  0.002
almaty               0.002
payitforward         0.002
httptcoimjcylkqmb    0.002


In [22]:
topic_names = {0:'Religion',
               1:'Music',
               2:'Music',
               3:'Business',
               4:'Social Media',
               5:'Other',
               6:'Sports',
               7:'Other',
               8:'Other',
               9:'Politics',
              10:'Zombies',
              11:'Pawan Kalyan',
              12:'Feelings',
              13:'Technology',
              14:'eCommerce',
              15:'Cricket',
              16:'People',
              17:'Moblie Phones',
              18:'Online Games',
              19:'Facebook',
              20:'Competition',
              21:'Trump',
              22:'Other',
              23:'fashion',
              24:'Operating System',
              25:'Other',
              26:'Career',
              27:'Startup',
              28:'Brands',
              29:'Money',
              30:'Travel',
              31:'Other',
              32:'Cryptocurrency',
              33:'Publicity',
              34:'Youtube',
              35:'Children',
              36:'Mobile',
              37:'Other',
              38:'Politics',
              39:'Social Media',
              40:'Emotions',
              41:'Europe',
              42:'Poetry',
              43:'Fashion',
              44:'Music Event',
              45:'Casino',
              46:'sports',
              47:'Adventures',
              48:'Fashion',
              49:'News'}

In [23]:
topic_names_filepath = os.path.join(intermediate_directory, 'topic_names.pkl')

with open(topic_names_filepath, 'w') as f:
    pickle.dump(topic_names, f)

In [84]:
LDAvis_data_filepath = os.path.join(intermediate_directory, 'ldavis_prepared')

pyLDAvis includes a one-line function to take topic models created with gensim and prepare their data for visualization.
Check: https://github.com/bmabey/pyLDAvis 

As the model takes a lot of time to process, visualizations take time to be displayed.

In [85]:
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    LDAvis_prepared = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix,dictionary)

    with open(LDAvis_data_filepath, 'w') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the pre-prepared pyLDAvis data from disk
with open(LDAvis_data_filepath) as f:
    LDAvis_prepared = pickle.load(f)

# Save visualisations in a 'html'
pyLDAvis.save_html(LDAvis_prepared, 'lda.html')

# Visualising the topics
pyLDAvis.display(LDAvis_prepared)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


In [59]:
def lda_description(tweet, min_topic_freq=0.05):
    
    # create a bag-of-words representation
    tweet_bow = dictionary.doc2bow(tweet)
    
    # create an LDA representation
    tweet_lda = ldamodel[tweet_bow]
    
    # sort with the most highly related topics first
    tweet_lda = sorted(tweet_lda, key=lambda (topic_number, freq): -freq)
    
    #return tweet_lda[0]
    topic_list=[]
    for topic_number, freq in tweet_lda:
        if freq < min_topic_freq:
            break
            
        # print the most highly related topic names and frequencies
        topic_list.append(topic_names[topic_number]) 
        #print '{:25} {}'.format(topic_names[topic_number], round(freq, 2))
        
    if len(topic_list) >= 1:
        return topic_list[0]
    else:
        return 'Other'

In [64]:
lda_description(tweet_clean[75])

'Career'

## Finding topic of interest for each user

In [65]:
user_interest = []
user_interest = [lda_description(tweets) for tweets in tweet_clean]
user_interest =pd.DataFrame(user_interest,columns=['Interest'])

In [67]:
users = new_data['User'].unique()

In [68]:
import numpy as np
users = np.sort(users, axis=None)
users
users=pd.DataFrame(users,columns=['User'])

array(['001_mr', '014max', '0192474822j', ..., 'zoezackbella',
       'zombieman5000', 'zsiloamcrutcher'], dtype=object)

Save the user-to-interest mapping !!

In [89]:
user_df = pd.concat([users, user_interest],axis=1)

In [92]:
user_df.head()

Unnamed: 0,User,Interest
0,001_mr,Career
1,014max,Cryptocurrency
2,0192474822j,Feelings
3,05mer05,Zombies
4,09Dimon13,People


In [90]:
np.savetxt("users_interest.csv", user_df, delimiter=",", fmt='%s')

In [73]:
user_interest[81]

'Competition'

In [74]:
users[81]

'AZTIECA'

Since the task was to assign user to interest buckets namely - "Sports", "Business", "Music" and "Other". 
Check for all the topics and assign accordingly.

In [93]:
# Categorizing based on specific topics (Sports, Music, Business and Others) only:

topics_others = ['Religion','Other', 'Politics', 'Zombies', 'Pawan Kalyan', 'Feelings', 'People', 
                 'Trump', 'fashion', 'Career', 'Travel', 'Children', 
                 'Emotions', 'Europe', 'Poetry', 'Fashion', 'News']

topics_Sports = ['Sports', 'Cricket', 'Online Games', 'Competition', 'Casino', 'Adventures']

topics_Music = ['Music', 'Music Event']

topics_Business = ['Business',  'Technology', 'eCommerce', 'Moblie Phones', 'Facebook', 'Operating System', 
                    'Startup', 'Brands', 'Money', 'Cryptocurrency', 'Publicity', 'Youtube',
                    'Mobile', 'Social Media']

In [120]:
user_categories = []
for i in range(len(user_interest)):
    if any(user_interest['Interest'][i] in word for word in topics_others):
        user_categories.append('Other')
    elif any(user_interest['Interest'][i] in word for word in topics_Sports):
        user_categories.append('Sports')
    elif any(user_interest['Interest'][i] in word for word in topics_Music):
        user_categories.append('Music')
    elif any(user_interest['Interest'][i] in word for word in topics_Business):
        user_categories.append('Business')
    else:
        user_categories.append('Other')

In [123]:
user_categories =pd.DataFrame(user_categories,columns=['Interest'])
user_df1 = pd.concat([users, user_categories],axis=1)
user_df1.columns = ["User", "Interest"]

Save final document. 

In [128]:
np.savetxt("users_categorized.csv", user_df1, delimiter=",", fmt='%s', header="User, Interest")