## Trending topics

### While topic modeling such as LDA seems to be an easy tool to identify topics, my past experience using it showed three drawbacks when it comes to our task of analyzing OCBC on Twitter.

1. Once an unsupervised model is selected, classfications become static, which means "unseen" topics in the future won't be able to be captured.

2. Twitter messages are very short and more likely to contain a homogenous topic than not, which challenges LDA's assumption of a probability distribution of "multiple" topics in each document.

3. Ultimately LDA is a statistical model that isn't quite able capture similar semantics of words

### Instead, here I build a *"live"* pipeline of trending topic detector combining several techniques such as Doc2Vec and hierachical clustering. This approach allws us to focus on recent documents, regardless what the past topic distribution might look like.


### _Please note: here I don't have any API for news feed, thus using a file sent from a friend of mine. However the approach is similar._

In [2]:
import pandas as pd
import numpy as np
import re
import datetime
import xlrd
import matplotlib as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering

import gensim

### 0. Load and Clean file

In [3]:
text_df = pd.read_csv('sample_file.csv') 
print(text_df.shape)
print(text_df.columns)

# remove rows that indicates "please ignore"
text_df = text_df[text_df.HEADLINE_ALERT_TEXT.str.contains("Test, Please Ignore")==False]
text_df.LANGUAGE.value_counts()

#### Only English
text_df = text_df[text_df['LANGUAGE']=='EN']

## Use Headline when no Take Text is available
text_df.loc[text_df['TAKE_TEXT'].isnull(),'TAKE_TEXT'] = text_df.loc[text_df['TAKE_TEXT'].isnull(),'HEADLINE_ALERT_TEXT']

## fill missing time
text_df.DATE = text_df.DATE.fillna(method = 'ffill')
text_df.TIME = text_df.TIME.fillna(method = 'ffill')

## format time

text_df.DATE = pd.to_datetime(text_df.DATE)
text_df['HOUR'] = text_df.TIME.apply(lambda x: x.split(':')[0])

# cleaning text
text_df.TAKE_TEXT = text_df.TAKE_TEXT.str.lower().str.strip().str.replace('[^\w\s]''',' ').str.replace('[^a-zA-Z0-9'' ]',' ').str.replace(r'\W*\b\w{1,1}\b', '')

text_df.TAKE_TEXT = text_df.TAKE_TEXT.apply(lambda x: re.sub(r'[\*|\+|\_|\-|\<||>|\(|\)]','',x))

text_df = text_df[text_df.TAKE_TEXT.notnull()]

# apply gensim processing
text_df['TAKE_TEXT'] = text_df['TAKE_TEXT'].apply(gensim.utils.simple_preprocess)
print(text_df.shape)

train_corpus = text_df.TAKE_TEXT.tolist()
train_corpus = [gensim.models.doc2vec.TaggedDocument(value, [key])for key , value in enumerate(train_corpus)]
len(train_corpus)

(8213, 19)
Index(['DATE', 'TIME', 'UNIQUE_STORY_INDEX', 'EVENT_TYPE', 'PNAC',
       'STORY_DATE_TIME', 'TAKE_DATE_TIME', 'HEADLINE_ALERT_TEXT',
       'ACCUMULATED_STORY_TEXT', 'TAKE_TEXT', 'PRODUCTS', 'TOPICS',
       'RELATED_RICS', 'NAMED_ITEMS', 'HEADLINE_SUBTYPE', 'STORY_TYPE',
       'TABULAR_FLAG', 'ATTRIBUTION', 'LANGUAGE'],
      dtype='object')
(3137, 20)


3137

### 1. Train a Doc2Vec Model quickly (can also use a pre-trained model from wiki for transfer-learning)

In [4]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

# build vocabulary
model.build_vocab(train_corpus)

# train model
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Wall time: 31.1 s


In [None]:
ranks = []
second_ranks = []

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    second_ranks.append(sims[1])
    

#### Some test of the model ability

In [7]:
# Random check to see similar documents are indeed identified (cltr-end for a few examples)
doc_id = np.random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Train Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Train Document (732): «nymex palladium warehouse statistics pa stock»

Similar Document (738, 0.8323554992675781): «comex silver warehouse statistics si stock»



### 2. Cluster documents for topic identification using Hierachical Agglomerative Clustering
#### Here clustering is executed in each time window (4 hours as I choose). We are able to identify the top topic in each time window (my defiintion of "trending"). Such "onlineness" is a key feature of this approach.

In [18]:
def most_common(lst):
    return max(set(lst), key=lst.count)

def find_largest_topic(vector):
    
    cluster = AgglomerativeClustering(n_clusters= 100, linkage='ward')  # Here we want many clusters to get smaller ones
    cluster.fit_predict(vector)
    max_id = most_common(list(cluster.labels_))
    result_id = (cluster.labels_ == max_id)
    
    return result_id 

In [19]:
text_df['vector'] = [model.infer_vector(x.words) for x in train_corpus]

#### Break data into 4-hour window

In [21]:
result_df = text_df
result_df.HOUR = result_df.HOUR.apply(float)
n = int(result_df.HOUR.max() / 4)
result_df['Hour_4'] = pd.qcut(result_df.HOUR, n , labels = False)

In [22]:
topic_list =[]

for x in range(n):
    vector = result_df.loc[result_df.Hour_4 == x, 'vector'].tolist()
    top_topic_id = find_largest_topic(vector)
    top_docs = result_df.loc[result_df.Hour_4 == x, 'TAKE_TEXT'][top_topic_id]
    topic_list.append(top_docs)

In [23]:
topic_list[0] 
# Not as clean as I would ideatlly like, 
# But similar document do get clustered together (e.g. market/economy and sports). We may further fine-tune it.

0       [mercator, minerals, announces, senior, manage...
5       [mercator, minerals, announces, senior, manage...
6       [korea, says, may, exports, pct, vs, yr, earli...
11      [korea, says, may, imports, pct, vs, yr, earli...
14               [star, buffet, inc, sells, real, estate]
25                       [top, news, investment, banking]
144     [tennis, vulnerable, nadal, needs, to, find, f...
151     [service, alert, msci, world, and, us, eod, de...
154     [service, alert, reuters, xtra, confirmed, see...
419     [update, rebalance, to, asia, pacific, gaining...
468     [peru, inflation, index, pecpi, eci, rises, pc...
488     [corrected, mitsubishi, to, buy, grain, compan...
561                        [india, call, money, pct, jun]
609     [exxon, mobil, xom, reports, unplanned, flarin...
610     [exxon, reports, unplanned, flaring, breakdown...
613                           [update, baseball, results]
1108                             [top, news, front, page]
1110    [rugby

### 3. Identify the "Names" in the largest topic using TF-IDF
#### Here alternatively we can retreive most frequent words that are in the largest topic but not the rest

In [26]:
def get_top_words(docs):
    
    tfidf = TfidfVectorizer(tokenizer=word_tokenize, stop_words='english')   

    X_train = list(map(lambda x: ' '.join(word for word in x), docs))

    vect = tfidf.fit(X_train)

    f = np.array(vect.get_feature_names())

    X_train_transformed = vect.transform(X_train)

    sorted_tfidf_index = X_train_transformed.max(0).toarray()[0].argsort()

    sorted_tfidf_value = X_train_transformed.max(0).toarray()[0,sorted_tfidf_index]

    df = pd.DataFrame(data =f[sorted_tfidf_index], columns=['vocab'])

    df['tfidf_value'] = np.array(sorted_tfidf_value)

    df = df.sort_values(['tfidf_value','vocab'], ascending =[False, True])
    
    df = df.iloc[0:10, ]

    #l = pd.DataFrame(data = np.array(df['tfidf_value']), index = np.array(df['vocab']))[0:10]
    
    return df

result = [get_top_words(x) for x in topic_list]

In [27]:
result

[            vocab  tfidf_value
 315       vietnam     0.807446
 314          news     0.731672
 313  motorcycling     0.707107
 312          page     0.681657
 311         asian     0.671638
 310         slips     0.669932
 309      baseball     0.643279
 307       banking     0.633383
 308    investment     0.633383
 306          june     0.631230,             vocab  tfidf_value
 149          news     0.754908
 148  motorcycling     0.680062
 147          page     0.655831
 144         asian     0.636477
 143     companies     0.636477
 146      emerging     0.636477
 145       markets     0.636477
 142         alert     0.601936
 141        greece     0.520439
 139         amers     0.512572,            vocab  tfidf_value
 81        sports     0.859619
 80   commodities     0.807292
 79  motorcycling     0.730529
 78          news     0.724205
 77      baseball     0.696383
 76          page     0.689585
 75  agricultural     0.662451
 74         alert     0.571874
 73       service

In [28]:
result = pd.DataFrame.from_dict({k : v['vocab'].tolist() for k, v in enumerate(result) if k !=0})

result

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,news,sports,offer,news,alert,test,commodities,ks,emerging,iss,news,baseball,test
1,motorcycling,commodities,bangkok,sports,companies,qtr,news,california,ifr,news,ignore,metals,boj
2,page,motorcycling,ignore,motorcycling,european,mth,baseball,alert,asian,page,test,news,rate
3,asian,news,ignroe,page,financial,alert,results,ignore,banking,alert,foreign,page,revised
4,companies,baseball,si,baseball,services,financial,page,test,companies,ifr,china,finnish,mahindra
5,emerging,page,pls,update,aud,services,agricultural,tokyo,fixed,markets,greater,brazilian,digest
6,markets,agricultural,testing,alert,news,exchange,alert,ontario,income,baltics,page,soy,press
7,alert,alert,pct,service,mideast,foreign,update,battle,investment,diary,alert,scorers,certificates
8,greece,service,alert,results,malaysia,money,nhc,federer,markets,bo,exchange,summaries,dealt
9,amers,foodgrain,factors,svcs,thailand,news,outlook,reaches,metal,summaries,summaries,kuwait,deposit
