## Trending topics

#### While topic modeling such as LDA seems to be an easy tool to identify topics, my past experience using it showed three drawbacks when it comes to short and high-frequency content.

1. Once an unsupervised model is selected, classfications become static, which means "unseen" topics in the future won't be able to be captured.

2. Short messages are more likely to contain a homogenous topic than not, which challenges LDA's assumption of a probability distribution of "multiple" topics in each document.

3. Ultimately LDA is a statistical model that isn't quite able capture similar semantics of words

#### Instead, here I build a *"live"* pipeline of trending topic detector combining several techniques such as Doc2Vec and hierachical clustering. This approach allws us to focus on recent documents, regardless what the past topic distribution might look like.


#### _Please note: here I don't have any API for news feed, thus using a file sent from a friend of mine. However the approach is similar._

In [1]:
import pandas as pd
import numpy as np
import re
import datetime
import xlrd
import matplotlib as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering

import gensim

### 0. Load and Clean file

In [2]:
text_df = pd.read_csv('sample_file.csv') 
print(text_df.shape)
print(text_df.columns)

# remove rows that indicates "please ignore"
text_df = text_df[text_df.HEADLINE_ALERT_TEXT.str.contains("Test, Please Ignore")==False]
text_df.LANGUAGE.value_counts()

#### Only English
text_df = text_df[text_df['LANGUAGE']=='EN']

## Use Headline when no Take Text is available
text_df.loc[text_df['TAKE_TEXT'].isnull(),'TAKE_TEXT'] = text_df.loc[text_df['TAKE_TEXT'].isnull(),'HEADLINE_ALERT_TEXT']

## fill missing time
text_df.DATE = text_df.DATE.fillna(method = 'ffill')
text_df.TIME = text_df.TIME.fillna(method = 'ffill')

## format time

text_df.DATE = pd.to_datetime(text_df.DATE)
text_df['HOUR'] = text_df.TIME.apply(lambda x: x.split(':')[0])

# cleaning text
text_df.TAKE_TEXT = text_df.TAKE_TEXT.str.lower().str.strip().str.replace('[^\w\s]''',' ').str.replace('[^a-zA-Z0-9'' ]',' ').str.replace(r'\W*\b\w{1,1}\b', '')

text_df.TAKE_TEXT = text_df.TAKE_TEXT.apply(lambda x: re.sub(r'[\*|\+|\_|\-|\<||>|\(|\)]','',x))

text_df = text_df[text_df.TAKE_TEXT.notnull()]

# apply gensim processing
text_df['TAKE_TEXT'] = text_df['TAKE_TEXT'].apply(gensim.utils.simple_preprocess)
print(text_df.shape)

train_corpus = text_df.TAKE_TEXT.tolist()
train_corpus = [gensim.models.doc2vec.TaggedDocument(value, [key])for key , value in enumerate(train_corpus)]
len(train_corpus)

(8213, 19)
Index(['DATE', 'TIME', 'UNIQUE_STORY_INDEX', 'EVENT_TYPE', 'PNAC',
       'STORY_DATE_TIME', 'TAKE_DATE_TIME', 'HEADLINE_ALERT_TEXT',
       'ACCUMULATED_STORY_TEXT', 'TAKE_TEXT', 'PRODUCTS', 'TOPICS',
       'RELATED_RICS', 'NAMED_ITEMS', 'HEADLINE_SUBTYPE', 'STORY_TYPE',
       'TABULAR_FLAG', 'ATTRIBUTION', 'LANGUAGE'],
      dtype='object')
(3137, 20)


3137

### 1. Train a Doc2Vec Model quickly (can also use a pre-trained model from wiki for transfer-learning)

In [3]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

# build vocabulary
model.build_vocab(train_corpus)

# train model
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Wall time: 31.2 s


In [4]:
ranks = []
second_ranks = []

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    second_ranks.append(sims[1])
    

#### Some test of the model ability

In [5]:
# Random check to see similar documents are indeed identified (cltr-end for a few examples)
doc_id = np.random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Selected Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Selected Document (1277): «moscow cairo june reuters russian firm altimo said on sunday it had no plans to launch new offer to buy out minority shareholders in egypt orascom telecom orte ca ot after the egyptian regulator rejected its attempt to reopen its earlier tender orascom telecom heavyweight on the egyptian stock exchange is already percent owned by russia vimpelcom vip which in turn is percent owned by altimo altimo launched an offer in april to buy out minority shareholders of orascom which has mobile businesses in canada algeria pakistan and other emerging markets such as bangladesh in deal worth an estimated billion but altimo failed last week to secure sufficient acceptances from the minority shareholders in orascom by monday deadline after shareholders with only percent of orascom telecom shares listed on the egyptian stock exchange offered to sell shares below the minimum percent required for the buy out to go ahead at the moment there are no plans to launch new offer alt

### 2. Cluster documents for topic identification using Hierachical Agglomerative Clustering
#### Here clustering is executed in each time window (4 hours as I choose). We are able to identify the top topic in each time window (my defiintion of "trending"). Such "onlineness" is a key feature of this approach.

In [6]:
def most_common(ls):
    return max(set(ls), key=ls.count)

def find_largest_topic(vector):
    
    cluster = AgglomerativeClustering(n_clusters= 140, linkage='ward')  # Here we want many clusters to get smaller ones
    cluster.fit_predict(vector)
    max_id = most_common(list(cluster.labels_))
    result_id = (cluster.labels_ == max_id)
    
    return result_id 

text_df['vector'] = [model.infer_vector(x.words) for x in train_corpus]

#### Break data into 4-hour window

In [7]:
result_df = text_df
result_df.HOUR = result_df.HOUR.apply(float)
n = int(result_df.HOUR.max() / 4)
result_df['Hour_4'] = pd.qcut(result_df.HOUR, n , labels = False)

In [8]:
topic_list =[]

for x in range(n):
    vector = result_df.loc[result_df.Hour_4 == x, 'vector'].tolist()
    top_topic_id = find_largest_topic(vector)
    top_docs = result_df.loc[result_df.Hour_4 == x, 'TAKE_TEXT'][top_topic_id]
    topic_list.append(top_docs)

In [9]:
# Not as clean as I would ideatlly like, 
# But similar document do get clustered together (e.g. market/economy and sports). 
# We may further fine-tune it.
topic_list[0] 

25                       [top, news, investment, banking]
153     [china, may, official, pmi, at, reuters, poll,...
485     [corrected, brief, mitsubishi, to, buy, los, g...
488     [corrected, mitsubishi, to, buy, grain, compan...
1114    [tennis, french, open, men, singles, round, re...
1236    [service, alert, burgundy, exchange, informati...
2098    [leading, sunni, muslim, cleric, calls, for, j...
2670    [feature, bombs, boredom, threaten, pakistan, ...
2907                   [inside, middle, east, debt, june]
3107    [rpt, mideast, debt, issuers, warm, to, cross,...
3120    [soccer, world, cup, asia, qualifying, fixture...
3423    [rpt, cyber, hacking, to, overshadow, summit, ...
3431    [tennis, french, open, men, singles, round, re...
3578    [rallying, latvala, wins, acropolis, rally, fo...
4505    [bis, lays, out, simple, plan, for, how, to, h...
4948              [column, the, economics, of, austerity]
4962    [golf, kuchar, wins, memorial, title, by, two,...
5270    [briti

### 3. Identify the "Names" in the largest topic using TF-IDF
#### Here alternatively we can retreive most frequent words that are in the largest topic but not the rest

In [15]:
# train a tfidf on entire corpus
tfidf = TfidfVectorizer(tokenizer=word_tokenize, stop_words='english', min_df= 18, max_df=0.035)   
X_train = list(map(lambda x: ' '.join(word for word in x), result_df.TAKE_TEXT))
vect = tfidf.fit(X_train)
feature = np.array(vect.get_feature_names())
print(len(feature))

feature

1902


array(['ability', 'able', 'abroad', ..., 'zb', 'zealand', 'zone'],
      dtype='<U15')

In [16]:
#exaample
test = [x for y in topic_list[0] for x in y ]
test = ' '.join(test)
test
x = tfidf.transform([test])
y = x.toarray()[0].argsort()[-20:]
y

feature[y]

array(['austerity', 'grain', 'slowdown', 'boredom', 'nishikori',
       'economics', 'sunni', 'includes', 'pmis', 'buy', 'buzz', 'men',
       'singles', 'los', 'wins', 'summit', 'rpt', 'manufacturing',
       'brief', 'vs'], dtype='<U15')

In [18]:
# apply to each topic list to get most distinct words withitn the topic
def get_top_words(docs):
    
    # merge top topic into one document for tf-idf performance
    docs = [x for y in docs for x in y ] 
    docs =[' '.join(docs)]
    
    # get tf-idf and sort words by returned value
    tfidf = vect.transform(docs)
    sorted_tfidf_index = tfidf.toarray()[0].argsort()[-15:]
    df = pd.DataFrame(data = feature[sorted_tfidf_index], columns=['vocab'])
    
    return df

result = [get_top_words(x) for x in topic_list]

In [19]:
# save to dataframe
result = pd.DataFrame.from_dict({k : v['vocab'].tolist() for k, v in enumerate(result) })
result

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,economics,mail,prix,ric,forced,weather,unchanged,forex,downgrade,digest,rdf,forecasts,rdf,prix
1,sunni,rdf,dealing,svcs,force,baseball,nzd,form,subdebt,dealing,ignore,forex,unconfirmed,created
2,includes,dealing,rs,moody,foot,standings,buzz,forum,reviews,block,nhl,form,charges,rdf
3,pmis,svcs,resolved,motorcycling,hands,championship,man,friendly,usd,ric,message,forum,message,ignore
4,buy,rating,russia,ratings,founder,diary,great,forward,trading,techs,colombia,planned,swedish,qaeda
5,buzz,moody,rating,announces,russia,qatar,brazilian,zone,ax,mark,insider,maintenance,svcs,la
6,men,ca,moody,downgrade,italy,research,unconfirmed,fourth,transmission,trading,svcs,ignore,buzz,planned
7,singles,los,motorcycling,subdebt,standings,equities,iron,francisco,outage,buzz,falcao,dealing,aud,maintenance
8,los,angeles,rdf,reviews,svcs,sweden,escape,frankfurt,pjm,ax,result,resolved,jpy,liga
9,wins,county,argentine,testing,baseball,raises,transmission,free,halt,bo,scorers,scorers,momentum,motorcycling
