## Trending topics

#### While topic modeling such as LDA seems to be an easy tool to identify topics, my past experience using it showed three drawbacks when it comes to short and high-frequency content.

1. Once an unsupervised model is selected, classfications become static, which means "unseen" topics in the future won't be able to be captured.

2. Short messages are more likely to contain a homogenous topic than not, which challenges LDA's assumption of a probability distribution of "multiple" topics in each document.

3. Ultimately LDA is a statistical model that isn't quite able capture similar semantics of words

#### Instead, here I build a *"live"* pipeline of trending topic detector combining several techniques such as Doc2Vec and hierachical clustering. This approach allws us to focus on recent documents, regardless what the past topic distribution might look like.


#### _Please note: here I don't have any API for news feed, thus using a file sent from a friend of mine. However the approach is similar._

In [1]:
import pandas as pd
import numpy as np
import re
import datetime
import xlrd
import matplotlib as plt

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering

import gensim

### 0. Load and Clean file

In [2]:
text_df = pd.read_csv('sample_file.csv') 
print(text_df.shape)
print(text_df.columns)

# remove rows that indicates "please ignore"
text_df = text_df[text_df.HEADLINE_ALERT_TEXT.str.contains("Test, Please Ignore")==False]
text_df.LANGUAGE.value_counts()

#### Only English
text_df = text_df[text_df['LANGUAGE']=='EN']

## Use Headline when no Take Text is available
text_df.loc[text_df['TAKE_TEXT'].isnull(),'TAKE_TEXT'] = text_df.loc[text_df['TAKE_TEXT'].isnull(),'HEADLINE_ALERT_TEXT']

## fill missing time
text_df.DATE = text_df.DATE.fillna(method = 'ffill')
text_df.TIME = text_df.TIME.fillna(method = 'ffill')

## format time

text_df.DATE = pd.to_datetime(text_df.DATE)
text_df['HOUR'] = text_df.TIME.apply(lambda x: x.split(':')[0])

# cleaning text
text_df.TAKE_TEXT = text_df.TAKE_TEXT.str.lower().str.strip().str.replace('[^\w\s]''',' ').str.replace('[^a-zA-Z0-9'' ]',' ').str.replace(r'\W*\b\w{1,1}\b', '')

text_df.TAKE_TEXT = text_df.TAKE_TEXT.apply(lambda x: re.sub(r'[\*|\+|\_|\-|\<||>|\(|\)]','',x))

text_df = text_df[text_df.TAKE_TEXT.notnull()]

# apply gensim processing
text_df['TAKE_TEXT'] = text_df['TAKE_TEXT'].apply(gensim.utils.simple_preprocess)
print(text_df.shape)

train_corpus = text_df.TAKE_TEXT.tolist()
train_corpus = [gensim.models.doc2vec.TaggedDocument(value, [key])for key , value in enumerate(train_corpus)]
len(train_corpus)

(8213, 19)
Index(['DATE', 'TIME', 'UNIQUE_STORY_INDEX', 'EVENT_TYPE', 'PNAC',
       'STORY_DATE_TIME', 'TAKE_DATE_TIME', 'HEADLINE_ALERT_TEXT',
       'ACCUMULATED_STORY_TEXT', 'TAKE_TEXT', 'PRODUCTS', 'TOPICS',
       'RELATED_RICS', 'NAMED_ITEMS', 'HEADLINE_SUBTYPE', 'STORY_TYPE',
       'TABULAR_FLAG', 'ATTRIBUTION', 'LANGUAGE'],
      dtype='object')
(3137, 20)


3137

### 1. Train a Doc2Vec Model quickly (can also use a pre-trained model from wiki for transfer-learning)

In [3]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=100)

# build vocabulary
model.build_vocab(train_corpus)

# train model
%time model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

Wall time: 31 s


In [4]:
ranks = []
second_ranks = []

for doc_id in range(len(train_corpus)):
    inferred_vector = model.infer_vector(train_corpus[doc_id].words)
    sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
    rank = [docid for docid, sim in sims].index(doc_id)
    ranks.append(rank)
    second_ranks.append(sims[1])
    

#### Some test of the model ability

In [5]:
# Random check to see similar documents are indeed identified (cltr-end for a few examples)
doc_id = np.random.randint(0, len(train_corpus) - 1)

# Compare and print the second-most-similar document
print('Selected Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
sim_id = second_ranks[doc_id]
print('Similar Document {}: «{}»\n'.format(sim_id, ' '.join(train_corpus[sim_id[0]].words)))

Selected Document (2259): «boj balance market operations revised»

Similar Document (133, 0.7650014758110046): «table breakdown of china may official pmi»



### 2. Cluster documents for topic identification using Hierachical Agglomerative Clustering
#### Here clustering is executed in each time window (4 hours as I choose). We are able to identify the top topic in each time window (my defiintion of "trending"). Such "onlineness" is a key feature of this approach.

In [14]:
def most_common(lst):
    return max(set(lst), key=lst.count)

def find_largest_topic(vector):
    
    cluster = AgglomerativeClustering(n_clusters= 120, linkage='ward')  # Here we want many clusters to get smaller ones
    cluster.fit_predict(vector)
    max_id = most_common(list(cluster.labels_))
    result_id = (cluster.labels_ == max_id)
    
    return result_id 

In [15]:
text_df['vector'] = [model.infer_vector(x.words) for x in train_corpus]

#### Break data into 4-hour window

In [17]:
result_df = text_df
result_df.HOUR = result_df.HOUR.apply(float)
n = int(result_df.HOUR.max() / 4)
result_df['Hour_4'] = pd.qcut(result_df.HOUR, n , labels = False)

In [18]:
topic_list =[]

for x in range(n):
    vector = result_df.loc[result_df.Hour_4 == x, 'vector'].tolist()
    top_topic_id = find_largest_topic(vector)
    top_docs = result_df.loc[result_df.Hour_4 == x, 'TAKE_TEXT'][top_topic_id]
    topic_list.append(top_docs)

In [19]:
topic_list[0] 
# Not as clean as I would ideatlly like, 
# But similar document do get clustered together (e.g. market/economy and sports). We may further fine-tune it.

468     [peru, inflation, index, pecpi, eci, rises, pc...
561                        [india, call, money, pct, jun]
609     [exxon, mobil, xom, reports, unplanned, flarin...
610     [exxon, reports, unplanned, flaring, breakdown...
613                           [update, baseball, results]
991     [service, alert, thomson, reuters, dealing, re...
1108                             [top, news, front, page]
1942                                    [emea, test, rcf]
1943    [by, ramkumar, koottala, hari, abcdefghijklmn,...
2224                             [top, news, front, page]
2227                             [top, news, front, page]
2467                             [top, news, front, page]
3331    [motorcycling, motorcycling, grand, prix, moto...
4395    [service, alert, datascope, equities, planned,...
5602                                     [diary, vietnam]
5637              [diary, top, economic, events, to, jun]
5640                 [seoul, shares, ks, open, down, pct]
5652    [diary

### 3. Identify the "Names" in the largest topic using TF-IDF
#### Here alternatively we can retreive most frequent words that are in the largest topic but not the rest

In [20]:
# train a tfidf on entire corpus

tfidf = TfidfVectorizer(tokenizer=word_tokenize, stop_words='english')   

X_train = list(map(lambda x: ' '.join(word for word in x), result_df.TAKE_TEXT))

vect = tfidf.fit(X_train)

feature = np.array(vect.get_feature_names())

In [21]:
# apply to each topic list to get most distinct words withitn the topic

def get_top_words(docs):
    
    docs = list(map(lambda x: ' '.join(word for word in x), docs))
    
    X_train_transformed = vect.transform(docs)

    sorted_tfidf_index = X_train_transformed.max(0).toarray()[0].argsort()

    sorted_tfidf_value = X_train_transformed.max(0).toarray()[0,sorted_tfidf_index]

    df = pd.DataFrame(data =feature[sorted_tfidf_index], columns=['vocab'])

    df['tfidf_value'] = np.array(sorted_tfidf_value)

    df = df.sort_values(['tfidf_value','vocab'], ascending =[False, True])
    
    df = df.iloc[0:10, ]

    #l = pd.DataFrame(data = np.array(df['tfidf_value']), index = np.array(df['vocab']))[0:10]
    
    return df

result = [get_top_words(x) for x in topic_list]

In [22]:
# save to dataframe
result = pd.DataFrame.from_dict({k : v['vocab'].tolist() for k, v in enumerate(result) if k !=0})
result

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,motorcycling,sports,baseball,page,failover,sports,page,tops,generates,sports,ignore,metals,brief
1,rcf,page,malaysia,motorcycling,services,commodities,news,ontario,management,page,greater,baseball,test
2,alert,motorcycling,page,baseball,alberta,page,gc,california,winners,ifr,hungarian,page,ramkumar
3,amers,baseball,pls,news,companies,lynx,warehouse,outperforms,carbon,news,exchange,standings,mahindra
4,hdcp,news,news,alert,european,baseball,hg,nzs,slaughterhouse,upcoming,foreign,soy,finland
5,constructor,polish,diary,results,australia,news,benitec,aej,nzs,insider,bajaj,collated,hari
6,semifinal,nhc,ignroe,update,hungarian,mth,biopharma,alert,kills,scorers,offs,news,digest
7,argentine,tropical,motorcycling,service,financial,hungarian,blt,ringgit,gateway,markets,swedish,finnish,alert
8,datascope,argentine,offs,unavailable,alert,upcoming,alert,redbank,tourist,liga,test,stl,corporate
9,delhi,freight,scorers,washout,enoc,finnish,pjm,lin,coal,alert,china,hungarian,factors
