# Topic Modelling for EBC data

EBC is executive briefing center's data at Red Hat. In this notebook we show how to use TOM for inferring latent topics that pervade the corpus of articles published at EBC between 2015 and 2019 using non-negative matrix factorization. Based on the discovered topics we use TOM to shed light on interesting facts on the topical structure of the EBC.

## Loading and vectorizing the corpus

We prune words which absolute frequency in the corpus is less than 2, as well as words which relative frequency is higher than 80%, with the aim to only keep the most significant ones. Eventually, we build the vector space representation of these articles with $tf \cdot idf$ weighting. It is a $n \times m$ matrix denoted by $A$, where each line represents an article, with $n = 21$ (i.e. the number of articles) and $m = 383$ (i.e. the number of words).

In [1]:
from tom_lib.structure.corpus import Corpus
from tom_lib.visualization.visualization import Visualization

corpus = Corpus(source_file_path='input/x.csv',
                language='english',
                vectorization='tfidf',
                max_relative_frequency=0.8,
                min_absolute_frequency=2)
print('corpus size:', corpus.size)
print('vocabulary size:', len(corpus.vocabulary))

corpus size: 21
vocabulary size: 383


## Estimating the optimal number of topics ($k$)



In [2]:
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization,LatentDirichletAllocation
topic_model = LatentDirichletAllocation(corpus) 

### Weighted Jaccard average stability

The figure below shows this metric for a number of topics varying between 5 and 50 (higher is better).

In [3]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure
output_notebook()

p = figure(plot_height=250)
p.line(range(10, 51), topic_model.greene_metric(min_num_topics=10, step=1, max_num_topics=50, top_n_words=10, tao=10), line_width=2)
show(p)

### Symmetric Kullback-Liebler divergence

The figure below shows this metric for a number of topics varying between 5 and 50 (lower is better).

In [4]:
p = figure(plot_height=250)
p.line(range(10, 51), topic_model.arun_metric(min_num_topics=10, max_num_topics=50, iterations=10), line_width=2)
show(p)

Guided by the two metrics described previously, we manually evaluate the quality of the topics identified with $k$ varying between 12 and 20. Eventually, we judge that the best results are achieved with LDA for $k=12$

In [5]:
k = 12
topic_model.infer_topics(num_topics=k)

## Results

### Description of the discovered topics



In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', 500)

d = {'Most relevant words': [', '.join([word for word, weight in topic_model.top_words(i, 10)]) for i in range(k)]}
df = pd.DataFrame(data=d)
df.head(k)

Unnamed: 0,Most relevant words
0,"general, catalog, vision, teams, paul, engagement, cross, aware, behind, connect"
1,"amadeus, openshift, travel, infrastructure, overview, roadmap, management, service, want, cluster"
2,"general, catalog, vision, teams, paul, engagement, cross, aware, behind, connect"
3,"none, general, catalog, vision, teams, paul, engagement, aware, behind, connect"
4,"general, catalog, vision, teams, paul, engagement, cross, aware, behind, connect"
5,"boston, senior, developer, ken, manager, director, 2018, account, services, level"
6,"health, infrastructure, offers, premise, today, production, relationship, openstack, automation, service"
7,"group, companies, value, need, teams, new, services, infrastructure, market, head"
8,"containers, goals, use, openshift, year, strategic, jboss, public, automation, ansible"
9,"paul, rh, system, notes, ansible, program, redhat, com, openshift, portfolio"


In the following, we leverage the discovered topics to highlight interesting particularities about the EGC society. To be able to analyze the topics, supplemented with information about the related papers, we partition the papers into 15 non-overlapping clusters, i.e. a cluster per topic. Each article $i \in [0;1-n]$ is assigned to the cluster $j$ that corresponds to the topic with the highest weight $w_{ij}$:

\begin{equation}
\text{cluster}_i = \underset{j}{\mathrm{argmax}}(w_{i,j})
\label{eq:cluster}
\end{equation}

### Global topic proportions

In [7]:
p = figure(x_range=[str(_) for _ in range(k)], plot_height=350, x_axis_label='topic', y_axis_label='proportion')
p.vbar(x=[str(_) for _ in range(k)], top=topic_model.topics_frequency(), width=0.7)
show(p)

### Shifting attention, evolving interests

Here we focus on topics topics 12 (social network analysis and mining) and 3 (association rule mining). The following figures describe these topics in terms of their respective top 10 words and top 3 documents.

In [8]:
def plot_top_words(topic_id):
    words = [word for word, weight in topic_model.top_words(topic_id, 10)]
    weights = [weight for word, weight in topic_model.top_words(topic_id, 10)]

    p = figure(x_range=words, plot_height=300, plot_width=800, x_axis_label='word', y_axis_label='weight')
    p.vbar(x=words, top=weights, width=0.7)
    show(p)
    
def top_documents_df(topic_id):
    top_docs = topic_model.top_documents(topic_id, 3)
    d = {'Article title': [corpus.title(doc_id) for doc_id, weight in top_docs], 'Year': [int(corpus.date(doc_id)) for doc_id, weight in top_docs]}
    df = pd.DataFrame(data=d)
    return df

#### Topic #12

##### Top 10 words

In [9]:
plot_top_words(5)

##### Top 3 articles

In [10]:
top_documents_df(5).head()

Unnamed: 0,Article title,Year
0,Federal Reserve Boston,2018
1,Accenture,2017
2,Accenture2,2018


#### Topic #3
##### Top 10 words

In [11]:
plot_top_words(5)

##### Top 3 articles

In [12]:
top_documents_df(5).head()

Unnamed: 0,Article title,Year
0,Federal Reserve Boston,2018
1,Accenture,2017
2,Accenture2,2018


#### Evolution of the frequencies of topics 3 and 8



In [13]:
p = figure(plot_height=250, x_axis_label='year', y_axis_label='topic frequency')
p.line(range(2015, 2020), [topic_model.topic_frequency(3, date=i) for i in range(2015, 2020)], line_width=2, line_color='blue', legend='topic #3')
p.line(range(2015, 2020), [topic_model.topic_frequency(8, date=i) for i in range(2015, 2020)], line_width=2, line_color='red', legend='topic #12')
show(p)