# Discovering Topics from Report Views

In this notebook, we will build a **Latent Dirichlet Allocation (LDA)** model to uncover hidden topics in the textual descriptions of report views. This analysis helps us understand thematic structures in our data products and improve semantic search and categorization.

We will:
- Load and clean the descriptions from the `Views` sheet in the inventory.
- Preprocess the text (tokenization, lemmatization, filtering).
- Build a topic model using `scikit-learn`'s `LatentDirichletAllocation`.
- Explore the extracted topics using `pyLDAvis`.

This notebook follows the structure of the earlier topic modeling analysis on the Cordis dataset.


In [1]:
import pandas as pd
from pathlib import Path

# Cargar el fichero de inventario
xlsx_path = Path("../raw/Reporting_Inventory.xlsx")
views_df = pd.read_excel(xlsx_path, sheet_name="Views")

# Mostrar columnas disponibles
views_df.columns


Index(['ID Data Product', 'Report Name', 'Product Owner', 'PBIX_File',
       'Report View', 'Description', 'Category', 'Status', 'Rename',
       'Dimensions', 'KPIs', 'Other Terms', 'Filters', 'Tags', 'Priority'],
      dtype='object')

We clean and preprocess the textual descriptions to prepare them for topic modeling. This includes:
- Removing punctuation and stopwords
- Lemmatizing
- Token filtering by length and type


In [4]:
import nltk
import spacy
import re
from sklearn.feature_extraction.text import CountVectorizer

nltk.download("stopwords")
from nltk.corpus import stopwords

nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words("english"))

def preprocess(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_ for token in doc
        if token.is_alpha and token.lemma_ not in stop_words and len(token) > 2
    ]
    return " ".join(tokens)

# Aplicar preprocesado a la columna de descripción
views_df["clean_text"] = views_df["Description"].astype(str).apply(preprocess)
views_df.head(5)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cbadenes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,clean_text
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1,methodolody definition algorithim feed market
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand performance hotel specif...
2,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,EXECUTIVE VIEW,Global view to understand Feeder Market Perfor...,Executive,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,global view understand feed market performance...
3,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER MARKET FLOWS,View focused on understanding the booking beha...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand book behaviour feeder ma...
4,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER_MARKET_DETAIL,Detail view of Feeder Markets by Destination i...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,detail view feed market destination include in...


We vectorize the cleaned text and apply `LatentDirichletAllocation` to discover underlying topics.


In [5]:
from sklearn.decomposition import LatentDirichletAllocation

# Vectorización
vectorizer = CountVectorizer(max_df=0.95, min_df=2)
dtm = vectorizer.fit_transform(views_df["clean_text"])

# Modelo LDA
lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_model.fit(dtm)

# Visualizar los tópicos
def display_topics(model, feature_names, no_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}: ", ", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

display_topics(lda_model, vectorizer.get_feature_names_out())
# Guardar el modelo LDA y el vectorizador
import joblib
model_path = Path("../models/lda_model.pkl")
joblib.dump(lda_model, model_path)
vectorizer_path = Path("../models/vectorizer.pkl")
joblib.dump(vectorizer, vectorizer_path)

Topic 0:  business, portfolio, account, total, summary, report, regard, sale, activity, performance
Topic 1:  nan, forecast, detail, budget, year, total, information, pick, level, last
Topic 2:  view, page, hotel, kpis, information, table, block, analyze, kpi, detail
Topic 3:  view, channel, index, page, block, performance, month, button, market, interactive
Topic 4:  tab, cost, datum, report, consumption, ratio, detailed, quality, hotel, level


['../models/vectorizer.pkl']

We use `pyLDAvis` to explore the topic-term distributions interactively.


In [10]:
import pyLDAvis
import pyLDAvis._prepare as prep
from pyLDAvis import display

pyLDAvis.enable_notebook()

panel = prep.prepare(
    topic_term_dists=lda_model.components_ / lda_model.components_.sum(axis=1)[:, None],
    doc_topic_dists=lda_model.transform(dtm),
    doc_lengths=[len(t.split()) for t in views_df["clean_text"]],
    vocab=vectorizer.get_feature_names_out(),
    term_frequency=dtm.sum(axis=0).A1
)
display(panel)


## Inferring Topics from New Descriptions

We can use our trained LDA model to infer the most probable topics for any given text description. This allows us to:
- Explore how new or unseen reports align with existing topics.
- Suggest tags or categories dynamically.
- Validate the thematic coherence of generated summaries or keywords.

Below, you can input a sample text and view the topic distribution inferred by the model.


In [11]:
def infer_topics(text, n_top=3):
    # Preprocess text as in training
    clean = preprocess(text)
    
    # Vectorize
    vectorized = vectorizer.transform([clean])
    
    # Get topic distribution
    topic_dist = lda_model.transform(vectorized)[0]
    
    # Show top N topics with weights
    top_topics = sorted(list(enumerate(topic_dist)), key=lambda x: -x[1])[:n_top]
    for topic_idx, weight in top_topics:
        terms = [vectorizer.get_feature_names_out()[i] 
                 for i in lda_model.components_[topic_idx].argsort()[:-6:-1]]
        print(f"Topic {topic_idx} ({weight:.2f}):", ", ".join(terms))

# Ejemplo de uso
sample_description = "Weekly dashboard for performance and revenue trends in hotel bookings"
infer_topics(sample_description)


Topic 2 (0.51): view, page, hotel, kpis, information
Topic 0 (0.41): business, portfolio, account, total, summary
Topic 4 (0.03): tab, cost, datum, report, consumption
