# Discovering Topics from Report Views

In this notebook, we will build a **Latent Dirichlet Allocation (LDA)** model to uncover hidden topics in the textual descriptions of report views. This analysis helps us understand thematic structures in our data products and improve semantic search and categorization.

We will:
- Load and clean the descriptions from the `Views` sheet in the inventory.
- Preprocess the text (tokenization, lemmatization, filtering).
- Build a topic model using `scikit-learn`'s `LatentDirichletAllocation`.
- Explore the extracted topics using `pyLDAvis`.

This notebook follows the structure of the earlier topic modeling analysis on the Cordis dataset.


In [1]:
import pandas as pd
from pathlib import Path

# Cargar el fichero de inventario
xlsx_path = Path("../raw/Reporting_Inventory.xlsx")
views_df = pd.read_excel(xlsx_path, sheet_name="Views")

# Mostrar columnas disponibles
views_df.columns


Index(['ID Data Product', 'Report Name', 'Product Owner', 'PBIX_File',
       'Report View', 'Description', 'Category', 'Status', 'Rename',
       'Dimensions', 'KPIs', 'Other Terms', 'Filters', 'Tags', 'Priority'],
      dtype='object')

We clean and preprocess the textual descriptions to prepare them for topic modeling. This includes:
- Removing punctuation and stopwords
- Lemmatizing
- Token filtering by length and type


In [6]:
import nltk
import spacy
import re
from sklearn.feature_extraction.text import CountVectorizer

nltk.download("stopwords")
from nltk.corpus import stopwords

nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words("english"))

def preprocess(text):
    doc = nlp(text.lower())
    tokens = [
        token.lemma_ for token in doc
        if token.is_alpha and token.lemma_ not in stop_words and len(token) > 2
    ]
    return " ".join(tokens)

# Aplicar preprocesado a la columna de descripción
views_df["clean_text"] = views_df["Description"].astype(str).apply(preprocess)
views_df.head(5)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cbadenes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,clean_text
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1,methodolody definition algorithim feed market
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand performance hotel specif...
2,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,EXECUTIVE VIEW,Global view to understand Feeder Market Perfor...,Executive,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,global view understand feed market performance...
3,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER MARKET FLOWS,View focused on understanding the booking beha...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand book behaviour feeder ma...
4,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER_MARKET_DETAIL,Detail view of Feeder Markets by Destination i...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,detail view feed market destination include in...


We vectorize the cleaned text and apply `LatentDirichletAllocation` to discover underlying topics.


In [12]:
from sklearn.decomposition import LatentDirichletAllocation

# Lista de textos a procesar
documents = views_df["clean_text"].tolist()

# Vectorización
tf_vectorizer = CountVectorizer(
    stop_words=[],  # No eliminamos stopwords por ahora []
    min_df=1,      # Incluir palabras que aparecen al menos 1 vez
    max_df=1.0,    # Sin límite superior de frecuencia
    lowercase=True, # Convertir todo a minúsculas
    max_features=50000,  # Máximo número de palabras a considerar
    token_pattern='[a-zA-Z0-9]{3,}',  # Palabras de 3+ caracteres
    analyzer = 'word'
)

# Crear la matriz de documentos-términos
bag_of_words = tf_vectorizer.fit_transform(documents)

# Obtener el vocabulario
dictionary = tf_vectorizer.get_feature_names_out()
vocabulary = tf_vectorizer.vocabulary_

print("Estadísticas del preprocesamiento:")
print(f"- Tamaño del vocabulario: {len(dictionary)} palabras únicas")
print(f"- Dimensiones de la matriz: {bag_of_words.shape}")

# Mostrar las palabras más frecuentes
word_freq = bag_of_words.sum(axis=0).A1
top_words_idx = word_freq.argsort()[-10:][::-1]
print("\nPalabras más frecuentes:")
for idx in top_words_idx:
    print(f"- {dictionary[idx]}: {word_freq[idx]} apariciones")

Estadísticas del preprocesamiento:
- Tamaño del vocabulario: 965 palabras únicas
- Dimensiones de la matriz: (1486, 965)

Palabras más frecuentes:
- nan: 922 apariciones
- view: 278 apariciones
- page: 151 apariciones
- information: 136 apariciones
- performance: 129 apariciones
- hotel: 121 apariciones
- table: 119 apariciones
- block: 108 apariciones
- detail: 104 apariciones
- channel: 101 apariciones


In [None]:
# Parámetros del modelo
n_topics = 2    # Número moderado de tópicos para empezar
alpha = 1.0     # Documentos algo especializados
beta = 0.1     # Tópicos bastante específicos

# Crear y entrenar el modelo
print("Configuración del modelo LDA:")
print(f"- Número de tópicos: {n_topics}")
print(f"- Alpha: {alpha}")
print(f"- Beta: {beta}")
print("\nIniciando entrenamiento...\n")

# Modelo LDA
lda_model = LatentDirichletAllocation(
    n_components=n_topics,      # Número de tópicos
    doc_topic_prior=alpha,      # Prior documentos-tópicos
    topic_word_prior=beta,      # Prior tópicos-palabras
    max_iter=25,               # Máximo de iteraciones
    learning_method='online',   # Método de aprendizaje
    evaluate_every=1,          # Evaluar en cada iteración
    n_jobs=-1,                # Usar todos los cores
    random_state=0,           # Semilla para reproducibilidad
    verbose=1                 # Mostrar progreso
)
lda_model.fit(bag_of_words)

# Visualizar los tópicos
def display_topics(model, feature_names, no_top_words=10):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}: ", ", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

display_topics(lda_model, tf_vectorizer.get_feature_names_out())
# Guardar el modelo LDA y el vectorizador
import joblib
model_path = Path("../models/lda_model.pkl")
joblib.dump(lda_model, model_path)
vectorizer_path = Path("../models/vectorizer.pkl")
joblib.dump(tf_vectorizer, vectorizer_path)

Configuración del modelo LDA:
- Número de tópicos: 2
- Alpha: 1.0
- Beta: 0.1

Iniciando entrenamiento...

iteration: 1 of max_iter: 25, perplexity: 417.5657
iteration: 2 of max_iter: 25, perplexity: 375.9183
iteration: 3 of max_iter: 25, perplexity: 359.4184
iteration: 4 of max_iter: 25, perplexity: 351.2923
iteration: 5 of max_iter: 25, perplexity: 346.7075
iteration: 6 of max_iter: 25, perplexity: 343.8115
iteration: 7 of max_iter: 25, perplexity: 341.8163
iteration: 8 of max_iter: 25, perplexity: 340.3513
iteration: 9 of max_iter: 25, perplexity: 339.2304
iteration: 10 of max_iter: 25, perplexity: 338.3547
iteration: 11 of max_iter: 25, perplexity: 337.6635
iteration: 12 of max_iter: 25, perplexity: 337.1112
iteration: 13 of max_iter: 25, perplexity: 336.6627
iteration: 14 of max_iter: 25, perplexity: 336.2919
iteration: 15 of max_iter: 25, perplexity: 335.9803
iteration: 16 of max_iter: 25, perplexity: 335.7147
iteration: 17 of max_iter: 25, perplexity: 335.4854
iteration: 18 of m

['../models/vectorizer.pkl']

We use `pyLDAvis` to explore the topic-term distributions interactively.


In [15]:
import pyLDAvis
import pyLDAvis._prepare as prep
from pyLDAvis import display

pyLDAvis.enable_notebook()

panel = prep.prepare(
    topic_term_dists=lda_model.components_ / lda_model.components_.sum(axis=1)[:, None],
    doc_topic_dists=lda_model.transform(bag_of_words),
    doc_lengths=[len(t.split()) for t in views_df["clean_text"]],
    vocab=tf_vectorizer.get_feature_names_out(),
    term_frequency=bag_of_words.sum(axis=0).A1
)
display(panel)


## Inferring Topics from New Descriptions

We can use our trained LDA model to infer the most probable topics for any given text description. This allows us to:
- Explore how new or unseen reports align with existing topics.
- Suggest tags or categories dynamically.
- Validate the thematic coherence of generated summaries or keywords.

Below, you can input a sample text and view the topic distribution inferred by the model.


In [16]:
def infer_topics(text, n_top=3):
    # Preprocess text as in training
    clean = preprocess(text)
    
    # Vectorize
    vectorized = tf_vectorizer.transform([clean])
    
    # Get topic distribution
    topic_dist = lda_model.transform(vectorized)[0]
    
    # Show top N topics with weights
    top_topics = sorted(list(enumerate(topic_dist)), key=lambda x: -x[1])[:n_top]
    for topic_idx, weight in top_topics:
        terms = [tf_vectorizer.get_feature_names_out()[i] 
                 for i in lda_model.components_[topic_idx].argsort()[:-6:-1]]
        print(f"Topic {topic_idx} ({weight:.2f}):", ", ".join(terms))

# Ejemplo de uso
sample_description = "Weekly dashboard for performance and revenue trends in hotel bookings"
infer_topics(sample_description)


Topic 0 (0.50): view, performance, block, channel, report
Topic 1 (0.50): nan, page, detail, index, kpi
