# Supervised Topic Modeling with Labeled LDA

In this notebook, we train a **Labeled Latent Dirichlet Allocation (Labeled LDA)** model to learn topics associated with predefined `Category` labels from report view descriptions. This allows us to:
- Learn label-specific topics from text
- Predict and interpret categories for new or unlabeled views
- Improve semantic tagging of report views


In [8]:
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import tomotopy as tp

nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

# Cargar la hoja "Views"
df = pd.read_excel("../raw/Reporting_Inventory.xlsx", sheet_name="Views")
df = df[df["Description"].notna() & df["Category"].notna() & df["Report View"].notna()]
df.head(2)

[nltk_data] Downloading package punkt to /Users/cbadenes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cbadenes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim of Feeder Market,Informative,Productive,,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by hotel for a specific feeder market o selection of feeder marktes.,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel Mix, Room Type","Total Revenue, Room Revenue, RN, Lead Time, Lenght of Stay, AOV, ADR, ADR Net, %Cost",,,,Priority 1


In [9]:
# Preprocesamiento
def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^a-záéíóúñü\s]", "", text)
    tokens = word_tokenize(text)
    return [t for t in tokens if t not in stop_words]

df["tokens"] = df["Description"].apply(preprocess)
df["label"] = df["Category"].str.lower().str.strip()
df.head(2)

Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,tokens,label
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim of Feeder Market,Informative,Productive,,,,,,,Priority 1,"[methodolody, definition, algorithim, feeder, market]",informative
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by hotel for a specific feeder market o selection of feeder marktes.,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel Mix, Room Type","Total Revenue, Room Revenue, RN, Lead Time, Lenght of Stay, AOV, ADR, ADR Net, %Cost",,,,Priority 1,"[view, focused, understand, performance, hotel, specific, feeder, market, selection, feeder, marktes]",functional


In [10]:
df["label"].value_counts()

label
functional      360
index            67
executive        57
informative      42
self-service     13
other            12
master data       7
Name: count, dtype: int64

## Train a labeled LDA

* tw=tp.TermWeight.ONE
This sets the term weighting scheme to no weighting (i.e., each word has equal importance).
Alternative values include:   
    * tp.TermWeight.PMI – Pointwise Mutual Information
    * tp.TermWeight.IDF – Inverse Document Frequency

* min_cf=1
Minimum collection frequency: a token must appear in at least 3 documents to be included in the vocabulary.
This filters out extremely rare words to reduce noise.

* rm_top=2
Removes the top 2 most frequent tokens from the vocabulary.
These are typically very common terms (e.g. "data", "report") that add little semantic value for topic modeling.

In [23]:
import tomotopy as tp

# Obtener etiquetas únicas de forma segura
unique_labels = sorted(set(df["label"]))

# Crear modelo Labeled LDA
model = tp.PLDAModel(
    tw=tp.TermWeight.ONE,   # Peso de término
    min_cf=1,   # Frecuencia mínima de término
    rm_top=2   # Eliminar los 2 términos más frecuentes
    )

# Añadir documentos
for tokens, label in zip(df["tokens"], df["label"]):
    model.add_doc(tokens, labels=[label])

# Entrenar progresivamente
model.train(0)
for i in range(100, 1000, 100):
    model.train(i)
    print(f"Log-likelihood after {i} iterations: {model.ll_per_word:.4f}")


Log-likelihood after 100 iterations: -6.7652
Log-likelihood after 200 iterations: -6.7651
Log-likelihood after 300 iterations: -6.7649
Log-likelihood after 400 iterations: -6.7649
Log-likelihood after 500 iterations: -6.7649
Log-likelihood after 600 iterations: -6.7649
Log-likelihood after 700 iterations: -6.7649
Log-likelihood after 800 iterations: -6.7649
Log-likelihood after 900 iterations: -6.7649


## Evaluate the Model

In [25]:
# Mostrar las palabras más relevantes para cada etiqueta (label)
for i, label in enumerate(model.topic_label_dict):
    print(f"Top palabras para el label '{label}':")
    print(model.get_topic_words(i, top_n=10))
    print()


Top palabras para el label 'informative':
[('report', 0.03986325114965439), ('glossary', 0.0365440808236599), ('quest', 0.03322490677237511), ('information', 0.026586564257740974), ('main', 0.026586564257740974), ('descriptions', 0.023267393931746483), ('tab', 0.023267393931746483), ('fields', 0.023267393931746483), ('definitions', 0.019948221743106842), ('data', 0.019948221743106842)]

Top palabras para el label 'functional':
[('performance', 0.016415858641266823), ('information', 0.014657174237072468), ('block', 0.011726032942533493), ('detail', 0.010553576052188873), ('evolution', 0.010113905183970928), ('data', 0.010113905183970928), ('table', 0.009967347607016563), ('kpis', 0.009234561584889889), ('total', 0.009234561584889889), ('month', 0.008941447362303734)]

Top palabras para el label 'executive':
[('business', 0.020764142274856567), ('executive', 0.017996512353420258), ('kpis', 0.015920789912343025), ('hotel', 0.015228882431983948), ('budget', 0.013845068402588367), ('report'

### Make Inferences

In [33]:
text = "The reports sent by STR every 3 months with forecast data from some markets of %OCC, ADR and RevPar, are consolidated on this tab."
tokens = preprocess(text)
doc = model.make_doc(tokens)
topic_dist, _ = model.infer(doc)
    
# Get most probable label
best_label = max(zip(model.topic_label_dict, topic_dist), key=lambda x: x[1])[0]
print("Topic:",best_label)

Topic: functional


## Use the model

In [34]:
# Aplicar a vistas sin categoría
unlabeled_df = pd.read_excel("../raw/Reporting_Inventory.xlsx", sheet_name="Views")
unlabeled_df = unlabeled_df[unlabeled_df["Category"].isna() & unlabeled_df["Description"].notna()]
unlabeled_df["tokens"] = unlabeled_df["Description"].apply(preprocess)

# Predecir la categoría más probable
predictions = []
for tokens in unlabeled_df["tokens"]:
    doc = model.make_doc(tokens)
    topic_dist, _ = model.infer(doc)

    # Buscar la mejor etiqueta usando su probabilidad
    best_label = max(zip(model.topic_label_dict, topic_dist), key=lambda x: x[1])[0]
    predictions.append(best_label)

unlabeled_df["predicted_category"] = predictions
unlabeled_df.head(10)


Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,tokens,predicted_category
182,RPPBI0034,Corporate Market Share - 2024,Raven Jordan,CharacterReport.pbix,STR Forecast Dashboard 2024,"The reports sent by STR every 3 months with forecast data from some markets of %OCC, ADR and RevPar, are consolidated on this tab.",,Productive,,Cities available,"Occupancy, ADR, RevPar",%Chg last 2 forecast,"Forecast Month, Flag STR is Yes, Hotel_Name is not Hotel Puebla Finsa or Hotel Curitiba The Five or Hotel Lisboa Campo Grande","STR Forecast, Corporate Market Share, 2024",Priority 1,"[reports, sent, str, every, months, forecast, data, markets, occ, adr, revpar, consolidated, tab]",functional
183,RPPBI0034,Corporate Market Share - 2024,Raven Jordan,CharacterReport.pbix,STR Forecast Dashboard 2025,"The reports sent by STR every 3 months with forecast data from some markets of %OCC, ADR and RevPar, are consolidated on this tab.",,Productive,,Cities available,"Occupancy, ADR, RevPar",%Chg last 2 forecast,"Forecast Month, Flag STR is Yes, Hotel_Name is not Hotel Puebla Finsa or Hotel Curitiba The Five or Hotel Lisboa Campo Grande","STR Forecast, Corporate Market Share, 2024",Priority 1,"[reports, sent, str, every, months, forecast, data, markets, occ, adr, revpar, consolidated, tab]",functional
259,RPPBI0150,Corporate Market Share - 2025,Matthew Callahan,SameReport.pbix,STR Forecast Dashboard 2025,"The reports sent by STR every 3 months with forecast data from some markets of %OCC, ADR and RevPar, are consolidated on this tab.",,Productive,,Cities available,"Occupancy, ADR, RevPar",%Chg last 2 forecast,"Forecast Month, Flag STR is Yes, Hotel_Name is not Hotel Puebla Finsa or Hotel Curitiba The Five or Hotel Lisboa Campo Grande","STR Forecast, Corporate Market Share",Priority 1,"[reports, sent, str, every, months, forecast, data, markets, occ, adr, revpar, consolidated, tab]",functional
320,RPPBI0173,Daily Revenue Report 2025,Tasha Hall,AboutReport.pbix,Pick Up Channel Detail,DELETED,,,,,,,,,Priority 1,[deleted],functional
358,RPPBI0062,Price Competitiveness,Nicole Carter,AboutReport.pbix,Booking Criteria,"This view is exclusively for Booking.com,given that they have their offensive criteria. They stablish that a searchis considered offensive when the price difference is greater then 3% and the ranking position is less than 4",,Productive,,"BU, Country, City, Hotel, Brand, META, OTA",,,,,Priority 1,"[view, exclusively, bookingcomgiven, offensive, criteria, stablish, searchis, considered, offensive, price, difference, greater, ranking, position, less]",executive
362,RPPBI0062,Price Competitiveness,Nicole Carter,AboutReport.pbix,Page 1,internal,,Internal,,,,,,,Priority 1,[internal],functional
