# Topic-Based Similarity between Report Views

This notebook uses an unsupervised LDA model to extract topic distributions from report view descriptions and compute pairwise similarities using the **Jensen-Shannon distance**. This allows us to:
- Represent each report view as a topic distribution vector.
- Find the most thematically similar views for any given one.


In [1]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.spatial.distance import jensenshannon
import numpy as np

# Load data
df = pd.read_excel("../raw/Reporting_Inventory.xlsx", sheet_name="Views")

# Clean text
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(str(text).lower())
    return " ".join([
        token.lemma_ for token in doc
        if token.is_alpha and not token.is_stop and len(token) > 2
    ])

df["clean_text"] = df["Description"].apply(preprocess)
df.head(2)

Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,clean_text
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1,methodolody definition algorithim feed market
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand performance hotel specif...


In [2]:
from pathlib import Path
import joblib

# Rutas a los archivos
model_path = Path("../models/lda_model.pkl")
vectorizer_path = Path("../models/vectorizer.pkl")

# Cargar el modelo LDA y el vectorizador
lda = joblib.load(model_path)
vectorizer = joblib.load(vectorizer_path)

# Convert text to bag-of-words
bow = vectorizer.transform(df["clean_text"])

# Infer Topics
doc_topics = lda.transform(bow)



In [3]:
def get_similar_views(doc_idx, top_n=3):
    query_vec = doc_topics[doc_idx]
    similarities = []

    for i, vec in enumerate(doc_topics):
        if i != doc_idx:
            dist = jensenshannon(query_vec, vec, base=2)
            sim = 1 - dist
            similarities.append((i, sim))

    # Evitar valores infinitos o nan por igualdad exacta
    similarities = [(i, s, doc_topics[i] if not np.isnan(s) else 0.0) for i, s in similarities]

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

In [None]:
# Supongamos que quieres buscar vistas relacionadas con esta vista
view_name = "Channel Mix View"

# Buscamos primero la vista de referencia en "Report View"
example_idx = df[df["Report View"] == view_name].index[0]

print("=" * 80)
print("VISTA DE REFERENCIA")
print("-" * 80)
print(f" Report View : {df.iloc[example_idx]['Report View']}")
print(f" Tópicos     : {doc_topics[example_idx]}")
print(f" Descripción : {df.iloc[example_idx]['Description']}")
print("=" * 80)

# Exploramos vistas similares
similar_views = get_similar_views(example_idx, top_n=5)

print("\n VISTAS MÁS SIMILARES\n")
for idx, score, vector in similar_views:
    print("-" * 80)
    print(f" Report Name : {df.iloc[idx]['Report Name']}")
    print(f" Report View : {df.iloc[idx]['Report View']}")
    print(f" Similitud   : {score:.4f}")
    print(f" Tópicos     : {vector}")
    print(f" Descripción : {df.iloc[idx]['Description']}")
print("-" * 80)


VISTA DE REFERENCIA
--------------------------------------------------------------------------------
 Report View : Channel Mix View
 Tópicos     : [0.87715154 0.12284846]
 Descripción : HIgher level of detail drilling down total revenue by segment budget to a channel mix level

 VISTAS MÁS SIMILARES

--------------------------------------------------------------------------------
 Report Name : Sustainability Report
 Report View : COSTS
 Similitud   : 0.9991
 Tópicos     : [0.87787295 0.12212705]
 Descripción : Energy and utilites costs performance. It includes a BI Costs Forecast based on the Energy Forecast also calculated by BI Team.
--------------------------------------------------------------------------------
 Report Name : Energy Report
 Report View : COSTS
 Similitud   : 0.9991
 Tópicos     : [0.87787295 0.12212705]
 Descripción : Energy and utilites costs performance. It includes a BI Costs Forecast based on the Energy Forecast also calculated by BI Team.
-------------------

In [5]:
# Optional: Save topic vectors for future use
topic_df = pd.DataFrame(doc_topics, index=df["Report View"])
topic_df.to_csv("../models/report_views_topic_vectors.csv")
