# Topic-Based Similarity between Report Views

This notebook uses an unsupervised LDA model to extract topic distributions from report view descriptions and compute pairwise similarities using the **Jensen-Shannon distance**. This allows us to:
- Represent each report view as a topic distribution vector.
- Find the most thematically similar views for any given one.


In [1]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.spatial.distance import jensenshannon
import numpy as np

# Load data
df = pd.read_excel("../raw/Reporting_Inventory.xlsx", sheet_name="Views")

# Clean text
nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(str(text).lower())
    return " ".join([
        token.lemma_ for token in doc
        if token.is_alpha and not token.is_stop and len(token) > 2
    ])

df["clean_text"] = df["Description"].apply(preprocess)
df.head(2)

Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority,clean_text
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1,methodolody definition algorithim feed market
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1,view focus understand performance hotel specif...


In [6]:
# Convert text to bag-of-words
vectorizer = CountVectorizer(max_df=0.95, min_df=2)
X = vectorizer.fit_transform(df["clean_text"])

# Train LDA
lda = LatentDirichletAllocation(n_components=3, random_state=42)
doc_topics = lda.fit_transform(X)


In [28]:
def get_similar_views(doc_idx, top_n=3):
    query_vec = doc_topics[doc_idx]
    similarities = []

    for i, vec in enumerate(doc_topics):
        if i != doc_idx:
            dist = jensenshannon(query_vec, vec, base=2)
            sim = 1 - dist
            similarities.append((i, sim))

    # Evitar valores infinitos o nan por igualdad exacta
    similarities = [(i, s, doc_topics[i] if not np.isnan(s) else 0.0) for i, s in similarities]

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]

In [31]:
# Supongamos que quieres buscar esta vista
view_name = "Channel Mix View"

# Buscar la primera coincidencia exacta en la columna "Report View"
example_idx = df[df["Report View"] == view_name].index[0]
print(f"Report View: {df.iloc[example_idx]['Report View']} [{doc_topics[example_idx]}]")

similar_views = get_similar_views(example_idx)
print("\nMost similar views:\n")
for idx, score, vector in similar_views:
    print(f"- {df.iloc[idx]['Report View']} (Similarity: {score:.4f}) [{vector}])")


Report View: Channel Mix View [[0.94190054 0.02881268 0.02928678]]

Most similar views:

- Upselling (Similarity: 0.9978) [[0.94152291 0.02958134 0.02889575]])
- Upselling (Similarity: 0.9978) [[0.94152291 0.02958134 0.02889575]])
- COSTS (Similarity: 0.9950) [[0.94430447 0.02699633 0.0286992 ]])


In [17]:
# Optional: Save topic vectors for future use
topic_df = pd.DataFrame(doc_topics, index=df["Report View"])
topic_df.to_csv("../models/report_views_topic_vectors.csv")
