# Report Description Generator with GPT2

This notebook demonstrates how to use a Spanish language model to automatically generate descriptions for report views based on selected metadata fields.

<a href="https://colab.research.google.com/github/cbadenes/semantic-report-search/blob/main/data/analysis/32_text_generation.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
</a>

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import pandas as pd

In [2]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
# Make sure the Excel file is accessible in the current environment
df = pd.read_excel("Reporting_Inventory.xlsx", sheet_name="Views")
df.head()

Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim o...,Informative,Productive,,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by ...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
2,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,EXECUTIVE VIEW,Global view to understand Feeder Market Perfor...,Executive,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
3,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER MARKET FLOWS,View focused on understanding the booking beha...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1
4,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,FEEDER_MARKET_DETAIL,Detail view of Feeder Markets by Destination i...,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel ...","Total Revenue, Room Revenue, RN, Lead Time, Le...",,,,Priority 1


In [25]:
def build_input_text(row):
    return (
        f"Report Name: {row['Report Name']}. "
        f"Category: {row.get('Category', 'N/A')}. "
        f"KPIs: {row.get('KPIs', 'not specified')}. "
        f"Suggested Description:"
    )

df["input_text"] = df.apply(build_input_text, axis=1)
df[["Report Name", "Report View", "input_text"]].head()


Unnamed: 0,Report Name,Report View,input_text
0,Feeder Market - 2024,CRITERIA,Report Name: Feeder Market - 2024. Category: I...
1,Feeder Market - 2024,DESTINATION_OF_FEEDER_MARKETS,Report Name: Feeder Market - 2024. Category: F...
2,Feeder Market - 2024,EXECUTIVE VIEW,Report Name: Feeder Market - 2024. Category: E...
3,Feeder Market - 2024,FEEDER MARKET FLOWS,Report Name: Feeder Market - 2024. Category: F...
4,Feeder Market - 2024,FEEDER_MARKET_DETAIL,Report Name: Feeder Market - 2024. Category: F...


In [26]:
def generate_description(input_text, max_length=80):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated.replace(input_text, "").strip()

In [27]:
# Example: generate description for the first report view
sample_input = df.loc[1, "input_text"]
sample_description = generate_description(sample_input)
print(f"Input text:\n{sample_input}\n")
print(f"Generated description:\n{sample_description}")

# To generate descriptions for all views, uncomment the lines below:
# df["Generated_Description"] = df["input_text"].apply(generate_description)
# df[["Report Name", "Generated_Description"]].head()

Input text:
Report Name: Feeder Market - 2024. Category: Functional. KPIs: Total Revenue, Room Revenue, RN, Lead Time, Lenght of Stay, AOV, ADR, ADR Net, %Cost. Suggested Description:

Generated description:
The Feeders Market is the fastest growing service in the state, and is expected to grow to $1.3 billion by 2024, a year after


In [None]:
df[["Name", "Generated_Description"]].to_csv("generated_descriptions.csv", index=False)
