# Keyword Enrichment from Verbs using POS Tagging

This notebook extracts **verbs** from the `Description` field of each report view using POS tagging with **spaCy**, and adds them to the `keywords` column in `reports.csv`.

Students will:
- Load the `Views` sheet
- Use `spaCy` to identify verbs
- Merge them with the existing `keywords` field
- Save the updated file to be used by the API

In [1]:
# Install spaCy and download model if needed
# !pip install spacy
# !python -m spacy download en_core_web_sm

import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")
import spacy
from pathlib import Path

# Load language model
nlp = spacy.load("en_core_web_sm")

In [2]:
# Display full content in cells (not truncated)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 100)

# Load the original Views data
views_path = Path("../raw/Reporting_Inventory.xlsx")
views_df = pd.read_excel(views_path, sheet_name="Views")
views_df.fillna("", inplace=True)

# Focus only on the 'Description' column
views_df = views_df[views_df["Description"].str.strip() != ""]
views_df.head(2)

Unnamed: 0,ID Data Product,Report Name,Product Owner,PBIX_File,Report View,Description,Category,Status,Rename,Dimensions,KPIs,Other Terms,Filters,Tags,Priority
0,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,CRITERIA,Methodolody and definition of the algorithim of Feeder Market,Informative,Productive,,,,,,,Priority 1
1,RPPBI0032,Feeder Market - 2024,Jonathan Shields,LifeReport.pbix,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by hotel for a specific feeder market o selection of feeder marktes.,Functional,Productive,,"Hotel, month, Feeder Market, Segment, Channel Mix, Room Type","Total Revenue, Room Revenue, RN, Lead Time, Lenght of Stay, AOV, ADR, ADR Net, %Cost",,,,Priority 1


## Extract verbs from Description using spaCy

In [3]:
def extract_verbs(text):
    doc = nlp(str(text))
    verbs = {token.lemma_.lower() for token in doc if token.pos_ == "VERB" and len(token.lemma_) > 2}
    return ", ".join(sorted(verbs))

# Apply extraction to each row
views_df["verb_keywords"] = views_df["Description"].apply(extract_verbs)

# Preview results
views_df[["Report Name", "Report View", "Description","verb_keywords"]].head(10)

Unnamed: 0,Report Name,Report View,Description,verb_keywords
0,Feeder Market - 2024,CRITERIA,Methodolody and definition of the algorithim of Feeder Market,
1,Feeder Market - 2024,DESTINATION_OF_FEEDER_MARKETS,View focused on understand the performance by hotel for a specific feeder market o selection of feeder marktes.,"focus, understand"
2,Feeder Market - 2024,EXECUTIVE VIEW,Global view to understand Feeder Market Performance compared to previous years diferentiating between domestic and international,"compare, understand"
3,Feeder Market - 2024,FEEDER MARKET FLOWS,"View focused on understanding the booking behaviour by Feeder Market. It allows to understand when, where and through which channels and segments are producing the different feeder markets for a selected booking period. Besides, it shows the flow (Feeder Market to Destination) by contribution of total revenue","allow, book, focus, produce, select, show, understand"
4,Feeder Market - 2024,FEEDER_MARKET_DETAIL,"Detail view of Feeder Markets by Destination including more indepth view by channel, and including Top_Agency and Top_Company information",include
5,Feeder Market - 2024,FEEDER_MARKETS_OF_DESTINATION,VIew focused on understanding the feeder markets producing at a specific Destination,"focus, produce, understand"
6,Feeder Market - 2024,MENU,Index page with interactive buttons to other views.,
7,Feeder Market - 2024,OE MARKET INSIGHTS,Benchmark by Destination. Outside information is provided by Oxford Economics providing a summary developed by AI,"develop, provide"
8,Feeder Market - 2024,TARGETS FOLLOW UP,"View that provides performance vs budget at a feeder Market level. It allows to drill down by destination, segment and channel","allow, drill, provide, view"
9,Feeder Market - 2025,CRITERIA,Methodolody and definition of the algorithim of Feeder Market,


## Merge verb keywords into `keywords` column of reports.csv

In [4]:
# Load existing reports.csv
reports_path = Path("../api/reports.csv")
reports_df = pd.read_csv(reports_path)
reports_df.fillna("", inplace=True)

# Merge verb keywords from views_df
merged_df = pd.merge(
    reports_df,
    views_df[["ID Data Product", "Report View", "verb_keywords"]],
    on=["ID Data Product", "Report View"],
    how="left",
    suffixes=("", "_new")
)

merged_df["verb_keywords"] = merged_df["verb_keywords"].fillna("")

# Combine existing keywords with verbs
def merge_keywords(original, new):
    orig_set = {kw.strip() for kw in str(original).split(",") if kw.strip()}
    new_set = {kw.strip() for kw in str(new).split(",") if kw.strip()}
    return ", ".join(sorted(orig_set | new_set))

merged_df["keywords"] = merged_df.apply(
    lambda row: merge_keywords(row["keywords"], row["verb_keywords"]), axis=1
)

# Drop helper column and save
merged_df.drop(columns=["verb_keywords"], inplace=True)
merged_df.to_csv(reports_path, index=False)
print("Updated 'keywords' with extracted verbs and saved to reports.csv")

Updated 'keywords' with extracted verbs and saved to reports.csv
