# Semantic Similarity Between Keywords Using Word Embeddings

This notebook demonstrates how to:
- Load keywords from `reports.csv`
- Use `flair` to embed each keyword with static embeddings (e.g., GloVe)
- Embed a user-defined query term
- Compute cosine similarity and find the most semantically similar keywords

In [1]:
# !pip install flair

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
from flair.embeddings import WordEmbeddings
from flair.data import Sentence
from sklearn.metrics.pairwise import cosine_similarity

## Load and prepare keywords from reports.csv

In [3]:
# Load reports.csv and extract keywords
reports_path = Path("../api/reports.csv")
df = pd.read_csv(reports_path).fillna("")

keywords = set()
for kw_list in df["keywords"]:
    kws = [k.strip().lower() for k in kw_list.split(",") if k.strip()]
    keywords.update(kws)
keywords = sorted(keywords)
print(f"Loaded {len(keywords)} unique keywords.")

Loaded 1715 unique keywords.


In [4]:
import pandas as pd

# Convertir a DataFrame para visualización tabular
keywords_df = pd.DataFrame(keywords, columns=["keyword"])
keywords_df.to_csv("../api/keywords_alphabetical.csv", index=False)

keywords_df.head(20)  # muestra las 20 primeras



Unnamed: 0,keyword
0,"""tripadvisor"
1,% offense
2,%share/channel
3,& ly
4,(account
5,(co2
6,(corporate sales
7,(cover
8,(eft
9,(gid


## Load Word Embedding model

https://flairnlp.github.io/docs/tutorial-embeddings/classic-word-embeddings

In [5]:
embedding = WordEmbeddings('en')  # Alternatives: 'en' (fasttex), 'en-glove'.
print("Embedding model loaded.")

Embedding model loaded.


## Embed each keyword and build a dictionary

In [6]:
keyword_vectors = {}
for kw in keywords:
    sentence = Sentence(kw, use_tokenizer=True)
    embedding.embed(sentence)
    if sentence:
        # calculate a mean value between word embeddings (for keyphrases)
        vector = np.mean([token.embedding.cpu().numpy() for token in sentence], axis=0)
        keyword_vectors[kw] = vector        

print(f"Embedded {len(keyword_vectors)} keywords.")

Embedded 1715 keywords.


## Search for keywords similar to a given query

In [None]:
query = "earnings"
query_sentence = Sentence(query, use_tokenizer=True)
embedding.embed(query_sentence)

if query_sentence:
    query_vector = np.mean([token.embedding.cpu().numpy() for token in query_sentence], axis=0).reshape(1, -1)
    scores = {}
    for kw, vec in keyword_vectors.items():
        sim = cosine_similarity(query_vector, vec.reshape(1, -1))[0][0]
        scores[kw] = sim

    top_k = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:10]    
    print(f"Top keywords similar to '{query}':\n")
    for kw, score in top_k:
        print(f"{kw}: {score:.4f}")
else:
    print(f"'{query}' could not be embedded.")

Top keywords similar to 'sport':

travel industry spain: 0.5798
commercial team: 0.5734
feeder market & industry: 0.5610
business area: 0.5590
platform and business area: 0.5551
leisure: 0.5543
team: 0.5527
activity category: 0.5506
country: 0.5467
business travel potential: 0.5440
