# Train Custom Word Embeddings from PDF Corpus

In this notebook, we train a custom word embedding model using text extracted from a folder of PDF documents. The resulting embeddings can later be loaded and used with Flair for downstream tasks like similarity search, clustering, etc.

We will use `PyMuPDF` to extract text from PDFs and `Gensim` to train a Word2Vec model.

In [22]:
import os
import re
from pathlib import Path
import fitz  # PyMuPDF
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/cbadenes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Extract and clean text from PDF files

In [23]:
# Path to folder containing PDF files
pdf_folder = Path("../pdf")
documents = []

for pdf_file in pdf_folder.glob("*.pdf"):
    print("reading",pdf_file)
    text = ""
    with fitz.open(pdf_file) as doc:
        for page in doc:
            text += page.get_text()
    # Basic cleanup
    text = re.sub(r'\s+', ' ', text)
    documents.append(text)

print(f"Extracted text from {len(documents)} PDFs.")

reading ../pdf/2006-business-intelligence-20-moving-to-real-time-bi-report-nicholls.pdf
Extracted text from 1 PDFs.


## Tokenize documents into sentences of words

In [24]:
# Tokenize each document into words
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]
tokenized_docs = [[word for word in tokens if word.isalpha()] for tokens in tokenized_docs]

print(f"Tokenized {len(tokenized_docs)} documents.")

Tokenized 1 documents.


In [25]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

stop_words = set(stopwords.words("english"))  # cambia si es español
lemmatizer = WordNetLemmatizer()

tokenized_docs = []
for doc in documents:
    tokens = word_tokenize(doc.lower())
    clean_tokens = [
        lemmatizer.lemmatize(tok)
        for tok in tokens
        if tok.isalpha() and tok not in stop_words
    ]
    tokenized_docs.append(clean_tokens)

print(f"Tokenized {len(tokenized_docs)} documents.")

Tokenized 1 documents.


## Train a Embedding model

| Parameter     | Description                                                                                                                |
| ------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `sentences`   | Input corpus: a list of tokenized documents or sentences.<br>Example: `[['data', 'analysis']]`.                            |
| `vector_size` | Dimensionality of the word vectors.<br>Each word is represented as a 100-dimensional vector.                               |
| `window`      | Maximum distance between the current and context word.<br>`window=5` looks 5 words ahead and behind.                       |
| `min_count`   | Ignores all words with frequency lower than this value.<br>`min_count=1` means **all** words are included.                 |
| `workers`     | Number of threads (CPU cores) used during training.                                                                        |
| `sg`          | Training algorithm:<br>`sg=1` → **Skip-Gram** (good for rare words)<br>`sg=0` → **CBOW** (faster, good for frequent words) |


In [26]:
# Train model
w2v_model = Word2Vec(
    sentences=tokenized_docs,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4,
    sg=1
)

print("Model trained.")

Model trained.


In [15]:
print(w2v_model.wv.most_similar("intelligence"))

[('business', 0.9985243678092957), ('stream', 0.998451828956604), ('company', 0.9983727931976318), ('value', 0.9983698725700378), ('information', 0.9983309507369995), ('world', 0.9982976317405701), ('process', 0.9982869029045105), ('application', 0.9982665777206421), ('u', 0.9982622265815735), ('automatically', 0.9982519745826721)]


In [16]:
print(w2v_model.wv.similarity("data", "information"))


0.9982693


## Save the model for use in Flair

In [27]:
output_path = Path("../models/word2vec")
output_path.mkdir(parents=True, exist_ok=True)

# Save in Word2Vec text format
w2v_model.wv.save_word2vec_format(output_path / "custom_embeddings.bin", binary=True)
print("Embeddings saved to 'custom_embeddings.bin'")

Embeddings saved to 'custom_embeddings.bin'


## Load custom embeddings in Flair

In [28]:
from flair.embeddings import WordEmbeddings
embedding = WordEmbeddings(str(output_path / "custom_embeddings.bin"))
print("Custom embeddings loaded in Flair")

Custom embeddings loaded in Flair


In [29]:
# Load reports.csv and extract keywords
import pandas as pd
reports_path = Path("../api/reports.csv")
df = pd.read_csv(reports_path).fillna("")

keywords = set()
for kw_list in df["keywords"]:
    kws = [k.strip().lower() for k in kw_list.split(",") if k.strip()]
    keywords.update(kws)
keywords = sorted(keywords)
print(f"Loaded {len(keywords)} unique keywords.")

Loaded 1715 unique keywords.


In [30]:
# Embed each keyword and build a dictionary
from flair.data import Sentence
import numpy as np
keyword_vectors = {}
for kw in keywords:
    sentence = Sentence(kw, use_tokenizer=True)
    embedding.embed(sentence)
    if sentence:
        # calculate a mean value between word embeddings (for keyphrases)
        vector = np.mean([token.embedding.cpu().numpy() for token in sentence], axis=0)
        keyword_vectors[kw] = vector        

print(f"Embedded {len(keyword_vectors)} keywords.")

Embedded 1715 keywords.


## Search for keywords similar to a given query

In [31]:
from sklearn.metrics.pairwise import cosine_similarity
query = "intelligence"
query_sentence = Sentence(query, use_tokenizer=True)
embedding.embed(query_sentence)

if query_sentence:
    query_vector = np.mean([token.embedding.cpu().numpy() for token in query_sentence], axis=0).reshape(1, -1)
    scores = {}
    for kw, vec in keyword_vectors.items():
        sim = cosine_similarity(query_vector, vec.reshape(1, -1))[0][0]
        scores[kw] = sim

    top_k = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:10]
    print(f"Top keywords similar to '{query}':\n")
    for kw, score in top_k:
        print(f"{kw}: {score:.4f}")
else:
    print(f"'{query}' could not be embedded.")

Top keywords similar to 'intelligence':

business intelligence training: 0.9995
data governance and business intelligence: 0.9993
related business area: 0.9987
top_company information: 0.9987
business group: 0.9986
new business: 0.9986
company detail: 0.9986
business area: 0.9986
company & hotel view: 0.9986
agency or company level: 0.9985
