# **Mini-project \#5**

Names: Balingit, Andrei Luis & Burayag, Ethan Axl

More information on the assessment is found in our Canvas course.

# **Load Pre-trained Embeddings**

*While you don't have to separate your code into blocks, it might be easier if you separated loading / downloading your data from the main part of your solution. Consider placing all loading of data into the code block below.*

I scoured GitHub to look for Filipino or Tagalog Wordlists for this project and these are what I've found according to number of words:

1. [GitHub - AustinZuniga: Filipino Wordlist](https://github.com/AustinZuniga/Filipino-wordlist/tree/master) - Filipino Wordlist with 194327 words

2. [GitHub - luisligunas: Pinoy Dictionary Scraper](https://github.com/luisligunas/pinoy-dictionary-scraper.git) - Filipino, Tagalog, Cebuano, Hiligaynon, and Ilocano Wordlist with 98794 Filipino words

3. [GitHub - jmalonzo: Tagalog Wordlist](https://github.com/jmalonzo/tl-wordlist) - Tagalog Wordlist with 18319 words

4. [GitHub - fofajardo: Tagalog Spellcheck Dictionary](https://github.com/fofajardo/tagalog-spellcheck-dictionary.git) = Filipino/Tagalog Wordlist with 17887 words

I initially wanted to use the first wordlist dataset, `AustinZuniga's Filipino Wordlist`, primarily because of its large size. But upon skimming the dataset, it contained a lot of non-words and specific proper names that would not do well in the context of the Semantle game.

I checked the second dataset, `luisligunas's Pinoy Dictionary Scraper`, and found out that the Filipino words contained within it are mostly proper nouns and loan words; still unsuitable for the Semantle game.

And so, I decided to go with the next dataset with the third highest number of words, `jmalonzo's Tagalog Wordlist`. It has data that is mostly clean and semantically consistent. Its only two issues are that it contains words with **many versions of the same root** and **long compound words**, both of which are inherent characteristics of the Tagalog language (highly morphologically complex) and both of which may prove difficult for the Semantle-style game.

Because these two issues are not limited by the rubric, I will leave the dataset as is, not pre-processing nor cleaning it whatsoever.

Because I am on MacOS, I import the Tagalog Wordlist from Jan Alonzo's repository using the following command:

>       `wget -O tl_wordlist.txt https://raw.githubusercontent.com/jmalonzo/tl-wordlist/master/tl.wl`

In [1]:
!wget -O tl_wordlist.txt https://raw.githubusercontent.com/jmalonzo/tl-wordlist/master/tl.wl

--2026-02-20 20:21:00--  https://raw.githubusercontent.com/jmalonzo/tl-wordlist/master/tl.wl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 178256 (174K) [text/plain]
Saving to: â€˜tl_wordlist.txtâ€™


2026-02-20 20:21:00 (8.52 MB/s) - â€˜tl_wordlist.txtâ€™ saved [178256/178256]



As for the word vectors, I'll use fastText Tagalog Wikipedia vectors. I'll use WGET again to import the file directly:

>       `wget -O wiki.tl.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.tl.vec`

In [2]:
!wget -O wiki.tl.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.tl.vec

--2026-02-20 20:21:00--  https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.tl.vec
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2600:9000:289a:d600:13:6e38:acc0:93a1, 2600:9000:289a:b200:13:6e38:acc0:93a1, 2600:9000:289a:b400:13:6e38:acc0:93a1, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:289a:d600:13:6e38:acc0:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 175695851 (168M) [binary/octet-stream]
Saving to: â€˜wiki.tl.vecâ€™


2026-02-20 20:21:21 (8.05 MB/s) - â€˜wiki.tl.vecâ€™ saved [175695851/175695851]



Because my laptop, an M4 MacBook Air, is capable enough for this task despite the word vector collection's size, I don't think I'll need to create this Jupyter notebook on Google Colab. But I'll still take precautionary measures in loading the data onto memory.

# **Your Implementation**

*Again, you don't have to have everything in one block. Use the notebook according to your preferences with the goal of fulfilling the assessment in mind.*

## Building a Tagalog Semantic Word Game

In this notebook, I am building a semantic guessing game using Tagalog FastText vectors.

---

### 1. Imports and Configuration

First, we set up our paths and configuration. 

We set a `MAX_WORDS_TO_LOAD` cap to keep memory usage reasonable. 

We also define a Regex pattern to ensure we only accept "clean" Tagalog words; lowercase, no numbers or punctuation.

In [3]:
import gzip
import random
import re
from pathlib import Path
from typing import Dict, List, Tuple
import numpy as np

WORDLIST_PATH = Path("tl_wordlist.txt")
VEC_PATH = Path("wiki.tl.vec")

# arbitrary number
MAX_WORDS_TO_LOAD = 80_000

# we only want lowercase latin letters and Ã±
WORD_RE = re.compile(r"^[a-zÃ±]+$")

---

### 2. Defining the Vocabulary

We shall use `jmalonzo's Tagalog Wordlist`. 

If a word isn't in this list, it doesn't get into the game.

In [4]:
def load_wordlist(path: Path) -> set:

    allowed = set()
    
    # check if file exists to avoid crashing later
    if not path.exists():
        raise FileNotFoundError(f"Missing wordlist: {path}")

    with path.open("r", encoding="utf-8", errors="ignore") as f:
        for line in f:
            w = line.strip().lower()
            # skip empty lines
            if not w:
                continue
            # only keep words that match our clean regex
            if WORD_RE.match(w):
                allowed.add(w)
                
    print(f"Loaded {len(allowed)} valid words from wordlist.")
    return allowed

# execute the load
allowed_vocab = load_wordlist(WORDLIST_PATH)

Loaded 17164 valid words from wordlist.


---

### 3. Loading the Semantic Vectors
Now we read the FastText `.vec` file. 

We iterate through the vector file, but we only keep the row:
1. if the word exists in our `allowed_vocab`.
2. if it matches our Regex pattern.
3. if we haven't hit our `MAX_WORDS` limit yet.

In [5]:
def load_fasttext_vec_text(path: Path, allowed: set, max_words: int) -> Tuple[List[str], np.ndarray]:
    if not path.exists():
        raise FileNotFoundError(f"Missing vectors file: {path}")

    words: List[str] = []
    vectors: List[np.ndarray] = []

    # Handle both plain .vec and compressed .vec.gz
    opener = gzip.open if path.suffix == ".gz" else open

    print(f"Reading vectors from {path}...")
    
    with opener(path, "rt", encoding="utf-8", errors="ignore") as f:
        # The first line of a .vec file is usually "<vocab_size> <dimensions>"
        header = f.readline().strip().split()
        if len(header) != 2:
            raise ValueError("Unexpected .vec header. Expected: '<vocab_size> <dim>'")
        
        _, dim_str = header
        dim = int(dim_str)

        for line in f:
            parts = line.rstrip().split(" ")
            
            if len(parts) != dim + 1:
                continue

            token = parts[0].lower()

            # --- FILTERING STEPS ---
            if token not in allowed:
                continue
            if not WORD_RE.match(token):
                continue

            try:
                # Convert the numbers from strings to a numpy array (float32 for memory efficiency)
                vec = np.array(parts[1:], dtype=np.float32)
            except ValueError:
                continue

            words.append(token)
            vectors.append(vec)

            # Stop if we reach our memory limit
            if len(words) >= max_words:
                break

    if not words:
        raise RuntimeError("No vectors loaded. Check your wordlist/path and MAX_WORDS_TO_LOAD.")

    # Stack list of arrays into a single 2D Matrix
    mat = np.vstack(vectors).astype(np.float32)
    print(f"Successfully loaded matrix shape: {mat.shape}")
    return words, mat

# Execute the load
words, mat = load_fasttext_vec_text(VEC_PATH, allowed=allowed_vocab, max_words=MAX_WORDS_TO_LOAD)

Reading vectors from wiki.tl.vec...
Successfully loaded matrix shape: (12714, 300)


### 4. Normalization through **Cosine Similarity**.

We need to normalize all vectors and scale them so their length is 1.0 in order to get the denominator to become 1. 

This simplifies the calculation to just the Dot Product ($A \cdot B$), which is much faster for the computer to calculate repeatedly during the game.

$$\text{similarity} = \frac{A \cdot B}{||A|| \times ||B||}$$

In [6]:
def l2_normalize_rows(mat: np.ndarray) -> np.ndarray:

    # we need to normalize the rows of the matrix so that the L2 norm of each row is 1
    # doing this allows us to use Dot Product as a proxy for Cosine Similarity

    # calculate the length of each row
    norms = np.linalg.norm(mat, axis=1, keepdims=True)
    norms[norms == 0] = 1.0     # avoid divide by zero before returning
    return mat / norms

# normalize loaded matrix
mat = l2_normalize_rows(mat)

# create a lookup dictionary for O(1) access speed
word_to_idx: Dict[str, int] = {w: i for i, w in enumerate(words)}

### 5. Starting a Game Round
Here we pick a random "Target Word." 

Because we have the matrix, we can instantly calculate the similarity of *every* word in our vocabulary against the target word. 

This essentially creates an "Answer Key" for the entire game before the player even makes a guess.

In [7]:
# 1. Pick a random target
target_idx = random.randrange(len(words))
target_word = words[target_idx]
target_vec = mat[target_idx]

print(f"Target Word (Hidden): {target_word}")

# 2. Calculate global similarities
sims = mat @ target_vec 

# 3. Create a ranking order (highest similarity first)
order = np.argsort(-sims)       # sort descending 

# remove the target word itself from the hints list
order = order[order != target_idx]

def show_rank(rank_1_based: int) -> None:
    k = rank_1_based - 1
    if k < 0 or k >= len(order):
        return
    idx = order[k]
    print(f"Rank {rank_1_based}: {words[idx]} (Score: {sims[idx]:.4f})")

# Preview the "neighbors" (for debugging/narrative purposes)
print("\n--- Game Debug Info ---")
print("Closest words to target:")
show_rank(1)
show_rank(10)
show_rank(100)

Target Word (Hidden): kabiyak

--- Game Debug Info ---
Closest words to target:
Rank 1: umiyak (Score: 0.5832)
Rank 10: kasintahang (Score: 0.5509)
Rank 100: tiyahin (Score: 0.4524)


### 6. Interactive Game Loop
 
We can run this cell to play the game.

The loop will ask for input, check the dictionary, and return the similarity score (0.0 to 1.0).

In [8]:
print(f"Game Started! Try to guess the word related to Tagalog context.")
print("Type 'exit' to quit.")

while True:
    try:
        guess = input("\nYour guess: ").strip().lower()
    except KeyboardInterrupt:
        print("\nStopping game.")
        break
        
    if guess == "exit" or not guess:
        print(f"Gave up? The word was: {target_word}")
        break

    if guess == target_word:
        print(f"ðŸŽ‰ CORRECT! The word is {guess.upper()}! Score: 1.0")
        break

    # lookupindex
    idx = word_to_idx.get(guess)
    
    if idx is None:
        print(f"'{guess}' is not in the vocabulary.")
        continue

    # calculate score and rank in relation to chosen word
    score = float(mat[idx] @ target_vec)
    rank = np.where(order == idx)[0][0] + 1
    
    print(f"Word: {guess} | Similarity: {score:.4f} | Rank: {rank}")

Game Started! Try to guess the word related to Tagalog context.
Type 'exit' to quit.
Word: manila | Similarity: 0.1955 | Rank: 9389
Word: asawa | Similarity: 0.4157 | Rank: 274
Word: kahati | Similarity: 0.1962 | Rank: 9340
Word: oo | Similarity: 0.1352 | Rank: 11901
Word: hindi | Similarity: 0.2801 | Rank: 3983
Word: bobo | Similarity: 0.3769 | Rank: 710
Word: tanga | Similarity: 0.3099 | Rank: 2529
ðŸŽ‰ CORRECTION! The word is KABIYAK! Score: 1.0
