# GloVe (Gensim)

For looking at word vectors, we'll use **Gensim**. **Gensim** isn't really a deep learning package. It's a package for for word and text similarity modeling, which started with (LDA-style) topic models and grew into SVD and neural word representations. But its efficient and scalable, and quite widely used.   We gonna use **GloVe** embeddings, downloaded at [the Glove page](https://nlp.stanford.edu/projects/glove/). They're inside [this zip file](https://nlp.stanford.edu/data/glove.6B.zip)

In [1]:
# !pip install gensim

# Load pre-trained GloVe embeddings trained on a very large corpus.
# These embeddings are expected to outperform
# custom-trained models due to scale and data diversity.


In [None]:
import os
from gensim.models import KeyedVectors
import gensim.downloader as api

model = api.load('glove-wiki-gigaword-100') 
glove = api.load("glove-wiki-gigaword-100") 


In [None]:
def predict_analogy(a, b, c, model): 
    # Use gensim's built-in analogy via most_similar
    if any(w not in model.key_to_index for w in (a, b, c)):
        return None

    for word, _ in model.most_similar(positive=[b, c], negative=[a], topn=10):
        if word not in {a, b, c}:
            return word
    return None



# Evaluate the pre-trained GloVe model on word analogy tasks.
# The results serve as an upper-bound benchmark
# for comparison with custom-trained embeddings.


In [None]:
def evaluate_analogies(file_path, model): #evaluate_analogies
    total = 0
    correct = 0

    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip().lower()
            if not line or line.startswith(":"):
                continue

            words = line.split()
            if len(words) != 4:
                continue

            a, b, c, d = words

            # Skip if any word is OOV
            if any(w not in model.key_to_index for w in (a, b, c, d)):
                continue


            prediction = predict_analogy(a, b, c, model)

            total += 1
            if prediction == d:
                correct += 1

    accuracy = correct / total if total > 0 else 0.0
    return accuracy, correct, total


# Compute Spearman correlation between pre-trained
# GloVe similarities and human similarity judgments.
# Higher correlation reflects better semantic alignment.


In [None]:
syntactic_acc, syn_correct, syn_total= evaluate_analogies("past-tense.txt", model) #evaluate on syntactic analogies

In [None]:
semantic_acc, syn_correct, syn_total = evaluate_analogies("country-capital.txt", model) #evaluate on semantic analogies

In [None]:
import pandas as pd

results = { #results dictionary
    "Model": ["Skipgram (NEG)"],
    "Window Size": ["5"],
    "Training Loss": ["-"],
    "Training time": ["-"],
    "Syntactic Accuracy": [syntactic_acc],
    "Semantic accuracy": [semantic_acc]
}

df_skipgram_neg = pd.DataFrame(results)
df_skipgram_neg


Unnamed: 0,Model,Window Size,Training Loss,Training time,Syntactic Accuracy,Semantic accuracy
0,Skipgram (NEG),5,-,-,0.554487,0.894433
