# Global Vectors for Word Representation (GloVe) 

This document is dedicated to experiments with the GloVe model. We will use a pre-trained glove embedding with gensim. torchtext is "on pause" and does not work with python 3.11+ in my experience. We will get a little fancier than using simple "mean" or "sum" (same when using cosine similarity) to vectorize the sentences according to its words, we will implement a Tf-Idf sentence vectorizer (see detail in the bottom). 

In [1]:
import pandas as pd 
import numpy as np 
from gensim.utils import simple_preprocess
import gensim.downloader as api 

import importlib 
import sys 
sys.path.append("../")

from proj_mod import glove 
importlib.reload(glove);

In [2]:
df_pt=pd.read_csv("../data/raw.csv")
df_pt

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


In [3]:
glove_ranker=glove.GloVeRanker()

In [4]:
glove_ranker.fit(df=df_pt)

In [5]:
query=[
    "Aspiring human resources",
    "seeking human resources"
]
glove_ranker.create_score(query=query)

<proj_mod.glove.GloVeRanker at 0x7f46f77cafd0>

Best 10 after first query 

In [9]:
glove_ranker.df_fitted_.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
27,28,Seeking Human Resources Opportunities,0,0.905511,0.905511
29,30,Seeking Human Resources Opportunities,0,0.905511,0.905511
5,6,Aspiring Human Resources Specialist,0,0.868459,0.868459
23,24,Aspiring Human Resources Specialist,0,0.868459,0.868459
35,36,Aspiring Human Resources Specialist,0,0.868459,0.868459
48,49,Aspiring Human Resources Specialist,0,0.868459,0.868459
59,60,Aspiring Human Resources Specialist,0,0.868459,0.868459
75,76,Aspiring Human Resources Professional | Passio...,0,0.867992,0.867992
73,74,Human Resources Professional,0,0.865691,0.865691
45,46,Aspiring Human Resources Professional,0,0.860696,0.860696


Best 10 after second query 

In [10]:
query2=[
    "Human Resources Professional",
    "Aspiring Human Resources Professional"
]
glove_ranker.create_score(query=query2)

<proj_mod.glove.GloVeRanker at 0x7f46f77cafd0>

In [11]:
glove_ranker.df_fitted_.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
73,74,Human Resources Professional,0.865691,0.980076,0.94576
2,3,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
16,17,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
32,33,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
57,58,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
96,97,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
45,46,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
20,21,Aspiring Human Resources Professional,0.860696,0.980076,0.944262
23,24,Aspiring Human Resources Specialist,0.868459,0.904817,0.89391
5,6,Aspiring Human Resources Specialist,0.868459,0.904817,0.89391


## Theory review of GloVe 

Intuitively, the GloVe model seeks to "move" (the dot product) a pair of word vectors to mimic the (log) co-occurrence counts of this word pair across the whole corpus. 

To achieve this, a *co-occurrence matrix* is built: 
* Tokenize the corpus, choose a context window size. 
* For each center word $i$, and each context word $j$ within the set context window size around $i$, increment the co-occurrence "counter" $X_{ij}$; it is also common to apply a distance weighting when incrementing $X_{ij}$, e.g. instead of simply add 1 for a co-occurrence, add $\frac{1}{d}$ for distance $d$ being the distance between $i$ and $j$. 

After this process, we will end up with a matrix $(X_{ij})$. Intuitively, $X_{ij}$ is "how much $j$ appear around $i$". 

In the GloVe model, each word $i$ will be, first, embedded into a pair of vectors $(v_i, \tilde{v_i})$. Intuitively, $v_i$ is the "word itself as a center word", $\tilde{v_i}$ is the "word itself as a context word". 
The GloVe model seeks to minimize the loss function 
$$
\sum\limits_{i,j:\; X_{ij}>0} f(X_{ij})\; \left( v_i^{\top}\tilde{v_j} + b_i + \tilde{b_j} - log(X_{ij}) \right)^2\; 
$$
where: 
* $f(x)$ is the *weighting function* that helps emphasize meaningful counts and dampens "rare noise": 
$$
f(x):= \begin{cases} (\frac{x}{x_{max}})^{\alpha} & x<x_{max} \\ 1 & otherwise \end{cases}\; , 
$$
where $\alpha$ is typically set to $0.75$, and $x_{max}$ set within $[50,200]$, with $100$ being the most common. 

* $b_i$ and $\tilde{b_i}$ are biases for word $i$ (as center word and context word, respectively). This helps to "absorb" extremely common words. 

It is, then, common to combine the pair of embedding into $w_i:=v_i+\tilde{v_i}$ being the "finalized word vector". Of course, it is also acceptable to simply set $w_i:=v_i$. 

## The Tf-Idf sentence vectorizer 

See detail of the Tf-Idf method in notebook "bag_of_words.ipynb". The sentence vector for sentence $S$ (considered as a set of string words $w\in S$) in this method is 
$$
v_{S}:=\frac{\sum\limits_{w\in S}\left(TFIDF(w,T_S)\; v_w\right)}{\sum\limits_{w\in S}TFIDF(w,T_S)}\; ,  
$$
where $v_w$ is the vector of word $w$, and $T_S$ is string version of sentence $S$. 