# Word2vec

This document is dedicated to experiments with word2vec. We will implement with the ready to use model by gensim. 

## Implementation with gensim 

Gensim.model provides a ready to use training algorithm for word2vec, we will make use of this. Due to the fact that we are seeking to compare strings (made of several words) instead of just words, we will have to "aggregate" the vector of words together somehow before the calculating the cosine similarity. We will opt for the (most simple) mean of all word vectors in a string. This kinda treats the whole string as a "bag of words" again without order (just like the BOW model). 

In [79]:
import pandas as pd 
import numpy as np 
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

import importlib 
import sys 
sys.path.append("../")

from proj_mod import word2vec 
importlib.reload(word2vec);

In [2]:
df_pt=pd.read_csv("../data/raw.csv")
df_pt

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


### Below is for SGNS, to use CBOW, simply add "sg=0" when fit 

In [None]:
w2v_model=word2vec.Word2VecRanker()
w2v_model.fit(
            df=df_pt, 
            #sg=0  #This is SGNS when sg=1 by default, set to 0 if want to use CBOW 
            ) 

w2v model fitted. 


<proj_mod.word2vec.Word2VecRanker at 0x7f94c5beee50>

In [81]:
w2v_model.data_

[['bauer',
  'college',
  'of',
  'business',
  'graduate',
  'magna',
  'cum',
  'laude',
  'and',
  'aspiring',
  'human',
  'resources',
  'professional'],
 ['native',
  'english',
  'teacher',
  'at',
  'epik',
  'english',
  'program',
  'in',
  'korea'],
 ['aspiring', 'human', 'resources', 'professional'],
 ['people', 'development', 'coordinator', 'at', 'ryan'],
 ['advisory', 'board', 'member', 'at', 'celal', 'bayar', 'university'],
 ['aspiring', 'human', 'resources', 'specialist'],
 ['student',
  'at',
  'humber',
  'college',
  'and',
  'aspiring',
  'human',
  'resources',
  'generalist'],
 ['hr', 'senior', 'specialist'],
 ['student',
  'at',
  'humber',
  'college',
  'and',
  'aspiring',
  'human',
  'resources',
  'generalist'],
 ['seeking', 'human', 'resources', 'hris', 'and', 'generalist', 'positions'],
 ['student', 'at', 'chapman', 'university'],
 ['svp',
  'chro',
  'marketing',
  'communications',
  'csr',
  'officer',
  'engie',
  'houston',
  'the',
  'woodlands',
  

In [82]:
query=[
    "Aspiring human resources",
    "seeking human resources"
]
w2v_model.create_score(query=query)

<proj_mod.word2vec.Word2VecRanker at 0x7f94c5beee50>

Find the best 10 after the first round. 

In [None]:
w2v_model.df_fitted_

Unnamed: 0,id,job_title,old_score,new_score,fit_score
0,1,2019 C.T. Bauer College of Business Graduate (...,0,0.713039,0.713039
1,2,Native English Teacher at EPIK (English Progra...,0,0.614060,0.614060
2,3,Aspiring Human Resources Professional,0,0.848897,0.848897
3,4,People Development Coordinator at Ryan,0,0.527141,0.527141
4,5,Advisory Board Member at Celal Bayar University,0,0.413983,0.413983
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,0,0.885755,0.885755
100,101,Human Resources Generalist at Loparex,0,0.793641,0.793641
101,102,Business Intelligence and Analytics at Travelers,0,0.482616,0.482616
102,103,Always set them up for Success,0,0.000000,0.000000


In [88]:
w2v_model.df_fitted_.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
72,73,"Aspiring Human Resources Manager, seeking inte...",0,0.910334,0.910334
27,28,Seeking Human Resources Opportunities,0,0.899067,0.899067
74,75,"Nortia Staffing is seeking Human Resources, Pa...",0,0.899067,0.899067
29,30,Seeking Human Resources Opportunities,0,0.899067,0.899067
98,99,Seeking Human Resources Position,0,0.899067,0.899067
78,79,Liberal Arts Major. Aspiring Human Resources A...,0,0.899067,0.899067
99,100,Aspiring Human Resources Manager | Graduating ...,0,0.885755,0.885755
5,6,Aspiring Human Resources Specialist,0,0.849429,0.849429
35,36,Aspiring Human Resources Specialist,0,0.849429,0.849429
59,60,Aspiring Human Resources Specialist,0,0.849429,0.849429


In [89]:
query2=[
    "Seeking Human Resources Opportunities",
    "Aspiring Human Resources Specialist"
]
w2v_model.create_score(query=query2)

<proj_mod.word2vec.Word2VecRanker at 0x7f94c5beee50>

Find best 10 after second round query 

In [90]:
w2v_model.df_fitted_.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
72,73,"Aspiring Human Resources Manager, seeking inte...",0.910334,0.892439,0.897808
27,28,Seeking Human Resources Opportunities,0.899067,0.875501,0.882571
74,75,"Nortia Staffing is seeking Human Resources, Pa...",0.899067,0.875501,0.882571
29,30,Seeking Human Resources Opportunities,0.899067,0.875501,0.882571
98,99,Seeking Human Resources Position,0.899067,0.875501,0.882571
78,79,Liberal Arts Major. Aspiring Human Resources A...,0.899067,0.872995,0.880817
99,100,Aspiring Human Resources Manager | Graduating ...,0.885755,0.868757,0.873857
5,6,Aspiring Human Resources Specialist,0.849429,0.875501,0.867679
35,36,Aspiring Human Resources Specialist,0.849429,0.875501,0.867679
59,60,Aspiring Human Resources Specialist,0.849429,0.875501,0.867679


## Theory review of Word2Vec 

Word2Vec embeds both a (center) word itself and the context of it together into a pair of vectors. In practice, the context is just a representation of the set of words around the center word of consideration. 
There are two main kinds of word2vec models: 
* SG (Skip-Gram): Use all words within a distance of the center words as context. The goal is to promote the correct context for a word. 
* CBOW (Continuous Bag of Words): Use the average of all words within a distance of the center words as context. The goal is to promote the correct word for a context. 

The model works under the assumption that words that appears in similar context should be close (in cosine similarity). 
In the following:  
* $w_1, \cdots, w_T$ is a sequence of tokens (with order, like words in a sentence). 
* Given a position $t\in [1, T]$, $R_t$ denotes the window size. Intuitively speaking, this is the "effective radius" of the context around the center word $w_t$. 
* $V$ be the total vocabulary of corpus of consideration.  

### Theory review of SG 

We define the *Positive pairs* 
$$
D^+\{(w_t, w_{t+j}): t\in [1,T], j\in[-R_t, R_t]\setminus \{0\}\} \; ,
$$
as the "context". 

The goal of this model is to maximize 
$$
\sum\limits_{(c,o)\in D^+} log(P(o|c))\; \text{, where }\; P(o|c):=\frac{exp(v_c^{\top}u_o)}{\sum\limits_{w\in V} exp(v_c^{\top}u_w)}\; . 
$$

Notice that this is not practical for the sake of training, as the total vocabulary can be too big. In practice, SG-Negative sampling (SGNS) is often implemented: in short, instead of considering the whole vocabulary, we "punish the fake co-occurrence". To implement the SGNS, we first build the *word-negativeness distribution* as 
$$
P_n(w):= \frac{f(w)^{\alpha}}{\sum\limits_{u\in V} f(u)^{\alpha}}\; , 
$$
where $w$ is a word, $f(w)$ is the *raw total count* of $w$ in corpus, and $\alpha\in(0,1)$ ($\alpha$ is normally set to $0.75$, one will have extreme frequent words dominate the negatives if $\alpha$ is too high, while have extreme rare words over-sampled in negatives if $\alpha$ is too low). 
Then, we define the new SGNS loss (to be minimized) as 
$$
\sum\limits_{(c,o)\in D^+} \left( log(\sigma(v_c^{\top}u_o)) + \sum\limits_{i=1}^{k}log(\sigma(-v_c^{\top}u_{n_i}))\right)\; ,
$$
where $\sigma(x):=\frac{1}{1+e^{-x}}$, and $n_1,\cdots, n_k$ are i.i.d. samples according to $P_n$. 

In practice, this is implemented by setting "sg=1" for Word2Vec. 

### Theory review of CBOW 

We define 
$$
C_t:=\{w_{t-j}\}_{j=1}^{R_t}\cup \{w_{t+j}\}_{j=1}^{R_t}
$$
for position $t$ as the "context" around center word $w_t$. 
The embedded the context is, then, defined as 
$$
h_t:=\frac{\sum\limits_{w\in C_t} v_w}{|C_t|}\; . 
$$
Notice that this embedding of context will remove the order of the text, and behave very much like a BOW model. 
The goal is to maximize 
$$
log(P(w_t|C_t)=log(\frac{exp(h_t^{\top}u_{w_t})}{\sum\limits_{w\in V}exp(h_t^{\top}u_w)}))\; . 
$$
Again, in practice, the CBOW-NS is often implemented. We define the new CBOW-NS loss (to be minimized) as 
$$
\sum\limits_t\left( log(\sigma(h_t^{\top}u_{w_t})) + \sum\limits_{i=1}^k log(\sigma(-h_t^{\top}u_{n_i})) \right)\; , 
$$
where $\sigma$ and $n_i$'s are the same as SGNS. 

In practice, this is implemented by setting "sg=0" for Word2Vec. 