# Bag Of Words (BOW) with Term Frequency-Inverse Document Frequency (Tf-IDF) 

This document serves to experiments with BOW with TF-IDF. 

In [84]:
import pandas as pd 
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

import importlib 
import sys 
sys.path.append("../")

from proj_mod import bow 
importlib.reload(bow);


In [85]:
df_pt=pd.read_csv("../data/raw.csv")
df_pt

Unnamed: 0,id,job_title,location,connection,fit
0,1,2019 C.T. Bauer College of Business Graduate (...,"Houston, Texas",85,
1,2,Native English Teacher at EPIK (English Progra...,Kanada,500+,
2,3,Aspiring Human Resources Professional,"Raleigh-Durham, North Carolina Area",44,
3,4,People Development Coordinator at Ryan,"Denton, Texas",500+,
4,5,Advisory Board Member at Celal Bayar University,"İzmir, Türkiye",500+,
...,...,...,...,...,...
99,100,Aspiring Human Resources Manager | Graduating ...,"Cape Girardeau, Missouri",103,
100,101,Human Resources Generalist at Loparex,"Raleigh-Durham, North Carolina Area",500+,
101,102,Business Intelligence and Analytics at Travelers,Greater New York City Area,49,
102,103,Always set them up for Success,Greater Los Angeles Area,500+,


There is no target in "fit", this is an unsupervised similarity ranking task, I opt to use "job_title" as the most important feature. The initial queries are: “Aspiring human resources”, and “seeking human resources”. 

In [86]:
corpus=df_pt["job_title"].to_list()
corpus

['2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
 'Native English Teacher at EPIK (English Program in Korea)',
 'Aspiring Human Resources Professional',
 'People Development Coordinator at Ryan',
 'Advisory Board Member at Celal Bayar University',
 'Aspiring Human Resources Specialist',
 'Student at Humber College and Aspiring Human Resources Generalist',
 'HR Senior Specialist',
 'Student at Humber College and Aspiring Human Resources Generalist',
 'Seeking Human Resources HRIS and Generalist Positions',
 'Student at Chapman University',
 'SVP, CHRO, Marketing & Communications, CSR Officer | ENGIE | Houston | The Woodlands | Energy | GPHR | SPHR',
 'Human Resources Coordinator at InterContinental Buckhead Atlanta',
 '2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professional',
 '2019 C.T. Bauer College of Business Graduate (Magna Cum Laude) and aspiring Human Resources professiona

In [87]:
tfidf_kwargs={
    "ngram_range": (1,2), #Allow for singular words and 2-gram terms
    # "max_df": 0.9, #If 90 percent of documents has a term, ignore it
    # "min_df": 0.05 #If lower than 5 percent of the has a term, ignore it 
}
bow_ranker=bow.BowRanker(tfidf_kwargs=tfidf_kwargs)

In [95]:
bow_ranker.fit(df=df_pt)

Fitted data removed. Proceeding...


<proj_mod.bow.BowRanker at 0x7fce118c93d0>

In [96]:
query=[
    "Aspiring human resources",
    "seeking human resources"
]
bow_ranker.create_score(query=query)

<proj_mod.bow.BowRanker at 0x7fce118c93d0>

In [97]:
bow_ranker.new_scores_

array([0.17598096, 0.        , 0.4886911 , 0.        , 0.        ,
       0.42537991, 0.26179783, 0.        , 0.26179783, 0.32002205,
       0.        , 0.        , 0.13617991, 0.17598096, 0.17598096,
       0.        , 0.4886911 , 0.        , 0.17598096, 0.        ,
       0.4886911 , 0.        , 0.        , 0.42537991, 0.26179783,
       0.        , 0.29044922, 0.44284399, 0.29044922, 0.44284399,
       0.17598096, 0.        , 0.4886911 , 0.        , 0.        ,
       0.42537991, 0.26179783, 0.        , 0.26179783, 0.32002205,
       0.        , 0.        , 0.13617991, 0.17598096, 0.        ,
       0.4886911 , 0.        , 0.        , 0.42537991, 0.26179783,
       0.        , 0.26179783, 0.32002205, 0.        , 0.        ,
       0.13617991, 0.17598096, 0.4886911 , 0.        , 0.42537991,
       0.        , 0.32002205, 0.        , 0.        , 0.13617991,
       0.21125667, 0.13572351, 0.1755387 , 0.10292453, 0.11687969,
       0.1801419 , 0.21592196, 0.3712681 , 0.34262474, 0.16962

In [98]:
cur_rank=bow_ranker.df_fitted_

In [99]:
#View the first 10 
cur_rank.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
2,3,Aspiring Human Resources Professional,0,0.488691,0.488691
32,33,Aspiring Human Resources Professional,0,0.488691,0.488691
16,17,Aspiring Human Resources Professional,0,0.488691,0.488691
20,21,Aspiring Human Resources Professional,0,0.488691,0.488691
57,58,Aspiring Human Resources Professional,0,0.488691,0.488691
96,97,Aspiring Human Resources Professional,0,0.488691,0.488691
45,46,Aspiring Human Resources Professional,0,0.488691,0.488691
27,28,Seeking Human Resources Opportunities,0,0.442844,0.442844
29,30,Seeking Human Resources Opportunities,0,0.442844,0.442844
98,99,Seeking Human Resources Position,0,0.427734,0.427734


In [100]:
query_again=["Seeking Human Resources Position"]
bow_ranker.create_score(query=query_again)

<proj_mod.bow.BowRanker at 0x7fce118c93d0>

In [101]:
new_rank=bow_ranker.df_fitted_
new_rank.sort_values(by="fit_score", ascending=False).head(10)

Unnamed: 0,id,job_title,old_score,new_score,fit_score
98,99,Seeking Human Resources Position,0.427734,1.0,0.82832
27,28,Seeking Human Resources Opportunities,0.442844,0.419362,0.426406
29,30,Seeking Human Resources Opportunities,0.442844,0.419362,0.426406
39,40,Seeking Human Resources HRIS and Generalist Po...,0.320022,0.303052,0.308143
52,53,Seeking Human Resources HRIS and Generalist Po...,0.320022,0.303052,0.308143
61,62,Seeking Human Resources HRIS and Generalist Po...,0.320022,0.303052,0.308143
9,10,Seeking Human Resources HRIS and Generalist Po...,0.320022,0.303052,0.308143
99,100,Aspiring Human Resources Manager | Graduating ...,0.213803,0.343766,0.304777
2,3,Aspiring Human Resources Professional,0.488691,0.159265,0.258093
45,46,Aspiring Human Resources Professional,0.488691,0.159265,0.258093


## Theory review of BOW

Consider a *document* $d$, and a collection of vocabulary $V$. We define the *term weight vector $w_{d}:=(w_{d,j})$*, where 
$$
w_{d,j}:= tf_{d,j}:= \text{ \# term } j \text{ appears in } d \; . 
$$
By construction, this method "discards order and context". The model works on the assumption that documents sharing many vocabularies end up geometrically close (we will use cosine similarity) to each other. 
This can be achieved in practice by setting sublinear_tf=False (by default) and use_idf=False for TfidfVectorizer. 

## Theory review of TF-IDF 

This method alters the term weight vector by reducing the weight allotted to extremely common vocabularies (like "the", "a", and etc). 

There are some common ways to do this (we will denote the altered term frequency with $tf_{d,j}^*$): 
* **Sublinear TF**: This is the $tf_{d,j}^*:=1+log(tf_{d,j})$. Intuitively, this is designed to dampen *within-in document term frequency*. This can be achieved in practice by setting sublinear_tf=True for TfidfVectorizer. 
* **Smoothed IDF**: $$idf_j := log(\frac{1+ n}{ 1+ df_j})+1\; \text{, where }\; n:=\text{ total \# of documents }\; \text{, and } df_j:=\text{ \# of documents having } j\; .$$ Intuitively, this dampens the globally common terms. This can be achieved by setting smooth_idf=True and use_idf=True for TfidfVectorizer. 
* **TF-IDF**: Define $$w_{d,j}:=tf^*_{d,h}\times idf_j\; $$ for the TF-IDF. 