# BOW and TFIDF

In this project, I write a function that returns a TDIDF matrix:
* The tokens data frame to use.
* The OHCO level to use, e.g. which "bag" to use.
* The type of count to use (e.g. binary counts are regular counts).
* The type of TF to use.
* The type of IDF to use.

Then, I use the function to get the TDIDF of a corpus with both books and chapters as the bag, separately. I discuss the different results at the bottom of the notebook.

## Import

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
sns.set()
%matplotlib inline

## Config

In [3]:
tf_norm_k = .5 #used for double norm

#for the look and feel of the tables 
#https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html 
gradient_cmap = 'YlGnBu' 


#The ordered hierarchy of content object for the corpus
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num'] 

SENTS = OHCO[:4] #fourth item
PARAS = OHCO[:3] #third item
CHAPS = OHCO[:2] #second item
BOOKS = OHCO[:1] #first item

In [4]:
#prepare data:
LIB = pd.read_csv("LIB.csv").set_index(BOOKS)
TOKEN = pd.read_csv('TOKEN.csv').set_index(OHCO)
VOCAB = pd.read_csv('VOCAB.csv').set_index('term_id')

In [5]:
VOCAB = VOCAB[~VOCAB.term_str.isna()]

In [6]:
TOKEN = TOKEN[~TOKEN.term_str.isna()]

In [7]:
#add term_id to the token table
TOKEN['term_id'] = TOKEN.term_str.map(VOCAB.reset_index().set_index('term_str').term_id)

In [8]:
#add pos to vocab table
VOCAB['pos_max'] = TOKEN.groupby(['term_id', 'pos']).pos.count().unstack().idxmax(1)

In [9]:
#add term rank to vocab table
if 'term_rank' not in VOCAB.columns:
    VOCAB = VOCAB.sort_values('n', ascending=False).reset_index()
    VOCAB.index.name = 'term_rank'
    VOCAB = VOCAB.reset_index()
    VOCAB = VOCAB.set_index('term_id')
    VOCAB['term_rank'] = VOCAB['term_rank'] + 1

In [10]:
#alternate rank
new_rank = VOCAB.n.value_counts()\
    .sort_index(ascending=False).reset_index().reset_index()\
    .rename(columns={'level_0':'term_rank2', 'index':'n', 'n':'nn'})\
    .set_index('n')

In [11]:
VOCAB['term_rank2'] = VOCAB.n.map(new_rank.term_rank2) + 1

In [12]:
VOCAB['p'] = VOCAB.n / VOCAB.shape[0]

In [13]:
#computer zipf's k
VOCAB['zipf_k'] = VOCAB.n * VOCAB.term_rank
VOCAB['zipf_k2'] = VOCAB.n * VOCAB.term_rank2
VOCAB['zipf_k3'] = VOCAB.p * VOCAB.term_rank2

In [14]:
#computer P of vocabâ€”prior, or marginal, probability of a term
VOCAB['p2'] = VOCAB.n / VOCAB.n.sum()

In [15]:
#compute entropy
VOCAB['h'] = VOCAB.p2 * np.log2(1/VOCAB.p2) # Self entropy of each word 
H = VOCAB.h.sum()
N_v = VOCAB.shape[0]
H_max = np.log2(N_v)
R = round(1 - (H/H_max), 2) * 100

# Function

In [16]:
def tfidf(token_df, bag, count_method, tf_method, idf_method):    
    
    #BOW
    BOW = token_df.groupby(bag+['term_id']).term_id.count()\
        .to_frame().rename(columns={'term_id':'n'})

    BOW['c'] = BOW.n.astype('bool').astype('int')
    
    #Count matrix
    DTCM = BOW[count_method].unstack().fillna(0).astype('int')
    
    #compute TF
    if tf_method == 'sum':
        TF = DTCM.T / DTCM.T.sum()

    elif tf_method == 'max':
        TF = DTCM.T / DTCM.T.max()

    elif tf_method == 'log':
        TF = np.log10(1 + DTCM.T)

    elif tf_method == 'raw':
        TF = DTCM.T

    elif tf_method == 'double_norm':
        TF = DTCM.T / DTCM.T.max()
        TF = tf_norm_k + (1 - tf_norm_k) * TF[TF > 0] 

    elif tf_method == 'binary':
        TF = DTCM.T.astype('bool').astype('int')

    TF = TF.T

    #compute DF
    DF = DTCM[DTCM > 0].count()

    #compute IDF
    N = DTCM.shape[0]

    if idf_method == 'standard':
        IDF = np.log10(N / DF)

    elif idf_method == 'max':
        IDF = np.log10(DF.max() / DF) 

    elif idf_method == 'smooth':
        IDF = np.log10((1 + N) / (1 + DF)) + 1

    #compute TFIDF    
    TFIDF = TF * IDF

    #move things to their places
    VOCAB['df'] = DF
    VOCAB['idf'] = IDF

    BOW['tf'] = TF.stack()
    BOW['tfidf'] = TFIDF.stack()

    #apply TFIDF sum to vocab
    VOCAB['tfidf_sum'] = TFIDF.sum()

    #print results
    result = VOCAB[['term_str','tfidf_sum']]\
    .sort_values('tfidf_sum', ascending=False).head(20)\
    .style.background_gradient(cmap=gradient_cmap, high=1)
    
    return result

# TFIDF of corpus of books with "books" as the bag:

In [17]:
tfidf(TOKEN, BOOKS, 'n', 'sum', 'standard')

Unnamed: 0_level_0,term_str,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
26302,pierre,0.009838
11648,elinor,0.006763
19306,israel,0.006504
38673,vernon,0.005857
2644,babbalanja,0.005394
22176,media,0.00494
5540,catherine,0.004324
21823,marianne,0.004316
29073,reginald,0.004167
11812,emma,0.004055


# TFIDF of corpus of books with "chapters" as the bag:

In [18]:
tfidf(TOKEN, CHAPS, 'n', 'sum', 'standard')

Unnamed: 0_level_0,term_str,tfidf_sum
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
31648,she,1.349173
16730,her,1.331294
26302,pierre,1.15434
40387,you,0.756695
23260,mr,0.706684
17566,i,0.664376
39540,whale,0.594694
23261,mrs,0.593591
35574,thou,0.586061
36885,um,0.501435


#### As you can see, at the book level you will produce a collection of proper nouns and a higher specificity of the subjects of the content. At the chapter level you will not see this. Instead you will see a collection closer to a list of pronouns. 