# Term Frequency - Inversed document frequency (TF-IDF)

<h3>
<ol>
    <li><a href="#section1">Introduction</a></li>
    <li><a href="#section2">TF-IDF from Scratch</a></li>
    <ol>
        <li><a href="#section2_1">Variations</a></li>
        <li><a href="#section2_2">Normalize</a></li>
    </ol>
    <li><a href="#section3">Implementations</a></li>
    <ol>
        <li><a href="#section3_1">Gensim</a></li>
        <li><a href="#section3_2">Scikit</a></li>
        <li><a href="#section3_3">Transform Gensim $\leftrightarrow$ Scikit</a></li>
    </ol>
    <li><a href="#section4">Accuracy</a></li>
</ol>
</h3>


<a id="section1"></a>
# 1. Introduction

Let a document $d_{j\in\mathbb{N}}$ be represented by a set of $k\in\mathbb{N}$ terms $t_i\in\mathbb{N}$ with $t_{i}\neq t_{h\in\mathbb{N}\neq i}$ as $(t_1, ..., t_1,t_2, ..., t_2, ..., t_i, ..., t_i, t_k, ..., t_k)$ and let $\#d_j$ be the size of that document. Than for every $i$ exists one $w_i=\#t_i$ so that we can represent $d_j$ by the set $(w_1, w_2, ...w_i,... , w_k)$, where $w_i$ literally stands for the amount of all terms of one kind $t_i$ in $d_j$. Now let there be an amount $m\in\mathbb{N}$ of documents $d_j$, so that we need an amount of $n\in\mathbb{N}$ $t_{i,j}$ and $w_{i,j}$ to represent every $d_j$. Finally $\bigcup\limits_{j \in m} d_j$ is the join of all sets of documents, where $\#w_i$ is the amount of all $w_{i,j}$ of one kind in $\bigcup\limits_{j \in m} d_j$. 

Then we call $\frac{w_{i,j}}{\#d_j}$ <b>term frequency</b> and $log(\frac{m}{\#w_i})$ <b>inversed document frequency</b>. In words, <b>term frequency</b> is the amount of all terms of one kind in a document divided by the number of all terms in that document. <b>Inversed document frequency</b> is the amount of all documents divided by the amount of all documents having a term in common. An overview of different aproaches is given by [Robertson in 2004](http://www.inf.ed.ac.uk/teaching/courses/ad/lectures16/Robertson_IDF.pdf).

|   t   |   $d_1$   |    $d_2$   |   $d_3$   | ... |   $d_j$   | ... |   $d_m$   |
|  ---  |    ---    |    ---     |    ---    | --- |    ---    | --- |    ---    |
| $t_1$ | $w_{1,1}$ |  $w_{1,2}$ | $w_{1,3}$ | ... | $w_{1,j}$ | ... | $w_{1,m}$ |
| $t_2$ | $w_{2,1}$ |  $w_{2,2}$ | $w_{2,3}$ | ... | $w_{2,j}$ | ... | $w_{2,m}$ |
| ...   |    ...    |    ...     |   ...     | ... |    ...    | ... |    ...    |
| $t_i$ | $w_{i,1}$ |  $w_{i,2}$ | $w_{i,3}$ | ... | $w_{i,j}$ | ... | $w_{i,m}$ |
| ...   |    ...    |    ...     |   ...     | ... |    ...    | ... |    ...    |
| $t_n$ | $w_{n,1}$ |  $w_{n,2}$ | $w_{n,3}$ | ... | $w_{n,j}$ | ... | $w_{n,m}$ |


$$\#w_i=  \sum_{j=1}^m \left\{\begin{array}{ll} 0, & w_{i,j}=0  \\ 1, & w_{i,j}>0 \end{array}\right .  \,\,\,\,\,\,\,\,\,\, \#d_j=\sum_{i=1}^n w_{i,j}$$
<br/>
<br/>

<h3>Term Frequency (TF)</h3>
The first appearance of term frequency was in 1957 by Hans Peter Luhn in [A Statistical Interpretation of Term Specificity and Its Application in Retrieval"](http://www.phil-fak.uni-duesseldorf.de/fileadmin/Redaktion/Institute/Informationswissenschaft/downloadcenter/infocenter/Informationretrieval/Luhn_1957_statistical_approach.pdf). Since then different TF-weights have been developed (incomplete list):<br/>
binary: if $t_i$ appears TF=1 else TF=0<br/>
log normalization: TF=$1+log(TF)$<br/>
raw count: TF=$w_{i,j}$<br/>
double normalization x: TF=$x+(1-x)\cdot \frac {w_{i,j}}{max(d_j)}$<br/>
<br/>
<br/>
<h3>Inversed document frequency (IDF)</h3>
The first appearance of term frequency was in 1957 by Hans Peter Luhn in [A statistical approach to mechanized encoding and searching of literary information](https://doi.org/10.1108%2Feb026526). Since then different IDF-weights have been developed (incomplete list):<br/>
unary: IDF=1<br/>
smooth: $log(1+\frac{m}{\#w_i})$<br/>
max: $log(\frac{max(\#w)}{1+\#w_i})$<br/>
probabilistic: $log(\frac{m-\#w_i}{\#w_i})$<br/>
<br/>
<br/>
<h3>Term Frequency - Inversed document frequency (TF-IDF)</h3>
Formular: TF-IDF=TF$\cdot$IDF<br/>
Mechanism: The more a term appears in a document the bigger gets TF. The more a term appears in different documents the more IDF converges to 1. So terms spamming a lot will have low values, terms appear in a few documents very often will have high values.
<br/>
<br/>
<br/>

<a id="section2"></a>
# 2. TF-IDF from Scratch

<br/>
I am going to start with the simplest version of tf-idf and then explore more variations.

Formula: $tf\_idf=w_{i,j}\cdot\log(\frac{m}{\#w_i})$

<h3>Code with example:</h3>

In [96]:
#import all modules needed for 2. TF-IDF from Scratch
import pandas as pd
from collections import Counter
import itertools
import math
from IPython.display import display_html

#prepare displayer
from IPython.display import display, HTML

def disp(*args, **kwargs):
    
    try:
        col=kwargs['col']
    except:
        col=1
    try:
        row=kwargs['row']
    except:
        row=1
    
    objs=[]
    for element in args:
        try:
            title='<h3 align="center">'+element[0]+'</h3><br/>'
            obj=element[1].to_html()
        except:
            title=''
            obj=element.to_html()
        objs.append(title+obj)
    
    diff=row*col-len(objs)
    
    if diff>0:
        objs+=['' for x in range(diff)]
    if diff<0:
        row+=diff
    
    k=0
    html_source='<table border="1" class="dataframe">'
    for i in range(row):
        html_source+='<tr>'
        for j in range(col):
            html_source+='<td bgcolor="white">'+objs[k]+'</td>'
            k+=1
        html_source+='</tr>'
    html_source+='</table>'
    html_source=html_source.replace('<table','<table style="display:inline"')
    display_html(html_source,raw=True)

In [97]:
#initailize a clean document, with overall spamming terms, unique spamming terms and mixed appearances
documents=['A B C D E F',
           'B B B B E F',
           'C C C D E X',
           'X X D D E X',
           'X X X X E X',
           'G G G G E G'] 

#number of documents
m=len(documents)

#string in documents to token in documents
documents_tokenized=[document.split(' ') for document in documents]

#unique terms
tokens=set(list(itertools.chain.from_iterable(documents_tokenized)))

#number of terms-appearance in each document as Counter
document_token_counts=[Counter(document) for document in documents_tokenized]

#initate array to create variable_names by iteration
it=range(m)

#create variable_names by iteration
columns=['t']+['d_'+str(i+1) for i in it]

#initate pandas DataFrame
count_matrix=pd.DataFrame(columns=columns)

#save distribution of each term per document to pandas
for token in tokens:
    count_matrix.loc[count_matrix.shape[0]]=( [token]+[document_token_counts[i][token] for i in it] )
    
#sort the dataframe in alphabetical order
count_matrix=count_matrix.sort_values('t').reset_index(drop=True)

#set all terms in each Counter to 1, sum all Counters to get #w_i
document_with_word_counts=document_token_counts
for i in it:
    document_with_word_counts[i]=Counter({x:1 for x in document_token_counts[i].keys()})
d_t=sum(document_with_word_counts, Counter())

#iniate a new column and save #w_i
count_matrix['#w_i']=0
for d in d_t:
    count_matrix.loc[count_matrix['t']==d,'#w_i']=d_t[d]

#calculate idf=log(d_n/d_t) for each t_i
count_matrix['idf']=count_matrix['#w_i'].map(lambda x: math.log(m/float(x)))

count_matrix.index += 1 

In [98]:
#initate dataframe for tf_idf
tfidf_matrix=count_matrix[['t']].copy()

#calculate tf-idf for each d_i of t_i: tf-idf=d_i*idf
for col  in columns[1:]:
    tfidf_matrix['tf_idf_'+col.split('_')[-1]]=count_matrix[col]*count_matrix['idf']

In [99]:
disp(['Basic TF & IDF',count_matrix], ['TF-IDF',tfidf_matrix], row=1, col=2)

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_1,Unnamed: 9_level_1
1,A,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.791759
2,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0,1.098612
3,C,1.0,0.0,3.0,0.0,0.0,0.0,2.0,1.098612
4,D,1.0,0.0,1.0,2.0,0.0,0.0,3.0,0.693147
5,E,1.0,1.0,1.0,1.0,1.0,1.0,6.0,0.0
6,F,1.0,1.0,0.0,0.0,0.0,0.0,2.0,1.098612
7,G,0.0,0.0,0.0,0.0,0.0,5.0,1.0,1.791759
8,X,0.0,0.0,1.0,3.0,5.0,0.0,3.0,0.693147
1,A,1.79176,0.0,0.0,0.0,0.0,0.0,,
2,B,1.09861,4.39445,0.0,0.0,0.0,0.0,,

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
1,A,1,0,0,0,0,0,1,1.791759
2,B,1,4,0,0,0,0,2,1.098612
3,C,1,0,3,0,0,0,2,1.098612
4,D,1,0,1,2,0,0,3,0.693147
5,E,1,1,1,1,1,1,6,0.0
6,F,1,1,0,0,0,0,2,1.098612
7,G,0,0,0,0,0,5,1,1.791759
8,X,0,0,1,3,5,0,3,0.693147

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.79176,0.0,0.0,0.0,0.0,0.0
2,B,1.09861,4.39445,0.0,0.0,0.0,0.0
3,C,1.09861,0.0,3.29584,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.38629,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.09861,1.09861,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.9588
8,X,0.0,0.0,0.693147,2.07944,3.46574,0.0


Example: X for document 5
    
$w_{8,5}=5$, look up row 8 $\rightarrow$ X and column d_5<br/>
$\#w_7=3$, because x is in d_3, d_4 and d_5<br/>
$m=6$, because there are 6 documents.<br/>
$tf\_idf(X,d_5)=w_{8,5}\cdot\log(\frac{m}{\#w_8})=5\cdot\log(\frac{6}{3})=5\cdot\log(2)=5\cdot 0.6931471805599453 = 3.46574$

<a id="section2_1"></a>
# Variations

Lets take a look at some chosen variations:

1. <b>base</b> model: $tf\_idf=w_{i,j}\cdot\log(\frac{m}{\#w_i})$
1. <b>weight</b>ing tf by the number of words in a document: $tf\_idf=\frac{w_{i,j}}{\#d_j} \cdot log\frac{m}{\#w_i}$
2. avoid <b>zero</b>-divide: $tf\_idf=w_{i,j} \cdot log\frac{m}{1+\#w_i}$
3. <b>smooth</b>ing idf: $tf\_idf=w_i \cdot log\frac{1+m}{\#w_i}$<br/>
6. <b>lognorm</b>: $tf\_idf=(1+log(w_{i,j})) \cdot log\frac{1+m}{\#w_i}$<br/>

Let's build a function for all these variations:

In [100]:
def tf_idf(documents_tokenized, weight=False, zero=0, smooth=0, double=False, scaleidf=1,lognorm=False, addidf=0):   
      
    #prepare smooth    
    if type(smooth)==type(True) and smooth==True:
        smooth=1
    if smooth==False:
        smooth=0
        
    #prepare zero
    if type(zero)==type(True) and zero==True:
        zero=1
    if zero==False:
        zero=0
        
    #prepare double    
    if type(double)==type(True) and double==True:
        double=0.5
    if double==False:
        double=0
    
    #prepare scaleidf    
    if scaleidf==False:
        scaleidf=1
    if type(scaleidf)==type(True) and scaleidf==True:
        scaleidf=math.log(2)

    
    #calculate variation of tf
    def tf(w_i,lognorm):
        if type(lognorm)==type(True) and lognorm==True:
            lognorm=1
        if lognorm!=False and w_i!=0:
            w_i=lognorm+math.log(w_i)        
        return w_i
    
    #########
    #########
    
    
    #unique terms
    tokens=set(list(itertools.chain.from_iterable(documents_tokenized)))

    #number of terms-appearance in each document as Counter
    document_token_counts=[Counter(document) for document in documents_tokenized]

    #initate array to create variable_names by iteration
    it=range(m)

    #create variable_names by iteration
    columns=['t']+['d_'+str(i+1) for i in it]

    #initate pandas DataFrame
    count_matrix=pd.DataFrame(columns=columns)

    #save distribution of each term per document to pandas
    for token in tokens:
        count_matrix.loc[count_matrix.shape[0]]=( [token]+[document_token_counts[i][token] for i in it] )

    #sort the dataframe in alphabetical order
    count_matrix=count_matrix.sort_values('t').reset_index(drop=True)

    #set all terms in each Counter to 1, sum all Counters to get #w_i
    document_with_word_counts=document_token_counts
    for i in it:
        document_with_word_counts[i]=Counter({x:1 for x in document_token_counts[i].keys()})
    d_t=sum(document_with_word_counts, Counter())

    
    #######
    # idf #
    #######
    
    #iniate a new column and save #w_i
    count_matrix['#w_i']=0
    for d in d_t:
        count_matrix.loc[count_matrix['t']==d,'#w_i']=d_t[d]
    count_matrix

    #calculate idf for each w_i                                      #smooth    #zero           #scaleidf
    count_matrix['idf']=count_matrix['#w_i'].map(lambda w: addidf+(math.log((smooth+m)/(zero+float(w)))/scaleidf))

    
    ######
    # tf #
    ######
    
    # w_i divided by max(w) [divided by highest term frequency]
    if double!=False:
        for col in columns[1:]:
            count_matrix[col]=count_matrix[col]/max(count_matrix[col])
            
    #weight by k=#d_j, number of terms in d_j
    if type(1)==type(weight) or type(1.0)==type(weight):        
        weight=[weight for col in columns[1:]]
        count_matrix[columns[1:]]=count_matrix[columns[1:]].div(weight, axis=1)
    if weight==True and type(weight)==type(True):
        weight=[sum(count_matrix[col]) for col in columns[1:]]
        count_matrix[columns[1:]]=count_matrix[columns[1:]].div(weight, axis=1)

    for col  in columns[1:]:    
        count_matrix[col]=count_matrix[col].map(lambda w_i: double + (1-double) *tf(w_i,lognorm))


    ##########   
    # tf-idf #
    ##########
        
    #initate dataframe for tf_idf
    tfidf_matrix=count_matrix[['t']].copy()

    #calculate tf-idf for each d_j for every w_i: tf-idf=w_i*idf
    for col  in columns[1:]:                                # tf * idf
        tfidf_matrix['tf_idf_'+col.split('_')[-1]]=count_matrix[col]*count_matrix['idf']
    tfidf_matrix.index += 1 
    return tfidf_matrix

In [101]:
disp(['Simple TF & IDF',count_matrix],
     ['Simple TF-IDF',tfidf_matrix],
     ['TF-IDF with weighted TF',tf_idf(documents_tokenized, weight=True)],
     ['TF-IDF with smoothed IDF',tf_idf(documents_tokenized, smooth=True)],
     ['TF-IDF with double normalized TF',tf_idf(documents_tokenized, double=0.5)],
     ['TF-IDF with avoid zero devision IDF',tf_idf(documents_tokenized, zero=True)],
     ['TF-IDF with log normalized TF',tf_idf(documents_tokenized, lognorm=True)],
     ['TF-IDF with scaled IDF',tf_idf(documents_tokenized, scaleidf=math.log(2))],
     row=4, col=2)

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_1,Unnamed: 9_level_1
Unnamed: 0_level_2,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_2,Unnamed: 9_level_2
Unnamed: 0_level_3,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_3,Unnamed: 9_level_3
Unnamed: 0_level_4,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_4,Unnamed: 9_level_4
Unnamed: 0_level_5,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_5,Unnamed: 9_level_5
Unnamed: 0_level_6,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_6,Unnamed: 9_level_6
Unnamed: 0_level_7,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6,Unnamed: 8_level_7,Unnamed: 9_level_7
1,A,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.791759
2,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0,1.098612
3,C,1.0,0.0,3.0,0.0,0.0,0.0,2.0,1.098612
4,D,1.0,0.0,1.0,2.0,0.0,0.0,3.0,0.693147
5,E,1.0,1.0,1.0,1.0,1.0,1.0,6.0,0.0
6,F,1.0,1.0,0.0,0.0,0.0,0.0,2.0,1.098612
7,G,0.0,0.0,0.0,0.0,0.0,5.0,1.0,1.791759
8,X,0.0,0.0,1.0,3.0,5.0,0.0,3.0,0.693147
1,A,1.79176,0.0,0.0,0.0,0.0,0.0,,
2,B,1.09861,4.39445,0.0,0.0,0.0,0.0,,

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
1,A,1,0,0,0,0,0,1,1.791759
2,B,1,4,0,0,0,0,2,1.098612
3,C,1,0,3,0,0,0,2,1.098612
4,D,1,0,1,2,0,0,3,0.693147
5,E,1,1,1,1,1,1,6,0.0
6,F,1,1,0,0,0,0,2,1.098612
7,G,0,0,0,0,0,5,1,1.791759
8,X,0,0,1,3,5,0,3,0.693147

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.79176,0.0,0.0,0.0,0.0,0.0
2,B,1.09861,4.39445,0.0,0.0,0.0,0.0
3,C,1.09861,0.0,3.29584,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.38629,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.09861,1.09861,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.9588
8,X,0.0,0.0,0.693147,2.07944,3.46574,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,0.298627,0.0,0.0,0.0,0.0,0.0
2,B,0.183102,0.732408,0.0,0.0,0.0,0.0
3,C,0.183102,0.0,0.549306,0.0,0.0,0.0
4,D,0.115525,0.0,0.115525,0.231049,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.183102,0.183102,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.493133
8,X,0.0,0.0,0.115525,0.346574,0.577623,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.94591,0.0,0.0,0.0,0.0,0.0
2,B,1.252763,5.011052,0.0,0.0,0.0,0.0
3,C,1.252763,0.0,3.758289,0.0,0.0,0.0
4,D,0.847298,0.0,0.847298,1.694596,0.0,0.0
5,E,0.154151,0.154151,0.154151,0.154151,0.154151,0.154151
6,F,1.252763,1.252763,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,9.729551
8,X,0.0,0.0,0.847298,2.541894,4.236489,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.89588,0.89588,0.89588,0.89588,0.89588
2,B,1.098612,1.098612,0.549306,0.549306,0.549306,0.549306
3,C,1.098612,0.549306,1.098612,0.549306,0.549306,0.549306
4,D,0.693147,0.346574,0.462098,0.577623,0.346574,0.346574
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,0.686633,0.549306,0.549306,0.549306,0.549306
7,G,0.89588,0.89588,0.89588,0.89588,0.89588,1.791759
8,X,0.346574,0.346574,0.462098,0.693147,0.693147,0.346574

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.098612,0.0,0.0,0.0,0.0,0.0
2,B,0.693147,2.772589,0.0,0.0,0.0,0.0
3,C,0.693147,0.0,2.079442,0.0,0.0,0.0
4,D,0.405465,0.0,0.405465,0.81093,0.0,0.0
5,E,-0.154151,-0.154151,-0.154151,-0.154151,-0.154151,-0.154151
6,F,0.693147,0.693147,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,5.493061
8,X,0.0,0.0,0.405465,1.216395,2.027326,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0


<b>Examples for X in $d_5$</b><br/>
    
$max(d_5)=max(0,0,0,0,1,0,5)=5$, the highest value $w_{i,5}$ in column $d_5$.<br/><br/>  

<b>TF-IDF with weighted TF</b><br/>
$tf\_idf(X,d_5)=w_{8,5}=\frac{w_{8,5}}{\#d_5}\cdot\log(\frac{m}{\#w_5})=\frac{5}{6}\cdot\log(\frac{6}{3})=\frac{5}{6}\cdot\log(2) \approx 0.8333333333333\cdot 0.6931471805599 \approx 0.577623$<br/>

<b>TF-IDF with smoothed IDF</b><br/>
$tf\_idf(X,d_5)=w_{8,5}\cdot\log(\frac{1+m}{\#w_8})=5\cdot\log(\frac{1+6}{3})=5\cdot\log(\frac{7}{3})=5\cdot\log(2.3333333) \approx 5\cdot 0.8472978603872 \approx 4.236489$<br/>

<b>TF-IDF with double normalized TF</b><br/>
$tf\_idf(X,d_5)=(x+(1-x)\cdot \frac{w_{8,5}}{max(d_5)})\cdot\log(\frac{m}{\#w_8})=(0.5+(1-0.5)\cdot\frac{5}{5})\cdot\log(\frac{6}{3}) \approx (1)\cdot0.6931471805599\approx0.693147$<br/>

<b>TF-IDF with avoid zero devision IDF</b><br/>
$tf\_idf(X, d_5)=w_{8,5}\cdot\log(\frac{m}{1+\#w_8})=5\cdot\log(\frac{6}{1+3})=5\cdot\log(1.5) \approx 5\cdot 0.4054651081081644 \approx 2.027326$<br/>

<b>TF-IDF with log normalized TF</b><br/>
$tf\_idf(X, d_5)=(1+\log(w_{8,5}))\cdot\log(\frac{m}{\#w_8})=(1+\log(5))\cdot\log(\frac{6}{3})\approx(1+1.6094379)\cdot\log(2)\approx(2.6094379)\cdot0.693147 \approx 1.808725$<br/>

<b>TF-IDF with scaled IDF</b><br/>
$(\log(2))$: $tf\_idf(X,d_5)=w_{8,5}\cdot  \frac{\log(\frac{m}{\#w_8})}{\log(2)}=5\cdot \frac{\log(\frac{6}{3})}{\log(2)}=5\cdot\frac{\log(2)}{\log(2)}=5\cdot 1 = 5$

<a id="section2_2"></a>
# Normalization

$l^1$-norm:
$$norm_{i,j}=\frac{w_{i,j}}{\sum_{i=1}^{n} w_{i,j}}$$
$l^2$-norm: 
$$norm_{i,j}=\frac{tfidf_{i,j}}{\sqrt{\sum_{i=1}^{n} tfidf_{i,j}^2}}$$<br/>

<br/>
$^\star$Don't get confused about $w_{i,j}$ instead of $|w_{i,j}|$ as $w_{i,j}$ already is a length. Actually $w_{i,j}=|t_{i,j}|$.

In [102]:
def normalize(tf_idf_matrix, norm='l2'):
    m=tf_idf_matrix.shape[1]-1
    columns=['tf_idf_'+str(i) for i in range(1,m+1)]

    if norm=='l1':
        norm1_tf_idf_matrix=tfidf_matrix[['t']].copy()
        for col in columns:
            norm1_tf_idf_matrix['l1-tf_idf_'+col.split('_')[-1]]=tf_idf_matrix[col]/sum(tf_idf_matrix[col].map(lambda x: math.fabs(x)))
        return norm1_tf_idf_matrix
    
    if norm=='l2':
        norm2_tf_idf_matrix=tfidf_matrix[['t']].copy()
        for col in columns:
            SqS=math.sqrt(sum(tf_idf_matrix[col]**2))
            if SqS == 0:
                SqS=1e-12
            norm2_tf_idf_matrix['l2-tf_idf_'+col.split('_')[-1]]=tf_idf_matrix[col]*(1/SqS)
        return norm2_tf_idf_matrix
    
    if norm!='l1' and norm!='l2':
        raise Exception('unknown norm')
        
disp(['Simple TF-IDF',tf_idf(documents_tokenized)],
     ['L1-norm TF-IDF',normalize(tf_idf(documents_tokenized), norm='l1')],
     ['L2-norm TF-IDF',normalize(tf_idf(documents_tokenized), norm='l2')],
     row=2, col=2)

Unnamed: 0_level_0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
Unnamed: 0_level_1,t,l1-tf_idf_1,l1-tf_idf_2,l1-tf_idf_3,l1-tf_idf_4,l1-tf_idf_5,l1-tf_idf_6
Unnamed: 0_level_2,t,l2-tf_idf_1,l2-tf_idf_2,l2-tf_idf_3,l2-tf_idf_4,l2-tf_idf_5,l2-tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,3.295837,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.386294,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.958797
8,X,0.0,0.0,0.693147,2.079442,3.465736,0.0
1,A,0.309953,0.0,0.0,0.0,0.0,0.0
2,B,0.190047,0.8,0.0,0.0,0.0,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,3.295837,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.386294,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.958797
8,X,0.0,0.0,0.693147,2.079442,3.465736,0.0

Unnamed: 0,t,l1-tf_idf_1,l1-tf_idf_2,l1-tf_idf_3,l1-tf_idf_4,l1-tf_idf_5,l1-tf_idf_6
1,A,0.309953,0.0,0.0,0.0,0.0,0.0
2,B,0.190047,0.8,0.0,0.0,0.0,0.0
3,C,0.190047,0.0,0.703918,0.0,0.0,0.0
4,D,0.119906,0.0,0.148041,0.4,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.190047,0.2,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.0
8,X,0.0,0.0,0.148041,0.6,1.0,0.0

Unnamed: 0,t,l2-tf_idf_1,l2-tf_idf_2,l2-tf_idf_3,l2-tf_idf_4,l2-tf_idf_5,l2-tf_idf_6
1,A,0.662629,0.0,0.0,0.0,0.0,0.0
2,B,0.406289,0.970143,0.0,0.0,0.0,0.0
3,C,0.406289,0.0,0.958503,0.0,0.0,0.0
4,D,0.25634,0.0,0.201583,0.5547,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.406289,0.242536,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.0
8,X,0.0,0.0,0.201583,0.83205,1.0,0.0


norm-l2: $tf\_idf(X,d_5)\approx \frac{3.46574}{\sqrt{0^2+0^2+0^2+0^2+0^2+0^2+0^2+3.46574^2}}=\frac{3.46574}{3.46574}=1$

<a id="section3"></a>
# 3. Implementations

<a id="section3_1"></a>
# Gensim

In [103]:
#initailize a clean document
documents=['A B C D E F',
           'B B B B E F',
           'C C C D E X',
           'X X D D E X',
           'X X X X E X',
           'G G G G E G']

#number of documents
d_n=len(documents)

#string in documents to token in documents
documents_tokenized=[document.split(' ') for document in documents]

from gensim import corpora
from gensim import models
import gensim
import numpy as np

#show version
print('Gensim version is '+gensim.__version__)

#tokenize corpus and give terms an id
dictionary = corpora.Dictionary(documents_tokenized)
gensim_tokenized = [dictionary.doc2bow(document) for document in documents_tokenized]

Gensim version is 3.2.0


Gensim uses a function called <b>wglobal</b> to calculate IDF and <b>wlocal</b> for TF.<br/>
The default wglobal uses (docfreq, totaldocs, log_base=2.0, add=0.0): idf=add + np.log(float(totaldocs) / docfreq) / np.log(log_base)
The default wlocal is simply number of appearances.<br/>

Gensim saves the simple TF value in a dictionary. If a termhad 0 appearance in a document it's omitted. One line is one document. To see which term belongs to which ID we can call the dictionary.

In [104]:
#dictionary.items() # -> [(0, u'A'), (2, u'C'), (1, u'B'), (4, u'E'), (3, u'D'), (5, u'X')]
gensim_tokenized, [t for t in dictionary.iteritems()]

([[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)],
  [(1, 4), (4, 1), (5, 1)],
  [(2, 3), (3, 1), (4, 1), (6, 1)],
  [(3, 2), (4, 1), (6, 3)],
  [(4, 1), (6, 5)],
  [(4, 1), (7, 5)]],
 [(7, 'G'),
  (6, 'X'),
  (0, 'A'),
  (3, 'D'),
  (5, 'F'),
  (1, 'B'),
  (2, 'C'),
  (4, 'E')])

Now it's time to call the simple tf-idf output from gensim.

In [105]:
tfidf = models.TfidfModel(gensim_tokenized, normalize=False)
for doc in tfidf[gensim_tokenized]:
    print(doc)

[(0, 2.584962500721156), (1, 1.5849625007211563), (2, 1.5849625007211563), (3, 1.0), (5, 1.5849625007211563)]
[(1, 6.339850002884625), (5, 1.5849625007211563)]
[(2, 4.754887502163469), (3, 1.0), (6, 1.0)]
[(3, 2.0), (6, 3.0)]
[(6, 5.0)]
[(7, 12.92481250360578)]


At first visualize this output in a DataFrame and compare with the scale tf-idf used before.

In [106]:
def vis_gensim(model, dic):
    w_i=[w for w in dic]
    tfidf_matrix=pd.DataFrame(float(0), index=np.arange(len(w_i)), columns=['w']+['tf-idf_'+str(i+1) for i in it])
    tfidf_matrix['w']=w_i
    i=0
    for doc in model:
        for ID, z in doc:
            w=dictionary.get(ID)
            index=tfidf_matrix.index[tfidf_matrix['w']==w]
            tfidf_matrix.at[index, 'tf-idf_'+str(i+1)]=z
        i+=1
    tfidf_matrix=tfidf_matrix.sort_values('w').reset_index(drop=True)
    tfidf_matrix.index+=1
    return tfidf_matrix

disp(['Simple TF-IDF from Gensim',vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())],
    ['TF-IDF with scaled log(2) IDF',tf_idf(documents_tokenized, scaleidf=math.log(2))],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0


The same! TF-IDF scaled IDF with log(2) is the default mode of gensim.
Let's try the normalize mode.

In [107]:
tfidf = models.TfidfModel(gensim_tokenized, normalize=True)

disp(['Simple normalized TF-IDF from Gensim',vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())],
    ['TF-IDF with scaled log(2) IDF and l2-norm',normalize(tf_idf(documents_tokenized, scaleidf=math.log(2)), norm='l2')],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,t,l2-tf_idf_1,l2-tf_idf_2,l2-tf_idf_3,l2-tf_idf_4,l2-tf_idf_5,l2-tf_idf_6
1,A,0.662629,0.0,0.0,0.0,0.0,0.0
2,B,0.406289,0.970143,0.0,0.0,0.0,0.0
3,C,0.406289,0.0,0.958503,0.0,0.0,0.0
4,D,0.25634,0.0,0.201583,0.5547,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.406289,0.242536,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.0
8,X,0.0,0.0,0.201583,0.83205,1.0,0.0
1,A,0.662629,0.0,0.0,0.0,0.0,0.0
2,B,0.406289,0.970143,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,0.662629,0.0,0.0,0.0,0.0,0.0
2,B,0.406289,0.970143,0.0,0.0,0.0,0.0
3,C,0.406289,0.0,0.958503,0.0,0.0,0.0
4,D,0.25634,0.0,0.201583,0.5547,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.406289,0.242536,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.0
8,X,0.0,0.0,0.201583,0.83205,1.0,0.0

Unnamed: 0,t,l2-tf_idf_1,l2-tf_idf_2,l2-tf_idf_3,l2-tf_idf_4,l2-tf_idf_5,l2-tf_idf_6
1,A,0.662629,0.0,0.0,0.0,0.0,0.0
2,B,0.406289,0.970143,0.0,0.0,0.0,0.0
3,C,0.406289,0.0,0.958503,0.0,0.0,0.0
4,D,0.25634,0.0,0.201583,0.5547,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,0.406289,0.242536,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.0
8,X,0.0,0.0,0.201583,0.83205,1.0,0.0


So gensim uses l2-norm by default. <br/>

If we want to use different kinds of TF and IDF with gensim we have to manipulate wglobal and wlocal.

In [108]:
tfidf.idfs

{0: 2.584962500721156,
 1: 1.5849625007211563,
 2: 1.5849625007211563,
 3: 1.0,
 4: 0.0,
 5: 1.5849625007211563,
 6: 1.0,
 7: 2.584962500721156}

In [109]:
def new_wglobal(docfreq, totaldocs, add=0.0):
    return add + np.log(float(totaldocs) / docfreq)

tfidf = models.TfidfModel(gensim_tokenized, wglobal=new_wglobal, normalize=False)

vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())

disp(['Manipulated TF-IDF from Gensim',vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())],
    ['Simple TF-IDF',tf_idf(documents_tokenized)],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,3.295837,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.386294,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.958797
8,X,0.0,0.0,0.693147,2.079442,3.465736,0.0
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,3.295837,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.386294,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.958797
8,X,0.0,0.0,0.693147,2.079442,3.465736,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,4.394449,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,3.295837,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.386294,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,8.958797
8,X,0.0,0.0,0.693147,2.079442,3.465736,0.0


By this gensim allows the user to make a very own IDF function and easily implement it.<br/>

The same is possible for TF by wlocal.

In [110]:
def new_wglobal(docfreq, totaldocs, add=0.0):
    return add + np.log(float(totaldocs) / (docfreq))

def new_local(freq):
    return 1+math.log(freq)
    
tfidf = models.TfidfModel(gensim_tokenized, wglobal=new_wglobal, wlocal=new_local, normalize=False)

disp(['Manipulated TF-IDF from Gensim',vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())],
     ['TF-IDF with log normalized TF',tf_idf(documents_tokenized, lognorm=1)],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0


In [111]:
def new_wglobal(docfreq, totaldocs, add=0.0):
    return add + np.log(float(totaldocs) / (docfreq))

def new_local(freq):
    return 1+math.log(freq)
    
tfidf = models.TfidfModel(gensim_tokenized, wglobal=new_wglobal, wlocal=new_local, normalize=False)

disp(['Manipulated TF-IDF from Gensim',vis_gensim(tfidf[gensim_tokenized] ,dictionary.itervalues())],
     ['TF-IDF with log normalized TF',tf_idf(documents_tokenized, lognorm=1)],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,2.621612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,2.305561,0.0,0.0,0.0
4,D,0.693147,0.0,0.693147,1.1736,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,1.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,4.675485
8,X,0.0,0.0,0.693147,1.454647,1.808725,0.0


Gensim gives us a lot of options to manipulate TF-IDF, but the functions are limited to a few input variables. This can become a problem, if we want to calculate double normalized TF. We need $max(d_j)$ the number of the most frequent term, but there is not an easy way to do this with gensim.<br/>

Actually there is an updated version of the TfidfModel called [tfidfmodel](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py) (last call: 21.01.2018). It brings a lot of new features with it and more implemented options to manipulate TF and IDF.


In [112]:
#save the py-file and add the path to sys
import sys
sys.path.append('/usr/local/lib/python3.5/dist-packages/gensim/models/')

#I renamed the file to tfidfmodel2.py to be not confused
import tfidfmodel2
tfidf2 = tfidfmodel2.TfidfModel(gensim_tokenized, normalize=False)

tfidf1 = tfidfmodel2.TfidfModel(gensim_tokenized, normalize=False)

disp(['TFIDF gensim original',vis_gensim(tfidf1[gensim_tokenized] ,dictionary.itervalues())],
     ['TFIDF gensim updated',vis_gensim(tfidf2[gensim_tokenized] ,dictionary.itervalues())],
     row=1, col=2)

Unnamed: 0_level_0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
Unnamed: 0_level_1,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,2.584963,0.0,0.0,0.0,0.0,0.0
2,B,1.584963,6.33985,0.0,0.0,0.0,0.0
3,C,1.584963,0.0,4.754888,0.0,0.0,0.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.584963,1.584963,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,12.924813
8,X,0.0,0.0,1.0,3.0,5.0,0.0


Let's have some fun with the new model.

The new option smartirs gives a peak into what's new:

    w_tf : str
        Term frequency weighing:
            * `n` - natural,
            * `l` - logarithm,
            * `a` - augmented,
            * `b` - boolean,
            * `L` - log average.
    w_df : str
        Document frequency weighting:
            * `n` - none,
            * `t` - idf,
            * `p` - prob idf.
    w_n : str
        Document normalization:
            * `n` - none,
            * `c` - cosine.
            
The problem with smartirs is that it's bound to updated_wglobal and updated_wlocal, so we can't manipulate tf and idf if we use smartirs. But actually we can access the updated functions directly by a little workaround.

In [113]:
def new_wglobal(docfreq, totaldocs, add=0.0):
    return add + np.log(float(totaldocs) / (docfreq))

def new_wlocal(tf):
    return tfidfmodel2.updated_wlocal(tf, 'a')

tfidf2 = tfidfmodel2.TfidfModel(gensim_tokenized, normalize=False, wglobal=new_wglobal, wlocal=new_wlocal)

disp(['TF-IDF with log normalized TF',tf_idf(documents_tokenized, double=0.5)],
     ['TFIDF with gensim updated augmented TF',vis_gensim(tfidf2[gensim_tokenized] ,dictionary.itervalues())],
     row=1, col=2)

Unnamed: 0_level_0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
Unnamed: 0_level_1,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,1.791759,0.89588,0.89588,0.89588,0.89588,0.89588
2,B,1.098612,1.098612,0.549306,0.549306,0.549306,0.549306
3,C,1.098612,0.549306,1.098612,0.549306,0.549306,0.549306
4,D,0.693147,0.346574,0.462098,0.577623,0.346574,0.346574
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,0.686633,0.549306,0.549306,0.549306,0.549306
7,G,0.89588,0.89588,0.89588,0.89588,0.89588,1.791759
8,X,0.346574,0.346574,0.462098,0.693147,0.693147,0.346574
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,1.098612,0.0,0.0,0.0,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,1.791759,0.89588,0.89588,0.89588,0.89588,0.89588
2,B,1.098612,1.098612,0.549306,0.549306,0.549306,0.549306
3,C,1.098612,0.549306,1.098612,0.549306,0.549306,0.549306
4,D,0.693147,0.346574,0.462098,0.577623,0.346574,0.346574
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,0.686633,0.549306,0.549306,0.549306,0.549306
7,G,0.89588,0.89588,0.89588,0.89588,0.89588,1.791759
8,X,0.346574,0.346574,0.462098,0.693147,0.693147,0.346574

Unnamed: 0,w,tf-idf_1,tf-idf_2,tf-idf_3,tf-idf_4,tf-idf_5,tf-idf_6
1,A,1.791759,0.0,0.0,0.0,0.0,0.0
2,B,1.098612,1.098612,0.0,0.0,0.0,0.0
3,C,1.098612,0.0,1.098612,0.0,0.0,0.0
4,D,0.693147,0.0,0.462098,0.577623,0.0,0.0
5,E,0.0,0.0,0.0,0.0,0.0,0.0
6,F,1.098612,0.686633,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,1.791759
8,X,0.0,0.0,0.462098,0.693147,0.693147,0.0


Sneaky! The advantage of the update gensim is operating with whole vectors instead of every single int, this gives the opportunity to work with max-functions.

<a id="section3_2"></a>
# Scikit-learn

In [114]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
#initailize a clean document
documents=['A B C D E F',
           'B B B B E F',
           'C C C D E X',
           'X X D D E X',
           'X X X X E X',
           'G G G G E G'] 

#initialize Transformer
#min_df=0 -> minimum document frequency of a word
vect = TfidfVectorizer(analyzer='word',min_df=0, max_df=10, use_idf=False,smooth_idf=False, token_pattern=r'\w+', 
                       norm=None,lowercase=False)

#sklearn has an transposed output compare to the outputs used yet, so I will fit it
def transpose(vect, w=False, documents=documents):
    idfs=vect.fit_transform(documents)
    matrix=pd.DataFrame(vect.get_feature_names()).T

    for idflist in idfs.toarray():
        matrix.loc[matrix.shape[0]]=idflist
    if w==True:
        binary=matrix.iloc[1:].where(matrix == 0, 1).T
        matrix.loc[matrix.shape[0]]=binary.sum(axis=1)  
    matrix=matrix.T
    matrix.index+=1
      
    if w==True:
        matrix.columns=['t']+['d_'+str(i) for i in range(1,matrix.shape[1]-1)]+['#w_i']
    else:
         matrix.columns=['t']+['d_'+str(i) for i in range(1,matrix.shape[1])]
    
    return matrix
    disp(['Count Matrix',count_matrix],
         ['Scikit Count Matrix',matrix],
         row=1, col=2)  

#show version
display('Sklearn version is '+sklearn.__version__)    

disp(['Scikit Count Matrix',transpose(vect, w=True,documents=documents)],
     ['Count Matrix',count_matrix],  
     row=1, col=2) 



'Sklearn version is 0.19.1'

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,Unnamed: 9_level_0
Unnamed: 0_level_1,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
1,A,1.0,0.0,0.0,0.0,0.0,0.0,1.0,
2,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0,
3,C,1.0,0.0,3.0,0.0,0.0,0.0,2.0,
4,D,1.0,0.0,1.0,2.0,0.0,0.0,3.0,
5,E,1.0,1.0,1.0,1.0,1.0,1.0,6.0,
6,F,1.0,1.0,0.0,0.0,0.0,0.0,2.0,
7,G,0.0,0.0,0.0,0.0,0.0,5.0,1.0,
8,X,0.0,0.0,1.0,3.0,5.0,0.0,3.0,
1,A,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.791759
2,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0,1.098612

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i
1,A,1,0,0,0,0,0,1
2,B,1,4,0,0,0,0,2
3,C,1,0,3,0,0,0,2
4,D,1,0,1,2,0,0,3
5,E,1,1,1,1,1,1,6
6,F,1,1,0,0,0,0,2
7,G,0,0,0,0,0,5,1
8,X,0,0,1,3,5,0,3

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i,idf
1,A,1,0,0,0,0,0,1,1.791759
2,B,1,4,0,0,0,0,2,1.098612
3,C,1,0,3,0,0,0,2,1.098612
4,D,1,0,1,2,0,0,3,0.693147
5,E,1,1,1,1,1,1,6,0.0
6,F,1,1,0,0,0,0,2,1.098612
7,G,0,0,0,0,0,5,1,1.791759
8,X,0,0,1,3,5,0,3,0.693147


In [115]:
#set idf to true
vect = TfidfVectorizer(analyzer='word',min_df=0, max_df=10, use_idf=True, token_pattern=r'\w+', 
                       norm=None,lowercase=False,sublinear_tf =False)

disp(['Scikit TF-IDF',transpose(vect,documents=documents)],
     ['smoothed, avoid zero devision TF-IDF with IDF+1',tf_idf(documents_tokenized, addidf=1, smooth=1, zero=1)],
    row=1, col=2) 

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.25276,0.0,0.0,0.0,0.0,0.0
2,B,1.8473,7.38919,0.0,0.0,0.0,0.0
3,C,1.8473,0.0,5.54189,0.0,0.0,0.0
4,D,1.55962,0.0,1.55962,3.11923,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,1.8473,1.8473,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,11.2638
8,X,0.0,0.0,1.55962,4.67885,7.79808,0.0
1,A,2.252763,0.0,0.0,0.0,0.0,0.0
2,B,1.847298,7.389191,0.0,0.0,0.0,0.0

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6
1,A,2.25276,0.0,0.0,0.0,0.0,0.0
2,B,1.8473,7.38919,0.0,0.0,0.0,0.0
3,C,1.8473,0.0,5.54189,0.0,0.0,0.0
4,D,1.55962,0.0,1.55962,3.11923,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,1.8473,1.8473,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,11.2638
8,X,0.0,0.0,1.55962,4.67885,7.79808,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.252763,0.0,0.0,0.0,0.0,0.0
2,B,1.847298,7.389191,0.0,0.0,0.0,0.0
3,C,1.847298,0.0,5.541894,0.0,0.0,0.0
4,D,1.559616,0.0,1.559616,3.119232,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,1.847298,1.847298,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,11.263815
8,X,0.0,0.0,1.559616,4.678847,7.798079,0.0


Sklearns default TF-IDF is $w_{i,j}\cdot(1+\log(\frac{1+m}{1+\#w_i}))$ and calls this the smooth IDF-Version. If we go set smooth_idf to False it will be the function $w_{i,j}\cdot(\log(\frac{1+m}{\#w_i}))$.

In [116]:
#set smooth to false
vect = TfidfVectorizer(analyzer='word',min_df=0, max_df=10, use_idf=True, smooth_idf=False, token_pattern=r'\w+', 
                       norm=None,lowercase=False,sublinear_tf =False)

disp(['Scikit TF-IDF',transpose(vect,documents=documents)],
     ['TF-IDF with IDF+1',tf_idf(documents_tokenized, addidf=1)],
    row=1, col=2) 

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6
Unnamed: 0_level_1,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.79176,0.0,0.0,0.0,0.0,0.0
2,B,2.09861,8.39445,0.0,0.0,0.0,0.0
3,C,2.09861,0.0,6.29584,0.0,0.0,0.0
4,D,1.69315,0.0,1.69315,3.38629,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,2.09861,2.09861,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,13.9588
8,X,0.0,0.0,1.69315,5.07944,8.46574,0.0
1,A,2.791759,0.0,0.0,0.0,0.0,0.0
2,B,2.098612,8.394449,0.0,0.0,0.0,0.0

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6
1,A,2.79176,0.0,0.0,0.0,0.0,0.0
2,B,2.09861,8.39445,0.0,0.0,0.0,0.0
3,C,2.09861,0.0,6.29584,0.0,0.0,0.0
4,D,1.69315,0.0,1.69315,3.38629,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,2.09861,2.09861,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,13.9588
8,X,0.0,0.0,1.69315,5.07944,8.46574,0.0

Unnamed: 0,t,tf_idf_1,tf_idf_2,tf_idf_3,tf_idf_4,tf_idf_5,tf_idf_6
1,A,2.791759,0.0,0.0,0.0,0.0,0.0
2,B,2.098612,8.394449,0.0,0.0,0.0,0.0
3,C,2.098612,0.0,6.295837,0.0,0.0,0.0
4,D,1.693147,0.0,1.693147,3.386294,0.0,0.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0
6,F,2.098612,2.098612,0.0,0.0,0.0,0.0
7,G,0.0,0.0,0.0,0.0,0.0,13.958797
8,X,0.0,0.0,1.693147,5.079442,8.465736,0.0


While sklearns tf-idf function is very thrifty and doesn't give a lot of options it brings one new feature a treshhold with min_df and max_df. So we can decide if there should be a max or min $w_i$. For example only allow terms with $2\leq\#w_i\leq3$.

In [117]:
#set tresholds
vect = TfidfVectorizer(analyzer='word',min_df=0, max_df=10, use_idf=False, smooth_idf=False, token_pattern=r'\w+', 
                       norm=None,lowercase=False,sublinear_tf =False)

vect2 = TfidfVectorizer(analyzer='word',min_df=2, max_df=3, use_idf=False, smooth_idf=False, token_pattern=r'\w+', 
                       norm=None,lowercase=False,sublinear_tf =False)

disp(['Scikit TF',transpose(vect, w=True,documents=documents)],
    ['Scikit TF with treshold 2<TF<3',transpose(vect2, w=True,documents=documents)],
    row=1, col=2) 

Unnamed: 0_level_0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i
Unnamed: 0_level_1,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i
1,A,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0
3,C,1.0,0.0,3.0,0.0,0.0,0.0,2.0
4,D,1.0,0.0,1.0,2.0,0.0,0.0,3.0
5,E,1.0,1.0,1.0,1.0,1.0,1.0,6.0
6,F,1.0,1.0,0.0,0.0,0.0,0.0,2.0
7,G,0.0,0.0,0.0,0.0,0.0,5.0,1.0
8,X,0.0,0.0,1.0,3.0,5.0,0.0,3.0
1,B,1.0,4.0,0.0,0.0,0.0,0.0,2.0
2,C,1.0,0.0,3.0,0.0,0.0,0.0,2.0

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i
1,A,1,0,0,0,0,0,1
2,B,1,4,0,0,0,0,2
3,C,1,0,3,0,0,0,2
4,D,1,0,1,2,0,0,3
5,E,1,1,1,1,1,1,6
6,F,1,1,0,0,0,0,2
7,G,0,0,0,0,0,5,1
8,X,0,0,1,3,5,0,3

Unnamed: 0,t,d_1,d_2,d_3,d_4,d_5,d_6,#w_i
1,B,1,4,0,0,0,0,2
2,C,1,0,3,0,0,0,2
3,D,1,0,1,2,0,0,3
4,F,1,1,0,0,0,0,2
5,X,0,0,1,3,5,0,3


<a id="section3_3"></a>
# 4. Transform Gensim $\leftrightarrow$ Scikit

Gensim uses unsorted tokens, while tokens of Scikit are sorted. So if we want to use dictionary for both, we need to sort the dictionary of gensim.

In [1]:
import numpy as np
documents=['A B C D E F',
           'B B B B E F',
           'C C C D E X',
           'X X D D E X',
           'X X X X E X',
           'G G G G E G'] 
documents_tokenized=[d.split(' ') for d in documents]

from gensim import corpora
dictionary = corpora.Dictionary(documents_tokenized)
[token for token in dictionary.iteritems()]
#Will sort tokens by appearance. 7->G while 6->X.

[(6, 'X'),
 (0, 'A'),
 (2, 'C'),
 (1, 'B'),
 (5, 'F'),
 (4, 'E'),
 (7, 'G'),
 (3, 'D')]

In [2]:
#Will sort alphabetically
sort=sorted([token for token in dictionary.itervalues()])
dictionary = corpora.Dictionary([sort,sort[:1]])
[token for token in dictionary.iteritems()]

[(7, 'X'),
 (0, 'A'),
 (2, 'C'),
 (1, 'B'),
 (5, 'F'),
 (4, 'E'),
 (6, 'G'),
 (3, 'D')]

In [3]:
#Sklearn to Gensim
from sklearn.feature_extraction.text import TfidfVectorizer
vect= TfidfVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x)
idfs=vect.fit_transform(documents_tokenized)

from gensim import matutils
gens=matutils.Sparse2Corpus(idfs.transpose())
for d in gens:
    print(d)

[(0, 0.5203245000969775), (1, 0.4266735334246792), (2, 0.4266735334246792), (3, 0.36022711512470473), (4, 0.2309717033587892), (5, 0.4266735334246792)]
[(1, 0.9618875969223704), (4, 0.130174946004749), (5, 0.2404718992305926)]
[(2, 0.9163298657008804), (3, 0.25787621226827795), (4, 0.16534598730219802), (7, 0.25787621226827795)]
[(3, 0.5461318819721838), (4, 0.17508539160633224), (7, 0.8191978229582756)]
[(4, 0.12719513620744327), (7, 0.9918777128886251)]
[(4, 0.08843204774441836), (6, 0.9960822119342002)]


In [4]:
#Gensim to Sklearn
from gensim import models

#sklearn default idf
def new_wglobal(docfreq, totaldocs):
    return 1 + np.log(float(1+totaldocs) /(1+ docfreq))
 
gensim_tokenized = [dictionary.doc2bow(document) for document in documents_tokenized]
model_tfidf = models.TfidfModel(gensim_tokenized, wglobal=new_wglobal)
tfidf  = model_tfidf[gensim_tokenized]
print(matutils.corpus2csc(tfidf).toarray().transpose())

[[0.5203245  0.42667353 0.42667353 0.36022712 0.2309717  0.42667353
  0.         0.        ]
 [0.         0.9618876  0.         0.         0.13017495 0.2404719
  0.         0.        ]
 [0.         0.         0.91632987 0.25787621 0.16534599 0.
  0.         0.25787621]
 [0.         0.         0.         0.54613188 0.17508539 0.
  0.         0.81919782]
 [0.         0.         0.         0.         0.12719514 0.
  0.         0.99187771]
 [0.         0.         0.         0.         0.08843205 0.
  0.99608221 0.        ]]


<a id="section4"></a>
# 4. Accuracy

For testing the performance of different TF-IDF approaches I will use supervised learning classifiers. For this I am going to use Gensim to extract the TF-IDFs and Sklearn to test them in a pipeline.

<h2>To be continued....</h2>