## TF_IDF App

#### I've built my own DF-IDF app for understanding the importance of a word within a document of a given corpus. Check it out below:

#### Mike Bourne, October 2020

<br>
<br>

---------------------------

Theory

The <b>inverse document frequency (IDF)</b> is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient).

$ \mathrm{idf}(t, D)$ =  $\log \frac{N}{|\{d \in D: t \in d\}|}$

with

$N$: total number of documents in the corpus $N=|D|$

$\{d \in D: t \in d\} $ : number of documents where the term $t$ appears 

<br>

The <b>total frequency (TF)</b> is the number of times a word (t) appears in a given document (d);  

${tf}(t,d) $

<br>

Thus, The <b>TF-IDF</b> is simply a measure of the number of times a word appears in a given document, weighted by the number of documents containing that word across a corpus. It is given by:

${\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)}$

---------------------------



In [2]:
#a funciton to return either:
# (i) a unique set of words in a file (flag = 0) or 
# (ii) the frequency of each word in a file (flag = 1)
def unique_words(input_file,flag=0):

    #open file, get string of all contents, close file
    file = open(input_file, 'r')
    text_all = file.read()
    file.close()
    
    #split out words, remove non-alpha symbols, make all lower case
    words = text_all.split()
    for i in range(0,len(words)):
        words[i] = words[i].lower().strip(',').strip('.')

    #create dict of the frequency of each word
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word,0) + 1
       
    #return a set of unique words by using set function when flag-0
    if(flag == 0):
        return set(words)
    #or return the frequency of the words in the file (as a dictionary)
    else:
        return word_freq
        print(word_freq)

In [114]:
#set up a frequency count for words in a dictionary 
corpus_word_freq = {}

######
# First loop to find all the unique words and the IDF
######
#get list of all files in dictionary
import os
list_files = os.listdir()

no_files = 0
for file in list_files:
    
    #if file ends in '.txt' then must be a text file
    if ( file[-4:] == '.txt' ):
        no_files += 1
        
        # use function to return a list of unique words within that text file
        file_words = unique_words(file,0)
        
        
        #go through each unique word and +1 to its occurance in the count_words dictionary
        for word in file_words:
            corpus_word_freq[word] = corpus_word_freq.get(word,0) + 1 

#calcualte IDF 
import math
for word in words_freq:
    corpus_word_freq[word] = math.log(no_files/(corpus_word_freq[word]))

#create numpy array for storing all the DF-IDF'a
import numpy as np
all_df_idfs = np.zeros((no_files,len(corpus_word_freq)))
               
######
# Second loop to calculate the DF-IDF for each word in each file
######
no_files = 0
file_name_list = []
for file in list_files:
    
    #if file ends in '.txt' then must be a text file
    if ( file[-4:] == '.txt' ): 
        file_name_list.append(file)
        
        # use function to return a dictionary containing freq of each word
        word_freq = unique_words(file,1)

        # calculate dictionary DF_IDF (using comprehension), sort data
        df_idf = { word: word_freq[word]*corpus_word_freq[word] for word in word_freq}
                       
        # put data into the np array called all_df_idfs
        for count, word in enumerate(corpus_word_freq):
            all_df_idfs[no_files][count] = df_idf.get(word,0)
        
        no_files +=1
        
# display as a pandas dataframe
import pandas as pd
df = pd.DataFrame(all_df_idfs)
df.columns = words_freq
df.index = file_name_list
df

Unnamed: 0,tranining,course,bourne,used,test,is,for,be,text,of,...,to,contains,exercise,in,the,i,with,3.20,other,comparing
text4.txt,0.287682,0.287682,0.693147,0.287682,0.287682,0.575364,0.287682,0.287682,0.693147,0.0,...,0.575364,0.693147,0.287682,0.287682,0.287682,0.0,0.0,0.0,0.0,0.0
text3.txt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
text2.txt,0.287682,0.287682,3.465736,0.287682,0.287682,0.287682,0.287682,0.287682,0.0,0.0,...,0.575364,0.0,0.287682,0.287682,0.287682,1.386294,1.386294,1.386294,1.386294,1.386294
text.txt,0.287682,0.287682,0.0,0.287682,0.287682,0.575364,0.287682,0.287682,0.693147,0.0,...,0.575364,0.693147,0.287682,0.287682,0.287682,0.0,0.0,0.0,0.0,0.0


In [115]:
# import the scipy library
from scipy.spatial.distance import squareform, pdist
pdist(df)

array([2.20657687, 4.49390025, 0.69314718, 4.92633097, 2.09488149,
       4.95168825])

In [116]:
# pass the tf_idf table (df) into pdist, turn the result into a square-form matrix and make a DataFrame out of it
output = pd.DataFrame(squareform(pdist(df, 'euclidean')))
output.index = file_name_list
output.columns = file_name_list
output

Unnamed: 0,text4.txt,text3.txt,text2.txt,text.txt
text4.txt,0.0,2.206577,4.4939,0.693147
text3.txt,2.206577,0.0,4.926331,2.094881
text2.txt,4.4939,4.926331,0.0,4.951688
text.txt,0.693147,2.094881,4.951688,0.0
