## TF_IDF App

#### I've build my own DF-IDF app for understanding the importance of a word within a document of a given corpus. Check it out below:

#### Mike Bourne, October 2020

<br>
<br>

---------------------------

Theory

The <b>inverse document frequency (IDF)</b> is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient).

$ \mathrm{idf}(t, D)$ =  $\log \frac{N}{|\{d \in D: t \in d\}|}$

with

$N$: total number of documents in the corpus $N=|D|$

$\{d \in D: t \in d\} $ : number of documents where the term $t$ appears 

<br>

The <b>total frequency (TF)</b> is the number of times a word (t) appears in a given document (d);  

${tf}(t,d) $

<br>

Thus, The <b>TF-IDF</b> is simply a measure of the number of times a word appears in a given document, weighted by the number of documents containing that word across a corpus. It is given by:

${\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)}$

---------------------------



In [348]:
#a funciton to return unique set of words in a particular input file
def unique_words(input_file,flag):

    #open file, get string of all contents, close file
    file = open(input_file, 'r')
    text_all = file.read()
    file.close()
    
    #split out words, remove non-alpha symbols, make all lower case
    words = text_all.split()
    for i in range(0,len(words)):
        words[i] = words[i].lower().strip(',').strip('.')

    #create dict of the frequency of each word
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word,0) + 1
       
    #return a set of unique words by using set function when flag-0
    if(flag == 0):
        return set(words)
    #or return the frequency of the words in the file (as a dictionary)
    else:
        return word_freq
        print(word_freq)

In [349]:
#set up a frequency count for words in a dictionary 
words_freq = {}

#get list of all files in dictionary
import os
list_files = os.listdir()

no_files = 0
for file in list_files:
    
    #if file ends in '.txt' then must be a text file
    if ( file[-4:] == '.txt' ):
        no_files += 1
        
        # use function to return a list of unique words within that text file
        file_words = unique_words(file,0)
              
        #go through each unique word and +1 to its occurance in the count_words dictionary
        for word in file_words:
            words_freq[word] = words_freq.get(word,0) + 1 

#calcualte IDF 
import math
for word in words_freq:
    words_freq[word] = math.log(no_files/(words_freq[word]))

# again loop through each file
for file in list_files:
    if ( file[-4:] == '.txt' ): 
        
        # use function to return a dictionary containing freq of each word
        word_freq = unique_words(file,1)

        # calculate dictionary DF_IDF (using comprehension), sort data
        df_idf = { word: word_freq[word]*words_freq[word] for word in word_freq}
        sorted_df_idf = sorted(df_idf.items(), key=lambda x: x[1], reverse=True)
        
        #print output to screen (top 5 in each file)
        for word in sorted_df_idf[:5]:
            print(file, ' ' ,round(word[1],2), word[0])

text3.txt   0.0 it
text3.txt   0.0 has
text3.txt   0.0 a
text3.txt   0.0 number
text3.txt   0.0 of
text2.txt   5.49 bourne
text2.txt   1.22 files
text2.txt   1.1 3.20
text2.txt   1.1 i
text2.txt   1.1 comparing
text.txt   1.1 this
text.txt   1.1 text
text.txt   1.1 file
text.txt   1.1 contains
text.txt   1.1 3.16
