The aim of the program is to take  3 inputs namely the document name,size of dataset and frequency threshold (fraction of documents above which the set of words is deemed frequent) in percentage and return the output of itemsets which appears in atleast F percentage of documents and is of size K.

In [1]:
import pandas as pd

Pandas pacakge is used to handle dataframe
Efficient-apriori to perform market basket analysis 

In [2]:
from efficient_apriori import apriori
from efficient_apriori import itemsets_from_transactions

In [3]:
import time

In [4]:
data = pd.DataFrame(columns=["doc","K","F","time","no of entries","list"])

We created a data frame named data with document name, word collection size K, frequency F, time taken, the number of entries, and the list of K-itemsets.

So, in this problem, the items are words and transactions/baskets are the documents.

In [5]:
def itemset(doc,K,F):
    
    global data
    start=time.process_time()
    
    df=pd.read_csv("docword."+doc+".txt",sep=" ",names=["docid","wordid","countw"]) 
    #df stores the docword file
    df=df[3:]
    #we don't need the three header lines of docword
    
    dfv=pd.read_csv("vocab."+doc+".txt",sep=" ",names=["word"],na_filter=False)
    #dfv stores the vocab file
    dfv.insert(0,'wordid',range(1,1+len(dfv)))
                   
    '''The vocab file has all the words present in the document 
    and the word id corresponding to the word is the row in which the word is present.
    As the vocab file does not have a word_id column specifically, so we have created it.'''
    
    '''While reading the vocab.txt ,na_filter=False is used because some data sets 
    contain the word "null" which is treated as NULL by python. 
    To avoid this, na_filter is used which ensures that if a word null exists,
    it is treated as a string “null”.'''              
                   
    df=(df.merge(dfv,left_on='wordid',right_on='wordid').reindex(columns=["docid","word","countw"]))
    #The vocab dataframe and docid dataframe are merged using the column docid                 

    df=df.drop('countw', axis=1)
    #This column is not of our use presently, so we drop it
                   
    df=df.groupby('docid').word.apply(list)
    '''We then group all the words occurring in the document and save them as lists.
    Each row  in the data frame is docid followed by list of words contained in that document. '''
                   
    dfl=df.tolist() 
    #This is then converted to a list of lists as itemset_from_transaction takes input as list of lists.                
                   
    itemsets=itemsets_from_transactions(dfl, min_support=F/100,max_length=K)
    '''Itemsets_from_transactions takes parameters list of lists and min_support and max_length and 
    runs apriori algorithm on the list and returns the all the items with min support F/100 and all itemsets of size upto K.
    The output is a dictionary of dictionaries. The keys of the first dictionary are the lengths of the itemsets.
    The values of the dictionary is the dictionary of items.
    The keys of the inner dictionary are the itemsets and value is the number of transactions it appears in. '''               
    
    if len(itemsets[0].keys())==K:
      x={"doc":doc,"K":K,"F":F,"time":time.clock()-start,"no of entries":len(list(itemsets[0][K].keys())),"list":list(itemsets[0][K].keys())}
    else:
      x={"doc":doc,"K":K,"F":F,"time":time.clock()-start,"no of entries":0,"list":"NA"}  
                   
    '''If number of keys in the 1st dictionary is K it means that there are  itemsets of length K. 
    then if it exists it prints the itemsets else says that output doesn’t exist.'''
                   
    data=data.append(x,ignore_index=True)
                   
    with open('output.txt', 'w') as f: 
     for item in list(itemsets[0][K].keys()): 
        f.write(str(item)) 
        f.write("\n")
    #saving the output in a txt file
                   
    print("The time taken is ",time.process_time()-start) 


In [6]:
itemset("kos",4,15) 

The time taken is  4.328125




In [7]:
with open('output.txt', 'r') as f: 
    contents=f.read()
    print(contents)

('bush', 'democratic', 'general', 'kerry')
('bush', 'general', 'kerry', 'poll')
('bush', 'general', 'kerry', 'war')

