# Lab 3: Information Retrieval

Student: John Wu

__Summary__

For the input file, each document is processed and an inverted file of the corpus is built. Each document is first converted to lower case and tokenized by NLTK. From these tokens an inverted file is built. The program then calculates the IDFs of each term as well as the vector document lengths of each document in the corpus. 

For the queries, they are processed the same way. Each query is then scored against all documents for which its term appear in. A cosine similarity score is calculated for every document and the top 50 most similar document is then chosen and written out. 

Generally, the inverted file is about 1/3 to 1/2 of the size of the original corpus. The time it takes to build the corpus takes much longer than the querying. Therefore, once a corpus is built, queries can be done very quickly.

In [1]:
import sys, re, nltk, time, math
from collections import Counter
from operator import itemgetter

## (a) Build in-memory inverted file

First, we must build a tokenizer that can split text into tokens to be processed by the program. 

In [2]:
def tokenize(txt): # return lowe cased txt as tokenized by NLTK
    ''' Tokenize a string of text and return a list of string tokens.
    - The tokenizer is an extension on Treebank tokenization
    - The tokenizer splits on all whitespaces as well as contractions 
      where “can’t” -> “ca”, “n’t” etc.
    - It tokenizes any consecutive number of punctuations, such as 
      “,”, “?”, “—“, or “…”
    - Punctuations inmixed with letters, such as “03/20/2018” would be tokenized 
      as one word, as well as things like URL or hyphenated words 
      like “open-faced”
    '''
    return nltk.word_tokenize(txt.casefold())

The input document will be processed one by one, with the result being arranged into a inverted file. The inverted file will be implemented as a python dict, a hash table with the keys being the terms themselves. These will be performed by 2 utilty functions below.

In [3]:
def processDoc(txt, docID, vocab):
    '''Process a string of text, and takes an inverted file as a parameter and
    adds to the inverted file. The output is an inverted file with a new document
    added.
    '''
    d = Counter( tokenize(txt) ) # count of each token
    for tk in d: # merge dict of this doc with the bigger vocab dict
        if tk not in vocab: # if not in vocab
            vocab[tk] = [(docID, d[tk])] # first posting for token: (docID, DF)
        else: # if already in vocab
            vocab[tk].append( (docID, d[tk]) ) # append to posting list
    return vocab, d

def processDocsFile(docFile):
    ''' Read in a text file, where each line is a document. The function 
    calls `processDoc()` function for each document, and returns an inverted
    file as a dict as well as the total number of document processed.
    '''
    nDocs = 0 # count number of total docs processed
    vcb = dict() # dict for inverted file

    with open(docFile, 'r') as f:
        for line in f: # NOTE: read line by line due to possibly large size
            docID,txt = line.split('\t')
            docID = int(docID) # parse into int
            vcb, tmpDict = processDoc(txt, docID, vcb) # process single doc
            nDocs += 1

        for term in vcb:  # go through dict and sort the posting lists
            vcb[term].sort(key=itemgetter(0)) # sort by first elem, or docID
            
    return vcb, nDocs

Perform the parsing of TIME dataset and building of inverted file

In [4]:
fName = './data/time-documents.txt'
t0 = time.perf_counter()
timeInv, timeNdocs = processDocsFile(fName)
tt = time.perf_counter() - t0

__Posting List Tuples for Terms__

Since the tokenization folds the case of all terms, the terms need to be inputted as low cased. Only the first 10 entries of the posting lists are printed.

In [5]:
terms = ['computer', 'thailand', 'rockets']
for t in terms:
    posts = timeInv[t][:10]
    print('%s -> %s'%(t,posts))

computer -> [(308, 1)]
thailand -> [(203, 1), (243, 5), (280, 14), (396, 1), (449, 1), (498, 1), (516, 1), (534, 5), (543, 12), (544, 2)]
rockets -> [(27, 1), (117, 1), (186, 1), (313, 6), (404, 1), (464, 2), (495, 1), (509, 2), (545, 2)]


__Print DF and IDF__

For the three terms above, prints the document frequency and the inverse document frequency. Note that the IDF here has a one added to `N/DF` inside the log so as to prevent terms which is in every document to have an IDF of 0.

In [6]:
for t in terms:
    df = len(timeInv[t])
    print('%s: DF=%d, IDF=%f'%(t,df,math.log2(1.0 + timeNdocs/df)))

computer: DF=1, IDF=8.727920
thailand: DF=11, IDF=5.302120
rockets: DF=9, IDF=5.584963


__Timing of Processing Documents__

The time is measured as CPU process time

In [7]:
print('Processed in %d minutes and %.3f seconds.'%(tt//60,tt%60))

Processed in 0 minutes and 1.821 seconds.


## (b) Document vector length

The function below implement the algorithm provided in the assignment. Note that +1.0 is added to raw IDF so to not end up with 0 if a term appears in all documents.

In [8]:
def calcDocLens(vcb, nDocs):
    '''Given an inverted file and total number of documents in corpus, 
    return a dict containing vector length of each document and another dict
    containing the IDF value of each term.
    '''
    docLens = Counter() # use dict since docID may not be contiguous
    idfs = dict() # dict storing IDF values for each term
    
    for term,posts in vcb.items(): # loop over all terms in collection
        idf = math.log2(1.0 + nDocs/len(posts)) # +1.0 for term in all docs
        idfs[term] = (len(posts), idf) # also store the DF as well
        for docID,tf in posts: # loop over docID and tf(term,docid)
            docLens[docID] += (tf*idf)**2 # accumulate doc vector length
            
    for docID,accum in docLens.items(): # loop calculate proper doc vec length
        docLens[docID] = math.sqrt(accum) # sqrt of sum of squared terms
    
    return docLens,idfs

__Document Vector Lengths__

Since we do not know a priori the documents are sorted by docID, we must sort all the document IDs and get the 10 documents with the lowest IDs. 

In [9]:
timeDocLens,timeIDFs = calcDocLens(timeInv, timeNdocs)
tmp = sorted(timeDocLens.items(), key=itemgetter(0))[:10] # sorted by docID
for docID,docLen in tmp: # print 10 lowest by numerical docID
    print('DocID=%d, length=%f'%(docID,docLen))

DocID=17, length=187.113691
DocID=18, length=71.207738
DocID=19, length=155.399087
DocID=20, length=75.472602
DocID=21, length=185.960262
DocID=23, length=145.982647
DocID=24, length=243.984066
DocID=25, length=73.511986
DocID=26, length=135.834693
DocID=27, length=84.271118


## (c) Query representation

The query files are read in and for each query (separated by line), it will be processed in the same way as the source corpus, where each query will be represented as a dict of term frequencies.

In [10]:
def processQueryFile(queryFile):
    ''' Read a query file and process each query. Returns a list of tuples, 
    where each tuple contains a query ID and a dict with keys being the term 
    and the item being the term frequency in the query.
    '''
    with open(queryFile, 'r') as f: # read the query file
        txts = f.read().splitlines() # split by line, list of text strings
    qs = [None for x in range(len(txts))] # pre-allocate list of query dicts
    qIDs = [0 for x in range(len(txts))] # pre-allocate list of ints (for qID)
    for n,line in enumerate(txts): # loop over lines (or individual queries)
        qID,qTxt = line.split('\t') # split queryID from query text
        qIDs[n] = int(qID)
        qs[n] = Counter(tokenize(qTxt)) # tokenize and count terms in query

    return list(zip(qIDs, qs))

In [11]:
fName = './data/time-queries.txt'
timeQs = processQueryFile(fName)

__TF/IDF and query vector length for 1st query__

The code below goes through each term in the first query, looks up the TF and IDF of the terms, and then calculate the vector length of each term in the query.

In [12]:
qLen = 0
for term,tf in timeQs[0][1].items():
    if term in timeIDFs:
        df,idf = timeIDFs[term]
        print('%s: tf=%d, idf=%f'%(term,tf,idf))
        qLen += (tf*idf) ** 2
    else:
        print('%s: not found in Corpus'%term)
qLen = math.sqrt(qLen)
    
print('\nQuery Vector Length: %f'%qLen)

kennedy: tf=1, idf=3.321928
administration: tf=1, idf=4.539975
pressure: tf=1, idf=3.829723
on: tf=1, idf=1.071462
ngo: tf=1, idf=4.469235
dinh: tf=1, idf=4.469235
diem: tf=1, idf=4.277338
to: tf=1, idf=1.001708
stop: tf=1, idf=3.872352
suppressing: not found in Corpus
the: tf=1, idf=1.000000
buddhists: tf=1, idf=5.179909
.: tf=1, idf=1.000000

Query Vector Length: 12.269275


## (d) Score Documents

To score the queries against the corpus, we use two of the utility functions below. The first function will calculate cosine similarity for a given query against the corpus (represented by an inverted file, the IDFs of all terms, and dict of document lengths). The second function repeatedly calls the first function for every query in the file.

In [13]:
def cosineSim(qDict, invFile, idfs, docLens):
    ''' Given a query (represented as a dict) and a corpus (represented by an
    inverted file, idfs, and document vector lengths of all documents), the 
    function calculate a similarity score of the query against all documents
    for which any query terms appear in. It returns this score as a dict, where
    the keys are the document IDs and the items are cosine similarity scores.
    '''
    sims = Counter()  # counter for storing simularity scores
    qLen = 0 # vector length of query
    for tk,quTF in qDict.items(): # loop over terms in a query
        if tk not in invFile: # skip query term if not in corpus
            continue
        df,idf = idfs[tk] # document freq and IDF value for each token
        qLen += (quTF*idf) ** 2
        for docID,corpTF in invFile[tk]: # iterate through posting list
            sims[docID] += corpTF*idf * quTF*idf 
    
    qLen = math.sqrt(qLen) # take sqrt of query raw document length
    for docID in sims: # has to iterate through all docs for proper score
        sims[docID] /= (docLens[docID] * qLen) 
    return sims # return simularity scores of each document (most are 0)

def processQueries(qs, invFile, idfs, docLens):
    scores = [None for x in range(len(qs))] # score dict
    for n,(qID,qDict) in enumerate(qs): # iterate through all queries
        scores[n] = (qID,cosineSim(qDict, invFile, idfs, docLens))
    return scores

__Processing Queries and Timing__

In [14]:
t0 = time.perf_counter()
timeQscores = processQueries(timeQs, timeInv, timeIDFs, timeDocLens)
tt = time.perf_counter() - t0
print('Processed in %d minutes and %.3f seconds.'%(tt/60,tt%60))

Processed in 0 minutes and 0.150 seconds.


__Sample of Cosine Similarity Scores__

This shows the cosine similarity scores of the first query for 20 arbitrary document IDs.

In [15]:
for n,(qID,s) in enumerate(timeQscores[0][1].items()):
    if n>=20:
        break
    print('QueryID: %d, similarity score = %f'%(qID,s))

QueryID: 17, similarity score = 0.078330
QueryID: 21, similarity score = 0.061475
QueryID: 28, similarity score = 0.065611
QueryID: 29, similarity score = 0.071129
QueryID: 43, similarity score = 0.067917
QueryID: 45, similarity score = 0.067563
QueryID: 57, similarity score = 0.047027
QueryID: 62, similarity score = 0.088439
QueryID: 67, similarity score = 0.052006
QueryID: 70, similarity score = 0.055056
QueryID: 71, similarity score = 0.087181
QueryID: 105, similarity score = 0.057303
QueryID: 126, similarity score = 0.068932
QueryID: 163, similarity score = 0.083421
QueryID: 183, similarity score = 0.120371
QueryID: 188, similarity score = 0.075367
QueryID: 196, similarity score = 0.095311
QueryID: 204, similarity score = 0.068869
QueryID: 217, similarity score = 0.050837
QueryID: 221, similarity score = 0.075565


## (e) Ranked List
Since we're using `Counter` to store similarity scores, we can use the built-in `most_common()` function, which implements a binary heap for extracting the top N items with highest value.

In [16]:
def getTopNSimDocs(qID, simScore, N=50): # return top N document for a query
    topN = simScore.most_common(N) # use binary heap for extracting top N
    fmt = '%d Q0 %d %d %.6f jwu74\n' # format for output file lines
    return [fmt % (qID,docID,n+1,score) for n,(docID,score) in enumerate(topN)]

Outputting query results to `time-jwu74.txt`.

In [17]:
def writeQueryResult(outName, qScores, N=50):
    with open(outName, 'w') as fh: 
        for qInd,score in qScores: # loop over query results
            out = getTopNSimDocs(qInd,score) # get top N docs based on sim
            fh.writelines(out) # write out lines for output

writeQueryResult('time-jwu74.txt', timeQscores)

## (f) Efficiency

In this section, we build another function which performs the entire pipeline of building the inverted file, calculating document vector lengths, processing a query file, and writing the output of top 50 similar documents in one function.

In [18]:
def queryCorpus(corpusFile, queryFile, outFile):
    ''' The function reads in a corpus, build the inverted file, and 
    calculate document vector lengths for the corpus. It then reads in a 
    query file, process and score the queries, and write the output of top
    50 similar documents to an output file.
    
    The function returns two floats, representing the seconds it took to 
    build a corpus, and to query against this corpus.
    '''
    t0 = time.perf_counter() # time building process
    invFile, nDocs = processDocsFile(corpusFile) # build inverted file
    docLens, idfs = calcDocLens(invFile, nDocs) # calc vector doc lengths
    buildTime = time.perf_counter() - t0 # end time
    
    t0 = time.perf_counter() # time querying process
    qTxts = processQueryFile(queryFile) # read in query file
    qryScores = processQueries(qTxts, invFile, idfs, docLens) # queries
    queryTime = time.perf_counter() - t0 # end time
    
    writeQueryResult(outFile, qryScores) # output query results
    
    return buildTime, queryTime

Executing this pipeline on the fire10 data set takes significantly longer than the previous Reuters headlines file. The result is written to `fire10-jwu74.txt` file.

In [19]:
bt, qt = queryCorpus('./data/fire10-documents.txt', 
                     './data/fire10-queries.txt', 'fire10-jwu74.txt' )

print('Build time for fire10: %d minutes %.3f seconds'%(bt//60,bt%60))
print('Query time for fire10: %d minutes %.3f seconds'%(qt//60,qt%60))

Build time for fire10: 5 minutes 36.826 seconds
Query time for fire10: 0 minutes 18.169 seconds
