Iason Myttas and Adrian Figueroa

<p style="text-align: left;">&nbsp;</p>
<p style="text-align: center;">Approach and Improvements&nbsp;</p>
<p style="text-align: center;">As a student with prior biology experience and interested in medicine, I decided to work on a corpus of medicine papers with my partner, Adrian. The corpus consists of IDs for each of the documents, a link to the abstract online, a title for the document and the text in its abstract. We thought that this would be a very interesting choice as it could be used by users both to learn new information about medicine that they are interested in, and also to look up any problems that could arise. <br /> In the process, we performed some research on selecting and mining data, we cleaned our files, built an inverted index, normalized queries and tried to test them out using both a TF-IDF ranking algorithm and a BM25. After that, we came up with a lot of parts of our engine that we could fix, both by observing its behavior, and testing it out with our friends. Therefore, we later went back and reevaluated our project.<br /> More particularly, after we chose our dataset we decided to clean it of non-letter or number characters, and to turn everything to lower case. We also decided that we would look up a query in the text of the abstract and print out the title as a result. When we created the inverted index, we dissected the one taught in our course and we noticed that a lot of times it would double count the documents that would be included for each word. We changed the algorithm so that it does not contain duplicates, which would be a waste of space. <br /> For ranking, we initially decided to use both the TF-IDF and the BM25 but lately decided that it would be a better idea to just use the BM25 which seemed to provide more relevant results. At that point we began our testing and observing how different words impacted the search engine. After testing it ourselves and getting feedback from some of our friends we decided that there were a few problems we should take care of. One of our main additions was developing a GUI to make our search engine look more approachable. With our GUI, a user can enter a query, specify the number of results he wants to receive and see the results. We also noticed that a lot of words would produce different results according to if they are presented in singular or plural. For example, the query &ldquo;brain&rdquo; produces much more results than the query &ldquo;brains&rdquo;. We therefore modified the queries so that they account for words that have an &ldquo;s&rdquo; in the end as well. We also normalized the queries so that it does not matter whether it is higher case or lower case.<br /> We also read some interesting related articles along the way such as the one here: <a href="http://l.facebook.com/l.php?u=http%3A%2F%2Flink.springer.com%2Fchapter%2F10.1007%2F978-0-85729-320-6_91%23page-2&amp;h=pAQGcThdz" target="_blank" rel="nofollow">http://link.springer.com/chapter/10.1007/978-0-85729-320-6_91#page-2</a>. Things we would like to implement later is performing more metrics and using some machine learning algorithms on it.</p>

In [4]:
import csv
import re
import math
from collections import Counter

with open("mega2.csv") as f:
 
    r = csv.reader(f, delimiter='\t')
    rgx = re.compile(r'\b[a-zA-Z]+\b')
    docs = [ (' '.join(re.findall(rgx, x[2])).lower(), ' '.join(re.findall(rgx, x[3])).lower())  \
            for i,x in enumerate(r)]
 
items_t = [ d[0] for d in docs ] # item titlescd    
items_d = [ d[1] for d in docs ] # item descriptions
items_i = range(0, len(items_t)) # item id


#index
def create_inverted_index(corpus):
   idx={}
   for i, document in enumerate(corpus):
       for word in document.split():
           if word in idx:
              if i in idx[word]:
                idx[word][i] += 1
              else:
                  idx[word][i] = 1
           else:
               idx[word] = {i:1}
   return idx

#idx = create_inverted_index(items_d)




def idf(term, idx, n):
    return math.log(float(n) / (1 + len(idx[term])))    


def print_results(results,n, head=True):

   if head:    
       print('\nTop %d from recall set of %d items:' % (n,len(results)))
       for r in results[:n]:
           print('\t%0.2f - %s'%(r[0],items_t[r[1]]))

   else:
       print('\nBottom %d from recall set of %d items:' % (n,len(results)))
       for r in results[-n:]:
           print('\t%0.2f - %s'%(r[0],items_t[r[1]]))
 
#BM25

def get_results_bm25(qry, corpus, k1=1.5, b=0.75):
    idx = create_inverted_index(corpus)
    # 1.Assign (integer) n to be the number of documents in the corpus
    ## HIDE
    n = len(corpus)
    # 2.Assign (list) d with elements corresponding to the number of terms in each document in the corpus
    ## HIDE
    d = [len(x.split()) for x in corpus]
    # 3.Assign (float) d_avg as the average document length of the documents in the corpus
    ## HIDE
    d_avg = float(sum(d)) / len(d)                
    score = Counter()
    for term in qry.split():
        if term in idx:
            i = idf(term, idx, n)
            for doc in idx[term]:
                # 4.Assign (float) f equal to the number of times the term appears in doc
                ## HIDE
                f = float(idx[term][doc])
                # 5.Assign (float) s the BM25 score for this (term, document) pair
                # HIDE
                s = i * (( f * (k1 + 1) ) / (f + k1 * (1 - b + (b * (float(d[doc]) / d_avg)))))
                score[doc] += s
                
    results=[]
    for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
        if x[1] > 0:
            results.append([x[1],x[0]])

    sorted_results = sorted(results, key=lambda t: t[0] * -1 )
    return sorted_results

#TF IDF was not used
# #TF IDF
# def get_results_tfidf(qry, idx, n):
#   score = Counter()
  
#   for term in qry.split():
#       # << IMPLEMENT TF-IDF SCORING >> CODE HERE
 
#     if term in idx:
#         i = idf(term, idx, n)
#         #print i
#         for doc in idx[term]:
#             #print "doc", doc
#             #print "idx[term][doc]", idx[term][doc]
#             score[doc] += idx[term][doc] + i
#             #print score
 
#   results=[]
#   for x in [[r[0],r[1]] for r in zip(score.keys(), score.values())]:
#       if x[1] > 0:
#           results.append([x[1],x[0]])
 
#   sorted_results = sorted(results, key=lambda t: t[0] * -1 )
#   return sorted_results



In [5]:
from ipywidgets import *
from IPython.display import display

#Input
search = Button(button_style = 'danger', description = "Search")
userinput = Text(placeholder="Enter your search")
maxresults = IntSlider(description = "Max results",max=30, value = 10)
#Layout Organization
ButtonGroup =[search, maxresults]
#HTML
HTMLResults = HTML()
PageTitle = HTML(value = '<p style="text-align: center;"><span style="color: #339966;"><strong>Medical research paper results for</strong></span></p>')


display(userinput, ButtonGroup, widgets.HBox(ButtonGroup))
tester = "lungs"
def sfixer(inputstring):
    splitinput = inputstring.split()
    for x in range(len(splitinput)):
            hold = ""
            lastletter = splitinput[x][len(splitinput[x])-1]
            if lastletter == 's':
                for i in range(len(splitinput[x])-1):
                    hold+=splitinput[x][i]
                splitinput.append(hold)
                inputstring += " " + hold
    return inputstring
#print(sfixer(tester))
def searchclick(b):
    inputstr = userinput.value
    fixed = sfixer(inputstr)
    #result = get_results_bm25(inputstr, idx, len(items_t))
    result = get_results_bm25(fixed.lower(), items_d)
    #print(result)
    HTMLResults.value = ''
    HTMLResults.value += '<p style="text-align: center;"><span style="color: #339966;"><strong>%s</strong></span></p>' % (fixed)
    for r in result[:maxresults.value]:
        HTMLResults.value += '<p style="text-align: left;"><span style="font-size: small;">%s</span></p>'% (items_t[r[1]])
        HTMLResults.value += '<p style="text-align: center;"><span style="font-size: x-small;">Relativity score - %8.2f</span></p>'% (r[0])
        HTMLResults.value += '<hr /'
    display(PageTitle, HTMLResults)

search.on_click(searchclick)



[<ipywidgets.widgets.widget_button.Button at 0x7f1f94112c50>,
 <ipywidgets.widgets.widget_int.IntSlider at 0x7f1f94142450>]