# Joins and creating an inverted index in Hadoop

### 3.  HW5.2  Memory-backed map-side

Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Use the following tables for this HW and join based on the country code (third column of the transactions table and the second column of the Countries table:

<PRE>
transactions.dat
Alice Bob|$10|US
Sam Sneed|$1|CA
Jon Sneed|$20|CA
Arnold Wesise|$400|UK
Henry Bob|$2|US
Yo Yo Ma|$2|CA
Jon York|$44|CA
Alex Ball|$5|UK
Jim Davis|$66|JA

Countries.dat
United States|US
Canada|CA
United Kingdom|UK
Italy|IT

</PRE>

Justify which table you chose as the Left table in this hashside join.

Please report the number of rows resulting from:

- (1) Left joining Table Left with Table Right
- (2) Right joining Table Left with Table Right
- (3) Inner joining Table Left with Table Right


In [1]:
%%writefile transactions.dat
Alice Bob|$10|US
Sam Sneed|$1|CA
Jon Sneed|$20|CA
Arnold Wesise|$400|UK
Henry Bob|$2|US
Yo Yo Ma|$2|CA
Jon York|$44|CA
Alex Ball|$5|UK
Jim Davis|$66|JA

Writing transactions.dat


In [2]:
%%writefile Countries.dat
United States|US
Canada|CA
United Kingdom|UK
Italy|IT

Writing Countries.dat


In [67]:
%%writefile MRJoin.py
#!/home/cloudera/anaconda2/bin/python

import os
from mrjob.job import MRJob
from collections import defaultdict


# -------------------------------------- define helper functions -------------------------------------------------- #


def choose_join(join_type):
    '''Helps calling the correct join type class that matches the command line argument'''
    
    if join_type == 'inner':
        return 'MRInnerJoin'
    elif join_type == 'left':
        return 'MRLeftJoin'
    elif join_type == 'right':
        return 'MRRightJoin'
    else:
        raise ValueError("Choose 'inner', 'left' or 'right'.")
        
        
# ----------------------------------------- Join superclass -------------------------------------------------------- #        


class MRInMemoryMapperJoin(MRJob):
    '''In memory mapper join super class (implements inner join)'''

    def mapper_init(self):
        self.join_type = 'inner'
        self.left_table = defaultdict(None)
        # populate the left table with the entries of the smaller table
        with open('Countries.dat', 'r') as infile:
            for line in infile:
                country_name, iso2 = line.strip().split('|')
                self.left_table[iso2] = country_name

    def mapper(self, _, right_table_line):
        person_name, transaction_value, iso2 = right_table_line.strip().split('|')
        try:
            country = self.left_table[iso2]
            yield iso2, (country, person_name, transaction_value)
        # if key is not in left table, do not emit it
        except KeyError:
            pass
        
    def reducer(self,key,values):
        for value in values:
            yield key,value
    

    
# ------------------------------------- inner join subclass ---------------------------------------------------- #    
    
# MRInMemoryMapperJoin implements an inner join, for consistency create also an inner join subclass
class MRInnerJoin(MRInMemoryMapperJoin):
     pass
    

    
# -------------------------------------- right join subclass -------------------------------------------------- #    

# right join needs to add lines from the right table to the left table if the keys don't exist in it
class MRRightJoin(MRInMemoryMapperJoin):

    def mapper(self ,_,right_table_line):
        person_name, transaction_value, iso2 = right_table_line.strip().split('|')
        try:
            # emit joined line
            yield iso2, (self.left_table[iso2], person_name, transaction_value)
            # if there is a key in the right table which is not in the left table, ignore it
        except KeyError:
            yield iso2, (None, person_name, transaction_value)

            
            
# ---------------------------------------------- left join subclass ----------------------------------------- #            
            
#left join requires to check which right table entries are not in the left table, and perform this check to account for several
# independent mappers
class MRLeftJoin(MRInMemoryMapperJoin):


    def mapper_init(self):
        # inherit inits from MRInMemoryMapperJoin
        super(MRLeftJoin, self).mapper_init()
        # create a set which registers the iso2 codes that were emitted by the mapper
        self.iso2s = set()

    def mapper(self ,_,right_table_line):
        person_name, transaction_value, iso2 = right_table_line.strip().split('|')
        try:
            # add iso2 to the emitted key set
            self.iso2s.update([iso2])
            # emit joined line
            yield iso2, (self.left_table[iso2], person_name, transaction_value)
            # if there is a key in the right table which is not in the left table, ignore it
        except KeyError:
            pass

    def mapper_final(self):
        # find left table lines not in right table
        remainder = self.left_table.viewkeys() - (self.left_table.viewkeys() & self.iso2s)
        for iso2 in remainder:
            yield iso2, (self.left_table[iso2], None, None)


    # need to add a reducer:
    # if there is more than one mapper, a mapper might emit a line which is in the other parts of the right table,
    # not seen by the other mapper
    def reducer(self,iso2,values):
        possibly_not_in_right_table = []
        for i,value in enumerate(values):
            # check if neither of the mappers thinks that an iso2 key is not in the right table
            if value[1] is not None:
                yield iso2, value
            # if at least one does, append it to a list of values that are possibly not in right tab;e
            else:
                possibly_not_in_right_table.append(value)
        # if all of the values were appended to those that are possibly not in the right table, emit only one of them,
        # the others are repetitions, and if not all of them were added to this list, then at least one mapper wasn't right
        if len(possibly_not_in_right_table) == (i+1):
            yield iso2, possibly_not_in_right_table[0]

# ------------------------------- run --------------------------------- #

if __name__ == "__main__":
    join_type = os.environ.get('JOIN_TYPE', 'left')
    eval(choose_join(join_type)).run()


Overwriting MRJoin.py


In [55]:
!chmod +x MRJoin.py

## Run left join

In [45]:
!python MRJoin.py transactions.dat --file Countries.dat --cmdenv JOIN_TYPE='left' \
           -r hadoop --python-bin /home/cloudera/anaconda2/bin/python -q > left_join.txt
!cat left_join.txt
!echo "Number of lines:"
!cat left_join.txt | wc -l

"CA"	["Canada","Jon York","$44"]
"CA"	["Canada","Yo Yo Ma","$2"]
"CA"	["Canada","Jon Sneed","$20"]
"CA"	["Canada","Sam Sneed","$1"]
"IT"	["Italy",null,null]
"UK"	["United Kingdom","Alex Ball","$5"]
"UK"	["United Kingdom","Arnold Wesise","$400"]
"US"	["United States","Henry Bob","$2"]
"US"	["United States","Alice Bob","$10"]
Number of lines:
9


## Run right join

In [71]:
!python MRJoin.py transactions.dat --file Countries.dat --cmdenv JOIN_TYPE='right' \
           -r hadoop --python-bin /home/cloudera/anaconda2/bin/python -q > right_join.txt
!cat right_join.txt
!echo "Number of lines:" 
!cat right_join.txt | wc -l

"CA"	["Canada","Jon York","$44"]
"CA"	["Canada","Yo Yo Ma","$2"]
"CA"	["Canada","Jon Sneed","$20"]
"CA"	["Canada","Sam Sneed","$1"]
"JA"	[null,"Jim Davis","$66"]
"UK"	["United Kingdom","Alex Ball","$5"]
"UK"	["United Kingdom","Arnold Wesise","$400"]
"US"	["United States","Henry Bob","$2"]
"US"	["United States","Alice Bob","$10"]
Number of lines:
9


## Run inner join

In [69]:
!python MRJoin.py transactions.dat --file Countries.dat --cmdenv JOIN_TYPE='inner' \
           -r hadoop --python-bin /home/cloudera/anaconda2/bin/python -q > inner_join.txt
!cat inner_join.txt
!echo "Number of lines:" 
!cat inner_join.txt | wc -l

"CA"	["Canada","Jon York","$44"]
"CA"	["Canada","Yo Yo Ma","$2"]
"CA"	["Canada","Jon Sneed","$20"]
"CA"	["Canada","Sam Sneed","$1"]
"UK"	["United Kingdom","Arnold Wesise","$400"]
"UK"	["United Kingdom","Alex Ball","$5"]
"US"	["United States","Henry Bob","$2"]
"US"	["United States","Alice Bob","$10"]
Number of lines:
8


## Pairwise similarity  

In this part of the assignment we will focus on developing methods for detecting synonyms, using the Google 5-grams dataset. To accomplish this you must script two main tasks using MRJob:


#### (1) Using the systems tests data sets, write mrjob code to build the stripes
#### (2) Write mrjob code to build an inverted index from the stripes
#### (3) Using two (symmetric) comparison methods of your choice (e.g., correlations, distances, similarities), pairwise compare all stripes (vectors), and output to a file.   

__==Design notes for (1)== __  
For this task you will be able to modify the pattern we used in HW 3.2 (feel free to use the solution as reference). To total the word counts across the n-grams, output the support from the mappers using the total order inversion pattern:

<*word,count>   

to ensure that the support arrives before the cooccurrences.   

In addition to ensuring the determination of the total word counts, the mapper must also output co-occurrence counts for the pairs of words inside of each n-gram. Treat these words as a basket, as we have in HW 3, but count all stripes or pairs in both orders, i.e., count both orderings: (word1,word2), and (word2,word1), to preserve
symmetry in our output for (2).

__==Design notes for (3)==__   
For this task you will have to determine a method of comparison.
Here are a few that you might consider:

 - Jaccard
 - Cosine similarity
 - Spearman correlation
 - Euclidean distance
 - Taxicab (Manhattan) distance
 - Shortest path graph distance (a graph, because our data is symmetric!)
 - Pearson correlation
 - Kendall correlation
 ...

However, be cautioned that some comparison methods are more difficult to parallelize than others, and do not perform more associations than is necessary, since your choice of association will be symmetric.

Please use the inverted index (discussed in live session #5) based pattern to compute the pairwise (term-by-term) similarity matrix. 

In [79]:
%%writefile buildStripes.py
#!~/anaconda2/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
import re
import mrjob
import json
from mrjob.protocol import RawProtocol
from mrjob.job import MRJob
from mrjob.step import MRStep
import itertools

class MRbuildStripes(MRJob):
  

    
    # ------------------ settings ---------------------- #
    INPUT_PROTOCOL = RawProtocol
    
    # helper function for summing stripes 
    def add_dictionaries(self,dict1, dict2):
        '''Merges two dictionaries and sums up the values of the common keys'''
        new_dict = {}
        for k in itertools.chain(dict1.keys(), dict2.keys()):
            new_dict[k] = dict1.get(k, 0)+dict2.get(k, 0)
        return new_dict
    
    # ---------------- define MR steps ----------------- #

    def steps(self):

        return [MRStep(
            mapper_init=self.mapper_init,
            mapper=self.mapper,
            reducer=self.reducer)
        ]
    
    # -------------- Step 1: emit stripes ---------------- #
    def mapper_init(self):
        self.WORD_RE = re.compile(r"[a-z']+")
        self.COUNT_RE = re.compile(r'[0-9]+')

    def mapper(self, line, freq):
        line = line.lower()
        # extract all words
        words = self.WORD_RE.findall(line)
        # find only the first count
        count = int(self.COUNT_RE.search(freq).group())
        stripe = {}
        # get all the unique pairs
        for word in words:
            # generate stripe
            #    1: get all of the other words
            other_words = words[:]
            other_words.remove(word)
            #    2: create a container for words and counts
            item = [None]*(len(other_words)*2)
            #    3: associate the other words with their counts
            item[::2] = other_words
            item[1::2] = itertools.repeat(count,len(other_words))
            item = dict(map(None, *[iter(item)]*2))
            #    4: place them in a dictionary to the word key
            stripe[word] = item
            yield word, stripe[word]

    def reducer(self,key,values):
        counts = {}
        for value in values:
            counts = self.add_dictionaries(counts,value)
        yield key, counts


  
if __name__ == '__main__':
    MRbuildStripes.run()

Overwriting buildStripes.py


In [100]:
%%writefile invertedIndex.py
#!~/anaconda2/bin/python
# -*- coding: utf-8 -*-


from __future__ import division
import collections
import re
import json
import math
import numpy as np
import itertools
import mrjob
from mrjob.protocol import RawProtocol
from mrjob.job import MRJob
from mrjob.step import MRStep

class MRinvertedIndex(MRJob):
    


    # ---------------------------- settings ------------------------------ #

    MRJob.SORT_VALUES = True
    INPUT_PROTOCOL = RawProtocol
    
    # --------------------------- define MR steps ------------------------ #
    def steps(self):

        return [MRStep(
            mapper_init=self.mapper_init,
            mapper=self.mapper,
            reducer=self.reducer)
        ]
    
    # ------------------ Step 1: find inverted index, document length ---- #

    def mapper_init(self):
        self.WORD_RE = re.compile(r"[a-zA-Z']+")

    def mapper(self, doc_id, words):
        words = self.WORD_RE.findall(words)
        doc_id = self.WORD_RE.findall(doc_id)
        doc_id_len = list(doc_id)
        _len = len(words)
        doc_id_len.append(_len)
        for word in words:
            # Store the length of the document to use with JACCARD (|A| + |B|)
            yield word, doc_id_len

    def reducer(self, key, values):
        d = collections.defaultdict(list)
        for value in values:
            d[key].append(value)
        yield key, d[key]




        
if __name__ == '__main__':
    MRinvertedIndex.run() 

Overwriting invertedIndex.py


In [139]:
%%writefile similarity.py
#!~/anaconda2/bin/python
# -*- coding: utf-8 -*-

from __future__ import division
import collections
import re
import json
import math
import numpy as np
import itertools
import mrjob
from mrjob.protocol import RawProtocol
from mrjob.job import MRJob
from mrjob.step import MRStep

class MRsimilarity(MRJob):
  
    
    #-------------------- define similarity functions ---------------- # 

    def get_jaccard(self,N,lenA,lenB):
        return N/(lenA+lenB-N)

    def get_cosine(self,N,lenA,lenB):
        return N/np.sqrt(lenA*lenB)
    
    def get_overlap(self,N,lenA,lenB):
        return N/min(lenA,lenB)

    def get_dice(self,N,lenA,lenB):
        return 2*N/(lenA+lenB)
    
    # ------------------------------ define MR steps ----------------- #

    def steps(self):
        JOBCONF_STEP1 = {}
        JOBCONF_STEP2 = {
            ######### IMPORTANT: THIS WILL HAVE NO EFFECT IN -r local MODE. MUST USE -r hadoop FOR SORTING #############
            'mapreduce.job.output.key.comparator.class': 'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
            'mapreduce.partition.keycomparator.options': '-k1,1nr',
        }
        return [MRStep(jobconf=JOBCONF_STEP1,
                       mapper=self.mapper_pair_sim,
                       reducer=self.reducer_pair_sim)
            ,
                MRStep(jobconf=JOBCONF_STEP2,
                       mapper=None,
                       reducer=self.reducer_sort)
                ]

    # -------------------- Step 1: find pairs, calculate similarity  ------------------------------- #

    def mapper_pair_sim(self, _, line):
        line = line.strip()
        index, posting = line.split("\t")
        posting = json.loads(posting)

        '''
        @input: lines (postings) from inverted index
         "blue" [["DocA", 4], ["DocC", 4], ["DocE", 3]]

        @output: pairs of doc and doc_length, count the number of pairs
         make complex key and count of 1 as value:
         DocA.4.DocB.4, 1
         DocA.4.DocC.4, 1
         DocB.4.DocC.4, 1
        '''

        X = map(lambda x: x[0] + "." + str(x[1]), posting)

        # taking advantage of symetry, output only (a,b), but not (b,a)
        for subset in itertools.combinations(sorted(set(X)), 2):
            yield subset[0] + "." + subset[1], 1

    def reducer_pair_sim(self, key, value):
        '''
        @input:  
            key: Doc1_id.Doc1_len.Doc2_id.Doc2_len
            value: count
        @output: 
            key: average of similarity measures
            value: Doc1-Doc2,cosine,jaccard,overlap,dice similarity measures
        '''
        Doc1, Doc1_len, Doc2, Doc2_len = key.split(".")
        t = sum(value)
        lenDoc1,lenDoc2 = int(Doc1_len),int(Doc2_len)
        cosine = self.get_cosine(t,lenDoc1,lenDoc2)
        jaccard = self.get_jaccard(t,lenDoc1,lenDoc2)
        overlap = self.get_overlap(t,lenDoc1,lenDoc2)
        dice = self.get_dice(t,lenDoc1,lenDoc2)
        avg = np.mean([cosine,jaccard,overlap,dice])
        yield avg,(Doc1 + " - " + Doc2,cosine,jaccard,overlap,dice)

    # ----------------------------- Step 2: emit, sort ----------------------------
    def reducer_sort(self, key, value):
        for v in value:
            yield  key, v


  
if __name__ == '__main__':
    MRsimilarity.run()

Overwriting similarity.py


##  Run Systems tests locally on small datasets 

#### 1: unit/systems first-10-lines

In [72]:
%%writefile googlebooks-eng-all-5gram-20090715-0-filtered-first-10-lines.txt
A BILL FOR ESTABLISHING RELIGIOUS	59	59	54
A Biography of General George	92	90	74
A Case Study in Government	102	102	78
A Case Study of Female	447	447	327
A Case Study of Limited	55	55	43
A Child's Christmas in Wales	1099	1061	866
A Circumstantial Narrative of the	62	62	50
A City by the Sea	62	60	49
A Collection of Fairy Tales	123	117	80
A Collection of Forms of	116	103	82

Writing googlebooks-eng-all-5gram-20090715-0-filtered-first-10-lines.txt


#### 2: unit/systems atlas-boon

In [1]:
%%writefile atlas-boon-systems-test.txt
atlas boon	50	50	50
boon cava dipped	10	10	10
atlas dipped	15	15	15

Writing atlas-boon-systems-test.txt


#### 3: unit/systems stripe-docs-test

Three terms, A,B,C and their corresponding stripe-docs of co-occurring terms

- DocA {X:20, Y:30, Z:5}
- DocB {X:100, Y:20}
- DocC {M:5, N:20, Z:5}

### (1) build stripes for all the test data sets - run the commands and ensure that your output matches the output below

In [151]:

!python buildStripes.py -r local googlebooks-eng-all-5gram-20090715-0-filtered-first-10-lines.txt -q > systems_test_stripes_1

In [152]:
!cat systems_test_stripes_1

"a"	{"limited":55,"sea":62,"general":92,"female":447,"in":1201,"religious":59,"george":92,"biography":92,"city":62,"for":59,"tales":123,"child's":1099,"the":124,"forms":116,"wales":1099,"christmas":1099,"government":102,"collection":239,"by":62,"case":604,"circumstantial":62,"of":895,"study":604,"bill":59,"establishing":59,"narrative":62,"fairy":123}
"bill"	{"a":59,"religious":59,"for":59,"establishing":59}
"biography"	{"a":92,"of":92,"george":92,"general":92}
"by"	{"a":62,"city":62,"the":62,"sea":62}
"case"	{"a":604,"limited":55,"government":102,"of":502,"study":604,"female":447,"in":102}
"child's"	{"a":1099,"wales":1099,"christmas":1099,"in":1099}
"christmas"	{"a":1099,"wales":1099,"in":1099,"child's":1099}
"circumstantial"	{"a":62,"of":62,"the":62,"narrative":62}
"city"	{"a":62,"the":62,"by":62,"sea":62}
"collection"	{"a":239,"of":239,"fairy":123,"tales":123,"forms":116}
"establishing"	{"a":59,"bill":59,"religious":59,"for":59}
"fairy"	{"a":123,"of":123,"tales":123,"colle

In [149]:


!python buildStripes.py -r local atlas-boon-systems-test.txt -q > systems_test_stripes_2

In [150]:
!cat systems_test_stripes_2

"atlas"	{"dipped":15,"boon":50}
"boon"	{"atlas":50,"dipped":10,"cava":10}
"cava"	{"dipped":10,"boon":10}
"dipped"	{"atlas":15,"boon":10,"cava":10}


In [88]:

with open("systems_test_stripes_3", "w") as f:
    f.writelines([
        '"DocA"\t{"X":20, "Y":30, "Z":5}\n',
        '"DocB"\t{"X":100, "Y":20}\n',  
        '"DocC"\t{"M":5, "N":20, "Z":5, "Y":1}\n'
    ])
!cat systems_test_stripes_3   

"DocA"	{"X":20, "Y":30, "Z":5}
"DocB"	{"X":100, "Y":20}
"DocC"	{"M":5, "N":20, "Z":5, "Y":1}


### (2) Build Inverted Index - run the commands and insure that your output matches the output below

In [134]:
!python invertedIndex.py -r local systems_test_stripes_1 -q > systems_test_index_1

In [135]:
!python invertedIndex.py -r local systems_test_stripes_2 -q > systems_test_index_2

In [136]:
!python invertedIndex.py -r local systems_test_stripes_3 -q > systems_test_index_3

In [137]:
##########################################################
# Pretty print systems tests for generating Inverted Index
##########################################################

import json

for i in range(1,4):
  print "—"*100
  print "Systems test ",i," - Inverted Index"
  print "—"*100  
  with open("systems_test_index_"+str(i),"r") as f:
      lines = f.readlines()
      for line in lines:
          line = line.strip()
          word,stripe = line.split("\t")
          stripe = json.loads(stripe)
          stripe.extend([["",""] for _ in xrange(3 - len(stripe))])

          print "{0:>16} |{1:>16} |{2:>16} |{3:>16}".format(
              (word), stripe[0][0]+" "+str(stripe[0][1]), stripe[1][0]+" "+str(stripe[1][1]), stripe[2][0]+" "+str(stripe[2][1]))
        


————————————————————————————————————————————————————————————————————————————————————————————————————
Systems test  1  - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
             "a" |          bill 4 |     biography 4 |            by 4
          "bill" |            a 27 |  establishing 4 |           for 4
     "biography" |            a 27 |       general 4 |        george 4
            "by" |            a 27 |          city 4 |           sea 4
          "case" |            a 27 |        female 4 |    government 4
       "child's" |            a 27 |     christmas 4 |            in 7
     "christmas" |            a 27 |       child's 4 |            in 7
"circumstantial" |            a 27 |     narrative 4 |           of 16
          "city" |            a 27 |            by 4 |           sea 4
    "collection" |            a 27 |         fairy 4 |         forms 3
  "establishing" |            a 27 |          bill 4 |

### Inverted Index

In [None]:
————————————————————————————————————————————————————————————————————————————————————————————————————
Systems test  1  - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
             "a" |          bill 4 |     biography 4 |            by 4
          "bill" |            a 27 |  establishing 4 |           for 4
     "biography" |            a 27 |       general 4 |        george 4
            "by" |            a 27 |          city 4 |           sea 4
          "case" |            a 27 |        female 4 |    government 4
       "child's" |            a 27 |     christmas 4 |            in 7
     "christmas" |            a 27 |       child's 4 |            in 7
"circumstantial" |            a 27 |     narrative 4 |           of 15
          "city" |            a 27 |            by 4 |           sea 4
    "collection" |            a 27 |         fairy 4 |         forms 3
  "establishing" |            a 27 |          bill 4 |           for 4
         "fairy" |            a 27 |    collection 5 |           of 15
        "female" |            a 27 |          case 7 |           of 15
           "for" |            a 27 |          bill 4 |  establishing 4
         "forms" |            a 27 |    collection 5 |           of 15
       "general" |            a 27 |     biography 4 |        george 4
        "george" |            a 27 |     biography 4 |       general 4
    "government" |            a 27 |          case 7 |            in 7
            "in" |            a 27 |          case 7 |       child's 4
       "limited" |            a 27 |          case 7 |           of 15
     "narrative" |            a 27 |circumstantial 4 |           of 15
            "of" |            a 27 |     biography 4 |          case 7
     "religious" |            a 27 |          bill 4 |  establishing 4
           "sea" |            a 27 |            by 4 |          city 4
         "study" |            a 27 |          case 7 |        female 4
         "tales" |            a 27 |    collection 5 |         fairy 4
           "the" |            a 27 |            by 4 |circumstantial 4
         "wales" |            a 27 |       child's 4 |     christmas 4
————————————————————————————————————————————————————————————————————————————————————————————————————
Systems test  2  - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
         "atlas" |          boon 3 |        dipped 3 |                
          "boon" |         atlas 2 |          cava 2 |        dipped 3
          "cava" |          boon 3 |        dipped 3 |                
        "dipped" |         atlas 2 |          boon 3 |          cava 2
————————————————————————————————————————————————————————————————————————————————————————————————————
Systems test  3  - Inverted Index
————————————————————————————————————————————————————————————————————————————————————————————————————
             "M" |          DocC 4 |                 |                
             "N" |          DocC 4 |                 |                
             "X" |          DocA 3 |          DocB 2 |                
             "Y" |          DocA 3 |          DocB 2 |          DocC 4
             "Z" |          DocA 3 |          DocC 4 |                


### (3) Calculate similarities - run the commands and insure that your output matches the output below

#### NOTE: you must run in hadoop mode to generate sorted similarities

In [140]:
!python similarity.py -r hadoop systems_test_index_1 -q > systems_test_similarities_1

In [141]:
!python similarity.py -r hadoop systems_test_index_2 -q > systems_test_similarities_2

In [142]:
!python similarity.py -r hadoop systems_test_index_3 -q > systems_test_similarities_3

In [146]:
############################################
# Pretty print systems tests
############################################

import json
for i in range(1,4):
  print '—'*110
  print "Systems test ",i," - Similarity measures"
  print '—'*110
  print "{0:>15} |{1:>15} |{2:>15} |{3:>15} |{4:>15} |{5:>15}".format(
          "average", "pair", "cosine", "jaccard", "overlap", "dice")
  print '-'*110

  with open("systems_test_similarities_"+str(i),"r") as f:
      lines = f.readlines()
      for line in lines:
          line = line.strip()
          avg,stripe = line.split("\t")
          stripe = json.loads(stripe)

          print "{0:>15f} |{1:>15} |{2:>15f} |{3:>15f} |{4:>15f} |{5:>15f}".format(
              float(avg), stripe[0], float(stripe[1]), float(stripe[2]), float(stripe[3]), float(stripe[4])
 )

——————————————————————————————————————————————————————————————————————————————————————————————————————————————
Systems test  1  - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
        average |           pair |         cosine |        jaccard |        overlap |           dice
--------------------------------------------------------------------------------------------------------------
       1.000000 |female - limited |       1.000000 |       1.000000 |       1.000000 |       1.000000
       0.868292 |  forms - tales |       0.866025 |       0.750000 |       1.000000 |       0.857143
       0.868292 |  fairy - forms |       0.866025 |       0.750000 |       1.000000 |       0.857143
       0.830357 |   case - study |       0.857143 |       0.750000 |       0.857143 |       0.857143
       0.723144 |         a - of |       0.721688 |       0.535714 |       0.937500 |       0.697674
       0.712500 |bill

### Pairwise Similairity 

In [None]:
Systems test  3  - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
        average |           pair |         cosine |        jaccard |        overlap |           dice
--------------------------------------------------------------------------------------------------------------
       0.741582 |    DocA - DocB |       0.816497 |       0.666667 |       1.000000 |       0.800000
       0.488675 |    DocA - DocC |       0.577350 |       0.400000 |       0.666667 |       0.571429
       0.276777 |    DocB - DocC |       0.353553 |       0.200000 |       0.500000 |       0.333333
--------------------------------------------------------------------------------------------------------------

Systems test  2  - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
        average |           pair |         cosine |        jaccard |        overlap |           dice
--------------------------------------------------------------------------------------------------------------
       1.000000 |   atlas - cava |       1.000000 |       1.000000 |       1.000000 |       1.000000
       0.625000 |  boon - dipped |       0.666667 |       0.500000 |       0.666667 |       0.666667
       0.389562 |  cava - dipped |       0.408248 |       0.250000 |       0.500000 |       0.400000
       0.389562 |    boon - cava |       0.408248 |       0.250000 |       0.500000 |       0.400000
       0.389562 | atlas - dipped |       0.408248 |       0.250000 |       0.500000 |       0.400000
       0.389562 |   atlas - boon |       0.408248 |       0.250000 |       0.500000 |       0.400000
--------------------------------------------------------------------------------------------------------------

Systems test  1  - Similarity measures
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
        average |           pair |         cosine |        jaccard |        overlap |           dice
--------------------------------------------------------------------------------------------------------------
       0.096639 |      bill - of |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 |   child's - of |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 | christmas - of |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 |establishing - of |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 |       for - of |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 | of - religious |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.096639 |     of - wales |       0.129099 |       0.055556 |       0.250000 |       0.105263
       0.120879 |       in - the |       0.142857 |       0.076923 |       0.142857 |       0.142857
       0.142202 |collection - in |       0.169031 |       0.090909 |       0.200000 |       0.166667
       0.142328 |      a - forms |       0.222222 |       0.071429 |       0.666667 |       0.133333
       0.156933 |    bill - case |       0.188982 |       0.100000 |       0.250000 |       0.181818
      ...