# Lab 2 - Creating an inverted index

Overview of inverted indexes: <a href="https://en.wikipedia.org/wiki/Inverted_index">https://en.wikipedia.org/wiki/Inverted_index</a>

In this lab you will create an inverted index for the Gutenberg books. What I want you to do is create a single index that you can quickly return all the lines from all the books that contain a specific word. We will be using the basic and naive split functionality from the chapter (i.e., don't worry about punctuation, etc). Those are details that are not necessary for our exploration into distributed computing. We will use GNU Parallel to distributed our solution.

This lab will focus on distributing the workload across multiple cores/processors. We will bring lab 1 and lab 2 together in lab 3 and use gluster and parallel together.

In [1]:
%load_ext autoreload
%autoreload 2


# Put all your solutions into Lab1_helper.py as this script which is autograded
import Lab2_helper
    
import os
from pathlib import Path
home = str(Path.home())

import pandas as pd

### Read in the book files for testing purposes

In [2]:
from os import path
book_files = []
for book in open(f"{home}/csc-369-student/data/gutenberg/order.txt").read().split("\n"):
    if path.isfile(f'{home}/csc-369-student/data/gutenberg/{book}-0.txt'):
        book_files.append(f'{home}/csc-369-student/data/gutenberg/{book}-0.txt')

**Exercise 1:** Create a function that returns a line that is read after seeking to ``pos`` in ``book``.

Hint: You'll need to open a file object and the call seek. Calling readline will then work as expected.

In [3]:
line = Lab2_helper.read_line_at_pos(book_files[0],100)
display(line)

'one anywhere in the United States and\n'

**Notice that readline reads from the current position until the end of the line.** For the inverted index, you'll want to make sure to record only the positions that get you to the beginning of the line.

In [4]:
display(Lab2_helper.read_line_at_pos(book_files[0],95))

'f anyone anywhere in the United States and\n'

**Exercise 2:** Create a function that returns a Python dictionary representing the inverted index. The dictionary should contain an offset that puts the file point at the beginning of the line. I used ``.split()`` without any arguments.

Hint: I used the ``tell`` function to return the correct offset.

In [5]:
index = Lab2_helper.inverted_index(book_files[0])
display(index['things'])

[8386,
 13175,
 21912,
 23602,
 24101,
 27549,
 29850,
 37134,
 68890,
 69086,
 69771,
 69845,
 71403,
 74893,
 77502,
 80991,
 91732,
 105218,
 119592,
 120796,
 135001,
 135217]

**Exercise 3:** Write a function that reads all of inverted indices into a single inverted index in the format shown below.

In [6]:
index = Lab2_helper.merged_inverted_index(book_files)
display(pd.Series(index.keys()))

0                                         ﻿
1                                       St.
2                                Benedict’s
3                                      Rule
4                                       for
                        ...                
367190                            _there!_”
367191                        dispossessed,
367192                            209-0.txt
367193                            209-0.zip
367194    http://www.gutenberg.org/2/0/209/
Length: 367195, dtype: object

In [7]:
index['things']

{'/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt': [8386,
  13175,
  21912,
  23602,
  24101,
  27549,
  29850,
  37134,
  68890,
  69086,
  69771,
  69845,
  71403,
  74893,
  77502,
  80991,
  91732,
  105218,
  119592,
  120796,
  135001,
  135217],
 '/home/jupyter-pander14/csc-369-student/data/gutenberg/1342-0.txt': [92468,
  126100,
  131655,
  192987,
  202986,
  242634,
  274222,
  281631,
  349511,
  434772,
  439240,
  605843,
  611074,
  612237,
  631212,
  656519,
  764179,
  783530,
  783670],
 '/home/jupyter-pander14/csc-369-student/data/gutenberg/84-0.txt': [31274,
  46864,
  47203,
  75424,
  235797,
  434766,
  434905],
 '/home/jupyter-pander14/csc-369-student/data/gutenberg/6133-0.txt': [24351,
  32691,
  48489,
  56570,
  166077,
  174650,
  212069,
  220879,
  224655,
  284282,
  327966,
  328106],
 '/home/jupyter-pander14/csc-369-student/data/gutenberg/46-0.txt': [8894,
  47600,
  63177,
  71864,
  90711,
  111084,
  121207,
  130800,
  148233,
  

In [7]:
pd.Series(index['things'])

/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt    [8386, 13175, 21912, 23602, 24101, 27549, 2985...
/home/jupyter-pander14/csc-369-student/data/gutenberg/1342-0.txt     [92468, 126100, 131655, 192987, 202986, 242634...
/home/jupyter-pander14/csc-369-student/data/gutenberg/84-0.txt       [31274, 46864, 47203, 75424, 235797, 434766, 4...
/home/jupyter-pander14/csc-369-student/data/gutenberg/6133-0.txt     [24351, 32691, 48489, 56570, 166077, 174650, 2...
/home/jupyter-pander14/csc-369-student/data/gutenberg/46-0.txt       [8894, 47600, 63177, 71864, 90711, 111084, 121...
                                                                                           ...                        
/home/jupyter-pander14/csc-369-student/data/gutenberg/730-0.txt      [119678, 153359, 314959, 343502, 343696, 40217...
/home/jupyter-pander14/csc-369-student/data/gutenberg/113-0.txt      [4685, 6737, 7345, 8278, 8348, 22839, 26064, 2...
/home/jupyter-pander14/csc-369-student/data/gute

In [8]:
pd.Series(index['things'])

/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt    [8386, 13175, 21912, 23602, 24101, 27549, 2985...
/home/jupyter-pander14/csc-369-student/data/gutenberg/1342-0.txt     [92468, 126100, 131655, 192987, 202986, 242634...
/home/jupyter-pander14/csc-369-student/data/gutenberg/84-0.txt       [31274, 46864, 47203, 75424, 235797, 434766, 4...
/home/jupyter-pander14/csc-369-student/data/gutenberg/6133-0.txt     [24351, 32691, 48489, 56570, 166077, 174650, 2...
/home/jupyter-pander14/csc-369-student/data/gutenberg/46-0.txt       [8894, 47600, 63177, 71864, 90711, 111084, 121...
                                                                                           ...                        
/home/jupyter-pander14/csc-369-student/data/gutenberg/730-0.txt      [119678, 153359, 314959, 343502, 343696, 40217...
/home/jupyter-pander14/csc-369-student/data/gutenberg/113-0.txt      [4685, 6737, 7345, 8278, 8348, 22839, 26064, 2...
/home/jupyter-pander14/csc-369-student/data/gute

**Exercise 4:** Write a function that returns all of the lines from all of the books that contain a word. Duplicate lines are correct if the line has more than one occurence of the word. Format shown below.

In [9]:
lines = Lab2_helper.get_lines(index,'things')
lines[:10]

[('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'we must always so serve Him with the good things He has given us, that\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'still in the body and are able to fulfil all these things by the light\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'justice, and all these things shall be given you besides.” And again:\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'also it is his function to dispose all things with prudence and justice.\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'the same time, the Abbot himself should do all things in the fear of God\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  '  60. To obey in all things the commands of the Abbot, even though he\n'),
 ('/home/jupyter-pander14/csc-369-student/data/gutenberg/50040-0.txt',
  'completed, the two thing

**Exercise 5:**

Write a Python script that we can execute using Parallel in the following manner. 

I have hard coded an example script that will return the incorrect answer, but it will run. 

Your job is to remove the hard coded answer and insert the correct solution that will produce the correct answer. I have supplied the directory structure, and the parallel commands. You do need to write code that merges the groups back together.

**Here are the three groups.** Each directory has about 25 books. We could distribute these to different machines in a cluster, but you get the idea without that step.

In [10]:
!ls -d {home}/csc-369-student/data/gutenberg/group*

/home/jupyter-pander14/csc-369-student/data/gutenberg/group1
/home/jupyter-pander14/csc-369-student/data/gutenberg/group2
/home/jupyter-pander14/csc-369-student/data/gutenberg/group3


In [11]:
!ls {home}/csc-369-student/data/gutenberg/group1

1080-0.txt  1400-0.txt	219-0.txt    43-0.txt	  64244-0.txt
11-0.txt    160-0.txt	25344-0.txt  46-0.txt	  74-0.txt
1250-0.txt  1661-0.txt	2542-0.txt   50040-0.txt  76-0.txt
1260-0.txt  1952-0.txt	25929-0.txt  6133-0.txt   84-0.txt
1342-0.txt  205-0.txt	2701-0.txt   64241-0.txt  98-0.txt


In [12]:
!ls {home}/csc-369-student/data/gutenberg/group2

1184-0.txt  147-0.txt	2600-0.txt  4300-0.txt	 64239-0.txt
120-0.txt   158-0.txt	2852-0.txt  45-0.txt	 64242-0.txt
1232-0.txt  16-0.txt	3600-0.txt  57426-0.txt  64247-0.txt
135-0.txt   2554-0.txt	36-0.txt    58585-0.txt  768-0.txt
140-0.txt   2591-0.txt	408-0.txt   60479-0.txt  996-0.txt


In [13]:
!ls {home}/csc-369-student/data/gutenberg/group3

113-0.txt   203-0.txt  28054-0.txt  41-0.txt	 53854-0.txt  730-0.txt
1399-0.txt  209-0.txt  2814-0.txt   42108-0.txt  6130-0.txt   766-0.txt
1727-0.txt  215-0.txt  30254-0.txt  4517-0.txt	 64238-0.txt  863-0.txt
1998-0.txt  244-0.txt  35-0.txt     521-0.txt	 64246-0.txt  902-0.txt


**Running a single directory:** You can run a single directory with the following command and store the results to a file.

In [14]:
!python Lab2_exercise5.py {home}/csc-369-student/data/gutenberg/group1 > group1.json

We can easily read these back into Python by relying on the JSON format. While more strict than Python dictionaries. They are very similar for our purposes (<a href="https://www.json.org/json-en.html">https://www.json.org/json-en.html</a>). 

In [15]:
import json
group1_results = json.load(open("group1.json"))
pd.Series(group1_results['things'])

/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/6133-0.txt     [24351, 32691, 48489, 56570, 166077, 174650, 2...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/46-0.txt       [8894, 47600, 63177, 71864, 90711, 111084, 121...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/219-0.txt      [15626, 21421, 34891, 35328, 53751, 58248, 750...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/76-0.txt       [9757, 11753, 11824, 25522, 26420, 27504, 5213...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/25344-0.txt    [75568, 85281, 148509, 218956, 222761, 254728,...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/98-0.txt       [3780, 21317, 60608, 60819, 74433, 188123, 211...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/50040-0.txt    [8386, 13175, 21912, 23602, 24101, 27549, 2985...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/205-0.txt      [5450, 13577, 13844, 24649, 53422, 89572, 

**You can run the files in parallel using**

In [16]:
!ls {home}/csc-369-student/data/gutenberg/group1

1080-0.txt  1400-0.txt	219-0.txt    43-0.txt	  64244-0.txt
11-0.txt    160-0.txt	25344-0.txt  46-0.txt	  74-0.txt
1250-0.txt  1661-0.txt	2542-0.txt   50040-0.txt  76-0.txt
1260-0.txt  1952-0.txt	25929-0.txt  6133-0.txt   84-0.txt
1342-0.txt  205-0.txt	2701-0.txt   64241-0.txt  98-0.txt


In [17]:
# !parallel "python Lab2_exercise5.py" ... # Come and see me to get the ... You'll have to try to come up with it first

Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.



In [18]:
index = Lab2_helper.merge()
# You've done it!

In [19]:
pd.Series(index['things'])

/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/6133-0.txt     [24351, 32691, 48489, 56570, 166077, 174650, 2...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/46-0.txt       [8894, 47600, 63177, 71864, 90711, 111084, 121...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/219-0.txt      [15626, 21421, 34891, 35328, 53751, 58248, 750...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/76-0.txt       [9757, 11753, 11824, 25522, 26420, 27504, 5213...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group1/25344-0.txt    [75568, 85281, 148509, 218956, 222761, 254728,...
                                                                                                  ...                        
/home/jupyter-pander14/csc-369-student/data/gutenberg/group3/64238-0.txt    [4202, 8980, 12824, 21679, 25943, 30211, 37908...
/home/jupyter-pander14/csc-369-student/data/gutenberg/group3/35-0.txt       [2658, 20215, 53533, 58539, 60512, 62724, 

This solution should match your solution above that was single thread, but now you are a rockstar distributed computing wizard who could process thousands of books on a cluster with nothing other than simple Python and GNU parallel.

In [20]:
# Don't forget to push!

rm: cannot remove '*.json': No such file or directory


In [21]:
!rm *.json

rm: cannot remove '*.json': No such file or directory
