# Lab 1 - Creating an inverted index

Overview of inverted indexes: <a href="https://en.wikipedia.org/wiki/Inverted_index">https://en.wikipedia.org/wiki/Inverted_index</a>

In this lab you will create an inverted index for the Gutenberg books. What I want you to do is create a single index that you can quickly return all the lines from all the books that contain a specific word. We will be using the basic and naive split functionality from the chapter (i.e., don't worry about punctuation, etc). Those are details that are not necessary for our exploration into distributed computing. We will use GNU Parallel to distributed our solution.

In [2]:
display_available = True
try:
    display('Verifying you can use display')
    from IPython.display import Image
except:
    display=print
    display_available = False
try:
    import pygraphviz
    graphviz_installed = True # Set this to False if you don't have graphviz
except:
    graphviz_installed = False
    
import os
from pathlib import Path
home = str(Path.home())
if home == '/home/runner':
    home = os.getcwd()

def isnotebook():
    try:
        shell = get_ipython().__class__.__name__
        if shell == 'ZMQInteractiveShell':
            return True   # Jupyter notebook or qtconsole
        elif shell == 'TerminalInteractiveShell':
            return False  # Terminal running IPython
        else:
            return False  # Other type (?)
    except NameError:
        return False      # Probably standard Python interpreter

'Verifying you can use display'

### Read in the book files for testing purposes

In [6]:
from os import path
book_files = []
for book in open(f"{home}/csc-369-student/data/gutenberg/order.txt").read().split("\n"):
    if path.isfile(f'{home}/csc-369-student/data/gutenberg/{book}-0.txt'):
        book_files.append(f'{home}/csc-369-student/data/gutenberg/{book}-0.txt')

**Exercise 1:**

Complete the following function that returns a line that is read after seeking to ``pos`` in ``book``.

In [7]:
def read_line_at_pos(book, pos):
    with open(book,encoding="utf-8") as f:
        f.seek(pos)
        return f.readline()

In [9]:
if isnotebook():
    line = read_line_at_pos(book_files[0],100)
    # Way to get started!
    display(line)

'one anywhere in the United States and\n'

**Notice that readline reads from the current position until the end of the line.** For the inverted index, you'll want to make sure to record only the positions that get you to the beginning of the line.

In [5]:
if isnotebook():
    read_line_at_pos(book_files[0],95)

**Exercise 2:**

Complete the following function that returns a Python dictionary representing the inverted index. The dictionary should contain an offset that puts the file point at the beginning of the line. 

In [6]:
# Read in the file once and build a list of line offsets
def inverted_index(book):
    index = {}
    # YOUR SOLUTION HERE
    # Check out https://stackoverflow.com/a/40546814/9864659 for inspiration using seek and tell
    return index

In [7]:
if isnotebook():
    index = inverted_index(book_files[0])
    index['things']

**Exercise 3:**

Write a function that reads all of inverted into a single inverted index in the format shown below.

In [8]:
def merged_inverted_index(book_files):
    index = {}
    for book in book_files:
        book_index = inverted_index(book)
        # YOUR SOLUTION HERE
        pass
    return index

In [9]:
if isnotebook():
    index = merged_inverted_index(book_files)
    # Getting there!

In [10]:
if isnotebook():
    import pandas as pd
    pd.Series(index.keys())

In [11]:
if isnotebook():
    index['things']

In [12]:
if isnotebook():
    import pandas as pd
    # I am only using pandas here to make this display nicely on our screens
    pd.Series(index['things'])

**Exercise 4:**

Write a function that returns all of the lines from all of the books that contain a word. Duplicate lines are correct if the line has more than one occurence of the word. Format shown below.

In [13]:
def get_lines(index,word):
    lines = []
    for book in index[word]:
        # YOUR SOLUTION HERE
        pass
    return lines

In [14]:
if isnotebook():
    lines = get_lines(index,'things')
    lines

**Exercise 5:**

Write a Python script that I can execute using Parallel in the following manner. I have hard coded an example script that will return the incorrect answer, but it will run. Your job is to remove the hard coded answer and insert the correct solution that will produce the correct answer. I have supplied the directory structure, and the parallel commands. You do need to write code that merges the groups back together.

**Here are the three groups.** Each directory has about 25 books. We could distribute these to different machines in a cluster, but you get the idea without that step.

In [15]:
!ls -d $HOME/csc-369-student/data/gutenberg/group*

/home/jupyter-pander14/csc-369-student/data/gutenberg/group1
/home/jupyter-pander14/csc-369-student/data/gutenberg/group2
/home/jupyter-pander14/csc-369-student/data/gutenberg/group3


In [16]:
!ls $HOME/csc-369-student/data/gutenberg/group1

1080-0.txt  1400-0.txt	219-0.txt    43-0.txt	  64244-0.txt
11-0.txt    160-0.txt	25344-0.txt  46-0.txt	  74-0.txt
1250-0.txt  1661-0.txt	2542-0.txt   50040-0.txt  76-0.txt
1260-0.txt  1952-0.txt	25929-0.txt  6133-0.txt   84-0.txt
1342-0.txt  205-0.txt	2701-0.txt   64241-0.txt  98-0.txt


In [17]:
!ls $HOME/csc-369-student/data/gutenberg/group2

1184-0.txt  147-0.txt	2600-0.txt  4300-0.txt	 64239-0.txt
120-0.txt   158-0.txt	2852-0.txt  45-0.txt	 64242-0.txt
1232-0.txt  16-0.txt	3600-0.txt  57426-0.txt  64247-0.txt
135-0.txt   2554-0.txt	36-0.txt    58585-0.txt  768-0.txt
140-0.txt   2591-0.txt	408-0.txt   60479-0.txt  996-0.txt


In [18]:
!ls $HOME/csc-369-student/data/gutenberg/group3

113-0.txt   203-0.txt  28054-0.txt  41-0.txt	 53854-0.txt  730-0.txt
1399-0.txt  209-0.txt  2814-0.txt   42108-0.txt  6130-0.txt   766-0.txt
1727-0.txt  215-0.txt  30254-0.txt  4517-0.txt	 64238-0.txt  863-0.txt
1998-0.txt  244-0.txt  35-0.txt     521-0.txt	 64246-0.txt  902-0.txt


**Running a single directory:** You can run a single directory with the following command and store the results to a file.

In [19]:
!python Lab1_exercise5.py $HOME/csc-369-student/data/gutenberg/group1 > group1.json

We can easily read these back into Python by relying on the JSON format. While more strict than Python dictionaries. They are very similar for our purposes (<a href="https://www.json.org/json-en.html">https://www.json.org/json-en.html</a>. 

In [20]:
import json
if isnotebook():
    group1_results = json.load(open("group1.json"))
    group1_results['things']

**You can run the files in parallel using**

In [21]:
!ls $HOME/csc-369-student/data/gutenberg/group1

1080-0.txt  1400-0.txt	219-0.txt    43-0.txt	  64244-0.txt
11-0.txt    160-0.txt	25344-0.txt  46-0.txt	  74-0.txt
1250-0.txt  1661-0.txt	2542-0.txt   50040-0.txt  76-0.txt
1260-0.txt  1952-0.txt	25929-0.txt  6133-0.txt   84-0.txt
1342-0.txt  205-0.txt	2701-0.txt   64241-0.txt  98-0.txt


In [22]:
!parallel "python Lab1_exercise5.py {} > {/}.json" ::: "$HOME/csc-369-student/data/gutenberg/group1" "$HOME/csc-369-student/data/gutenberg/group2" "$HOME/csc-369-student/data/gutenberg/group3"
            

Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication
please cite:

  O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
  ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence this citation notice: run 'parallel --citation'.



In [23]:
import os
def merge():
    index = {}
    r = os.system('parallel "python Lab1_exercise5.py {} > {/}.json" ::: "$HOME/csc-369-student/data/gutenberg/group1" "$HOME/csc-369-student/data/gutenberg/group2" "$HOME/csc-369-student/data/gutenberg/group3"')
    if r == 0:
        for file in ["group1.json","group2.json","group3.json"]:
            # YOUR SOLUTION HERE
            pass
        os.system("rm group1.json group2.json group3.json")
    return index

In [24]:
if isnotebook():
    index = merge()
    # You've done it!

In [25]:
if isnotebook():
    index['things']

This solution should match your solution above that was single thread, but now you are a rockstar distributed computing wizard who could process thousands of books on a cluster with nothing other than simple Python and GNU parallel.

In [26]:
# Don't forget to push!

rm: cannot remove '*.json': No such file or directory


In [27]:
!rm *.json

rm: cannot remove '*.json': No such file or directory
