<a href="https://colab.research.google.com/github/UniVR-DH/ADHLab/blob/main/lecture03_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Producing N-Grams and Building Inverted Indexes


<img src="https://drive.google.com/uc?export=view&id=1m_EMdnI5C826kgqK7r5vB4TXnB0-Wq7W" alt="Intestazione con loghi istituzionali" width="525"/>

| Docente      | Insegnamento | Anno Accademico    |
| :---        |    :----   |          ---: |
| Matteo Lissandrini      | Laboratorio Avanzato di Informatica Umanistica       | 2024/2025   |

## Usual install and basic imports

In [1]:
import gzip
import math
import string
import requests
import numpy as np
import regex as re
import matplotlib.pyplot as plt
from collections import Counter

### Goal:  build an inverted index on a set of bigrams



In [2]:
# request the raw text of Alice in Wonderland
r = requests.get(r'https://ia801604.us.archive.org/6/items/alicesadventures19033gut/19033.txt')
alice_text = r.text
print(len(alice_text))

74726


In [4]:
# split in pages
alice_pages = alice_text.split("\n\r\n\r\n\r")
# remove white space
space_regex = re.compile(' +') # Regex matching whitespace
alice_pages = [space_regex.sub(' ', page).strip() for page in alice_pages ]
# remove empty pages
alice_pages = [ page for page in alice_pages if page != "" ]
# see the result
print(len(alice_pages))

19


In [5]:
#print one page
print(alice_pages[7])

I--DOWN THE RABBIT-HOLE


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do. Once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, "and what is the use of a book," thought Alice, "without pictures or
conversations?"

So she was considering in her own mind (as well as she could, for the
day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so very remarkable in that, nor did Alice think it so
very much out of the way to hear the Rabbit say to itself, "Oh dear! Oh
dear! I shall be too late!" But when the Rabbit actually took a watch
out of its waistcoat-pocket and looked at it and then hurried on, Alice
started to her feet, for it flashed across her mind that she had never
befor

In [6]:
# remove 'new lines' make the text all on one line
print(" ".join(alice_pages[7].splitlines()))

I--DOWN THE RABBIT-HOLE   Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?"  So she was considering in her own mind (as well as she could, for the day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.  There was nothing so very remarkable in that, nor did Alice think it so very much out of the way to hear the Rabbit say to itself, "Oh dear! Oh dear! I shall be too late!" But when the Rabbit actually took a watch out of its waistcoat-pocket and looked at it and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with

In [7]:
inverted_index = {}

for page_index, page in enumerate(alice_pages):
    # Remove new lines and make the text all on one line
    page_text = " ".join(page.splitlines())

    # Tokenize the page on white space
    words = page_text.lower().split()

    for word in words:
        word = word.strip(string.punctuation) # remove punctuation
        if word: # ignore empty strings after punctuation removal
          if word not in inverted_index:
              inverted_index[word] = set()
          inverted_index[word].add(page_index)

# Example usage
print(inverted_index.get("alice", [])) # show the page index containing the word "alice"

{0, 2, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}


In [None]:
#TODO: extract bi-grams instead of single words


#TODO: insert bigrams in the inverted index instead of a single word


## Repeat for a different book, what if you want an index with the pages of both books together?

In [None]:
r = requests.get(r'https://ia600906.us.archive.org/29/items/aesopsfablesanew11339gut/11339.txt')
fables = r.text

fables_pages = fables.split("\n\r\n\r\n\r")
fables_pages = [ space_regex.sub(' ', page).strip() for page in fables_pages ]
fables_pages = [ page for page in fables_pages if page != "" ]
print(len(fables_pages))
print(" ".join(fables_pages[3].splitlines()))

294
INTRODUCTION   _AEsop embodies an epigram not uncommon in human history; his fame is all the more deserved because he never deserved it. The firm foundations of common sense, the shrewd shots at uncommon sense, that characterise all the Fables, belong not him but to humanity. In the earliest human history whatever is authentic is universal: and whatever is universal is anonymous. In such cases there is always some central man who had first the trouble of collecting them, and afterwards the fame of creating them. He had the fame; and, on the whole, he earned the fame. There must have been something great and human, something of the human future and the human past, in such a man: even if he only used it to rob the past or deceive the future. The story of Arthur may have been really connected with the most fighting Christianity of falling Rome or with the most heathen traditions hidden in the hills of Wales. But the word "Mappe" or "Malory" will always mean King Arthur; even though we