# Self study 2


In this self-study we build an index that supports Boolean search over the web pages that you crawl with the crawler from the 1st self study. You can continue to just extract the titles of the web-pages you crawl, or you can be more adventurous and look at the whole text that you get from the .get_text() method of a BeautifulSoup parser. In either case, the collection of texts from the crawled web-pages is you corpus. You should then:

- construct the vocabulary of terms for your corpus
- build an 'inverted' index for your vocabulary
- implement Boolean search for your index (perhaps only for a limited set of Boolean queries)

In [2]:
# Some things already used in self study 1:
import requests
from bs4 import BeautifulSoup
from crawler import crawl

In [21]:
import logging

logging.basicConfig(
    level=logging.ERROR, 
    format='%(asctime)s (%(name)s) %(levelname)s: %(message)s', 
    datefmt='%Y-%m-%d %H:%M:%S'
)

In [5]:
[visited, _, _] = crawl(["https://notes.bagerbach.com"], timeout=2, host_blacklist=[])
[x for i, x in enumerate(visited.items()) if i < 5]

[('https://www.aau.dk', 'AAU - Viden for verden - Aalborg Universitet'),
 ('https://www.instagram.com/aaustudieliv/',
  'Aalborg Universitet (@aaustudieliv) • Instagram photos and videos'),
 ('https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false',
  'AAU søg - Aalborg Universitet'),
 ('https://www.ansatte.aau.dk', 'for ansatte'),
 ('https://www.aau.dk/om-aau/kontakt/whistleblowerordning',
  'whistleblowerordning - Aalborg Universitet')]

A useful resource is the nltk natural language processing package:
https://www.nltk.org/
which provides methods for tokenization, stemming, and much more (the 'punkt' package is needed for tokenization):

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/christian/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now let's use the title string of the AAU homepage as an example:

In [51]:
from typing import Dict, List
from nltk.stem.porter import PorterStemmer

vocabulary: Dict[str, List] = {}
stemmer = PorterStemmer()

for url, title in visited.items():
    for pos, token in enumerate(nltk.word_tokenize(title)):
        stemmed = stemmer.stem(token)
        if stemmed not in vocabulary:
            vocabulary[stemmed] = []
        vocabulary[stemmed].append((pos, url))
        
# Vocabulary is a dictionary of tokens and their positions in the titles.
vocabulary

{'aau': [(0, 'https://www.aau.dk'),
  (0, 'https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false'),
  (3, 'https://www.staff.aau.dk/'),
  (1, 'https://www.studerende.aau.dk/kontakt'),
  (2, 'https://www.aau.dk/om-aau/profil/baeredygtighed'),
  (2, 'https://www.studerende.aau.dk/studieliv/inklusion-pa-aau'),
  (0, 'https://www.design.aau.dk/'),
  (4, 'https://www.en.aau.dk/contact'),
  (4, 'https://www.aau.dk/om-aau/kontakt'),
  (0, 'https://www.aau.dk/'),
  (0, 'https://www.en.aau.dk/cooperation/aau-connect'),
  (4, 'https://www.stillinger.aau.dk/videnskabelige-stillinger'),
  (2, 'https://www.aau.dk/uddannelser/moed-aau/kandidatdag')],
 '-': [(1, 'https://www.aau.dk'),
  (5, 'https://www.aau.dk'),
  (2, 'https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false'),
  (1, 'https://www.aau.dk/om-aau/kontakt/whistleblowerordning'),
  (2, 'https://www.en.aau.dk/'),
  (7, 'https://www.en.aau.dk/'),
  (4, 'https://www.aau.dk/nyheder'),
  (4, 'https://www.en.aau.dk/research'

In [64]:
def intersection(a: List[str], b: List[str]) -> List[str]:
    """
    Returns the intersection of two lists.
    """
    return list(set(a) & set(b))

def union(a: List[str], b: List[str]) -> List[str]:
    """
    Returns the union of two lists.
    """
    return list(set(a) | set(b))


def boolean_search(query: str) -> List[str]:
    """
    Boolean search with AND, OR and NOT.
    """
    cmds = ['AND', 'NOT', 'OR']
    
    # Split the query into tokens.
    tokens = [stemmer.stem(x) if x not in cmds else x for x in nltk.word_tokenize(query)]
    if (len(tokens) == 0):
        raise Exception("No tokens in query")
    if (len(tokens) == 1):
        return [x[1] for x in vocabulary.get(tokens[0], [])]
    if (len(tokens) != 3):
        raise Exception("Invalid query")
        
    
    # The first token is the first operand.
    operand = tokens[0]
    
    # The second token is the operator.
    operator = tokens[1]
    
    # The third token is the second operand.
    operand2 = tokens[2]

    v1 = vocabulary.get(operand) or [];
    v2 = vocabulary.get(operand2) or [];

    v1 = [x[1] for x in v1] # Get the urls
    v2 = [x[1] for x in v2]
    
    # If the operator is AND, we intersect the two operands.
    if operator == "AND":
        return intersection(v1, v2)
    
    # If the operator is OR, we union the two operands.
    elif operator == "OR":
        return union(v1, v2)
    
    # If the operator is NOT, we remove the second operand from the first.
    elif operator == "NOT":
        return list(set(v1) - set(v2))
    
    # If the operator is not AND, OR or NOT, we return an empty list.
    else:
        return []
    
print(boolean_search("søg"))
print(boolean_search("AAU OR studienævn"))
    

['https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false']
['https://www.aau.dk', 'https://www.design.aau.dk/', 'https://www.en.aau.dk/contact', 'https://www.stillinger.aau.dk/videnskabelige-stillinger', 'https://www.search.aau.dk?site=www.aau.dk&locale=da&mobile=false', 'https://www.studerende.aau.dk/kontakt', 'https://www.aau.dk/om-aau/organisation/studienaevn/', 'https://www.en.aau.dk/cooperation/aau-connect', 'https://www.aau.dk/', 'https://www.staff.aau.dk/', 'https://www.aau.dk/uddannelser/moed-aau/kandidatdag', 'https://www.studerende.aau.dk/studieliv/inklusion-pa-aau', 'https://www.aau.dk/om-aau/kontakt', 'https://www.aau.dk/om-aau/profil/baeredygtighed']
