# Zot Search - Tutorial
A tiny search engine for Wikipedia - A quick walkthrough of Zot Search

### 1. Bookeeping: 
First, let's make sure we have all the external libraries that we need:

Note: On some machine may have to change `python` to `python3`

In [8]:
!python3 -m pip install -r requirements.txt

Collecting requests (from -r requirements.txt (line 4))
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests->-r requirements.txt (line 4))
  Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl.metadata (37 kB)
Collecting idna<4,>=2.5 (from requests->-r requirements.txt (line 4))
  Using cached idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests->-r requirements.txt (line 4))
  Downloading urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting certifi>=2017.4.17 (from requests->-r requirements.txt (line 4))
  Downloading certifi-2026.1.4-py3-none-any.whl.metadata (2.5 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl (208 kB)
Using cached idna-3.11-py3-none-any.whl (71 kB)
Downloading urllib3-2.6.3-py3-none-any.whl (131 kB)
Downloading certifi-2026.1.4-py3-none-any.whl (

### 2. Crawling: 
**Note - This will take a while, if you are short on time use the provided sample url's below**

Start by picking a `start_url`, for this tutorial we will use the Wikipedia page for [Anteater](https://en.wikipedia.org/wiki/Anteater)

Then we will run the crawler, which will start at the `start_url` and search for linked Wikipeida pages. This process will repeat recursively until we reached the maximum depth.

In [9]:
from zot_search.crawler import crawl

start_url = "https://en.wikipedia.org/wiki/Anteater"

In [None]:
#Use this block to run the crawler (comment it out if you are using the example data)
wikipedia_url_dict = crawl(start_url, height=1)

Crawling: https://en.wikipedia.org/wiki/Anteater Height: 1)
Crawling: https://en.wikipedia.org/wiki/Anteater Height: 0)
Crawling: https://en.wikipedia.org/wiki/Anteater Height: 0)
Crawling: https://en.wikipedia.org/wiki/Anteater Height: 0)
Crawling: https://en.wikipedia.org/wiki/Anteater_(disambiguation) Height: 0)
Crawling: https://en.wikipedia.org/wiki/Miocene Height: 0)
Crawling: https://en.wikipedia.org/wiki/Megaannum Height: 0)
Crawling: https://en.wikipedia.org/wiki/Precambrian Height: 0)
Crawling: https://en.wikipedia.org/wiki/Cambrian Height: 0)
Crawling: https://en.wikipedia.org/wiki/Ordovician Height: 0)
Crawling: https://en.wikipedia.org/wiki/Silurian Height: 0)
Crawling: https://en.wikipedia.org/wiki/Devonian Height: 0)
Crawling: https://en.wikipedia.org/wiki/Carboniferous Height: 0)
Crawling: https://en.wikipedia.org/wiki/Permian Height: 0)
Crawling: https://en.wikipedia.org/wiki/Triassic Height: 0)
Crawling: https://en.wikipedia.org/wiki/Jurassic Height: 0)
Crawling: http

In [None]:
#Use this block when using the example data (comment it out if you are running the crawler from scratch)
#wikipedia_url_dict = {'Anteater': 'https://en.wikipedia.org/wiki/Anteater', 'Anteater (disambiguation)': 'https://en.wikipedia.org/wiki/Anteater_(disambiguation)','Miocene': 'https://en.wikipedia.org/wiki/Miocene', 'Year': 'https://en.wikipedia.org/wiki/Megaannum', 'Precambrian': 'https://en.wikipedia.org/wiki/Precambrian', 'Cambrian': 'https://en.wikipedia.org/wiki/Cambrian', 'Ordovician': 'https://en.wikipedia.org/wiki/Ordovician', 'Silurian': 'https://en.wikipedia.org/wiki/Silurian', 'Devonian': 'https://en.wikipedia.org/wiki/Devonian', 'Carboniferous': 'https://en.wikipedia.org/wiki/Carboniferous', 'Permian': 'https://en.wikipedia.org/wiki/Permian', 'Triassic': 'https://en.wikipedia.org/wiki/Triassic', 'Jurassic': 'https://en.wikipedia.org/wiki/Jurassic', 'Cretaceous': 'https://en.wikipedia.org/wiki/Cretaceous', 'Paleogene': 'https://en.wikipedia.org/wiki/Paleogene', 'Neogene': 'https://en.wikipedia.org/wiki/Neogene', 'Giant anteater': 'https://en.wikipedia.org/wiki/Giant_anteater', 'Taxonomy (biology)': 'https://en.wikipedia.org/wiki/Taxonomy_(biology)', 'Animal': 'https://en.wikipedia.org/wiki/Animal', 'Chordate': 'https://en.wikipedia.org/wiki/Chordate', 'Mammal': 'https://en.wikipedia.org/wiki/Mammal', 'Pilosa': 'https://en.wikipedia.org/wiki/Pilosa', 'Johann Karl Wilhelm Illiger': 'https://en.wikipedia.org/wiki/Johann_Karl_Wilhelm_Illiger', 'Cyclopedidae': 'https://en.wikipedia.org/wiki/Cyclopedidae', 'Myrmecophagidae': 'https://en.wikipedia.org/wiki/Myrmecophagidae', 'Ant': 'https://en.wikipedia.org/wiki/Ant', 'Termite': 'https://en.wikipedia.org/wiki/Termite', 'Sloth': 'https://en.wikipedia.org/wiki/Sloth', 'Aardvark': 'https://en.wikipedia.org/wiki/Aardvark', 'Numbat': 'https://en.wikipedia.org/wiki/Numbat', 'Echidna': 'https://en.wikipedia.org/wiki/Echidna', 'Pangolin': 'https://en.wikipedia.org/wiki/Pangolin', 'Silky anteater': 'https://en.wikipedia.org/wiki/Silky_anteater', 'Southern tamandua': 'https://en.wikipedia.org/wiki/Southern_tamandua', 'Northern tamandua': 'https://en.wikipedia.org/wiki/Northern_tamandua', 'Species': 'https://en.wikipedia.org/wiki/Species', 'Common name': 'https://en.wikipedia.org/wiki/Common_name', 'Tamandua': 'https://en.wikipedia.org/wiki/Tamandua', 'Tupi language': 'https://en.wikipedia.org/wiki/Tupi_language', 'Genus': 'https://en.wikipedia.org/wiki/Genus', 'List of pilosans': 'https://en.wikipedia.org/wiki/List_of_pilosans', 'Xenarthra': 'https://en.wikipedia.org/wiki/Xenarthra', 'Great American Interchange': 'https://en.wikipedia.org/wiki/Great_American_Interchange', 'Convergent evolution': 'https://en.wikipedia.org/wiki/Convergent_evolution', 'Armadillo': 'https://en.wikipedia.org/wiki/Armadillo', 'Three-toed sloth': 'https://en.wikipedia.org/wiki/Bradypodidae', 'Two-toed sloth': 'https://en.wikipedia.org/wiki/Choloepodidae', 'Palaeomyrmidon': 'https://en.wikipedia.org/wiki/Palaeomyrmidon', 'Neotamandua': 'https://en.wikipedia.org/wiki/Neotamandua', 'Protamandua': 'https://en.wikipedia.org/wiki/Protamandua', 'Snout': 'https://en.wikipedia.org/wiki/Snouts', 'Submandibular gland': 'https://en.wikipedia.org/wiki/Submaxillary_glands', 'Prehensility': 'https://en.wikipedia.org/wiki/Prehensile', 'Albinism': 'https://en.wikipedia.org/wiki/Albinism', 'Leucism': 'https://en.wikipedia.org/wiki/Leucism', 'Melanism': 'https://en.wikipedia.org/wiki/Melanism', 'Amazon basin': 'https://en.wikipedia.org/wiki/Amazon_basin', 'Isthmus of Panama': 'https://en.wikipedia.org/wiki/Isthmus_of_Panama', 'Million years ago': 'https://en.wikipedia.org/wiki/Million_years_ago', 'Pleistocene': 'https://en.wikipedia.org/wiki/Pleistocene', 'Sonora': 'https://en.wikipedia.org/wiki/Sonora,_Mexico', 'Deglaciation': 'https://en.wikipedia.org/wiki/Deglaciation', 'Trinidad': 'https://en.wikipedia.org/wiki/Trinidad', 'The Guianas': 'https://en.wikipedia.org/wiki/Guianas', 'Disjunct distribution': 'https://en.wikipedia.org/wiki/Disjunct_population', 'Tropical and subtropical dry broadleaf forests': 'https://en.wikipedia.org/wiki/Dry_tropical_forest', 'Rainforest': 'https://en.wikipedia.org/wiki/Rainforest', 'Grassland': 'https://en.wikipedia.org/wiki/Grassland', 'Savanna': 'https://en.wikipedia.org/wiki/Savanna', 'Territory (animal)': 'https://en.wikipedia.org/wiki/Territory_(animal)', 'Testicle': 'https://en.wikipedia.org/wiki/Testes', 'Mammary gland': 'https://en.wikipedia.org/wiki/Mammary_gland', 'Polygyny in animals': 'https://en.wikipedia.org/wiki/Polygyny_in_animals', 'Lingual papillae': 'https://en.wikipedia.org/wiki/Filiform_papilla', 'Gizzard': 'https://en.wikipedia.org/wiki/Gizzard', 'Jaguar': 'https://en.wikipedia.org/wiki/Jaguars', 'Ocelot': 'https://en.wikipedia.org/wiki/Ocelots', 'Felidae': 'https://en.wikipedia.org/wiki/Felids', 'Fox': 'https://en.wikipedia.org/wiki/Foxes', 'Caiman': 'https://en.wikipedia.org/wiki/Caimans', 'Harpy eagle': 'https://en.wikipedia.org/wiki/Harpy_eagle', 'Parasitism': 'https://en.wikipedia.org/wiki/Parasites', 'Tick': 'https://en.wikipedia.org/wiki/Ticks', 'Flea': 'https://en.wikipedia.org/wiki/Fleas', 'Parasitic worm': 'https://en.wikipedia.org/wiki/Parasitic_worms', 'Acanthocephala': 'https://en.wikipedia.org/wiki/Acanthocephalans', 'Ixodidae': 'https://en.wikipedia.org/wiki/Ixodidae', 'Amblyomma': 'https://en.wikipedia.org/wiki/Amblyomma', 'Cestoda': 'https://en.wikipedia.org/wiki/Cestoda', 'Spiruridae': 'https://en.wikipedia.org/wiki/Spiruridae', 'Physalopteridae': 'https://en.wikipedia.org/wiki/Physalopteridae', 'Trichostrongylidae': 'https://en.wikipedia.org/wiki/Trichostrongylidae', 'Ascarididae': 'https://en.wikipedia.org/wiki/Ascarididae', 'Anemia': 'https://en.wikipedia.org/wiki/Anemia', 'Gastritis': 'https://en.wikipedia.org/wiki/Gastritis', 'Type (biology)': 'https://en.wikipedia.org/wiki/Type_host', 'Coccidia': 'https://en.wikipedia.org/wiki/Coccidia', 'Protozoa': 'https://en.wikipedia.org/wiki/Protozoans', 'Bacteria': 'https://en.wikipedia.org/wiki/Bacteria', 'Parabasalid': 'https://en.wikipedia.org/wiki/Parabasalids', 'Virus': 'https://en.wikipedia.org/wiki/Viruses', 'Sertoli cell tumour': 'https://en.wikipedia.org/wiki/Sertoli_cell_tumour', 'Mineralized tissues': 'https://en.wikipedia.org/wiki/Mineralized_tissues#Diseased_mineralized_tissues', 'Vitamin D toxicity': 'https://en.wikipedia.org/wiki/Hypervitaminosis_D', 'Osteomyelitis': 'https://en.wikipedia.org/wiki/Osteomyelitis', 'Dermatitis': 'https://en.wikipedia.org/wiki/Dermatitis', 'Disease vector': 'https://en.wikipedia.org/wiki/Disease_vector', 'Rickettsia': 'https://en.wikipedia.org/wiki/Rickettsia', 'Spotted fever': 'https://en.wikipedia.org/wiki/Spotted_fever', 'SARS-CoV-2': 'https://en.wikipedia.org/wiki/SARS-CoV-2', 'COVID-19': 'https://en.wikipedia.org/wiki/COVID-19', 'Leishmania': 'https://en.wikipedia.org/wiki/Leishmania', 'Leishmaniasis': 'https://en.wikipedia.org/wiki/Leishmaniasis', 'Canine distemper': 'https://en.wikipedia.org/wiki/Canine_distemper', 'Morbillivirus': 'https://en.wikipedia.org/wiki/Morbillivirus', 'Maned wolf': 'https://en.wikipedia.org/wiki/Maned_wolf', 'Cancer': 'https://en.wikipedia.org/wiki/Cancer', 'Apoptosis': 'https://en.wikipedia.org/wiki/Apoptosis', 'Least-concern species': 'https://en.wikipedia.org/wiki/Least_concern', 'International Union for Conservation of Nature': 'https://en.wikipedia.org/wiki/IUCN', 'Vulnerable species': 'https://en.wikipedia.org/wiki/Vulnerable_species', 'Habitat destruction': 'https://en.wikipedia.org/wiki/Habitat_loss', 'Data deficient': 'https://en.wikipedia.org/wiki/Data_deficient', 'Wildlife trade': 'https://en.wikipedia.org/wiki/Wildlife_trade', 'Oxford English Dictionary': 'https://en.wikipedia.org/wiki/Oxford_English_Dictionary', 'Digital object identifier': 'https://en.wikipedia.org/wiki/Doi_(identifier)', 'Semantic Scholar': 'https://en.wikipedia.org/wiki/S2CID_(identifier)', 'Bibcode': 'https://en.wikipedia.org/wiki/Bibcode_(identifier)', 'PubMed': 'https://en.wikipedia.org/wiki/PMID_(identifier)', 'Molecular Biology and Evolution': 'https://en.wikipedia.org/wiki/Molecular_Biology_and_Evolution', 'PubMed Central': 'https://en.wikipedia.org/wiki/PMC_(identifier)', 'Paleobiology Database': 'https://en.wikipedia.org/wiki/Paleobiology_Database', 'University of Chicago Press': 'https://en.wikipedia.org/wiki/University_of_Chicago_Press', 'ISBN': 'https://en.wikipedia.org/wiki/ISBN_(identifier)', 'ISSN': 'https://en.wikipedia.org/wiki/ISSN_(identifier)', 'Handle System': 'https://en.wikipedia.org/wiki/Hdl_(identifier)', 'ProQuest': 'https://en.wikipedia.org/wiki/ProQuest', 'Bernhard Grzimek': 'https://en.wikipedia.org/wiki/Bernhard_Grzimek', 'Public domain': 'https://en.wikipedia.org/wiki/Public_domain', 'Hugh Chisholm': 'https://en.wikipedia.org/wiki/Hugh_Chisholm', 'Encyclopædia Britannica Eleventh Edition': 'https://en.wikipedia.org/wiki/Encyclop%C3%A6dia_Britannica_Eleventh_Edition', 'JSTOR': 'https://en.wikipedia.org/wiki/JSTOR_(identifier)', 'IUCN Red List': 'https://en.wikipedia.org/wiki/IUCN_Red_List', 'Placentalia': 'https://en.wikipedia.org/wiki/Placentalia', 'Pygmy three-toed sloth': 'https://en.wikipedia.org/wiki/Pygmy_three-toed_sloth', 'Northern maned sloth': 'https://en.wikipedia.org/wiki/Maned_sloth', 'Pale-throated sloth': 'https://en.wikipedia.org/wiki/Pale-throated_sloth', 'Brown-throated sloth': 'https://en.wikipedia.org/wiki/Brown-throated_sloth', "Linnaeus's two-toed sloth": 'https://en.wikipedia.org/wiki/Linnaeus%27s_two-toed_sloth', "Hoffmann's two-toed sloth": 'https://en.wikipedia.org/wiki/Hoffmann%27s_two-toed_sloth', 'Argyromanis': 'https://en.wikipedia.org/wiki/Argyromanis', 'Orthoarthrus': 'https://en.wikipedia.org/wiki/Orthoarthrus', 'Pseudoglyptodon': 'https://en.wikipedia.org/wiki/Pseudoglyptodon', 'Megalocnidae': 'https://en.wikipedia.org/wiki/Megalocnidae', 'Acratocnus': 'https://en.wikipedia.org/wiki/Acratocnus', 'Imagocnus': 'https://en.wikipedia.org/wiki/Imagocnus', 'Megalocnus': 'https://en.wikipedia.org/wiki/Megalocnus', 'Neocnus': 'https://en.wikipedia.org/wiki/Neocnus', 'Parocnus': 'https://en.wikipedia.org/wiki/Parocnus', 'Scelidotheriidae': 'https://en.wikipedia.org/wiki/Scelidotheriidae', 'Analcitherium': 'https://en.wikipedia.org/wiki/Analcitherium', 'Catonyx': 'https://en.wikipedia.org/wiki/Catonyx', 'Chubutherium': 'https://en.wikipedia.org/wiki/Chubutherium', 'Nematherium': 'https://en.wikipedia.org/wiki/Nematherium', 'Neonematherium': 'https://en.wikipedia.org/wiki/Neonematherium', 'Proscelidodon': 'https://en.wikipedia.org/wiki/Proscelidodon', 'Scelidodon': 'https://en.wikipedia.org/wiki/Scelidodon', 'Scelidotherium': 'https://en.wikipedia.org/wiki/Scelidotherium', 'Valgipes': 'https://en.wikipedia.org/wiki/Valgipes', 'Mylodontidae': 'https://en.wikipedia.org/wiki/Mylodontidae', 'Baraguatherium': 'https://en.wikipedia.org/wiki/Baraguatherium', 'Eionaletherium': 'https://en.wikipedia.org/wiki/Eionaletherium', 'Octodontotherium': 'https://en.wikipedia.org/wiki/Octodontotherium', 'Octomylodon': 'https://en.wikipedia.org/wiki/Octomylodon', 'Orophodon': 'https://en.wikipedia.org/wiki/Orophodon', 'Pseudoprepotherium': 'https://en.wikipedia.org/wiki/Pseudoprepotherium', 'Urumacotherium': 'https://en.wikipedia.org/wiki/Urumacotherium', 'Mylodontinae': 'https://en.wikipedia.org/wiki/Mylodontinae', 'Archaeomylodon': 'https://en.wikipedia.org/wiki/Archaeomylodon', 'Brievabradys': 'https://en.wikipedia.org/wiki/Brievabradys', 'Mylodonopsis': 'https://en.wikipedia.org/wiki/Mylodonopsis', 'Ocnotherium': 'https://en.wikipedia.org/wiki/Ocnotherium', 'Oreomylodon': 'https://en.wikipedia.org/wiki/Oreomylodon', 'Bolivartherium': 'https://en.wikipedia.org/wiki/Bolivartherium', 'Lestobradys': 'https://en.wikipedia.org/wiki/Lestobradys', 'Lestodon': 'https://en.wikipedia.org/wiki/Lestodon', 'Magdalenabradys': 'https://en.wikipedia.org/wiki/Magdalenabradys', 'Thinobadistes': 'https://en.wikipedia.org/wiki/Thinobadistes', 'Glossotherium': 'https://en.wikipedia.org/wiki/Glossotherium', 'Mylodon': 'https://en.wikipedia.org/wiki/Mylodon', 'Paramylodon': 'https://en.wikipedia.org/wiki/Paramylodon', 'Simomylodon': 'https://en.wikipedia.org/wiki/Simomylodon', 'Hapalops': 'https://en.wikipedia.org/wiki/Hapalops', 'Hiskatherium': 'https://en.wikipedia.org/wiki/Hiskatherium', 'Pelecyodon': 'https://en.wikipedia.org/wiki/Pelecyodon', 'Schismotherium': 'https://en.wikipedia.org/wiki/Schismotherium', 'Megalonychidae': 'https://en.wikipedia.org/wiki/Megalonychidae', 'Mesopotamocnus': 'https://en.wikipedia.org/wiki/Mesopotamocnus', 'Proplatyarthrus': 'https://en.wikipedia.org/wiki/Proplatyarthrus', 'Urumacocnus': 'https://en.wikipedia.org/wiki/Urumacocnus', 'Eucholoeops': 'https://en.wikipedia.org/wiki/Eucholoeops', 'Hapaloides': 'https://en.wikipedia.org/wiki/Hapaloides', 'Ortotherium': 'https://en.wikipedia.org/wiki/Ortotherium', 'Proschismotherium': 'https://en.wikipedia.org/wiki/Proschismotherium', 'Ahytherium': 'https://en.wikipedia.org/wiki/Ahytherium', 'Australonyx': 'https://en.wikipedia.org/wiki/Australonyx', 'Megalonyx': 'https://en.wikipedia.org/wiki/Megalonyx', 'Megistonyx': 'https://en.wikipedia.org/wiki/Megistonyx', 'Meizonyx': 'https://en.wikipedia.org/wiki/Meizonyx', 'Nohochichak': 'https://en.wikipedia.org/wiki/Nohochichak', 'Pattersonocnus': 'https://en.wikipedia.org/wiki/Pattersonocnus', 'Pliometanastes': 'https://en.wikipedia.org/wiki/Pliometanastes', 'Xibalbaonyx': 'https://en.wikipedia.org/wiki/Xibalbaonyx', 'Nothrotheriidae': 'https://en.wikipedia.org/wiki/Nothrotheriidae', 'Thalassocnus': 'https://en.wikipedia.org/wiki/Thalassocnus', 'Aymaratherium': 'https://en.wikipedia.org/wiki/Aymaratherium', 'Chasicobradys': 'https://en.wikipedia.org/wiki/Chasicobradys', 'Huilabradys': 'https://en.wikipedia.org/wiki/Huilabradys', 'Lakukullus': 'https://en.wikipedia.org/wiki/Lakukullus', 'Mcdonaldocnus': 'https://en.wikipedia.org/wiki/Mcdonaldocnus', 'Mionothropus': 'https://en.wikipedia.org/wiki/Mionothropus', 'Nothropus': 'https://en.wikipedia.org/wiki/Nothropus', 'Nothrotheriops': 'https://en.wikipedia.org/wiki/Nothrotheriops', 'Nothrotherium': 'https://en.wikipedia.org/wiki/Nothrotherium', 'Pronothrotherium': 'https://en.wikipedia.org/wiki/Pronothrotherium', 'Megatheriidae': 'https://en.wikipedia.org/wiki/Megatheriidae', 'Prepotherium': 'https://en.wikipedia.org/wiki/Prepotherium', 'Prepoplanops': 'https://en.wikipedia.org/wiki/Prepoplanops', 'Megatheriinae': 'https://en.wikipedia.org/wiki/Megatheriinae', 'Diabolotherium': 'https://en.wikipedia.org/wiki/Diabolotherium', 'Eremotherium': 'https://en.wikipedia.org/wiki/Eremotherium', 'Megathericulus': 'https://en.wikipedia.org/wiki/Megathericulus', 'Megatherium': 'https://en.wikipedia.org/wiki/Megatherium', 'Proeremotherium': 'https://en.wikipedia.org/wiki/Proeremotherium', 'Promegatherium': 'https://en.wikipedia.org/wiki/Promegatherium', 'Sibotherium': 'https://en.wikipedia.org/wiki/Sibotherium', 'Wikidata': 'https://en.wikipedia.org/wiki/Wikidata', 'Wikispecies': 'https://en.wikipedia.org/wiki/Wikispecies', 'Animal Diversity Web': 'https://en.wikipedia.org/wiki/Animal_Diversity_Web', 'Catalogue of Life': 'https://en.wikipedia.org/wiki/Catalogue_of_Life', 'Encyclopedia of Life': 'https://en.wikipedia.org/wiki/Encyclopedia_of_Life', 'Integrated Taxonomic Information System': 'https://en.wikipedia.org/wiki/Integrated_Taxonomic_Information_System', 'Mammal Species of the World': 'https://en.wikipedia.org/wiki/Mammal_Species_of_the_World', 'National Center for Biotechnology Information': 'https://en.wikipedia.org/wiki/National_Center_for_Biotechnology_Information'}

### 3. Scraping: 
Now we will iterate through our `wikipedia_url_dict` and retrieve the contents of the pages (we will call these our documents)

In [None]:
from zot_search.scrapper import scrape_wiki_page

documents = {}

for title, url in wikipedia_url_dict.items():
    content = scrape_wiki_page(url)

    if content == "":
        continue

    documents[title] = content

Title: Anteater


### 4. Indexing: 
Now it is time to build our inverted index! We will be using `sqlite`.

In [None]:
from zot_search.index import create_inverted_index, search_index, display_top_results

db_path = "zot_search_index.db"

create_inverted_index(db_path, documents, wikipedia_url_dict)

<sqlite3.Connection at 0x1ac61f7f5b0>

Once the index has been constructed, you can take a look at database structure. There are two tables:
1. Documents:
    * doc_id: a unique integer key
    * doc_title: the title of the wikipedia page
    * doc_url: the url of the wikipedia page
    * content: a normalized version of the text

2. InvertedIndex:
    * term: the token
    * doc_title: the title of the wikipedia page
    * frequency: the number of times the token appears on a given webpage

### 5. TF-IDF: 
We have our inverted index and with that we can calcualte the TF-IDF score for any term

In [None]:
from zot_search.index import get_number_of_documents_in_index, get_len_of_document
import numpy as np

def calculate_tf_idf_for_single_term(db_path, term):
    tf_idfs = {}
    
    results = search_index(db_path, term)

    number_of_documents = get_number_of_documents_in_index(db_path)
    number_of_documents_containing_term = len(results)
    idf = np.log((1 + number_of_documents)/ (1 + number_of_documents_containing_term))

    for result in results:
        title = result[0]
        document_count = result[1]
        document_len = get_len_of_document(db_path, result[0])
        
        tf = document_count/document_len
        tf_idfs[title] = tf * idf

    return tf_idfs


In [None]:
results = calculate_tf_idf_for_single_term(db_path, "portuguese")
display_top_results(db_path, results, True)

Tupi language: 0.0031490366952661927
https://en.wikipedia.org/wiki/Tupi_language


The Guianas: 0.0015464278289448275
https://en.wikipedia.org/wiki/Guianas


Integrated Taxonomic Information System: 0.000716990369897091
https://en.wikipedia.org/wiki/Integrated_Taxonomic_Information_System


Anteater: 0.0004578849573615254
https://en.wikipedia.org/wiki/Anteater


Maned wolf: 0.0004026567376210253
https://en.wikipedia.org/wiki/Maned_wolf


Armadillo: 0.0002826226705897219
https://en.wikipedia.org/wiki/Armadillo


Amazon basin: 0.0002348247761757721
https://en.wikipedia.org/wiki/Amazon_basin


Sloth: 0.00018263995904161233
https://en.wikipedia.org/wiki/Sloth


Jaguar: 9.93267145000477e-05
https://en.wikipedia.org/wiki/Jaguars


Ant: 5.192734639774728e-05
https://en.wikipedia.org/wiki/Ant




### 6. Cosine similarity: 
How do we handle a query with multiple terms?

We could try to average the tf-idfs, or we can compute cosine similarity:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from zot_search.index import union_documents_with_term, get_document_text

def calculate_cosine_similarity(db_path, query):
    documents = []
    doc_titles = []

    results = {}

    documents_to_compare = union_documents_with_term(db_path, query)
    if len(documents_to_compare) == 0:
        return results
    
    for document in documents_to_compare:
        doc_titles.append(document)
        documents.append(get_document_text(db_path, document))

    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)

    query_vector = vectorizer.transform([query])

    similarity_matrix = cosine_similarity(query_vector, tfidf_matrix).flatten()

    for i in range(len(documents_to_compare)):
        results[doc_titles[i]] = similarity_matrix[i]
    return results

In [None]:
results = calculate_cosine_similarity(db_path, "worm species")
display_top_results(db_path, results, True)

Species: 0.08235954648308656
https://en.wikipedia.org/wiki/Species


Parasitic worm: 0.06869946110382016
https://en.wikipedia.org/wiki/Parasitic_worms


List of pilosans: 0.0628430506157582
https://en.wikipedia.org/wiki/List_of_pilosans


Cestoda: 0.04025662865878257
https://en.wikipedia.org/wiki/Cestoda


Parasitism: 0.037071017034966416
https://en.wikipedia.org/wiki/Parasites


IUCN Red List: 0.03527527749662752
https://en.wikipedia.org/wiki/IUCN_Red_List


Least-concern species: 0.03448918586923339
https://en.wikipedia.org/wiki/Least_concern


Data deficient: 0.031463197120534005
https://en.wikipedia.org/wiki/Data_deficient


Type (biology): 0.03128721583794599
https://en.wikipedia.org/wiki/Type_host


Amazon basin: 0.0305661195915086
https://en.wikipedia.org/wiki/Amazon_basin


