## Preprocessing
In this section we will preprocess our corpus.
To provide some context, here is a brief description of our corpus:
Our corpus consists of 500 mainly Italian recipes.\
We received those recipes from: https://www.giallozafferano.com/recipes.
We scrape the website, and save the relevant information of the recipes in a file in this format:
``` name:
[NAME]
ingredients:
[INGREDIENT1]: [QUANTITY]

instructions:
[INSTRUCTIONS]

url:
[URL]

In the preprocessing, we remove interpunction, fix the whitespaces, as sometimes tabs are used and sometimes spaces etc. As well as lower the complete text. This all to improve the embedding of the data and not let irrelevant details in the data affect the embedding.

In [1]:
from pathlib import Path
import string
import os
from typing import List, Dict

In [4]:
# preprocessing function

def remove_interpunction(text: str) -> str:
    exclude = set(string.punctuation)
    return ''.join(ch for ch in text if ch not in exclude)

def fix_whitespace(text: str) -> str:
    return ' '.join(text.split())

def read_file(path: Path) -> str:
    with open(path, 'r') as file:
        content = file.read()
    return content

def normalize_text(text: str) -> str:
    text = remove_interpunction(text)
    text = fix_whitespace(text)
    text = text.lower()
    return text

In [None]:
# load the recipes:

def load_text(path: Path) -> str:
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
        text = normalize_text(text)
    return text

def load_recipes(base_dir: Path = "data", max_index: int = 100) -> List[Dict]:
    """
    Load recipes from base_dir/index.txt for index in [0, max_index].
    Each loaded recipe is a dict with an id, title and text.
    """

    docs = []
    for index in range(max_index):

        recipe_path = base_dir / f"{index}.txt"     # create the full path to recipe

        if not os.path.exists(recipe_path):         # skip missing indices
            print(f"WARNING: {recipe_path} not found, skipping.")
            continue
        
        text = load_text(recipe_path)               # load the data

        docs.append({                               # save the data in a dict
            "id": str(index),
            "title": f"recipe_{index}",
            "text": text,
            "path": recipe_path,
        })
    return docs

data_path = Path("../model/data")
recipes: List[Dict] = load_recipes(base_dir=data_path, max_index=100)
print(len(recipes), "recipes loaded.")

100 recipes loaded.
