In [1]:
import requests
from bs4 import BeautifulSoup

# Proof of Concept
---

To test out web scraping The Doctor's Kitchen, I'll try extracting the data I need from one recipe. I'll just use the first presented on the website, https://thedoctorskitchen.com/recipes/smoky-mushroom-and-tempeh-veggie-burgers/.

In [2]:
# URL = "https://thedoctorskitchen.com/recipes/smoky-mushroom-and-tempeh-veggie-burgers/"
# URL = "https://thedoctorskitchen.com/recipes/beetroot-apple-and-ginger-soup/"
# URL = "https://thedoctorskitchen.com/recipes/tarragon-mushrooms-on-toast/"
URL = "https://thedoctorskitchen.com/recipes/air-fried-aubergine-tomato-cucumber-and-egg-chopped-salad-with-hummus-and-tahini-dressing/"

page = requests.get(URL)
print("Status Code:",page.status_code)

if page.status_code == 200:
    soup = BeautifulSoup(page.content, "html.parser")

Status Code: 200


# Recipe Title
The recipe title is simply contained in HTML like the following:
```HTML
<h1 class="lg:mb-3 mb-2 xl:text-h1-head lg:text-h2 text-h2-head font-bold text-white print:text-black" itemprop="name">Smoky Mushroom and Tempeh Veggie Burgers </h1>
```

## Implementation
The following code gets this title by simply searching for the HTML tag with the attribute `itemprop="name"`.

In [3]:
def get_recipe_title(soup):
    itemprop_tag = "name"
    return soup.find(itemprop=itemprop_tag).text.strip()

title = get_recipe_title(soup)
print(title)

Air-Fried Aubergine, Tomato, Cucumber and Egg Chopped Salad with Hummus and Tahini Dressing


# Number of Servings
The number of servings is contained in the following HTML:
```HTML
<h3 class="mb-1.5 text-h3 font-medium text-black">Ingredients (Serves <span itemprop="recipeYield">4</span>)</h3>
```

## Implementation
To get the number of servings, we simply search for the HTML tag with attribut `itemprop="recipeYield"` and convert the text it contains to an integer.

In [4]:
def get_recipe_servings(soup):
    itemprop_tag = "recipeYield"
    return int(soup.find(itemprop=itemprop_tag).text.strip())

num_servings = get_recipe_servings(soup)
print(num_servings)

2


# Recipe Vegan/Vegetarian/etc
Recipes on 'The Doctors Kitchen' have tags like 'vegan' or 'vegetarian'. It seems like each recipe only has one such tag. These appear in the HTML as follows:
```HTML
<div class="order-2 relative self-center md:w-5/12 sm:w-1/2 w-full xl:px-12 lg:px-8 md:px-6 xs:px-10 px-7 sm:py-0 xs:py-10 py-6 z-20 print:p-4">
	<h1 class="lg:mb-3 mb-2 xl:text-h1-head lg:text-h2 text-h2-head font-bold text-white print:text-black" itemprop="name">
		Smoky Mushroom and Tempeh Veggie Burgers 
	</h1>
	<span class="inline-block mr-4 mt-2 xl:text-h3 lg:text-lg text-base text-white print:text-black">
		<i class="inline-block mr-2 xl:text-perex text-xl fal fa-leaf align-middle"></i>
        Vegetarian
    </span>
</div>
```
Their parent `<span>` looks very auto-generated, but there's not much more I can do than grab it. Doing a search in the HTML for this class name only reveals this result, so hopefully it's safe to use. Not every recipe has a tag like this (e.g., https://thedoctorskitchen.com/recipes/cooks-white-bean-prawn-saganaki/).

## Implementation
We grab the span with the complicated class name `"inline-block mr-4 mt-2 xl:text-h3 lg:text-lg text-base text-white print:text-black"`, then get its second child. As follows:

In [5]:
def get_recipe_tag(soup):
    class_name = "inline-block mr-4 mt-2 xl:text-h3 lg:text-lg text-base text-white print:text-black"
    tag_soup = soup.find(class_=class_name)
    if tag_soup:
        return tag_soup.contents[2].strip()
    else:
        return None
    
recipe_tag = get_recipe_tag(soup)
print(recipe_tag)

Vegetarian


# Meal Tag
Recipes on The Doctor's Kitchen contain a tag with what meals the recipe can be used for. These are:
- Breakfast
- Lunch
- Dinner
- Snack
These tags are contained in the following HTML:
```HTML
<a href="/recipes/breakfast" class="inline-block mr-2 mb-2 px-2 py-1.5 md:text-label text-labelsmall font-bold text-white bg-docGreen uppercase tracking-wider">
	Breakfast
</a>
```
And the class name is the same for each type, so we'll just use the class to identify the meal.

In [6]:
def recipe_meals(soup):
    class_name = "inline-block mr-2 mb-2 px-2 py-1.5 md:text-label text-labelsmall font-bold text-white bg-docGreen uppercase tracking-wider"
    name = "a"
    return [subsoup.text.strip() for subsoup in soup.find_all(name, class_=class_name)]

meals = recipe_meals(soup)
print(meals)

['Lunch', 'Dinner']


# Cooking time
Recipe's have cooking time split between two times: a prep time, and a cook time. We will grab both of these. They are both contained in the following HTML:
```HTML
<span class="inline-block mr-1 mb-2 px-2 py-1.5 md:text-label text-labelsmall font-bold text-white bg-docGreen uppercase tracking-wider">Prep: <span itemprop="prepTime" content="PT20M">20</span> mins</span>

<span class="inline-block mr-1 mb-2 px-2 py-1.5 md:text-label text-labelsmall font-bold text-white bg-docGreen uppercase tracking-wider">Cooks: <span itemprop="cookTime" content="PT10M">10</span> mins</span>
```
These seem to always appear on every page. When there is no cooking time, the 'Cooks:' text still appears, there is just no text next to it. So it should be safe to grab the numbers contained in 'prepTime' and 'cookTime' spans, and assume the unit is minutes. We have to make sure to allow for missing text in either span.

In [7]:
def get_prep_and_cook_time_in_mins(soup):
    prep_time_itemprop_tag = "prepTime"
    prep_time = soup.find(itemprop=prep_time_itemprop_tag).text.strip()
    prep_time = float(prep_time) if len(prep_time)>0 else None

    
    cook_time_itemprop_tag = "cookTime"
    cook_time = soup.find(itemprop=cook_time_itemprop_tag).text.strip()
    cook_time = float(cook_time) if len(cook_time)>0 else None

    return prep_time, cook_time

prep_time, cook_time = get_prep_and_cook_time_in_mins(soup)

# Ingredients

The ingredients are contained in an unordered list, and each ingredient is a list like the following:
```HTML
<li class="block mt-4 sm:text-lg text-base text-black text-opacity-70" itemprop="recipeIngredient">
	<span data-min="150" data-max="" class="r4-ingre-metric">
		150 g
	</span> 

	<span data-min="5.290500000000001" data-max="" class="r4-ingre-imperial hidden">
		<span data-nr="5.290500000000001" class="imperial-nr">
			5.3
		</span>
		 oz
	</span> 

	white onion
	
	<span class="block text-md text-gray-400">finely diced</span>
</li>
```
The children of the `<li>` tag with the property `itemprop="recipeIngredient` are:
1. The metric measurement of an ingredient (including the units), is text inside the span with class `r4-ingre-metric`.
	- For an ingredient without a unit, like a number of whole onions, the metric span just contains a number.
2. The imperial measurements are contained in the span with class `r4-ingre-imperial hidden`, and unlike the metric measurement is split by quantity and unit.
3. The ingredient name.
4. Preparation instructions.
	- This tag is present even for an ingredient like olive oil, which does not require preparation. In this case it just contains no text.

## Implementation
The following code uses beautiful soup to find the elements, then extracts the required text into an `Ingredient` object with attributes for the ingredient name, quantity, unit, and preparation steps. For some reason the ingredient lists appears twice in the HTML, so in `get_ingredients` I just take the first half of the total ingredients list.

> **Note**: The code assumes the measurement text contained in the `r4-ingre-metric` span is split into a list of length 1 or 2. If it's longer (e.g. '3 table spoons'), it will raise a ValueError.

In [8]:
class Ingredient():
    def __init__(self, name, quantity, measurement_unit=None, preparation=None):
        self.name = name
        self.quantity = quantity
        self.measurement_unit = measurement_unit
        self.preparation = preparation
    
    def __repr__(self):
        return f"<Ingredient object: name={self.name.__repr__()}; quantity={self.quantity.__repr__()}, measurement_unit={self.measurement_unit.__repr__()}, preparation={self.preparation.__repr__()}>"


def get_ingredients(soup):
    itemprop_tag = "recipeIngredient"

    recipeIngredients = soup.find_all(itemprop=itemprop_tag)

    ingredient_objects = []
    for ingredient_soup in recipeIngredients:
        ingredientObject = ingredient_soup_to_IngredientObject(ingredient_soup)
        ingredient_objects.append(ingredientObject)
    
    # recipe ingredients are duplicated in the HTML, so we split the list down the middle
    ingredient_objects = ingredient_objects[:len(ingredient_objects)//2]

    return ingredient_objects


def ingredient_soup_to_IngredientObject(ingredient_soup):
    metric_measurement_class = "r4-ingre-metric"
    measurement = ingredient_soup.find(class_=metric_measurement_class).text.split()
    preparation_name = "span"
    preparation_class = "block text-md text-gray-400"

    quantity = None
    measurement_unit = None
    if len(measurement) >= 1:
        quantity = float(measurement[0])
    if len(measurement) == 2:
        measurement_unit = measurement[1]
    elif len(measurement) > 2:
        raise ValueError(f"Measurement {measurement} has length {len(measurement)} > 2")
    
    name = ingredient_soup.contents[-2].strip()

    preparation = ingredient_soup.find(preparation_name, preparation_class).text.strip()
    if preparation == "":
        preparation = None
    
    return Ingredient(name, quantity, measurement_unit=measurement_unit, preparation=preparation)


ingredient_objects = get_ingredients(soup)
for i in ingredient_objects:
    print(i)

<Ingredient object: name='eggs'; quantity=4.0, measurement_unit=None, preparation=None>
<Ingredient object: name='aubergine'; quantity=300.0, measurement_unit='g', preparation='2cm cubed'>
<Ingredient object: name='olive oil'; quantity=1.0, measurement_unit='tbsp', preparation=None>
<Ingredient object: name='wholegrain pitta'; quantity=2.0, measurement_unit=None, preparation='cut into 2cm pieces'>
<Ingredient object: name='cucumber'; quantity=160.0, measurement_unit='g', preparation='2cm cubed'>
<Ingredient object: name='tomatoes'; quantity=160.0, measurement_unit='g', preparation='2cm cubed'>
<Ingredient object: name='gherkins'; quantity=30.0, measurement_unit='g', preparation='finely diced'>
<Ingredient object: name='lemon'; quantity=1.0, measurement_unit=None, preparation='juiced'>
<Ingredient object: name='tahini'; quantity=4.0, measurement_unit='tbsp', preparation=None>
<Ingredient object: name='water'; quantity=2.0, measurement_unit='tbsp', preparation='or as needed'>
<Ingredient

# Description
Recipes on the site have a short description of the recipe. This is contained in the following HTML:
```HTML
<h2 class="text-perex font-medium text-docGreen" itemprop="description">
	If you want to cut back on red meat, have a veggie friend coming to a barbecue, or want to convince a plant-based sceptic, this recipe is guaranteed to please. Tempeh is an excellent source of protein and prebiotic fibres. Combined with mushrooms we think it makes these plant-based burgers taste unbelievably meaty and super good for your gut microbes!
</h2>
```
There are other tags with the attribute `itemprop="description"`, so we need to look for the `<h2>` with this attribute.

In [9]:
def recipe_description(soup):
    itemprop_tag = "description"
    name = "h2"
    desc = soup.find(name, itemprop=itemprop_tag).text.strip()
    desc = None if len(desc) == 0 else desc
    return desc

description = recipe_description(soup)
print(description)

This is our salad version of a sabich, a popular street food sandwich in Tel Aviv that originated in the Iraqi Jewish community. Meltingly tender fried aubergine, crunchy gherkins and a refreshing chopped cucumber and tomato salad, drizzled with a luscious tahini dressing. 

The aubergine is traditionally deep-fried, here we have used an air-fryer to keep things lighter. If you don’t have one the aubergine can be baked in the oven at 190°C for 20 minutes, until tender and dark golden brown.


# Method
The recipe steps are contained in the following HTML:
```HTML
<div class="r4-instruction-list" itemprop="recipeInstructions">
	<div class="r4-instruction-item" itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
		<div class="r4-instruction-img">
			<img src="..." width="960" alt="Gather ..." itemprop="image">
		</div>
		<p itemprop="description">
			Gather and prepare your ingredients.
		</p>
	</div>
	
	<div class="r4-instruction-item" itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
		<div class="r4-instruction-img">
			<img src="..." width="960" alt="Heat ha..." itemprop="image">
		</div>
		<p itemprop="description">
			Heat half the olive oil in a frying pan over medium-high heat. Add the onion, garlic and a light sprinkle of salt and pepper. Cook for 3-4 minutes until softened and translucent.
		</p>
	</div>
	
	<div class="r4-instruction-item" itemprop="itemListElement" itemscope itemtype="https://schema.org/ListItem">
		<div class="r4-instruction-img">
			<img src="..." width="960" alt="Add the..." itemprop="image">
		</div>
		<p itemprop="description">
			Add the mushrooms to the pan and sprinkle lightly with salt and pepper. Cook, stirring occasionally, for 3-4 minutes, until reduced. Then add the tempeh, paprika, tomato puree and tamari and cook, stirring often, for a further 5-6 minutes, until everything is darkened, sticky and caramelised. Remove from the heat, stir through the oats and allow to cool slightly.
		</p>
	</div>
	
	...
	
</div>
```

To get the steps in a list, I just extracted all `<p>` tags with the attribute `itemprop="description"`. Some have sub-steps, so I extracted the text of their contents rather than the straight text of the `<p>`.

In [10]:
def recipe_method(soup):
    name = "p"
    itemprop_tag = "description"

    steps_soup_list = soup.find_all(name, itemprop="description")
    steps_list = []
    for step in steps_soup_list:
        for item in step.contents:
            steps_list.append(item.text.strip())
    
    return steps_list
    # return [s.text.strip() for s in soup.find_all(name, itemprop="description")]

method = recipe_method(soup)
for i, step in enumerate(method):
    print(f"{i+1})  {step}\n")

1)  Gather and prepare your ingredients.

2)  Bring a small pot of water to the boil and gently add the eggs. Boil for 7 minutes, then remove and run under cold water to halt the cooking process. Once cool, peel and slice into wedges.

3)  Lightly drizzle the aubergine with olive oil and sprinkle with salt and pepper. Toss to coat. Place into the basket and ‘fry’ at 190°C for 7-8 minutes, until dark golden brown and tender but not completely cooked.

4)  Lightly drizzle the chopped pitta breads with olive oil, sprinkle with salt and pepper and toss to coat. Add to the air-fryer with the aubergine and cook for a further 4-5 minutes, until the bread is golden and crispy and the aubergine is completely tender.

5)  Place the cucumber, tomato and gherkins into a bowl, squeeze over half the lemon juice and toss to coat. Add salt to taste.

6)  To make the tahini sauce place the tahini and remaining lemon juice into a medium bowl and whisk to combine. It will seize slightly, add water and wh

# Extracting Equipment from the Recipe Steps
We check for the presence of the phrases in `equipment_dict` in the method steps, and in the ingredients list. See the obsidian notes.

In [11]:
# words and phrases which indicate use of a specific piece of equipment:
# FORMAT:
#       "equipment_name": {"synonym", "phrase to look for"}

equipment_dict = {
    "pan": {"pan", "frypan"},
    "pot": {"pot", "boil", "saucepan"},
    "blender": {"blend"},
    "air fryer": {"air frier", "air fryer"},
    "toaster": {"toaster", "toast"},
    "oven": {"oven", "bake", "roast"},
    "cast-iron pan": {"cast iron pan"},
    "knife": {"slice", "sliced", "chop", "chopped", "dice", "diced", "cube", "cubed"},
    "chopping board": {"slice", "sliced", "chop", "chopped", "dice", "diced", "cube", "cubed", "board"},
    "measuring spoons": {"tbsp", "tsp"},
    "measuring cup": {"cup", "ml"},
    "scale": {"g", "gram", "grams"},
    "bowl": {"bowl"},
    "baking paper": {"baking paper", "baking parchment"},
    "fridge": {"fridge", "refridgerator"},
    "freezer": {"freezer"}
}

character_replacement_dict = {
    "-": " "
}

def string_to_standard_string(string):
    string = [character_replacement_dict[c] if c in character_replacement_dict else c for c in string]
    string = ''.join([c.lower() for c in string if c.isalnum() or c.isspace()])
    return string

def method_alnum_string_from_list(method):
    method_alnum = []
    for step in method:
        method_alnum.append(string_to_standard_string(step))

    method_alnum_string = ' '.join(method_alnum)
    return method_alnum_string

def method_words_list(soup):
    method_list = recipe_method(soup)
    method_string = method_alnum_string_from_list(method_list)
    return list(method_string.split())

def equipment_set(soup):
    words = method_words_list(soup)

    equipment_set_object = set()
    for equipment, phrases in equipment_dict.items():

        for phrase in phrases:
            if phrase_in_str_list(phrase,words) or phrase_in_ingredients(phrase, ingredient_objects):
                equipment_set_object.add(equipment)
                break
    
    return equipment_set_object

def phrase_in_str_list(phrase, str_list):
    phrase_list = phrase.split()
    l = len(phrase_list)
    for i in range(len(str_list)-l+1):
        if str_list[i:i+l] == phrase_list:
            return True
    return False

def phrase_in_ingredients(phrase, ingredient_objects):
    name_list = [string_to_standard_string(ingredient.name) for ingredient in ingredient_objects if not ingredient.name is None]
    unit_list = [string_to_standard_string(ingredient.measurement_unit) for ingredient in ingredient_objects if not ingredient.measurement_unit is None]

    preparation_list = []
    for ingredient in ingredient_objects:
        prep = ingredient.preparation
        if prep is not None:
            preparation_list = preparation_list + string_to_standard_string(ingredient.preparation).split()

    combined_list = name_list + unit_list + preparation_list

    return phrase in combined_list


equipments = equipment_set(soup)
print(equipments)

{'chopping board', 'scale', 'air fryer', 'bowl', 'pot', 'measuring spoons', 'knife'}


# Summary
We summarise these results in a class, which shows a markdown summary when printed.

In [17]:
class Recipe():
    def __init__(self, url, soup, title, description, tags, meals, servings, prep_time, cook_time, ingredients, equipment, method):
        self.url = url
        self.soup = soup
        self.title = title
        self.description = description
        self.tags = tags
        self.meals = meals
        self.servings = servings
        self.prep_time = prep_time
        self.cook_time = cook_time
        self.ingredients = ingredients
        self.equipment = equipment
        self.method = method
    
    def __str__(self):
        rep = f"# {self.title.upper()}\n"
        rep += f"{self.url}\n"
        rep += f"Total time: {self.prep_time + self.cook_time} mins\n"

        rep += self.subheader("meals")
        for meal in meals:
            rep += f"- {meal}\n"

        rep += self.subheader("description")
        rep += f"{self.description}\n"

        rep += self.subheader("equipment")
        for e in self.equipment:
            rep += f"- {e}\n"
        
        rep += self.subheader("ingredients")
        for ingredientObject in self.ingredients:
            rep += f"- {ingredientObject.name}, {ingredientObject.quantity} {ingredientObject.measurement_unit}\n"
        return rep
    
    def subheader(self, string):
        return f"\n## {string.upper()}\n"

exampleRecipe = Recipe(URL, soup, title, description, recipe_tag, meals, num_servings, prep_time, cook_time, ingredient_objects, equipments, method)
print(exampleRecipe)

# AIR-FRIED AUBERGINE, TOMATO, CUCUMBER AND EGG CHOPPED SALAD WITH HUMMUS AND TAHINI DRESSING
https://thedoctorskitchen.com/recipes/air-fried-aubergine-tomato-cucumber-and-egg-chopped-salad-with-hummus-and-tahini-dressing/
Total time: 25.0 mins

## MEALS
- Lunch
- Dinner

## DESCRIPTION
This is our salad version of a sabich, a popular street food sandwich in Tel Aviv that originated in the Iraqi Jewish community. Meltingly tender fried aubergine, crunchy gherkins and a refreshing chopped cucumber and tomato salad, drizzled with a luscious tahini dressing. 

The aubergine is traditionally deep-fried, here we have used an air-fryer to keep things lighter. If you don’t have one the aubergine can be baked in the oven at 190°C for 20 minutes, until tender and dark golden brown.

## EQUIPMENT
- chopping board
- scale
- air fryer
- bowl
- pot
- measuring spoons
- knife

## INGREDIENTS
- eggs, 4.0 None
- aubergine, 300.0 g
- olive oil, 1.0 tbsp
- wholegrain pitta, 2.0 None
- cucumber, 160.0 g
