# Recept website scrapen

## Metadata

- **Vak**: ID Datavisualisatie
- **Workshop**: Datavis API Workshop
- **Lecturers**: Laura Benvenuti & Danny de Vries
- **University**: Amsterdam University of Applied Sciences
- **Programme**: Communication and Multimedia Design
- **Faculty**: Digital Media and Creative Industries

## Omschrijving

Dit is een notebook die laat zien hoe je met Beautiful Soup data van een website kunt 'scrapen'. In dit geval uit de HTML van een recepten website. We voeren een URL in en scrapen dan de ingredienten en stappen van het recept. Gebasseerd op voorbeelden uit het boek [Mining the Social Web](https://www.oreilly.com/library/view/mining-the-social/9781491973547/).

## Taken

Dit notebook doet:

* Een .csv met een export vanuit qualtrics inladen.
* Opschonen tot alleen relevante data, weghalen van metadata zoals tijdstip. locatie etc.
* Verwijderen van respondenten die de survey niet hebben afgemaakt.
* Een schone .csv naar een map op je computer schrijven.

### Check Python versie

In [2]:
from packaging import version
import platform
import sys

min_version = '3.8'

def check_version(min_version):
    current_version = sys.version.split()[0]
    return version.parse(current_version) >= version.parse(min_version)

# Example usage:
if __name__ == "__main__":
    if check_version(min_version):
        print("Running a sufficiently new version of Python.")
        print("Current version: " + platform.python_version())
        print("Minimum required version: " + min_version)
    else:
        print("Python version is too old. Upgrade to a newer version.")

Running a sufficiently new version of Python.
Current version: 3.12.5
Minimum required version: 3.8


### Packages installeren

In [3]:
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib
!pip install cloudscraper

Collecting requests
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting charset-normalizer<4,>=2 (from requests)
  Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests)
  Downloading idna-3.8-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Downloading urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests)
  Downloading certifi-2024.7.4-py3-none-any.whl.metadata (2.2 kB)
Downloading requests-2.32.3-py3-none-any.whl (64 kB)
Downloading certifi-2024.7.4-py3-none-any.whl (162 kB)
Downloading charset_normalizer-3.3.2-cp312-cp312-macosx_11_0_arm64.whl (119 kB)
Downloading idna-3.8-py3-none-any.whl (66 kB)
Downloading urllib3-2.2.2-py3-none-any.whl (121 kB)
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests
Successfully installed certifi-2024.7.4 charset-normalizer-3.3.2 idna-3.8 requests-2

### Packages importeren

In [4]:
import cloudscraper
import json
import requests
from bs4 import BeautifulSoup

### Data parsen op basis van recept URL

In [5]:
# Een url naar een recept
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

# Cloudscraper object maken
scraper = cloudscraper.create_scraper()

# Gebruiken om de HTML content te fetchen
response = scraper.get(URL)

# The HTML parsen met beautifulsoup
soup = BeautifulSoup(response.text, 'html.parser')

# Lijst maken van het 'soup' object.
listchildren = list(soup.children)

### Log output van de HTML van de webpagina

In [7]:
# Log alle children (dus alle HTML elementen)
print(listchildren) 

['html', <html lang="en-US">
<head><meta content="index, nofollow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<!-- This site is optimized with the Yoast SEO Premium plugin v22.6 (Yoast SEO v22.6) - https://yoast.com/wordpress/plugins/seo/ -->
<title>Vegetarian Carbonara – A Couple Cooks</title>
<meta content="This vegetarian carbonara is quick and delicious; an egg yolk is stirred into the pasta to create a sauce, and smoked mozarella is used instead of bacon." name="description"/>
<link href="https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/" rel="canonical"/>
<meta content="en_US" property="og:locale"/>
<meta content="article" property="og:type"/>
<meta content="Vegetarian Carbonara" property="og:title"/>
<meta content="This vegetarian carbonara is quick and delicious; an egg yolk is stirred into the pasta to create a sauce, and smoked mozarella is used instead of bacon." property="og:description"/>
<meta content="https://www.acou

### JSON-LD uitlezen en ingredienten en stappen uitlezen

In [8]:
# Script tag zoeken met JSON-LD data
json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

# Content uit de script tag laten
script_content = json_ld_script.string

# JSON data inladen
data = json.loads(script_content)

# Recept ingredienten key pakken uit de graph
recipe_ingredients = data["@graph"][7]["recipeIngredient"]

# Print lijst van ingredienten
for ingredient in recipe_ingredients:
    print(ingredient)

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


In [9]:
# Zoeken naar de instructies van het recept
recipe_instructions= data["@graph"][7]["recipeInstructions"]

# Stappen zijn in dictionaries, step gebruiken om om te zetten
for step in recipe_instructions:
    print(step["text"])

In a large pot, combine 6 quarts of water with 2 tablespoons kosher salt and bring it to a boil.
Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.
Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.
In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon kosher salt. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the mozzarella will stick together in some places).
To serve, top each pasta serving with a whole egg yolk and additional Parmesan cheese, and stir the yolk into the pasta at the table (if you are uncomfortable serving egg yolks at the table, stir the egg yolks into the pasta in the skillet to heat them through). Serve immediately. (Note that the mozzarella cheese can become gummy th