*****************************************************************
#  The Social Web: data representation
- Instructors: Davide Ceolin, Emma Beauxis-Aussalet.
- TAs: Zubaria Inayat, Maxim Sergeev, Zhuofan Mei, Alexander Schmatz, Ling Jin.
- Exercises for Hands-on session 2
*****************************************************************

In this session you are going to mine data in various microformats. You will see the differences in what each of the formats can contain and what purpose they serve. We will start by looking at geographical data.

Prerequisites:
- Python 3.8
- Python packages: requests, BeautifulSoup4, HTMLParser, rdflib


In [2]:
# If you're using a virtualenv, make sure it's activated before running
# this cell!
!pip install requests
!pip install BeautifulSoup4
!pip install HTMLParser
!pip install rdflib
!pip install cloudscraper


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To u

##  Exercise 1

Even if web pages do not use microformat, interesting data can often be extracted from the HTML. You may use packages such as BeautifulSoup to extract arbitrary pieces of data from any HTML page.
The example below shows how we can find the URL of first image in the infobox table of the wikipedia page on Amsterdam. Tip: compare the code below with HTML source code of the wikipedia page: the image url is in the "src" attribute of the "img" element of in the "table" element with class="infobox".

In [3]:
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

# This script requires you to add a url of a page with geotags to the commandline, e.g.
# python geo.py 'http://en.wikipedia.org/wiki/Amsterdam'
URL = 'https://en.wikipedia.org/wiki/Amsterdam'

req = requests.get(URL, headers={'User-Agent' : "Social Web Course Student"})
soup = BeautifulSoup(req.text)
# print(req.text)
image1 = soup.findAll('table', class_='infobox')[0].find('img')
print(image1['src'])  

//upload.wikimedia.org/wikipedia/commons/thumb/5/57/Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png/268px-Imagen_de_los_canales_conc%C3%A9ntricos_en_%C3%81msterdam.png


Extracting coordinates from a webpage and reformatting them in the geo microformat (based on Example 8-1 in Mining the Social Web). Note that wikipages may encode long/lat information in different ways. On of the ways used by the Amsterdam wikipedia page is in a span element that is not shown to the user: 
<span class="geo">52.367; 4.900</span>
This span element has a single child: len(geoTag == 1) and no further structure, we have to manually get the long/lat by splitting the string on the ';' semicolon.

In [4]:

geoTag = soup.find(True, 'geo')
print(geoTag)

if geoTag and len(geoTag) > 1:
        lat = geoTag.find(True, 'latitude').string
        lon = geoTag.find(True, 'longitude').string
        print ('Location is at'), lat, lon
elif geoTag and len(geoTag) == 1:
        (lat, lon) = geoTag.string.split(';')
        (lat, lon) = (lat.strip(), lon.strip())
        print (('Location is at'), lat, lon)
else:
        print ('Location not found')


<span class="geo">52.37278; 4.89361</span>
Location is at 52.37278 4.89361


### Task 1

Can you convert the output of Exercise 1 into KML? Here is the KML documentation: https://developers.google.com/kml/documentation/?csw=1 and here you can find a simple example of how it is used: https://renenyffenegger.ch/notes/tools/Google-Earth/kml/index

Visualise the point in Google Maps using the following code example: https://developers.google.com/maps/documentation/javascript/examples/layer-kml-features
You will have to create your own KML file for the custom map layer, and provide a URL to the KML file inside the JavaScript code, which means that you have to upload the file somewhere. You can use a service like http://pastebin.com/ to obtain a URL for your KML file —> paste the code there and request the RAW format URL; use this one in this Task1. If it fails to work you can also use KML viewer websites like https://kmzview.com/.

Is KML a microformat, why (not)?

In [5]:
!pip3 install lxml


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [6]:
KML_FILENAME = "./A2_Task1_KML.kml"

task1_soup = BeautifulSoup('<?xml version="1.0" encoding="UTF-8"?><kml xmlns="http://www.opengis.net/kml/2.2"></kml>' ,features='xml')

task1_soup.find('kml').append(task1_soup.new_tag("Placemark"))

# Make name tag
task1_soup.find("Placemark").append(task1_soup.new_tag("name"))
task1_soup.find("name").append(task1_soup.new_string("Our New Social Web Placemark"))

# Make description
task1_soup.find("Placemark").append(task1_soup.new_tag("description"))
task1_soup.find("description").append(task1_soup.new_string("Oh look it works!"))

# Make point
task1_soup.find("Placemark").append(task1_soup.new_tag("Point"))

# Put coordinates in point
task1_soup.find("Point").append(task1_soup.new_tag("coordinates"))
task1_soup.find("coordinates").append(task1_soup.new_string(f'{lon}, {lat}, 0'))

print(task1_soup.prettify())
with open(KML_FILENAME, "w") as f:
    f.write(task1_soup.prettify())
    

<?xml version="1.0" encoding="utf-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
 <Placemark>
  <name>
   Our New Social Web Placemark
  </name>
  <description>
   Oh look it works!
  </description>
  <Point>
   <coordinates>
    4.89361, 52.37278, 0
   </coordinates>
  </Point>
 </Placemark>
</kml>



KLM is not a microformat. Microformats are a set of simple open data format standards of HTML created to serve as metadata about an element, while KLM is simply a XML-based standard to display geographic data in an Earth browser, independent of HTML documents. Moreover, microformats use a HTML tags and attributes, while KLM uses its own defined tags and structure.

## Exercise 2 
In order to find information in the web we can use microformats such as [hRecipe](https://microformats.org/wiki/hrecipe) or Schema.org's [Recipe](https://schema.org/Recipe). But first, we'll show you how to find arbitrary tags in a webpage.


### Task 2 
Parsing data for a <sub><sup>veggie</sup></sub> spaghetti alla carbonara recipe (from Example 2-7 in Mining the Social Web).

In [7]:
import cloudscraper
import json
from bs4 import BeautifulSoup

# A yummy webpage (feel free to change to your likings.)
URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

# Create a CloudScraper object
scraper = cloudscraper.create_scraper()

# Use the CloudScraper object to fetch the HTML content
response = scraper.get(URL)

# Parse the HTML content with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Now you can work with the 'soup' object as you did before
listchildren = list(soup.children)
#print(listchildren)


We can find any element in the page through *css tag selectors*
You can find them all [here](https://www.w3schools.com/cssref/css_selectors.asp), but shortly these are "." for classes, # for ids and plain text for the element name.


You can also combine them, so that looking for ".class1.class2" would select all elements displaying both classes. For a deeper overview please check the above link (or google "html tag selectors"). 

In [8]:
print(len(listchildren)) # we can see here how many children the html doc has got.
title_unparsed = soup.select_one("title")
#show the title element
print(title_unparsed)

2
<title>Vegetarian Carbonara – A Couple Cooks</title>


Not so pretty.... Use the text method.

In [9]:
print(title_unparsed.text)

Vegetarian Carbonara – A Couple Cooks


The website has a block of JSON-LD data embedded. Try to see if you can find it in the soup object.
We can load the JSON-LD script to work with it easier.
Lets get a list of the ingredients.

In [10]:
# Find the script tag containing the JSON-LD data
json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

# Extract the content of the script tag
script_content = json_ld_script.string

# Load the JSON data from the script content
data = json.loads(script_content)

# Access the "recipeIngredient" list
recipe_ingredients = data["@graph"][7]["recipeIngredient"]

# Print the list of ingredients
for ingredient in recipe_ingredients:
    print(ingredient)

1 pound spaghetti noodles
½ cup smoked mozzarella cheese
½ cup grated Parmesan cheese, plus more for serving
4 egg yolks
1 cup frozen Earthbound Farm Organic peas
8 cups Earthbound Farm Organic spinach
3 tablespoons butter
Kosher salt
Fresh ground black pepper


Lets also print out the instructions.

In [11]:
recipe_instructions= data["@graph"][7]["recipeInstructions"]
#the instructions list contains dictionaries as elements, take a look at how the list is organized
for step in recipe_instructions:
    print(step["text"])

In a large pot, combine 6 quarts of water with 2 tablespoons kosher salt and bring it to a boil.
Grate the Parmesan and mozzarella cheese. Carefully separate four egg yolks and set aside.
Once boiling, add the pasta and cook until the pasta is just about al dente, about 7 minutes; then add peas and spinach and cook for 1 minute. Reserve 1 cup cooking water, and then drain the pasta and vegetables.
In a skillet, melt the butter, then stir in the cheeses, ¼ cup pasta water, and ¼ teaspoon kosher salt. Stir in the pasta and vegetables until creamy over low heat, adding more pasta water if necessary (note that the mozzarella will stick together in some places).
To serve, top each pasta serving with a whole egg yolk and additional Parmesan cheese, and stir the yolk into the pasta at the table (if you are uncomfortable serving egg yolks at the table, stir the egg yolks into the pasta in the skillet to heat them through). Serve immediately. (Note that the mozzarella cheese can become gummy th

Websites are going to be structured differently. Look at the following JSON-DL snippet.

In [12]:
json_example = {
    "title": "The anarchist cookbook",
    "recipeInstructions": "<ol class=\"recipeSteps\"><li>Cook the linguine according to the packet instructions. </li><li>Meanwhile, carefully crack the eggs into a small bowl and beat them with a fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest. </li><li>When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan. </li><li>Stir in the egg mixture and spinach with a wooden spoon – they'll cook gently in the residual heat. Add a little pasta water to loosen, if needed. </li><li>Share between bowls and serve with a green salad.</li></ol>",
    "ingredients": ["a lot of effort", "the right mindset"]
}

recipe_instructions = json_example["recipeInstructions"]
example_soup = BeautifulSoup(recipe_instructions, 'html.parser')

In [13]:
#to get a nice and clean list of the instructions, step by step
#we can use the find method to get the first "ol" element with attribute "class.." and then use find_all to get all list elements in there
#then we can strip the list items to obtain the instructions
list_items = example_soup.find('ol', class_='recipeSteps').find_all('li')
instructions = [item.get_text(strip=True) for item in list_items]
print(instructions)

['Cook the linguine according to the packet instructions.', 'Meanwhile, carefully crack the eggs into a small bowl and beat them with a fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest.', 'When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan.', "Stir in the egg mixture and spinach with a wooden spoon – they'll cook gently in the residual heat. Add a little pasta water to loosen, if needed.", 'Share between bowls and serve with a green salad.']


## Task 2.1
Now it's your turn. Create a function that can scrape any recipe webpage from the same website (other websites will have different class tags). 

Make sure to:

- return itemized content (e.g. ingredients) in a list. You may want to use a list comprehension here.
- Not all items have been cleaned of their html markdown (see variables ```ingredients``` vs. ```instructions_unparsed```. Make sure to return a list with human readable content (i.e. by using the ```.text``` attribute).


In [27]:
#Here you can see the solution for our example website

URL = "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara"

def parse_website(url):
    # Create a CloudScraper object
    scraper = cloudscraper.create_scraper()

    # Use the CloudScraper object to fetch the HTML content
    response = scraper.get(URL)

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Get the title
    title_unparsed = soup.select_one("title")
    fn = title_unparsed.text
    
    json_ld_script = soup.find("script", {"class": "yoast-schema-graph"})

    # Extract the content of the script tag
    script_content = json_ld_script.string

    # Load the JSON data from the script content
    data = json.loads(script_content)

    # Access the "recipeIngredient" list
    recipe_ingredients = data["@graph"]
    
    ingredients = [ingredient for ingredient in recipe_ingredients]
    
    #Access the instructions
    recipe_instructions= data["@graph"][7]["recipeInstructions"]
    #the instructions list contains dictionaries as elements, take a look at how the list is organized
    instructions = [step["text"] for step in recipe_instructions]

    return {'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print(json.dumps(recipe, indent=2))
        

{
  "name": "Vegetarian Carbonara \u2013 A Couple Cooks",
  "ingredients": [
    {
      "@type": "Article",
      "@id": "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/#article",
      "isPartOf": {
        "@id": "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/"
      },
      "author": {
        "name": "Sonja Overhiser",
        "@id": "https://www.acouplecooks.com/#/schema/person/d3a3c9869553a70f725f007b4656b09b"
      },
      "headline": "Vegetarian Carbonara",
      "datePublished": "2018-04-04T08:00:29+00:00",
      "dateModified": "2021-12-05T02:15:20+00:00",
      "wordCount": 804,
      "commentCount": 12,
      "publisher": {
        "@id": "https://www.acouplecooks.com/#organization"
      },
      "image": {
        "@id": "https://www.acouplecooks.com/spring-vegetarian-spaghetti-carbonara/#primaryimage"
      },
      "thumbnailUrl": "https://www.acouplecooks.com/wp-content/uploads/2016/04/Spring-Vegetarian-Carbonara-003.jpg",
  

In [28]:
# -*- coding: utf-8 -*-

import cloudscraper
from bs4 import BeautifulSoup

# Pass in a URL containing hRecipe, such as
# https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/

URL = "https://www.jamieoliver.com/recipes/pasta-recipes/veggie-carbonara/"

# Parse out some of the pertinent information for a recipe.
# See http://microformats.org/wiki/hrecipe.

#Solution for jamie oliver
def parse_website(url):
    # Create a CloudScraper object
    scraper = cloudscraper.create_scraper()

    # Use the CloudScraper object to fetch the HTML content
    response = scraper.get(url)

    # Parse the HTML content with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    #Get the title
    title_unparsed = soup.select_one("title")
    fn = title_unparsed.text
    
    json_ld_script = soup.find("script", {'type':'application/ld+json'})

    # Extract the content of the script tag
    script_content = json_ld_script.string

    # Load the JSON data from the script content
    data = json.loads(script_content)

    # Access the "recipeIngredient" list
    recipe_ingredients = data["recipeIngredient"]
    ingredients = [ingredient.strip() for ingredient in recipe_ingredients]
    
    #Access the instructions
    recipe_instructions= data["recipeInstructions"]
    list_items = BeautifulSoup(recipe_instructions, 'html.parser').find('ol').find_all('li')
    instructions = [item.get_text(strip=True) for item in list_items]
    
    #the instructions list contains dictionaries as elements, take a look at how the list is organized
    # instructions = [step["text"] for step in recipe_instructions]

    return {
            'name': fn,
            'ingredients': ingredients,
            'instructions': instructions,
            }
    
recipe = parse_website(URL)
print(json.dumps(recipe, indent=2))

{
  "name": "Vegetarian carbonara recipe | Jamie Oliver pasta recipes",
  "ingredients": [
    "400 g dried linguine",
    "4 large free-range eggs",
    "2 tablespoons soft ricotta cheese",
    "1  lemon",
    "100 g fresh or frozen peas",
    "100 g baby spinach"
  ],
  "instructions": [
    "Cook the linguine according to the packet instructions.",
    "Meanwhile, carefully crack the eggs into a small bowl and beat them with \na fork. Season with a little black pepper, then stir in the ricotta finely grate in most of the lemon zest.",
    "When the pasta has 3 minutes left, add the peas. Reserve a little cooking water, then drain the linguine and peas, and return to the pan.",
    "Stir in the egg mixture and spinach with a wooden spoon \u2013 they\u2019ll cook gently in the residual heat. Add a little pasta water to loosen, if needed.",
    "Share between bowls and serve with a green salad."
  ]
}


But How can we get information not only from one website,  but from all? 

The answer: microformats.

But rather than extracting with information manually from the schema.org or hRecipe microformats, we can use a package, ```scrape-schema-recipe``` 

Feel free to experiment with it. 

### Task 2.2
hRecipe is a microformat specifically created for recipes.
Can you for example easily compare different dessert recipe ingredients? For inspiration you can look back at the exercises you did in Hands-on session 1 where you compared different sets of tweets.

In [16]:
!pip3 install scrape-schema-recipe


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [2]:
from prettytable import PrettyTable
import scrape_schema_recipe
import string
from collections import Counter

# get the top 10 words from the ingredients
def get_ingredients(ingredients):
  ingredientList = []
  for ingredient in ingredients:
        ingredientList.append(ingredient)

  return ingredientList



url = 'https://www.jamieoliver.com/recipes/butternut-squash-recipes/ultimate-pumpkin-pie/'
recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
recipe1 = recipe_list[0]

url = 'https://www.jamieoliver.com/recipes/fruit-recipes/frozen-sour-cherry-berry-pie/'
recipe_list = scrape_schema_recipe.scrape_url(url, python_objects=True)
recipe2 = recipe_list[0]


# join the two tuple lists
# for i in range(0, len(words1)):
  # words1[i] = words1[i] + words2[i]

# draw the table
pt = PrettyTable(field_names=['Recipe1', 'Recipe2'])
[ pt.add_row(r) for r in zip(recipe1['recipeIngredient'], recipe2['recipeIngredient']) ]
print(pt)


+---------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+
|                                              Recipe1                                              |                                         Recipe2                                         |
+---------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+
| 1 large squash such as butternut, Crown Prince, onion or pumpkin, (or 1 x 425g tin pumpkin purée) |                         400 g plain flour plus extra for dusting                        |
|                                         300 g plain flour                                         |                                    75 g icing sugar                                     |
|                                       

## Exercise 3

Schema.org is one of the most widely used annotations formats. Schema.org is a multipurpose  template that has been created by a consortium consisting of Yahoo!, Google and Microsoft. It can describe entities, events, products etc. Check out the vocabulary specs on Schema.org.

### Task 3

Parsing schema.org microdata. To parse this data you need to install the rdflib-microdata package, which you have done in one of the previous steps.



In [18]:
from rdflib import Graph

# Source: https://www.youtube.com/watch?v=sCU214rbRZ0
# Pass in a URL containing Schema.org microformats
URL = "http://dbpedia.org/resource/Micheal_Jackson"

# Initialize a graph
g = Graph()

# Parse in an RDF file graph dbpedia
result = g.parse(location=URL)

# Loop through first 10 triples in the graph
for index, (sub, pred, obj) in enumerate(g):
    print(sub, pred, obj)
    if index == 10:
        break

http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRedirects http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageWikiLink http://dbpedia.org/resource/Michael_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/2000/01/rdf-schema#label Micheal Jackson
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageLength 68
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageID 14995602
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org/ontology/wikiPageRevisionID 1056738079
http://dbpedia.org/resource/Micheal_Jackson http://xmlns.com/foaf/0.1/isPrimaryTopicOf http://en.wikipedia.org/wiki/Micheal_Jackson
http://dbpedia.org/resource/Micheal_Jackson http://www.w3.org/ns/prov#wasDerivedFrom http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=1056738079&ns=0
http://dbpedia.org/resource/Micheal_Jackson http://dbpedia.org

In [19]:
# Print the size of the Graph
print(f'Graph has {len(g)} facts')

Graph has 9 facts


In [20]:
# Print out the entire Graph in the RDF Turtle format
print(g.serialize(format='ttl'))

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ns1: <http://dbpedia.org/ontology/> .
@prefix ns2: <http://dbpedia.org/property/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://dbpedia.org/resource/Micheal_Jackson> rdfs:label "Micheal Jackson"@en ;
    ns1:wikiPageID 14995602 ;
    ns1:wikiPageLength "68"^^xsd:nonNegativeInteger ;
    ns1:wikiPageRedirects <http://dbpedia.org/resource/Michael_Jackson> ;
    ns1:wikiPageRevisionID 1056738079 ;
    ns1:wikiPageWikiLink <http://dbpedia.org/resource/Michael_Jackson> ;
    ns2:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:R_from_misspelling> ;
    prov:wasDerivedFrom <http://en.wikipedia.org/wiki/Micheal_Jackson?oldid=1056738079&ns=0> ;
    foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Micheal_Jackson> .




### Task 3.1 
Compare the schema.org information about a band on last.fm to the Facebook Open Graph information about the same band from Facebook. What are the differences? Which format do you think supports better interoperability? In particular, refer to the Microformat specifications indicated in the box on the top right corner.

In [21]:
!pip install extruct


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [22]:
# Data from LastFM

import extruct

LASTFM_URL = "https://www.last.fm/music/Pinguini+Tattici+Nucleari"
req_lastfm = requests.get(LASTFM_URL, headers={'User-Agent' : "Social Web Course Student"})
metadata_lastfm = extruct.extract(htmlstring_or_tree=req_lastfm.text)

# Go through all the items
for item_lastfm in metadata_lastfm["microdata"]:
    print(item_lastfm["type"])
    itemtype = item_lastfm["type"].split("/")[-1]
    
    if 'properties' in item_lastfm:
        props = item_lastfm['properties']
        
        # Go through all the info about the entry
        for prop in props:
            
            # Go through the albums
            if prop == 'album':
                print("- Albums:")
                for album in props[prop]:
                    print(f'  - {album["properties"]["name"]} ({album["properties"]["url"]})')
                    
                    
            #Go through the tracks
            elif prop == 'track':
                print("- Tracks:")
                for album in props[prop]:
                    print(f'  - {album["properties"]["name"]}')
            
            else:
                print(f'- {prop}: {props[prop]}')
    



http://schema.org/MusicGroup
- image: https://lastfm.freetls.fastly.net/i/u/ar0/3c5c0026bea779f2fd2635b714458935.jpg
- name: Pinguini Tattici Nucleari
- url: /music/Pinguini+Tattici+Nucleari
- Albums:
  - Rubami La Notte - Single (/music/Pinguini+Tattici+Nucleari/Rubami+La+Notte+-+Single)
  - AHIA! (/music/Pinguini+Tattici+Nucleari/AHIA!)
  - Fuori Dall'Hype Ringo Starr (/music/Pinguini+Tattici+Nucleari/Fuori+Dall%27Hype+Ringo+Starr)
  - Fake News (/music/Pinguini+Tattici+Nucleari/Fake+News)
  - Fuori dall'hype (/music/Pinguini+Tattici+Nucleari/Fuori+dall%27hype)
- Tracks:
  - Rubami la Notte
  - Rubami la Notte
  - Pastello Bianco
  - Ricordi
  - Ridere
  - Giovani Wannabe
  - Scrivile Scemo
  - Ringo Starr
  - Scooby Doo
  - Coca zero
  - Hold On
http://schema.org/MusicGroup
- name: GAZZELLE
- url: /music/GAZZELLE
http://schema.org/MusicGroup
- name: Cesare Cremonini
- url: /music/Cesare+Cremonini
http://schema.org/MusicGroup
- name: Tananai
- url: /music/Tananai
http://schema.org/Mu

In [23]:

import extruct

FB_URL = "https://www.facebook.com/pinguinitattici/"
req_fb = requests.get(FB_URL, headers={'User-Agent' : "Social Web Course Student"})
metadata_fb = extruct.extract(htmlstring_or_tree=req_fb.text)

#Go through all the items
for item_fb in metadata_fb["opengraph"]:
    print("Raw:", item_fb)
    for props in item_fb:
        if props == 'properties':
            print("Properties: ")
            for prop in item_fb[props]:
                print(f'- {prop[0]}: {prop[1]}')
        else:
            print(f'{props}: {item_fb[props]}')

    

Raw: {'namespace': {'og': 'http://ogp.me/ns#'}, 'properties': [('og:type', 'video.other'), ('og:title', 'Pinguini Tattici Nucleari'), ('og:description', 'Pinguini Tattici Nucleari. 166.260 vind-ik-leuks · 236 personen praten hierover. I Pinguini Tattici Nucleari sono: Riccardo Zanotti, Matteo Locati, Simone Pagani, Elio Biffi, Lorenzo Pasini e Nicola...'), ('og:url', 'https://www.facebook.com/pinguinitattici'), ('og:image:alt', 'Pinguini Tattici Nucleari'), ('og:image', 'https://scontent-ams2-1.xx.fbcdn.net/v/t39.30808-1/375969548_871318244353138_5791655238800008151_n.jpg?stp=dst-jpg_p720x720&_nc_cat=111&ccb=1-7&_nc_sid=5f2048&_nc_ohc=ecXy-BU8N4wAX87q0E4&_nc_ht=scontent-ams2-1.xx&oh=00_AfD2c4ShaTlxY9LPIF51kGjYUnIAo1otgWoSv6I7wBG59g&oe=6556C2D6'), ('og:locale', 'en_US')]}
namespace: {'og': 'http://ogp.me/ns#'}
Properties: 
- og:type: video.other
- og:title: Pinguini Tattici Nucleari
- og:description: Pinguini Tattici Nucleari. 166.260 vind-ik-leuks · 236 personen praten hierover. I Ping

### Answer
Last.fm had more information and detail compared to the Facebook page of the band. However, this could be attributed to last.fm's specialization on bands compared to Facebook's more generic applications. Therefore, even though the OpenGraph data has less information, this is not necessarily due to this microformat being worse than the other.

### Task 3.2
Explore the various microformats at http://microformats.org/ and compare the output of the exercises with the output of http://microformats.org/. Think about possible microformats you want to support in your final assignment and read up on how to parse them.

### Answer

In [32]:
{
  "items": [
    {
      "type": [
        "h-recipe"
      ],
      "properties": {
        "name": [
          "Bagels"
        ],
        "ingredient": [
          "Flour",
          "Sugar",
          "Yeast"
        ],
        "yield": [
          "4"
        ],
        "instructions": [
          {
            "value": "Start by mixing all the ingredients together.",
            "html": "<ol>       <li>Start by mixing all the ingredients together.</li>     </ol>"
          }
        ]
      }
    }
  ]
}

{'items': [{'type': ['h-recipe'],
   'properties': {'name': ['Bagels'],
    'ingredient': ['Flour', 'Sugar', 'Yeast'],
    'yield': ['4'],
    'instructions': [{'value': 'Start by mixing all the ingredients together.',
      'html': '<ol>       <li>Start by mixing all the ingredients together.</li>     </ol>'}]}}]}

The code above is an example that is taken from the https://microformats.org/wiki/h-recipe website. When we compare this to the Vegeterian Carbonara Recipe of Task 2, where we use https://schema.org/Recipe it seems like the schema.org format has way more information compared to the microformats.org h-recipe. Schema.org also has information on the total time, estimatedCost etc. which could be useful.