# Scraping the Guggenheim Bilbao Museum website

This notebook creates a dataset containing information about the different paintings in the [Guggenheim Bilbao Museum collection](https://www.guggenheim-bilbao.eus/en/the-collection/works). 

![collection](https://www.dropbox.com/s/3abnf25pgxrrneb/collection.png?raw=1)

### Getting the basics

Head over to the provided link and take a look at the information displayed there. For each work, the website shows the name of the <b>author</b>, the <b>title</b> of the work, the production <b>year</b> and a small <b>image</b>. These data will be our first target. Scraping will allow you to retrieve this information directly from the webpage's code. To do so **you will need an inspector**. 

#### Inspecting a webpage

Today, most internet browsers come with specific tools that allow you to inspect the content in any webpage. Below you can find instructions on how to open the inspection tool for different internet browsers and operating systems.

- **Google Chrome**. To open the inspection tool press ```ctrl``` + ```shift``` + ```i``` (Windows) or ```alt``` + ```cmd``` + ```c``` (Mac) on your keyboard. Alternatively, you can also right clik on the website itself and select ```inspect```. Finally, you can access the inspection tool by clicking on ```View```, ```Developer```, ```Inspect elements``` in the top menu.

- **Safari**. To open the inspection tool press ```alt``` + ```cmd``` + ```i``` (Mac) on your keyboard. Alternatively, you can also right clik on the website itself and select ```Inspect Element```. Finally, you can access the inspection tool by clicking on ```Develop```, ```Show Web Inspector``` in the top menu.

- **Firefox**.To open the inspection tool press ```ctrl``` + ```shift``` + ```i``` (Windows) or ```alt``` + ```cmd``` + ```c``` (Mac) on your keyboard. Alternatively, you can also right clik on the website itself and select ```Inspect Element```. Finally, you can access the inspection tool by clicking on ```Tools```, ```Web Developer```, ```Inspector``` in the top menu.

#### Identifying content tags

The inspector will show you the source HTML code of the website that you are visiting. Note the different HTML and CSS tags. Each of them contains a different block of content. **If you pass your mouse through the code, you'll notice that different areas of the website get highlighted**. This is the inspector helping you identify the specific pieces of code that refer to each element in the layout. On top of that, the inspector also allows you to navigate the different tag levels by showing the more specific content in inner tags. You can access it simply by clicking on top of the tags.

<img src="https://www.dropbox.com/s/gt7rf5lbtb8rfx3/Captura%20de%20pantalla%202022-10-31%20a%20las%2011.16.02.png?dl=1" width="600">

Much of the work of building  a web scraper is about identifying the different pieces of content in the code. The inspector can help us locate and identify the different pieces of information that we want to extract, but to retrieve this information we will first need to extract the HTML code from the website.


### A. Extracting painting information
#### 1. Extracting a webpage

To retrieve the full HTML code for the Guggenheim Bilbao Museum Work collection website, we are going to use the ```requests``` library. 

*Note that website are hosted by remoted servers. Making a default get request to a website server equals making a request to extract its HTML code.*

In [3]:
import requests
source = requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/works")

In [4]:
source

<Response [200]>

Source prints a <i>status code</i>. If you get a 200, it means your requests was successfully managed. You can access the actual body of the response /the HTML code) by using either the `text` or the `content` methods.

In [5]:
source.text

'<!doctype html><html lang="en" itemscope itemtype="http://schema.org/WebPage"><head><meta charset="utf-8"/><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no,shrink-to-fit=no"/><meta name="theme-color" content="#4cb6cc"/><meta name="msapplication-TileColor" content="#4cb6cc"/><link rel="manifest" href="/manifest.json"/><link rel="shortcut icon" href="/favicon.ico"/><link rel="apple-touch-icon-precomposed" sizes="16x16" href="/icons/icon-16x16.png"/><link rel="apple-touch-icon-precomposed" sizes="32x32" href="/icons/icon-32x32.png"/><link rel="apple-touch-icon-precomposed" sizes="180x180" href="/icons/icon-180x180.png"/><link rel="icon" type="image/png" sizes="192x192" href="/icons/icon-192x192.png"/><link rel="apple-touch-icon-precomposed" sizes="512x512" href="/icons/icon-512x512.png"/><title data-react-helmet="true">Works | The Collection | Guggenheim Museum Bilbao</title><meta data-react-helmet="true" property="og:site_name" content="

The response body include the full HTML code for the provided url. Note, however, that the code is in `str` form. This is, raw. In this format, we will hardly be able to make sense of it. To make it more comprehensible, we need to parse it.

Parsing refers to the process of converting one data format into another, which is more suitable under the considered circumstances. You can think of parsing as synonym for *putting data in a nicer format*.

#### 2. Parsing HTML content

To parse the extracted HTML code we can use the ```beautifulsoup4``` or ```bs4``` library. You can take a look at the documentation for this library [here](https://tedboy.github.io/bs4_doc/index.html).

In [6]:
import bs4

This library has a class called ```BeautifulSoup``` that is specifically designed to parse HTML content into a *parse tree*. 

To parse a document, we need to pass it into the ```BeautifulSoup``` constructor and select a parser. Different parsers will create different parse trees from the same document. If you give ```BeautifulSoup``` a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document. But if the document is not perfectly-formed, different parsers will give different results.

Differences between parsers can affect your script. Hence, it is safer to specify a parser in the ```BeautifulSoup``` constructor. In particular, we will use Python’s built-in HTML parser.

In [7]:
soup = bs4.BeautifulSoup(source.text, 'html.parser')

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.

In [8]:
soup

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta charset="utf-8"/><meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no,shrink-to-fit=no" name="viewport"/><meta content="#4cb6cc" name="theme-color"/><meta content="#4cb6cc" name="msapplication-TileColor"/><link href="/manifest.json" rel="manifest"/><link href="/favicon.ico" rel="shortcut icon"/><link href="/icons/icon-16x16.png" rel="apple-touch-icon-precomposed" sizes="16x16"/><link href="/icons/icon-32x32.png" rel="apple-touch-icon-precomposed" sizes="32x32"/><link href="/icons/icon-180x180.png" rel="apple-touch-icon-precomposed" sizes="180x180"/><link href="/icons/icon-192x192.png" rel="icon" sizes="192x192" type="image/png"/><link href="/icons/icon-512x512.png" rel="apple-touch-icon-precomposed" sizes="512x512"/><title data-react-helmet="true">Works | The Collection | Guggenheim Museum Bilbao</title><meta content="Guggenheim Bilbao" data-react-helmet="true" pr

#### 3. Navigating the tree

You can now navigate your parse tree. The simplest way to do so is to specify the name of the tag you want. If you want to extract the `<head>` tag, you can simply access the `head` attribute of your soup.

In [9]:
soup.head

<head><meta charset="utf-8"/><meta content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no,shrink-to-fit=no" name="viewport"/><meta content="#4cb6cc" name="theme-color"/><meta content="#4cb6cc" name="msapplication-TileColor"/><link href="/manifest.json" rel="manifest"/><link href="/favicon.ico" rel="shortcut icon"/><link href="/icons/icon-16x16.png" rel="apple-touch-icon-precomposed" sizes="16x16"/><link href="/icons/icon-32x32.png" rel="apple-touch-icon-precomposed" sizes="32x32"/><link href="/icons/icon-180x180.png" rel="apple-touch-icon-precomposed" sizes="180x180"/><link href="/icons/icon-192x192.png" rel="icon" sizes="192x192" type="image/png"/><link href="/icons/icon-512x512.png" rel="apple-touch-icon-precomposed" sizes="512x512"/><title data-react-helmet="true">Works | The Collection | Guggenheim Museum Bilbao</title><meta content="Guggenheim Bilbao" data-react-helmet="true" property="og:site_name"/><meta content="@MuseoGuggenheim" data-react-helmet="true" n

You can also zoom in to individual elements.

In [10]:
soup.title

<title data-react-helmet="true">Works | The Collection | Guggenheim Museum Bilbao</title>

You use the same trick again and again to zoom in on certain parts of the body parse tree. For example, this code gets the first `<a>` tag beneath the `<body>` tag.

In [11]:
soup.body.a

<a class="header__logo-link" href="/en"></a>

Note that using a tag name as an attribute will give you only the first tag by that name. Yet, when retrieving information from the website, we may need to access not just one but all the tags by a given name or class. To do so, we will need more specific tools.

#### 4. Searching the tree

The two most popular methods for searching a parse tree in `BreautifulSoup` are `find` and `find_all`. You can read more about these two [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all). The `find` method scans the entire document looking for **the first** result that matches a provided filter. The `find_all` method looks through a tag’s descendants and retrieves **all** descendants that match your filters. 

For the Guggenheim Museum's webpage, the central part of the webpage contains a grid, where information for each of the paintings is displayed inside a block. Each individual block contains the information we are looking for and the same structure gets repeated for all the paintings.

In [12]:
# Storing the tag type and class of the different blocks of painting information
item_tag_type = "div"
item_tag_class = "column grid__item"

In [13]:
# Find first result that matches the two filters above
painting = soup.find(item_tag_type, {"class": item_tag_class})

In [14]:
type(painting)

bs4.element.Tag

Note that the retrieved object is  `Tag` including the code for the first painting in the collection: Mark Rothko's *Untitled*.

In [15]:
# Find all the results that match the two filters above
paintings = soup.find_all(item_tag_type, {"class": item_tag_class})

In [16]:
type(paintings)

bs4.element.ResultSet

Note that this time, the retrieved object is a `ResultSet`, inlcuding including the code for all the paintings in the provided url in a *list*-like structure.


#### 5. Extracting painting information

Once identified the code that corresponds to each individual painting, we can start retrieving information about the title, artist and date for each painting.


In [17]:
# Store tags corresponding to the pieces of information we want to retrieve (for 1 painting)
artist_tag = painting.find("div", {"class": "work-preview__artists"})
title_tag = painting.find("span", {"class": "typography typography--h5 typography--italic work-preview__name"})
date_tag = painting.find("span", {"class": "typography typography--h5"})

In [18]:
# Extract the actual information by calling the text attribute
artist_tag.text

'Mark Rothko '

In [19]:
# Store the information in the right format to operate later
artist = artist_tag.text.strip()
title = title_tag.text.split(',')[0]
date = date_tag.text.split(', ')[1]

In [20]:
# Iterate to retrive the information for all the 12 paintings in the website
artists = []
titles = []
dates = []

for painting in paintings:
    artists.append(painting.find("div", {"class": "work-preview__artists"}).text)
    titles.append(painting.find("span", {"class": "typography typography--h5 typography--italic work-preview__name"}).text.split(', ')[0])
    dates.append(painting.find("span", {"class": "typography typography--h5"}).text.split(', ')[1])

These data are a bit scarce. Perhaps we can do better.

If you click on each of the paintings, you'll see that another webpage opens, containing additional information. This includes the <b>original title</b> of each piece, the <b>materials</b> it was made of, its <b>dimensions</b> and the <b>credit</b> line, as well as a short description. Let's retrieve these data too.

![work](https://www.dropbox.com/s/sl7mtoj6mvp22s0/work.png?raw=1)

To do so, we need to be able to navigate through each painting webpage. The first step is therefore locating the corresponding url addresses.
The <i>get</i> method will allow you to retrieve a specific piece of information from a tag.

In [21]:
# Find and store urls
urls = []
for painting in paintings:
    urls.append(painting.find("a")['href'])

These urls are only partial links, which need to be modified to contain the full url addresses for all painting webpages.

In [22]:
# Modify the stored urls
base_url = "https://www.guggenheim-bilbao.eus"
for i in range(len(urls)):
    urls[i] = base_url + urls[i]

You can now use these urls to retrieve the remaining information for each painting.

In [23]:
original_titles = []
materials = []
dimensions = []
credits = []

for url in urls:
    
    source = requests.get(url)
    soup = bs4.BeautifulSoup(source.text, 'html.parser')
    
     # get details
    details = soup.find_all("div", {"class": "aside-detail"})
    
    for item in details:
        if "Original title" in item.text:
            original_titles.append(item.text.split("Original title")[1])
        elif "Medium/Materials" in item.text:
            materials.append(item.text.split("Medium/Materials")[1])
        elif "Dimensions" in item.text:
            dimensions.append(item.text.split("Dimensions")[1])
        elif "Credit line" in item.text:
            credits.append(item.text.split("Credit line")[1])

    original_titles.append(details[0].text)
    materials.append(details[2].text.split('\n')[0])
    dimensions.append(details[3].text)
    credits.append(details[4].text)

#### 6. Navigating different pages

So far, we've retrieved the data only for the paintings shown in the first page of the collection catalogue. However, there exist additional pages we can navigate through.

In [24]:
artists = []
titles = []
dates = []
urls = []
original_titles = []
materials = []
dimensions = []
credits = []

# Start from first page
main_page = bs4.BeautifulSoup(requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/works").text, 'html.parser')

# Get number of pages
n_pages = int(main_page.find_all("li", {"class": "paginator__item"})[-2].text)

# Write loop to traverse the pages
for run in range(n_pages-1):
    
    # Get paintings list
    paintings = main_page.find_all("div", {"class": "column grid__item"})
    
    # Get information from each painting
    for painting in paintings:

        artists.append(painting.find("div", {"class": "work-preview__artists"}).text.strip())
        titles.append(painting.find("span", {"class": "typography typography--h5 typography--italic work-preview__name"}).text.split(', ')[0])
        dates.append(painting.find("span", {"class": "typography typography--h5"}).text.split(', ')[1])

        url = base_url+painting.find("a")['href']

        painting_page = bs4.BeautifulSoup(requests.get(url).text, 'html.parser')
    
        details = painting_page.find_all("div", {"class": "aside-detail"})
    
        for item in details:
            if "Original title" in item.text:
                original_titles.append(item.text.split("Original title")[1])
            elif "Medium/Materials" in item.text:
                materials.append(item.text.split("Medium/Materials")[1].split('\n')[0])
            elif "Dimensions" in item.text:
                dimensions.append(item.text.split("Dimensions")[1])
            elif "Credit line" in item.text:
                credits.append(item.text.split("Credit line")[1])
    
    # Get url for the next page
    next_url = base_url + main_page.find_all("li", {"class": "paginator__item"})[-1].find("a")['href']
    
    # Request next page
    main_page = bs4.BeautifulSoup(requests.get(next_url).text, 'html.parser')

#### 7. Saving the data

Finally, now that all the data has been retrieved, it should be stored to be able to access it later.


In [25]:
import pandas as pd
df_works = pd.DataFrame(data = {'title': titles,
                          'artist': artists,
                          'original_title': original_titles,
                          'date': dates,
                          'materials': materials,
                          'dimensions': dimensions,
                          'credit': credits
                         })

ValueError: All arrays must be of the same length

Notice that an image is displayed for each painting, both in the mainpage and in each paintings corresponding page. We can also retrieve the url for each of these images.

In [None]:
imgs = []

# Start from first page
main_page = bs4.BeautifulSoup(requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/works").text, 'html.parser')

# Get number of pages
n_pages = int(main_page.find_all("li", {"class": "paginator__item"})[-2].text)

# Write loop to traverse the pages
for run in range(n_pages-1):
    # Get paintings list
    paintings = main_page.find_all("div", {"class": "column grid__item"})

    for painting in paintings:
        imgs.append(painting.find("img")['src'])
    
    # Get url for the next page
    next_url = base_url + main_page.find_all("li", {"class": "paginator__item"})[-1].find("a")['href']
    
    # Request next page
    main_page = bs4.BeautifulSoup(requests.get(next_url).text, 'html.parser')

df_works['img'] = imgs

In [None]:
df_works

Once you have defined the list, run the following cell to retrieve all the images from their source url address and store them in jpg format in your computer (if you are using Anaconda) or in the Data folder (if you are using Colab)

In [None]:
for index, row in df_works.iterrows():
    with open(row['title'].replace('/', '')+'.jpg', 'wb') as handler:
        handler.write(requests.get(row['img']).content)

### B. Extracting artist information

The artists information can also be extracted. Going back to the home website, by clicking on top of the menu will take you to a [new catalogue](https://www.guggenheim-bilbao.eus/en/the-collection/artists). By clicking on the different elements in this catalogue you access each artist's webpage. Here you can find information about them and their professional career. 

![artist](https://www.dropbox.com/s/5r527vryjj5wtz2/Captura%20de%20pantalla%202022-10-31%20a%20las%2012.38.57.png?dl=1)

In [None]:
# Make a request to retrieve the HTML code and store the response
source = requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/artists")

In [None]:
# Parse the HTML code into a BeautifulSoup tree
soup = bs4.BeautifulSoup(source.text, "html.parser")

The artist catalogue is structured as the paintings catalogue, where information for the different artists is shown in separate blocks inside a grid.

![artists](https://www.dropbox.com/s/88k3woopuru06o1/Captura%20de%20pantalla%202022-10-31%20a%20las%2017.13.00.png?dl=1)


In [None]:
# Identify tag storing the information for the first artist
artist = soup.find("div", {"class": "column grid__item"})

In [None]:
# Extract artist information for the first artist
artist_name = artist.find("h3", {"class": "typography typography--book-h5 artist-preview__name"}).text
place_of_birth = artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(',')[0]
year_of_birth = int(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[-1])

In [None]:
# Repeat for every artist in the first page of the catalogue
artists = soup.find_all("div", {"class": "column grid__item"})

artist_names = []
years_of_birth = []
places_of_birth = []
years_of_death = []
places_of_death = []

for artist in artists:
    artist_names.append(artist.find("h3", {"class": "typography typography--book-h5 artist-preview__name"}).text)
    try:
        years_of_birth.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[-1]))
        places_of_birth.append(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[0])
    except:
        places_of_birth.append(", ".join(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[1:]))
        years_of_birth.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[0]))
    try:
        years_of_death.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find_all("p")[1].text.split(', ')[-1]))
        places_of_death.append(", ".join(artist.find("div", {"class": "artist-preview__biography-content"}).find_all("p")[1].text.split(', ')[:-1]))
    except:
        places_of_death.append(None)
        years_of_death.append(None)  

In [None]:
# Repeat for all the pages in the artist catalogue
artists = soup.find_all("div", {"class": "column grid__item"})

artist_names = []
years_of_birth = []
places_of_birth = []
years_of_death = []
places_of_death = []

# Start from first page
main_page = bs4.BeautifulSoup(requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/artists").text, 'html.parser')

# Get number of pages
n_pages = int(main_page.find_all("li", {"class": "paginator__item"})[-2].text)

# Write loop to traverse the pages
for run in range(n_pages-1):

    for artist in artists:
        artist_names.append(artist.find("h3", {"class": "typography typography--book-h5 artist-preview__name"}).text)
        try:
            years_of_birth.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[-1]))
            places_of_birth.append(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[0])
        except:
            places_of_birth.append(", ".join(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[1:]))
            years_of_birth.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find("p").text.split(', ')[0]))
        try:
            years_of_death.append(int(artist.find("div", {"class": "artist-preview__biography-content"}).find_all("p")[1].text.split(', ')[-1]))
            places_of_death.append(", ".join(artist.find("div", {"class": "artist-preview__biography-content"}).find_all("p")[1].text.split(', ')[:-1]))
        except:
            places_of_death.append(None)
            years_of_death.append(None)
    
    # Get url for the next page
    next_url = base_url + main_page.find_all("li", {"class": "paginator__item"})[-1].find("a")['href']
    
    # Request next page
    main_page = bs4.BeautifulSoup(requests.get(next_url).text, 'html.parser')

In [None]:
# Saving the data
import pandas as pd
df_artists = pd.DataFrame(data = {'artist_name': artist_names,
                               'year_of_birth': years_of_birth,
                               'place_of_birth': places_of_birth,
                               'year_of_death': years_of_death,
                               'place_of_death': places_of_death
                         })

If you click on top of each artist name, this will take you to another website where you can see all the paintings int he catalogue create by this artist. To retrieve the number of painings for each artist and store it in a list called <b>n_paintings</b> you can do as follows:

In [None]:
artists = soup.find_all("div", {"class": "column grid__item"})

n_paintings = []

# Start from first page
main_page = bs4.BeautifulSoup(requests.get("https://www.guggenheim-bilbao.eus/en/the-collection/artists").text, 'html.parser')

# Get number of pages
n_pages = int(main_page.find_all("li", {"class": "paginator__item"})[-2].text)

# Write loop to traverse the pages
for run in range(n_pages-1):

    for artist in artists:
        url = base_url + artist.find("a").get("href")
        artist_page = bs4.BeautifulSoup(requests.get(url).text, 'html.parser')
        n_paintings.append(len(artist_page.find_all("div", {"class": "swiper-slide"})))
    
    # Get url for the next page
    next_url = base_url + main_page.find_all("li", {"class": "paginator__item"})[-1].find("a")['href']
    
    # Request next page
    main_page = bs4.BeautifulSoup(requests.get(next_url).text, 'html.parser')

df_artist['n_paintings'] = n_paintings