# 3.14.34 Web Scraping

### What is it?

Web scraping is the art and practice of **retrieving information from a website**. Web scraping can be done manually and, in a sense, it's what we do every time we look up some information on the website or when we copy a table from a Wikipedia page and paste it into a spreadsheet. 

That said, the most efficient way to scrape a website is through some kind of an automated script and we will explore this practice in Python using a well known scraping library called `BeautifulSoup`. 

### How does it work?

1. The first step is to locate a website from which you want to extract some information that is embedded in it. It could be an e-Commerce, a news outlet, a weather site or any other website really. 
2. You should then proceed to inspect and analyze the HTML source code of that web page to find the specific ids and tags that identify the information you're after. 
3. The central part of a Python scraper is, of course, the script itself. You write an algorithm to instruct your program to use those tags and ids to retrieve the information you're after.
4. Finally, you manipulate your results into a data structure of your choice (csv, json, ...) that will allow you to interact with and use your scraped data. 

<img src="img/web-scraping.png" width="600">

### Is it legal?

In general terms, web scraping is not illegal. More specifically, according to Wikipedia: *the legality of web scraping varies across the world. In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.* 

It is generally accepted that, if a website makes its data public, then scraping it should be legal. This is reinforced by legal cases such as "LinkedIn vs hiQ Labs". You can read more on [Wikipedia](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues) and [online](https://www.parsehub.com/blog/web-scraping-legal/). 

### A first simple example

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

We will start with a [simple page](https://www.scrapethissite.com/pages/simple/), click on the link and familiarise yourself with its contents; it contains a list of country names from the world and some information for each one of them. 

Before moving on, inspect the underlying HTML code: right click on an element of the page (for example the name "Andorra") and select "Inspect". This will open the Inspect Element window in your Chrome browser, which allows you to see the code that generates the page you're looking at. 

<img src="img/inspect-element.png" width="600">

In Python, let's start by sending an HTTP GET request to retrieve the HTML content of a desired web page and print to screen the first 1000 characters: 

In [2]:
url = "https://www.scrapethissite.com/pages/simple/"
page = requests.get(url)

print(page.text[0:1000])

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>
    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png" />

    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping.">

    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">
    <link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>
    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">

    
<meta name=

Then wew create a Beautiful Soup object named `soup` that takes `page.content` as an input (this is the HTML content we just printed to screen).

*Note: it is better to pass page.content instead of page.text to avoid issues with character encoding.*

In [3]:
soup = BeautifulSoup(page.content, "html.parser")

If you go back to the **Inspect Element** window, you'll notice that if you hover over an element in the windown on the right, it will highlight on the webpage to the left. We want to retrieve all the country names in the list and the `<div id="page">` element contains all of the results that we're after, so we apply the `.find()` method to the `soup` object in order to find that element via its `id="page"` parameter. 

<img src="img/div id=page.png" width="600">

In [7]:
results = soup.find(id="page")

As we were saying, inside the `results` object there are all the country names we're interested in, but we still need to identify them and to do that we go back to the Inspect Element window and notice that each country block has its name and all related information stored inside a `div` element with `class="col-md-4 country"`. Therefore, we apply the `.find_all()` method to the `results` object in order to get the information about each country.

In [8]:
countries = results.find_all("div", class_="col-md-4 country")

In [9]:
countries[0]

<div class="col-md-4 country">
<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>
<div class="country-info">
<strong>Capital:</strong> <span class="country-capital">Andorra la Vella</span><br/>
<strong>Population:</strong> <span class="country-population">84000</span><br/>
<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">468.0</span><br/>
</div>
</div>

Now, if you look closely to the above output, you'll notice that each country name is enclosed in a `h3` tag with `class="country-name"`. 

In [10]:
type(countries)

bs4.element.ResultSet

In [11]:
len(countries)

250

The output object `countries` is an iterable object of class `bs4.element.ResultSet` containing 250 elements, so we can use a for loop to print each element in the `h3` tag. 

In [12]:
for c in countries[0:3]: 
    print(c.find("h3", class_="country-name"))

<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>
<h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>
<h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
                            Afghanistan
                        </h3>


As you can see, this prints everything, including the tag itself; in order **to include just the text**, we can add the `.text` attribute.

In [13]:
for c in countries[0:3]: 
    print(c.find("h3", class_="country-name").text)



                            Andorra
                        


                            United Arab Emirates
                        


                            Afghanistan
                        


Now, instead of printing the output to screen, it's better to append each new element to a list in order to store those information for a later use. 

In [14]:
listy = []
for c in countries: 
    listy.append(c.find("h3", class_="country-name").text)

In [15]:
listy[0:5]

['\n\n                            Andorra\n                        ',
 '\n\n                            United Arab Emirates\n                        ',
 '\n\n                            Afghanistan\n                        ',
 '\n\n                            Antigua and Barbuda\n                        ',
 '\n\n                            Anguilla\n                        ']

Since there are some "spaces" and "return" characters in the strings, we use the `.strip()` method to remove those from the output. 

In [16]:
names = []
for c in countries: 
    names.append(c.find("h3", class_="country-name").text.strip())

In [17]:
names[0:5]

['Andorra',
 'United Arab Emirates',
 'Afghanistan',
 'Antigua and Barbuda',
 'Anguilla']

That's it, we successfully created a list including all the country names from our initial web page. 

### A slightly more difficult example

At [this link](https://www.imdb.com/chart/top/?ref_=nv_mv_250) you will find a list of the top 250 movies ranked by users from the **IMDB website**. We want to retrieve all of the 250 titles in the ranking and save them in a Python list. 

We begin by sending an HTTP GET request to retrieve the HTML page and save its contents to a BeautifulSoup object. 

*Notice the `headers={'Accept-Language': "lang=en-US"}` additional parameter, which ensures that the contents are always retrieved in english, no matter what the client language is.*

In [None]:
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
page = requests.get(url, headers={'Accept-Language': "lang=en-US"})

In [None]:
soup = BeautifulSoup(page.content, "html.parser")

Using the Element Inspector tool, we can see that each movie title belongs to the `td` tag and is identified by the `class="titleColumn"` parameter. Using the `.find_all()` method, we can find all those occurrences and we save them to an object called `movies`. 

In [None]:
movies = soup.find_all('td', class_='titleColumn')

In [None]:
type(movies)

In [None]:
movies[0:3]

The `movies` object is a BeautifulSoup `ResultSet` object, which is an iterable and contains all the movie titles we're interested in. We proceed extracting the movie title from each element of the `movies` object by looping through its elements, finding the `a` tag and extracting its text content.

*Note: in HTML, an `a` tag defines a hyperlink.*

In [None]:
movie_names = []
for m in movies: 
    movie_names.append(m.find('a').text)

In [None]:
movie_names[0:5]

You may have noticed that the `a` tag (which defines a hyperlink), has an attribute called `title`; this is the title of the link (what is shown if you hover over the link), which includes the names of the director as well as the main actors in the movie. 

To access this information, we'll use `.attrs.get('title')`: 

- the `.attrs` attribute allows us to access the attributes in the hyperlink tag, 
- while the `.get()` method enables us to retrieve the contents of a specific attribute, in this case the `'title'` attribute.

In [None]:
movie_cast = []
for m in movies: 
    movie_cast.append(m.find('a').attrs.get('title'))

In [None]:
movie_cast[0:5]

Finally, let's combine together the two lists `movie_names` and `movie_cast` into a single DataFrame named `df_movies`.

In [None]:
df_movies = pd.DataFrame(
    {'name': movie_names,
     'cast': movie_cast
    })

In [None]:
df_movies