<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

![](scraping_meme.jpg)

## Objectives

- Understand what Web Scraping is.
- Understand why as Data Scientists we might want to scrape the web.
- Use `requests` and `BeautifulSoup` to scrape data from the web using Python.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1">Objectives</a></span></li><li><span><a href="#Why-do-we-scrape-the-web?" data-toc-modified-id="Why-do-we-scrape-the-web?-2">Why do we scrape the web?</a></span><ul class="toc-item"><li><span><a href="#Getting-Info-from-a-Web-Page" data-toc-modified-id="Getting-Info-from-a-Web-Page-2.1">Getting Info from a Web Page</a></span></li><li><span><a href="#If-I-wanted-to-get-a-list-of-all-of-the-countries-visited,-how-would-I-do-it?" data-toc-modified-id="If-I-wanted-to-get-a-list-of-all-of-the-countries-visited,-how-would-I-do-it?-2.2">If I wanted to get a list of all of the countries visited, how would I do it?</a></span></li></ul></li><li><span><a href="#Getting-Info-from-a-Web-Page" data-toc-modified-id="Getting-Info-from-a-Web-Page-3">Getting Info from a Web Page</a></span><ul class="toc-item"><li><span><a href="#Requests-Library" data-toc-modified-id="Requests-Library-3.1">Requests Library</a></span></li></ul></li><li><span><a href="#Example:-Autotrader" data-toc-modified-id="Example:-Autotrader-4">Example: Autotrader</a></span><ul class="toc-item"><li><span><a href="#Now-that-we-have-the-web-page,-we-can-parse-it-with-BeautifulSoup:" data-toc-modified-id="Now-that-we-have-the-web-page,-we-can-parse-it-with-BeautifulSoup:-4.1">Now that we have the web page, we can parse it with BeautifulSoup:</a></span></li><li><span><a href="#We-can-now-set-up-a-loop-to-go-through-all-the-different-pages-of-this-website-search:" data-toc-modified-id="We-can-now-set-up-a-loop-to-go-through-all-the-different-pages-of-this-website-search:-4.2">We can now set up a loop to go through all the different pages of this website search:</a></span></li></ul></li><li><span><a href="#Pair-Practice:-Rightmove" data-toc-modified-id="Pair-Practice:-Rightmove-5">Pair Practice: Rightmove</a></span></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm_notebook

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internet to find interesting data:
    * From an existing company
    * Text for NLP
    * Images

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. With a **combination of HMTL and CSS selectors** we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [None]:
html = '''<!DOCTYPE html>
<html>
<head>
<title>The title of this web page</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.png">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.png" alt="Venice"> <br />
<img src="venice2.png" alt="Venice"> <br />
<img src="rome.png" alt="Roma">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.png" alt="Berlin">
</div>
</body>
</html>
'''

In [None]:
from bs4 import BeautifulSoup

# we create a soup object with the html:
soup = BeautifulSoup(html, 'html.parser')

In [None]:
print(soup.prettify())

In [None]:
# now we can query it
soup.title

In [None]:
soup.title.text

In [None]:
soup.h1

In [None]:
soup.h3

In [None]:
soup.find('h3')

In [None]:
soup.find_all('h3')

In [None]:
soup.find_all('h3')[1].text

In [None]:
soup.find_all('div', class_='country')

In [None]:
soup.find_all('img', alt='Venice')

In [None]:
soup.find('div', class_='country').find_previous_siblings('h3')

### If I wanted to get a list of all of the countries visited, how would I do it?

In [None]:
#A

## Getting Info from a Web Page

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making **http requests within Python**. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

## Example: Autotrader

In [None]:
import requests

url = 'https://www.autotrader.co.uk/\
car-search?sort=sponsored&radius=10&postcode=e16lt&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New'

r = requests.get(url)

In [None]:
r.text[:1000] # First 1000 characters of the HTML

### Now that we have the web page, we can parse it with BeautifulSoup:

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
print(soup.prettify())

In [None]:
description = []
price = []
for car in soup.find_all('li', attrs={'class':'search-page__result'}):
    try:
        description.append(car.find('h2', attrs={'class':'listing-title title-wrap'}).text)
    except:
        description.append(np.nan)
    
    try:
        price.append(car.find('div', attrs={'class':'vehicle-price'}).text)
    except:
        price.append(np.nan)

cars = pd.DataFrame({'Description': description,
                     'Price': price})
cars

### We can now set up a loop to go through all the different pages of this website search:

In [None]:
description = []
price = []
for x in tqdm_notebook(range(1, 21)):
    url = 'https://www.autotrader.co.uk/\
car-search?sort=sponsored&radius=10\
&postcode=e16lt&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&page={}'.format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for car in soup.find_all('li', attrs={'class':'search-page__result'}):
        try:
            description.append(car.find('h2', attrs={'class':'listing-title title-wrap'}).text)
        except:
            description.append(np.nan)

        try:
            price.append(car.find('div', attrs={'class':'vehicle-price'}).text)
        except:
            price.append(np.nan)

cars = pd.DataFrame({'Description': description,
                     'Price': price})

In [None]:
cars.info()

In [None]:
cars.head()

## Pair Practice: Rightmove

Using the URL below:

1. Have a look at the HTML using 'Inspect' on the website.
2. Look at the tags and what is linked to different sections of the website.
3. Write a script that creates a dataframe of the houses for sale, with their location, description and price.

In [None]:
url = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E87490&index=0'