<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping

![](scraping_meme.jpg)

## Objectives

- Understand what Web Scraping is.
- Understand why as Data Scientists we might want to scrape the web.
- Use `requests` and `BeautifulSoup` to scrape data from the web using Python.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1">Objectives</a></span></li><li><span><a href="#Why-do-we-scrape-the-web?" data-toc-modified-id="Why-do-we-scrape-the-web?-2">Why do we scrape the web?</a></span><ul class="toc-item"><li><span><a href="#Getting-Info-from-a-Web-Page" data-toc-modified-id="Getting-Info-from-a-Web-Page-2.1">Getting Info from a Web Page</a></span></li><li><span><a href="#If-I-wanted-to-get-a-list-of-all-of-the-countries-visited,-how-would-I-do-it?" data-toc-modified-id="If-I-wanted-to-get-a-list-of-all-of-the-countries-visited,-how-would-I-do-it?-2.2">If I wanted to get a list of all of the countries visited, how would I do it?</a></span></li></ul></li><li><span><a href="#Getting-Info-from-a-Web-Page" data-toc-modified-id="Getting-Info-from-a-Web-Page-3">Getting Info from a Web Page</a></span><ul class="toc-item"><li><span><a href="#Requests-Library" data-toc-modified-id="Requests-Library-3.1">Requests Library</a></span></li></ul></li><li><span><a href="#Example:-Autotrader" data-toc-modified-id="Example:-Autotrader-4">Example: Autotrader</a></span><ul class="toc-item"><li><span><a href="#Now-that-we-have-the-web-page,-we-can-parse-it-with-BeautifulSoup:" data-toc-modified-id="Now-that-we-have-the-web-page,-we-can-parse-it-with-BeautifulSoup:-4.1">Now that we have the web page, we can parse it with BeautifulSoup:</a></span></li><li><span><a href="#We-can-now-set-up-a-loop-to-go-through-all-the-different-pages-of-this-website-search:" data-toc-modified-id="We-can-now-set-up-a-loop-to-go-through-all-the-different-pages-of-this-website-search:-4.2">We can now set up a loop to go through all the different pages of this website search:</a></span></li></ul></li><li><span><a href="#Pair-Practice:-Rightmove" data-toc-modified-id="Pair-Practice:-Rightmove-5">Pair Practice: Rightmove</a></span></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from tqdm import tqdm_notebook

## Why do we scrape the web?

* Realistically, data that you want to study won't always be available to you in the form of a curated data set.
* Need to go to the internet to find interesting data:
    * From an existing company
    * Text for NLP
    * Images

# Scraping from a Web Page with Python

Scraping a web site basically comes down to making a **request from Python and parsing through the HTML** that is returned from each page. For each of these tasks we have a Python library, **`requests` and `bs4`**, respectively.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Getting Info from a Web Page

Now that we can gain easy access to the HMTL for a web page, we need **some way to pull the desired content from it**. Luckily there is already a system in place to do this. With a **combination of HMTL and CSS selectors** we can identify the information on a HMTL page that we wish to retrieve and grab it with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

In [2]:
html = '''<!DOCTYPE html>
<html>
<head>
<title>The title of this web page</title>
</head>
<body>
<h1>My Photos</h1>
<div class='intro'>
<p>These are some photos of my trips.</p>
<img src="me.png">
</div>

<h3>Italy</h3>
<div class='country'>
<img src="venice1.png" alt="Venice"> <br />
<img src="venice2.png" alt="Venice"> <br />
<img src="rome.png" alt="Roma">
</div>

<h3>Germany</h3>
<div class='country'>
<img src="berlin.png" alt="Berlin">
</div>
</body>
</html>
'''

In [3]:
from bs4 import BeautifulSoup

# we create a soup object with the html:
soup = BeautifulSoup(html, 'html.parser')

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The title of this web page
  </title>
 </head>
 <body>
  <h1>
   My Photos
  </h1>
  <div class="intro">
   <p>
    These are some photos of my trips.
   </p>
   <img src="me.png"/>
  </div>
  <h3>
   Italy
  </h3>
  <div class="country">
   <img alt="Venice" src="venice1.png"/>
   <br/>
   <img alt="Venice" src="venice2.png"/>
   <br/>
   <img alt="Roma" src="rome.png"/>
  </div>
  <h3>
   Germany
  </h3>
  <div class="country">
   <img alt="Berlin" src="berlin.png"/>
  </div>
 </body>
</html>



In [5]:
# now we can query it
soup.title

<title>The title of this web page</title>

In [6]:
soup.title.text

'The title of this web page'

In [7]:
soup.h1

<h1>My Photos</h1>

In [8]:
soup.h3

<h3>Italy</h3>

In [9]:
soup.find('h3')

<h3>Italy</h3>

In [10]:
soup.find_all('h3')

[<h3>Italy</h3>, <h3>Germany</h3>]

In [11]:
soup.find_all('h3')[1].text

'Germany'

In [12]:
soup.find_all('div', class_='country')

[<div class="country">
 <img alt="Venice" src="venice1.png"/> <br/>
 <img alt="Venice" src="venice2.png"/> <br/>
 <img alt="Roma" src="rome.png"/>
 </div>, <div class="country">
 <img alt="Berlin" src="berlin.png"/>
 </div>]

In [13]:
soup.find_all('img', alt='Venice')

[<img alt="Venice" src="venice1.png"/>, <img alt="Venice" src="venice2.png"/>]

In [14]:
soup.find('div', class_='country').find_previous_siblings('h3')

[<h3>Italy</h3>]

### If I wanted to get a list of all of the countries visited, how would I do it?

In [15]:
#A

## Getting Info from a Web Page

### Requests Library

The [requests](http://docs.python-requests.org/en/latest/index.html) library is designed to simplify the process of making **http requests within Python**. The interface is mind-bogglingly simple. Instantiate a requests object to the request, this will mostly be a `get`, with the URL and optional parameters you'd like passed through the request. That instance make the results of the request available via attributes/methods.

## Example: Autotrader

In [16]:
import requests

url = 'https://www.autotrader.co.uk/\
car-search?sort=sponsored&radius=10&postcode=e16lt&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New'

r = requests.get(url)

In [17]:
r.text[:1000] # First 1000 characters of the HTML

'<!DOCTYPE html>\n<html>\n<head>\n\n    <meta name="gpt1" content="/323304435/web/cars/search/listings">\n\n    <meta name="gpt2" content="/323304435/web/cars/content/sponsored">\n\n\n<script>\n    var googletag = googletag || {};\n    googletag.cmd = googletag.cmd || [];\n\n    var pbjs = pbjs || {};\n    pbjs.que = pbjs.que || [];\n</script>\n<script src=\'//d2zv5rkii46miq.cloudfront.net/0/latest/cmp_shim.js\'></script>\n\n    <script type="text/javascript" async="async" src="//ads.rubiconproject.com/prebid/8059_ATfpd.js"></script>\n<!--start Sourcepoint code-->\n<!--//IAB Stub file, implement before header bidding script-->\n<script type="text/javascript">\n    (function () {\n        var e = false;\n        var c = window;\n        var t = document;\n\n        function r() {\n            if (!c.frames["__cmpLocator"]) {\n                if (t.body) {\n                    var a = t.body;\n                    var e = t.createElement("iframe");\n                    e.style.cssText = "

### Now that we have the web page, we can parse it with BeautifulSoup:

In [18]:
soup = BeautifulSoup(r.text, 'html.parser')

In [19]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta content="/323304435/web/cars/search/listings" name="gpt1"/>
  <meta content="/323304435/web/cars/content/sponsored" name="gpt2"/>
  <script>
   var googletag = googletag || {};
    googletag.cmd = googletag.cmd || [];

    var pbjs = pbjs || {};
    pbjs.que = pbjs.que || [];
  </script>
  <script src="//d2zv5rkii46miq.cloudfront.net/0/latest/cmp_shim.js">
  </script>
  <script async="async" src="//ads.rubiconproject.com/prebid/8059_ATfpd.js" type="text/javascript">
  </script>
  <!--start Sourcepoint code-->
  <!--//IAB Stub file, implement before header bidding script-->
  <script type="text/javascript">
   (function () {
        var e = false;
        var c = window;
        var t = document;

        function r() {
            if (!c.frames["__cmpLocator"]) {
                if (t.body) {
                    var a = t.body;
                    var e = t.createElement("iframe");
                    e.style.cssText = "display:none";
            

In [20]:
description = []
price = []
for car in soup.find_all('li', attrs={'class':'search-page__result'}):
    try:
        description.append(car.find('h2', attrs={'class':'listing-title title-wrap'}).text)
    except:
        description.append(np.nan)
    
    try:
        price.append(car.find('div', attrs={'class':'vehicle-price'}).text)
    except:
        price.append(np.nan)

cars = pd.DataFrame({'Description': description,
                     'Price': price})
cars

Unnamed: 0,Description,Price
0,\nPeugeot 308 1.6L Allure BlueHDi 5dr\n,"£7,650"
1,\nHonda Civic 1.8 i-VTEC Sport 5dr\n,"£1,480"
2,\nRenault Clio 1.2 16v Expression 5dr\n,"£1,180"
3,\nVauxhall Vectra 1.8 i VVT SRi 5dr\n,£550
4,\nRenault Scenic 1.6 16v Fidji 5dr\n,£495
5,\nLexus IS 220d 2.2 TD SE-L 4dr\n,"£1,780"
6,\nFord Streetka 1.6 Winter 2dr\n,£550
7,\nVauxhall Astra 1.4 i 16v Club 5dr\n,£900
8,\nPeugeot 207 1.6 VTi Sport 5dr\n,"£1,360"
9,\nMercedes-Benz C Class 1.8 C180 Kompressor El...,£790


### We can now set up a loop to go through all the different pages of this website search:

In [22]:
description = []
price = []
for x in tqdm_notebook(range(1, 21)):
    url = 'https://www.autotrader.co.uk/\
car-search?sort=sponsored&radius=10\
&postcode=e16lt&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&page={}'.format(x)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for car in soup.find_all('li', attrs={'class':'search-page__result'}):
        try:
            description.append(car.find('h2', attrs={'class':'listing-title title-wrap'}).text)
        except:
            description.append(np.nan)

        try:
            price.append(car.find('div', attrs={'class':'vehicle-price'}).text)
        except:
            price.append(np.nan)

cars = pd.DataFrame({'Description': description,
                     'Price': price})

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




In [24]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 2 columns):
Description    260 non-null object
Price          260 non-null object
dtypes: object(2)
memory usage: 4.1+ KB


In [23]:
cars.head(30)

Unnamed: 0,Description,Price
0,\nVolkswagen Golf S TSI BLUEMOTION TECHNOLOGY ...,"£10,500"
1,\nHonda Civic 1.8 i-VTEC Sport 5dr\n,"£1,480"
2,\nRenault Clio 1.2 16v Expression 5dr\n,"£1,180"
3,\nVauxhall Vectra 1.8 i VVT SRi 5dr\n,£550
4,\nRenault Scenic 1.6 16v Fidji 5dr\n,£495
5,\nLexus IS 220d 2.2 TD SE-L 4dr\n,"£1,780"
6,\nFord Streetka 1.6 Winter 2dr\n,£550
7,\nPeugeot 207 1.6 VTi Sport 5dr\n,"£1,360"
8,\nMercedes-Benz C Class 1.8 C180 Kompressor El...,£790
9,\nVolkswagen Polo 1.6 TDI SEL 5dr\n,"£2,780"


## Pair Practice: Rightmove

Using the URL below:

1. Have a look at the HTML using 'Inspect' on the website.
2. Look at the tags and what is linked to different sections of the website.
3. Write a script that creates a dataframe of the houses for sale, with their location, description and price.

In [26]:
url = 'https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%5E87490&index=0'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="is-not-modern property-for-sale channel--buy" lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <title>
   Properties For Sale in London - Flats &amp; Houses For Sale in London - Rightmove
  </title>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible">
   <meta content="width=device-width, shrink-to-fit=no, initial-scale=1.0, user-scalable=no" name="viewport"/>
   <meta content="telephone=no" name="format-detection"/>
   <meta content="True" name="HandheldFriendly"/>
   <meta content="Find Properties For Sale in London - Flats &amp; Houses For Sale in London - Rightmove. Search over 900,000 properties for sale from the top estate agents and developers in the UK - Rightmove." name="description"/>
   <meta content="origin-when-cross-origin" name="referrer"/>
   <link crossorigin="" href="https://media.rightmove.co.uk:443" rel="preconnect"/>
   <link crossorigin="" href="//product.rightmove.co.uk" rel="preconnect"/>
   <link href="/pvw/images/favicons/rebra

In [50]:
house = []
location = []
description = []
price = []

for prop in soup.find_all('div', attrs={'class':'propertyCard-wrapper'}):
    try:
        house.append(prop.find('h2', attrs={'class':'propertyCard-title'}).text.strip())
    except:
        house.append(np.nan)
    try:
        location.append(prop.find('address', attrs={'class':'propertyCard-address'}).text.strip())
    except:
        location.append(np.nan)
    try:
        description.append(prop.find('a', attrs={'class':'propertyCard-link'}).text.strip())
    except:
        description.append(np.nan)
    try:
        price.append(prop.find('div', attrs={'class':'propertyCard-priceValue'}).text.strip())
    except:
        price.append(np.nan)        
houses = pd.DataFrame({'House': house})
            
houses = pd.DataFrame({'House': house,
                       'Location': location,
                       'Description': description,
                       'Price': price})

In [51]:
houses

Unnamed: 0,House,Location,Description,Price
0,1 bedroom flat for sale,"St John Street, Clerkenwell, London, EC1V",1 bedroom flat for sale \n\n\n\nSt John...,"£1,995,000"
1,8 bedroom house for sale,"Buckingham Gate, St James's Park",8 bedroom house for sale \n\n\n\nBuckin...,"£55,000,000"
2,5 bedroom apartment for sale,"Mayfair, London",5 bedroom apartment for sale \n\n\n\nMa...,"£55,000,000"
3,10 bedroom detached house for sale,"Merton Lane, London, N6",10 bedroom detached house for sale \n\n...,"£40,000,000"
4,10 bedroom detached house for sale,"Merton Lane, London, N6",10 bedroom detached house for sale \n\n...,POA
5,8 bedroom house for sale,"Wilton Crescent, Belgravia, London, SW1X",8 bedroom house for sale \n\n\n\nWilton...,"£37,000,000"
6,5 bedroom town house for sale,"Mayfair, London",5 bedroom town house for sale \n\n\n\nM...,"£35,000,000"
7,7 bedroom detached house for sale,"The Bishops Avenue, Hampstead, London, N2",7 bedroom detached house for sale \n\n\...,POA
8,6 bedroom terraced house for sale,"Cadogan Place, Belgravia, London, SW1X",6 bedroom terraced house for sale \n\n\...,"£34,000,000"
9,6 bedroom house for sale,"Queen Anne's Gate, St James's Park",6 bedroom house for sale \n\n\n\nQueen ...,"£33,500,000"
