## An intro to web scraping

### In this notebook, we explore the country index to get a feel of bs4 syntax and its methods to extract data from a static webpage

### Import libraries

In [2]:
import requests
from bs4 import BeautifulSoup as bs

### Load page to scrape

In [3]:
# load page source
page = requests.get("https://www.scrapethissite.com/pages/simple/", 'lxml')

# convert to BeautifulSoup object
soup = bs(page.content)

#print the HTML
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="A single page that lists information about all the countries in the world. Good for those just get started with web scraping." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
  <meta content="noindex

### Lets Scrape!

In [4]:
soup.title

<title>Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping</title>

In [5]:
print(soup.title.parent.name) #tag name
print(soup.title.string) #title as a string
print(soup.h1)  #get first h1 tag

head
Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping
<h1>
                            Countries of the World: A Simple Example
                            <small>250 items</small>
</h1>


In [6]:
# get first occurrence of h1 tag
print(soup.find('h1'))

# get all occurrences of h1's
print(soup.find_all('h1'))

print('\n\n-------------------------------------------\n\n')

# get first occurrence of li tag
li = soup.find('li')
print(li)

# get all occurrences of li's
all_li = soup.find_all('li')
print(all_li)

<h1>
                            Countries of the World: A Simple Example
                            <small>250 items</small>
</h1>
[<h1>
                            Countries of the World: A Simple Example
                            <small>250 items</small>
</h1>]


-------------------------------------------


<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                Scrape This Site
                            </a>
</li>
[<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                Scrape This Site
                            </a>
</li>, <li id="nav-sandbox">
<a class="nav-link" href="/pages/">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                Sandbox
                            </a>
</li>, <li id="nav-lessons">
<a class="nav-link

In [7]:
# get all the text content
content = soup.get_text()
print(content)




Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping

















                                Scrape This Site
                            




                                Sandbox
                            




                                Lessons
                            




                                FAQ
                            



                                Login
                            












                            Countries of the World: A Simple Example
                            250 items







                            A single page that lists information about all the countries in the world. Good for those just get started with web scraping.
                            Practice looking for patterns in the HTML that will allow you to extract information about each country. Then, build a simple web scraper that makes a request to this page, parses the HTML and prints out 

In [8]:
# grabbing all h1 and h3 tags in our site
headers = soup.find_all(['h1', 'h3'])
print(headers)

[<h1>
                            Countries of the World: A Simple Example
                            <small>250 items</small>
</h1>, <h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
                            Afghanistan
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ag"></i>
                            Antigua and Barbuda
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ai"></i>
                            Anguilla
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-al"></i>
                            Albania
                        </h3>, <h3 class="country-name">
<i class="flag-ic

In [9]:
# find_all can also search for attributes

#searching for all country names
countries = soup.find_all('h3', attrs={'class': 'country-name'})
print(countries)

[<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
                            Afghanistan
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ag"></i>
                            Antigua and Barbuda
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ai"></i>
                            Anguilla
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-al"></i>
                            Albania
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-am"></i>
                            Armenia
                        </h3>, <h3 class="country-name">
<i class="flag-icon

In [10]:
#using nested find/find_all

#getting only country names
for country in countries:
  print(country.text.strip())

Andorra
United Arab Emirates
Afghanistan
Antigua and Barbuda
Anguilla
Albania
Armenia
Angola
Antarctica
Argentina
American Samoa
Austria
Australia
Aruba
Åland
Azerbaijan
Bosnia and Herzegovina
Barbados
Bangladesh
Belgium
Burkina Faso
Bulgaria
Bahrain
Burundi
Benin
Saint Barthélemy
Bermuda
Brunei
Bolivia
Bonaire
Brazil
Bahamas
Bhutan
Bouvet Island
Botswana
Belarus
Belize
Canada
Cocos [Keeling] Islands
Democratic Republic of the Congo
Central African Republic
Republic of the Congo
Switzerland
Ivory Coast
Cook Islands
Chile
Cameroon
China
Colombia
Costa Rica
Cuba
Cape Verde
Curacao
Christmas Island
Cyprus
Czech Republic
Germany
Djibouti
Denmark
Dominica
Dominican Republic
Algeria
Ecuador
Estonia
Egypt
Western Sahara
Eritrea
Spain
Ethiopia
Finland
Fiji
Falkland Islands
Micronesia
Faroe Islands
France
Gabon
United Kingdom
Grenada
Georgia
French Guiana
Guernsey
Ghana
Gibraltar
Greenland
Gambia
Guinea
Guadeloupe
Equatorial Guinea
Greece
South Georgia and the South Sandwich Islands
Guatemala
G

In [11]:
# searching for countries starting with D
import re

counts = [tag.get_text(strip=True) for tag in countries if re.search("D", tag.get_text())]  # finds all occurrences of 'D'
print(counts)

# without using RE
counts = [tag.get_text(strip=True) for tag in countries if tag.get_text(strip=True).startswith("D")]
print(counts)


['Democratic Republic of the Congo', 'Djibouti', 'Denmark', 'Dominica', 'Dominican Republic', 'Heard Island and McDonald Islands']
['Democratic Republic of the Congo', 'Djibouti', 'Denmark', 'Dominica', 'Dominican Republic']


### Using select

In [26]:
#selecting all capital cities
capitals = soup.select('span.country-capital')
print(capitals)

#get only text
for capital in capitals:
  print(capital.get_text(strip=True))



[<span class="country-capital">Andorra la Vella</span>, <span class="country-capital">Abu Dhabi</span>, <span class="country-capital">Kabul</span>, <span class="country-capital">St. John's</span>, <span class="country-capital">The Valley</span>, <span class="country-capital">Tirana</span>, <span class="country-capital">Yerevan</span>, <span class="country-capital">Luanda</span>, <span class="country-capital">None</span>, <span class="country-capital">Buenos Aires</span>, <span class="country-capital">Pago Pago</span>, <span class="country-capital">Vienna</span>, <span class="country-capital">Canberra</span>, <span class="country-capital">Oranjestad</span>, <span class="country-capital">Mariehamn</span>, <span class="country-capital">Baku</span>, <span class="country-capital">Sarajevo</span>, <span class="country-capital">Bridgetown</span>, <span class="country-capital">Dhaka</span>, <span class="country-capital">Brussels</span>, <span class="country-capital">Ouagadougou</span>, <span c

In [31]:
# selecting by ID
nav = soup.select('#nav-lessons')
nav

[<li id="nav-lessons">
 <a class="nav-link" href="/lessons/">
 <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                 Lessons
                             </a>
 </li>]

### Properties

In [33]:
title = soup.find('title')
title.string  # prints only a single instance -> no support for nested vars



'Countries of the World: A Simple Example | Scrape This Site | A public sandbox for learning web scraping'

In [48]:
div = soup.select_one('div.row')
print(div.get_text(strip=True))


Countries of the World: A Simple Example250 items


In [62]:
a = soup.find_all('a')
#print(a)

for link in a:
  print(link['href'])

print('\n----------\n')

images = soup.find_all('img')
for image in images:
  print(image['src'])


/
/pages/
/lessons/
/faq/
/login/
/lessons/
http://peric.github.io/GetCountries/

----------

/static/images/scraper-icon.png
https://www.facebook.com/tr?id=764287443701341&ev=PageView&noscript=1
//googleads.g.doubleclick.net/pagead/viewthroughconversion/950945448/?guid=ON&script=0
