# Data Science Recap

Example Coursera DataScience course: labs/DP0701EN/Webscraping postal codes of Canada-Part 1 2 and 3.ipynb

  * https://towardsdatascience.com/introduction-to-web-scraping-with-beautifulsoup-e87a06c2b857
  * https://towardsdatascience.com/in-10-minutes-web-scraping-with-beautiful-soup-and-selenium-for-data-professionals-8de169d36319

  * https://www.dataquest.io/blog/web-scraping-tutorial-python/ - Beginner
  * https://www.dataquest.io/blog/web-scraping-beautifulsoup/
  * https://www.datacamp.com/community/tutorials/web-scraping-python-nlp
  * https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
  * https://www.datacamp.com/community/tutorials/web-scraping-using-python


## 1 Web Scraping with BeautifulSoup - National Weather Services
https://www.dataquest.io/blog/web-scraping-tutorial-python/


  * **Requests**
  * **Beautiful Soup**
  * Scrapy
  * Selenium

<img src = "https://forecast.weather.gov/wwamap/png/US.png" width = 400 align = 'left'>

### Install BeautifulSoup4 and Requests Python package

In [4]:
pip install BeautifulSoup4 requests

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 2.8MB/s ta 0:00:011
Collecting soupsieve>=1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/0b/44/0474f2207fdd601bb25787671c81076333d2c80e6f97e92790f8887cf682/soupsieve-1.9.3-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.8.0 soupsieve-1.9.3
Note: you may need to restart the kernel to use updated packages.


## A. Requests and Beautifulsoup step by step example

### Import Library's and Parse HTML

In [5]:
# importing libraries
from bs4 import BeautifulSoup

# https://www.pythonforbeginners.com/requests/using-requests-in-python
import requests

import pandas as pd

In [13]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page
#page.status_code
#page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

### Prepare and pase webpage object into Beautifulsoup

In [14]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

#define url to scrape
url = "http://dataquestio.github.io/web-scraping-pages/simple.html"

#connect to website
try:
    r = requests.get(url, headers=headers)
    print("Connection to ", url, "succesfull")
except:
    print("An error occured.")

Connection to  http://dataquestio.github.io/web-scraping-pages/simple.html succesfull


In [16]:
# get webpage object into Beautifullsoup

soup = BeautifulSoup(page.content, 'html.parser')

In [18]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


## Analyse all elements step by step

#### One

In [19]:
#list generator soup

list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [20]:
#types of the elements in soup
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

#### Two

In [21]:
#get third item
html = list(soup.children)[2]

In [22]:
#list generator html
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
#types of the elements in html
[type(item) for item in list(html.children)]

[bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

#### Three

In [24]:
#dive into the body
body = list(html.children)[3]

In [25]:
#list generator body
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [26]:
#types of the elements in body
[type(item) for item in list(body.children)]

[bs4.element.NavigableString, bs4.element.Tag, bs4.element.NavigableString]

#### Four

In [27]:
#dive into p
p = list(body.children)[1]

In [28]:
#isolate p
p.get_text()

'Here is some simple content for this page.'

## B. Requests and Beautifulsoup all at ones example

Find all instances of p with find_all()

In [29]:
soup2 = BeautifulSoup(page.content, 'html.parser')
soup2.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [30]:
soup2.find_all('p')[0].get_text()

'Here is some simple content for this page.'

Find just a single (the first) instance of p with find()

In [33]:
soup2.find('p')

<p>Here is some simple content for this page.</p>

## C. Requests and Beautifulsoup searching for tags by class and id

In [35]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

#define url to scrape
url = "http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html"

#connect to website
try:
    r = requests.get(url, headers=headers)
    print("Connection to ", url, "succesfull")
except:
    print("An error occured.")

soup3 = BeautifulSoup(r.content, 'html.parser')
soup3

Connection to  http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html succesfull


<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [37]:
# look up all p tags with the class containing out-text
soup3.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [39]:
# look up all tags with the class containing out-text
soup3.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [40]:
# lookup all tags with id="first"
soup3.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [41]:
# using CSS selector looking for p within div
soup3.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

<hr>

## 2 Web Scraping with BeautifulSoup - IMDB and Metacritic

https://www.dataquest.io/blog/web-scraping-beautifulsoup/