<a href="https://colab.research.google.com/github/fleshgordo/scrapinghub/blob/main/003_scraping_bs4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping a website without an API with Beautifulsoup


## Requesting a website

In order to entirely download a webpage and its content we first need to request the server, wait for the response and store it in a python variable. This is achieved with the [requests](https://pypi.org/project/requests/) library. Before using it, we need to import it to our current runtime (this needs to be done only once!)

In [None]:
import requests

Through the [quickstart tutorial](https://requests.readthedocs.io/en/latest/user/quickstart/) we can fetch a website that interests us. In this case we will scrape the very first webpage that went online (in CERN Geneva 1989)

In [None]:
r = requests.get('http://info.cern.ch/hypertext/WWW/TheProject.html')
print(r)

<Response [200]>


The above code should output `Response [200]`. To output the HTML source code of the page we need to access the `text` property. The response will be stored in a variable called `source`.

In [None]:
print(r.text)
source = r.text

## BeautifulSoup

The code is highly unreadable. Parsing through this source code is tedious and quickly time-consuming. Hence, [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) comes into play. This library is known for extracting data out of web pages. It provides elegant ways of navigating, searching, and modifying the parse tree of HTML and XML files. It commonly saves programmers hours or days of work. So, let's import this library:

In [1]:
from bs4 import BeautifulSoup

Our source code will be loaded into the Beautifulsoup which creates a python object that becomes browsable instead of a basic text string.

In [None]:
soup = BeautifulSoup(source, 'html.parser')
print(soup)

With the function `prettify()` the output looks a bit cleaner:

In [None]:
print(soup.prettify())

While the output of the new `soup` variable looks pretty much the same as the `source`, its major difference is that it is a python object that contains some functions in order to access the HTML structure. Let's say, we are interested only into the hyperlinks that are present on the page:

In [None]:
soup.find_all("a")

We can also search only for specific HTML tags such as `<h1>` or `<p> `


In [None]:
headlines = soup.find_all("h1")
texts = soup.find_all("p")

In [None]:
print(headlines)

At the time of writing, the webpage consists only of one headline `<h1>`. The text is technically an element of a list that has only one entry. In order to acces the first element of that list, we need to call it's array index:

In [None]:
print(headlines[0])

<h1>World Wide Web</h1>


The text is still wrapped around the html tags h1. If we want to access the "pure" text we can make use of the function getText():

In [None]:
print(headlines[0].getText())

World Wide Web


### Summary scraping CERN website

To summarize all the actions it took to:

1.   fetch the website content
2.   transform source code into beautifulsoup element
3.   find only h1 tags (titles)
4.   print only the headline

we could write:

In [None]:
import requests # import necessary libraries
from bs4 import BeautifulSoup

r = requests.get('http://info.cern.ch/hypertext/WWW/TheProject.html') # fetch the website
source = r.text # store its response in variables source

soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object
headlines = soup.find_all("h1") # search for h1 tags

print(headlines[0].getText()) # get the text for the first found h1 tag

World Wide Web


Let's look only at the links that are present on the website:

In [None]:
links = soup.find_all("a") # find all links on webpage
print(links[0].getText()) # show text of first link in list


hypermedia


There are more than one link in the webpage source. To see how many elements are in that list that we call `links`  we can print its amount of elements with the `len()` function:

In [None]:
print(len(links))

25


At the time of writing, there are 25 links. Let's create a loop of that list. Study this [python tutorial](https://www.w3schools.com/python/python_lists_loop.asp) for looping a list:

In [None]:
for link in links:
  print(link)

We can again use the `getText()` function to extract only the names of the link. See the [documentation for beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to discover other useful functions for extracting data.

In [None]:
for link in links:
  print(link.getText())

### Fake-headers

It might be important now to be careful about sending the right user-agent since some website won't serve the content if the server detects that the request is coming from a python script. In this case, we need to adapt the request function a little bit.

First, we need to define "fake" header information with a user-agent that looks inconspicous (chrome webbrowser on a macintosh computer). I copy/pasted this header from a standard web-browser.  

In [None]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


This headers variable needs to be send together with the requests function that fetches the website. That's it! No one will think again that you are a bot! Be sure to always define that fake-header in your requests:

In [None]:
r = requests.get("https://google.ch/",headers=headers) # fetch the website

### Scrape a newspage

In this example we will scrape a newspaper website and try to extract only the headlines.


In [2]:
import requests # import necessary libraries
from bs4 import BeautifulSoup

my_url = "https://nzz.ch/"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

r = requests.get(my_url,headers=headers) # fetch the website
source = r.text # store its response in variables source

soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object
print(soup.prettify())


<!DOCTYPE html>
<html data-n-head="%7B%22lang%22:%7B%22ssr%22:%22de%22%7D%7D" data-n-head-ssr="" lang="de">
 <head>
  <script id="script-loader" type="text/javascript">
   function cloneAttributes(e,t){for(var r=0;r<t.attributes.length;r++)e.setAttribute(t.attributes[r].nodeName,t.attributes[r].nodeValue)}var origInsertBefore=document.head.insertBefore;document.head.insertBefore=function(e,t){var r=["/tracking.js","//ens.",".ensighten.com",".piano.io","chartbeat","jwpcdn.com/player","embed-cdn.surveyhero.com"].some(t=>!e.getAttribute("src")||e.getAttribute("src").indexOf(t)>-1);if("false"===e.getAttribute("postload")&&(r=!0),r||"script"!==e.tagName.toLowerCase()||window.nzzScriptLazy)e.removeAttribute("async"),e.removeAttribute("defer"),e.getAttribute("src")&&e.getAttribute("src").indexOf("facebook")>-1?setTimeout((function(){origInsertBefore.call(document.head,e,t)}),3e3):origInsertBefore.call(document.head,e,t);else{var o=document.createElement("source");cloneAttributes(o,e),origInse

Have a look at the page source in the web-inspector first to analyze the website. As a reminder on how to use web inspector watch this [tutorial](https://www.youtube.com/watch?v=TuZJD-lKjCo)

In [4]:

# to extract only the teaser titles from the news, it became clear
# that all headlines are in <h2> tags that contained the class="teaser__title" attribute
# with bs4 one can filter only those tags with the line below:

teasers = soup.find_all("h2", {"class": "teaser__title"})
#print(teasers)
print(f"there are {len(teasers)} headlines on the page") # with len we get the length of the list

## teasers is a list object, if we want to print each line we have to create a loop
for teaser in teasers:
  print(teaser.getText())

there are 96 headlines on the page
0:   Michael Ambühl: «Es ist realitätsfremd, zu fordern, die Schweiz solle einen Angriff auf einen europäischen Staat nicht verurteilen»
1:   Russlands Feldzug schreitet voran. Das bedeuten die jüngsten Gebietsgewinne
2:   Krieg in der Ukraine: Ermittler nehmen ranghohen russischen General fest +++ US-Aussenminister Blinken ist in Kiew
3:   «Das ist eine hypothetische Frage»: Sergio Ermotti äussert sich zu Spekulationen, wonach die UBS die Schweiz verlassen könnte
4:   1968 protestierten Amerikas Studenten gegen den Vietnamkrieg. Was lernen wir daraus für 2024?
5:   Der berühmteste Bankräuber der Schweiz hat als Müllmann seinen Frieden gefunden. Doch jetzt verweigert die Stadt Zürich Hugo Portmann die Weiterbeschäftigung
6:   Er ist witzig, charmant und schnell: Open AIs neuer KI-Sprachassistent wirkt wie aus Hollywood
7:   Jasmin Paris ist die erste Frau, die den legendären Barkley-Marathon beendet hat: «Das Scheitern zieht mich magisch an»
8:   Die 

Can you try to find a simple website and extract some information from it? Make sure to not choose a too complex page or a platform where you need to login.

Start with this code snippet:

In [None]:
import requests # import necessary libraries
from bs4 import BeautifulSoup

my_url = "YOUR_URL_HERE"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

r = requests.get(my_url,headers=headers) # fetch the website
source = r.text # store its response in variables source

soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object
print(soup.prettify())

In [None]:
# continue to analyze your soup here:



## Text2speech

In [5]:
!pip install gtts
from gtts import gTTS

Collecting gtts
  Downloading gTTS-2.5.1-py3-none-any.whl (29 kB)
Installing collected packages: gtts
Successfully installed gtts-2.5.1


In [9]:
for index, teaser in enumerate(teasers):
  tts = gTTS(text=teaser.getText(), lang="de")
  tts.save(f"{index}.mp3") # saves the output file to 1,2,3,4etc...mp3
  print(f"{index}: {teaser.getText()}")

0:   Michael Ambühl: «Es ist realitätsfremd, zu fordern, die Schweiz solle einen Angriff auf einen europäischen Staat nicht verurteilen»
1:   Russlands Feldzug schreitet voran. Das bedeuten die jüngsten Gebietsgewinne
2:   Krieg in der Ukraine: Ermittler nehmen ranghohen russischen General fest +++ US-Aussenminister Blinken ist in Kiew
3:   «Das ist eine hypothetische Frage»: Sergio Ermotti äussert sich zu Spekulationen, wonach die UBS die Schweiz verlassen könnte
4:   1968 protestierten Amerikas Studenten gegen den Vietnamkrieg. Was lernen wir daraus für 2024?
5:   Der berühmteste Bankräuber der Schweiz hat als Müllmann seinen Frieden gefunden. Doch jetzt verweigert die Stadt Zürich Hugo Portmann die Weiterbeschäftigung
6:   Er ist witzig, charmant und schnell: Open AIs neuer KI-Sprachassistent wirkt wie aus Hollywood
7:   Jasmin Paris ist die erste Frau, die den legendären Barkley-Marathon beendet hat: «Das Scheitern zieht mich magisch an»
8:   Die hartnäckige Suche der Ukraine nach 

## Saving output to a file

To write all headlines into CSV file, we can use the python csv module. This is for demonstration only and the CSV will only contain one column, but the principles are the same if you have more data to write to this file

In [None]:
import csv

outputfile = "sample_data/output.csv" # filepath

# Open a CSV file for writing
with open(outputfile, 'w', newline='') as file:
    # Create a writer object
    writer = csv.writer(file)

    # Write each string to a new row in the CSV file
    for teaser in teasers:
        writer.writerow([teaser.getText()])

# Requesting a website that uses Javascript

Many modern websites uses Javascript to render elements dynamic. One strategy is control a web-browser by using the selenium library. This became difficult to run directly in the Google Colab environment. Another library promises to render Javascript is called [requests-html](https://requests.readthedocs.io/projects/requests-html/en/latest/). It is very similar to the requests commands we already used. First, we need to install the library

In [None]:
!pip install requests-html

After importing the module, we need to create a HTML session:

In [None]:
from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()

As a test, we will visit a webpage that loads images with Javascript. In case of the usual requests library the scraped webpage won't necessarily contain the img tags. Let's try

In [None]:
url = "https://enviragallery.com/demo/lazy-loading-demo/" # url to fetch
r = requests.get(url) # sending the request
source = r.text # store its response in variables source
soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object

In [None]:
print(len(soup.find_all("img")))
print(soup.find_all("img"))

0
[]


As expected, the scraper won't get any <img> tags since they are all loaded dynamically through javascript. The same code with the requests-html library looks like:

In [None]:
url = "https://enviragallery.com/demo/lazy-loading-demo/" # url to fetch
r = session.get(url) # sending the request through HTML session
source = r.text # store its response in variables source
soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object

In [None]:
print(len(soup.find_all("img")))
print(soup.find_all("img"))

43
[<img height="1" src="https://www.facebook.com/tr?id=511410959261450&amp;ev=PageView &amp;noscript=1" width="1"/>, <img data-lazy-src="https://enviragallery.com/wp-content/themes/envira-gallery/images/logo.png" src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E"/>, <img src="https://enviragallery.com/wp-content/themes/envira-gallery/images/logo.png"/>, <img alt="" class="envira-gallery-image envira-gallery-image-1 envira-lazy" data-caption="" data-envira-gallery-id="225278" data-envira-index="0" data-envira-item-id="214199" data-envira-src="https://enviragallery.com/wp-content/uploads/2017/03/zion-canyon-overlook-1024x659.jpg" data-envira-srcset="https://enviragallery.com/wp-content/uploads/2017/03/zion-canyon-overlook-1024x659.jpg 400w,https://enviragallery.com/wp-content/uploads/2017/03/zion-canyon-overlook-1024x659.jpg 2x" data-envirabox="225278" data-no-lazy="1" data-title="Zion - Canyon Overlook" draggable="false" height="

At the time of writing the website contained 43 images. We could first store the soup object containing only the images in a new list:

In [None]:
images = soup.find_all("img")

The link to the image itself isn't stored in the attribute "src" (usually) but in the attribute "data-envira-src"

In [None]:
for image in images:
  #print(image.get('title'))
  print(image.get('data-envira-src'))

None
None
None
Zion - Canyon Overlook
Sunset View - Bryce Canyon
Some Random Highway in Colorado - Because why not stop for mountains and stars
Looking Up
mequite-dunes
It's About the Experience
Great Sand Dunes National Park - Star Trails
pexels-photo-280189
pexels-photo-157233
water2
water
pexels-photo-120153-medium
pexels-photo-127753-medium
pexels-photo-131032-medium
night
pexels-photo-(1)
jetty-landing-stage-sea-sky
desert
Badwater Basin
Great Sand Dunes National Park
70-200mm Lens
Lens is Best For Wedding Photography
Change Lightbox Background Color
Camera is Best For Wedding Photography
Raindrops as Reflectors
Ceremony
Computer System
Lens
Camera
Travers & Brown
grill2
grill1
grill
pexels-photo-133459-medium
bride3
9 Top Wedding Photography Poses for the Groom
stars
photo-1461088778056-88bf6d3cefcd
photo-1460499593944-39e14f96a8c6
photo-1452473767141-7c6086eacf42


Some of the images on the page don't have this attribute (most likely icons or similar menu related things). We can include a little if-condition to get only the images we are interested in:


In [None]:
for image in images:
  if image.get('data-envira-src') != None:
    print(image.get('data-envira-src'))

https://enviragallery.com/wp-content/uploads/2017/03/zion-canyon-overlook-1024x659.jpg
https://enviragallery.com/wp-content/uploads/2017/03/sunset-view-bryce-canyon-1024x419.jpg
https://enviragallery.com/wp-content/uploads/2017/03/random-highway-1024x668.jpg
https://enviragallery.com/wp-content/uploads/2017/03/looking-up-787x1024.jpg
https://enviragallery.com/wp-content/uploads/2017/03/mequite-dunes-1024x683.jpg
https://enviragallery.com/wp-content/uploads/2017/03/experience-1024x427.jpg
https://enviragallery.com/wp-content/uploads/2017/03/great-sand-dunes-star-trails-1024x683.jpg
https://enviragallery.com/wp-content/uploads/2017/02/pexels-photo-280189-1024x683.jpg
https://enviragallery.com/wp-content/uploads/2017/02/pexels-photo-157233-1024x576.jpg
https://enviragallery.com/wp-content/uploads/2016/09/water2.jpeg
https://enviragallery.com/wp-content/uploads/2016/09/water.jpeg
https://enviragallery.com/wp-content/uploads/2016/09/pexels-photo-120153-medium.jpeg
https://enviragallery.com/

We can combine this with the batch-downloading script from our previous example. Let's take the title as filename:

In [None]:
for image in images:
  if image.get('data-envira-src') != None:
    filename = f"{image.get('title')}.jpg"
    url = image.get('data-envira-src')
    response = requests.get(url)
    image_content = response.content
    image_file = open(f"sample_data/{filename}", "wb")
    image_file.write(image_content)
    image_file.close()