<a href="https://colab.research.google.com/github/fleshgordo/scrapinghub/blob/main/003_scraping_bs4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping a website without an API with Beautifulsoup


## Requesting a website

In order to entirely download a webpage and its content we first need to request the server, wait for the response and store it in a python variable. This is achieved with the [requests](https://pypi.org/project/requests/) library. Before using it, we need to import it to our current runtime (this needs to be done only once!)

In [1]:
import requests

Through the [quickstart tutorial](https://requests.readthedocs.io/en/latest/user/quickstart/) we can fetch a website that interests us. In this case we will scrape the very first webpage that went online (in CERN Geneva 1989)

In [4]:
r = requests.get('http://info.cern.ch/hypertext/WWW/TheProject.html')
print(r)

<Response [200]>


The above code should output `Response [200]`. To output the HTML source code of the page we need to access the `text` property. The response will be stored in a variable called `source`.

In [None]:
print(r.text)
source = r.text

## BeautifulSoup 

The code is highly unreadable. Parsing through this source code is tedious and quickly time-consuming. Hence, [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) comes into play. This library is known for extracting data out of web pages. It provides elegant ways of navigating, searching, and modifying the parse tree of HTML and XML files. It commonly saves programmers hours or days of work. So, let's import this library:

In [6]:
from bs4 import BeautifulSoup

Our source code will be loaded into the Beautifulsoup which creates a python object that becomes browsable instead of a basic text string.

In [7]:
soup = BeautifulSoup(source, 'html.parser')
print(soup)

<header>
<title>The World Wide Web project</title>
<nextid n="55"/>
</header>
<body>
<h1>World Wide Web</h1>The WorldWideWeb (W3) is a wide-area<a href="WhatIs.html" name="0">
hypermedia</a> information retrieval
initiative aiming to give universal
access to a large universe of documents.<p>
Everything there is online about
W3 is linked directly or indirectly
to this document, including an <a href="Summary.html" name="24">executive
summary</a> of the project, <a href="Administration/Mailing/Overview.html" name="29">Mailing lists</a>
, <a href="Policy.html" name="30">Policy</a> , November's  <a href="News/9211.html" name="34">W3  news</a> ,
<a href="FAQ/List.html" name="41">Frequently Asked Questions</a> .
<dl>
<dt><a href="../DataSources/Top.html" name="44">What's out there?</a>
<dd> Pointers to the
world's online information,<a href="../DataSources/bySubject/Overview.html" name="45"> subjects</a>
, <a href="../DataSources/WWW/Servers.html" name="z54">W3 servers</a>, etc.
<dt><a href="

While the output of the new `soup` variable looks pretty much the same as the `source`, its major difference is that it is a python object that contains some functions in order to access the HTML structure. Let's say, we are interested only into the hyperlinks that are present on the page:

In [8]:
soup.find_all("a")

[<a href="WhatIs.html" name="0">
 hypermedia</a>,
 <a href="Summary.html" name="24">executive
 summary</a>,
 <a href="Administration/Mailing/Overview.html" name="29">Mailing lists</a>,
 <a href="Policy.html" name="30">Policy</a>,
 <a href="News/9211.html" name="34">W3  news</a>,
 <a href="FAQ/List.html" name="41">Frequently Asked Questions</a>,
 <a href="../DataSources/Top.html" name="44">What's out there?</a>,
 <a href="../DataSources/bySubject/Overview.html" name="45"> subjects</a>,
 <a href="../DataSources/WWW/Servers.html" name="z54">W3 servers</a>,
 <a href="Help.html" name="46">Help</a>,
 <a href="Status.html" name="13">Software Products</a>,
 <a href="LineMode/Browser.html" name="27">Line Mode</a>,
 <a href="Status.html#35" name="35">Viola</a>,
 <a href="NeXT/WorldWideWeb.html" name="26">NeXTStep</a>,
 <a href="Daemon/Overview.html" name="25">Servers</a>,
 <a href="Tools/Overview.html" name="51">Tools</a>,
 <a href="MailRobot/Overview.html" name="53"> Mail robot</a>,
 <a href="S

We can also search only for specific HTML tags such as `<h1>` or `<p> ` 


In [10]:
headlines = soup.find_all("h1")
texts = soup.find_all("p")

In [15]:
print(headlines)

World Wide Web


At the time of writing, the webpage consists only of one headline `<h1>`. The text is technically an element of a list that has only one entry. In order to acces the first element of that list, we need to call it's array index:

In [13]:
print(headlines[0])

<h1>World Wide Web</h1>


The text is still wrapped around the html tags h1. If we want to access the "pure" text we can make use of the function getText():

In [16]:
print(headlines[0].getText())

World Wide Web


To summarize all the actions it took to:

1.   fetch the website content
2.   transform source code into beautifulsoup element
3.   find only h1 tags (titles)
4.   print only the headline

we could write:

In [17]:
import requests # import necessary libraries
from bs4 import BeautifulSoup

r = requests.get('http://info.cern.ch/hypertext/WWW/TheProject.html') # fetch the website
source = r.text # store its response in variables source

soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object 
headlines = soup.find_all("h1") # search for h1 tags

print(headlines[0].getText()) # get the text for the first found h1 tag

World Wide Web


Let's look only at the links that are present on the website:

In [19]:
links = soup.find_all("a") # find all links on webpage
print(links[0].getText()) # show text of first link in list


hypermedia


There are more than one link in the webpage source. To see how many elements are in that list that we call `links`  we can print its amount of elements with the `len()` function:

In [20]:
print(len(links))

25


At the time of writing, there are 25 links. Let's create a loop of that list. Study this [python tutorial](https://www.w3schools.com/python/python_lists_loop.asp) for looping a list:

In [None]:
for link in links:
  print(link)

We can again use the `getText()` function to extract only the names of the link. See the [documentation for beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to discover other useful functions for extracting data.

In [22]:
for link in links:
  print(link.getText())


hypermedia
executive
summary
Mailing lists
Policy
W3  news
Frequently Asked Questions
What's out there?
 subjects
W3 servers
Help
Software Products
Line Mode
Viola
NeXTStep
Servers
Tools
 Mail robot

Library
Technical
Bibliography
People
History
How can I help
Getting code

anonymous FTP


Can you try to find a simple website and extract some information from it? Make sure to not choose a too complex page or a platform where you need to login. It might be important now to be careful about sending the right user-agent since some website won't serve the content if the server detects that the request is coming from a python script. In this case, we need to adapt the request function a little bit. First, we need to define "fake" header information with a user-agent that looks inconspicous (chrome webbrowser on a macintosh computer). I copy/pasted this header from a standard web-browser.  

In [25]:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}


This headers variable needs to be send together with the requests function that fetches the website. That's it! No one will think that you are a bot from now on! Be sure to always define that fake-header in your requests:

In [26]:
r = requests.get("https://google.ch/",headers=headers) # fetch the website

Start with this code snippet:

In [27]:
import requests # import necessary libraries
from bs4 import BeautifulSoup

my_url = "https://nzz.ch/"

r = requests.get(my_url,headers=headers) # fetch the website
source = r.text # store its response in variables source

soup = BeautifulSoup(source, 'html.parser') # parse the webpage source into a bs4 soup object 
print(soup)

<!DOCTYPE html>

<html data-n-head="%7B%22lang%22:%7B%22ssr%22:%22de%22%7D%7D" data-n-head-ssr="" lang="de">
<head>
<script id="script-loader" type="text/javascript">function cloneAttributes(e,t){for(var r=0;r<t.attributes.length;r++)e.setAttribute(t.attributes[r].nodeName,t.attributes[r].nodeValue)}var origInsertBefore=document.head.insertBefore;document.head.insertBefore=function(e,t){var r=["/tracking.js","//ens.",".ensighten.com","chartbeat","jwpcdn.com/player","embed-cdn.surveyhero.com"].some(t=>!e.getAttribute("src")||e.getAttribute("src").indexOf(t)>-1);if("false"===e.getAttribute("postload")&&(r=!0),r||"script"!==e.tagName.toLowerCase()||window.nzzScriptLazy)e.removeAttribute("async"),e.removeAttribute("defer"),origInsertBefore.call(document.head,e,t);else{var o=document.createElement("source");cloneAttributes(o,e),origInsertBefore.call(document.head,o,t)}},setTimeout((function(){document.querySelectorAll('head source, body source[type="text/javascript"]').forEach((function(e){