# 1. Inspect the website's source code

### HTML Refresher
This part is based on chapter 11 of *Automate the Boring Stuff with Python* by Al Sweigart

HTML files are plain text files containing *tags*, which are words enclosed in angle brackets. Tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the `<a>` tag encloses text that should be a link.

Some elements have an `id` attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

### View a page's HTML source

In Firefox:
To view a page's sources right click on it and choose **View page source** which opens a new tab with the HTML sources.
<img src="img/view_page_source.png" width="500"> 

# 2. Fetching data
Use Python's `requests` library

In [25]:
import requests

url = 'https://github.com/benesom/redi-da-cph-spring21'
r = requests.get(url)
r.text

'\n\n\n\n\n\n<!DOCTYPE html>\n<html lang="en" >\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-PYWr2OavT8crCvolPhJe+bHZ6PG6Q6cH7+2eZue+suNLa9t4w/spUoiSCNG+JfpZIL7kq9rnGXwNXCJup7IQdA==" rel="stylesheet" href="https://github.githubassets.com/assets/frameworks-3d85abd8e6af4fc72b0afa253e125ef9.css" />\n  <link crossorigin="anonymous" media="all" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" rel="stylesheet" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" />\n    <link crossorigin="anonymous" media="all" integrity="sha512-jTdvoiCezBiH9yw26ZDI

# 3. Parse data and extract from it with BeautifulSoup

BeautifulSoup is a module for parsing and extracting information from HTML sources. The module’s name is bs4. In case it is not already installed on your machine:
- install it with 
```bash 
pip install beautifulsoup4

While beautifulsoup4 is the name used for installation, to import BeautifulSoup in your notebook you have to use `import bs4`.

Documentation: https://www.crummy.com/software/BeautifulSoup/

_"Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."_

### A Creating a BeautifulSoup Object from our local HTML File

- The `bs4.BeautifulSoup()` function needs to be called with a string containing the HTML file it will parse and returns a `BeautifulSoup` object.

You can load a local HTML file and pass a file object to `bs4.BeautifulSoup()`.

In [7]:
import bs4

with open('./example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html)
print(type(soup))
#print(soup.prettify())

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
 <head>
  <title>
   Hello!
  </title>
 </head>
 <body>
  <h1>
   Hello World!
  </h1>
  With great web scraping skills comes great responsibility!
  <br/>
  <br/>
  The
  <a href='\"https://github.com/benesom/redi-da-cph-spring21\"'>
   Lecture Notes
  </a>
  .
  <br/>
  <div>
   <p>
    paragraph 1
   </p>
   <p>
    and paragraph 2:
    <span id="span01">
     This is span 1
    </span>
    <span id="span03">
     Second span element
    </span>
    <span class="red_border">
     Here is the third span
    </span>
   </p>
  </div>
 </body>
</html>



### B Creating a BeautifulSoup object from a website

In [16]:
import requests

url = 'https://github.com/benesom/redi-da-cph-spring21'

r = requests.get(url)
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

print(soup.prettify()[:1500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-3d85abd8e6af4fc72b0afa253e125ef9.css" integrity="sha512-PYWr2OavT8crCvolPhJe+bHZ6PG6Q6cH7+2eZue+suNLa9t4w/spUoiSCNG+JfpZIL7kq9rnGXwNXCJup7IQdA==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-8da471024fd1ec4aba5d7b71b7675661.css" integrity="sha512-jaRxAk/R7Eq6XXtxt2dWYc6UfgT/Jk9zYWYh4UpAt5LFRnYVaWqEM3sPhUFL3fOBmHhHoOcn4wfLkMS21Q1yaw==" media="all" rel="stylesheet">
    <link crossorigin="anonymous" href="https://github.githubassets.com/assets/behaviors-8d376fa2209ecc1887f72c36e9

## Finding an Element with the `select()` Method

You can retrieve HTML elements from a `BeautifulSoup` object by calling the `select()` method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Common CSS selector patterns include:

  * `soup.select('div')` ... selects all elements named `<div>`
  * `soup.select('#lecturer')`  ... selects the element with an id attribute of author
  * `soup.select('.notice')` ... selects all elements that use a CSS class attribute named notice
  * `soup.select('div span')` ... selects all elements named ``<span>` that are within an element named `<div>`
  * `soup.select('div > span')` ... selects all elements named `<span>` that are directly within an element named `<div>`, with no other element in between
  * `soup.select('input[name]')` ... selects all elements named `<input>` that have a name attribute with any value
  * `soup.select('input[type="button"]')` ... selects all elements named `<input>` that have an attribute named type with value button
  
See more in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

In [17]:
with open('./example.html') as f:
    example_html = f.read()

soup = bs4.BeautifulSoup(example_html, 'html.parser')

elems = soup.select('body')

#print(soup.prettify())
print('1: return type of select()',type(elems))
print('2: length of the returned list',len(elems))
print('3: type of elements in the list',type(elems[0]))
print('4: get text from the element',elems[0].getText()[:40])
print('5: string representation of an element: ',str(elems[0]))

1: return type of select() <class 'bs4.element.ResultSet'>
2: length of the returned list 1
3: type of elements in the list <class 'bs4.element.Tag'>
4: get text from the element 
Hello World!
With great web scraping sk
5: string representation of an element:  <body>
<h1>Hello World!</h1>
With great web scraping skills comes great responsibility!<br/>
<br/>
The <a href='\"https://github.com/benesom/redi-da-cph-spring21\"'>Lecture Notes</a>.<br/>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
6: the attributes of the element:  {}


#### Extracting a link

In [22]:
a_elems = soup.select('a')
a_elems[0]['href']

'\\"https://github.com/benesom/redi-da-cph-spring21\\"'

### What is the difference between the `select` and the `find`/`find_all` functions?

You are not the first ones wondering about this... See:
https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x#38033910

# Some things to take into account

**OBS** Many web pages are not built to support high traffic or they explicitely discourage automatic access. Keep this in mind when writing your scraping tool.

In [27]:
from time import sleep
sleep(3) # script doesn't continue for 3 seconds

# Example Scraping Events from a Page


Ususally, you will use web scraping to collect information, which you cannot gather otherwise. 
For example, let's imagine we want to do some statistics about:
- job adverts for data analysts in Copenhagen

Since we cannot find an API or any other open dataset, we decide to scrape the publicly available homepage www.kultunaut.dk, 

The website lists all possible events in Denmark. 
Concerts in Copenhagen are for example accessible here: 
- http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik

**OBS** Many web pages are not built to support high traffic or they exlicitely discourage automatic access. Keep this in mind when writing your scraping tool.
- from time import sleep
- sleep(3) # sleep 3 seconds


Considering our example:
- we have to first figure out how many events there are at all. 
- We need this information, as events are given paginated, i.e., twenty events per page.
- The link given above only returns the link to the first page with the first twenty events. 
- Out of the total amount of events we can generate the URLs for the subsequent results.