# Basic Scraping with Beautiful Soup

_Note: adapted from Intro to Scientific Programming's exercises_


### HTML

HTML - Hyper Text Markup Language (https://www.w3schools.com/html/html_intro.asp)

HTML elements are defined by tags
```
<b>Bold text</b>
```
Tags have attributes
```
<a href="https://rwth-aachen.de">RWTH Aachen</a>
```
Tags could be nested
```
<div id="main">
    <div class="sub">
        <a href="https://rwth-aachen.de">RWTH Aachen</a>
    </div>
    <div class="sub">
        <a href="https://hse.ru">HSE University</a>
    </div>
</div>
<div id="footer">
</div>
```
HTML-elements could be selected by name
```
a
```
By ID
```
#main
```
By class
```
.sub
```
Nested selection
```
#main .sub
```

[web page example](https://www.rwth-aachen.de/cms/root/Studium/Vor-dem-Studium/Studiengaenge/~yev/Liste-Aktuelle-Studiengaenge/lidx/1/?page=1&aaaaaaaaaaaaaum=aaaaaaaaaaaaxqh&showall=1)

### Beautiful Soup

[BeautifulSoup library](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [3]:
!pip install beautifulsoup4


Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ---------------------------------------- 0.0/143.0 kB ? eta -:--:--
     -- ------------------------------------- 10.2/143.0 kB ? eta -:--:--
     ----------------------------------- -- 133.1/143.0 kB 1.6 MB/s eta 0:00:01
     -------------------------------------- 143.0/143.0 kB 1.7 MB/s eta 0:00:00
Collecting soupsieve>1.2 (from beautifulsoup4)
  Obtaining dependency information for soupsieve>1.2 from https://files.pythonhosted.org/packages/4c/f3/038b302fdfbe3be7da016777069f26ceefe11a681055ea1f7817546508e3/soupsieve-2.5-py3-none-any.whl.metadata
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.5


In [1]:

html_doc = """
<html>
    <head>
        <title>Test html-document</title>
    </head>
    <body>
        <div id="main">
        <div class="sub">
            <a href="https://uni-mannheim.de">Uni Mannheim</a>
        </div>
        <div class="sub">
            <a href="https://rwth-aachen.de">RWTH Aachen</a>
        </div>
        <div class="sub">
            <a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>
        </div>
        <div id="footer"></div>
    </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup, type(soup)

(
 <html>
 <head>
 <title>Test html-document</title>
 </head>
 <body>
 <div id="main">
 <div class="sub">
 <a href="https://uni-mannheim.de">Uni Mannheim</a>
 </div>
 <div class="sub">
 <a href="https://rwth-aachen.de">RWTH Aachen</a>
 </div>
 <div class="sub">
 <a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>
 </div>
 <div id="footer"></div>
 </body>
 </html>,
 bs4.BeautifulSoup)

In [2]:
soup.html.head.title
soup.html.head.title.text

'Test html-document'

In [3]:
print(len(soup.select('div.sub')))

5


In [4]:

print(len(soup.select('#main div')))
print(len(soup.select('#sub div')))

3


In [5]:
print(soup.prettify())

<html>
 <head>
  <title>
   Test html-document
  </title>
 </head>
 <body>
  <div id="main">
   <div class="sub">
    <a href="https://uni-mannheim.de">
     Uni Mannheim
    </a>
   </div>
   <div class="sub">
    <a href="https://rwth-aachen.de">
     RWTH Aachen
    </a>
   </div>
   <div class="sub">
    <a href="https://uni-heidelberg.de">
     Uni Heidelberg
    </a>
   </div>
  </div>
  <div id="footer">
  </div>
 </body>
</html>



In [10]:
soup.select('a')

[<a href="https://uni-mannheim.de">Uni Mannheim</a>,
 <a href="https://rwth-aachen.de">RWTH Aachen</a>,
 <a href="https://uni-heidelberg.de">Uni Heidelberg</a>]

In [11]:
first = soup.select('a')[0]
first, type(first)

(<a href="https://uni-mannheim.de">Uni Mannheim</a>, bs4.element.Tag)

In [13]:
first.parent

<div class="sub">
<a href="https://uni-mannheim.de">Uni Mannheim</a>
</div>

In [14]:
for el in first.parents:
    print(type(el), el.name)

<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> div
<class 'bs4.element.Tag'> body
<class 'bs4.element.Tag'> html
<class 'bs4.BeautifulSoup'> [document]


In [None]:
for el in first.parent.parent.children:
    print(el)

In [None]:
## Empty text is also an HTML node, so we need to repeat next_sibling twice
first.parent.next_sibling.next_sibling

In [None]:
soup.select('#main')

In [None]:
soup.select('.sub')

In [15]:
soup.find_all('div', class_ = 'sub')

[<div class="sub">
 <a href="https://uni-mannheim.de">Uni Mannheim</a>
 </div>,
 <div class="sub">
 <a href="https://rwth-aachen.de">RWTH Aachen</a>
 </div>,
 <div class="sub">
 <a href="https://uni-heidelberg.de">Uni Heidelberg</a></div>]