# Web Scraping

Date: 2024/01/06

In [1]:
#!pip3 install beautifulsoup4
from bs4 import BeautifulSoup

In [2]:
with open('pg35041/The Project Gutenberg EBook of Johann Sebastian Bach by Johann Nikolaus Forkel and Charles Sanford Terry.html', 'r') as f:
    html_doc = f.read() 

In [3]:
soup = BeautifulSoup(html_doc, 'html.parser')

## Title

In [4]:
soup.title

<title>The Project Gutenberg EBook of Johann Sebastian Bach
 by Johann Nikolaus Forkel and Charles Sanford Terry</title>

## All paragraphs

In [5]:
soup.find_all('p')

[<p><strong>Title</strong>: Johann Sebastian Bach: His Life, Art, and Work</p>,
 <p><strong>Author</strong>: Johann Nikolaus Forkel</p>,
 <p><strong>Translator</strong>: Charles Sanford Terry</p>,
 <p><strong>Release date</strong>: January 24, 2011 [eBook #35041]</p>,
 <p><strong>Language</strong>: English</p>,
 <p class="tei tei-p" style="margin-bottom: 1.00em"></p>,
 <p class="tei tei-p" style="margin-bottom: 1.00em">
 Johann Nikolaus Forkel, author of the monograph 
 of which the following pages afford a translation, 
 was born at Meeder, a small village in Saxe-Coburg, on February 22, 1749, seventeen months 
 before the death of Johann Sebastian Bach, whose 
 first biographer he became.  Presumably he would 
 have followed the craft of his father, the village 
 shoemaker, had not an insatiable love of music 
 seized him in early years.  He obtained books, 
 and studied them with the village schoolmaster. 
 In particular he profited by the <span class="tei tei-q">“Vollkommener
 Kape

## The first paragraph just after CHAPTER I header

In [6]:
h1_chapter1 = [tag for tag in soup.find_all(string='CHAPTER I. THE FAMILY OF BACH')]
h1_chapter1

['CHAPTER I. THE FAMILY OF BACH', 'CHAPTER I. THE FAMILY OF BACH']

In [7]:
p_first = h1_chapter1[1].find_next('p')
p_first

<p class="tei tei-p" style="margin-bottom: 1.00em">
If there is such a thing as inherited aptitude for 
art it certainly showed itself in the family of Bach.  
For six successive generations scarcely two or three 
of its members are found whom nature had not 
endowed with remarkable musical talent, and who 
did not make music their profession.<a class="pginternal" href="https://www.gutenberg.org/cache/epub/35041/pg35041-images.html#note_19" id="noteref_19"><span class="tei tei-noteref"><span style="font-size: 60%; vertical-align: super">19</span></span></a>
</p>

In [8]:
import re
re.sub(r'[\r\n]', '', p_first.get_text())

'If there is such a thing as inherited aptitude for art it certainly showed itself in the family of Bach.  For six successive generations scarcely two or three of its members are found whom nature had not endowed with remarkable musical talent, and who did not make music their profession.19'

## All the paragraphs after CHAPTER I header and before APPENDIX I header
- deepcopy of paragraphs ==> for spaCy NLP
- original paragraphs ==> add id attribute

In [9]:
import copy

p_all = h1_chapter1[1].find_all_next('p')
p_all_ = copy.deepcopy(p_all)

# Remove a and span child elements
for p in p_all_:
    for tag in p.find_all('a'):
        tag.decompose()
    for tag in p.find_all('span'):
        tag.decompose()

In [10]:
# Test if all the children has been removed from the copy of the paragraphs
p_all_text_ = [re.sub(r'[\r\n]', '', p.get_text()) for p in p_all_]
p_all_text_

['If there is such a thing as inherited aptitude for art it certainly showed itself in the family of Bach.  For six successive generations scarcely two or three of its members are found whom nature had not endowed with remarkable musical talent, and who did not make music their profession.',
 'Veit Bach, ancestor of this famous family, gained a livelihood as a baker at Pressburg in Hungary.  When the religious troubles of the sixteenth century broke out he was driven to seek another place of abode, and having got together as much of his small property as he could, retired with it to Thuringia, hoping to find peace and security there.  He settled at Wechmar, a village near Gotha, where he continued to ply his trade as a baker and miller.In his leisure hours he was wont to amuse himself with the lute,playing it amid the noise and clatter of the mill.  His taste for music descended to his two sonsand their children, and in time the Bachs grew to be a very numerous family of professional m

In [11]:
# Compare with the original
p_all_text = [re.sub(r'[\r\n]', '', p.get_text()) for p in p_all]
p_all_text

['If there is such a thing as inherited aptitude for art it certainly showed itself in the family of Bach.  For six successive generations scarcely two or three of its members are found whom nature had not endowed with remarkable musical talent, and who did not make music their profession.19',
 'Veit Bach,20 ancestor of this famous family, [pg 2]gained a livelihood as a baker at Pressburg in Hungary.  When the religious troubles of the sixteenth century broke out he was driven to seek another place of abode, and having got together as much of his small property as he could, retired with it to Thuringia, hoping to find peace and security there.  He settled at Wechmar, a village near Gotha,21 where he continued to ply his trade as a baker and miller.22In his leisure hours he was wont to amuse himself with the lute,23playing it amid the noise and clatter of the mill.  His taste for music descended to his two sons24and their children, and in time the Bachs grew to be a very numerous family

In [12]:
p_after_last = [tag for tag in soup.find_all(string="APPENDIX I. CHRONOLOGICAL CATALOGUE OF BACH'S COMPOSITIONS")][1].find_next('p')
p_after_last

<p class="tei tei-p" style="margin-bottom: 1.00em">
Motet: Lobet den Herrn, alle Heiden. 
</p>

# Extract main part of the paragraphs and adding id attributes to all the paragraphs

Make the paragraphs become addressable by its paragraph number, like "pg35041/pg35041-with-ids.html#p_5".
Note that it does not modify the content.

In [13]:
p_main = []
p_idx = 0
for p in p_all:
    if p == p_after_last:
        print('found')
        break
    p['id'] = f'p_{p_idx}'
    p_main.append(p)
    p_idx += 1

print(f'{len(p_all)}, {len(p_main)}')

found
820, 159


# Output the modified HTML (id attributes added to the paragraphs)

In [14]:
with open('pg35041/pg35041-with-ids.html', 'w') as f:
    f.write(str(soup))

# Output paragraphs for spaCy NLP

In [15]:
import yaml
paragraphs = p_all_text_[:len(p_main)]

with open('pg35041/pg35041-paragraphs.yaml', 'w', encoding='utf-8') as f:
    f.write(yaml.safe_dump(paragraphs))