<a id='top'></a><a name='top'></a>
# Chapter 3: Scraping Websites and Extracting Data

Book: [Blueprints for Text Analysis Using Python](https://www.oreilly.com/library/view/blueprints-for-text/9781492074076/)

Repo: https://github.com/blueprints-for-text-analytics-python/blueprints-text

* [Introduction](#introduction)
* [3.0 Imports and Setup](#3.0)
* [3.1 Scraping and Data Extraction](#3.1)
* [3.2 Introducing the Reuters News Archive](#3.2)
* [3.3 URL Generation](#3.3)
* [3.4 Blueprint: Downloading and Interpreting robots.txt](#3.4)
* [3.5 Blueprint: Finding URLs from sitemap.xml](#3.5)
* [3.6 Blueprint: Finding URLs from RSS](#3.6)
* [3.7 Downloading Data](#3.7)
* [3.8 Blueprint: Downloading HTML Pages with Python](#3.8)
* [3.9 Blueprint: Downloading HTML Pages with wget](#3.9)
* [3.10 Extracting Semistructured Data](#3.10)
* [3.11 Blueprint: Extracting Data with Regular Expressions](#3.11)
    - [3.11.1 Using an HTML parser for Extraction](#3.11.1)
    - [3.11.2 Blueprint: Extracting the Title/Headline](#3.11.2)
    - [3.11.3 Blueprint: Extracting the Article Text](#3.11.3)
    - [3.11.4 Blueprint: Extracting Image Captions](#3.11.4)
    - [3.11.5 Blueprint: Extracting the URL](#3.11.5)
    - [3.11.6 Blueprint: Extracting List Information (Authors)](#3.11.6)
    - [3.11.7 Blueprint: Extracting Text of Links (Section)](#3.11.7)
    - [3.11.8 Blueprint: Extracting Reading Time](#3.11.8)
    - [3.11.9 Blueprint: Extracting Attributes (ID)](#3.11.9)
    - [3.11.10 Blueprint: Extracting Attribution](#3.11.10)
    - [3.11.11 Blueprint: Extracting Timestamp](#3.11.11)
* [3.12 Blueprint: Spidering](#3.12)
    - [3.12.1 Introducing the Use Case](#3.12.1)
    - [3.12.2 Error Handling and Production-Quality Software](#3.12.2)
* [3.13 Density-Based Text Extraction ](#3.13)
    - [3.13.1 Extracting Reuters Content with Readability](#3.13.1)
    - [3.13.2 Summary Density-Based Text Extraction](#3.13.2)
* [3.14 All-in-One Approach](#3.14)
* [3.15 Blueprint: Scraping the Reuters Archive with Scrapy](#3.15)
* [3.16 Possible Problems with Scraping](#3.16)

---
<a name='introduction'></a><a id='introduction'></a>
# Introduction
<a href="#top">[back to top]</a>

### Dataset

* news-sitemap: [script](#news-sitemap), [source](http://web.archive.org/web/20200613003232if_/http://feeds.reuters.com/Reuters/worldNews)

### Explore

* How to acquire HTML data from websites.
* How to use tools to extract content from HTML files.

---
<a name='3.0'></a><a id='3.0'></a>
# 3.0 Imports and Setup
<a href="#top">[back to top]</a>

In [1]:
# Start with clean project
# !rm -f *.json 
# !rm -f *.rss 
# !rm -f *.txt 
# !rm -f *.py
# !rm -fr ./site_data 
# !ls -l

In [2]:
!mkdir ./site_data

In [3]:
req_file = "requirements_03.txt"

In [4]:
%%writefile {req_file}
bs4
feedparser
isort
readability-lxml
scrapy
tqdm
watermark
xmltodict

Writing requirements_03.txt


In [5]:
import sys

IS_COLAB = 'google.colab' in sys.modules

if IS_COLAB:
    print("Installing packages")
    !pip install --upgrade --quiet -r {req_file}
else:
    print("Running locally.")

Running locally.


In [6]:
%%writefile imports.py
import glob
import locale
import logging
import os.path
import pprint
import re
import urllib.robotparser
import warnings

import feedparser
import matplotlib.pyplot as plt
import pandas as pd
import requests
import scrapy
import seaborn as sns
import xmltodict
from bs4 import BeautifulSoup
from dateutil import parser
from readability import Document
from scrapy.crawler import CrawlerProcess
from tqdm.auto import tqdm
from watermark import watermark

Writing imports.py


In [7]:
!isort imports.py
!cat imports.py

import glob
import locale
import logging
import os.path
import pprint
import re
import urllib.robotparser

import feedparser
import matplotlib.pyplot as plt
import pandas as pd
import requests
import scrapy
import seaborn as sns
import xmltodict
from bs4 import BeautifulSoup
from dateutil import parser
from readability import Document
from scrapy.crawler import CrawlerProcess
from tqdm.auto import tqdm
from watermark import watermark


In [11]:
# Place at top to patch scikit-learn algorithms
import glob
import locale
import logging
import os.path
import pprint
import re
import urllib.robotparser
import warnings

import feedparser
import matplotlib.pyplot as plt
import pandas as pd
import requests
import scrapy
import seaborn as sns
import xmltodict
from bs4 import BeautifulSoup
from dateutil import parser
from readability import Document
from scrapy.crawler import CrawlerProcess
from tqdm.auto import tqdm
from watermark import watermark

In [12]:
def HR():
    print("-"*40)
    
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"

locale.getpreferredencoding = getpreferredencoding
warnings.filterwarnings('ignore')
BASE_DIR = '.'
sns.set_style("darkgrid")
tqdm.pandas(desc="progress-bar")
pp = pprint.PrettyPrinter(indent=4)

print(watermark(iversions=True, globals_=globals(),python=True, machine=True))

Python implementation: CPython
Python version       : 3.11.5
IPython version      : 8.18.1

Compiler    : Clang 14.0.0 (clang-1400.0.29.202)
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

feedparser: 6.0.11
logging   : 0.5.1.2
seaborn   : 0.13.2
re        : 2.2.1
sys       : 3.11.5 (main, Jan 16 2024, 17:25:53) [Clang 14.0.0 (clang-1400.0.29.202)]
scrapy    : 2.11.2
matplotlib: 3.9.2
xmltodict : 0.13.0
requests  : 2.32.3
pandas    : 2.2.2
dateutil  : 2.9.0.post0



In [13]:
site_data = 'site_data'

---
<a name='3.1'></a><a id='3.1'></a>
# 3.1 Scraping and Data Extraction
<a href="#top">[back to top]</a>

Scraping websites is a process consisting of mainly three different phases:

1. URL generation
2. Download
3. Extraction

---
<a name='3.2'></a><a id='3.2'></a>
# 3.2 Introducing the Reuters News Archive
<a href="#top">[back to top]</a>

Assume we are interested in analyzing the current and past political situation and are looking for an appropriate dataset. We want to find some trends, uncover when a word or topic was introduced for the first time, and so on. For this, we will use the Reuters news archive.

---
<a name='3.3'></a><a id='3.3'></a>
# 3.3 URL Generation
<a href="#top">[back to top]</a>

For downloading content from the Reuters' archive, we need to know the URLs of the content pages. Finding the URLs is called URL generation.

---
<a name='3.4'></a><a id='3.4'></a>
# 3.4 Blueprint: Downloading and Interpreting robots.txt
<a href="#top">[back to top]</a>

Check what we are allowed to download from this website via the robots.txt file.

In [14]:
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.reuters.com/robots.txt")
rp.read()

In [15]:
rp.__dict__

{'entries': [<urllib.robotparser.Entry at 0x130ca88d0>,
  <urllib.robotparser.Entry at 0x1306cd8d0>,
  <urllib.robotparser.Entry at 0x130ca8550>,
  <urllib.robotparser.Entry at 0x130ca8510>,
  <urllib.robotparser.Entry at 0x130ca8b90>,
  <urllib.robotparser.Entry at 0x130ca8490>,
  <urllib.robotparser.Entry at 0x130ca9890>,
  <urllib.robotparser.Entry at 0x130ca9950>,
  <urllib.robotparser.Entry at 0x130ca9a10>,
  <urllib.robotparser.Entry at 0x130ca8e90>,
  <urllib.robotparser.Entry at 0x130ca9b50>,
  <urllib.robotparser.Entry at 0x130ca9c10>,
  <urllib.robotparser.Entry at 0x130ca9cd0>,
  <urllib.robotparser.Entry at 0x130ca9d90>,
  <urllib.robotparser.Entry at 0x130ca9e50>],
 'sitemaps': ['https://www.reuters.com/arc/outboundfeeds/sitemap-index/?outputType=xml',
  'https://www.reuters.com/arc/outboundfeeds/news-sitemap-index/?outputType=xml',
  'https://www.reuters.com/plus/sitemap-index.xml',
  'https://www.reuters.com/arc/outboundfeeds/sitemap-plj-index/?outputType=xml',
  'https:

In [16]:
rp.can_fetch("*", "https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml")

True

In [17]:
rp.can_fetch("*", "https://www.reuters.com/finance/stocks/option")

False

---
<a name='3.5'></a><a id='3.5'></a>
# 3.5 Blueprint: Finding URLs from sitemap.xml
<a href="#top">[back to top]</a>

Reuters conveniently has a sitemap, listing the URLs of articles. 

<a id='news-sitemap'></a><a name='news-sitemap'></a>
### Dataset: news-sitemap
<a href="#top">[back to top]</a>

In [18]:
# Might need to install xmltodict
sitemap = xmltodict.parse(requests.get('https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml').text)

In [19]:
# Convert sitemap.xml to the dict urls
urls = [url["loc"] for url in sitemap["urlset"]["url"]]
print("\n".join(urls[0:3]))

https://www.reuters.com/sports/cricket/gill-pant-sparkle-india-reach-195-5-lunch-2024-11-02/
https://www.reuters.com/world/us-election-2024-live-harris-trump-swing-states-66-million-vote-early-2024-11-01/
https://www.reuters.com/world/europe/spain-hunts-bodies-opens-temporary-morgue-flood-death-toll-hits-158-2024-11-01/


In [20]:
# Check how many urls exist
len(urls)

50

---
<a name='3.6'></a><a id='3.6'></a>
# 3.6 Blueprint: Finding URLs from RSS
<a href="#top">[back to top]</a>

Reuters removed its RSS feed after the book was published. We therefore use a saved copy from the Internet archive

In [21]:
feed = feedparser.parse('http://web.archive.org/web/20200613003232if_/http://feeds.reuters.com/Reuters/worldNews')

In [22]:
# Examine the format of the RSS file
[(e.title, e.link) for e in feed.entries]

[('Mexico City to begin gradual exit from lockdown on Monday',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/OQtkVdAqHos/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R'),
 ('Mexico reports record tally of 5,222 new coronavirus cases',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/Rkz9j2G7lJU/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B'),
 ('Venezuela supreme court to swear in new electoral council leaders, government says',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/cc3R5aq4Ksk/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T'),
 ("One-fifth of Britain's coronavirus patients were infected in hospitals: Telegraph",
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/1_7Wb0S_6-8/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382'),
 ('France to lift border controls for EU travellers on June 15',
  'http://feeds.reuters.com/~r/

In [23]:
# Extract the "real" URL, contained in the id field
[e.id for e in feed.entries]

['https://www.reuters.com/article/us-health-coronavirus-mexico-city/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-mexico/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-venezuela-politics/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-britain-hospitals/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-france-borders/france-to-lift-border-controls-for-eu-travellers-on-june-15-idUSKBN23J385?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-brazil/brazils-covid-19-deaths-sur

---
<a name='3.7'></a><a id='3.7'></a>
# 3.7 Downloading Data
<a href="#top">[back to top]</a>

We can download data easily, both with Python libraries and external tools. 

---
<a name='3.8'></a><a id='3.8'></a>
# 3.8 Blueprint: Downloading HTML Pages with Python
<a href="#top">[back to top]</a>

Use the URLs from the sitemap to download the HTML pages. 

In [24]:
s = requests.Session()
for url in urls[0:10]:
    # Get the part after the last / in URL and use as filename
    file = url.split("/")[-2]
    
    r = s.get(url)
    print(file)
    with open(f"{site_data}/{file}", "w+b") as f:
        f.write(r.text.encode('utf-8'))
        
print("Done")

gill-pant-sparkle-india-reach-195-5-lunch-2024-11-02
us-election-2024-live-harris-trump-swing-states-66-million-vote-early-2024-11-01
spain-hunts-bodies-opens-temporary-morgue-flood-death-toll-hits-158-2024-11-01
un-cop16-nature-summit-creates-permanent-body-indigenous-peoples-2024-11-02
pakistan-captain-shan-backs-babar-return-stronger-after-drop-2024-11-02
thunder-pull-away-blazers-remain-unbeaten-2024-11-02
cop16-cration-dun-organe-permanent-pour-les-peuples-autochtones-2024-11-02
late-rally-carries-timberwolves-victory-over-nuggets-2024-11-02
dan-vladar-wins-battle-goalies-flames-blank-devils-2024-11-02
bagnaia-smashes-lap-record-storm-pole-malaysian-grand-prix-2024-11-02
Done


In [25]:
# Create file containing list of urls
with open(f"{site_data}/urls.txt", "w+b") as f:
    f.write("\n".join(urls).encode('utf-8'))
    
!head {site_data}/urls.txt

https://www.reuters.com/sports/cricket/gill-pant-sparkle-india-reach-195-5-lunch-2024-11-02/
https://www.reuters.com/world/us-election-2024-live-harris-trump-swing-states-66-million-vote-early-2024-11-01/
https://www.reuters.com/world/europe/spain-hunts-bodies-opens-temporary-morgue-flood-death-toll-hits-158-2024-11-01/
https://www.reuters.com/business/environment/un-cop16-nature-summit-creates-permanent-body-indigenous-peoples-2024-11-02/
https://www.reuters.com/sports/cricket/pakistan-captain-shan-backs-babar-return-stronger-after-drop-2024-11-02/
https://www.reuters.com/sports/basketball/thunder-pull-away-blazers-remain-unbeaten-2024-11-02/
https://www.reuters.com/fr/international/cop16-cration-dun-organe-permanent-pour-les-peuples-autochtones-2024-11-02/
https://www.reuters.com/sports/basketball/late-rally-carries-timberwolves-victory-over-nuggets-2024-11-02/
https://www.reuters.com/sports/nhl/dan-vladar-wins-battle-goalies-flames-blank-devils-2024-11-02/
https://www.reuters.com/sp

---
<a name='3.9'></a><a id='3.9'></a>
# 3.9 Blueprint: Downloading HTML Pages with wget
<a href="#top">[back to top]</a>

A good tool for mass downloading pages is `wget`.

---
<a name='3.10'></a><a id='3.10'></a>
# 3.10 Extracting Semi-structured Data
<a href="#top">[back to top]</a>

Here we explore different methods to extract data from the Reuters articles. We start with regex and then use a full-fledged HTML parser.

---
<a name='3.11'></a><a id='3.11'></a>
# 3.11 Blueprint: Extracting Data with Regular Expressions
<a href="#top">[back to top]</a>

We will first download a single article.

In [26]:
url = 'https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

file = url.split("/")[-1] + ".html"
print(file)

r = requests.get(url)

with open(f"{site_data}/{file}", "w+") as f:
    f.write(r.text)

us-health-vaping-marijuana-idUSKBN1WG4KT.html


Extract the title with regex from the downloaded file.

In [27]:
with open(f"{site_data}/{file}", "r") as f:
    html = f.read()
    g = re.search(r'<title>(.*)</title>', html, re.MULTILINE|re.DOTALL)
    if g:
        print(g.groups()[0])

reuters.com


<a name='3.11.1'></a><a id='3.11.1'></a>
## 3.11.1 Using an HTML Parser for Extraction
<a href="#top">[back to top]</a>

Reuters changed its content structure after the book was published. Unfortunately, they *obfuscated* the content so that the methods in the book don't work without massive changes.

In this notebook, we stick to the text in the book and download the articles from the Internet archive which still has the old HTML structure.

In [31]:
WA_PREFIX = "http://web.archive.org/web/20200118131624/"
html = s.get(WA_PREFIX + url).text

In [32]:
soup = BeautifulSoup(html, 'html.parser')
soup.select("h1.ArticleHeader_headline")

[<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>]

<a name='3.11.2'></a><a id='3.11.2'></a>
## 3.11.2 Blueprint: Extracting the Title/Headline
<a href="#top">[back to top]</a>

Selecting content in Beautiful Soup uses so-called selectors that need to be given in the Python program.

In [33]:
soup.h1

<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>

In [34]:
soup.h1.text

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [35]:
soup.title.text

'\n                Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [36]:
soup.title.text.strip()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

<a name='3.11.3'></a><a id='3.11.3'></a>
## 3.11.3 Blueprint: Extracting the Article Text
<a href="#top">[back to top]</a>

In [37]:
soup.select_one("div.StandardArticleBody_body").text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

<a name='3.11.4'></a><a id='3.11.4'></a>
## 3.11.4 Blueprint: Extracting Image Captions
<a href="#top">[back to top]</a>

In [38]:
soup.select("div.StandardArticleBody_body figure")

[<figure class="Image_zoom" style="padding-bottom:"><div class="LazyImage_container LazyImage_dark" style="background-image:none"><img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/><div class="LazyImage_image LazyImage_fallback" style="background-image:url(//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20);background-position:center center;background-color:inherit"></div></div><div aria-label="Expand Image Slideshow" class="Image_expand-button" role="button" tabindex="0"><svg focusable="false" height="18px" version="1.1" viewbox="0 0 18 18" width="18px"

Variant

In [39]:
soup.select("div.StandardArticleBody_body figure img")

[<img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/>,
 <img src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991145&amp;r=LYNXMPEF9039M"/>]

In [40]:
soup.select("div.StandardArticleBody_body figcaption")

[<figcaption><div class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></div></figcaption>,
 <figcaption class="Slideshow_caption">Slideshow<span class="Slideshow_count"> (2 Images)</span></figcaption>]

<a name='3.11.5'></a><a id='3.11.5'></a>
## 3.11.5 Blueprint: Extracting the URL
<a href="#top">[back to top]</a>

In [41]:
soup.find("link", {'rel': 'canonical'})['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

In [42]:
soup.select_one("link[rel=canonical]")['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

<a name='3.11.6'></a><a id='3.11.6'></a>
## 3.11.6 Blueprint: Extracting List Information (Authors)
<a href="#top">[back to top]</a>

In [43]:
soup.find("meta", {'name': 'Author'})['content']

'Jacqueline Tempera'

Variant

In [44]:
sel = "div.BylineBar_first-container.ArticleHeader_byline-bar div.BylineBar_byline span"
soup.select(sel)

[<span><a href="/web/20200118131624/https://www.reuters.com/journalists/jacqueline-tempera" target="_blank">Jacqueline Tempera</a>, </span>,
 <span><a href="/web/20200118131624/https://www.reuters.com/journalists/jonathan-allen" target="_blank">Jonathan Allen</a></span>]

In [45]:
[a.text for a in soup.select(sel)]

['Jacqueline Tempera, ', 'Jonathan Allen']

<a name='3.11.7'></a><a id='3.11.7'></a>
## 3.11.7 Blueprint: Extracting Text of Links (Section)
<a href="#top">[back to top]</a>

In [46]:
soup.select_one("div.ArticleHeader_channel a").text

'Health News'

<a name='3.11.8'></a><a id='3.11.8'></a>
## 3.11.8 Blueprint: Extracting Reading Time
<a href="#top">[back to top]</a>

In [47]:
soup.select_one("p.BylineBar_reading-time").text

'6 Min Read'

<a name='3.11.9'></a><a id='3.11.9'></a>
## 3.11.9 Blueprint: Extracting Attributes (ID)
<a href="#top">[back to top]</a>

In [48]:
soup.select_one("div.StandardArticle_inner-container")['id']

'USKBN1WG4KT'

Alternative: URL

<a name='3.11.10'></a><a id='3.11.10'></a>
## 3.11.10 Blueprint: Extracting Attribution
<a href="#top">[back to top]</a>

In [49]:
soup.select_one("p.Attribution_content").text

'Reporting Jacqueline Tempera in Brookline and Boston, Massachusetts, and Jonathan Allen in New York; Editing by Frank McGurty and Bill Berkrot'

<a name='3.11.11'></a><a id='3.11.11'></a>
## 3.11.11 Blueprint: Extracting Timestamp
<a href="#top">[back to top]</a>

In [50]:
ptime = soup.find("meta", { 'property': "og:article:published_time"})['content']
print(ptime)

2019-10-01T19:23:16+0000


In [51]:
from dateutil import parser
parser.parse(ptime)

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

In [52]:
parser.parse(soup.find("meta", { 'property': "og:article:modified_time"})['content'])

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

---
<a name='3.12'></a><a id='3.12'></a>
# 3.12 Blueprint: Spidering
<a href="#top">[back to top]</a>

These are typical steps in downloading part of an archive, called *spidering*.

1. Define how many pages of the archive should be downloaded.
2. Download each page of the archive into a file called page-000001.html, page-000002.html, and so on for easier inspection. Skip this step if the file is already present.
3. For each page-*.html file, extract the URLs of the referenced articles.
4. For each article URL, download the article into a local HTML file. Skip this step if the article file is already present.
5. For each article file, extract the content into a dict and combine these dicts into a Pandas DataFrame.

Here, each step can be run individually, and downloads have to be performed only once.

**Note:**
* Since this involves web-scraping, this can be very messy and error-prone.
* So, this deserves more error-handling than usual.

<a name='3.12.1'></a><a id='3.12.1'></a>
## 3.12.1 Introducing the Use Case
<a href="#top">[back to top]</a>

In [53]:
def download_archive_page(page):
    filename = f"{site_data}/page-%06d.html" % page
    print(f"filename: {filename}")
    if not os.path.isfile(filename):
        url = "https://www.reuters.com/news/archive/?view=page&page=%d&pageSize=10" % page
        r = requests.get(url)
        with open(f"{filename}", "w+") as f:
            f.write(r.text)

def parse_archive_page(page_file):
    with open(page_file, "r") as f:
        html = f.read()
    soup = BeautifulSoup(html, 'html.parser')
    hrefs = ["https://www.reuters.com" + a['href'] for a in soup.select("article.story div.story-content a")]
    return hrefs

def download_article(url):
    # check if article already there
    filename = url.split("/")[-1] + ".html"
    if not os.path.isfile(filename):
        r = requests.get(url)
        with open(f"{site_data}/{filename}", "w+") as f:
            f.write(r.text)

# def parse_article(article_file):
#     def find_obfuscated_class(soup, klass):
#         return soup.find_all(lambda tag: tag.has_attr("class") and (klass in " ".join(tag["class"])))
#     with open(article_file, "r") as f:
#         html = f.read()
#     r = {}
#     soup = BeautifulSoup(html, 'html.parser')
#     r['url'] = soup.find("link", {'rel': 'canonical'})['href']
#     r['id'] = r['url'].split("-")[-1]
#     r['headline'] = soup.h1.text    
#     r['text'] = "\n".join([t.text for t in find_obfuscated_class(soup, "Paragraph-paragraph")])
    
#     # GB: Error
#     # r['authors'] = find_obfuscated_class(soup, "Attribution-attribution")[0].text
    
#     r['time'] = soup.find("meta", { 'property': "og:article:published_time"})['content']
#     return r

In [54]:
# Download 10 pages of archive.
for p in range(1, 10):
    download_archive_page(p)
    
print("Done")

filename: site_data/page-000001.html
filename: site_data/page-000002.html
filename: site_data/page-000003.html
filename: site_data/page-000004.html
filename: site_data/page-000005.html
filename: site_data/page-000006.html
filename: site_data/page-000007.html
filename: site_data/page-000008.html
filename: site_data/page-000009.html
Done


In [55]:
# Parse archive and add to article_urls.
article_urls = []
for page_file in glob.glob("site_data/page-*.html"):
    print(f"page_file: {page_file}")
    article_urls += parse_archive_page(page_file)
    
print("Done")

page_file: site_data/page-000005.html
page_file: site_data/page-000009.html
page_file: site_data/page-000008.html
page_file: site_data/page-000004.html
page_file: site_data/page-000003.html
page_file: site_data/page-000002.html
page_file: site_data/page-000001.html
page_file: site_data/page-000007.html
page_file: site_data/page-000006.html
Done


In [56]:
# Download articles.
for url in tqdm(article_urls):
    download_article(url)
    
print("Done")

0it [00:00, ?it/s]

Done


**Note:**

The `append()` method was removed in Pandas version 2.0, so we need to replace it with the `concat()` function.


**Note:**

Error as soup.find("link", {'rel': 'canonical'}) is returning None, which means it couldn't find a link tag with the 'rel' attribute set to 'canonical' in the HTML.


In [66]:
def parse_article(article_file):
    def find_obfuscated_class(soup, klass):
        return soup.find_all(lambda tag: tag.has_attr("class") and (klass in " ".join(tag["class"])))
    with open(article_file, "r") as f:
        html = f.read()
    r = {}
    soup = BeautifulSoup(html, 'html.parser')
    r['url'] = soup.find("link", {'rel': 'canonical'})['href']
    r['id'] = r['url'].split("-")[-1]
    r['headline'] = soup.h1.text    
    r['text'] = "\n".join([t.text for t in find_obfuscated_class(soup, "Paragraph-paragraph")])
    
    r['time'] = soup.find("meta", { 'property': "og:article:published_time"})['content']
    return r


try:
    # Initialize an empty list to store the parsed articles
    parsed_articles = []
    
    # Parse articles and append to the list
    for article_file in tqdm(glob.glob(f"{site_data}/*-id???????????.html")):
        parsed_articles.append(parse_article(article_file))
    
    # Create the DataFrame from the list of parsed articles
    df = pd.DataFrame(parsed_articles)
    
    # Convert the 'time' column to datetime
    df['time'] = pd.to_datetime(df['time'])

except Exception as e:
    print(e)

  0%|          | 0/1 [00:00<?, ?it/s]

'NoneType' object is not subscriptable


In [67]:
df.head()

In [68]:
df.sort_values("time").head(5)

KeyError: 'time'

In [69]:
(
    df[df["time"]>= '2020-01-01']
        .set_index("time")
        .resample('D')
        .agg({'id': 'count'})
        .plot.bar()
)

plt.show()

KeyError: 'time'

<a name='3.12.2'></a><a id='3.12.2'></a>
## 3.12.2 Error Handling and Production-Quality Software
<a href="#top">[back to top]</a>

For production software, you should use exception handling, especially since pages can change at any time. 

---
<a name='3.13'></a><a id='3.13'></a>
# 3.13 Density-Based Text Extraction
<a href="#top">[back to top]</a>

Extracting structured data from HTML is not complicated, but it can be tedious.

<a name='3.13.1'></a><a id='3.13.1'></a>
## 3.13.1 Extracting Reuters Content with Readability
<a href="#top">[back to top]</a>

We can examine the information density of pages. These tools contain algorithms that measure the density of information, and should automatically eliminate repeated information such as headers, navigation, footers, etc. 

In [70]:
doc = Document(html)
doc.title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [71]:
doc.short_title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [72]:
doc.summary()

'<html><body><div><div class="StandardArticleBody_body"><p>BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 </p><div class="PrimaryAsset_container"><div class="Image_container" tabindex="-1"><figure class="Image_zoom"></figure><figcaption><p class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></p></figcaption></div></div><p>The 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. </p><p>There are other ways to get the desired effect from  marijuana, and patients have 

In [73]:
doc.url

Extract the body part via Beautiful Soup.

In [74]:
density_soup = BeautifulSoup(doc.summary(), 'html.parser')
density_soup.body.text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

---
<a name='3.13.2'></a><a id='3.13.2'></a>
## 3.13.2 Summary Density-Based Text Extraction
<a href="#top">[back to top]</a>

Density-based text extraction is powerful when using both heuristics and statistical information about information distribution on an HTML page.

---
<a name='3.14'></a><a id='3.14'></a>
# 3.14 All-in-One Approach
<a href="#top">[back to top]</a>

Scrapy is another Python package that offers an all-in-one approach to spidering and content extraction. The methods are similar to the ones described in the earlier sections, although Scrapy is more suited for downloading whole websites and not only parts of them.

---
<a name='3.15'></a><a id='3.15'></a>
# 3.15 Blueprint: Scraping the Reuters Archive with Scrapy
<a href="#top">[back to top]</a>

Unfortunately, the code for `scrapy` cannot be changed easily. One more argument for using *up to date* separate libraries. In this version, it still collects the titles of the articles but not more.

In [75]:
class ReutersArchiveSpider(scrapy.Spider):
    name = 'reuters-archive'
    
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_FORMAT': 'json',
        'FEED_URI': 'reuters-archive.json'
    }
    
    start_urls = [
        'https://www.reuters.com/news/archive/',
    ]

    def parse(self, response):
        for article in response.css("article.story div.story-content a"):
            yield response.follow(article.css("a::attr(href)").extract_first(), self.parse_article)

        next_page_url = response.css('a.control-nav-next::attr(href)').extract_first()
        if (next_page_url is not None) & ('page=2' not in next_page_url):
            yield response.follow(next_page_url, self.parse)

    def parse_article(self, response):
        yield {
          'title': response.css('h1::text').extract_first().strip(),
        }

Scrapy is optimized for command-line usage, but it can also be invoked in a Jupyter notebook. Because of Scrapy’s usage of the (ancient) Twisted environment, the scraping cannot be restarted, so you have only one shot if you try it in the notebook (otherwise you have to restart the notebook).

In [76]:
# This can be run only once from a Jupyter notebook due to Twisted
process = CrawlerProcess()

process.crawl(ReutersArchiveSpider)
process.start()

2024-11-02 15:41:40 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot)
2024-11-02 15:41:40 [scrapy.utils.log] INFO: Versions: lxml 5.1.1.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.10.0, Python 3.11.5 (main, Jan 16 2024, 17:25:53) [Clang 14.0.0 (clang-1400.0.29.202)], pyOpenSSL 24.2.1 (OpenSSL 3.3.1 4 Jun 2024), cryptography 43.0.0, Platform macOS-12.7.6-x86_64-i386-64bit


In [77]:
glob.glob("*.json")

['reuters-archive.json']

In [78]:
!cat 'reuters-archive.json'

[

]

---
<a name='3.16'></a><a id='3.16'></a>
# 3.16 Possible Problems with Scraping
<a href="#top">[back to top]</a>

For complicated cases, it can be useful to "remote control" the browser by using Selenium, a framework that was originally conceived for the automated testing of web applications, or a headless browser.