# Web Scraping with BeautifulSoup

This web scraper collects roughly 300 of the latest news articles and metadata from the website "Hacker News" about topics such as cyber attacks, computer security, and entrepreneurship.

Python libraries:
- *requests* : gets webpages from the internet;
- *BeautifulSoup* : converts a webpage from HTML to a searchable object;
- *pandas* : provides a tabular data structure which will contain scraped data;
- *re* : regular expressions provides ways to match and substitute sub-strings.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

## Loading website home page

In [2]:
url = "https://thehackernews.com/"
response = requests.get(url)
response

<Response [200]>

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify()[:250] + "\n...")

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="#395697" name="theme-color"/>
  <link as="style" href="/css/roboto.css" rel="preload"/>
  <link href="/css/roboto.css" media="print" onload="this.media='all
...


## Gathering article URLs

The URL link to each article is found in the body of the home page, as follows:

In [4]:
article_url_list = []
article_url_list.extend([
    article.find('a', class_='story-link')['href'] for article in
    soup.find_all('div', class_='body-post clear')
])
print('Found {} articles:'.format(len(article_url_list)))
article_url_list

Found 8 articles:


['https://thehackernews.com/2020/11/become-white-hat-hacker-get-10-top.html',
 'https://thehackernews.com/2020/11/interpol-arrest-3-nigerian-bec-scammers.html',
 'https://thehackernews.com/2020/11/2-factor-authentication-bypass-flaw.html',
 'https://thehackernews.com/2020/11/baidus-android-apps-caught-collecting.html',
 'https://thehackernews.com/2020/11/stantinko-botnet-now-targeting-linux.html',
 'https://thehackernews.com/2020/11/critical-unpatched-vmware-flaw-affects.html',
 'https://thehackernews.com/2020/11/why-replace-traditional-web-application.html',
 'https://thehackernews.com/2020/11/facebook-messenger-bug-lets-hackers.html']

The way this website is structured means that the home page contains links to only a small amount of articles, however this is not enough for our requirements.

Additional articles can be found by clicking a button near the bottom of the home page. The following cells show the programme cycling through successive pages, grabbing article URLs, and appending them to the URL list.

In [5]:
next_page_url = soup.find('a', class_='blog-pager-older-link-mobile')['href']
next_page_url

'https://thehackernews.com/search?updated-max=2020-11-20T00:31:00-08:00&max-results=8'

In [6]:
while len(article_url_list) < 300:
    response = requests.get(next_page_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    article_url_list.extend([
        article.find('a', class_='story-link')['href'] for article in
        soup.find_all('div', class_='body-post clear')
    ])
    next_page_url = soup.find('a', class_='blog-pager-older-link-mobile')['href']

In [7]:
print("Found {} article URLs.".format(len(article_url_list)))

Found 304 article URLs.


## Scraping article data and metadata

For each URL, the corresponding article page is loaded. Pieces of data are scraped. Finally, the data is appended to a *pandas DataFrame*.

In [8]:
df = pd.DataFrame()
for url in article_url_list:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    title = soup.find('h1', class_='story-title').text
    date_published = soup.find('meta', {'itemprop': 'datePublished'})['content']
    date_modified = soup.find('meta', {'itemprop': 'dateModified'})['content']
    author = soup.find('a', {'rel': 'author'}).text
    
    body_container = soup.find('div', class_='articlebody clear cf')
    # In some articles, the body lives inside another div element.
    if body_container.find('div', {'dir': 'ltr'}):
        body_container = body_container.find('div', {'dir': 'ltr'})
    # We need to throw away any divs because they contain junk such as adverts.
    for div in body_container.find_all('div'):
        div.decompose()
    # The article text is left, either in p elements or between line breaks (br)
    body = re.sub(pattern=r'\n+', repl='\n', string=body_container.text)
    
    df = df.append({
        'url': url,
        'title': title,
        'date_published': date_published,
        'date_modified': date_modified,
        'author': author,
        'body': body.strip(),
    }, ignore_index=True)

## Viewing the first 5 rows of scraped data

In [9]:
df.head()

Unnamed: 0,author,body,date_modified,date_published,title,url
0,The Hacker News,Many of us here would love to turn hacking int...,2020-11-26T17:43:03Z,2020-11-25T22:53:00-08:00,Become a White Hat Hacker — Get 10 Top-Rated C...,https://thehackernews.com/2020/11/become-white...
1,Ravie Lakshmanan,Three Nigerian citizens suspected of being mem...,2020-11-26T06:22:23Z,2020-11-25T22:17:00-08:00,Interpol Arrests 3 Nigerian BEC Scammers For T...,https://thehackernews.com/2020/11/interpol-arr...
2,Ravie Lakshmanan,"cPanel, a provider of popular administrative t...",2020-11-25T07:14:18Z,2020-11-24T23:14:00-08:00,2-Factor Authentication Bypass Flaw Reported i...,https://thehackernews.com/2020/11/2-factor-aut...
3,Ravie Lakshmanan,Two popular Android apps from Chinese tech gia...,2020-11-26T06:57:12Z,2020-11-24T22:36:00-08:00,China's Baidu Android Apps Caught Collecting S...,https://thehackernews.com/2020/11/baidus-andro...
4,Ravie Lakshmanan,An adware and coin-miner botnet targeting Russ...,2020-11-24T14:56:39Z,2020-11-24T06:56:00-08:00,Stantinko Botnet Now Targeting Linux Servers t...,https://thehackernews.com/2020/11/stantinko-bo...


## Finally, saving the scraped data to disk

In [10]:
df.to_csv('hacker_news_articles.csv', index=False, encoding='utf-8-sig')