Skip to content

Latest commit

 

History

History

08-web-scraping

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for quickly and efficiently extracting data from websites. It involves fetching a web page and extracting information from it.

Motivation

  • Websites have a lot of data. If extracted properly, this data can be very useful in the realm of Machine Learning.
  • Search engine engineering relies a lot on Web Scraping and other forms of Information Retrieval.
  • It can be used to automate mundane search tasks.
  • Learning how to web scrape involves exploring many related domains, such as Cybersecurity, Web Development, Web Services, & Natural Language Processing.

Setup

We will be using the following Python libraries:

  • requests - HTTP Requests for Humans

    Allows to navigate and access resources on the web.

  • beautifulsoup4 - Scraping Library

    Allows to parse and extract certain information from a resource.

pip3 install requests bs4 pandas	# install the libraries

Create the following project layout:

web-scraping
└── main.py

Scraper

We will build a crawler that surfs across Wikipedia and finds the shortest directed path between two articles.

Import the dependencies

import re
import os
import requests
import argparse
import pandas as pd
from bs4 import BeautifulSoup
from collections import deque

Using the get function in the requests module, you can access a webpage

response = requests.get('https://en.wikipedia.org/wiki/Batman')	# pass the webpage url as an argument to the function
print(response.status_code)	# 200 OK response if the webpage is present
print(response.headers)		# contains the date, size, information about the server and type of file the server is sending to the client
print(resonse.content)		# page content or the html source

Status Codes

There are five classes of response status codes:

  • 1xx informational - relatively new, only indicates the request has been received
  • 2xx successful - request received and accepted successfully
  • 3xx redirection - client must take some action in order to complete the request
  • 4xx client error - request from client contains bad syntax or access to invalid resources
  • 5xx server error - server incapable of fulfilling the request

For more information, you can refer to this Wikipedia page.

To parse the content or the page source, we will use the BeautifulSoup module. To do so, first create a BeautifulSoup object

soup = BeautifulSoup(response.content, 'html.parser')

We can now access specific information directly. To get the first link in the page

link = soup.a

To get all the links on the page

links = soup.find_all('a')	# anchor tag contains a link

Let's work on the last link in the page

link = links[-1]	# get the last anchor tag

The link is a Tag object corresponding to the anchor tag. Each Tag object contains a few properties

print(link.name)	# name of the tag
print(link.attrs)	# attributes of the tag

The anchor tag or <a> has an href attribute which contains the actual link to the page

print(link.attrs['href'])	# accessing directly using attrs
print(link['href'])			# accessing by treating like a dict

Trying to access an attribute that does not exist causes an error. A better way is to use the get method

print(link['id'])		# error
print(link.get('id'))	# returns value if exists else None and does not throw an error

To find only those anchor tags that contain the href attribute

links = soup.find_all('a', href=True)	# anchor tag contains a href attribute

You can also specify a regex to match links that fit a requirement. In our case, we only want to keep links to Wikipedia pages. We also want to exclue the links that point to some documentation (such as the Help or About page). If you notice, the links have a specific format.

Links we need since they point to other Wikipedia pages

/wiki/DC_Thomson
/wiki/Chris_KL-99
/wiki/Wing_(DC_Comics)

Links we don't need since they contain documentation, media or point to other websites

/wiki/Category:American_culture
/wiki/File:Batman_DC_Comics.png
www.wikimediafoundation.org

Hence, we can use the following regex ^/wiki/[^.:#]*$. This is explained below:

  • ^ denotes the start of the string
  • ^/wiki/ the link starts with /wiki/
  • [] denotes a character set that is allowed, [^] denotes a character set that is not allowed. The ^ symbol negates the character set if present at the start
  • [c]* denotes 0 or more occurrences of a character c, [^c]* denotes 0 or more occurrences of all characters except c
  • $ denotes the end of the string

The above regex means the link starts with /wiki/ has 0 or more characters that are not . or : or # symbol till the end of the string.

links = soup.find_all('a', href=re.compile('^/wiki/[^.:#]*$'))	# find all anchor tags that contain the href attribute with the specified regex

Special characters

  • . - indicates a url for a media file. For example, '/wiki/url/to/file/batman.jpg'.
  • : - indicates a url for a Wikipedia meta page. For example, '/wiki/Help:Contents'.
  • # - indicates a url for a page with an anchor attached, we chose to not include these.

Don't forget, we still have to extract the title from the href link since the above returns all the anchor tags

pages = set([link.get('href')[len("/wiki/"):] for link in links])

Code Modularity

Let's add a helper function to return the entire url given a title

def wiki(title):
    """Takes a title and wraps it to form a https://en.wikipedia.org URL

    Arguments:
        title {str} -- Title of Wikipedia Article

    Returns:
        {str} -- URL on wikipedia
    """
    return f"https://en.wikipedia.org/wiki/{title}"

Let us organize our code into a function that retrieves a page, and returns a list of titles for all articles that appear as links on the page

def get_pages(title):
	"""Returns a set of wikipedia articles linked in a wikipedia article

	Arguments:
		title {str} -- Article title

	Returns:
		{set()} -- A set of wikipedia articles
	"""
	response = requests.get(wiki(title))
	soup = BeautifulSoup(response.content, 'html.parser')
	links = soup.find_all('a', href=re.compile('^/wiki/[^.:#]*$'))
	pages = set([link.get('href')[len("/wiki/"):] for link in links])

	return pages

Celebrity Details

In this section we will learn how to extract information for celebrities. An example is shown below

We will extract the following parameters

header = ['Name', 'Nick Name', 'Born', 'Birth Place', 'Nationality', 'Residence', 'Occupation', 'Parent(s)', 'Website']

First we extract the table

body = soup.find('table', class_='infobox').tbody

Most of the parameters can be extracted using the class attributes since they are unique

name = body.find(class_='fn').text
nickname = body.find(class_='nickname').text
born = body.find(class_='bday').text
birthplace = body.find(class_='birthplace').text
residence = body.find(class_='label').text
occupation = body.find(class_='role').li.text
url = body.find(class_='url').text

However, the parent names are not under a unique class attribute

<tr>
	<th scope="row">Parent(s)</th>
	<td>
		<div class="plainlist">
			<ul>
				<li>
					<a class="mw-redirect" href="/wiki/William_Henry_Gates_Sr."
						title="William Henry Gates Sr.">William Henry Gates Sr.</a>
				</li>
				<li>
					<a href="/wiki/Mary_Maxwell_Gates" title="Mary Maxwell Gates">Mary Maxwell Gates</a>
				</li>
			</ul>
		</div>
	</td>
</tr>

We can still extract these by searching for the string Parent(s) and then getting the parent tag. We can now access the li tags

parent = body.find('th', string='Parent(s)').parent
father, mother = [li.text for li in parent.find_all('li')]

You can now write them into a csv file if you prefer. For this, you will need to import the necessary modules first

import os
import pandas as pd

Create a pandas DataFrame and write it into a csv file

csv = 'path/to/file.csv'
row = [name, nickname, born, birthplace, residence, occupation, father, mother, url]	# row of column values
df = pd.DataFrame(row).T	# create a pandas DataFrame

if not os.path.isfile(csv):	# create a new file if it does not exist
	header = ['Name', 'Nickname', 'Born', 'Birthplace', 'Residence', 'Occupation', 'Father', 'Mother', 'Website']	# column names
	df.to_csv(csv, header=header, index=False)
else:						# append to existing csv file
	df.to_csv(csv, mode='a', header=False, index=False)

Crawler

The crawler uses a Breadth-First Search traversal to crawl across the site.

def shortest_path(start, end):
	"""
	Finds the shortest path in Wikipedia from start page to end page

	Arguments:
		start {str} -- start page in /wiki/name format
		end {str} -- end page in /wiki/name format
	"""
	i = 1
	seen = set()
	d = deque([start])
	tree = {start: None}
	level = {start: 1}

	while d:
		# Get element in front
		topic = d.pop()
		seen.add(topic)
		print(f'{i}) Parsed: {topic}, Deque: {len(d)}, Seen: {len(seen)}, Level: {level[topic]}')

		urls = get_pages(topic)
		urls -= seen

		# Update structures with new urls
		seen |= urls
		d.extendleft(urls)
		for child in urls:
			tree[child] = topic
			level[child] = level[topic] + 1

		# Check if page found
		if end in urls:
			topic = end
			break
		i += 1

	# Get path from start to end
	path = []
	while topic in tree:
		path.append(topic)
		topic = tree[topic]
	print(' \u2192 '.join(reversed(path)))
	print(f'Length: {len(path)-1}')

Interface

Let us create an interface for our functions

def main():
    """Command line interface for the program
    """
    parser = argparse.ArgumentParser()
    parser.add_argument('--start', '-s', help='Exact name of start page', required=True)
    parser.add_argument('--end', '-e', help='Exact name of end page', required=True)
    args = parser.parse_args()

    start = '_'.join(args.start.split())
    end = '_'.join(args.end.split())
    shortest_path(start, end)

if __name__ == "__main__":
    main()

You can find the complete source code here

Usage

To learn how to use the command

python main.py -h
usage: main.py [-h] --start START --end END

optional arguments:
  -h, --help            show this help message and exit
  --start START, -s START
                        Exact name of start page
  --end END, -e END     Exact name of end page

Try the program

python main.py -s Web_Scraping -e Hell

Summary

We covered: