# Getting Started with Python Web Scraping

Working as a data scientists requires us to rely on the availability of data. The internet is an excellent place to obtain data to work with. Large data platforms like Facebook or Twitter usually provide an Application Programming Interface (API) which allows for often times limited access to a web site's data. Ever so often, this data proves not to be sufficient or in an undesirable format for further processing. In such cases we can resort to <strong>web scraping</strong> to strip the data right of a web page.

With this article I want to share my experience on how to use Python 3 for web scraping data from webpages using the <code>BeautifulSoup</code> module. We'll first learn about web page fundamentals and than strip specific data elements from my own web page.

## What is web scraping?

<strong>Web scraping</strong> (<b>web harvesting</b> or <b>web data extraction</b>) is data scraping used for extracting data from websites. This technique focuses on fetching semi-structured data (HTML or XML formats) on web pages and transform it into structured data.

A field that has gained considerable traction in the past years is <a href="https://en.wikipedia.org/wiki/Sentiment_analysis"><strong>Sentiment Analysis</strong></a>. Sentiment analysis, or <strong>Opinion Mining</strong>, refers to examining opinionated user-generated posts, articles, comments or tweets and classify them according to criteria of interest. We can classify whether a tweet has a positive or negative connotation using.

Web scraping comes into play for generating a data set that we can feed to our <a href="https://en.wikipedia.org/wiki/Text_mining"><strong>text data mining</strong></a> routines, and ultimately perform sentiment analysis<b> </b>based on the data scraped for a specific topic. Data scientist use sentiment analysis to predict stock market movements and perform trend analytics.


## Components of web pages

Interaction between the user and the internet is facilitated via a <strong>web browser</strong>, or <strong>client</strong>. By typing an URL into your web browser, it sends a <strong>request</strong> to the web server to which the URL points. This is referred to as a <code>GET</code> request due to the fact that we are requesting data files from the web server. The server then sends back a <strong>response</strong> containing the web site data files to our browser. Usually, the received data contains the following file types:
<ul>
 	<li><a href="https://en.wikipedia.org/wiki/HTML">HTML</a> - contains main content of a site</li>
 	<li><a href="https://en.wikipedia.org/wiki/Cascading_Style_Sheets">CSS</a> - adds optical design to a web site</li>
 	<li><a href="https://en.wikipedia.org/wiki/JavaScript">JS</a> - JavaScript files add dynamic behavior to web sites</li>
 	<li>Media - media files, for example pictures (JPG, PNG, etc.)</li>
</ul>
The browser <strong>renders </strong>the page with the information from the received files to display it to the user. The rendering is automatically taken care of by our browser and is of no concern for our web scraping. However, we will quickly elaborate on each of the components.

<strong>Hypertext Markup Language</strong> (<strong>HTML</strong>) is the standard markup language for creating web pages and web applications. Think of the the HTML file as being the skeleton of a web page. It is the base structure to which the main content of the web page is added.

HTML so far only provides us with the course structure of a web page. Adding styling specifics like fonts, coloring or text boxes can be added using <strong>Cascading Style Sheets</strong> <strong>(CSS). </strong>CSS is designed primarily to enable the separation of presentation and content, including aspects such as the layout, colors, and fonts. You can think of the style sheets as being the skin in our analogy.

Then, the layer between skeleton and skin would be muscle, which in our analogy corresponds to <strong>JavaScript</strong>. JavaScript is responsible for almost every dynamic behavior we can see on web pages, e.g. moving text elements, responsive buttons or image slide shows.


## A Basic HTML file

To see how easy it actually is to work with HTML files try the following:
<ol>
 	<li><strong>Create</strong> a simple text file, named <code>basic_html_file.txt</code>.</li>
 	<li><strong>Open</strong> it and <strong>copy</strong> the content of the below code box inside of it.</li>
 	<li><strong>Rename</strong> the file to <code>basic_html_file.html</code>.</li>
 	<li><strong>Right</strong> <strong>click</strong> the file.</li>
 	<li><strong>Open</strong> the file using your browser (I use <a href="https://www.google.com/chrome/browser/desktop/index.html">Chrome</a>).</li>
</ol>

<!DOCTYPE html>
<html>
  <body>

    <h1>This is a header</h1>

    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>

    <a href=" https://dacatay.com/python/python-web-scraping/">This is a link to my site.</a>

    <p>This is a link to an Einstein picture</p>
    <img src="https://upload.wikimedia.org/wikipedia/en/8/86/Einstein_tongue.jpg" width="104" height="142">

        <table style="width:20%" class="example_table">
      <tr>
        <th>Firstname</th>
        <th>Lastname</th> 
        <th>Age</th>
      </tr>
      <tr>
        <td>Jill</td>
        <td>Smith</td> 
        <td>50</td>
      </tr>
      <tr>
        <td>Eve</td>
        <td>Jackson</td> 
        <td>94</td>
      </tr>
    </table>

  </body>
</html>

You can also find this file <a href="https://github.com/dacatay/web-scraping">here</a>. Credit is owed to <a href="https://www.w3schools.com">w3schools</a>.

HTML uses <strong>tags</strong>s to introduce structure into a document which always start with the form &lt;tag&gt; and end with &lt;/tag&gt;. For the example above we can see the following:
<ol>
 	<li><strong>&lt;!DOCTYPE html&gt; - </strong>declares the document type</li>
 	<li><strong>&lt;html&gt; -</strong> indicates the beginning of the actual HTML content</li>
 	<li><strong>&lt;body&gt; -</strong> contains everything that is displayed to the user</li>
 	<li><strong>&lt;h1&gt;</strong> to <strong>&lt;h6&gt; -</strong> define headings</li>
 	<li><strong>&lt;p&gt;</strong> - defines paragraphs</li>
 	<li><strong>&lt;a&gt;</strong> - defines a hyperlink</li>
 	<li><strong>&lt;img&gt;</strong> - used to include images</li>
 	<li><strong>&lt;table&gt; -</strong> starts a table</li>
 	<li><strong>&lt;tr&gt;</strong> - starts a table row</li>
 	<li><strong>&lt;th&gt;</strong> - defines a table header</li>
 	<li><strong>&lt;td&gt;</strong> - defines table data</li>
</ol>
HTML uses <strong>attributes</strong> like <code>href</code> in hyperlinks tags or <code>src</code> in images tags to specify details about the content. For example, the size of the included image can be manipulated with the <code>width</code> and <code>height</code> attributes. You should see something like this:


## Web scraping basics using BeatifulSoup

In this section we will go through the basic functionality of <code>BeatifulSoup</code> step-by-step. A lot of credit is owed to <a>sentdex</a>, for his awesome tutorials.

In [90]:
import bs4 as bs
import requests

We'll first use the the <code>urllib</code> library to send a <code>GET</code> request to the web server the <code>url</code> is pointing at and save the server's response in the variable <code>source</code>. Then we declare the <code>soup</code>, a <code>BeautifulSoup</code> object, and pass it the <code>source</code> from which we'd like to strip data.

In [105]:
url= 'https://dacatay.com'
page = requests.get(url)
headers = {'User-Agent':'Mozilla/5.0'}

soup = bs.BeautifulSoup(page.text, 'html.parser')
print(soup)

<!DOCTYPE html>

<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<script>(function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);</script>
<title>Home - dacatay</title>
<!-- This site is optimized with the Yoast SEO plugin v5.1 - https://yoast.com/wordpress/plugins/seo/ -->
<meta content="Welcome to my afterwork life and tech blog. This webpage feature freely usable code and explanation on how to implement it." name="description">
<link href="https://dacatay.com/" rel="canonical">
<meta content="en_US" property="og:locale">
<meta content="website" property="og:type"/>
<meta content="Home - dacatay" property="og:title"/>
<meta content="Welcome to my afterwork life and tech blog. This webpage feature freely usable code and explanation on how to implement it." property="og:des

Looking at the output one can understand why the developers called it soup. This is the raw HTML document of my home page tab as it is hosted on the server. <code>BeatifulSoup</code> allows us to make this output look pretty using formatting and whitespace

In [92]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-US" prefix="og: http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <script>
   (function(html){html.className = html.className.replace(/\bno-js\b/,'js')})(document.documentElement);
  </script>
  <title>
   Home - dacatay
  </title>
  <!-- This site is optimized with the Yoast SEO plugin v5.1 - https://yoast.com/wordpress/plugins/seo/ -->
  <meta content="Welcome to my afterwork life and tech blog. This webpage feature freely usable code and explanation on how to implement it." name="description">
   <link href="https://dacatay.com/" rel="canonical">
    <meta content="en_US" property="og:locale">
     <meta content="website" property="og:type"/>
     <meta content="Home - dacatay" property="og:title"/>
     <meta content="Welcome to my afterwork life and tech blog. This webpage feature freely usable code and exp

Having our <code>soup</code> at the ready we can search for hyperlinks that are contained on my homepage

In [93]:
link = soup.find('a')   #or
link = soup.a
print(link)

<a class="skip-link screen-reader-text" href="#content">Skip to content</a>


This will search our soup and assign the <code>first element</code> found to the variable <code>link</code>. We can find <strong>all</strong> occurences of <code>a</code> with

In [94]:
links = soup.find_all('a')
print(links)

[<a class="skip-link screen-reader-text" href="#content">Skip to content</a>, <a href="https://dacatay.com/" rel="home">dacatay</a>, <a href="http://dacatay.com">Home</a>, <a href="https://dacatay.com/blog/">Blog</a>, <a href="https://dacatay.com/about/">About</a>, <a href="https://dacatay.com/contact/">Contact</a>, <a href="http://facebook.com"><span class="screen-reader-text">Fb</span></a>, <a href="http://twitter.com"><span class="screen-reader-text">Twitter</span></a>, <a href="http://linkedin.com"><span class="screen-reader-text">Linkedin</span></a>, <a href="http://github.com/dacatay"><span class="screen-reader-text">github</span></a>, <a href="https://dacatay.com/" rel="home">
<img alt="dacatay" height="279" sizes="(max-width: 709px) 85vw, (max-width: 909px) 81vw, (max-width: 1362px) 88vw, 1200px" src="https://dacatay.com/wp-content/uploads/2017/07/cropped-morning-glow-2140867-e1501170411534.jpg" srcset="https://dacatay.com/wp-content/uploads/2017/07/cropped-morning-glow-2140867

This prints a large list where every list element is an<code>a</code> tagged hyperlink found on my website. As we already know a hyperlink tag is always accompanied by a <code>href</code> attribute. We can then retrieve this particular attribute for every element in our <code>links</code> list using the <code>get</code> method

In [95]:
for link in links:
    print(link.get('href'))

#content
https://dacatay.com/
http://dacatay.com
https://dacatay.com/blog/
https://dacatay.com/about/
https://dacatay.com/contact/
http://facebook.com
http://twitter.com
http://linkedin.com
http://github.com/dacatay
https://dacatay.com/
http://dacatay.com/about/
https://github.com/dacatay
https://dacatay.com/data-science/set-up-python-development-environment/
https://dacatay.com/data-science/set-up-python-development-environment/
https://dacatay.com/uncategorized/install-java-ubuntu-16-04-14-04-13-04-12-04/
https://dacatay.com/uncategorized/install-java-ubuntu-16-04-14-04-13-04-12-04/
https://dacatay.com/data-science/statistical-learning-with-python-1-introduction/
https://dacatay.com/data-science/statistical-learning-with-python-1-introduction/
https://dacatay.com/uncategorized/install-bash-windows-10/
https://dacatay.com/uncategorized/install-bash-windows-10/
https://dacatay.com/tag/anaconda/
https://dacatay.com/tag/bash/
https://dacatay.com/tag/beginner/
https://dacatay.com/tag/guid

For another example, suppose that we are looking for all hyperlinks that are present in the navigation bar. In this case, we would make use of the <code>nav</code> tag and do the following

In [96]:
nav = soup.nav
for link in nav.find_all('a'):
    print(link.get('href'))

http://dacatay.com
https://dacatay.com/blog/
https://dacatay.com/about/
https://dacatay.com/contact/


Similarly, to retrieve al paragraphs in the complete body of the HTML document we would do this

In [97]:
body = soup.body
for p in body.find_all('p'):
    print(p.text)

dacatay
Afterwork Life and Tech Blog
Hello and welcome to my page. In the about tab you can find out who I am and what I do for a living.
This home tab essentially exists as part of my learning how to scrap web pages and retrieve specific information and content from a page.

 
 
Welcome to my afterwork life and tech blog. All code provided is free to use and can also be found on my github.


  (adsbygoogle = window.adsbygoogle || []).push({
    google_ad_client: "ca-pub-6962034930395047",
    enable_page_level_ads: true
  });



This next one is a little more specific. Some web pages doe not just enter their contents using the basic paragraph <code>p</code> tags. The create custom classes to avoid being scrapped automatically. For the case of my webpage, I am using the "Twenty Sixteen" WordPress theme template, which comes with such special <code>div</code> tag classes defined for content entry.

In [98]:
for div in soup.find_all('div', class_='entry-content'):
    print(div.text)


Hello and welcome to my page. In the about tab you can find out who I am and what I do for a living.
This home tab essentially exists as part of my learning how to scrap web pages and retrieve specific information and content from a page.

 
 



Firstname
Lastname
Age


Jill
Smith
50


Eve
Jackson
94



 


## Scraping tables

In this section we will learn to scrap tables. Looking at the HTML code snippet above, essentially, every table starts and ends with a <code>table</code> tag and everything in between the table tags is organized by table row <code>tr</code> tags. Within a table row we can observe table headers <code>th</code> and table data tags <code>td</code> which contain the actual data.

As before, we can find a tagged element with

In [99]:
table = soup.find('table')
print(table)

<table class='"example_table' style="width: 100%;">
<tbody>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</tbody>
</table>


To strip the table data elements from the table we can use the following snippet.

In [100]:
table_rows = table.find_all('tr')
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

[]
['Jill', 'Smith', '50']
['Eve', 'Jackson', '94']


First, a list of all <code>tables_rows</code> is created from the <code>table</code> variable. We then iterate over every table row <code>tr</code> to find the corresponding table data and assign it to list <code>td</code>. Lastly, we create the list <code>row</code> using a list comprehension saving every cell's content.

## Scraping XML files

We are also able to scrap <strong>site maps</strong>. A <b>site map</b> is a list of all URLs of a web site accessible to crawlers or users and typically come in <strong><a href="https://de.wikipedia.org/wiki/Extensible_Markup_Language">XML</a></strong> format. <strong>Extensible Markup Language (XML)</strong> defines a set of encoding rules for the presentation of hierarchically structured data.

This is <a href="http://dacatay.com/post_tag-sitemap.xml">my site map</a> for all posts available on my web page. In the opened window click <code>Ctrl + u</code> to view the source code. You can see that the document is entirely made of a single <code>urlset</code> tag and for every post an <code>url</code> tag.

You can use the site maps on a news page to scrape of information on the newest article links, for example this is the <a href="http://edition.cnn.com/sitemaps/sitemap-index.xml">index site map</a> of the Washington Post which gives us an overview of all the available site maps on the web page. This is the highest hierarchical site map level and the structuring tags on this level are <code>sitemapindex</code> and <code>sitemap</code>. We can dig deeper and look for example at this <a href="http://edition.cnn.com/sitemaps/sitemap-show-2017-08.xml">site map</a> of all news articles related to to the topic 'politics'.

To scrape this content we have to slightly modify our existing code

In [106]:
url = 'https://www.washingtonpost.com/news-politics-sitemap.xml'
page = requests.get(url)
headers = {'User-Agent':'Mozilla/5.0'}
soup = bs.BeautifulSoup(page.text, 'xml')

for link in soup.find_all('loc'):
    print(link.text)

https://www.washingtonpost.com/politics/trump-still-has-the-bully-pulpit-but-is-facing-more-challenges-to-his-authority/2017/08/04/661f147e-7926-11e7-803f-a6c989606ac7_story.html
https://www.washingtonpost.com/politics/white-house-anger-over-leaks-grows-crackdown-promised/2017/08/04/905cf556-7914-11e7-8c17-533c52b2f014_story.html
https://www.washingtonpost.com/politics/the-latest-trump-blasts-democrats-for-fueling-russia-hoax/2017/08/03/bf8d9b9e-78a8-11e7-8c17-533c52b2f014_story.html
https://www.washingtonpost.com/local/md-politics/maryland-fentanyl-deaths-surge-again-in-first-quarter-of-2017/2017/08/04/07343642-7953-11e7-9eac-d56bd5568db8_story.html
http://www.washingtonpost.com/news/politics/wp/2017/08/04/flooding-in-miami-is-no-longer-news-but-its-certainly-newsworthy/
https://www.washingtonpost.com/politics/federal_government/huff-puff-pass-ags-pot-fury-not-echoed-by-task-force/2017/08/04/75ff158e-7952-11e7-8c17-533c52b2f014_story.html
https://www.washingtonpost.com/politics/courts

Obviously, we'd want to save the data to a file when we scrape it. We can save the found links from the Washington post site to a <code>.txt</code> file.

In [102]:
filename = 'links.txt'
header = 'links'
with open(filename, 'w') as file:
    file.write(header)
    file.write('\n')
    for link in soup.find_all('loc'):
        file.write(link.text + '\n')
    file.close()

## Using pandas to read data tables from web pages

Alternatively, we can use the <code>pandas</code> module to easily scrap the content of tables from web pages. We begin by importing the required library and stuff the content of my web page into the <code>dfs</code> variable. This is a <code>list</code> and hence we want to see the single element <code>DF</code>

In [103]:
import pandas as pd

In [104]:
dfs = pd.read_html('https://dacatay.com', header=0)
for df in dfs:
    print(df)

  Firstname Lastname  Age
0      Jill    Smith   50
1       Eve  Jackson   94


We use the <code>read_html</code> method and pass along the web address. This will tell pandas to visit the website, parse all tables it finds on it, and return a list of <code>DataFrame</code> objects for every table found on the web page. Here we will only find one table.


## Conclusion

In this article we have discovered and implemented several web scraping techniques using the Python modules <code>Beautiful Soup</code> and <code>urllib</code>. We also found out that we can use built-in methods from the <code>pandas</code> module to create structured data fromes from scraped HTML tables.