Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.1: Scraping HTML content from the web.

Usually, you see web content in a browser. The browser displays web pages which are either created as static HTML or generated dynamically from databases or other formats. One way to create a corpus from web pages is to download the HTML source.

HTML stands for *Hyper Text Markup Language* and it contains more than just the text of the web page. It includes instructions to the browser how to render the content so that people can easily access it. It may also contain other parts such as: Java Script code to run little programs, hyperlinks to other webpages, images, videos, or comments made by the people that created the page. 

In the python course, you have already seen how to use the *request* package to download the web page as an HTML object. In order to access different elements of the HTML object, we will use the package *beautifulsoup4* which is build on top of *html5lib*. 

Please make sure that you have prepared your [technical setup](https://canvas.vu.nl/courses/56534/pages/getting-started). Activate the virtual environment LaD and make sure that you choose LaDKernel as a kernel for this notebook. Now, we can start scraping web content as HTML.

## 1. Scraping HTML content

In [1]:
import requests 
import html5lib


url ="http://cltl.nl"
result = requests.get(url)
html = result.text


The variable *html* now contains the HTML content from our CLTL web page. Let's have a closer look at it.

In [2]:
# An HTML document starts with a header that specifies a lot of metadata. 
# Let's print the first 2000 characters:
print('The start of the HTML content:')
print(html[:2000])
print()

The start of the HTML content:
<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if !(IE 7) & !(IE 8)]><!-->
<html lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<!--<![endif]-->
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width" />
<title>CLTL | the Computational Linguistics &amp; Text Mining Lab</title>
<link rel="profile" href="http://gmpg.org/xfn/11" />
<link rel="pingback" href="http://www.cltl.nl/xmlrpc.php" />
<!--[if lt IE 9]>
<script src="http://www.cltl.nl/wp-content/themes/twentytwelve/js/html5.js" type="text/javascript"></script>
<![endif]-->
<link rel='dns-prefetch' href='//s7.addt

You can see that the content is structured by so-called tags between "<" and ">". Don't worry if the header looks confusing. The main content of the web page can be found within the ```<body> .... </body>``` tags. 

In [3]:
# In this example, the body starts after around 12050 characters. 
print('The beginning of the body:')
print(html[12050:13000])

The beginning of the body:
</head>

<body class="home page-template-default page page-id-25 custom-background do-etfw tribe-no-js custom-background-white custom-font-enabled">
<div id="page" class="hfeed site">
	<header id="masthead" class="site-header" role="banner">
		<hgroup>
			<h1 class="site-title"><a href="http://www.cltl.nl/" title="CLTL" rel="home">CLTL</a></h1>
			<h2 class="site-description">the Computational Linguistics &amp; Text Mining Lab</h2>
		</hgroup>

		<nav id="site-navigation" class="main-navigation" role="navigation">
			<button class="menu-toggle">Menu</button>
			<a class="assistive-text" href="#content" title="Skip to content">Skip to content</a>
			<div class="nav-menu"><ul>
<li class="current_page_item"><a href="http://www.cltl.nl/">Home</a></li><li class="page_item page-item-31"><a href="http://www.cltl.nl/people/">People</a></li>
<li class="page_item page-item-33 page_item_has_children"><a href="http://www.cltl.nl/projects/">Projects


In [4]:
# I looked for a region that contains some text.  
print('Somewhere in the middle:\n')
print(html[30000:32000])

Somewhere in the middle:

ute</a>.</p>
<p>&nbsp;</p>
<h3>Our research</h3>
<p>The Computational Linguistics and Text Mining Lab (CLTL) models the <a href="http://www.understandinglanguagebymachines.org">understanding of natural language by machines</a>. Machines that can read texts and understand what it is about (what, who, when, where), but also machines that create powerful distributional language models from large volume of text using Deep Learning techniques. Our research tries to obtain a better understanding of so-called backbone models, reveal biases and unwanted errors but also to combine distributional approaches with explicit symbolic models to add explanatory power.&nbsp;Please go <a href="https://cltl.github.io">here</a> for an overview of our current research and links to more information.</p>
<p>We see language as a reference system that connects people and systems to their perception of the world. Identity, reference and perspectives are central themes in our research a

Unfortunately, the text content that we are interested in is still scattered and not easy to extract. Here are some examples for structuring tags and symbols: 
* &lt;p> stands for the beginning of a paragraph, &lt;/p> for the end
* &lt;a href=... starts a link 
* &lt;h1> stands for a big headline, &lt;h2> stands for a smaller headline 
* &amp;#8220; and &amp;#8221; are opening and closing quotation marks. 

You do not need to learn all these terms and symbols, just know how to look them up in case you need them. 
(https://www.w3schools.com/tags/ref_byfunc.asp, https://dev.w3.org/html5/html-author/charref)
    
**Take some time to print different parts of the HTML content and compare them to what you see in the browser. Play around with other urls!**

## 2. Extracting text from HTML
In order to reduce the complex HTML content to the main text content, we use an HTML parser called BeautifulSoup. This parser processes the different opening and closing tags and extracts only the content that seems to be textual. Compare the different outputs of the functions *prettify()* and *get_text()*.


In [5]:
from bs4 import BeautifulSoup
parser_content = BeautifulSoup(html, 'html5lib')

# The function prettify() provides the HTML content in a more readable way.
# Note that due to the additional line breaks, the character count changes for the same region of text. 
print(parser_content.prettify()[36700:38700])


e.org/" rel="noopener" target="_blank">
          Network Institute
         </a>
         .
        </p>
        <p>
        </p>
        <h3>
         Our research
        </h3>
        <p>
         The Computational Linguistics and Text Mining Lab (CLTL) models the
         <a href="http://www.understandinglanguagebymachines.org">
          understanding of natural language by machines
         </a>
         . Machines that can read texts and understand what it is about (what, who, when, where), but also machines that create powerful distributional language models from large volume of text using Deep Learning techniques. Our research tries to obtain a better understanding of so-called backbone models, reveal biases and unwanted errors but also to combine distributional approaches with explicit symbolic models to add explanatory power. Please go
         <a href="https://cltl.github.io">
          here
         </a>
         for an overview of our current research and links to more i

In [6]:
# The function get_text() extracts textual content from the HTML.
# Again, the character count for our text region changed because all tags are now being ignored. 
print(parser_content.get_text()[6000:8000])

# Uncomment the line below to look at the full output for comparison: 
# print(parser_content.get_text())

t is about (what, who, when, where), but also machines that create powerful distributional language models from large volume of text using Deep Learning techniques. Our research tries to obtain a better understanding of so-called backbone models, reveal biases and unwanted errors but also to combine distributional approaches with explicit symbolic models to add explanatory power. Please go here for an overview of our current research and links to more information.
We see language as a reference system that connects people and systems to their perception of the world. Identity, reference and perspectives are central themes in our research and are studied in combination. You can read more about the Theory of Identify, Reference and Perspective (TIRP) here. In our research projects on Communicative Robots, many of our ideas come together: http://makerobotstalk.nl In these projects, we try to build robots that communicate with people in real-world situations taking perceptions of the conte

If you look at the full output of the get_text() method, you notice that it still contains unnecessary elements such as scripts. We remove them using the method extract(). If you want to know more about this, look at the [documentation] (https://www.crummy.com/software/BeautifulSoup/bs4/doc/). But for the moment, I recommend to just accept that this works.  

A regular expression helps us to get rid of unnecessary newlines. **If you are new to Python, take some time here to understand what is happening and recap how to use regular expressions. These are very common data processing steps and you will need them very often.**

In [7]:
# Clean up the content step by step. 
#1. Remove unnecessary elemens like scripts. 
for script in parser_content(["script", "style", "aside"]):
    script.extract()
   
text_with_newlines = parser_content.get_text()

# 2. We split the text at each newline or tab. 
# Make sure to recap how to use regular expressions. You will need them very often. 
import re
text_elements = re.split(r'[\n\t]+', text_with_newlines)

# Now join the text elements by simple white spaces: 
text_without_newlines = " ".join(text_elements)

print(text_without_newlines[:5000])

 CLTL | the Computational Linguistics & Text Mining Lab   CLTL the Computational Linguistics & Text Mining Lab Menu Skip to content HomePeople Projects Current projects Rethinking News Recommender Systems (2020-2024) ALANI (2020-2024) Understanding of Language by Machines Hybrid Intelligence (2019-2029) Make Robots Talk and Think (2020-2024) Robot Leolani (ULM5) Weekend of Science 2018: Talking with Robots Leolani in Brno 2018 Weekend of Science 2017: Talking with Robots Dutch Framenet (2019-2023) CLARIAH VU University Research Fellow QuPiD2 Word, Sense and Reference Digital Humanities Open Dutch Wordnet Global WordNet Grid Global WordNet Association Previous projects The Reference Machine NewsReader Reading between the lines BiographyNet Language, Knowledge and People in Perspective Discriminatory Micro Portraits CLIN26 Visualizing Uncertainty and Perspectives HHuCap SERPENS Investigating Criminal Networks INclusive INsight Can we Handle the News (EYR4) OpeNER Mapping Notes and Nodes 

To make our lifes easier, we put all the steps into a single function. 

In [8]:
def url_to_string(url):
    """
    Utility function to get the raw text from a web page.  
    It takes a URL string as input and returns the text.
    """
    res = requests.get(url)
    html = res.text
    parser_content = BeautifulSoup(html, 'html5lib')
    
    for script in parser_content(["script", "style", 'aside']):
        script.extract()
        
    # This is a shorter way to write the code for removing the newlines. 
    # It does the same as above in one step without intermediate variables
    return " ".join(re.split(r'[\n\t]+', parser_content.get_text()))

We can now apply this function to any URL and save the result in a txt-file. 

In [9]:
url ="http://cltl.nl"
cltl_content=url_to_string(url)

# Save the text content to a file. 
filename='../results/cltl.txt'
with open(filename, 'w', encoding="utf-8") as outfile:
    outfile.write(cltl_content)

You can see that the text extract of an HTML page is not clean and linear text because it contains for example menu items. These are glued to the text without proper punctuation or sentence structure. It will be different for every web page and will have an impact on what we represent as language and on the performance of the systems that we run on these texts. **Play around with other URLs to see the differences and think about what kind of text preprocessing could be useful in each case.** 

In order to be able to access the function url_to_string(url) from other notebooks, we did the following: 
- We copied the function to a file called *util_html.py*.
- We added an empty file called *\_\_init\_\_.py* to the directory. This file indicates that py-files in this directory can be treated as modules and the functions in these files can be imported. 

When we need the function in another notebook, we can now just write: 

In [10]:
from util_html import url_to_string