# Lab1.2: The web as a source of text

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

The WWW consists of web pages that are either created as static HTML or generated dynamically from databases or other formats.

HTML stands for Hyper Text Markup Language. It basically tells the browser how to render a web page in a browser so that people can easily access the content.

HTML contains more than just the content. It includes the instructions to the browser how to render it. It may also contain other data such as Java Script to run little programs, hyperlinks to other webpages, images, video's, or comments made by the people that created the page.

In order to get the content from a web page, we need to separate the language from the instructions. For this we use two packages *requests* and *BeautifulSoup*. 

The *requests* package is included in your Python distribution and can be used to post a request to a webaddress to get the respons. More precisly *requests.get(url)* simulates a browser to access the page specified by a *url*. The result is an object from which we can get the text, e.g. "result.text" represents the HTML as a text object.

A nice blog on using BeatifulSoup for further reading is the following:

https://towardsdatascience.com/forget-apis-do-python-scraping-using-beautiful-soup-import-data-file-from-the-web-part-2-27af5d666246


In [2]:
import requests 

url ="http://cltl.nl"
result = requests.get(url)
html = result.text
### Printing the HTML shows the source code of the web page
print(html)

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if !(IE 7) & !(IE 8)]><!-->
<html lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<!--<![endif]-->
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width" />
<title>CLTL | the Computational Lexicology &amp; Terminology Lab of Prof. Dr. Piek Vossen</title>
<link rel="profile" href="http://gmpg.org/xfn/11" />
<link rel="pingback" href="http://www.cltl.nl/xmlrpc.php" />
<!--[if lt IE 9]>
<script src="http://www.cltl.nl/wp-content/themes/twentytwelve/js/html5.js" type="text/javascript"></script>
<![endif]-->
<link rel='dns-prefetch' href='//s7.addthis.com

As you can see, an HTML page can be a complex data structure with so-called *tags* between "<" and ">". The actual natural language text is scattered throughout the HTML, usually within the so-called *<body>* .... *</body>* tags.

Since the HTML can be very complex, having comments and java script inserted anywhere, we will use the BeautifulSoup package to obtain the actual text.

You first must install the package on your local computer through the command line. You can follow the instruction here:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

After installing the package you can import it and pass the HTML string to to to process the structure and represent the data elements. Here you can find more documentation on the way the data is structured and can be accessed: 

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

Basically, BeautifulSoup gives you a *soup* objects through which you can call various functions. We call here now the *prettify()* function to print the above a little nicer.


In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" xmlns:fb="https://www.facebook.com/2008/fbml" xmlns:addthis="https://www.addthis.com/help/api-spec" >
<![endif]-->
<!--[if !(IE 7) & !(IE 8)]><!-->
<html lang="en-US" xmlns:addthis="https://www.addthis.com/help/api-spec" xmlns:fb="https://www.facebook.com/2008/fbml">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   CLTL | the Computational Lexicology &amp; Terminology Lab of Prof. Dr. Piek Vossen
  </title>
  <link href="http://gmpg.org/xfn/11" rel="profile"/>
  <link href="http://www.cltl.nl/xmlrpc.php" rel="pingback"/>
  <!--[if lt IE 9]>
<script src="http://www.cltl.nl/wp-content/themes/twentytwelve/js/html5.js" type="text/javascript"></script>
<![endif]-->
  <link href="//s7.addthis.com"

Luckily, we also have the function *soup.get_text()* that finds the text snippets. We use a regular expression package *re* to remove spurious tabs and newlines and join the text snippets through spaces. We provide here a complete function *url_to_string* that will get text from any url.

In [15]:
from bs4 import BeautifulSoup
import requests 
import re

#Utility function to get the raw text from a web page. 
#It takes a URL string as input and returns the text.
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    # The next loop removes non-text elements
    for script in soup(["script", "style", 'aside']):
        script.extract()
    # Now, we remove the new lines  
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
    # Check what happens if you do not remove newlines and tabs
    #return soup.get_text()

If we apply this function to the *CLTL* website, we will get the plain text that we can save to a file as natural language text data. Play with the code above to understand the way the text is glued together. See what happens if you skip the for-loop or the removing of tabs and newlines.

In [16]:
url ="http://cltl.nl"
cltl_content=url_to_string(url)
print(cltl_content)

 CLTL | the Computational Lexicology & Terminology Lab of Prof. Dr. Piek Vossen   CLTL the Computational Lexicology & Terminology Lab of Prof. Dr. Piek Vossen Menu Skip to content HomePeople Projects Future Projects The Reference Machine Theory of Identity, Reference and Perspective (TIRP) Discriminatory Micro Portraits Current projects Robot Leolani Weekend of Science 2018: Talking with Robots Leolani in Brno 2018 Weekend of Science 2017: Talking with Robots Understanding of Language by Machines Dutch Framenet Hybrid Intelligence SERPENS Word, Sense and Reference QuPiD2 HHuCap Reading between the lines VU University Research Fellow CLARIAH Visualizing Uncertainty and Perspectives Digital Humanities Open Dutch Wordnet Global WordNet Grid Global WordNet Association Previous projects NewsReader BiographyNet Language, Knowledge and People in Perspective CLIN26 Investigating Criminal Networks INclusive INsight Can we Handle the News (EYR4) OpeNER Mapping Notes and Nodes in Networks KYOTO C

The value of the variable *cltl_content* can be saved to a file to built up a text data set.

In [22]:
url ="http://cltl.nl"
cltl_content=url_to_string(url)

filename='cltl.txt'
f=open(filename,"w+")
f.write(cltl_content)

#print(cltl_content)

5164

In the further notebooks of this course, you will re-use the above function to obtain the text and to build a corpus from it.

## End of this notebook