# Automatic data collection on the Web

Before we start to tackle some nice web pages (html), we will discover the xml language which is a good introduction to data web scraping

### XML

XML was created to facilitate data exchange between machines and software.

XML is a language that is written using tags.

XML is a W3C recommendation, so it is a technology with strict rules to follow.

XML is intended to be understandable by everyone: people and machines alike.

XML allows us to create our own vocabulary using a set of customizable rules and tags.

XML is also compatible with the web so that data exchanges can be easily carried out over the Internet.

XML is therefore standardized, simple, but above all extensible and configurable so that any type of data can be described.

Here is an example of an XML document, which we have saved as data.xml in the data directory

Display its content

In [1]:
file = open("../data/data.xml", "r")
print (file.read())
file.close()

FileNotFoundError: [Errno 2] No such file or directory: '../data/data.xml'

The first line indicates the encoding, we always stay in the UTF-8 encoding. Then we notice that the "users" tag has other "user" tags that themselves have their own tags. The data is hierarchized in a tree and each node provides information.

Here is a small script that displays all the user names.

In [None]:
from lxml import etree
# I define my source document
tree = etree.parse("../data/data.xml")
# I look at my document and identify the tag path to get to the "user" information
# Indeed, the information is in a name tag itself present in a user tag
# it even presents itself in a users tag. This last tag is located at the root of the directory
# so in tree.xpath("/users/user/name") there are the tags associated with our search
for user in tree.xpath("/users/user/nom"):
    # I want to display only the content (.text) of these tags /users/user/name
    print(user.text)

In [None]:
tree.xpath("/users/user/nom")[0].text

In [None]:
# You can display the attributes of the tags that store this information
tree = etree.parse("../data/data.xml")
for user in tree.xpath("/users/user"):
    print(user.get("data-id"))

You can refine the display by proposing to display only users whose job is Veterinary 

In [None]:
tree = etree.parse("../data/data.xml")
# Quel joli petit dictionnaire
for user in tree.xpath("/users/user[metier='Veterinaire']/nom"):
    print(user.text)

# Data web scrapping

We saw earlier how to parse XML, it is also possible to parse HTML and the tool that does the job best in my opinion is the BeautifulSoup librairy

Save a web page (for example becode.org) that you like in the data directory, and display its content (the xxx.html file)

Put the content of this page in a variable, for example html_doc


In [None]:
file = open("../data/becode.html", "r")
#print(fichier.read())
html_doc=file.read()
file.close()
html_doc

In [1]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "lxml")
# In my file (becode.org) by looking at this html script We can see that the main title is arranged in the h1 tag

for p in soup.find_all('h1'):
    # We only retrieve the content ==> .text
    print (p.text)

NameError: name 'html_doc' is not defined

Do the same with H2 tags

And now, do the same with the "p" tags

### Scrapping via request HTTP

HTTP is a kind of language that will allow the client (you, through your browser for example) to communicate with a server connected to the network (the HTTP server installed on a site's server, for example Apache).

Requests always go in pairs: the request (from the client) and the response (from the server).
If this is not the case, it is because a problem has occurred at a point in the network.

The syntax of the request (= client request) is always the same:
- Command line (Command, URL, Protocol version)

Command is the method to use, it specifies the type of request, it can have the values :


GET
This is the most common way to request a resource. A GET request has no effect on the resource, it must be possible to repeat the request without effect.

HEAD
This method only asks for information about the resource, without asking for the resource itself.

POST
This method must be used when a request modifies the resource.

OPTIONS
This method allows you to obtain the communication options of a resource or the server in general.

CONNECT
This method allows you to use a proxy as a communication tunnel.

TRACE
This method asks the server to return what it has received, in order to test and diagnose the connection.

PUT
This method allows you to add a resource to the server.

DELETE
This method allows you to delete a resource from the server.

I will only discuss the most common ones here: HEAD, GET and POST.

### Putting it into practice

In [2]:
import requests
# Url of website
url='https://www.becode.org/about/'
# I send my HTTP request with a "GET" to the site server to identify in the url
r = requests.get(url)
# I display the requested url and the return of the server
print(url, r.status_code)
# I ask beautifulSoup to keep in a soup variable the web page to scrape (url) an html script
soup = BeautifulSoup(r.content,'lxml')
soup

https://www.becode.org/about/ 200


NameError: name 'BeautifulSoup' is not defined

We have thus retrieved the information from the site without physically saving it in a file, only in a variable!

Display the main title, the subtitles of and the paragraphs and their descriptions again to convince you