<small><i>August 2014 - This notebook was created by [Oriol Pujol Vila](http://www.maia.ub.es/~oriol). Source and license info are in the folder.</i></small>

# Data hunting and gathering

<img style = "border-radius:20px;" src = "http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg">

SOFTWARE REQUIREMENTS
    
    + mongoDB 
    + lxml #pip install lxml
    + cssselector #pip install cssselect
    + pyquery #pip install pyquery
    + selenium #pip install selenium
    + pymongo #pip install pymongo
    
NOT REQUIRED BUT USED FOR SOME EXAMPLES: 

    + VLC

CONTENTS

+ Introduction and warm-up project: A web crawler
 
     + Introduction to MongoDB and PyMongo
     + A whirlwind tour into regular expressions
    
+ Using the API

    + Retrieving Twitter data

+ Creating our own web API: Scraping

    + Understanding HTML and CSS
    + CSS selectors
    + XPath selectors
    + Scraping dynamic content with Selenium    
   

# 1. Introduction

Data is the basis of this course. Although we usually find it in well structured formats such as a spreadsheet resulting from our last experiment, or the collection of company records in a classical relational database, with the advent of internet new information sources have to be taken into account. However, these new sources are home of unstructured data. In this lecture several methods for retrieving data and storing it are presented.

Let us first introduce the big picture guiding this lecture. Whenever we want to retrieve data from a web site we should ask first if the web site is providing a simple way for that purpose. Many large sites such as google, facebook, twitter, etc, provide a **Application Programming Interface (API)** that can make data hunting easier. However, most of web sites do not have this interface. Even more, an API may not provide the desired information. In those cases we have to use **scraping** techniques. This means dealing with the raw information as it is provided to the web browser and code our data finding methods.  

<img style="border-radius:20px;" src="./files/big_picture.jpg">

Let us start connecting to the net and checking out how to retrieve a basic page. We will start using `urllib.request` module.

In [1]:
from urllib.request import urlopen
source = urlopen("http://www.google.com/")
print(source)



<http.client.HTTPResponse object at 0x7f71f8032dd8>


In [3]:
#Let us check what is in
source

<http.client.HTTPResponse at 0x7f71f8032dd8>

In [4]:
#Hurray we got a socket. An all sockets behave like files, so let us go read() the "file"
something = source.read().decode('latin-1')

In [5]:
#Check on something
print(something)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="es"><head><meta content="Google.es permite acceder a la información mundial en castellano, catalán, gallego, euskara e inglés." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="jBBjMw5yWe8dNPI6XeutIA==">(function(){window.google={kEI:'syyvXaC_ArCOlwSH-ICoDw',kEXPI:'0,1353747,5662,731,223,510,1065,3152,378,206,1017,53,2134,10,299,414,271,67,48,48,94,25,324,274,35,579,65,16,50,99,317,4,1130631,1197706,329561,1294,12383,4855,32691,15248,867,28684,369,3314,5505,2442,5942,1119,2,579,727,2432,1361,4323,4967,774,2251,2820,1923,3122,6192,1719,1808,1976,2044,8909,1900,3397,2016,38,920,873,1217,2975,2736,3061,2,631,3240,8066,2884,20,317,1119,904,101,2024,1,369,2777,919,992,509,776,8,2796,813,72,82,601,11,14,667,612,2212,

In [6]:
#What!!!!
#Let us read more
print(source.read())



b''


In [None]:
#Ooooppss nothing else.


Ok, hands on!!! Some first warm up exercises:

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM UP EXERCISES**
<ol>
<li>Is there the word python in pyladies.org?</li>
<li>Does http://google.com contain an image? (hint: < img > TAG )</li>
<li>What are the first ten characters of python.org?</li>
</ol>
</div>

In [31]:
#Write your code here
from urllib.request import urlopen

pylad = urlopen("http://www.python.org").read().decode('latin-1')

'python' in pylad

print(pylad[0:10])
#print (pylad.split())

<!doctype 


We are retrieving data from an URL! So we are done! 

# Crawling and Scraping

Scraping and **crawling** are two very related techniques. While scraping is used for retrieving data from a web page, crawling is used to retrieve the web pages. Scraping and crawling are found at the core of search engines. Scraping is used to get keywords, analyze, and extract useful information from the web pages so that given a user query it may return related results. On the other hand, crawling allows to retrieve the actual pages and uses scraping to get the links in each web site. This allows to create a graph of the connection among web sites and this information can be used to order the results of a query.

In general, we might want not only to get data from a single page but probably retrieve from several related pages. In those cases crawling is the way to go. 

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM-UP PROJECT:** Let us build a very simple spider. The basic functionality of an spider is to crawl and store all the data in web pages. In this simple project we will take care of single site. 

<ol>
<li>A crawler must recognize the links to crawl. Take a minute and think how to retrieve the links of a web site.</li>
<li>Let us start the project by creating a Spider class. The constructor will have the following parameters: starting_url, crawl_domain, and max_iter. crawl_domain will be the domain that validates if an absolute link will be considered or not. max_iter is the maximum amount of web items to crawl.</li>
<li>The main method can be Spider.run(). Enumerate the big functionalities/building blocks of the crawler.</li>
</ol>
</div>
    
    

In [1]:
from urllib.request import urlopen
from urllib.error import HTTPError

import time

def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        self.collection=[]
        
    def retrieveHtml(self):
        try:
            socket = urlopen(self.url);
            self.html = socket.read().decode('latin-1')
            return 0
        except HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            return -1
             
    def run(self):
        while (len(self.links_to_crawl)>0 and len(self.collection)<self.max_iter):
            self.url = self.links_to_crawl.pop(0)
            print (self.links_to_crawl)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()
    
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if self.crawl_domain in item:
                tmpList.append(item)
            if not(":") in item: #Take care of http:// https:// and mailto:
                tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        self.collection.append(doc)
       


Let us validate the crawler with the following code: 

In [2]:
spider = Spider('http://www.ub.edu/datascience/postgraduate/','http://www.ub.edu/datascience/postgraduate/',20)
spider.run()

[]
Adding: http://www.ub.edu/datascience/postgraduate/index.html
Adding: http://www.ub.edu/datascience/postgraduate/whatisdatascience.html
Adding: http://www.ub.edu/datascience/postgraduate/content.html
Adding: http://www.ub.edu/datascience/postgraduate/apply.html
Adding: http://www.ub.edu/datascience/postgraduate/faq.html
Adding: http://www.ub.edu/datascience/postgraduate/about.html
['http://www.ub.edu/datascience/postgraduate/whatisdatascience.html', 'http://www.ub.edu/datascience/postgraduate/content.html', 'http://www.ub.edu/datascience/postgraduate/apply.html', 'http://www.ub.edu/datascience/postgraduate/faq.html', 'http://www.ub.edu/datascience/postgraduate/about.html']
['http://www.ub.edu/datascience/postgraduate/content.html', 'http://www.ub.edu/datascience/postgraduate/apply.html', 'http://www.ub.edu/datascience/postgraduate/faq.html', 'http://www.ub.edu/datascience/postgraduate/about.html']
Adding: http://www.ub.edu/datascience/postgraduate/academics.html
Adding: http://www.u

In [11]:
#Howmany elements does our colletion have?
len(spider.collection)


10

In [12]:
spider.collection[2]

{'url': 'http://hunch.net/?p=12224166',
 'date': '22/10/2019',
 'html': '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">\n\n<head profile="http://gmpg.org/xfn/11">\n\t<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n\n\t<title>ICML has 3(!) Real World Reinforcement Learning Workshops &laquo;  Machine Learning (Theory)</title>\n\n\t<style type="text/css" media="screen">\n\t\t@import url( http://hunch.net/wp-content/themes/classic_modified/style.css );\n\t</style>\n\n\t<link rel="alternate" type="application/rss+xml" title="RSS 2.0" href="http://hunch.net/?feed=rss2" />\n\t<link rel="alternate" type="text/xml" title="RSS .92" href="http://hunch.net/?feed=rss" />\n\t<link rel="alternate" type="application/atom+xml" title="Atom 1.0" href="http://hunch.net/?feed=atom" />\n\n\t<link rel="pingback" href="http://hunch.net/xmlrpc.php" />\n\t\t

In [13]:
#Enumerate the urls retreived
[spider.collection[i]['url'] for i in range(len(spider.collection))]

['http://hunch.net',
 'http://hunch.net/',
 'http://hunch.net/?p=12224166',
 'http://hunch.net/?cat=11',
 'http://hunch.net/?cat=41',
 'http://hunch.net/~jl',
 'http://hunch.net/~mltf',
 'http://hunch.net/~rwil',
 'http://hunch.net/~jl/projects/prediction_bounds/tutorial/langford05a.pdf',
 'http://hunch.net/~jl/interact.pdf']

Let us go for a more complex web site. Run the code on http://hunch.net (a machine learning blog by John Langford).

In [14]:
spider = Spider('http://hunch.net','http://hunch.net/',10)
spider.run()

[]
Adding: http://hunch.net/
Adding: http://hunch.net/?p=12224166
Adding: http://hunch.net/?cat=11
Adding: http://hunch.net/?cat=41
Adding: http://hunch.net/~jl
['http://hunch.net/?p=12224166', 'http://hunch.net/?cat=11', 'http://hunch.net/?cat=41', 'http://hunch.net/~jl']
['http://hunch.net/?cat=11', 'http://hunch.net/?cat=41', 'http://hunch.net/~jl']
['http://hunch.net/?cat=41', 'http://hunch.net/~jl']
['http://hunch.net/~jl']
[]
Adding: http://hunch.net/public_key
Adding: http://hunch.net/~mltf
Adding: http://hunch.net/~rwil
Adding: http://hunch.net/projects/projects.html
Adding: http://hunch.net/conferences/conferences.html
Adding: http://hunch.net/classes/classes.html
Adding: http://hunch.net/resume/resume.html
['http://hunch.net/~mltf', 'http://hunch.net/~rwil', 'http://hunch.net/projects/projects.html', 'http://hunch.net/conferences/conferences.html', 'http://hunch.net/classes/classes.html', 'http://hunch.net/resume/resume.html']
['http://hunch.net/~rwil', 'http://hunch.net/proj

In [15]:
# And check the urls retrieved
[spider.collection[i]['url'] for i in range(len(spider.collection))]

['http://hunch.net',
 'http://hunch.net/',
 'http://hunch.net/?p=12224166',
 'http://hunch.net/?cat=11',
 'http://hunch.net/?cat=41',
 'http://hunch.net/~jl',
 'http://hunch.net/~mltf',
 'http://hunch.net/~rwil',
 'http://hunch.net/~jl/projects/prediction_bounds/tutorial/langford05a.pdf',
 'http://hunch.net/~jl/interact.pdf']

It seems that the simple crawler more or less works as expected. There are still many functionalities to work on , such as valid domains, valid urls, etc. One important issue to consider is **persistence**, or how to store the data retrieved for further analysis. In this basic scraping tutorial we us MongoDB as a Non-SQL database for persistence purposes. 

## 1.1 Introduction to MongoDB
<small>This introduction is partially inspired on the notes of Alberto Negron's [blog](http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/)</small>

MongoDB is a document-oriented database, part of the NoSQL family of database systems. MongoDB stores structured data as JSON-like structures. From a pythonic point of view it is like storing dictionary data structures. One of its main feature is its schema-less feature, i.e. it supports dynamic schemas. A schema in a relational database informally referst to the structure of the data it stores, i.e. what kind of data, which tables, which relations, etc.

Let us change the Spider class to support MongoDB persistence.

First of all let us configure the MongoDB system.

+ Download mongoDB.
+ Rename the folder to mongodb.
+ Add a directory data and log in your working project directory.
+ Check that the server works 
        mongod --dbpath . --nojournal &
+ Check the connection to the server: in another terminal write mongo, check that it does not raise any error and exit the console.
+ Close the mongo daemon (mongod). You may have to kill mongod with kill -9 and remove the lock on the daemon, mongod.lock.
+ Let us configure a little the data base by configuring the path of the data storage and log files. Create a [mongo.conf](./mongodb/data/mongo.conf) file such as the one provided  and start the server using the following command:

        mongod --config=./mongodb/data/mongo.conf --nojournal &
+ Bonus: we can check the database status using  http://127.0.0.1:27017/

### Connect to a MongoDB database

In [18]:
import pymongo

# Connection to Mongo DB
try:
    conn=pymongo.MongoClient()
    print ("Connected successfully!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )
conn


Connected successfully!!!


MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)

In [22]:
import pymongo
conn = pymongo.MongoClient()

We can **create** a database using attribute access <span style = "font-family:Courier;"> db = conn.name_db</span> or dictionary acces <span style = "font-family:Courier;"> db = conn[name_db]</span>.

In [23]:
#Create a database using db = conn.name_db or dictionary access db = conn['name_db']
db = conn['datascienceUB_Octubre_2016']
print (db)
conn.database_names()
#Empty databases do not show

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'datascienceUB_Octubre_2016')


  after removing the cwd from sys.path.


ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

A database stores a **collection**. A collection is a group of documents stored in MongoDB, and can be thought of as the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database:

In [24]:
collection = db['Hola']
db.collection_names()
#Empty collections do not show

  


ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

In [25]:
#The database has a collection, thus ...
conn.database_names()

  


ServerSelectionTimeoutError: localhost:27017: [Errno 111] Connection refused

MongoDB stores structured data as JSON-like (JavaScript Object Notation) documents, using dynamic schemas (called BSON), rather than predefined schemas. An element of data is called a document, and documents are stored in collections. One collection may have any number of documents.

Compared to relational databases, we could say collections are like tables, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.

All you really need to know when you're using Python, however, is that documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).

To insert some data into MongoDB, all we need to do is create a dict and call .insert() on the collection object. Let us exemplify this process by downloading an url and storing it in the collection.

In [None]:
from urllib.request import urlopen
import time
# dd/mm/yyyy format
print (time.strftime("%d/%m/%Y"))
url = 'http://www.ub.edu/datascience/postgraduate/'
html = urlopen(url).read().decode('latin-1')

#Create a dictionary/document to store
doc = {}
doc['url'] = url
doc['date'] = time.strftime("%d/%m/%Y")
doc['html'] = html
doc['adios'] = 'esto es otra prueba'

In [None]:
doc

In [None]:
#insert the document in the collection
collection.insert_one(doc)

In [None]:
#Check that we have a non empty collection.
db.collection_names()

To recap, we have databases containing collections. A collection is made up of documents. Each document is made up of fields.

### Retrieving data

In [None]:
collection.find_one() #Returns one random document in the collection

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows us to iterate over all matching documents.


In [None]:
collection.find()

In [None]:
for d in collection.find():
    try:
        print (d['adios'])
    except KeyError:
        pass

If we just want to know how many documents match a query we can perform a count() operation instead of a full query. We can get a count of all of the documents in a collection:

In [None]:
collection.count()

### Basic queries

Querying in pymongo uses .find() 

In [None]:
for i in collection.find({"hola":"esto es otra prueba"}):
    print (i)

Observe that it finds exact matches. Operations include *gt* (greater than), *gte* (greater than equal), *lt* (lesser than), *lte* (lesser than equal), *ne* (not equal), *nin* (not in a list), *regex* (regular expression), *exists*, *not*, *or*, *and*, etc. Let us see some examples:

In [None]:
collection.find({"date":{"$gte":"01/01/2014"}}).count()

In [None]:
substring = "datascience"
reg = substring
collection.find({"html":{"$regex":reg}}).count()

In [None]:
for item in collection.find({"html":{"$regex":"datascience"}}):
    print (item['html'])

### A short practical introduction to RegEx

We have come across RegEx a couple of times now. So we better give a short pragmatical introduction to it.

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<ul>
    <li> Regular expressions are pattern matching rules. In essence everything is a character and the regular expression are a set of rules of the character patterns to seek.</li>
    <li> If we provide a raw set of characters it will look for exact matches, e.g. 'aBc1' </li>
    <li> There are wildcards: . (matches any character), *\d* (matches any digit), *\w* (matches any alphanumeric character), *\s* (any whitespace such as space, new line, carrier return, tab, etc.)</li>
    <li>An interesting metacharacter is *\b* that stands for boundary and matches the boundary between a word character and a non-word character.</li>
</ul>
</div>

In [None]:
import re
#Match the three strings
ex1 = 'abc abcde Abcdefg bcde'

pattern = re.compile('[aA]bc')
print(re.findall(pattern,ex1))

ex2 = 'abc123xyz define123 var g = 123'
pattern = re.compile('123')
for m in re.finditer(pattern, ex2):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
    


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<ul>
    <li> Capital letters denote the negation of the concept, i.e. *\D* is everything except digits, *\W* everything expect alphanumeric chars (e.g. puntuation marks, etc), *\S* any non-spacing character.</li>
    <li>Repetitions of a character can be handled explicitly using curly brackets, e.g. *a{3}* means three repetitions of character "a"</li>
    <li>Undefinite number of values are given by star and plus signs (\* and +). Star will match 0 to infty number of times and Plus will do it one or more times, e.g. .* will stand for any amount of any character</li>
</ul>
</div>

In [None]:
#Find all adverbs (words ended by ly)
import re
text = "He was carefully disguised but captured quickly by police."
for m in re.finditer(r"\w+ly", text):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<ul>
     <li>Optional values can be given by the question mark sign. The preceding character will be optional, e.g. cats? stands for cat and cats.</li>
     <li>Another way of checking for specific options is to use square brackets. For example *[abc]* will match only a, b, or c.</li>
     <li>We can negate a set in square brackets *[^abc]*</li>
     <li>We can select ranges, such as *[a-z]*, *[A-Z]* or *[0-9]*</li>
</ul>
</div>

In [None]:
#Find the names of the first two items without extension
#All of them start with file, thus it is a boundary and "file pattern", then any amount of arbirtary characters wil do and finally it will end with .pdf
import re
text = 'file_a_record_file.pdf file_yesterday.pdf testfile_fake.pdf.tmp' 
for m in re.finditer(r"\bfile\w*\.pdf", text):
    print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

In [None]:
#Trim starting and ending spaces

text = "               Masters of Ba Gua Zhang    "

for m in re.finditer(r"\s*(.+)\s*$", text):
    print ('%02d-%02d: %s' % (m.start(1), m.end(1), m.group(1))) #Note that we use group(1), group(0) is the complete match without capture

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
<ul>
     <li>Another interesting feature is capturing. In parenthesis we can define the group or set of data we want to return. In python we can access these data by indexing the match. At the first position we will get the first capture, in the second position the nested capture or group, etc.</li>
</ul>
</div>

Check what happens if we change index 1 for index 0 in the former example.

In [None]:
#Match the numbers and skip the last item
text = '3.1452 -255.34 128 1.9e10 123,34.00 720p'

for m in re.finditer(r"-?\d+[\.,]?\d*[\.e]?\d*\b", text):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))

In [None]:
#Find the links in a html page
from urllib.request import urlopen
url = 'http://www.ub.edu/'
html = urlopen(url).read().decode()


for m in re.finditer(r"href=\"(\S+)\"", html):
    print ('%02d-%02d: %s' % (m.start(1), m.end(1), m.group(1)))

Regular expressions are usually not the answer due to the fragility of html pages on the internet today -- common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression.

### Update

In this section, several methods for updating and deleting documents are reveiwed:

+ Update. This method finds the documents defined by query and **replaces** it by the new document. 

In [None]:
from urllib.request import urlopen
import time
# dd/mm/yyyy format
print (time.strftime("%d/%m/%Y"))
url = 'http://www.ub.edu/datascience'
html = urlopen(url).read()

#Create a dictionary/document to store
doc = {}
doc['url'] = url
doc['date'] = time.strftime("%d/%m/%Y")
doc['html'] = html
doc['foo'] = 'foo'

import pymongo
conn = pymongo.MongoClient()
conn.database_names()


In [None]:
db = conn.test
collection = db.coltest
db.collection_names()

In [None]:
collection.insert_one(doc)

In [None]:
collection.update_one({'url':{'$regex':'datascience'}},{"$set":{'html':'mola'}},upsert=True)

for doc in collection.find_one():
    print(doc)

We lost all the document and replaced it by the new instance that only has the 'html' key.

We can modify the behavior of the update command by adding a sub-command. Let us check some of them:

+ Sub-command **Set**:

This statement updates in the document in collection where field matches value1 by replacing the value of the field field1 with value2. This operator will add the specified field or fields if they do not exist in this document or replace the existing value of the specified field(s) if they already exist.

An upsert eliminates the need to perform a separate database call to check for the existence of a record before performing either an update or an insert operation. Typically update operations update existing documents, but in MongoDB, the update() operation can accept an upsert option as an argument. Upserts are a hybrid operation that use the query argument to determine the write operation:

If the query matches an existing document(s), the upsert performs an update.
If the query matches no document in the collection, the upsert inserts a single document.

In [None]:
collection.update_one({"html":"mola"},{"$set":{"html":"new data"}})

In [None]:
for doc in collection.find():
    print (doc)

In [None]:
collection.update_many({"html":"new date"},{"$set":{"html":"modify"}})

In [None]:
for doc in collection.find():
    print (doc)

Nothing changes since there is no match for the query. Let us use upsert.

In [None]:
collection.update_many({"html":"new date"},{"$set":{"html":"modify"}}, upsert=True)

In [None]:
for doc in collection.find():
    print (doc)

+ Sub-commnad **Unset**:

The unset operator deletes a particular field. If documents match the initial query but do not have the field specified in the unset operation, there the statement has no effect on the document.

In [None]:
collection.update_many({"html":"modify"},{"$unset":{"html":""}})

In [None]:
for doc in collection.find():
    print (doc)

There are several update commands that work with lists allowing to remove the first element (pop), any amount of elements (pull) or insert an element (push).

By the way, the dollar character used at the before the command identifies an element in an array field to update without explicitly specifying the position of the element in the array.  

### Delete operations

We can remove elements by simply:

In [None]:
collection.delete_one({"html":"new data"})

In [None]:
for doc in collection.find():
    print (doc)

And remove a collection by:

In [None]:
db.collection_names()

In [None]:
db.drop_collection("coltest")
db.collection_names()

And remove a database by:

In [None]:
conn.database_names()

In [None]:
conn.drop_database('test')
conn.database_names()

And finally close the connection with the database.

In [None]:
conn.close()

## 1.2 Finishing the warm up project with MongoDB storage

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError

import time
#Import pymongo
import pymongo


def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        self.collection=[]
        # Create the connection to MongoDB
        try:
            self.conn=pymongo.MongoClient()
            print ("Connection to Mongo Daemon successful!!!")
        except pymongo.errors.ConnectionFailure as e:
            print ("Could not connect to MongoDB: %s" % e )
        self.db = conn['crawlerDB']
        self.collection = self.db[starting_url+'DB']

        
    def retrieveHtml(self):
        try:
            socket = urlopen(self.url);
            self.html = socket.read().decode('latin-1')
            return 0
        except HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            return -1
             
    def run(self):
        #Change the count on the collection
        while (len(self.links_to_crawl)>0 and self.collection.count()<self.max_iter):
            self.url = self.links_to_crawl.pop(0)
            print (self.links_to_crawl)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()
        self.conn.close()
    
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html,max_links=50)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if '.pdf' not in item:
                if self.crawl_domain in item:
                    tmpList.append(item)
                else:
                    if not(":") in item: #Take care of http:// https:// and mailto:
                        tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        #Insert in the collection
        self.collection.insert_one(doc)

In [None]:
spider = Spider('http://hunch.net','http://hunch.net/',20)
spider.run()


In [None]:
conn = pymongo.MongoClient()


In [None]:
print (conn.database_names())
db = conn['crawlerDB']

In [None]:
db.collection_names()
db.drop_collection("http://hunch.netDB")

In [None]:
collection = db['http://hunch.netDB']
collection.count()

In [None]:
for doc in collection.find():
    print (doc['url'])
    print (doc['date'])

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**PROS and CONS:**
<p>
**MongoDB** querying is powerful but based on basic string operations. This actually tells us that storing full HTML pages is not going to be effiecient for retrieval. Actually, we will see that it is important to break the information in the pieces we really want. However, this is a good starting point before a post processing if we are not sure what we are going to do with the data or further scraping is going to take long. </p>
</div>

In the next section we will see more efficient ways of dealing with web based data.

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**URLLIB** is good for getting simple things. In the end you end up with a large HTML string you want to do something on it. 
So the next thing you want to do is to parse data. But you want to do it in the same way you do when you interact with the web page. You see a menu, a frame on the left side, a nice colorful block where the price for your flight is. So **you want to parse data the way you see data in the webpage so that you can target it**.
</div>

## 2. Using the API

Recall the **big picture**. If we are targeting for specific data we could check if the web site has a programatic interface for querying. If it has we can use it.



<img style="border-radius:20px;" src="./files/big_picture.jpg">

## Scrapping twitter data with the API

A standard way for programatically communicating with a web service is using the API (Application Programing Interface) whenever it is provided. Twitter provides several APIs. The two most important ones are the RESTful API for static queries (e.g. user's friends and followers, check timelines, etc) and the Streaming API for retrieving live data. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON. The Streaming API should not need authentication.

Ex. 

https://api.twitter.com/oauth/authenticate?oauth_token=XXXXXXXXXXXXXX

https://api.twitter.com/1.1/followers/ids.json?cursor=-1&screen_name=my_user_name&count=5000

Building these queries is not always easy, thus we may use a wrapper around the API. This is what **tweepy** does.

Using the API with authentification (needed for the RESTful API)

From wikipedia:

>"Web service APIs that adhere to the architectural constraints are called RESTful. HTTP based RESTful APIs are defined with these aspects:

> <ul><li>base URI (Uniform Resource Identifier), such as http://example.com/resources/
<li>an Internet media type for the data. This is often JSON but can be any other valid Internet media type (e.g. XML, Atom, microformats, images, etc.)</li>
<li>standard HTTP methods (e.g., GET [retrieve], PUT[idempotent update/create], POST[update/create], or DELETE)</li>
<li>hypertext links to reference state</li>
<li>hypertext links to reference related resources"</li>
</ul>

If we want to use the RESTful API in Twitter we have to follow these steps:
<ul>
<li>From your twitter account we want to generate a token: https://apps.twitter.com</li>
<li>Create a new App. This will create the API keys (consumer keys)</li>
<li>Go to API Keys and generate a token. (access keys)</li>
</ul>

In [None]:
import json
import pymongo
import tweepy

consumer_key = ""
consumer_secret = ""

access_key = ""
access_secret = ""

#Authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

#Do something
USER_NAME = "espavilat"
user = api.get_user(id=USER_NAME)

We can access some basic information about the user

In [None]:
user._json

In [None]:
user.id

In [None]:
user.created_at

In [None]:
user.friends_count

In [None]:
user.followers_count

>JSON (JavaScript Object Notation), is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON is a way to encode complicated information in a platform-independent way.  It could be considered the lingua franca of information exchange on the Internet. 

In [None]:
#We can access the full JSON
user._json['created_at']

We can access all the information as it was a dictionary structure.

In [None]:
juser = user._json
print (juser['created_at'])

We can apply our basic scrape knowledge and use urllib2 to retrieve more interesting infomation, such as the profile image.

In [None]:
img_url = juser['profile_image_url']
print (img_url)

In [None]:
from urllib.request import urlopen

f = open('scraped_image','wb')
f.write(urlopen(img_url).read())

%matplotlib inline
import matplotlib.pyplot as plt
im=plt.imread('scraped_image')
plt.imshow(im,interpolation='nearest')
plt.title(juser['screen_name'],size=16)

Now we want to retrieve the list of follower ids. There are two ways for doing so. Both uses the `api.followers_ids` function. The function returns a maximum of 100 ids. If we want to get all of them we may use a pagination variable `cursor`. This can be managed directly in the call `api.followers_ids(id, cursor)` or using a `Cursor` object with the `pages` method that handles the cursor implicitly. This second method is illustrated in the following lines:

In [None]:
#Retrieving all the followers
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name=USER_NAME).pages():
    ids.extend(page)
    time.sleep(20)  #This should be 60 to avoid limit rates

Notice the `sleep` command. This is needed to respect the hourly limit rates of the Twitter API. 

In [None]:
#friends (screen_name) or follower_ids
ids

In [None]:
document={}
document['user'] = user.id
document['followers'] = ids[:]

# Create the connection to MongoDB
try:
    conn=pymongo.MongoClient()
    print ("Connection to Mongo Daemon successful!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )
db = conn['twitter']
collection = db['twitter_users']
collection.insert_one(document)

In [None]:
for doc in collection.find():
    print (doc)

In [None]:
doc['user']

In [None]:
doc['followers']

<div class = "alert alert-error" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;"> **TAKE HOME EXERCISE:** Given a starting user ID, retrieve the user ids corresponding to the set of followers up to two depth levels. This is the followers of the followers of the named user. This information creates a network of influence that will be used in upcoming sessions.
</div>

The **Streaming API** works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection. The data you get back will be encoded in JSON. 

One of the main usage cases of tweepy is monitoring for tweets and doing actions when some event happens. Key component of that is the StreamListener object, which monitors tweets in real time and catches them.

If we check the official twitter streaming API we see that we have several modifiers for filtering the stream, i.e. track (filter by keyword), locations (filter by geographic location), etc

StreamListener has several methods, with on_data() and on_status() being the most useful ones. Here is a sample program which implements this behavior:

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def on_data(self, data):
        #Beauty print data
        parsed = json.loads(data)
        print (json.dumps(parsed, indent=4, sort_keys=True))
        return True
    def on_error(self, status):
        print ('ERROR')
        print (status)

Get the twitter data filtered by location inside the following bounding box. (http://boundingbox.klokantech.com)

<img style = "border-radius:10px;" src="./files/ub_location.png">

In [None]:
twitterStream = Stream(auth, listener()) 
twitterStream.filter(locations=[2.1622322352,41.385987385,2.1651827408,41.3877173586])

In [None]:
# Other examples
twitterStream = Stream(auth, listener()) 
#twitterStream.filter(track=["datascience"])
#Use http://boundingbox.klokantech.com to get the Barcelona bounding box
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def on_data(self, status):
        json_data=json.loads(status)
        print (str(json_data["user"]["screen_name"])+' : ' + json_data["text"])
        return True
    
    def on_error(self, status):
        print ('Error')
        print (status)
        
# Catch all tweets in Barcelona area and print them
twitterStream = Stream(auth, listener()) 
#twitterStream.filter(locations=[2.1622322352,41.385987385,2.1651827408,41.3877173586])
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

Let us fill the class in order to capture and store the data in a MongoDB database.

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def __init__(self):
        super(StreamListener, self).__init__()
        try:
            self.conn=pymongo.MongoClient()
            print ("Connection to Mongo Daemon successful!!!")
        except pymongo.errors.ConnectionFailure as e:
            print ("Could not connect to MongoDB: %s" % e )
        self.db = conn['twitter_stream']
        self.collection = db['tweets']
    
    def on_data(self, status):
        jdata = json.loads(status)
        if 'android' in jdata["source"]:
            device = "android"
        else:
            device = "apple"
        document={'text':jdata["text"], 'created':jdata["created_at"], 'screen_name':jdata["user"]["screen_name"], 'device':device}        
        self.collection.insert(document) 
        print (document)
        return True
    
    def on_error(self, status):
        print ('ERROR')
        print (status)

# Catch all tweets in Barcelona area and print them
twitterStream = Stream(auth, listener()) 
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

In [None]:
#Check captured data
try:
    conn=pymongo.MongoClient()
    print ("Connection to Mongo Daemon successful!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )

db = conn['twitter_stream']
collection = db['tweets']
collection.count()
for doc in coll.find():
    print (doc)

In [None]:
conn.database_names()
db = conn['twitter']
coll = db.tweets
for item in coll.find():
    print (item['device'])

APIs are nice. Most large web site provide useful APIs, e.g. Google, OpenStreetMap, Facebook, etc, subject to some use terms. However most of the web sited do not provide any kind of access to data. What to do then?

## 3. Making your own API: Web scraping

Sometimes data is on the web but there is no API to grant access to it, the API is lacking functionalities or the terms of service are not adequate. In those cases because as humans we have visual access to the data we might wonder how to extract that data automatically. The discipline for doing so is **Web Scraping**. 

Before we start, it is useful to understand a little how web pages are created and data stored. In this section a brief introduction to web front-end development is presented. We will focus on two basic aspects:

+ Basic HTML + CSS static pages.
+ Dynamic HTML (a basic JavaScript example using JQuery).


<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**Firebug example on a page.**

Go to a page and check its contents using Inspect Element
</div>

### 3.1 Basic HTML + CSS 101

The most basic web pages are built upon HTML + CSS technology. This division stnds for content and design, respectively. **HTML (Hypertext markup language)** is used to give websites structure and stores the contents. This is our target for scraping. On the other hand **CSS (Cascading Style Sheets)** gives format to the content, sigles out content for visualization purposes, i.e. defines the style (e.g. font, color, family, borders, image style, relative positioning of the content, etc). HTML files include tags and references to style, thus it is worthwhile to understand a little bit of both technologies which can help us to scrap data more efficiently.


HTML is a tagged language usually rendered by a browser. Tags are specified in the following format:

<p style="text-align: center">&lt;tag_name *attributes*&gt; content &lt;/tag_name&gt;<p>

<p>
<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">STRUCTURE of an HTML file:

<ul>
    <li> HTML files start with the <!DOCTYPE html>. This tells the browser that we will use HTML5. In former versions of HTML standard there were different versions. </li>
    <li> The first tag in a web page is &lt;html&gt; and its corresponding &lt;/html&gt; closing tag. All the web page is found inside these tags. </li>
    <li> HTML files have a &lt;head&gt; and a &lt;body&gt; </li>
    <li> In the head, we have the &lt;title&gt; tags, and we use this to specify the webpage's name. We can also find references to CSS stylesheets (&lt;link&gt;) used for formating the page and links to javascript files (&lt;script&gt;)that give the web page dynamic behavior.</li>
    <li> In the body we find the content of the page. </li> 
        <ul>
            <li> Headings and text paragraphs can be created using &lt;h#&gt; (# is a natural number) and &lt;p&gt; ,respectively. </li>
            <li> Hyperlinks (links) are given in the <strong>href</strong> attribute of the &lt;a&gt; (anchor) tag. </li>
            <li> Images can be embedded using the &lt;img&gt; tag and setting the <strong>src</strong> attribute to the resource. Caution: img is an special tag and it does not have a closing tag, e.g. &lt;img src = "my_pic.jpg" /&gt; </li>
        </ul>
</ul>
</div>
</p>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us build a basic HTML web page, adding the following tags. Remember that nearly all tags require to be closed using &lt;/tag&gt;

+ DOCTYPE
+ html
+ head
+ title
+ body

<ol>
<li>Create a file 'example.html' in your favorite editor.</li>
<li>Create a basic html web page containing a *title*, *h1*, *p*, *img* and *a* tags.</li>
</ol>
</div>

If you are lazy go to the files folder and double-click on "example.html". You can check the html code executing the following line.


<html>
	<head>
		<title>
			Basic knowledge for web scraping.
		</title>	
	</head>
	<body>
		<h1>About HTML
		</h1>
		<p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
        
        <p> One of the following rubberduckies is clickable
	</p>
	<p>
            <img src = "files/rubberduck.jpg"/>
        
            <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
        </p>
	</body>
</html>



<html>
	<head>
		<title>
			Basic knowledge for web scraping.
		</title>	
	</head>
	<body>
		<h1>About HTML
		</h1>
		<p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
        
        <p> One of the following rubberduckies is clickable
	</p>
	<p>
            <img src = "files/rubberduck.jpg"/>
        
            <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
        </p>
	</body>
</html>


Because Ipython notebook cells directly interpret markdown and HTML we can use the cell as an interactive editor for our HTML understanding.
<div class  = "alert alert-success">** EXERCISE ** <p>
Change the type of cell of the former cell to *Markdown* and Execute (SHIFT+ENTER). In order for the files to show you must add the relative path to the image, e.g. ./files/rubberduck.jpg
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Old style HTML** static pages rely heavily on tables and lists: 

<ul>
<li> Making ordered and unordered lists is simple: *ol* (ordered list), *ul* (unordered list) are the main tags. Each item is inserted as *li* (list item) </li>
<li> *table* is the containing tag for building tables, each table row is given as *tr* and columns depend on the table data elements *td*. Tables may have a head (*thead*) and a body (*tbody*). *th* is the same as *td* but for the header. If you want a multi column cell then use colspan=number of cells to cover.
</li>
</ul>
</div>

The next example shows a simple table build. Check the markdown code.

<table>
<thead>
<tr><th colspan = 2>A table</th><tr>
</thead>
<tbody>
<tr>
<td>Hello I am element 1.1</td><td>Hello I am element 1.2</td>
</tr>
<tr>
<td colspan=2>Hello I am element 2.1 and 2.2</td>
</tr>
</tbody>
</table>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
**Current HTML** static pages rely heavily on containers and style: 

<ul>
<li> *div* stands for division and mark a block of content.
</li>
<li> *span* is used to single out an element of a block content.
</li>
</ul>

</div>

By themselves they are not much but when combined with the *style* attribute they become interesting.

For example, consider the following example of code:

<div style = "width:100px;height:100px;background-color:red;padding:10px;font-family:Verdana;font-size:24;color:pink;display:inline-block">  Box 1
</div>
<div style = "width:100px;height:100px;background-color:blue;padding:10px;font-family:Futura;font-size:24;color:lightblue;display:inline-block">  Box 2
</div>
<div style = "width:100px;height:100px;background-color:yellow;padding:10px;font-family:Garamond;font-size:24;color:orange;display:inline-block">  Box 3
</div>
<div style = "width:100px;height:100px;background-color:green;padding:10px;font-family:ArialNarrow;font-size:24;color:lightgreen;display:inline-block">  Box 4
</div>

The attribute *style* is also referred as *inline CSS* and let us give the skeleton some skin and makeup.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us build a basic HTML web page and check the magic of CSS in action before going in detail into CSS.
<ol>
<li>Create a file 'example2.html' using your favorite editor.</li>
<li>Fill the header and body basic HTML structure</li>
<li>Let us add three containers *div* in the body.</li> 
<li>Select one of them. This will be used as a navigation bar and will contain an unordered list with three elememnts: Home, Brief Bio, Hobbies</li>
<li>Select another division and create a table inside. Each row will contain information about your profile, e.g. the first row may contain Name: Your Name, the second row Position: Your current position, etc</li>
<li>The last one will contain an image of youself and a paragraph with your contact info (email)</li>
</ol>
<p>
Check the result. Nearly professional, doesn't it?
</p>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE FOLLOW UP**

Let us add some style.
<ol>
<li>Add the class "navbar" as an attribute to the *div* containing the list. (eg. class = "navbar")</li>
<li>Add the class "head" to the *div* containing the image and the email.</li>
<li>Add the class "right" to the *div* containing the table.</li>
<li>Add the identifier "email" to the paragraph containing the email. (eg. id = "email")</li>
<li>Finally, let us link the class and ids definitions we have just writen by adding to the head tag the following line:
<p>< link type="text/css" rel="stylesheet" href="stylesheet.css"/ ></p>
</li>
</ol>
<p>
Check the result now. Do not forget to hover over your navigation bar.
</p>
</div>

The former exercise is an extremely simple exercise showing the separation between the content and the styling. Observe that the html file you have created does not have any explicit styling. However, we have added two new elements to the mix, classes and identifiers as attributes of the tags. As you can imagine styling rules are given for each class and ID and are compactly found on the stylesheet.css we have just linked.

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">**COMMENT:**
Very simple formating can be also given using html markers. For example *strong* and *em* tags refers to bold and italics fonts.
</div>

**CSS (which stands for Cascading Style Sheets)** is a language used to describe the appearance and formatting of your HTML. A style sheet is a file that describes how an HTML file should look. The word cascading refers to the fact that a specific style rules override more generic ones. We will see that in a minute. 


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">FORMAT of a CSS file:

<ul>
    <li> CSS files contains a set of style rules applied to a certain selection of the content of the html file. The format is as follows:
    
    <p style = "font-family:Courier;margin-left:200px;"> css_selector { 
                property: value;
        }
    </p>
    and may contain many properties.
    
    <li> <span style = "font-family:Courier"> css_selector </span> identifies a certain context of the Document Object Model (DOM), i.e. it allows to traverse the DOM and select specific blocks. For example, 
    <p style = "font-family:Courier;margin-left:200px;">
        div { color:red; }
    </p>
    selects all *div* tags and apply a red font color to their content.
    </li>
    <li> We can use any html tag as element for selection.
    </li>
    </ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
    **CLASSES:**
        <ul>
        <li> We may link/format a certain set of tags of the html file with a unique CSS style by means of a **class**. </li>
        <li> In html the class is defined as an attribute and can be shared among tags, e.g. < div class = "my_class" > and < p class = "my_class">
        </li>
        <li> In the css file, the class is identified with a point preceeding the name, e.g.
         <p style = "font-family:Courier;margin-left:200px;">
            .my_class { font-family:Verdana; }
        </p>
        </li>
        </ul>
    </div>


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
    **IDENTIFIERS:**
        <ul>
        <li> If we want to single out an element to apply a certain style we can use an **identifier**. </li>
        <li> In html the identifier is defined as the attribute **ID**, e.g. < div id = "my_ID" >
        </li>
        <li> In the css file, the identifier name is preceeded by a hash sign (#), e.g.
         <p style = "font-family:Courier;margin-left:200px;">
            #my_ID { font-size:24px; }
        </p>
        </li>
        </ul>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Check stylesheet.css and inspect the formating of the identifiers and the classes.
<ol>
<li>
Create an identifier name_props that changes the font-family to Courier.
</li>
<li>
Add this identifier to the *td* tag with your name in the profile.
</li>
</ol>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE**

Let us make our own css style sheet and explore a little of the more advanced CSS selectors.
<ol>
<li>Create a file 'mystylesheet.css' using your favorite editor and open example3.html in your browser.</li>
<li>What happens if we add the following style command? 
<p style="font-family:Courier">div {font-family:Verdana;font-size:16px;}</p></li>
<li>Add the following line in the style sheet:
<p style="font-family:Courier">div div{color:red;}</p></li>
<li>Add the following line in the style sheet:
<p style="font-family:Courier">div>div{color:green;}</p></li>
<li>Add the style to make the font of the container regarding Bruce Lee comments on Choy Li Fut have *font-size:14px*, *font-family:Courier*, *background-color:#FFCC66;*, and *color:yellow*.
<li>Add a style to the *img* tag with *height* of *230px* and *width* of *200px*. Set the width of the table to 700 pixels width.</li>
</ol>
</div>

In this last exercise we have seen another type of css selection. The html document can be seen as a tree structure. The root of the tree is the *html* tag. This has two children *head* and *body*. Head may have different children such as *title*, *link*, or *script*. Body may have any combination of tags, *divs*, *p*, *a*, etc. These tags can be nested, e.g. we can find a *div* inside a *div* inside a *div*. In the example we have seen how to refer to nested elements. The elements can be html tags or classes or identifiers.
    + "elem1 elem2" refers to any elem2 inside any other elem1 disregarding the degree of nesting (it may have any arbitrary set of elementes in between both).
    + "elem1>elem2" specifically refers to any elem2 children of a direct parent with tag elem1.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE:**
What does "div table>img" select?

</div>

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">**BONUS MATERIAL:**

<table style ="border:2px orange solid">
<tr><th style ="border:3px red solid;color:red;">FONT</th><th style ="border:3px red solid;color:red;">BACKGROUND</th></tr>
<tr style ="border:2px orange solid">
<td style ="border:2px orange solid;">
<ul style= "font-family:Courier;">
<li>color</li>
<li>font-size:10px,1em,...</li>
<li>font-family:Verdana,Garamond,Futura,...</li>
</ul>
<div style = "font-size:12px">font-size size is controlled by pixels (px) or ems (em). One em refers to the normal size the user uses. Two ems refers to twice that size, and so on.

Font-families depend on the fonts installed in the computer. However there are some default values: serif, sans-serif, cursive</div>

</td>
<td>
<ul style= "font-family:Courier;">
<li>background-color</li>
<li>text-align: left,right,center;</li>
</ul>
</td>
</tr>

</table>

**CONTAINER LAYOUT**

Any container or element has the following layout:

<img src="./files/margin_border_padding.png"/>

*T* stands for top, *B* for bottom, *L* and *R* for left and right, respectively. We can modify all the attributes: *margin* (outside the container) and *padding* (inside the container) e.g. *margin:10px* (it will add a margin of 10 pixels to all sides), *padding-top:10px* will just add 10 pixels inside the top side of the container. *Border* can be accessed the same way but also we can add other modifiers such as *border-style:dotted/dashed/solid;*, *"border-radius:10px/20%;"*, *"border-width:5px"*, *"border-color:red;"*. 

<br>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE:**
Take a look at the following *div* and identify the container related elements:

<div style = "height:200px;width:400px;background-color:#FFCC66;border-width:10px;border-style:solid;border-color:red;padding:10px 30px 5px 100px;color:black;"> Crux sacra sit mihi lux / Non draco sit mihi dux
Vade retro satana / Numquam suade mihi vana
Sunt mala quae libas / Ipse venena bibas
</div>

</div>

**CSS POSITIONING**

Positioning is one of the toughest parts in CSS styling. For the sake of completeness we will briefly give some hints on how positioning is controlled.

The key attributes are *display*, *float*, and *position*. 

<ul>
<li> <strong>Display:</strong></li>
<ul>
<li>**block:** Makes the element a block box. It does not let anything sit next to it on the page.</li>
<li>**inline-block:** Makes the element a block box, but allow other elements to sit next ot it in the same line.</li>
<li>**inline:** Makes the element sit on the same line as another element, but without formatting it like a block. It only takes up as much width as it needs (not the whole line).</li>
<li>**none:** Makes the element disapear.</li>
</ul>
<li><strong>Float:</strong></li>
<ul>
<li>**left:** Floats the element to the left of the window.</li>
<li>**right:** Floats the element to the rightmost side of the window.</li>
</ul>
<li><strong>Position:</strong></li>
<ul>
<li>**absolute:** The element is positioned relative to its first positioned (not static) ancestor element.</li>
<li>**relative:** The element is positioned relative to its normal position, so "left:20" adds 20 pixels to the element's LEFT position.</li>
<li>**fixed:** Stays even when there is scroll up or down.</li>
</ul>
</ul>
It is worth mentioning the tag *clear:both/left/right* that defines what floats are not allowed, e.g. clear:both will not allow any float to be on the left of right. This is interesting for making some container to be at the bottom of several floats.
**PSEUDO-SELECTORS**

Pseudo-selectors allows to change aspect according to functional aspects. For example, change color when the mouse is hovering over a link or change the style of a visited link. They are added using colon (selector:pseudo-selector), examples:
<ul>
<li>a:link: An unvisited link.</li>
<li>a:visited: A visited link.</li>
<li>a:hover: A link you're hovering your mouse over.</li>
</ul>
<br>
<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**EXERCISE:**
Take a look at the file 'stylesheet.css' in our second example and locate the positioning elements.
</div>

</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">
    **Attribute based CSS selectors:**
        <ul>
        <li> We can select an attribute that has an specific attribute. Syntax is as follows:
        <p style = "font-family:Courier;margin-left:150px;">
            element[attribute = value] { styling_property; }
        </p>
        e.g. The following code will be applied to all buttons:
        <p style = "font-family:Courier;margin-left:150px;">
            input[type = "button"] { font-size:24px; }
        </p>
        </li>
        <li> More complex search patterns can be used using: 
            <ul> 
            <li><span style = "font-family:Courier">attribute~=value</span> : Finds attributes that contain the whole word value in lists of space separated words</li> 
            <li><span style = "font-family:Courier">attribute|=value</span> : Finds attributes that contain the whole word in a hyphens(dashed) separated list.</li>
            <li><span style = "font-family:Courier">attribute^=value</span> : Finds attributes that start with the string. It can be part of a larger word. </li> 
             <li><span style = "font-family:Courier">attribute$=value</span> : Finds attributes that end with the string. It can be part of a larger word. </li> 
             <li><span style = "font-family:Courier">attribute\*=value</span> : Finds attributes that **contain** the string. It can be part of a larger word. </li> 
            </ul>
        </li>
        <li>We can chain multiple attribute selectors:
        <p style = "font-family:Courier;margin-left:50px;">
            element[attribute1 = value1][attribute2 = value2]<br> { styling_property; }
        </p>
        </li>
        </ul>
</div>

FULL REFERENCE OF CSS SELECTORS:

<table style="font-family:Courier;font-size:14px;">
  <tbody><tr>
    <th style="width:20%">Selector</th>
    <th style="width:20%">Example</th>
    <th style="width:55%">Example description</th>
    <th>CSS</th>
  </tr>
	<tr>
    <td>.<i>class</i></td>
    <td class="notranslate">.intro</td>
    <td>Selects all elements with class="intro"</td>
    <td>1</td>
  </tr>
	<tr>
    <td>#<i>id</i></td>
    <td class="notranslate">#firstname</td>
    <td>Selects the element with id="firstname"</td>
    <td>1</td>
  </tr>  <tr>
  <td>\*</td>
  <td class="code notranslate">\*</td>
    <td>Selects all elements</td>
    <td>2</td>
  </tr>
  <tr>
    <td><i><a href="sel_element.asp">element</a></i></td>
    <td class="notranslate">p</td>
    <td>Selects all &lt;p&gt; elements</td>
    <td>1</td>
  </tr>
  <tr>
    <td><i><a href="sel_element_comma.asp">element,element</a></i></td>
    <td class="notranslate">div,p</td>
    <td>Selects all &lt;div&gt; elements and all &lt;p&gt; elements</td>
    <td>1</td>
  </tr>
  <tr>
    <td><a href="sel_element_element.asp"><i>element</i> <i>element</i></a></td>
    <td class="notranslate">div p</td>
    <td>Selects all &lt;p&gt; elements inside &lt;div&gt; elements</td>
    <td>1</td>
  </tr>
  <tr>
    <td><a href="sel_element_gt.asp"><i>element</i>&gt;<i>element</i></a></td>
    <td class="notranslate">div&gt;p</td>
    <td>Selects all &lt;p&gt; elements where the parent is a &lt;div&gt; element</td>
    <td>2</td>
  </tr>
  <tr>
    <td><a href="sel_element_pluss.asp"><i>element</i>+<i>element</i></a></td>
    <td class="notranslate">div+p</td>
    <td>Selects all &lt;p&gt; elements that are placed immediately after &lt;div&gt; elements</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_gen_sibling.asp"><i>element1</i>~<i>element2</i></a></td>
    <td>p~ul</td>
    <td>Selects every &lt;ul&gt; element that are preceded by a &lt;p&gt; element</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_attribute.asp">[<i>attribute</i>]</a></td>
    <td class="notranslate">[target]</td>
    <td>Selects all elements with a target attribute</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_attribute_value.asp">[<i>attribute</i>=<i>value</i>]</a></td>
    <td class="notranslate">[target=_blank]</td>
    <td>Selects all elements with target="_blank"</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_attribute_value_contains.asp">[<i>attribute</i>~=<i>value</i>]</a></td>
    <td class="notranslate">[title~=flower]</td>
    <td>Selects all elements with a title attribute containing the word "flower"</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_attribute_value_lang.asp">[<i>attribute</i>|=<i>value</i>]</a></td>
    <td class="notranslate">[lang|=en]</td>
    <td>Selects all elements with a lang attribute value starting with "en"</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_attr_begin.asp">[<i>attribute</i>^=<i>value</i>]</a></td>
    <td>a[href^="https"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value begins with "https"</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_attr_end.asp">[<i>attribute</i>\$=<i>value</i>]</a></td>
    <td>a[href$=".pdf"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value ends with ".pdf"</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_attr_contain.asp">[<i>attribute</i>\*=<i>value</i>]</a></td>
    <td>a[href*="w3schools"]</td>
    <td>Selects every &lt;a&gt; element whose href attribute value contains the substring 
	"w3schools"</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_active.asp">:active</a></td>
    <td class="notranslate">a:active</td>
    <td>Selects the active link</td>
    <td>1</td>
  </tr>
	<tr>
    <td><a href="sel_after.asp">::after</a></td>
    <td class="notranslate">p::after</td>
    <td>Insert content after every &lt;p&gt; element</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_before.asp">::before</a></td>
    <td class="notranslate">p::before</td>
    <td>Insert content before&nbsp; the content of every &lt;p&gt; element</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_checked.asp">:checked</a></td>
    <td>input:checked</td>
    <td>Selects every checked &lt;input&gt; element</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_disabled.asp">:disabled</a></td>
    <td>input:disabled</td>
    <td>Selects every disabled &lt;input&gt; element</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_empty.asp">:empty</a></td>
    <td>p:empty</td>
    <td>Selects every &lt;p&gt; element that has no children (including text nodes)</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_enabled.asp">:enabled</a></td>
    <td>input:enabled</td>
    <td>Selects every enabled &lt;input&gt; element</td>
    <td>3</td>
  </tr>
  <tr>
    <td><a href="sel_firstchild.asp">:first-child</a></td>
    <td class="notranslate">p:first-child</td>
    <td>Selects every &lt;p&gt; element that is the first child of its parent</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_firstletter.asp">::first-letter</a></td>
    <td class="notranslate">p::first-letter</td>
    <td>Selects the first letter of every &lt;p&gt; element</td>
    <td>1</td>
  </tr>
	<tr>
    <td><a href="sel_firstline.asp">::first-line</a></td>
    <td class="notranslate">p::first-line</td>
    <td>Selects the first line of every &lt;p&gt; element</td>
    <td>1</td>
  </tr>
	<tr>
    <td><a href="sel_first-of-type.asp">:first-of-type</a></td>
    <td>p:first-of-type</td>
    <td>Selects every &lt;p&gt; element that is the first &lt;p&gt; element of its parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_focus.asp">:focus</a></td>
    <td class="notranslate">input:focus</td>
    <td>Selects the input element which has focus</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_hover.asp">:hover</a></td>
    <td class="notranslate">a:hover</td>
    <td>Selects links on mouse over</td>
    <td>1</td>
  </tr>
	<tr>
    <td><a href="sel_in-range.asp">:in-range</a></td>
    <td class="notranslate">input:in-range</td>
    <td>Selects input elements with a value within a specified range</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_invalid.asp">:invalid</a></td>
    <td class="notranslate">input:invalid</td>
    <td>Selects all input elemets with an invalid value</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_lang.asp">:lang(<i>language</i>)</a></td>
    <td class="notranslate">p:lang(it)</td>
    <td>Selects every &lt;p&gt; element with a lang attribute equal to "it" 
	(Italian)</td>
    <td>2</td>
  </tr>
	<tr>
    <td><a href="sel_last-child.asp">:last-child</a></td>
    <td>p:last-child</td>
    <td>Selects every &lt;p&gt; element that is the last child of its parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_last-of-type.asp">:last-of-type</a></td>
    <td>p:last-of-type</td>
    <td>Selects every &lt;p&gt; element that is the last &lt;p&gt; element of its parent</td>
    <td>3</td>
  </tr>
  <tr>
    <td><a href="sel_link.asp">:link</a></td>
    <td class="notranslate">a:link</td>
    <td>Selects all unvisited links</td>
    <td>1</td>
  </tr>
	<tr>
    <td><a href="sel_not.asp">:not(<i>selector</i>)</a></td>
    <td>:not(p)</td>
    <td>Selects every element that is not a &lt;p&gt; element</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_nth-child.asp">:nth-child(<i>n</i>)</a></td>
    <td>p:nth-child(2)</td>
    <td>Selects every &lt;p&gt; element that is the second child of its parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_nth-last-child.asp">:nth-last-child(<i>n</i>)</a></td>
    <td>p:nth-last-child(2)</td>
    <td>Selects every &lt;p&gt; element that is the second child of its parent, counting 
	from the last child</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_nth-last-of-type.asp">:nth-last-of-type(<i>n</i>)</a></td>
    <td>p:nth-last-of-type(2)</td>
    <td>Selects every &lt;p&gt; element that is the second &lt;p&gt; element of its parent, counting 
	from the last child</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_nth-of-type.asp">:nth-of-type(<i>n</i>)</a></td>
    <td>p:nth-of-type(2)</td>
    <td>Selects every &lt;p&gt; element that is the second &lt;p&gt; element of its parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_only-of-type.asp">:only-of-type</a></td>
    <td>p:only-of-type</td>
    <td>Selects every &lt;p&gt; element that is the only &lt;p&gt; element of its 
	parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_only-child.asp">:only-child</a></td>
    <td>p:only-child</td>
    <td>Selects every &lt;p&gt; element that is the only child of its parent</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_optional.asp">:optional</a></td>
    <td class="notranslate">input:optional</td>
    <td>Selects input elements with no "required" attribute</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_out-of-range.asp">:out-of-range</a></td>
    <td class="notranslate">input:out-of-range</td>
    <td>Selects input elements with a value outside a specified range</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_read-only.asp">:read-only</a></td>
    <td class="notranslate">input:read-only</td>
    <td>Selects input elements with the "readonly" attribute specified</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_read-write.asp">:read-write</a></td>
    <td class="notranslate">input:read-write</td>
    <td>Selects input elements with the "readonly" attribute NOT specified</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_required.asp">:required</a></td>
    <td class="notranslate">input:required</td>
    <td>Selects input elements with the "required" attribute specified</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_root.asp">:root</a></td>
    <td>:root</td>
    <td>Selects the document's root element</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_selection.asp">::selection</a></td>
    <td>::selection</td>
    <td>Selects the portion of an element that is selected by a user</td>
    <td>&nbsp;</td>
  </tr>
	<tr>
    <td><a href="sel_target.asp">:target</a></td>
    <td>#news:target </td>
    <td>Selects the current active #news element (clicked on a URL containing that anchor name)</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_valid.asp">:valid</a></td>
    <td class="notranslate">input:valid</td>
    <td>Selects all input elements with a valid value</td>
    <td>3</td>
  </tr>
	<tr>
    <td><a href="sel_visited.asp">:visited</a></td>
    <td class="notranslate">a:visited</td>
    <td>Selects all visited links</td>
    <td>1</td>
  </tr>
</tbody></table>

## 3.2 Hands on with CSS selection

Different web-focused parsing libraries allow to use CSS selection. In this course we will see a couple of them. The first one is **LXML**. 

LXML is build upon the C libraries libxml2 and libxslt. These libraries brings standards-compliant XML support as wells as support for (broken) HTML and are very, very fast!

LXML allows to use CSS selection. Let us make some drills with lxml.

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**LXML EXERCISES** With the source of python.org
<ol>
<li>How many paragraphs are on the page?</li>
<li>What is the text content of the div wiht the class "shrubbery"? What are the links in that same div?</li>
<li>What is the text in the code elements?</li>
<li><strong>BONUS:</strong> Are there any forms?</li>
</ol>
</div>

In [None]:
from urllib.request import urlopen

#source = urllib2.urlopen("file:///Users/oriol/docencia/DataScience_Postgraduate/datascienceUB_notebooks/scraping/files/example3.html")
source = urlopen("http://www.python.org")
from lxml import html
from lxml.cssselect import CSSSelector

tree = html.document_fromstring(source.read())

In [None]:
#EX1
len(tree.cssselect('p'))

In [None]:
#EX2.a
[elem.text_content() for elem in tree.cssselect('.main-header')]

In [None]:
#EX2.b
for elem in tree.cssselect('div.shrubbery'):
    print ([l for l in elem.iterlinks()])


In [None]:
#EX3
[elem.text_content() for elem in tree.cssselect('code')]

In [None]:
#EX4
tree.cssselect('form')

## 3.3 XPATH as an alternative to CSSSelect

XPath is an alternative way of navigating through XML-like documents. It follows a similar structure to file directory navigations. In this sense, we can define an absolut path using /. This means that we have to give the complete path to the element we want to select. For example xpath('/html/body/p') will select all the paragraphs in the body of the html root.

If the path starts with // we are not starting at the root but will select an element starting anywhere in the hierarchy. For example  xpath('//a/div') will look for an 'a' followed with a 'div' anywhere in the document. 

We may also use wildcards suchs as *. For example xpath('//a/div/*') will return all the elements preceeding a/div anywhere in the document. And xpath('/*/*/div') will look for divs at the second level of the hierarchy with respect to the root.

If the selection returns more than one element we can choose one using brakets. For example xpath('//a/div[1]') will return the first div element of that set and xpath('//a/div[last()]') the last one.

We can toy with attributes using @. In this sense xpath('//@name') returns all attributes called 'name' anywhere in the document, and xpath('//div[@name]') selects from all the divs in the document only those that have an attribute 'name'. Note that it selects the divs, not the attributes. xpath('//div[not(@*)]') will return all the divs without attributes. We can even look for specific values of attributes xpath('//div[@name='chachiname']')

There are built-in functions that may help in localizing elements, such as count(),
name(), starts-with(), contains(). For example, xpath('//*[contains(name(),'iv')]') will selet all elements anywhere in the document with an name descriptor containing the substring 'iv'; or xpath('//*[count('div')==2]) will return all elements with two div elements as children.

We can select elements coming from several paths using | (OR), e.g. xpath('/div/p|/div/a') elements either div/p or div/a.

We can refer to the parent, ancestors, child, or descendants in a path, e.g. xpath('//div/div/parent::*') returns the parent nodes that have as children the path div/div.



<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**XPATH Exercises**:
<ol>
<li>Text that are on the lists of the page?</li>
<li>All the attributes on the page</li>
<li>Is there any link on the page?</li>
<li>Can you get the style sheet information?</li>
<li>divs that have the class "container"</li>
<li>what if you remove brackets in the last one?</li>
</ol>
</div>

In [None]:
tree.xpath('//li/text()')
#tree.xpath('//li/a/text()')

In [None]:
tree.xpath('//@*')

In [None]:
len(tree.xpath('//a'))

In [None]:
tree.xpath('//style')

In [None]:
tree.xpath('//link/@href')

In [None]:
for item in tree.xpath('//link[contains(@href,"css")]'):
    print (item.values())

In [None]:
tree.xpath('//div[@class="container"]') #exact match
#tree.xpath('//div[contains(@class,"container")]')

In [None]:
tree.xpath('//div/@class="container"') #exact match


## 3.4 The inexistent data scraping case

As a simple exercise try to scrap the numerical value in the text box of the hidden.html file.

In [None]:
from IPython.display import HTML
HTML('<iframe src=./files/hidden.html width=700 height=300></iframe>')

In [None]:
#Write your code here.


In [None]:
#Solution
from urllib.request import urlopen
socket = urlopen("file:./files/hidden.html")
print (socket.read().decode('latin-1'))

... and the value?

Problems and limitations of LXML and basic scraping techniques,

     + DOM loaded content. The page finishes loading and it is being acquired when the response is closed. Any further data will be not loaded.
     + Really broken HTML/XML
     + Proprietary and login required can be difficult depending on the log and flow of the page.
     + JS form interaction

### JUST FOR FUN: MARIO MAGIC

In [None]:
from IPython.display import HTML
HTML('<iframe src=./files/mario.html width=700 height=350></iframe>')

<div class = "alert alert-warning" style = "border-radius:10px;border-width:3px;border-color:orange;font-family:Verdana,sans-serif;font-size:16px;">**BONUS MATERIAL:**

In order to add a javascript script we simply declare the script file in the head of the html document. For example <br>
<span style = "font-family:Courier;"> < script type="text/javascript" src="script.js">< /script></span><br>
and <br>
<span style = "font-family:Courier;">< script type='text/javascript' src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min.js">< /script></span><br>

Next, we'll need to start up our jQuery magic using the **\$(document).ready();** syntax. **\$()** says, "hey, jQuery things are about to happen!". Putting document between the parentheses tells us that we're about to work our magic on the HTML document itself.
**.ready();** is a function, or basic action, in jQuery. It says "hey, I'm going to do stuff as soon as the HTML document is ready!". Whatever goes in .ready()'s parentheses is the jQuery event that occurs as soon as the HTML document is ready.

So,
<p style = "font-family:Courier;">
\$(document).ready(something);
</p>

Example:
<p style = "font-family:Courier;">
\$(document).ready(function(){<br>
 \$('div').mouseenter(function(){<br>
    \$('div').fadeTo('fast',1);<br>
 });   <br>
 \$('div').mouseleave(function(){<br>
    \$('div').fadeTo('fast',0.5);<br>
 });
});
</p>


**Some functions:**
<ul>
<li>.click()</li>
<li>.fadeTo(speed='fast,slow',alpha_value)</li>
<li>.fadeOut(speed)</li>
<li>.fadeIn(speed)</li>
<li>.hide()</li>
</ul>

**Variables**
<p style = "font-family:Courier">
var lucky = 7;<br>
var name = "";<br>
var \$p = \$('p');
</p>
Note that \$p is a variable and could be anything, number, etc. In this case we use it to store the JQuery functionality.
<p>
**JQuery objects are defined by: \$()**
<br>
The special tag 'this' e.g. \$(this), will select the item on which the action is being performed. We can use it to change the behavior of the element. For example
<p style = "font-family:Courier">
\$(document).ready(function(){
$(this).fadeOut('fast');
});<br>
</p>


**Adding elements:**
<ul>
<li>\$(where).append("stuff")</li>
<li>.prepend()</li>

<li>\$(where).before("stuff to append")</li>
<li>.after()</li>

<li>.empty()</li>
<li>.remove()</li>
</ul>
A nice trick with .after() is to select something with a variable and move it using .after(). For example:
<br>
<p style = "font-family:Courier">
    \$p = \$('#one')<br>
    \$('div').after($p)
</p>

This will take the contents of element with id = "one" and move it after the tag "div"

<br>
We can also alter attributes:
<ul>
<li>.addClass()</li>
<li>    .removeClass()</li>
<li>    .toggleClass() if the class is present it removes it, otherwise it is added.</li>
<li>    .css(style, value) allows to add style CSS commands ex. .css('color','red')</li>
</ul>

Example:    
<p style = "font-family:Courier">
    \$(document).ready(function(){<br>
    \$('div').height(200);<br>
    \$('div').width(200);<br>
    \$('div').css('border-radius',10);<br>
});
</p>

Finally, we can alter the content!!!! (scraping here we go!) using **.val()** or **.html()**.
Examples:
<br>
<span style = "font-family:Courier">   \$('div').html(); </span> -> will get the content of the first match, that is the first div.
<br><span style = "font-family:Courier"> \$('div').html("I love JQuery!"); </span>-> will set the content of the first match to the value.
<br> **.val**  will retrieve the value of forms. For example:
<br><span style = "font-family:Courier">\$('input:checkbox:checked').val();</spane> would get the value of the first checked checkbox that jQuery finds. 

<p>
**GENERAL EVENT HANDLERS**
The standard notation goes as follows

<br><span style = "font-family:Courier">\$(document).on('event', 'selector', function() {
Do something!});  </span>
<br>
A nice example can be checked in jquery_events_example.html,.css,.js
</p>  
    
 
**SOME JQUERY EVENTS**

The general way of handling events is as follows:

<p style = "font-family:Courier">
\$(document).ready(function() {<br>
    \$('thingToTouch').event(function() {<br>
        \$('thingToAffect').effect();<br>
    });<br>
});
</p>

Example of events:
<ul>
<li>.dblclick()</li>
<li>.hover()</li>
<li>.focus()</li>
<li>.keydown()</li>
</ul>
Example:
<p style = "font-family:Courier">
\$('div').hover(<br>
   function(){<br>
      \$(this).addClass('highlight');<br>
   },<br>
   function(){<br>
      \$(this).removeClass('highlight');<br>
   }<br>
);<br>
</p>
    
And a final cool **effect**:
<br>
.draggable()
    
</div>

# 4 Advanced scraping using automation tools

We see the data in our web browser but the data is not directly found in the html. However "Data is out there". This is due to the fact that it has been dinamically generated with a function call. Thus, we see that we have two versions of the web page. The first contains static data and function calls, the second contains static data after the interpretation of the function calls. The question now is how we can access this post interpretation data. There are many different ways. One way could be opting for running our own interpreter such as node.js. Another way is to take advantage of the browser interpretation capabilities and run it as an interpreter.

Automation tools such as mechanize or selenium are suites with the goal of testing web interfaces automatically from scripts. They allow to start a browser and interact with the web page in the same way a human user would do. We can use these tools for our scraping purposes.


## The Cepstral demo and our new goal.
<small>An updated version of the case study of Asheesh Laroia (PaulProtheus at Github)</small>

Our new goal is to deal with dynamically generated data. Our goal is to be able to perform a web scraping as the following case. Cepstral is a text-to-speech provider. Let us check the web page.

We will need to download geckodriver for Firefox to work 
https://github.com/mozilla/geckodriver/releases

In [None]:
from IPython.display import HTML
HTML('<iframe src="http://cepstral.com" width=700 height=350></iframe>')

Our goal is to retrieve the audio file that has been played using web scraping techniques. Let us check how can we do it.

In [None]:
!export PATH=$PATH:.

In [None]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://seleniumhq.org/')

In [None]:
#CEPSTRAL DEMO
%reset -f
#!/usr/bin/python
# -*- coding: utf-8 -*-

from selenium import webdriver
import time

url = 'http://www.cepstral.com/en/demos' #Poseu el nom de la pàgina web
browser = webdriver.Firefox() #Obrir un navegador Chrome
browser.get(url)
element = browser.find_element_by_css_selector("#demo_text")
element.clear()
s='Hi this is a very boring class! Wow, so awesome!'
element.send_keys(s)
browser.find_element_by_css_selector('#demo_submit').click()

browser.implicitly_wait(30)
browser.find_element_by_css_selector('audio')
html=browser.page_source
browser.quit()


In [None]:
print html


In [None]:
#Check the data is in
'.mp3' in html


In [None]:
#locate it
html.find('.mp3')

In [None]:
chunks=html.split('"')
for chunk in chunks:
    if '.mp3' in chunk:
        break


In [None]:
print (chunk)

In [None]:
import urllib
furl=urllib.parse.urljoin(url,chunk)
print (furl)

In [None]:
import os

player = "/Applications/VLC.app/Contents/MacOS/VLC " 

##Replace with media player with your own player 
os.system(player+furl)


## 4.1 Starting with Selenium 

+ Requirements
        ''pip install selenium''
        
If you use Firefox you do not need anything else. Check the following code and it should work fine.

We will need to download geckodriver for Firefox to work 
https://github.com/mozilla/geckodriver/releases


In [None]:
!export PATH=$PATH:.

In [None]:
from selenium import webdriver
browser = webdriver.Firefox()

If you want to use Chrome you need the Chrome webdriver interface 'chromedriver'. 

+ Download 'chromedriver' 2.10 at the time of this notebook.
+ You need to add the chromedriver path into the executable path. (We will do it directly on python)

Check the following code

In [None]:
from selenium import webdriver
import os 
os.environ["PATH"] = '$PATH:.' 
browser = webdriver.Chrome()


<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Basic manipulation in Selenium:**
<p>
A webdriver instance allows to manipulate the web session, control cookies, retrieve the html code or find elements in the source code.
</p>
Given a webdriver instance (e.g.<span style = "font-family:Courier;">
            browser = webdriver.Firefox()</span>) the most relevant methods

<ul>
<li>**Open URL:**  .get(url) (e.g.
<span style = "font-family:Courier;"> browser.get(url)</span>)</li>
<li>**Selection: ** .find_element(s)... [element will return the first, elements the complete list]
<ul>
<li>..._by_link_text('foo') - find the link with text foo</li>
<li>..._by_partial_link_text() - similar to contains ...</li>
<li>..._by_css_selector()</li>
<li>..._by_tag_name()</li>
<li>..._by_xpath()</li>
<li>..._by_class_name()</li>
</ul>
</li>
<li>**Retrieve source: ** .page_source</li>
  
</ul>
</div>

<div class = "alert alert-info" style = "background-color:lightyellow;border-radius:10px;border-width:3px;border-color:darkorange;font-family:Verdana,sans-serif;font-size:16px;color:brown">**Other web driver utilities:**
<ul>
<li>browser.execute_script('window.close()') - execute any javascript on a load page</li>
<li>brosers.save_screenshot('foo.png')</li>
<li>browser.switch_to_alert(): handle pop-ups automatically</li>
<li>browser.forward() / browser.back(): navigation</li>
</ul>
</div>

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**Exercise: Simple retrieval of a page source with AJAX.**
The prices of a search flight on google is a dinamically generated page. Let us suppose we want to find the cheapest price for a certain flight. We may access the google flights by means of the API.
</div>

In [None]:
%reset -f
from selenium import webdriver
import time
import os 
os.environ["PATH"] = '$PATH:.' 
url = 'https://www.google.es/flights/#search;f=BCN;t=MAD;d=2017-11-12;r=2017-11-16' #Poseu el nom de la pàgina web
browser = webdriver.Firefox() 
browser.get(url)
time.sleep(3) #Wait for the page to load
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);") #Scroll down

html=browser.page_source

In [None]:
print (html)

In [None]:
#Use basic string manipulation to retrieve the price.
import re
for m in re.finditer(r"€", html):
    print ('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
    print (html[m.start()-10:m.start()+1])


#euro = html.find_all('€')
#html[euro-5:euro-1]

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Element manipulation in Selenium:**
<p>
Consider the result of a selection, e.g. 

<span style = "font-family:Courier;">element = browser.find_element_by_css_selector('div')</span>

We can do several things on it.
<ul>
<li>element**.click()** - click on a selected element</li>
<li>Element properties:
<ul>
<li>element**.location**: x, y location</li>
<li>element**.parent**: parent element</li>
<li>element**.tag_name**: The tag of the element</li>
<li>element**.text**: text of the element and childs</li>
</ul>
</li>
   
</ul>




<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">
**A more elaborate exercise:** Go to Amazon, look for the film "Ip Man". Retrieve the list of all movies bought by customers who also bought "Ip Man". With this list go to Internet Movie Data Base and sort the list according to the score of the movie.
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Form input with Selenium:**
<ul>
<li> element**.send_keys()** - Keys, commands, arrows, etc </li>
<li> element**.clear()** - clear the element</li>
</ul>
<p>

**Example.**

<p style="font-family:Courier;">
from selenium.webdriver.common.keys import Keys
<br>input.send_keys('Ip Man',Keys.RETURN)
</p>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">**Scrolling and moving:**
Moving around the page is tricky, be prepared for displaying a little patience.

ActionChains provide a way of stringing together one or more actions and then implementing them.
<ul>
<li>move_by_offset(x,y)</li>
<li>move_to_element() - for highlighting, hovering, rollover, etc.</li>
<li>move_to_elemnte_by_offset(elem, x, y)</li>
</ul>
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;"> **Wait**

We can distinguish two types of waiting strategies, namely, implicit and explicit waits.

*Implicit waits* set up a timeout that will last for the full life of the web driver. On the other hand, *explicit waits* tell the driver to poll the DOM until some condition is met, e.g. a certain element has finished loading on the page. 

Example:
<p style="font-family:Courier;">
try: <br>
movie_info = webdriverwait(browser,10).until(EC.element_to_be_clickable((By.ID,'BotMovie')))<br>
title = movie_info.find_element_by_class_name('title').text<br>
link = movie_info.find_element_by_class_name('mdpLink').get_attribute('href')<br>
except:<br>
 print 'taking too long!!'<br>
 </p>
 
*EC* stands for Expected Condition and are the basis of explicit waits (see http://selenium-python.readthedocs.org for more information)
</div>

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;"> **Wrap-up**
<ul>
<li>We understood how data is usually stored in the web site and how to access it using different kinds of accessors, namely API and direct selectors.</li>
<li>We have seen how to capture different kinds of data types(text, audio and pictures).</li>
<li>We are now familiar with JSON data and basic No-SQL databases.</li>
</ul>
</div>