# A quick scraping example

Here we deal with a basic scraping example using <code>lxml</code>. Note there are other libraries that could be useful for the same purpose as <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>. 
For a more systematic approach to scraping, <a href="http://scrapy.org/">Scrapy</a> is a good solution. If you need to interact with HTML pages, <a href="">Selenium</a> is the best solution. However, Scrapy and Selenium are advanced tools and require deep knowledge of HTML and Javascript, together with an understanding of advanced architectural concepts. 

## Step one: select and look at the markup of the page

Scraping basically consists of getting a Web page and processing its HTML content to extract some information and save or convert that information in other format.

Let's take as example the page with the ranking of companies of Forbes: http://www.forbes.com/global2000/list/

The first thing we neeed to do is to look at the markup of the page. You can do it by using in Firefox:
* Tools > Web Developer > Page Source 

But it is better to use some more advanced view of the code. In Firefox, you can use:
* Tools > Web Developer > Inspector.

Or use some addon like Firebug: https://addons.mozilla.org/es/firefox/addon/firebug/
The good thing of Firebug is that it allows you to get from the page directly XPath expressions (see below).

In the Web page selected, we can see that the list of products and prices are inside a <code>tbody</code> element that contains a number of elements with class <code>company</code>.

So in this case we can get titles and prices based on CSS classes. 

## Step one: get the Web page

This is simply retrieving the text of the page. We can do it with the <code>request</code> module or with any other one with a similar purpose.

In [23]:
import requests
url = "http://www.forbes.com/global2000/list/"
page = requests.get(url)


## Step three: parse the Web page

In [24]:
from lxml import html, etree
utf8_parser = etree.XMLParser(encoding='utf-8')
s = page.text.encode('utf-8')
tree = etree.fromstring(s, parser=utf8_parser)
tree = html.parse(page.text.encode('utf-8'))
# error... not well formed.
# take the fragment from disk instead...


XMLSyntaxError: Entity 'rsquo' not defined, line 10, column 182

At this point we have in <code>tree</code> a representation of the contents of the page. Now we have to options to extract the data, using CSS selectors or XPath.

In [13]:
f = open("forbes-fragment.html")
tree = html.document_fromstring(f.read())

## Step four: get the data with CSS selectors.

We now take advantage of our knowledge that titles and prices are with particular CSS selectors or XPath... Using the CSS selector syntax: http://www.w3.org/TR/CSS2/selector.html

In [14]:
#companies = tree.cssselect('tr')
companies = tree.xpath('//tr[@class="data"]') # xpath expresiones declarativas que permite seleccionar una parte del documento html
##// todo los elementos tr, y me filtres todos las clases de "data"

print len(companies)
print type(companies)
print type(companies[0])


30
<type 'list'>
<class 'lxml.html.HtmlElement'>


Now we can iterate each element and get the subelements in each of the company td tags.

In [15]:
L = []
# Note we are excluding the header starting in 1. 
for i in range(1, len(companies)):
    # There are seven <td> elements:
    cells = companies[i].getchildren()
    name = cells[2].xpath("a/text()")[0]
    country = cells[3].text
    value = cells[7].text
    L.append([name, country, value])

In [16]:
L[:5]

[['China Construction Bank', 'China', '$212.9 B'],
 ['Agricultural Bank of China', 'China', '$189.9 B'],
 ['Bank of China', 'China', '$199.1 B'],
 ['Berkshire Hathaway', 'United States', '$354.8 B'],
 ['JPMorgan Chase', 'United States', '$225.5 B']]

Note: you can do something similar with <a href="http://www.w3schools.com/xpath/">XPath</a> expressions (you can get the XPath of an element in a Web page with Firebug)

## Step 5: transform the data (if needed)

We see that the value follows the format: "$174.4 B". We need to extract the number from that string.

In [17]:
# Easiest way is cutting the dollar and billion signs:
for i in range(len(L)):
    label = L[i][2]
    L[i][2] = float(label[1:-1])
print L[:5]
print type(L[0][2])

[['China Construction Bank', 'China', 212.9], ['Agricultural Bank of China', 'China', 189.9], ['Bank of China', 'China', 199.1], ['Berkshire Hathaway', 'United States', 354.8], ['JPMorgan Chase', 'United States', 225.5]]
<type 'float'>


## Step 6: Move it to a DataFrame

In [18]:
import pandas as pd
df = pd.DataFrame(L, columns = ["company", "country", "value"])
df.head(6)

Unnamed: 0,company,country,value
0,China Construction Bank,China,212.9
1,Agricultural Bank of China,China,189.9
2,Bank of China,China,199.1
3,Berkshire Hathaway,United States,354.8
4,JPMorgan Chase,United States,225.5
5,Exxon Mobil,United States,357.1


In [37]:
# habria que instalar html5lib 
import html5lib
dfs = pd.read_html('https://en.wikipedia.org/wiki/Breguet_14')
dfs
#for df in dfs:
 #   print(df)

ImportError: html5lib not found, please install it