# Web Scraping

###### When you are working, usually you will want to use data from internet. Sometimes you can find data in friendly format like csv or xlsx or sometimes you can access the data via API (Application Programming Interface).

###### But sometimes the only way to access the data is using a technique call Web Scrapin. This technique allow you to get the data from the web into a format you can work with in your analysis.

###### First of all we need to understand what we are going to find in a web page so let's see a short summary of it !!

## The components of a web page

###### When the data is donwloaded from the web page, falls into some categories. The main categories are:
- HTML: Contain the main content of the web page.
- CSS: Contain the instructions to make the web page looks beauty.
- JavaScript: Contain the instruction to add interactivity into the web page. 
- Images: Allos the web page to show pictures in formats like JPG, PNG, among others.

###### We will focus on the HTML code. Here is where the data that we want to obtain is located.

## HTML (Tags < > )

###### When you perform web scraping you have to deal with tags. Some of the most usual tags that you wil found in html code are listed below.
- 'html': Html document is contained into this tags.
- 'body': Inside html tag, body tags contain all the visible elements that the web page shows to the user.
- 'p': Inside the body tag, p tags contain a paragraph. Each p tag contain one paragraph.
- 'a': Inside the body tag, a tags contain a link to other web page.
- 'table': Inside the body tag, a table tag contain a table. Each row is contained into a 'tr' tag and rows are divided into data as 'td'

## Required libraries

###### Python is an open source programming language so there are a lot of libraries to perform the same action. In this training we are going to use a few libraries to explain how to do web scraping:
- Request: We will use the "get" function to obtain the html code from the web page. (this library have a lot of other functions)
- BeautyfulSoup: This library allow you to clean the html code to obtain just the data that you want.
- Parsel: This library gives you another way to extract the data from the html code.

## And now ... hands on the code !! (finally)

###### First we need to import the request library

In [1]:
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

###### We can print the content of the web page using the content function

In [2]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

###### And then you can use the BeautyfulSoup to parse the downloaded content of the web page

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

###### Now we can use the prettify method in our BeautyfulSoup object to see the html code in a really better undertanding way than before

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


###### With the previous code you can see the nested tags. Now is time to start digging into the tags to extract the paragraph that we want.
###### We can use the children method to get into the nested tags one level at a time. Note that the children method returns a list with the nested tags so we need to use the list function to see the nested tags.

In [5]:
soup.children

<list_iterator at 0x1f8770d9e10>

In [6]:
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

###### We found 3 elements using the children method. We can see the type of the elements using the type function

In [7]:
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

- The first element is a Doctype object which contain information about the type of the document
- The second element is a NavegableString which is text found in the code
- And finally the last element is a Tag object. This is the object what we are looking for because this object contains the other tags and the information that we need.

###### We can now select the html tag saving the last element of the previous list into a new variable

In [8]:
html = list(soup.children)[2]
html

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

###### Each one of the returned items by the children method are also a BeautyfulSoup object so we can use the children method with those returned items

###### So here we go again !!

In [9]:
list(html.children)

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

###### Now we have 5 elements. Three of them are just text. The other two elements are the Head tag and the Body tag which contain the p tag that we want to extract (We are almost there!)

###### So we have to repeat the same as before

In [10]:
body = list(html.children)[3]
body

<body>
<p>Here is some simple content for this page.</p>
</body>

###### And now our last effort to get the data

In [11]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [12]:
p = list(body.children)[1]
p

<p>Here is some simple content for this page.</p>

###### We finally have the p tag in our pocket so to get the content of the p tag we need to use the get_text method in our variable to extract all of the text from it

In [13]:
p.get_text()

'Here is some simple content for this page.'

## We did it !!!

###### We can now learn a better way to do the same

## Finding all instances of a tag at once

###### All the previous steps was needed to make you understand how to navigate through a web page and get the data that you want, but as you can see it take us a lot of code and effort to get the data.
###### There is another way to do the same. We can use the find_all method to get all the instances of a tag in just one shot.

In [14]:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

###### Note that find_all method returns a list of elements so we need to use an index to extract the text

In [15]:
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

###### In case you want to get just the first element in the html code of a particular tag, you can use the 'find' method

In [16]:
soup.find('p')

<p>Here is some simple content for this page.</p>

###### Note that the previous output is not a list so you should use the get_text method without using index

## Searching for tags using class and id attributes

###### Class and Id are two attributes that can be included in html code.
- Class attribute allow you to categorize one or more elements into groups
- Id attribute allow you to identify an element. A certain Id can be used by just one element and an element can have just one Id

###### We can use this attibutes when we are scraping to get particular elements in our searchs so we can filter effectively

###### To understand this we are going to use the following web page: http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html

In [17]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

###### Now we are going to search for any p tag that has the class outer-text

In [18]:
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

###### You can also search for elemetns using the Id attribute

In [19]:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

## CSS selectors

###### When you are searching for elements you can also do it using CSS selectors. Remember that CSS language allows developers to specify html tags to style. Here you have some examples:
- p a : finds all "a" tags inside of a "p" tag
- body p a : finds all "a" tags inside of a "p" tag inside of a "body" tag
- html body : finds all "body" tags inside of an "html" tag
- p.outer-text : finds all "p" tags with a class of "outer-text"
- p#first : finds all "p" tags with an id of "first"
- body p.outer-text : finds any "p" tags with a class of "outer-text" inside of a "body" tag

###### The great thing is that BeautyfulSoup objects support searching a web page via CSS selectors. The only difference is that you have to use the select method.
###### Note that select method returns a list of elements as equal as find and find_all methods

In [20]:
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

## Parsel

###### Parsel is a python library for extracting data from XML/HTML text using CSS or XPath selectors.

###### In this case we are going to use the following html code

In [21]:
html = u'''
<ul>
    <li><a href="http://blog.scrapinghub.com">Blog</a></li>
...
    <li><a href="https://www.scrapinghub.com">Scrapinghub</a></li>
...
    <li class="external"><a href="http://www.scrapy.org">Scrapy</a></li>
</ul>
'''


###### Now we have to import the Parsel library, load it into a Parsel Selector and extract links with an XPath expression

In [22]:
import parsel
sel = parsel.Selector(html)
sel.xpath("//a/@href").extract()

['http://blog.scrapinghub.com',
 'https://www.scrapinghub.com',
 'http://www.scrapy.org']

###### One of the best features of Parsel is the ability to chain selectors !!

In [23]:
sel.css('li.external').xpath('./a/@href').extract()

['http://www.scrapy.org']

###### You can also iterate through the results of the .css() and .xpath() methods since each element will be another selector

In [24]:
for li in sel.css('ul li'):
    print(li.xpath('./a/@href').extract_first())

http://blog.scrapinghub.com
https://www.scrapinghub.com
http://www.scrapy.org


# Thank you !!