# Demonstration of basic Web Scraping

Here is the deal: learn in one page how to scrape your facebook contacts to retrieve all their birthday dates and structure the data! 

Because the desired list of birthday dates is not provided by facebook for download into a spreadsheet this is an example of unstructured data, and in this case we will **scrape** it from the web. Some websites have API (Application Programming Interface) that allow a user to retrieve structured data, for example directly querying their database. Levels and contraints of access depend on the specific website, and although facebook has an API we want to demonstrate here the skill of scraping data which is often useful in Data Science.

The basic concept is that data can be retrieved from a webpage programmatically by parsing the source-code (e.g. HTML). Since we might want to do this automatically on many pages we need to **access pages programmatically** via their URL and possibly move from page to page by jumping from one URL to the next.

The simplest tool to access URLs in Python is **urllib3** or **requests**.

## First Web requests and BeautifulSoup

Here is how to submit a request with the *requests* library:

In [14]:
import requests
url = 'https://www.google.ca/'
r = requests.get(url)

Now that we have made a request to get the URL, we need to look at the content, and we can use the method r.content:

In [18]:
r.content[:200]

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en-CA"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/'

A better way to visualize and navigate the source code is to use the BeaufitulSoup library. So we can create an object that we will call *soup* and call the method *prettify()* that will rearrange the code in its readable form:

In [25]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, "lxml")
print(soup.prettify()[:400])

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en-CA">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <title>
   Google
  </title>
  <script>
   (function(){window.google={kEI:'Hho8Wb2bN8K_jwTjk4XIBQ',kEXPI:'1353382,1353799,3700324,37003


## Need for trying other libraries

What I first noticed is that the code returned from this kind of request is not identical to what I get by manually navigating to a page and using crtl+U to retrieve the code. In my case I want to see all the data that I see on my facebook page and the above method was not doing that. 

So I tried **urllib3** that can be used in this way:

In [26]:
#import the library used to query a website
import urllib3
import certifi

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
url = 'https://www.google.ca/'

In [27]:
page = http.request('GET', url)
html = page.read()
html = str(html)

In [30]:
soup = BeautifulSoup(page.data, "lxml")
print(soup.prettify()[:100])

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en-CA">
 <head>
  <met


Again one can retrieve the source code, but you still don't get the same info as you manually navigate with your browser, this is because the request is not sent as when you do with the browser, and the website *recognizes* that is a bot request. Apparently **urllib3** offers the option to extend the request with an *header* that refers to a browser, and even if this still didn't do what I wanted here is some example code:

In [None]:
import urllib.request
url = """https://www.facebook.com/"""
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
test = f.read().decode('utf-8')

Therefore we need to find a way to simulate a browser and get an identical behaviour.

For this purpose a powerful tool is the **Selenium** library.

## Success: Webdriver from the Selenium library

This actually worked, in the sense that inspecting the result obtained with the Selenium library, I could find the same code that I would obtain manually. So here is how it works:

In [35]:
from selenium import webdriver  
from selenium.common.exceptions import NoSuchElementException  
from selenium.webdriver.common.keys import Keys  
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
_chrome_options = Options()

In [41]:
url = "https://www.facebook.com"

In [42]:
driver = webdriver.Chrome(chrome_options=_chrome_options)
driver.get(url)  

This code successfully opens the Chrome browser and navigates to the desired URL. Now we just have to get the page_source and save it in a variable:

In [43]:
html_source = driver.page_source  

In [44]:
soup = BeautifulSoup(html_source,'html.parser')
print(soup.prettify()[:200])

<!DOCTYPE html>
<html class="" id="facebook" lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="origin-when-crossorigin" id="meta_referrer" name="referr


In the case of facebook, this code will not open my profile, but will accually navigate to the login page, therefore we will have to instruct the **webdriver** to insert our credentials and then navigate to the pages we want:

In [46]:
driver = webdriver.Chrome(chrome_options=_chrome_options) 
driver.get(url)  

# wait for the login elements to load
driver.implicitly_wait(10)

# identify relevant elements
email = driver.find_element_by_css_selector('input[type=email]')
password = driver.find_element_by_css_selector('input[type=password]')
login = driver.find_element_by_css_selector('input[value="Log In"]')

# insert the credentials (you will have to modify this)
email.send_keys('name.lastname@gmail.com')
password.send_keys('your_password')

# login
login.click()

# navigate to my friends list (you will have to get the right url for you)
driver.get('https://www.facebook.com/profile.php?id=.....blabla......._friends_tl')

# take ascreenshot
#driver.get_screenshot_as_file('yourName-profile.png')

# get source-code from current page
html_source = driver.page_source  

# quit browser
#driver.quit()

In [47]:
soup = BeautifulSoup(html_source,'html.parser')
print(soup.prettify()[:200])

<!DOCTYPE html>
<html class="" id="facebook" lang="it" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="origin-when-crossorigin" id="meta_referrer" name="referr


With this tool we can now navigate facebook as we were logged in our account. To go from one page to another we just need to figure out where to find the links. If I want to navigate from my profile to my friends list page, I need to parse the source code and find the URL that corresponds to that link. It is now a matter of finding the logic of where those URLs are located within the page, extract them from the soup with **regex** (Regular Expressions) and use another request to go to that page, get the source code from there, look for information or the next URL, and so on!

Wellcome to the art of Web Scraping!

# Parsing HTLM code with RegEx

Even if every page has a different code, there are patterns that stay the same. For every project we might want to search for similar information in very similar pages, therefore we can identify those code patterns that contain what we need and scrape it. There is no doubt that we need to learn now how to use RegEx or REGular EXpressions which is an obiquitous tool programming, independently of the language.

Here we will implement this tool in Python to search my friends list for their facebook URL, because we want to first go to their profile, then to their info-page, and find the date of birth. So here we go.

In [50]:
import re

This imports the python regex library, if you are already familiar with this concept you can move own and are ready to do your own scraping now, otherwise stick around and see how exactly we can get the info we want. First we need to make a couple of examples to explain how parsing works.

Let's pretend that we want to find the important dates and people names in the following text:

*DNA was first isolated by Friedrich Miescher in 1869. Its molecular structure was identified by James Watson and Francis Crick from Cold Spring Harbor Laboratory in 1953, whose model-building efforts were guided by X-ray diffraction data acquired by Raymond Gosling, who was a post-graduate student of Rosalind Franklin.*

We want to isolate the numbers and names, here is how we could start doing that:

In [51]:
text = 'DNA was first isolated by Friedrich Miescher in 1869.   \
        Its molecular structure was identified by James Watson  \
        and Francis Crick from Cold Spring Harbor Laboratory in \
        1953, whose model-building efforts were guided by X-ray \
        diffraction data acquired by Raymond Gosling, who was a \
        post-graduate student of Rosalind Franklin.'
dates = re.findall('\d{4}', text)
dates

['1869', '1953']

Here we used the function re.findall(*pattern*, *string*), where we looked for the pattern *\d{4}* in the string contained in *text*. Here we used one of the regex **identifiers** (\d) which identifies digits, followed by the **modifier** ({4}) which specifies how many digits in a row constitutes our pattern. And the result is what we wanted.

If we now wanted to find the names of people we would do:

In [55]:
names = re.findall('[A-Z][a-z]*\s[A-Z][a-z]*\W', text)
names

['Friedrich Miescher ',
 'James Watson ',
 'Francis Crick ',
 'Cold Spring ',
 'Harbor Laboratory ',
 'Raymond Gosling,',
 'Rosalind Franklin.']

The pattern here is more complicated, but can be broken down like this:  
**[A-Z]**   &nbsp;&nbsp; any capital letter followed by  
**[a-z]**   &nbsp;&nbsp;&nbsp; any lower case letter with the modifier  
**\***      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; as many repetitions (of lower case letters)  
**\s**      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a space  
**[A-Z][a-z]\***  &nbsp;&nbsp;&nbsp;&nbsp; anther word that starts with a capital letter  
**\W**      &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; followed by any non-character

For a full tutorial on Regex see: https://www.youtube.com/watch?v=sZyAn2TW7GY

Notice that this code successfully avoids counting the capitalized words at the beginning of each frase, but fails in recognizing that **Cold Spring Harbor Laboratory** is not a person name. There are other clever things that can be implemented to clean this list programmatically, but goes beyond the scope of this brief tutorial.

# Retrieving the right URLs

To be continued!