## TP 6 : crawling the web  
## Adrien HANS & Tanguy JEANNEAU

The aim of this notebook is to make you aware of the ability of Python to crawl the web to collect data in an automatic manner. This is made possible by using the package `urllib3`.

Other packages are useful:
- _os_ & _sys_ for operating system instructions,
- _re_ for **r**egular **e**xpressions when manipulating text strings. See <https://docs.python.org/2/library/re.html>
- _datetime_ to play with... dates & times.

In [1]:
import os,sys
import urllib3   # "python -m pip install urllib3" in the command shell if necessary
import re

from datetime import datetime, date

To make this notebook the clearest possible, we disable warnings from urllib3

In [2]:
urllib3.disable_warnings()

To take Centrale Lille's proxy into account to go out:

In [3]:
# If proxy : to get out through Centrale Lille's proxy
centrale_proxy = False
if centrale_proxy:
    proxy = urllib3.ProxyManager('http://cache.ec-lille.fr:3128')
else:
    proxy = urllib3.PoolManager()

# See https://stackoverflow.com/questions/40490187/get-proxy-address-on-auto-proxy-discovery-mode-on-mac-os-x
# scutil --proxy

### Example: parsing the web page <http://tycho.usno.navy.mil/timer.html>
The purpose of this example is to retrieve American Eastern time from some web page.

In [4]:
#%% Internet access: parsing web pages 
#from urllib3.request import urlopen
response = proxy.request('GET','http://tycho.usno.navy.mil/timer.html')

Transform content into formatted text:

In [5]:
texte_utf8 = response.data.decode('utf-8')
#texte_utf8 = "10:15:38 AM EDT"     # Test regex http://regexr.com   POUR CHECKER LE REGEX

Look for eastern time in the web page and print it:

In [6]:
# Search data using Regex (regular expressions)
# Test regex http://regexr.com

regex = "[0-9]+:[0-9]+:[0-9]+ PM EDT"   # looking for time under the form 10:16:18 PM  pour EDT
web_time = re.search(regex, texte_utf8)

print(web_time.group(0))

02:30:41 PM EDT


### Exercise 1
**We do the same as above for Western time : **
<br> 
Western time corresponds to UTC (Coordinated Universal Time) and is the time in weastern Europe. . 
<br>
Thus, we search the same regular expression (regex) but with "UTC" (Coordinated Universal Time) instead of "EDT" (Eastern Time).
<br> Then we only have to compute the last cell replacing "EDT" by "UTC" to get the wanted result. 

In [7]:
# Search data using Regex (regular expressions)
# Test regex http://regexr.com

regex = "[0-9]+:[0-9]+:[0-9]+ UTC"   # looking for time under the form 10:16:18  for UTC
web_time = re.search(regex, texte_utf8)

print(web_time.group(0))

18:30:41 UTC


**This is indeed the actual Coordinated Universal Time, and thus the Western Time that we wanted to get. **

### Exercise 2
**Information from Wikipedia**

Retrieve information about the date of birth and age of the following list of actors from Wikipedia: Brad Pitt, Laurent Cantet, Jean-Paul Belmondo, Matthew McConaughey,... and if possible Marion Cotillard,  and all others you may want to add to this list.

To this aim, we scroll their Wikipedia web page and find a date that might be that date of birth.

Then we translate it to a numerical date and compute the difference with today to get an estimate of their age.

We define a class actors, with fisrtname and name as attributes :

In [8]:
# for Brad Pitt or any actor
class Actor:
    def __init__(self, firstname, name):
        self.name = name
        self.firstname = firstname   

We can define some actors we want to test our algorithm on : 

In [9]:
List_Actors=[Actor('Matthew','McConaughey'),Actor('Marion','Cotillard'),Actor('Jean-Paul','Belmondo'),Actor('Laurent','Cantet'),Actor('Jean-Pierre','Leaud'),Actor('Anna','Karina'),Actor('Jean-Claude','Brialy'),Actor('Maurice','Ronet')]

We can iterate on each actor :

In [10]:
for person in List_Actors:
    #Parsing the wikipedia page of the actor : 
    response = proxy.request('GET','https://fr.wikipedia.org/wiki/'+person.firstname+'_'+person.name)

    #Encoding the page : 
    texte_utf8 = response.data.decode('utf-8')

    #Regular expression of the birth date in wikipedia : 
    #We firstly take the date with the expression arround it to be sure it's the birth date we are getting
    regex_expression = 'datetime="[0-9]+[0-9]+[0-9]+[0-9]-[0-9]+[0-9]-[0-3]+[0-9]"'

    #Searching the birth date : 
    birth_date_expression = re.search(regex_expression, texte_utf8)

    #This is the regex of the actual date 
    regex_actual_date="[0-9]+[0-9]+[0-9]+[0-9]-[0-9]+[0-9]-[0-3]+[0-9]"
    
    #Taking the birth date : 
    birth_date=re.search(regex_actual_date,birth_date_expression.group(0))

    #Getting the birth_date in web_date
    web_date=birth_date.group(0)

    #Converting the web_date in the time format : 
    born = datetime.strptime(web_date, '%Y-%m-%d')

    #Getting the date of today in format time : 
    now = date.today()

    # Compute the age
    age = now - born.date()
    age.days / 356

    #Getting the final result  : 
    result = now.year - born.date().year - ((now.month, now.day) < (born.date().month, born.date().day))

    #Printing the result for each actor  : 
    print('According to Wikipedia, ' + person.firstname + ' ' + person.name +' is ' + str(result) + ' years old.')

According to Wikipedia, Matthew McConaughey is 49 years old.
According to Wikipedia, Marion Cotillard is 44 years old.
According to Wikipedia, Jean-Paul Belmondo is 86 years old.
According to Wikipedia, Laurent Cantet is 58 years old.
According to Wikipedia, Jean-Pierre Leaud is 75 years old.
According to Wikipedia, Anna Karina is 79 years old.
According to Wikipedia, Jean-Claude Brialy is 86 years old.
According to Wikipedia, Maurice Ronet is 92 years old.


**On the actors tested the results were correct, even with Marion Cotillard, by specifying two regex expressions the birth date must corresponds to.** 