### Tuto depuis Dataquest : https://www.dataquest.io/blog/web-scraping-tutorial-python/

# National Weather Service

* Pour scraper une page web il faut d'abord la télécharger en utilisant la lib 'requests' qui envoie un GET au serveur web

In [2]:
import requests

page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page

<Response [200]>

In [3]:
page.status_code # U status code de 200 : Succes Telechargement de la Page

200

* Les status codes commencçant pas '2' indiquent un succès contrairement au status code '4' ou '5' (404, 500,...)

In [4]:
""" Imprimer le contenu de la page"""

' Imprimer le contenu de la page'

In [5]:
page.content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

### Parsing de contenu avec BeautifulSoup

Analyse Syntaxique

In [6]:
""" extraire le contenu des tags 'p' d'une simple page"""

" extraire le contenu des tags 'p' d'une simple page"

In [7]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content,'html.parser')

In [8]:
soup

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [11]:
""" Afficher l'objet BS tel qu'en code"""
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [15]:
""" Afficher les children depuid <html> sous forme de liste"""
list(soup.children)

['html', '\n', <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

In [16]:
""" Définir le type de chaque élément dans la liste"""
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

In [19]:
""" sélectionner  le 3eme elt d'une liste : à savoir l'élément tag"""
html = list(soup.children)[2]

In [20]:
print(html)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [21]:
list(html.children)# on se retroue avec 2 tags : head, body

['\n', <head>
 <title>A simple example page</title>
 </head>, '\n', <body>
 <p>Here is some simple content for this page.</p>
 </body>, '\n']

In [23]:
""" on descend plus vers body """
body = list(html.children)[3]

In [24]:
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [25]:
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

In [26]:
p = list(body.children)[1]

In [27]:
p

<p>Here is some simple content for this page.</p>

In [29]:
""" Extraire uniquement le texte"""
print(p.get_text())

Here is some simple content for this page.


### Trouver toutes les instances d'un Tag d'un coup

In [31]:
soup = BeautifulSoup(page.content,'html.parser')
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [33]:
""" Renvoie une liste, donc il faudra la parcourir pour récupérer les éléments"""
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

### La première instance d'un tag

In [35]:
soup.find('p')

<p>Here is some simple content for this page.</p>

### Chercher des tags par 'class' et 'id'

In [37]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content,'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <div>
   <p class="inner-text first-item" id="first">
    First paragraph.
   </p>
   <p class="inner-text">
    Second paragraph.
   </p>
  </div>
  <p class="outer-text first-item" id="second">
   <b>
    First outer paragraph.
   </b>
  </p>
  <p class="outer-text">
   <b>
    Second outer paragraph.
   </b>
  </p>
 </body>
</html>


In [38]:
""" Chercher les 'p' avec des 'class' : outer-text """

" Chercher les 'p' avec des 'class' : outer-text "

In [39]:
soup.find_all('p',class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [41]:
""" Renvoyer n'importe quel tag avec la class : outer-text"""
soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [42]:
""" Chercher les éléments par 'id' """
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

### Utiliser les CSS Selectors

Il est possible de faire de s recherches de CSS en utilisant '.select()'

In [43]:
soup.select('div p')#renvoie les 'p' dans 'body'

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [44]:
soup.select('p#first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

# ------------------------------------------------------------------- 

### Télécharger les données Weather
Scraper les données sur la météo de San Francisco

In [53]:
page = requests.get('https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168#.WwqS2yDA9pg')
page

<Response [200]>

In [54]:
soup = BeautifulSoup(page.content,'html.parser')

In [55]:
seven_day = soup.find('div',id='seven-day-forecast')

In [57]:
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   San Francisco CA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Today
       <br/>
       <br/>
      </p>
      <p>
       <img alt="Today: Sunny, with a high near 70. West southwest wind 5 to 10 mph increasing to 16 to 21 mph in the afternoon. Winds could gust as high as 28 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 70. West southwest wind 5 to 10 mph increasing to 16 to 21 mph in the afternoon. Winds could gust as high as 28 mph. "/>
      </p>
      <p class="short-desc">
       Sunny
      </p>
      <p class="temp temp-high">
       High: 70

b'<!DOCTYPE html>\n<html>\n<head>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n\n    <title>Python Web Scraping Tutorial using BeautifulSoup</title>\n    <meta name="HandheldFriendly" content="True" />\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\n    <link href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.8.4/themes/prism.min.css" rel="stylesheet" />\n    <link rel="stylesheet" type="text/css" href="/blog/assets/built/screen.css?v=a0706482e9" />\n    <link rel="stylesheet" type="text/css" href="/blog/assets/built/custom.css?v=a0706482e9" />\n\n    <meta name="description" content="Web scraping allows us to extract information from web pages.  In this tutorial, youll learn how to perform web scraping with Python." />\n    <link rel="shortcut icon" href="/blog/favicon.png" type="image/png" />\n    <link rel="canonical" href="https://www.dataquest.io/blog/web-scraping-tutorial-python/" />\n    <meta na

In [59]:
forecast_items = seven_day.find_all(class_='tombstone-container')

In [60]:
tonight = forecast_items[1]

In [62]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Mostly clear, with a low around 54. West southwest wind 16 to 21 mph decreasing to 5 to 10 mph after midnight. Winds could gust as high as 28 mph. " class="forecast-icon" src="newimages/medium/nfew.png" title="Tonight: Mostly clear, with a low around 54. West southwest wind 16 to 21 mph decreasing to 5 to 10 mph after midnight. Winds could gust as high as 28 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Clear
 </p>
 <p class="temp temp-low">
  Low: 54 °F
 </p>
</div>


### Extracting information from the page
* Nom du forecast - (Tonight)
* Description des conditions - (dans 'title' de 'img')
* breve description des conditions - (Mostly clear)
* Temperature basse - (49°)

In [69]:
period_name = tonight.find('p',class_='period-name').get_text()
#conditions = soup.find()
short_desc = tonight.find('p',class_='short-desc').get_text()
temp = tonight.find('p',class_='temp temp-low').get_text()


In [70]:
print(period_name,short_desc,temp)

Tonight Mostly Clear Low: 54 °F


In [75]:
img = tonight.find('img')

In [76]:
desc = img['title']

In [77]:
desc

'Tonight: Mostly clear, with a low around 54. West southwest wind 16 to 21 mph decreasing to 5 to 10 mph after midnight. Winds could gust as high as 28 mph. '

### Extracting ALL the information from the page

In [78]:
period_tags = seven_day.select('.tombstone-container .period-name')

In [83]:
periods = [i.get_text() for i in period_tags]

In [84]:
periods

['Today',
 'Tonight',
 'MemorialDay',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday']

In [86]:
short_descs = [i.get_text() for i in seven_day.select('.tombstone-container .short-desc')]

In [90]:
temps = [i.get_text() for i in seven_day.select('.tombstone-container .temp')]

In [97]:
descs = [d['title'] for d in seven_day.select('.tombstone-container img')]

In [98]:
descs

['Today: Sunny, with a high near 70. West southwest wind 5 to 10 mph increasing to 16 to 21 mph in the afternoon. Winds could gust as high as 28 mph. ',
 'Tonight: Mostly clear, with a low around 54. West southwest wind 16 to 21 mph decreasing to 5 to 10 mph after midnight. Winds could gust as high as 28 mph. ',
 'Memorial Day: Sunny, with a high near 75. Light west wind increasing to 16 to 21 mph in the afternoon. Winds could gust as high as 28 mph. ',
 'Monday Night: Mostly clear, with a low around 55. West southwest wind 16 to 21 mph decreasing to 9 to 14 mph in the evening. Winds could gust as high as 26 mph. ',
 'Tuesday: Sunny, with a high near 68. West southwest wind 9 to 14 mph increasing to 16 to 21 mph in the morning. Winds could gust as high as 26 mph. ',
 'Tuesday Night: Patchy fog after 11pm.  Otherwise, mostly cloudy, with a low around 54.',
 'Wednesday: Patchy fog before 11am.  Otherwise, mostly sunny, with a high near 63.',
 'Wednesday Night: Partly cloudy, with a low a

### Combining our data into a Pandas Dataframe

In [104]:
import pandas as pd

weather = pd.DataFrame({'period':periods,'short_desc':short_descs,'temp':temps,'desc':descs})

In [105]:
weather

Unnamed: 0,desc,period,short_desc,temp
0,"Today: Sunny, with a high near 70. West southw...",Today,Sunny,High: 70 °F
1,"Tonight: Mostly clear, with a low around 54. W...",Tonight,Mostly Clear,Low: 54 °F
2,"Memorial Day: Sunny, with a high near 75. Ligh...",MemorialDay,Sunny,High: 75 °F
3,"Monday Night: Mostly clear, with a low around ...",MondayNight,Mostly Clear,Low: 55 °F
4,"Tuesday: Sunny, with a high near 68. West sout...",Tuesday,Sunny,High: 68 °F
5,Tuesday Night: Patchy fog after 11pm. Otherwi...,TuesdayNight,Patchy Fog,Low: 54 °F
6,"Wednesday: Patchy fog before 11am. Otherwise,...",Wednesday,Patchy Fogthen Sunny,High: 63 °F
7,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 53 °F
8,"Thursday: Mostly sunny, with a high near 63.",Thursday,Mostly Sunny,High: 63 °F


In [107]:
#extraire les temperature
temp_nums = weather['temp'].str.extract("(?P<temp_num>\d+)", expand=False)
weather['temp_num'] = temp_nums.astype('int')

In [108]:
temp_nums

0    70
1    54
2    75
3    55
4    68
5    54
6    63
7    53
8    63
Name: temp_num, dtype: object

In [109]:
weather

Unnamed: 0,desc,period,short_desc,temp,temp_num
0,"Today: Sunny, with a high near 70. West southw...",Today,Sunny,High: 70 °F,70
1,"Tonight: Mostly clear, with a low around 54. W...",Tonight,Mostly Clear,Low: 54 °F,54
2,"Memorial Day: Sunny, with a high near 75. Ligh...",MemorialDay,Sunny,High: 75 °F,75
3,"Monday Night: Mostly clear, with a low around ...",MondayNight,Mostly Clear,Low: 55 °F,55
4,"Tuesday: Sunny, with a high near 68. West sout...",Tuesday,Sunny,High: 68 °F,68
5,Tuesday Night: Patchy fog after 11pm. Otherwi...,TuesdayNight,Patchy Fog,Low: 54 °F,54
6,"Wednesday: Patchy fog before 11am. Otherwise,...",Wednesday,Patchy Fogthen Sunny,High: 63 °F,63
7,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 53 °F,53
8,"Thursday: Mostly sunny, with a high near 63.",Thursday,Mostly Sunny,High: 63 °F,63


In [110]:
""" Moyenne des High et Low"""

' Moyenne des High et Low'

In [112]:
weather['temp_num'].mean()

61.666666666666664

In [113]:
""" Sélectionner uniquement les données de nuit"""
is_night = weather['temp'].str.contains('Low')
weather['is_night'] = is_night

In [114]:
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [115]:
weather[is_night]

Unnamed: 0,desc,period,short_desc,temp,temp_num,is_night
1,"Tonight: Mostly clear, with a low around 54. W...",Tonight,Mostly Clear,Low: 54 °F,54,True
3,"Monday Night: Mostly clear, with a low around ...",MondayNight,Mostly Clear,Low: 55 °F,55,True
5,Tuesday Night: Patchy fog after 11pm. Otherwi...,TuesdayNight,Patchy Fog,Low: 54 °F,54,True
7,"Wednesday Night: Partly cloudy, with a low aro...",WednesdayNight,Partly Cloudy,Low: 53 °F,53,True
