<a href="https://colab.research.google.com/github/abdatasci/probable-funicular/blob/master/Webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# 'requests' module behaves as kind of a virtual browser and allows us to request web pages the same way
# you would as if you typed a URL into your browser's location window.
import requests

In [2]:
# Here we will specify which web page we would like to load.
# requests.get will retrieve a page by its address
# In this example, the contents of a web page are stored in a variable called 'page'
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
print(page)

<Response [200]>


In [3]:
# Let's examine the page's contents (i.e. the HTML instide the page)
# As you will see, we get back a bunch of plain text and that it would take some work to parse it
# and to extract meaningful information
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


In [4]:
# Load BeautifulSoup module
from bs4 import BeautifulSoup

In [5]:
# BeautifulSoup function takes two parameters: content of an HTML page, and a parser specification.
# The parser specification basically tells BeautifulSoup what type of 'language' it needs to parse.
# BeautifulSoup can handle different versions of HTML, XML, etc...

# Note that the 'page.content' parameter comes from the page that we loaded using the 'request' module 
# a few blocks of code above:)

soup = BeautifulSoup(page.content, 'html.parser')

In [6]:
# Let's see what we get back
print(soup)

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [7]:
# What we got by printing 'soup' looks OK, but a bit difficult to read without indentations and formatting.
# Let's make it look prettier

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
 </body>
</html>


In [8]:
# We can iterate through them
for item in list(soup.children):
    print(item)
    #print(type(item))

html


<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [9]:
# We can also treat 'soup' as a list of elements
html = list(soup.children)[2]
print(html)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>


In [10]:
print(list(html.children))

['\n', <head>
<title>A simple example page</title>
</head>, '\n', <body>
<p>Here is some simple content for this page.</p>
</body>, '\n']


In [11]:
# Get HTML <body> tag contents
body = list(html.children)[3]
print(body)

<body>
<p>Here is some simple content for this page.</p>
</body>


In [12]:
# Take a look at elements inside the <body> tag
print(list(body.children))

['\n', <p>Here is some simple content for this page.</p>, '\n']


In [13]:
# Take a look at elements inside the <body> tag
print(list(body.children))

['\n', <p>Here is some simple content for this page.</p>, '\n']


In [16]:

# In this example, we can grab individual tag <p>
p = list(body.children)[1]

In [17]:
# And get the text stored inside that tag
p.get_text()

'Here is some simple content for this page.'

In [18]:
# find_all() function allows us to find all instances of a particular element in an HTML page
# In this example, will get back a list with all instances of element <p>
soup.find_all('p')

[<p>Here is some simple content for this page.</p>]

In [19]:
# Get text inside the first <p> element in an HTML page
soup.find_all('p')[0].get_text()

'Here is some simple content for this page.'

In [20]:
# We can also use the find() function to find only the first instance of a particular element
soup.find('p').get_text()

'Here is some simple content for this page.'

In [21]:
# Load a web page's contents
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
# Parse web page with BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup)

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>


In [22]:
# Find all instances of a tag '<p>' that has CSS class of 'outer-text'
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [23]:
# Find all tags that have CSS class of 'outer_text'
soup.find_all(class_="outer-text")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>, <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

In [24]:
# Find all HTML elements with id = 'first'
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [25]:
soup.find('p', class_='outer-text', id='second')

<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>

In [26]:
# We can also use BeautifulSoup for selecting nested elements
# In the example below, we are asking BeautifulSoup to find a <p> element that is a child of a <div> element
soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

In [27]:
# Let's examing the https://forecast.weather.gov/ web page and find weather for Pittsburgh

# Request the weather web page
# page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
page = requests.get("https://forecast.weather.gov/MapClick.php?CityName=Pittsburgh&state=PA&site=PBZ&textField1=40.4392&textField2=-79.9767&e=0#.XyLIEPhKid0")

# Parse the page with BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <!-- Meta -->
  <meta content="width=device-width" name="viewport"/>
  <link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
  <title>
   National Weather Service
  </title>
  <meta content="National Weather Service" name="DC.title">
   <meta content="NOAA National Weather Service National Weather Service" name="DC.description"/>
   <meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/>
   <meta content="" name="DC.date.created" scheme="ISO8601"/>
   <meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/>
   <meta content="weather, National Weather Service" name="DC.keywords"/>
   <meta content="NOAA's National Weather Service" name="DC.publisher"/>
   <meta content="National Weather Service" name="DC.contributor"/>
   <meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/>
   <meta content="General" name="rating"/>
   <meta content="index,follow" name="robots"/>

In [28]:
# After examining the HTML code, we can find that the seven day forcast is inside
# of a <div> tag with the id = "seven-day-forecast"
# Let's grab that element
seven_day = soup.find(id="seven-day-forecast")
print(seven_day.prettify())

<div class="panel panel-default" id="seven-day-forecast">
 <div class="panel-heading">
  <b>
   Extended Forecast for
  </b>
  <h2 class="panel-title">
   Pittsburgh PA
  </h2>
 </div>
 <div class="panel-body" id="seven-day-forecast-body">
  <div id="seven-day-forecast-container">
   <ul class="list-unstyled" id="seven-day-forecast-list">
    <li class="forecast-tombstone">
     <div class="tombstone-container">
      <p class="period-name">
       Tonight
       <br/>
       <br/>
      </p>
      <p>
       <img alt="Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. "/>
      </p>
      <p class="short-desc">
       Partly Cloudy
      </p>
      <p class="temp temp-low">
       Low: 45 °F
      </p>
     </div>
    </li>
    <li class="forecast-tombstone">
     <d

In [29]:
# Further examination reveals that daily forcasts are inside of another <div> container
# with id = "tombstone-container"
forecast_items = seven_day.find_all(class_="tombstone-container")

# forecast_items will give us a list of items for seven days of the week
# If we want to see only tonight's weather, we only need the first element
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Tonight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. " class="forecast-icon" src="newimages/medium/nsct.png" title="Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. "/>
 </p>
 <p class="short-desc">
  Partly Cloudy
 </p>
 <p class="temp temp-low">
  Low: 45 °F
 </p>
</div>


In [30]:
# HTML element with class='period_name' will allow us to get the specific day for which we are looking at the weather
period = tonight.find(class_="period-name").get_text()

# HTML element with class='short-desc' will allow us to get a short description of the day's weather
short_desc = tonight.find(class_="short-desc").get_text()

# HTML element with class='temp' will give is the temperature for a given day
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Tonight
Partly Cloudy
Low: 45 °F


In [31]:
img = tonight.find("img")
desc = img['title']
print(desc)

Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. 


In [32]:
# Let's get all weather periods
period_tags = seven_day.select(".tombstone-container .period-name")

periods = []
for pt in period_tags:
    periods.append(pt.get_text())

print(periods)

['Tonight', 'MemorialDay', 'MondayNight', 'Tuesday', 'TuesdayNight', 'Wednesday', 'WednesdayNight', 'Thursday', 'ThursdayNight']


In [33]:
# Let's get all short weather descriptions
short_descs = []
for sd in seven_day.select(".tombstone-container .short-desc"):
    short_descs.append(sd.get_text())
    
short_descs

['Partly Cloudy',
 'Mostly Sunny',
 'Mostly Cloudy',
 'Mostly Cloudy',
 'Mostly Cloudy',
 'Slight ChanceShowers thenChanceShowers',
 'T-stormsLikely',
 'Showers',
 'Showers thenChanceT-storms']

In [34]:

# Let's get all temperatures

temps = []

for t in seven_day.select(".tombstone-container .temp"):
    temps.append(t.get_text())
    
print(temps)

['Low: 45 °F', 'High: 73 °F', 'Low: 53 °F', 'High: 78 °F', 'Low: 58 °F', 'High: 76 °F', 'Low: 62 °F', 'High: 77 °F', 'Low: 62 °F']


In [35]:
# Last, but not least, let's get all descriptions

descs = []

for d in seven_day.select(".tombstone-container img"):
    descs.append(d["title"])
    
print(descs)

['Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. ', 'Memorial Day: Mostly sunny, with a high near 73. Light northwest wind. ', 'Monday Night: Mostly cloudy, with a low around 53. Calm wind. ', 'Tuesday: Mostly cloudy, with a high near 78. Light southwest wind. ', 'Tuesday Night: Mostly cloudy, with a low around 58. Calm wind. ', 'Wednesday: A slight chance of showers, then a chance of showers and thunderstorms after 2pm.  Mostly cloudy, with a high near 76. Chance of precipitation is 50%. New rainfall amounts of less than a tenth of an inch, except higher amounts possible in thunderstorms. ', 'Wednesday Night: A chance of showers before 8pm, then showers and thunderstorms likely between 8pm and 2am, then showers likely after 2am.  Mostly cloudy, with a low around 62. Chance of precipitation is 60%.', 'Thursday: Showers likely, then showers and possibly a thunderstorm after 2pm.  High near 77. Chance of precipitation is 80%.', 'T

In [36]:
for i in range(0, len(periods)):
    print("Period: " + periods[i])
    print("Weather: " + short_descs[i])
    print("Temperature: " + temps[i])
    print("Overall forecast: " + descs[i])
    print("____________________________")

Period: Tonight
Weather: Partly Cloudy
Temperature: Low: 45 °F
Overall forecast: Tonight: Partly cloudy, with a low around 45. Northeast wind around 5 mph becoming calm  in the evening. 
____________________________
Period: MemorialDay
Weather: Mostly Sunny
Temperature: High: 73 °F
Overall forecast: Memorial Day: Mostly sunny, with a high near 73. Light northwest wind. 
____________________________
Period: MondayNight
Weather: Mostly Cloudy
Temperature: Low: 53 °F
Overall forecast: Monday Night: Mostly cloudy, with a low around 53. Calm wind. 
____________________________
Period: Tuesday
Weather: Mostly Cloudy
Temperature: High: 78 °F
Overall forecast: Tuesday: Mostly cloudy, with a high near 78. Light southwest wind. 
____________________________
Period: TuesdayNight
Weather: Mostly Cloudy
Temperature: Low: 58 °F
Overall forecast: Tuesday Night: Mostly cloudy, with a low around 58. Calm wind. 
____________________________
Period: Wednesday
Weather: Slight ChanceShowers thenChanceShowe