# Getting Data from the Web

This class will provide an introduction to programmatically accessing data from websites and APIs using Python. 

## Table of Contents

2. [Scraping](#2)
  1. [Weather](#2A)
  2. [UFO Sitings](#2B)

## Web Scraping <a id=2></a>

Often times data is not available in the neat & tidy formats we are used from databases and APIs. We need to out into the world and capture the data. 

Enter web scraping which is the process of crawling a website(s) and extracting structured information from the pages of the site(s). 

There are a whole host of ethical concerns with web scraping. Make sure to read a site's `robots.txt` before initating a web scraping project. 

In [2]:
import re #Regular expressions
from bs4 import BeautifulSoup # a python HTML parser
import requests
import pandas as pd

### Weather Data <a id=2A></a>

Let's focus on grabbing general weather data & forecasts

In [10]:
# White House, Washington DC
dc_lat='38.8977'
dc_lon='-77.0365'

In [11]:
url = f"https://forecast.weather.gov/MapClick.php?lat={dc_lat}&lon={dc_lon}#.XFzl9s9KiCc"
r = requests.get(url)
r.status_code

200

In [12]:
#Let's make some soup
soup = BeautifulSoup(r.content, 'html.parser')

In [13]:
seven_day = soup.find(id="seven-day-forecast")

In [14]:
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Washington DC	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Today<br/><br/></p>
<p><img alt="Today: Partly sunny, with a high near 56. Southwest wind around 5 mph becoming calm  in the afternoon. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Partly sunny, with a high near 56. Southwest wind around 5 mph becoming calm  in the afternoon. "/></p><p class="short-desc">Partly Sunny</p><p class="temp temp-high">High: 56 °F</p></div></li><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Tonight<br/><br/></p>
<p><img alt="Tonight: A slight chance of showers after 1am.  Mostly cloudy, w

In [9]:
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Partly sunny, with a high near 56. South wind 3 to 6 mph. " class="forecast-icon" src="newimages/medium/bkn.png" title="Today: Partly sunny, with a high near 56. South wind 3 to 6 mph. "/>
 </p>
 <p class="short-desc">
  Partly Sunny
 </p>
 <p class="temp temp-high">
  High: 56 °F
 </p>
</div>


##### Extracting information from the page

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

* The name of the forecast item — in this case, Tonight.
* The description of the conditions — this is stored in the title property of img.
* A short description of the conditions.
* The temperature low.

We'll extract the name of the forecast item, the short description, and the temperature first, since they're all similar:

In [7]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Today
Partly Sunny
High: 56 °F


Now, we can extract the `title` attribute from the `img` tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [8]:
img = tonight.find("img")
desc = img['title']

print(desc)

Today: Partly sunny, with a high near 56. South wind 3 to 6 mph. 


##### Extracting all the information from the page
Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:

* Select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.
* Use a list comprehension to call the `get_text` method on each `BeautifulSoup` object.

In [9]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'ChristmasDay',
 'SaturdayNight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday']

In [10]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Partly Sunny', 'Mostly Cloudythen SlightChanceShowers', 'ChanceShowers', 'ChanceShowers', 'Sunny', 'Partly Cloudy', 'Chance Rain', 'Mostly Cloudy', 'Mostly Cloudy']
['High: 56 °F', 'Low: 45 °F', 'High: 65 °F', 'Low: 48 °F', 'High: 59 °F', 'Low: 38 °F', 'High: 47 °F', 'Low: 40 °F', 'High: 52 °F']
['Today: Partly sunny, with a high near 56. South wind 3 to 6 mph. ', 'Tonight: A slight chance of showers after 1am.  Mostly cloudy, with a low around 45. Southeast wind 3 to 6 mph.  Chance of precipitation is 20%.', 'Christmas Day: A chance of showers, mainly before 1pm.  Mostly cloudy, with a high near 65. Southwest wind 7 to 10 mph, with gusts as high as 23 mph.  Chance of precipitation is 30%.', 'Saturday Night: A chance of showers before 1am.  Mostly cloudy, then gradually becoming mostly clear, with a low around 48. Northwest wind around 7 mph.  Chance of precipitation is 30%.', 'Sunday: Sunny, with a high near 59. Northwest wind 9 to 13 mph, with gusts as high as 22 mph. ', 'Sunday Ni

### Exercise
Combine all the newly scraped data and analyze it. In order to do this, we'll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column.

In [11]:
weather = pd.DataFrame({
        "period": periods, 
        "short_desc": short_descs, 
        "temp": temps, 
        "desc":descs
    })

In [12]:
weather.head()

Unnamed: 0,period,short_desc,temp,desc
0,Today,Partly Sunny,High: 56 °F,"Today: Partly sunny, with a high near 56. Sout..."
1,Tonight,Mostly Cloudythen SlightChanceShowers,Low: 45 °F,Tonight: A slight chance of showers after 1am....
2,ChristmasDay,ChanceShowers,High: 65 °F,"Christmas Day: A chance of showers, mainly bef..."
3,SaturdayNight,ChanceShowers,Low: 48 °F,Saturday Night: A chance of showers before 1am...
4,Sunday,Sunny,High: 59 °F,"Sunday: Sunny, with a high near 59. Northwest ..."


### Analyzing Weather

In [13]:
# Use the Series.str.extract method to insert a regular expression to pull out numeric temperature values
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

0    56
1    45
2    65
3    48
4    59
5    38
6    47
7    40
8    52
Name: temp_num, dtype: object

In [14]:

# Find the mean of this week's temperature
weather["temp_num"].mean()

50.0

In [15]:
# Select rows that occur only at night
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

In [16]:
weather[is_night]

Unnamed: 0,period,short_desc,temp,desc,temp_num,is_night
1,Tonight,Mostly Cloudythen SlightChanceShowers,Low: 45 °F,Tonight: A slight chance of showers after 1am....,45,True
3,SaturdayNight,ChanceShowers,Low: 48 °F,Saturday Night: A chance of showers before 1am...,48,True
5,SundayNight,Partly Cloudy,Low: 38 °F,"Sunday Night: Partly cloudy, with a low around...",38,True
7,MondayNight,Mostly Cloudy,Low: 40 °F,"Monday Night: Mostly cloudy, with a low around...",40,True


<a name="2B"></a>
### UFO Sightings

In [17]:
r = requests.get("http://www.nuforc.org/webreports/ndxe201608.html")
b = BeautifulSoup(r.text, 'html.parser')
r.status_code

200

In [18]:
# Let's take a look at the first sighting
for tr in b.findAll('tr', attrs = {'valign':'TOP'})[:1]:
    # the findChildren method returns all children underneath it
    for child in tr.findChildren():
        print(child.text)

8/31/16 23:40
8/31/16 23:40
8/31/16 23:40
Terre Haute
Terre Haute
IN
IN
Light
Light
10 minutes
10 minutes
Four unidentified moving flashing lights that hovered in place for several minutes. Terre Haute, Indiana
Four unidentified moving flashing lights that hovered in place for several minutes. Terre Haute, Indiana
9/2/16
9/2/16


In [19]:
# OK, it's a bit messy, Let's clean it up
# Looks like the first element is the date, the 4th is the city, 6th if state, 8th is shape (this ones blank)
# 13th is the summary

ufo_sightings = {
        'Date':[],
        'City':[],
        'State':[],
        'Shape':[],
        'Summary':[]
    }

for tr in b.findAll('tr', attrs = {'valign':'TOP'}):
    # the findChildren method returns all children underneath it
    ufo_sighting_info = []
    for child in tr.findChildren():
        ufo_sighting_info.append(child.text)
    ufo_sightings['Date'].append(ufo_sighting_info[0])
    ufo_sightings['City'].append(ufo_sighting_info[3])
    ufo_sightings['State'].append(ufo_sighting_info[5])
    ufo_sightings['Shape'].append(ufo_sighting_info[7])
    ufo_sightings['Summary'].append(ufo_sighting_info[12])

pd.DataFrame(ufo_sightings).head()

Unnamed: 0,Date,City,State,Shape,Summary
0,8/31/16 23:40,Terre Haute,IN,Light,Four unidentified moving flashing lights that ...
1,8/31/16 22:00,Malo,WA,Light,Strange meandering craft pulls near 180 before...
2,8/31/16 22:00,Corte Madera,CA,Circle,Stationary yellow gold lights seen during exte...
3,8/31/16 21:00,Arlington,WI,Light,Fifteen minute sighting of unusual light forma...
4,8/31/16 21:00,Concord,NC,Triangle,We saw 3 triangle objects in the sky with Redi...
