# Getting Data from the Web


### Workshop Duration: ~ 1 Hour

*80 Minutes*

An introduction to programmatically accessing data from websites and APIs using Python. 

## How to use this Notebook <a id="1"></a>

#### 1. Examples: This is your opportunity to observe and learn. Most of the examples contains code and information that could be used to solve you exercise.
#### 2. Exercises: This is your opportunity to get hands on and try solving the challenges.
 
 - *Running the cells in the example section may break since there are references to file paths not on your local system.*
 - *If you run a cell that references a library that has not bee install you will see an error. Work with the supporting instructors to get this set up if you are having trouble.*
 - *This workshop contains lots of infomration and time is limited. We encourage everyone to spend time working these examples and exercises in detail after completing this bootcamp.*
 

#### AGENDA

- Example 1: Getting Movie Data from IMDB (10 Minutes)
- Exercise 1: Movie Release Date: defining a function (5 Minutes)
- Example 2: Color Wheel (10 Minutes)
- Example 3: Plotly - Brief Overview (5 Minutes)
- Exercise 2: Getting Data Using API Calls (30 Minutes)
- Example 4: Intro to Web Scraping (10 Minutes)
- Example 5: Weather Analysis (5 Minutes - SKIP TO UFO's)
- Exercise 4: UFO Sigtings (10-15 Minutes)

## API (Application Programming Interface) <a id="1"></a>

What is an API?
- Structured way to expose specific functionality and data to users
- Web APIs usually follow the [REST](https://en.wikipedia.org/wiki/Representational_state_transfer) standard (i.e. stateless)

How to interact with an API:
- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"
- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)
- Response is often JSON or XML format
- Web console is sometimes available (allows you to explore an API)

## Supplementary Reading

1. [Requests: Python Library Documentation](http://docs.python-requests.org/en/master/user/quickstart/)

2. [OAuth2 Documentation](https://oauth.net/2/)

3. [What is an API & How Does is Work](https://blogs.mulesoft.com/biz/tech-ramblings-biz/what-are-apis-how-do-apis-work/)

4. [API Directory](https://www.programmableweb.com/apis/directory)

## Example 1: Getting Movie Data from IMDB <a id="1B"></a>
*10 Minutes*

In [2]:
import pandas as pd
import requests

In [3]:
print(requests.__version__)

2.22.0


### Using the Requests Library
We will submit a get request to specific movie to the URL: `http://www.omdbapi.com`

In [4]:
API_KEY = "53bfc95d" # <- Super Secret Shhhh
title = "Jurassic Park" #Search for a movie you like
url = 'http://www.omdbapi.com?'

payload = {'t': title,
           'apikey': API_KEY}

r = requests.get('http://www.omdbapi.com?', params=payload)

In [5]:
# check the status: 200 means success, 4xx or 5xx means error
r.status_code

200

In [6]:
r.url

'http://www.omdbapi.com/?t=Jurassic+Park&apikey=53bfc95d'

We know from the documentation on omdapi.com that the response is a `JSON` object.

In [7]:
r.json()

{'Title': 'Jurassic Park',
 'Year': '1993',
 'Rated': 'PG-13',
 'Released': '11 Jun 1993',
 'Runtime': '127 min',
 'Genre': 'Action, Adventure, Sci-Fi, Thriller',
 'Director': 'Steven Spielberg',
 'Writer': 'Michael Crichton (novel), Michael Crichton (screenplay), David Koepp (screenplay)',
 'Actors': 'Sam Neill, Laura Dern, Jeff Goldblum, Richard Attenborough',
 'Plot': "A pragmatic paleontologist visiting an almost complete theme park is tasked with protecting a couple of kids after a power failure causes the park's cloned dinosaurs to run loose.",
 'Language': 'English, Spanish',
 'Country': 'USA',
 'Awards': 'Won 3 Oscars. Another 40 wins & 27 nominations.',
 'Poster': 'https://m.media-amazon.com/images/M/MV5BMjM2MDgxMDg0Nl5BMl5BanBnXkFtZTgwNTM2OTM5NDE@._V1_SX300.jpg',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '8.1/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '91%'},
  {'Source': 'Metacritic', 'Value': '68/100'}],
 'Metascore': '68',
 'imdbRating': '8.1',
 'imdb

In [8]:
#We can call out specific elements
r.json()['Year']

'1993'

What happens if the movie isn't found?

In [9]:
payload = {'apikey': API_KEY, 
          't': 'Machine Learning Rules!'}

r = requests.get(url, params=payload)
r.json()

{'Response': 'False', 'Error': 'Movie not found!'}

## Exercise 1 Movie Release Date
*5 Minutes*
##### Define a function to return the year of release of a given movie title, return None if no movie found.

In [10]:
def get_movie_year_from_title(movie_title):
    return -1

In [11]:
get_movie_year_from_title("Jungle Book")

-1

### Business Use Case Discussion
* How might our clients ask us to leverage text data such as this to inform business decisions?
* What could we do if we mashed this movie meta data up with movie review data from a source such as Rotten Tomatoes? 

## Example 2: Color Wheel <a id='1C'></a>
*10 Minutes*

In this example, accessing data on some of the colors of Crayola® palette via the Smithosian Cooper Hewitt's API. Archievists at Cooper Hewitt use this palette to tag images of objects by color. Downstream, these tags allow for a greater accuracy in information retrival for users looking for objects of a certain hue.

### Business Use Case
A major online retailer aggregator has asked to Deloitte to help it increase the efficetiveness of its search engine. They found that users are searching their catalog of over 100,000 items by color. However, many of the items don't have color tags making the search process frustrating for users. 

Our role will be to leverage the Cooper Hewitt's extensive online catalog of design objects tagged with colors to build a training set for automated tagging for the client with machine learning. 

If our future algorithm is successful, it will save our client thousands of hours to manual labor tagging the images on their website.


*__Note:__ This API implements the new standard for API autentication by using OAuth2 with access tokens. I've created a token for us ahead of time.*

[Article on Cooper Hewitt's API](https://labs.cooperhewitt.org/2014/the-api-at-the-center-of-the-museum/)

[API Documentation](https://collection.cooperhewitt.org/api/)

In [12]:
key = "84976de03204c1d366ae0224bf21d103" # less secure
token = "2f49c9d05b2faf779d11420637d99f57" # more secure which uses the OAuth2 authentication

In [13]:
base_url = 'https://api.collection.cooperhewitt.org/rest/?method=cooperhewitt.colors.palettes.getInfo&access_token=%s&palette=crayola' % token

In [14]:
base_url

'https://api.collection.cooperhewitt.org/rest/?method=cooperhewitt.colors.palettes.getInfo&access_token=2f49c9d05b2faf779d11420637d99f57&palette=crayola'

In [15]:
r = requests.get(base_url)

In [16]:
r.json()

{'palette': 'crayola',
 'colors': {'#fc89ac': {'name': 'Tickle Me Pink'},
  '#1f75fe': {'name': 'Blue'},
  '#a8e4a0': {'name': 'Granny Smith Apple'},
  '#fc74fd': {'name': 'Pink Flamingo'},
  '#7366bd': {'name': 'Blue Violet'},
  '#18a7b5': {'name': 'Teal Blue'},
  '#1164b4': {'name': 'Green Blue'},
  '#b2ec5d': {'name': 'Inchworm'},
  '#58427c': {'name': 'Cyber Grape'},
  '#bf4f51': {'name': 'Bittersweet Shimmer'},
  '#5d76cb': {'name': 'Indigo'},
  '#c5e384': {'name': 'Yellow Green'},
  '#8fd400': {'name': 'Sheen Green'},
  '#4a646c': {'name': 'Deep Space Sparkle'},
  '#ffbcd9': {'name': 'Cotton Candy'},
  '#ff7f49': {'name': 'Burnt Orange'},
  '#fefe22': {'name': 'Laser Lemon'},
  '#bc5d58': {'name': 'Chestnut'},
  '#9fe2bf': {'name': 'Sea Green'},
  '#000000': {'name': 'Black'},
  '#414a4c': {'name': 'Outer Space'},
  '#7851a9': {'name': 'Royal Purple'},
  '#ace5ee': {'name': 'Blizzard Blue'},
  '#a2add0': {'name': 'Wild Blue Yonder'},
  '#dd9475': {'name': 'Copper'},
  '#ffffff': 

In [17]:
_hex = list(r.json()['colors'].keys())
names = [k['name'] for k in list(r.json()['colors'].values())]

In [18]:
_hex

['#fc89ac',
 '#1f75fe',
 '#a8e4a0',
 '#fc74fd',
 '#7366bd',
 '#18a7b5',
 '#1164b4',
 '#b2ec5d',
 '#58427c',
 '#bf4f51',
 '#5d76cb',
 '#c5e384',
 '#8fd400',
 '#4a646c',
 '#ffbcd9',
 '#ff7f49',
 '#fefe22',
 '#bc5d58',
 '#9fe2bf',
 '#000000',
 '#414a4c',
 '#7851a9',
 '#ace5ee',
 '#a2add0',
 '#dd9475',
 '#ffffff',
 '#efdecd',
 '#bab86c',
 '#1974d2',
 '#b4674d',
 '#ebc7df',
 '#ff9baa',
 '#87a96b',
 '#71bc78',
 '#8e4585',
 '#fae7b5',
 '#979aaa',
 '#aaf0d1',
 '#c5d0e6',
 '#fd5e53',
 '#80daeb',
 '#2e5894',
 '#ff48d0',
 '#dd4492',
 '#eceabe',
 '#1a4876',
 '#9aceeb',
 '#f8d568',
 '#e7c697',
 '#1cac78',
 '#9c7c38',
 '#9f8170',
 '#a57164',
 '#ff6e4a',
 '#3bb08f',
 '#fff44f',
 '#ff43a4',
 '#1cd3a2',
 '#cda4de',
 '#ffa343',
 '#efcdb8',
 '#a5694f',
 '#8d4e85',
 '#ff7538',
 '#cb4154',
 '#17806d',
 '#ffa089',
 '#cd9575',
 '#cdc5c2',
 '#85754e',
 '#6699cc',
 '#ea7e5d',
 '#c46210',
 '#fdbcb4',
 '#d68a59',
 '#ffaacc',
 '#45cea2',
 '#ef98aa',
 '#6dae81',
 '#ffa474',
 '#324ab2',
 '#fddde6',
 '#ffcf48',
 '#c

In [20]:
# DataFrame of the results
crayola= pd.DataFrame({'hex': _hex, 'name':names})

# crayola.describe()
crayola.head()

Unnamed: 0,hex,name
0,#fc89ac,Tickle Me Pink
1,#1f75fe,Blue
2,#a8e4a0,Granny Smith Apple
3,#fc74fd,Pink Flamingo
4,#7366bd,Blue Violet


In [19]:
names

['Tickle Me Pink',
 'Blue',
 'Granny Smith Apple',
 'Pink Flamingo',
 'Blue Violet',
 'Teal Blue',
 'Green Blue',
 'Inchworm',
 'Cyber Grape',
 'Bittersweet Shimmer',
 'Indigo',
 'Yellow Green',
 'Sheen Green',
 'Deep Space Sparkle',
 'Cotton Candy',
 'Burnt Orange',
 'Laser Lemon',
 'Chestnut',
 'Sea Green',
 'Black',
 'Outer Space',
 'Royal Purple',
 'Blizzard Blue',
 'Wild Blue Yonder',
 'Copper',
 'White',
 'Almond',
 'Olive Green',
 'Navy Blue',
 'Brown',
 'Thistle',
 'Salmon',
 'Asparagus',
 'Fern',
 'Plum',
 'Banana Mania',
 'Manatee',
 'Magic Mint',
 'Periwinkle',
 'Sunset Orange',
 'Sky Blue',
 "B'dazzled Blue",
 'Razzle Dazzle Rose',
 'Cerise',
 'Spring Green',
 'Midnight Blue',
 'Cornflower',
 'Orange Yellow',
 'Gold',
 'Green',
 'Metallic Sunburst',
 'Beaver',
 'Blast Off Bronze',
 'Outrageous Orange',
 'Jungle Green',
 'Lemon Yellow',
 'Wild Strawberry',
 'Caribbean Green',
 'Wisteria',
 'Neon Carrot',
 'Desert Sand',
 'Sepia',
 'Razzmic Berry',
 'Orange',
 'Brick Red',
 '

### Red/Blue/Green
The hex code is by design very dense information. Let's parse out the individual color components from the data.

[Hex to RGB Converstion by Hand](https://www.rapidtables.com/convert/color/how-hex-to-rgb.html)

[Hex to RGB Code](https://stackoverflow.com/questions/29643352/converting-hex-to-rgb-value-in-python)

[int() Class](https://docs.python.org/3.4/library/functions.html?highlight=int#int)

In [21]:
# function that converts a single list element into it's corresponding RGB Value

def hex_to_rbg(_hexcode):
    h = _hexcode.lstrip('#') # strips the function of the hash 
    rbg = tuple(int(h[i:i+2], 16) for i in (0, 2 ,4)) #int converts our hex to a rgb value for us by passing base = 16
    return rbg

In [22]:
rbg = [hex_to_rbg(h) for h in crayola['hex'].tolist()]

In [23]:
rbg

[(252, 137, 172),
 (31, 117, 254),
 (168, 228, 160),
 (252, 116, 253),
 (115, 102, 189),
 (24, 167, 181),
 (17, 100, 180),
 (178, 236, 93),
 (88, 66, 124),
 (191, 79, 81),
 (93, 118, 203),
 (197, 227, 132),
 (143, 212, 0),
 (74, 100, 108),
 (255, 188, 217),
 (255, 127, 73),
 (254, 254, 34),
 (188, 93, 88),
 (159, 226, 191),
 (0, 0, 0),
 (65, 74, 76),
 (120, 81, 169),
 (172, 229, 238),
 (162, 173, 208),
 (221, 148, 117),
 (255, 255, 255),
 (239, 222, 205),
 (186, 184, 108),
 (25, 116, 210),
 (180, 103, 77),
 (235, 199, 223),
 (255, 155, 170),
 (135, 169, 107),
 (113, 188, 120),
 (142, 69, 133),
 (250, 231, 181),
 (151, 154, 170),
 (170, 240, 209),
 (197, 208, 230),
 (253, 94, 83),
 (128, 218, 235),
 (46, 88, 148),
 (255, 72, 208),
 (221, 68, 146),
 (236, 234, 190),
 (26, 72, 118),
 (154, 206, 235),
 (248, 213, 104),
 (231, 198, 151),
 (28, 172, 120),
 (156, 124, 56),
 (159, 129, 112),
 (165, 113, 100),
 (255, 110, 74),
 (59, 176, 143),
 (255, 244, 79),
 (255, 67, 164),
 (28, 211, 162),


In [24]:
rbg = pd.DataFrame(rbg, columns=['red', 'green', 'blue'])
crayola = pd.concat([crayola, rbg], axis=1)
crayola.head()

Unnamed: 0,hex,name,red,green,blue
0,#fc89ac,Tickle Me Pink,252,137,172
1,#1f75fe,Blue,31,117,254
2,#a8e4a0,Granny Smith Apple,168,228,160
3,#fc74fd,Pink Flamingo,252,116,253
4,#7366bd,Blue Violet,115,102,189


In [25]:
crayola.to_csv('/Users/rvanniekerk/OneDrive - Deloitte (O365D)/ML Guild Emminence/Apprentice ML Guild Course/Apprentice_Level_04222019/Day 1/Outputs/crayola.csv', index=False)

FileNotFoundError: [Errno 2] No such file or directory: '/Users/rvanniekerk/OneDrive - Deloitte (O365D)/ML Guild Emminence/Apprentice ML Guild Course/Apprentice_Level_04222019/Day 1/Outputs/crayola.csv'

## Example 3:  Plotly - Brief Overview
*5 Minutes*

Python plotting library for collaborative, interactive, publication-quality graphs.

[Plotly Website Link](https://pypi.org/project/plotly/)

*Note: Plotly will not work if you have run the above cells multiple times. Please click on 'Kernel' and select Restart & Clear Output as the incoming data needs to be correct*
*Note: Plotly also has a rendering bug while using JupyterLab, so if your graph does not render open a Jupyter Notebook as that should work*

In [26]:
# import plotly.offline as py
# import plotly.graph_objs as go
# py.init_notebook_mode(connected=False)

import plotly.offline as py
from plotly import __version__
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import Scatter
import plotly.graph_objs as go

py.init_notebook_mode(connected=False)

ModuleNotFoundError: No module named 'plotly'

In [27]:
print("Plotly Version",__version__)

NameError: name '__version__' is not defined

In [None]:
trace = go.Scatter3d(
    x = crayola['red'],
    y = crayola['green'],
    z = crayola['blue'],
    mode = 'markers',
    marker = dict(
        color = crayola['hex'].tolist(),
        size = 5,
        symbol = 'circle',
        opacity = 1))

layout = go.Layout(margin=dict(l=0, r=0, b=0, t=0))

In [None]:
fig = go.Figure(data=[trace],layout=layout)
py.iplot(fig, filename='Crayola_Scatter.html')

## Exercise # 2: Getting Data
*30 Minutes*

Build out our training dataset by studying the API documentation on the Cooper Hewit Website. We need a dataset with the museum curent objects on display (only 100 items), the images associated with those items, and the color(s) of those items.

* Store the the name of the objects and other meta data in a csv called `current_collection.csv`
* Place the images in a folder named `collection_images`.
  * You can used the request method `content` to access the file to write it to a file.
  * Raw Example: `open('image.jpg', 'wb').write(request.content)`
* Grab the color information and place it another csv `current_collection_colors.csv`

The API documentation is available [here](https://collection.cooperhewitt.org/api/methods/). You will want to use the following end points: 
1. [`getOnDisplay`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getOnDisplay)
2. [`getImages`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getImages)
3. [`getColors`](https://collection.cooperhewitt.org/api/methods/cooperhewitt.objects.getColors)

key = "84976de03204c1d366ae0224bf21d103", 
token = "2f49c9d05b2faf779d11420637d99f57"

[hint 1: Break down your problem and use pd.DataFrame.from_records(r.json()) or just simply r.json() to view data as a starting point](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html)

## Ex: 2.1 - (getOnDisplay) Let's start by using the object.getOnDisplay api call to get all items on display at the Museum 

1. You're free to achieve this however you wish.
2. Limit the number of items in your DataFrame to 100.
3. Write all 100 resultant rows and fields to a DataFrame.

In [28]:
import pandas as pd
import requests

In [29]:
#Here's something to get you started, it contains the URL and payload with embedded method you need to use.
token = "2f49c9d05b2faf779d11420637d99f57"
api = 'https://api.collection.cooperhewitt.org/rest/?'


payload = {'access_token': token,
              'method': 'cooperhewitt.objects.getOnDisplay'}

r = requests.get(api, params=payload)

In [30]:
# Start by breaking down the problem.

In [31]:
# When you understand how your data is structured and what you would like to achieve, try building a function to do this for you. For this excercise it's not a requirement.

In [32]:
# Write your results to a CSV using the to_csv() method - you can quickly determine a path by typing "pwd" into a cell and running it.

## Ex: 2.2 - getImages: Now let's build a function that uses the object.getImages method api call to retrieve and store our images.

1. Grab 100 Images using the getImages method.
2. Write all 100 image's to the specified folder on your local system (see hints below).
3. Remember each image has an object_id associated with it incase you were wondering which variable to loop through.

[hint 2: os python library helps you write/read files on your local system](https://docs.python.org/3/library/os.html)

In [33]:
# Here's something to get you started, it contains the URL and payload with embedded method you need to use. Please note that in this case a function would 
# be very helpfull in order to loop through your data. I've given you a head start by passing a single object_id. You will need to figure out how to loop through your set of 100 ojbect
# id's, writing them to the folder called collection_images.

token = "2f49c9d05b2faf779d11420637d99f57"
api = 'https://api.collection.cooperhewitt.org/rest/?'     
payload = {'access_token': token,
           'method': 'cooperhewitt.objects.getImages',
           'object_id': '18488027'}

r = requests.get(api, params=payload)

obj = pd.DataFrame.from_records(r.json())

In [34]:
# Start by breaking down the problem.

In [35]:
# When you understand how your data is structured and what you would like to achieve, you'll need to build a function that loops through your objects to accomplish this.

In [36]:
# Run your function sending a request to the url/token using the API method getImages and write each image to the specified file path .../your_directory/collection_images

## Ex: 2.3 - getColors: Now let's build a function that uses the object.getColors method api call to retrieve and store the 'colors' in a .csv file

1. You will want to do something similar to the previous two exercises, except this time you will be writing the data retreived to a csv.
2. Done worry about parsing out the 'colors' field or converting the hex values. Just grab the data and drop it into a .csv.
3. Limit the data to 100 rows as in the previous two exercises

In [37]:
# Heres' something to get you started. We're going to give you less information this time around since if you've made it this far you're doing well.

payload = {'access_token': token,
           'method': 'cooperhewitt.objects.getColors',
           'object_id': i}

r = requests.get(api, params=payload)

obj_col = pd.DataFrame.from_records(r.json())

NameError: name 'i' is not defined

In [None]:
# Start by breaking down the problem.

In [None]:
# When you understand how your data is structured and what you would like to achieve, you would probably smart to build a function that loops through your objects to accomplish this.

In [None]:
# Run your function sending a request to the url/token using the API method getImages and write each image to the specified file path .../your_directory/current_collection_colors

## Example 3: Intro to Web Scraping <a id=2></a>

*5 - 10 Minutes*

Often times data is not available in the neat & tidy formats we are used from databases and APIs. We need to go out into the world and capture the data.

Enter web scraping which is the process of crawling a website(s) and extracting structured information from the pages of the site(s).

There are a whole host of ethical concerns with web scraping. Make sure to read a site's `robots.txt` before initating a web scraping project: ex. https://www.buzzfeed.com/robots.txt

 - [Beautiful Soup: Python bs4 lib](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

 - [Regexp: Python re lib](https://docs.python.org/3/library/re.html)

 - [HTML Tags](https://www.w3schools.com/tags/tag_p.asp)

 - [What is a robot.txt file and how to find it](https://moz.com/learn/seo/robotstxt)

In [38]:
from bs4 import BeautifulSoup # a python HTML parser
import re #Regular expressions
import requests
import pandas as pd

### Weather Data <a id=2A></a>

Let's focus on grabbing general weather data & forecasts

In [39]:
url = "http://forecast.weather.gov/MapClick.php?lat=38.8904&lon=-77.032#.WpNL-ejwaUk"
r = requests.get(url)
r.status_code

200

In [40]:
r.content



In [41]:
#Let's make some soup
soup = BeautifulSoup(r.content, 'html.parser')

In [42]:
soup

<!DOCTYPE html>

<html class="no-js">
<head>
<!-- Meta -->
<meta content="width=device-width" name="viewport"/>
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/><title>National Weather Service</title><meta content="National Weather Service" name="DC.title"><meta content="NOAA National Weather Service National Weather Service" name="DC.description"/><meta content="US Department of Commerce, NOAA, National Weather Service" name="DC.creator"/><meta content="" name="DC.date.created" scheme="ISO8601"/><meta content="EN-US" name="DC.language" scheme="DCTERMS.RFC1766"/><meta content="weather, National Weather Service" name="DC.keywords"/><meta content="NOAA's National Weather Service" name="DC.publisher"/><meta content="National Weather Service" name="DC.contributor"/><meta content="http://www.weather.gov/disclaimer.php" name="DC.rights"/><meta content="General" name="rating"/><meta content="index,follow" name="robots"/>
<!-- Icons -->
<link href="./images/favicon.ico" rel="shor

In [43]:
seven_day = soup.find(id="seven-day-forecast")

In [44]:
seven_day

<div class="panel panel-default" id="seven-day-forecast">
<div class="panel-heading">
<b>Extended Forecast for</b>
<h2 class="panel-title">
	    	    Washington DC	</h2>
</div>
<div class="panel-body" id="seven-day-forecast-body">
<div id="seven-day-forecast-container"><ul class="list-unstyled" id="seven-day-forecast-list"><li class="forecast-tombstone">
<div class="tombstone-container">
<p class="period-name">Overnight<br/><br/></p>
<p><img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/></p><p class="short-desc">Patchy<br/>Drizzle and<br/>Patchy Fog</p><p class="temp temp-low">Low: 63 °F</p></div></li><li class="fore

In [45]:
forecast_items = seven_day.find_all(class_="tombstone-container")
forecast_items

[<div class="tombstone-container">
 <p class="period-name">Overnight<br/><br/></p>
 <p><img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/></p><p class="short-desc">Patchy<br/>Drizzle and<br/>Patchy Fog</p><p class="temp temp-low">Low: 63 °F</p></div>,
 <div class="tombstone-container">
 <p class="period-name">Sunday<br/><br/></p>
 <p><img alt="Sunday: Isolated showers.  Areas of fog before 11am.  Otherwise, mostly cloudy, with a high near 80. Calm wind becoming southeast around 6 mph in the afternoon.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/shra20.png" title="Sunday: Isolated sho

In [46]:
tonight = forecast_items[0]
print(tonight)

<div class="tombstone-container">
<p class="period-name">Overnight<br/><br/></p>
<p><img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/></p><p class="short-desc">Patchy<br/>Drizzle and<br/>Patchy Fog</p><p class="temp temp-low">Low: 63 °F</p></div>


In [47]:
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/>
 </p>
 <p class="short-desc">
  Patchy
  <br/>
  Drizzle and
  <br/>
  Patchy Fog
 </p>
 <p class="temp temp-low">
  Low: 63 °F
 </p>
</div>


##### Extracting information from the page

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

* The name of the forecast item — in this case, Tonight.
* The description of the conditions — this is stored in the title property of img.
* A short description of the conditions.
* The temperature low.

We'll extract the name of the forecast item, the short description, and the temperature first, since they're all similar:

In [48]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Overnight
PatchyDrizzle andPatchy Fog
Low: 63 °F


In [49]:
img = tonight.find("img")
img

<img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/>

Now, we can extract the `title` attribute from the `img` tag. To do this, we just treat the BeautifulSoup object like a dictionary, and pass in the attribute we want as a key:

In [50]:
img = tonight.find("img")
desc = img['title']

print(desc)

Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%.


##### Extracting all the information from the page
Now that we know how to extract each individual piece of information, we can combine our knowledge with css selectors and list comprehensions to extract everything at once.

In the below code, we:

* Select all items with the class `period-name` inside an item with the class `tombstone-container` in `seven_day`.
* Use a list comprehension to call the `get_text` method on each `BeautifulSoup` object.

In [51]:
# As a reminder here is what we are working with
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Overnight
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%." class="forecast-icon" src="newimages/medium/nra20.png" title="Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%."/>
 </p>
 <p class="short-desc">
  Patchy
  <br/>
  Drizzle and
  <br/>
  Patchy Fog
 </p>
 <p class="temp temp-low">
  Low: 63 °F
 </p>
</div>


In [52]:
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Overnight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight']

In [53]:
#breaking it down
period_tags = seven_day.select(".tombstone-container .period-name")
period_tags

[<p class="period-name">Overnight<br/><br/></p>,
 <p class="period-name">Sunday<br/><br/></p>,
 <p class="period-name">Sunday<br/>Night</p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>]

In [54]:
periods = [pt.get_text() for pt in period_tags]
periods

['Overnight',
 'Sunday',
 'SundayNight',
 'Monday',
 'MondayNight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight']

In [55]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['PatchyDrizzle andPatchy Fog', 'IsolatedShowers andAreas Fog', 'IsolatedShowers thenCloudy', 'Partly Sunny', 'ChanceShowers', 'Showers', 'Showers', 'ShowersLikely', 'ChanceShowers']
['Low: 63 °F', 'High: 80 °F', 'Low: 65 °F', 'High: 82 °F', 'Low: 68 °F', 'High: 78 °F', 'Low: 61 °F', 'High: 72 °F', 'Low: 57 °F']
['Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%.', 'Sunday: Isolated showers.  Areas of fog before 11am.  Otherwise, mostly cloudy, with a high near 80. Calm wind becoming southeast around 6 mph in the afternoon.  Chance of precipitation is 20%.', 'Sunday Night: Isolated showers before 9pm.  Cloudy, with a low around 65. South wind around 6 mph.  Chance of precipitation is 20%.', 'Monday: Partly sunny, with a high near 82. South wind 6 to 10 mph. ', 'Monday Night: A chance of showers, mainly after 3am.  Partly cloudy, with a low around 68. South wind 

In [56]:
short_desc = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_desc)

['PatchyDrizzle andPatchy Fog', 'IsolatedShowers andAreas Fog', 'IsolatedShowers thenCloudy', 'Partly Sunny', 'ChanceShowers', 'Showers', 'Showers', 'ShowersLikely', 'ChanceShowers']


In [57]:
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)

['Low: 63 °F', 'High: 80 °F', 'Low: 65 °F', 'High: 82 °F', 'Low: 68 °F', 'High: 78 °F', 'Low: 61 °F', 'High: 72 °F', 'Low: 57 °F']


In [58]:
desc = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(desc)

['Overnight: Patchy drizzle and fog with isolated showers before 3am, then isolated showers after 3am.  Cloudy, with a low around 63. Calm wind.  Chance of precipitation is 20%.', 'Sunday: Isolated showers.  Areas of fog before 11am.  Otherwise, mostly cloudy, with a high near 80. Calm wind becoming southeast around 6 mph in the afternoon.  Chance of precipitation is 20%.', 'Sunday Night: Isolated showers before 9pm.  Cloudy, with a low around 65. South wind around 6 mph.  Chance of precipitation is 20%.', 'Monday: Partly sunny, with a high near 82. South wind 6 to 10 mph. ', 'Monday Night: A chance of showers, mainly after 3am.  Partly cloudy, with a low around 68. South wind 6 to 10 mph.  Chance of precipitation is 40%.', 'Tuesday: A chance of showers, then showers and possibly a thunderstorm after 9am.  High near 78. Chance of precipitation is 80%.', 'Tuesday Night: Showers and possibly a thunderstorm before 3am, then showers likely.  Low around 61. Chance of precipitation is 80%.',

## Example 4: Weather Analysis

#### **(Skip to UFO's Exercise if time is short)**

Combine all the newly scraped data and analyze it. In order to do this, we'll call the DataFrame class, and pass in each list of items that we have. We pass them in as part of a dictionary. Each dictionary key will become a column in the DataFrame, and each list will become the values in the column.

In [59]:
import pandas as pd

In [60]:
frame_desc = pd.DataFrame(data=desc,columns=['desc'])
frame_desc

Unnamed: 0,desc
0,Overnight: Patchy drizzle and fog with isolate...
1,Sunday: Isolated showers. Areas of fog before...
2,Sunday Night: Isolated showers before 9pm. Cl...
3,"Monday: Partly sunny, with a high near 82. Sou..."
4,"Monday Night: A chance of showers, mainly afte..."
5,"Tuesday: A chance of showers, then showers and..."
6,Tuesday Night: Showers and possibly a thunders...
7,"Wednesday: Showers likely, mainly before 9am. ..."
8,Wednesday Night: A chance of showers. Mostly ...


In [61]:
frame_periods = pd.DataFrame(data=periods,columns=['periods'])
frame_periods

Unnamed: 0,periods
0,Overnight
1,Sunday
2,SundayNight
3,Monday
4,MondayNight
5,Tuesday
6,TuesdayNight
7,Wednesday
8,WednesdayNight


In [62]:
frame_short_descs = pd.DataFrame(data=short_desc,columns=['short_desc'])
frame_short_descs

Unnamed: 0,short_desc
0,PatchyDrizzle andPatchy Fog
1,IsolatedShowers andAreas Fog
2,IsolatedShowers thenCloudy
3,Partly Sunny
4,ChanceShowers
5,Showers
6,Showers
7,ShowersLikely
8,ChanceShowers


In [63]:
frame_temps = pd.DataFrame(data=temps,columns=['temps'],)
frame_temps

Unnamed: 0,temps
0,Low: 63 °F
1,High: 80 °F
2,Low: 65 °F
3,High: 82 °F
4,Low: 68 °F
5,High: 78 °F
6,Low: 61 °F
7,High: 72 °F
8,Low: 57 °F


In [64]:
weather = pd.concat([frame_desc, frame_periods, frame_short_descs, frame_temps], axis=1)
weather

Unnamed: 0,desc,periods,short_desc,temps
0,Overnight: Patchy drizzle and fog with isolate...,Overnight,PatchyDrizzle andPatchy Fog,Low: 63 °F
1,Sunday: Isolated showers. Areas of fog before...,Sunday,IsolatedShowers andAreas Fog,High: 80 °F
2,Sunday Night: Isolated showers before 9pm. Cl...,SundayNight,IsolatedShowers thenCloudy,Low: 65 °F
3,"Monday: Partly sunny, with a high near 82. Sou...",Monday,Partly Sunny,High: 82 °F
4,"Monday Night: A chance of showers, mainly afte...",MondayNight,ChanceShowers,Low: 68 °F
5,"Tuesday: A chance of showers, then showers and...",Tuesday,Showers,High: 78 °F
6,Tuesday Night: Showers and possibly a thunders...,TuesdayNight,Showers,Low: 61 °F
7,"Wednesday: Showers likely, mainly before 9am. ...",Wednesday,ShowersLikely,High: 72 °F
8,Wednesday Night: A chance of showers. Mostly ...,WednesdayNight,ChanceShowers,Low: 57 °F


In [65]:
weather.head()

Unnamed: 0,desc,periods,short_desc,temps
0,Overnight: Patchy drizzle and fog with isolate...,Overnight,PatchyDrizzle andPatchy Fog,Low: 63 °F
1,Sunday: Isolated showers. Areas of fog before...,Sunday,IsolatedShowers andAreas Fog,High: 80 °F
2,Sunday Night: Isolated showers before 9pm. Cl...,SundayNight,IsolatedShowers thenCloudy,Low: 65 °F
3,"Monday: Partly sunny, with a high near 82. Sou...",Monday,Partly Sunny,High: 82 °F
4,"Monday Night: A chance of showers, mainly afte...",MondayNight,ChanceShowers,Low: 68 °F


### Analyzing Weather

Here you will need to have some kind of understanding of regexp pattern syntax. As always if you don't our friend google is here to assist

[REGEX CHEAT SHEET](https://www.dataquest.io/blog/regex-cheatsheet/): Refresher

[PYTHON REGEX DOCS](https://docs.python.org/3/library/re.html): PyDocs

In [None]:
#First we will extract the number from the temps columns so we can run some basic functions

temp_nums = weather["temps"].str.extract("(?P<temp_num>\d+)", expand=False)
temp_nums

In [None]:
# Next we will simply add in the temp_num column to our dataframe ensuring to cast is as dtype int so we can run some calcs

weather["temp_num"] = temp_nums.astype('int')
weather

#### Mean Temperatures

In [None]:
weather["temp_num"].mean()

#### Night Time Temperatures

In [None]:
is_night = weather["temps"].str.contains("Low")
weather["is_night"] = is_night
is_night
weather[is_night]

## EXERCISE 5: UFO Sigtings

### *with the remaining time, please attempt attempt the exercise below*

1. Use beautiful soup to inspect the html file associated with the http://www.nuforc.org/webreports/ndxe201608.html data.
2. Use the findAll() and findChildren methods() to loop through the html data and load it into a DataFrame.

*HINT you may want to consider using an embedded for loop using the two find methods above; If you can find another more elegant way to do it please share withe the group*

In [None]:
from bs4 import BeautifulSoup # a python HTML parser
import re #Regular expressions
import requests
import pandas as pd

In [None]:
r = requests.get("http://www.nuforc.org/webreports/ndxe201608.html")
b = BeautifulSoup(r.text, 'html.parser')
r.status_code

In [None]:
# r.content #go ahead and uncomment this line by hitting ctrl+/

In [None]:
# b #go ahead and uncomment this as well and compare the different... then bask in the glory of BeautifulSoup

In [None]:
# What data do we have? Let's look at the head of the HTML file to determine what's contained within.
d = b.findAll('thead')
print(d)

In [None]:
# Let's take a look at the first sighting
for tr in b.findAll('tr', attrs = {'valign':'TOP'})[:1]: # remove the '1' in the slice to view all data
    # the findChildren method returns all children underneath it
    for child in tr.findChildren():
        print(child.text)

In [None]:
# OK, it's a bit messy, Let's clean it up. Go ahead and use the code below completing the loops below to load the data into a DataFrame.
# Looks like the first element is the date, the 4th is the city, 6th if state, 8th is shape etc...


ufo_sightings = {
        'Date':[],
        'City':[],
        'State':[],
        'Shape':[],
        'Duration':[],
        'Summary':[]
    }

for tr in b.findAll('tr', attrs = {'valign':'TOP'}):
        #your code goes here
    for child in tr.findChildren():
        #your code goes here
        
    pd.DataFrame(ufo_sightings).head()