## Minimize requests to HTML Page

* courteous
* blocked from making more requests to the same site
* reading from disk / memory is faster

Strategies

* do the request once... and never run that cell again (keep html in memory)
* download the page page: save as (File menu, righ-click)

## Web - a bunch of resources (on the internet)

resources

* images
* documents (web pages)
* css files

## Client vs Server

* client - asks for resource (making a request)
* server - fulfills that request

make a request, get back a response.... includes some sort of resource (html page, a json file, an img)

## Two options for making requests (in Python)

* `urlib` - built in, but you have to import
* `requests` - must install this on your own

In [1]:
import requests

In [2]:
res = requests.get('https://www.imdb.com/title/tt0108778/fullcredits?ref_=tt_cl_sm#cast')

In [3]:
res

<Response [200]>

## HTTP Requests

method, path <---

GET /title/tt0108778/fullcredits?ref_=tt_cl_sm#cast' HTTP/1.1

## HTTP Response

200 - OK
403 - auth required
404 - resource / page not found
500 - server error

HTTP/1.1 200 OK
...

<html>.....



In [5]:
# to get at body (the html, or json payload)
res.text[:500]

'\n\n \n\n\n\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///title/tt0108778?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n   '

## Use Your Browser's Tools for Inspecting HTML and HTTP Reqs and Res

* right-click inspect
* open web developer tools
* view source

## CSS Selectors Allow Select Elements Based on Some Criteria

Basic Selectors

* tagname: img 
* class: .someClass
* id: #someId

Combine tags and classes / IDs

* tagName.className: img.primaryPhoto (all images with class="primaryPhoto")
* tagName#idValue:

Combining Selectors by Relationship

* selector1 selector2: any element selected by 2... with an element selected by 1:
    * .filmography a: all a tags nested within class=filmography (any level of nesting)
* selector1 > selector2: direct descendant
* selector1, selector2: or... all element that either match selector 1 or selector 2


In [7]:
res.text[:100]

'\n\n \n\n\n\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook'

## DOM Library / HTML Parser

DOM - Document Object Model
API that interfaces with HTML document

* what are the objects that you have
* what are the built in methods, etc.

DOM API can be implemented with any language or libraryl

`BeatifulSoup4` HTML Parser, DOM API (a little different from the JavaScript DOM API, concepts are the same)

In [11]:
from bs4 import BeautifulSoup

BeatifulSoup is a class / constructor... pass it a string... and it parses that string

In [12]:
tmp = BeautifulSoup('<h1>hello</h1>')

In [13]:
tmp

<h1>hello</h1>

In [14]:
dom = BeautifulSoup(res.text)

In [16]:
#dom

* select to select elements with css selector
    * this can be used on the entire document
    * or select within an element
* use . to find the first child with that specfici name
    * element.otherElement ... looks for otherElement as 1st direct descendant of element

In [18]:
dom.select('img')[:3]

[<img alt="IMDbPro Menu" src="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_logo_nb._CB484021162_.png"/>,
 <img alt="Go to IMDbPro" height="145" src="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user._CB484021156_.png" srcset="https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user._CB484021156_.png 1x, https://m.media-amazon.com/images/G/01/wprs/images/navbar/imdbpro_navbar_menu_user_2x._CB484021157_.png 2x" width="127"/>,
 <img alt="Friends (TV Series 1994–2004) Poster" class="poster" height="98" itemprop="image" src="https://m.media-amazon.com/images/M/MV5BNDVkYjU0MzctMWRmZi00NTkxLTgwZWEtOWVhYjZlYjllYmU4XkEyXkFqcGdeQXVyNTA4NzY1MzY@._V1_UX67_CR0,0,67,98_AL_.jpg" width="67"/>]

In [19]:
dom.select('td img')[:3]

[<img alt="Jennifer Aniston" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BNjk1MjIxNjUxNF5BMl5BanBnXkFtZTcwODk2NzM4Mg@@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Jennifer Aniston" width="32"/>,
 <img src="https://m.media-amazon.com/images/G/01/imdb/images/favorite_theater/spinner-3099941772._CB470042407_.gif"/>,
 <img alt="Courteney Cox" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTA4OTczNDExNDNeQTJeQWpwZ15BbWU3MDUyNTIzMTM@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Courteney Cox" width="32"/>]

In [20]:
len(dom.select('td img'))
#dom.select('td.primary_photo img')[:3]

1626

In [22]:
rows = dom.select('td.primary_photo img')

In [23]:
for row in rows[:3]:
    print(row)

<img alt="Jennifer Aniston" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BNjk1MjIxNjUxNF5BMl5BanBnXkFtZTcwODk2NzM4Mg@@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Jennifer Aniston" width="32"/>
<img alt="Courteney Cox" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTA4OTczNDExNDNeQTJeQWpwZ15BbWU3MDUyNTIzMTM@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Courteney Cox" width="32"/>
<img alt="Lisa Kudrow" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTU5OTA0ODcxNl5BMl5BanBnXkFtZTcwMjE3NjQxMw@@._V1_UY44_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Lisa Kudrow" width="32"/>


In [24]:
links = dom.select('td.primary_photo a')

In [26]:
links[:3]

[<a href="/name/nm0000098/"><img alt="Jennifer Aniston" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BNjk1MjIxNjUxNF5BMl5BanBnXkFtZTcwODk2NzM4Mg@@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Jennifer Aniston" width="32"/></a>,
 <a href="/name/nm0001073/"><img alt="Courteney Cox" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTA4OTczNDExNDNeQTJeQWpwZ15BbWU3MDUyNTIzMTM@._V1_UX32_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopicture/32x44/name-2138558783._CB470041625_.png" title="Courteney Cox" width="32"/></a>,
 <a href="/name/nm0001435/"><img alt="Lisa Kudrow" class="loadlate hidden" height="44" loadlate="https://m.media-amazon.com/images/M/MV5BMTU5OTA0ODcxNl5BMl5BanBnXkFtZTcwMjE3NjQxMw@@._V1_UY44_CR0,0,32,44_AL_.jpg" src="https://m.media-amazon.com/images/G/01/imdb/images/nopi

In [28]:
['https://www.imdb.com' + link['href'] for link in links[:3]]

['https://www.imdb.com/name/nm0000098/',
 'https://www.imdb.com/name/nm0001073/',
 'https://www.imdb.com/name/nm0001435/']

In [29]:
url = 'https://www.imdb.com/name/nm0000098/'

In [30]:
test_res = requests.get(url)

In [31]:
test_res

<Response [200]>

In [32]:
test_res.text[:500]

'\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///name/nm0000098?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("b'

In [33]:
test_dom = BeautifulSoup(test_res.text)

In [35]:
test_dom.select('H1')

[<h1 class="header"> <span class="itemprop">Jennifer Aniston</span>
 </h1>]

To retrieve text from an element, use `.text`

In [38]:
test_dom.select('H1')[0].text.strip()

'Jennifer Aniston'

In [40]:
test_dom.select('#show-actress')[0].text

'Show\xa0'

In [43]:
raw_credits = test_dom.select('#filmo-head-actress, filmo-head-actor')[0].text

In [44]:
raw_credits

'\nHide\xa0\nShow\xa0\nActress (64 credits)\n'

In [45]:
import re

In [54]:
m = re.search('\((\d{1,3}) credits\)', raw_credits)

In [55]:
m[0]

'(64 credits)'

In [56]:
m[1]

'64'

In [59]:
# selecting a direct child of an element by dot and child element name
test_dom.select('H1')[0].span

<span class="itemprop">Jennifer Aniston</span>

In [60]:
# search within all nested elements within parent element
test_dom.select('H1')[0].select('span')


[<span class="itemprop">Jennifer Aniston</span>]

In [61]:
cols = ["Year", "State", "Title", "Employment", "Salary"]


In [62]:
data = [[2018, "CA", "Web Dev", 20170, 86160],
        [2018, "CA", "DB Admin", 10970, 100890],
        [2018, "NY", "Web Dev", 12030, 79880],
        [2018, "NY", "DB Admin", 7100, 99000],
        [2017, "CA", "Web Dev", 21150, 84270],
        [2017, "CA", "DB Admin", 12030, 95630],
        [2017, "NY", "Web Dev", 11900, 82360],
        [2017, "NY", "DB Admin", 7170, 94330],
        [2016, "CA", "Web Dev", 22650, 82930],
        [2016, "CA", "DB Admin", 12370, 93960],
        [2016, "NY", "Web Dev", 11410, 81140],
        [2016, "NY", "DB Admin", 6650, 91720]]

In [64]:
import pandas as pd
import numpy as np
df = pd.DataFrame(data, columns=cols)

In [65]:
df

Unnamed: 0,Year,State,Title,Employment,Salary
0,2018,CA,Web Dev,20170,86160
1,2018,CA,DB Admin,10970,100890
2,2018,NY,Web Dev,12030,79880
3,2018,NY,DB Admin,7100,99000
4,2017,CA,Web Dev,21150,84270
5,2017,CA,DB Admin,12030,95630
6,2017,NY,Web Dev,11900,82360
7,2017,NY,DB Admin,7170,94330
8,2016,CA,Web Dev,22650,82930
9,2016,CA,DB Admin,12370,93960


In [66]:
tmp = df.set_index('Year')

In [67]:
tmp

Unnamed: 0_level_0,State,Title,Employment,Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018,CA,Web Dev,20170,86160
2018,CA,DB Admin,10970,100890
2018,NY,Web Dev,12030,79880
2018,NY,DB Admin,7100,99000
2017,CA,Web Dev,21150,84270
2017,CA,DB Admin,12030,95630
2017,NY,Web Dev,11900,82360
2017,NY,DB Admin,7170,94330
2016,CA,Web Dev,22650,82930
2016,CA,DB Admin,12370,93960


In [69]:
tmp.loc[2016]

Unnamed: 0_level_0,State,Title,Employment,Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016,CA,Web Dev,22650,82930
2016,CA,DB Admin,12370,93960
2016,NY,Web Dev,11410,81140
2016,NY,DB Admin,6650,91720


In [70]:
df = df.set_index(['Year', 'State'])

In [71]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Title,Employment,Salary
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018,CA,Web Dev,20170,86160
2018,CA,DB Admin,10970,100890
2018,NY,Web Dev,12030,79880
2018,NY,DB Admin,7100,99000
2017,CA,Web Dev,21150,84270
2017,CA,DB Admin,12030,95630
2017,NY,Web Dev,11900,82360
2017,NY,DB Admin,7170,94330
2016,CA,Web Dev,22650,82930
2016,CA,DB Admin,12370,93960


In [73]:
df.loc[(2016, 'CA')]

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Unnamed: 1_level_0,Title,Employment,Salary
Year,State,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016,CA,Web Dev,22650,82930
2016,CA,DB Admin,12370,93960


In [74]:
df.index.names

FrozenList(['Year', 'State'])

In [75]:
df.mean(level='Year')

Unnamed: 0_level_0,Employment,Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2018,12567.5,91482.5
2017,13062.5,89147.5
2016,13270.0,87437.5


In [76]:
df.count(level='State')

Unnamed: 0_level_0,Title,Employment,Salary
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CA,6,6,6
NY,6,6,6


In [78]:
df = pd.DataFrame(data, columns=cols)
grouped = df['Salary'].groupby(df['Year'])



In [79]:
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x11fb20ed0>

In [80]:
grouped.mean()

Year
2016    87437.5
2017    89147.5
2018    91482.5
Name: Salary, dtype: float64

In [81]:
df['Salary'].groupby(df['Year']).mean()

Year
2016    87437.5
2017    89147.5
2018    91482.5
Name: Salary, dtype: float64

In [82]:
df.groupby(df['Year']).mean()

Unnamed: 0_level_0,Employment,Salary
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2016,13270.0,87437.5
2017,13062.5,89147.5
2018,12567.5,91482.5
