# Session 5 Notes

## Pandas `read_html`

Make sure you have the following modules installed:
- pandas
- html5lib
- lxml
- BeautifulSoup4

Install all of them with:

`pip install pandas html5lib lxml BeautifulSoup4`.


In [116]:
import pandas as pd
import html5lib
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_United_States_television_markets")

In [5]:
type(tables)

list

In [117]:
dmas = tables[0]
dmas.head()

Unnamed: 0,0,1,2,3,4,5
0,Rank,Market,State,Counties (or county-equivalents) covered,TV households (2016–17),"Major network affiliates (ABC, CBS, Fox, NBC) ..."
1,1,New York,New York,"New York: Bronx, Dutchess, Kings, Nassau, New ...","7,348,620 (6.407%)","WABC-TV (ABC), WCBS-TV (CBS) (WLNY-TV (Indepen..."
2,2,Los Angeles,California,"Inyo, Kern (Sierra Nevada), Los Angeles, Orang...","5,476,830 (4.775%)","KABC-TV (ABC), KCBS-TV (CBS) (KCAL-TV (Indepen..."
3,3,Chicago,Illinois,"Illinois: Cook, DeKalb, DuPage, Grundy, Kane, ...","3,463,060 (3.019%)","WBBM-TV (CBS), WFLD (Fox) (WPWR-TV (CW/MyNetwo..."
4,4,Philadelphia,Pennsylvania,"Pennsylvania: Berks, Bucks, Chester, Delaware,...","2,942,800 (2.566%)","KYW-TV (CBS) (WPSG (CW)), WCAU (NBC) (WWSI (Te..."


In [118]:
dmas.columns = dmas.iloc[0,:]
dmas = dmas.drop(0).reset_index(drop=True)

In [119]:
dmas.head()

Unnamed: 0,Rank,Market,State,Counties (or county-equivalents) covered,TV households (2016–17),"Major network affiliates (ABC, CBS, Fox, NBC) Sister stations"
0,1,New York,New York,"New York: Bronx, Dutchess, Kings, Nassau, New ...","7,348,620 (6.407%)","WABC-TV (ABC), WCBS-TV (CBS) (WLNY-TV (Indepen..."
1,2,Los Angeles,California,"Inyo, Kern (Sierra Nevada), Los Angeles, Orang...","5,476,830 (4.775%)","KABC-TV (ABC), KCBS-TV (CBS) (KCAL-TV (Indepen..."
2,3,Chicago,Illinois,"Illinois: Cook, DeKalb, DuPage, Grundy, Kane, ...","3,463,060 (3.019%)","WBBM-TV (CBS), WFLD (Fox) (WPWR-TV (CW/MyNetwo..."
3,4,Philadelphia,Pennsylvania,"Pennsylvania: Berks, Bucks, Chester, Delaware,...","2,942,800 (2.566%)","KYW-TV (CBS) (WPSG (CW)), WCAU (NBC) (WWSI (Te..."
4,5,Dallas-Fort Worth,Texas,"Anderson, Bosque, Collin, Comanche, Cooke, Dal...","2,713,380 (2.366%)","KDFW (Fox) (KDFI (MyNetworkTV)), KTVT (CBS) (K..."


In [19]:
dmas.loc[dmas.State=="California", "Market"]

1                                    Los Angeles
5                 San Francisco-Oakland-San Jose
19                   Sacramento-Stockton-Modesto
27                                     San Diego
53                                Fresno-Visalia
123    Santa Barbara-Santa Maria-San Luis Obispo
124                             Monterey-Salinas
125                                  Bakersfield
131                                Chico-Redding
145                                 Palm Springs
194                                       Eureka
Name: Market, dtype: object

## Disqus Comments from The Atlantic

In [20]:
import requests
disqus_url = "https://disqus.com/embed/comments/?base=default&f=theatlantic&t_i=mt536010&t_u=https%3A%2F%2Fwww.theatlantic.com%2Fscience%2Farchive%2F2017%2F08%2Fadvice-for-eclipse-newbies%2F536010%2F&t_e=Advice%20for%20Eclipse%20Newbies&t_d=Advice%20for%20Eclipse%20Newbies&t_t=Advice%20for%20Eclipse%20Newbies&s_o=default"
article_url = "https://www.theatlantic.com/science/archive/2017/08/advice-for-eclipse-newbies/536010/#article-comments"
r = requests.get(disqus_url, headers={'referer': article_url})

In [29]:
"Back in the early 90s" in r.text

True

In [55]:
from urllib.parse import urlencode
import re
from collections import OrderedDict

def get_disqus_url(article_url):
    match = re.search('\/[0-9]{6}\/', article_url)
    if match:
        article_id = match.group()
    else:
        print("Could not extract article_id: {}".format(article_url))
        return(None)
    params = OrderedDict([
        ("base","default"),
        ("f","theatlantic"),
        ("t_i","mt{}".format(article_id)),
        ("t_u",article_url),
        ("s_o","default")
    ])
    query = urlencode(params)
    return("https://disqus.com/embed/comments/?{}".format(query))

In [56]:
get_disqus_url("https://www.theatlantic.com/politics/archive/2017/%E2%80%A6mmon-error-in-coverage-of-the-google-memo/536181/")

'https://disqus.com/embed/comments/?base=default&f=theatlantic&t_i=mt%2F536181%2F&t_u=https%3A%2F%2Fwww.theatlantic.com%2Fpolitics%2Farchive%2F2017%2F%25E2%2580%25A6mmon-error-in-coverage-of-the-google-memo%2F536181%2F&s_o=default'

## PyQuery Basics

In [85]:
import requests
from pyquery import PyQuery as PQ

url = "https://arstechnica.com/tech-policy/2017/07/linkedin-its-illegal-to-scrape-our-website-without-permission/"


# Initialize the PyQuery object 
r = requests.get(url)
raw_html = r.text
pq = PQ(raw_html)

# This method downloads the HTML directly from the URL:
pq = PQ(url=url)

In [86]:
# Links all the a tags (i.e., link tags):
pq("a")

[<a#header-logo>, <a.dropdown-toggle>, <a.nav-search-close>, <a>, <a>, <a>, <a.active>, <a>, <a>, <a>, <a.dropdown-toggle>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a.active>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a.dropdown-toggle>, <a>, <a.signup-btn.button.button-wide>, <a.dropdown-toggle>, <a>, <a>, <a.caption-link>, <a.comment-count.icon-comment-bubble-down>, <a.social-icon.share-facebook>, <a.social-icon.share-twitter>, <a.social-icon.share-reddit>, <a.social-icon.share-gplus>, <a>, <a>, <a>, <a>, <a>, <a>, <a.author-photo>, <a.author-name>, <a>, <a>, <a.comment-count.icon-comment-bubble-down>, <a.social-icon.share-facebook>, <a.social-icon.share-twitter>, <a.social-icon.share-reddit>, <a.social-icon.share-gplus>, <a>, <a.vote_login>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a>, <a.icon.icon-logo-cn-us>, <a>, <a>, <a>, <a>, <a>]

In [90]:
# Get all the href values from the link tags:
hrefs = []
for a in pq("a"):
    href = pq(a).attr("href")
    hrefs.append(href)
hrefs[0:10]

['https://arstechnica.com',
 '/search/',
 None,
 '/information-technology/',
 '/gadgets/',
 '/science/',
 '/tech-policy/',
 '/cars/',
 '/gaming/',
 '/civis/']

In [91]:
# Get full links
hrefs = []
base_url = "https://arstechnica.com"
for a in pq("a")[0:10]:
    href = pq(a).attr("href")
    if not href:
        continue
    if href.startswith("/"):
        href = base_url + href
        hrefs.append(href)
    elif href.startswith("http"):
        hrefs.append(href)
    else:
        continue
hrefs[0:10]

['https://arstechnica.com',
 'https://arstechnica.com/search/',
 'https://arstechnica.com/information-technology/',
 'https://arstechnica.com/gadgets/',
 'https://arstechnica.com/science/',
 'https://arstechnica.com/tech-policy/',
 'https://arstechnica.com/cars/',
 'https://arstechnica.com/gaming/',
 'https://arstechnica.com/civis/']

In [92]:
# Get full links and link text
links = []
base_url = "https://arstechnica.com"
for a in pq("a"):
    href = pq(a).attr("href")
    if not href:
        continue
    if href.startswith("/"):
        href = base_url + href
    elif not href.startswith("http"):
        continue
    links.append((a.text, href))

links[0:10]


[('\n      ', 'https://arstechnica.com'),
 ('\n          ', 'https://arstechnica.com/search/'),
 ('Biz & IT', 'https://arstechnica.com/information-technology/'),
 ('Tech', 'https://arstechnica.com/gadgets/'),
 ('Science', 'https://arstechnica.com/science/'),
 ('Policy', 'https://arstechnica.com/tech-policy/'),
 ('Cars', 'https://arstechnica.com/cars/'),
 ('Gaming & Culture', 'https://arstechnica.com/gaming/'),
 ('Forums', 'https://arstechnica.com/civis/'),
 ('Videos', 'http://video.arstechnica.com/')]

### Quirky things about PyQuery

When you your selector matches more than one element, you will be returned a list of elements. To get a specific element out of this list, you just need to iterate (i.e., using a for loop).

When you have selected a **specific** element (i.e., not a list of elements), you can call `.text` to get out the text associated with that specific tag. This will only give you the *top level* text associated with that tag.

**However!** If you want text that may be embedded within that tag (i.e., inside another child tag), you will want to call the full PyQuery `.text()` extraction function. 

To do this, pass your extracted element back into the PyQuery function, and use `.text()`. Example below:

In [99]:
first_para = pq("div.article-content.post-page > p")[0]
first_para

<Element p at 0x7f50c8611688>

In [103]:
print(first_para.text)

print("\n")

print(PQ(first_para).text())

A small company called hiQ is locked in a high-stakes battle over 


A small company called hiQ is locked in a high-stakes battle over Web scraping with LinkedIn. It's a fight that could determine whether an anti-hacking law can be used to curtail the use of scraping tools across the Web.
