# STA 141B Assignment 4

Due __Feb 22, 2019__ by 11:59pm. Submit by editing this file, committing the changes with git, and then pushing to your private GitHub repo for the assignment. This assignment will be graded according to the class rubric.

Please do not rename this file or delete the exercise cells, because it will interfere with our grading tools. Put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text. Answer all questions with complete sentences.

The purpose of this assignment is to practice scraping data from web pages.

## The San Francisco Chronicle

In this assignment, you'll scrape text from [The San Francisco Chronicle](https://www.sfchronicle.com/) newspaper and then analyze the text.

The Chronicle is organized by category into article lists. For example, there's a [Local](https://www.sfchronicle.com/local/) list, [Sports](https://www.sfchronicle.com/sports/) list, and [Food](https://www.sfchronicle.com/food/) list.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Chronicle for analysis in exercise 1.4.

__Exercise 1.1.__ Write a function that extracts all of the links to articles in a Chronicle article list. The function should:

* Have a parameter `url` for the URL of the article list.

* Return a list of article URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

* Be polite and save time by setting up [requests_cache](https://pypi.python.org/pypi/requests-cache) before you write your function.

* You can use any of the XML/HTML parsing packages mentioned in class. Choose one and use it throughout the entire assignment.

In [2]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp # other science tools
# statsmodels -- "traditional statistical models"
# scikit-learn -- machine learning models
import seaborn as sns
#from plotnine import *

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

import nltk
import nltk.corpus

requests_cache.install_cache("../mycache")

In [14]:
def get_article_list(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    # Get all <a> tags inside of <h2> tags with class "headline"
#     links = html.xpath("//h2[contains(@class, 'headline')]/a")
#     links = [tag.attrib.get('href') for tag in links]

    links = html.xpath("//a/@href")
    links = set([link for link in links if 'article' in link])
    
    return list(links)

get_article_list("https://www.sfchronicle.com/business/")

['https://www.sfchronicle.com/news/world/article/Global-stocks-rise-on-optimism-over-US-China-13613104.php',
 'https://www.sfchronicle.com/business/article/Long-winding-road-led-to-get-touch-registration-12617780.php',
 'https://www.sfchronicle.com/news/article/Asian-stocks-slip-on-Wall-Street-leads-as-trade-13618693.php',
 'https://www.sfchronicle.com/bayarea/article/CleanPowerSF-tripling-households-served-with-13618155.php',
 'https://www.sfchronicle.com/business/article/Levi-Strauss-files-to-go-public-again-13613503.php',
 'https://www.sfchronicle.com/business/article/PG-E-says-it-s-still-trying-to-limit-power-13615287.php',
 'https://www.sfchronicle.com/business/article/California-solar-jobs-fall-for-second-year-13611311.php',
 'https://www.sfchronicle.com/realestate/article/Designer-Profile-Lowell-Strauss-of-Amalfi-West-13620501.php',
 'https://www.sfchronicle.com/business/article/Amazon-dumps-NYC-headquarters-and-its-promised-13617774.php',
 'https://www.sfchronicle.com/world/art

In [15]:
get_article_list("https://www.sfchronicle.com/elections/")

['https://www.sfchronicle.com/politics/article/Cannabis-reform-no-laughing-matter-for-Oakland-13552373.php',
 'https://www.sfchronicle.com/bayarea/article/SF-Mayor-Breed-attempts-to-free-up-some-Prop-C-13563063.php',
 'https://www.sfchronicle.com/bayarea/article/SF-Mayor-Breed-s-effort-to-free-brother-turned-13489538.php',
 'https://www.sfchronicle.com/politics/article/Progressive-majority-on-SF-Board-of-Supervisors-13525527.php',
 'https://www.sfchronicle.com/bayarea/article/Judson-True-named-to-new-position-with-goal-to-13465171.php',
 'https://www.sfchronicle.com/education/article/SF-mayor-appoints-her-own-adviser-to-school-board-13552306.php',
 'https://www.sfchronicle.com/politics/article/Calif-Gov-Newsom-bringing-wife-on-board-as-13527968.php',
 'https://www.sfchronicle.com/news/article/Newsom-names-Bay-Area-businessman-as-top-economic-13511487.php',
 'https://www.sfchronicle.com/politics/article/California-Republicans-have-a-big-problem-women-13418436.php',
 'https://www.sfchron

__Exercise 1.2.__ Write a function that extracts data from a Chronicle article. The function should:

* Have a parameter `url` for the URL of the article.

* Return a dictionary with keys for:
    + `url`: The URL of the article.
    + `title`: The title of the article.
    + `text`: The complete text of the article.
    + `author`: The author's name (if available) or a suitable missing value.
    + `date`: The date and time the article was published.
    + `date_updated`: The date and time the article was last updated (if available) or a suitable missing value.

For example, for [this article](https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php) your function should return a dictionary with the form:
```js
{'url': 'https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php',
 'title': '‘Gardenlust’ looks at best 21st century gardens in the world',
 'text': 'The book...',
 'author': 'Pam Peirce',
 'date': '2019-02-01T18:02:33+00:00',
 'date_updated': '2019-02-01T18:12:53+00:00'}
```
The value of the `text` field is omitted here to save space. Your function should return the full text in the `text` field.

Hints:

* Many parsing packages allow you to delete elements from an HTML document. Deleting elements is one way to avoid extracting unwanted tags.
* You can union multiple XPath paths with `|`.

In [98]:
def get_article_data(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    
    try:
        title = html.xpath("//h1[contains(@class, 'header-title')] | //h1[contains(@class, 'headline entry-title')]")[0].text_content()
    except IndexError:
        title = None
    
    try:
        text_blocks = html.xpath("//section[contains(@class, 'body')]/p | //div[contains(@class, 'article-body')]/p")
        text = ' '.join([block.text_content() for block in text_blocks])
    except IndexError:
        text = None
    
    try:
        author = html.xpath("//span[contains(@class, 'header-authors-name')] | //p[contains(@class, 'byline')]/a | //span[contains(@class, 'header-byline')]")[0].text_content()
    except IndexError:
        author = None
    author = ' '.join(author.split()).lstrip('By ')
    
    try:
        date = html.xpath("//time[contains(@itemprop, 'datePublished')]")[0].attrib.get("datetime")
    except IndexError:
        date = None
    
    try:
        update = html.xpath("//time[contains(@itemprop, 'dateModified')]")[0].attrib.get("datetime")
    except IndexError:
        update = None
    
    data = {'url': url, 'title': title, 'text': text, 'author': author, 'date': date, 'date_updated': update}
    
    return data

# get_article_data("https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php")
# get_article_data("https://www.sfchronicle.com/bayarea/article/Owner-of-destroyed-Richard-Neutra-house-sues-San-13621201.php")

In [99]:
get_article_data("https://www.sfchronicle.com/politics/article/Kamala-Harris-prosecutor-past-could-be-a-2020-13524773.php")

{'url': 'https://www.sfchronicle.com/politics/article/Kamala-Harris-prosecutor-past-could-be-a-2020-13524773.php',
 'title': 'Kamala Harris in the Senate',
 'text': 'WASHINGTON — Sen. Kamala Harris’ communications director has a note in all capital letters stuck on her office computer: “Show the math.” It’s a quote from her boss — one the senator uses frequently. “There’s, I think, a running joke in the office about certain phrases I use all the time,” the California Democrat said. “This is how I would train lawyers, prosecutors about trial techniques: I’d say, ‘When you’re standing before the jury, in your closing argument ... show them the math.’ Instead of saying, ‘You must find 8,’ show them 2 plus 2 plus 2 plus 2.” The “Kamalism,” as one former staffer called it, is far from the only remnant of her prosecutor past that followed her to Washington. That background has helped define Harris’ time in the Senate — which will draw intensive scrutiny if, as expected, she declares her cand

In [100]:
get_article_data('https://www.sfchronicle.com/politics/article/SF-mayor-supes-differ-on-divvying-up-181-13443089.php')

{'url': 'https://www.sfchronicle.com/politics/article/SF-mayor-supes-differ-on-divvying-up-181-13443089.php',
 'title': 'SF mayor, supes differ on use of city’s $181 million windfall: Let wrangling begin',
 'text': 'The political wrangling around the $181 million the city unexpectedly received last week began Tuesday, as Mayor London Breed and a number of supervisors offered differing visions on how they want to spend the extra money. The proposals, which were introduced Tuesday at the Board of Supervisors meeting, concern the unexpected $415 million windfall the city received from excess revenue in a county education fund. While the City Charter mandates that more than half the money go toward budget reserves and certain city agencies, such as the Municipal Transportation Agency, the mayor and board have free rein over how to spend $181 million. Breed and the six supervisors who sponsored the competing proposal  — Aaron Peskin, Rafael Mandelman, Sandra Lee Fewer, Hillary Ronen, Norman

__Exercise 1.3.__ Use your functions from exercises 1.1 and 1.2 to get data frames of articles for the "Biz+Tech" category as well as two other categories of your choosing (except for "Vault: Archive", "Podcasts", and "In Depth").

Add a column to each that indicates the category, then combine them into one big data frame. Clean up the data, stripping excess whitespace and converting columns to appropriate dtypes.

The `text` column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [101]:
business = get_article_list("https://www.sfchronicle.com/business/")
local = get_article_list("https://www.sfchronicle.com/local/")
politics = get_article_list("https://www.sfchronicle.com/elections/")

In [102]:
business = pd.DataFrame([get_article_data(art) for art in business])
local = pd.DataFrame([get_article_data(art) for art in local])
politics = pd.DataFrame([get_article_data(art) for art in politics])

In [103]:
business['category'] = 'business'
local['category'] = 'local'
politics['category'] = 'politics'

df = pd.concat([business, local, politics], ignore_index = True)
df

Unnamed: 0,author,date,date_updated,text,title,url,category
0,"ANNABELLE LIANG, Associated Press",2019-02-14T03:24:22+00:00,2019-02-14T03:28:32+00:00,SINGAPORE (AP) — Asian stocks were mostly lowe...,"Asian shares retreat as China, US begin trade ...",https://www.sfchronicle.com/news/world/article...,business
1,Carolyn Said,2018-02-16T14:00:00+00:00,2018-02-18T00:18:33+00:00,"If anyone knew the ropes about Airbnb rentals,...","Long, winding road to SF’s get-tough registrat...",https://www.sfchronicle.com/business/article/L...,business
2,"ANNABELLE LIANG, Associated Press",2019-02-15T04:47:23+00:00,2019-02-15T04:50:21+00:00,SINGAPORE (AP) — Asian shares were broadly low...,Asian stocks slip on Wall Street leads as trad...,https://www.sfchronicle.com/news/article/Asian...,business
3,Dominic Fracassa,2019-02-15T00:55:14+00:00,2019-02-15T00:56:10+00:00,"More than 250,000 San Francisco homes and busi...",CleanPowerSF tripling households served with m...,https://www.sfchronicle.com/bayarea/article/Cl...,business
4,Shwanika Narayan,2019-02-14T00:31:17+00:00,2019-02-14T00:32:46+00:00,"Levi Strauss & Co., the jean maker with a 165-...",Levi Strauss files to go public — again,https://www.sfchronicle.com/business/article/L...,business
5,J.D. Morris,2019-02-14T04:59:31+00:00,2019-02-14T23:57:55+00:00,While millions more Pacific Gas and Electric C...,PG&E says it’s still trying to limit power shu...,https://www.sfchronicle.com/business/article/P...,business
6,J.D. Morris,2019-02-13T00:02:44+00:00,2019-02-13T00:03:56+00:00,Jobs in California’s once-steadily growing sol...,California solar jobs fall for second year,https://www.sfchronicle.com/business/article/C...,business
7,Jordan Guinn,2019-02-16T17:17:19+00:00,2019-02-16T17:18:29+00:00,Despite his penchant for building multimillion...,Designer Profile: Lowell Strauss of Amalfi West,https://www.sfchronicle.com/realestate/article...,business
8,Joseph Pisani and Alexandra Olson,2019-02-15T03:24:18+00:00,2019-02-15T03:25:36+00:00,NEW YORK — Amazon abruptly dropped plans Thurs...,Amazon dumps NYC headquarters and its promised...,https://www.sfchronicle.com/business/article/A...,business
9,Edward Wong,2019-02-11T22:01:48+00:00,,WASHINGTON — The Trump administration is press...,Iraq rebuffs U.S. on cutting off energy purcha...,https://www.sfchronicle.com/world/article/Iraq...,business


In [75]:
df.count()

author          239
date            233
date_updated    217
text            239
title           238
url             239
category        239
dtype: int64

In [105]:
df.loc[216]

author                        Trisha Thadani and Dominic Fracassa
date                                    2018-12-05T17:29:38+00:00
date_updated                            2018-12-05T17:30:52+00:00
text            The political wrangling around the $181 millio...
title           SF mayor, supes differ on use of city’s $181 m...
url             https://www.sfchronicle.com/politics/article/S...
category                                                 politics
Name: 216, dtype: object

In [107]:
df.loc[df['date'].isna()]

Unnamed: 0,author,date,date_updated,text,title,url,category
15,Carolyn Said,,,"Under the threat of huge penalties, Airbnb, Ho...","SF short-term rentals transformed as Airbnb, o...",https://www.sfchronicle.com/business/article/S...,business
55,Lizzie Johnson,,,California was on fire again. Through this pas...,A place called home,https://www.sfchronicle.com/california-wildfir...,local
60,Lizzie Johnson,,,The wind pounded against the window panes and ...,Escape from the fire,https://www.sfchronicle.com/california-wildfir...,local
69,Lizzie Johnson,,,"In the early years of their marriage, moving w...",Survivors once again,https://www.sfchronicle.com/california-wildfir...,local
75,Lizzie Johnson,,,Melissa Geissinger measured grief by the numbe...,"A new, unhappy normal",https://www.sfchronicle.com/california-wildfir...,local
144,John Wildermuth,,,SACRAMENTO — When Jerry Brown’s fourth term in...,What Jerry Brown has learned,https://www.sfchronicle.com/politics/article/W...,politics


In [110]:
df.loc[15]['text']

'Under the threat of huge penalties, Airbnb, HomeAway, FlipKey and others have jettisoned hosts who ignored the city’s registration requirement for short-term rentals. That’s dramatically revamped the universe of listings, erasing more than half, tilting the market even more toward Airbnb, easing enforcement of local laws, and returning some rental units to a city that desperately needs them. The Chronicle asked Host Compliance, a San Francisco company that helps cities monitor vacation rentals, to capture snapshots of Airbnb, HomeAway and FlipKey in late August, just before a legal agreement required Airbnb and HomeAway to start telling hosts to register or get kicked off. Host Compliance then extracted data from the three sites on Jan. 19, just after a deadline for all Airbnb and HomeAway hosts to register. (FlipKey did not have the same deadline but is subject to the same requirements.) “The regulations had a massive impact on the number of rentals in the city, with an overall 55 pe

__Exercise 1.4.__  What topics has the Chronicle covered recently? How does the category affect the topics? Support your analysis with visualizations.

Hints:

*   The [nltk book](http://www.nltk.org/book/) may be helpful here.

*   This question will be easier to do after we've finished NLP in class.