# STA 141B Assignment 4

Due __Feb 22, 2019__ by 11:59pm. Submit by editing this file, committing the changes with git, and then pushing to your private GitHub repo for the assignment. This assignment will be graded according to the class rubric.

Please do not rename this file or delete the exercise cells, because it will interfere with our grading tools. Put your answers in new cells after each exercise. You can make as many new cells as you like. Use code cells for code and Markdown cells for text. Answer all questions with complete sentences.

The purpose of this assignment is to practice scraping data from web pages.

## The San Francisco Chronicle

In this assignment, you'll scrape text from [The San Francisco Chronicle](https://www.sfchronicle.com/) newspaper and then analyze the text.

The Chronicle is organized by category into article lists. For example, there's a [Local](https://www.sfchronicle.com/local/) list, [Sports](https://www.sfchronicle.com/sports/) list, and [Food](https://www.sfchronicle.com/food/) list.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Chronicle for analysis in exercise 1.4.

__Exercise 1.1.__ Write a function that extracts all of the links to articles in a Chronicle article list. The function should:

* Have a parameter `url` for the URL of the article list.

* Return a list of article URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

* Be polite and save time by setting up [requests_cache](https://pypi.python.org/pypi/requests-cache) before you write your function.

* You can use any of the XML/HTML parsing packages mentioned in class. Choose one and use it throughout the entire assignment.

In [2]:
# Our usual data science tools
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp # other science tools
# statsmodels -- "traditional statistical models"
# scikit-learn -- machine learning models
import seaborn as sns
#from plotnine import *

%matplotlib inline

# Web scraping tools
import lxml.html as lx
import requests
import requests_cache

requests_cache.install_cache("../mycache")

In [15]:
response = requests.get("https://www.sfchronicle.com/sports/")
response.raise_for_status()
html = lx.fromstring(response.text)
#html.make_links_absolute("https://www.sfchronicle.com/sports/")

html.text_content()

'\n    \n    \n        \n                \n        \n        var HDN = HDN || {}; HDN.t_firstbyte = Number(new Date());\n        \n        \n\n                \n\n        \n\n    // <![CDATA[\n\n        var HDN = HDN || {};\n        HDN.dataLayer = HDN.dataLayer || {};\n\n        // HDN.dataLayer object for content and href data\n        HDN.dataLayer.content = HDN.dataLayer.content || {};\n        HDN.dataLayer.href = HDN.dataLayer.href || {};\n\n    \n                    HDN.dataLayer.content.title = "Sports";\n                HDN.dataLayer.content.subtitle = "";\n        HDN.dataLayer.content.objectId = "8476";\n        \n        HDN.dataLayer.content.objectType = "channel";\n        HDN.dataLayer.content.sectionPath = [\'sports\'];\n        HDN.dataLayer.content.pubDate = "2013-01-22 23:12:00";\n        HDN.dataLayer.content.lastModifiedDate = "2019-01-09 14:29:36";\n        HDN.dataLayer.content.wordCount = 0;\n        HDN.dataLayer.content.keywords = [];\n        HDN.dataLayer.co

In [26]:
# Get all <h2> tags with class "headline"
links = html.xpath("//h2[contains(@class, 'headline')]")
[l.text_content() for l in links]

['\n            Steve Kerr ejection highlights Warriors’ late collapse in loss to Portland\n        ',
 '\n            Bay Area sports calendar, Feb. 14-15\n        ',
 '\n            A’s potential Opening Day starter Mike Fiers already mentor to Jesus Luzardo\n        ',
 '\n            Kareem Hunt back in uniform? NFL is betting you don’t care\n        ',
 '\n            Tiger Woods, Phil Mickelson not ready to relinquish stage to young players\n        ',
 '\n            49ers release high-priced backup OT Gilliam\n        ',
 '\n            Warriors’ Draymond Green not too upset about missing...\n        ',
 '\n            Report: 49ers to part ways with Pierre Garcon\n        ',
 '\n            Raiders, Coliseum Authority discussing playing 2019 season in...\n        ',
 '\n            A’s add Robbie Grossman to already strong outfield mix\n        ',
 '\n            Giants emerge as odds-on favorites to sign Bryce Harper\n        ',
 '\n            Ducks fire head coach Randy Car

In [None]:
def get_article_list(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    html.make_links_absolute(url)

    # Get all <a> tags with class "result-title"
    links = html.xpath("//a[contains(@class, 'result-title')]/@href")
    
    return links

links = scrape_front_page(start_url)
scrape_front_page(next_page)

__Exercise 1.2.__ Write a function that extracts data from a Chronicle article. The function should:

* Have a parameter `url` for the URL of the article.

* Return a dictionary with keys for:
    + `url`: The URL of the article.
    + `title`: The title of the article.
    + `text`: The complete text of the article.
    + `author`: The author's name (if available) or a suitable missing value.
    + `date`: The date and time the article was published.
    + `date_updated`: The date and time the article was last updated (if available) or a suitable missing value.

For example, for [this article](https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php) your function should return a dictionary with the form:
```js
{'url': 'https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php',
 'title': '‘Gardenlust’ looks at best 21st century gardens in the world',
 'text': 'The book...',
 'author': 'Pam Peirce',
 'date': '2019-02-01T18:02:33+00:00',
 'date_updated': '2019-02-01T18:12:53+00:00'}
```
The value of the `text` field is omitted here to save space. Your function should return the full text in the `text` field.

Hints:

* Many parsing packages allow you to delete elements from an HTML document. Deleting elements is one way to avoid extracting unwanted tags.
* You can union multiple XPath paths with `|`.

In [47]:
url = "https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php"
response = requests.get(url)
response.raise_for_status()
html = lx.fromstring(response.text)

data = {'url': url}
title = html.xpath("//h1[contains(@class, 'header-title')]")[0].text_content()
data['title'] = title

text_blocks = html.xpath("//section[contains(@class, 'body')]/p")
text = ' '.join([block.text_content() for block in text_blocks])
data['text'] = text

header = html.xpath("//div[contains(@class, 'header-author-time h5')]")[0]
author = header.xpath("//span[contains(@class, 'header-authors-name')]")[0].text_content()
data['author'] = author

date = header.xpath("//time[contains(@itemprop, 'datePublished')]")[0].attrib.get("datetime")
update = header.xpath("//time[contains(@itemprop, 'dateModified')]")[0].attrib.get("datetime")
data['date'] = date
data['date_updated'] = update

data

{'url': 'https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php',
 'title': '‘Gardenlust’ looks at best 21st century gardens in the world',
 'text': 'The book “Gardenlust: A Botanical Tour of the World’s Best New Gardens” by Christopher Woods (Timber Press, 416 pages, $40) shows you gardens from around the globe, including some surprising places such as Oman, Singapore and the island of Nevis in the West Indies, all of them created in the past 20 years. Fifty gardens or designers are featured. The sites are all engaging; some are breathtaking. Woods, the author and major photographer of the book, began his career as a gardener at Kew Gardens in London. He then worked at a number of public gardens in England and the U.S., including 20 years as director and chief designer of Chanticleer Arboretum in Pennsylvania. (I have been to Chanticleer and have seen the Ruins Garden he created there from the stone and slate ruins of the home of the origi

In [32]:
def get_article_data(url):
    response = requests.get(url)
    response.raise_for_status()
    html = lx.fromstring(response.text)
    
    data = {'url': url}
    title = html.xpath("//h1[contains(@class, 'header-title')]")[0].text_content()
    data['title'] = title
    return data

get_article_data("https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php")

{'url': 'https://www.sfchronicle.com/homeandgarden/article/Gardenlust-looks-at-best-21st-century-13580871.php',
 'title': '‘Gardenlust’ looks at best 21st century gardens in the world'}

__Exercise 1.3.__ Use your functions from exercises 1.1 and 1.2 to get data frames of articles for the "Biz+Tech" category as well as two other categories of your choosing (except for "Vault: Archive", "Podcasts", and "In Depth").

Add a column to each that indicates the category, then combine them into one big data frame. Clean up the data, stripping excess whitespace and converting columns to appropriate dtypes.

The `text` column of this data frame will be your corpus for natural language processing in exercise 1.4.

__Exercise 1.4.__  What topics has the Chronicle covered recently? How does the category affect the topics? Support your analysis with visualizations.

Hints:

*   The [nltk book](http://www.nltk.org/book/) may be helpful here.

*   This question will be easier to do after we've finished NLP in class.