# Scraping newspapers

So far we've used APIs to access data on the internet. But sometimes we want to access data that doesn't have an API. How do we work around this?

We can _scrape_ sites instead. This means writing a program that visits the webpage, sort of like you normally would, but can extract the information you want and return it to you directly.

For example, what if you wanted to grab all recent New York Times articles to analyze how much they mention "Trump" vs "climate"? You could visit their page, click through all the links, search for those terms, and count them up. But that would be a ton of work.

Or we could spend a bit of time and write a program to do that for us. Which is what we're going to do here.

### The `newspaper` library

We're going to use a Python library called `newspaper`. This provides us a way to very easily download articles from almost any news site without having to mess with the low-level guts of writing our own scrapers.


## Scraping a single article

Let's see how we can download a single New York Times article using `newspaper`:

In [1]:
import requests_cache
requests_cache.install_cache('/tmp/newspaper.cache', backend='sqlite', expire_after=60*60*24*2) # expire after two days

In [2]:
import newspaper

url = 'https://www.nytimes.com/2017/09/20/opinion/duterte-philippines.html'

# use newspaper's Article class
# and save the resulting data into a variable named `article`
article = newspaper.Article(url)
article.download()
article.parse()

Now we have access to all of the article's data:

```python
article.title
article.authors
article.top_image
article.text
article.publish_date
article.meta_keywords
```

If we want to access even more data, we can have `newspaper` analyze the article for us by calling `article.nlp()`:

In [None]:
article.nlp()
article.summary

We want to get the counts of specific words in the article. You can refer to the Python Tips notebook for how to do this, but we'll also do it right here. Let's define a function so we can re-use this later:

In [None]:
def split_into_words(article):
    # First, we want to lower case and remove all punctuation from the article text.
    # Because we don't want "Hello!" to be treated differently from "hello".
    text = article.text.lower()
    text = text.replace('.', ' ')
    text = text.replace('!', ' ')
    text = text.replace('?', ' ')
    text = text.replace("'", ' ')
    text = text.replace("’", ' ') # this is actually diff from the prev line
    text = text.replace('"', ' ')
    text = text.replace(',', ' ')

    # Now we can split it into words
    return text.split()

And let's try using it on the article we just downloaded:

In [None]:
words = split_into_words(article)
print(words)

In [None]:
# Now, to count the words, we'll use the Counter class
from collections import Counter
counts = Counter(words)
print(counts)

To get the count of an individual word, we can just do:

In [None]:
counts['drugs']

---

## Scraping a news site

Ok, so that's how we get one article. But we want to get as many as we can.

Fortunately, `newspaper` provides a way to do that too!

In [3]:
# This will grab as many NYT articles as it can from the main page
nyt = newspaper.build('https://nytimes.com/', memoize_articles=False)
len(nyt.articles)

357

The articles are available in `nyt.articles`; we can loop over them:

In [None]:
for article in nyt.articles:
    print(article.url)

This got all the article urls, but it did not yet the text. You can see for yourself:

In [None]:
nyt.articles[0].text

So we need to download and parse each article like we did when we scraped the single article.

We're dealing with the messiness of the internet here though, so some articles won't download properly. We only want to keep those that do.

So what we'll do below is loop over each article and _try_ to download it, and if it doesn't work, we'll skip. We'll keep the articles that were successful in a list called `ok_articles`.

When you run the code below, you may see lines saying "You must `download()` an article before calling `parse()` on it!". You can safely ignore these, this is `newspaper` warning us that the article wasn't downloaded correctly, but those are the ones we're not keeping.

We'll write this as a function so we can re-use it later:

In [7]:
def download_articles(articles):
    ok_articles = []
    for article in articles:
        try:
            article.download()
            article.parse()
            ok_articles.append(article)
        except newspaper.ArticleException:
            pass
    return ok_articles

Now let's use it on the NYT articles:

In [None]:
ok_articles = download_articles(nyt.articles)

Now the articles in `ok_articles` should have text:

In [None]:
ok_articles[0].text

## Challenge #1

Now see if you can get the counts for the words "trump" and "climate" across all of these articles.

Hints:

- We've already defined a function that will take an article and give us back its words, `split_into_words`. Take advantage of that.
- Remember when we're working with collections of data, e.g. a list of articles, we want to use `for` loops.
- Also notice the pattern we've used a few times in this class, of looping over a list and collecting data into another list.
    - If you want to dump an entire list (let's say it's called `a`) into another list (let's say it's called `b`), you can use `extend`. For example (you can try this out in a block below):
    
```python
a = [0,1,2]
b = [3,4,5]
a.extend(b)
print(a)
```

## Challenge #2

Grab the articles from another news site using `newspaper` and get the counts for 'trump' and 'climate' on that page. How do those mentions compare to the NYT?