# Scraping News Headlines

This notebook shows you how to scrape news headlines from a specific news source over a longer period of time. As an example we'll use the website of the NOS (Nederlandse Omroep Stichting / Dutch Broadcast Foundation). They are a Dutch state-funded news organisation. Similar to the BBC or ORF etc.

In [None]:
import requests as req

from lxml import html

import pandas as pd

from ipywidgets import IntProgress
from IPython.display import display

from time import sleep
from datetime import *

import glob

We need to create a date range of where we want to start scrape from the archives and where we want to stop.

In [None]:
archive_start_date = '01/01/2017'
archive_end_date = '01/01/2020'


Here we generate the date range using panda's `date_range` function and put the range into a list with propper date formatting

In [None]:
date_range = [d.strftime('%Y-%m-%d') for d in pd.date_range(start=archive_start_date,end=archive_end_date)]

We then use the list of dates to generate a list of urls that point to the archive pages

In [None]:
date_urls = [f"https://www.nos.nl/nieuws/archief/{date}" for date in date_range]

We put the list of URLs in a pandas DataFrame so we can manipulate it more easily and save it for future usage.

In [None]:
seed_df = pd.DataFrame({'urls':date_urls})

We then save the dataframe to a `csv` file. A text based format that spreadsheets like google sheets and excel can read.

In [None]:
seed_df.to_csv('urls.csv',sep=';')

We can get individual rows by accessing a column and a row number. By splitting on `'/'` we get a list that has the individual elements in the url

In [None]:
seed_df['urls'][0].split("/")

Again, we can acces different parts of the list based on an index.

In [None]:
seed_df['urls'][0].split("/")[2]

In [None]:
url = seed_df['urls'][0]

We then use Requests to get the HTML page from the archive.

In [None]:
r = req.get(url)

We then build a tree that we can query in order to get information in a structured manner from the document.

In [None]:
tree = html.fromstring(r.content)

We then use the path that we got from the web inspector in order to get the titles, the headlines and the timestamps of the headlines.

In [None]:
titles = tree.xpath('//*[@id="archief"]/ul/li/a/div[2]/text()')

In [None]:
timestamps = tree.xpath('//*[@id="archief"]/ul/li/a/div[1]/time/@datetime')

We then go over the list of urls, download the html files and extract the headlines, timestamps and urls. Finally we put those in a dataframe and save it to disk for later usage.

In [None]:
count = 0
max_count = len(date_urls) 

f = IntProgress(min=0, max=max_count, layout={'width':'auto'}) # instantiate the progress bar
display(f) # display the bar

while count < max_count:

    cur_date = seed_df['urls'][count].split("/")[-1] # current date of the archive-url
    cur_url = seed_df['urls'][count] # current archive url

    r = req.get(cur_url) #get the html

#     with open(f'cache/nos.nl/{c}/{c}_{cur_date}.html', mode='wb') as localfile:
#         localfile.write(r.content) # write retrieved html to cache

    tree = html.fromstring(r.content)

    urls = tree.xpath('//*[@id="archief"]/ul/li/a/@href') # retrieve urls
    timestamps = tree.xpath('//*[@id="archief"]/ul/li/a/div[1]/time/@datetime') # retrieve timestamps as non-UTC Strings
    titles = tree.xpath('//*[@id="archief"]/ul/li/a/div[2]/text()') # retrieve article titles

    urls = [f"https://www.nos.nl{u}" for u in urls] # create a list of article urls
    df = pd.DataFrame({'timestamp':timestamps,'title':titles,'url':urls})
    df.to_csv(f'../data/demo/nos.nl_{cur_date}.csv')

    count += 1    

    f.value = count # signal to increment the progress bar
    f.description = f'[{count}/{max_count}]'

    sleep(0.1)

In [None]:
for c in cat:

