# Download articles from The Guardian UK using API

Author: Jorge de Leon 

This script allows you to download news articles that match your parameters from the Guardian newspaper, https://www.theguardian.com/us.

The files are downloaded into JSON files. 

### Import packages

In [1]:
import os
import glob
import json
import requests
import pandas as pd 

from os import makedirs
from os.path import join, exists
from datetime import date, timedelta

### Create directory for files that will be downloaded as json

In [2]:
ARTICLES_DIR = join('theguardian', 'articles','microsoft')
makedirs(ARTICLES_DIR, exist_ok=True)

### Select parameters and provide API key

In [3]:
# Set up initial and end date 
#- potentially this section will be specified on a global basis for all the pieces of code that have been produced. 

START_DATE_GLOBAL = date(2020, 3, 1)
END_DATE_GLOBAL = date(2020, 4, 12)

#Enter API and parameters - these parameters can be obtained by playing around with the Guardian API tool:
# https://open-platform.theguardian.com/explore/

MY_API_KEY = open("..\\input files\\creds_guardian.txt").read().strip()
API_ENDPOINT = 'http://content.guardianapis.com/search'
my_params = {
    'from-date': "",
    'to-date': "",
    'show-fields': 'all',
    'order-by': "newest",
    'page-size': 200,
    'q': "%22Microsoft%20Corporation%22",
    'api-key': MY_API_KEY
}

### Request files from the Guardian

All files including articles, blogs, etc. that match to the search query defined by my paremeters above on a specific day are downloaded and saved to a json file, i.e. there is a json file per day.  

In [4]:
# day iteration from here:
# http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
start_date = START_DATE_GLOBAL
end_date = END_DATE_GLOBAL
dayrange = range((end_date - start_date).days + 1)
for daycount in dayrange:
    dt = start_date + timedelta(days=daycount)
    datestr = dt.strftime('%Y-%m-%d')
    fname = join(ARTICLES_DIR, datestr + '.json')
    if not exists(fname):
        # then let's download it
        print("Downloading", datestr)
        all_results = []
        my_params['from-date'] = datestr
        my_params['to-date'] = datestr
        current_page = 1
        total_pages = 1
        while current_page <= total_pages:
            print("...page", current_page)
            my_params['page'] = current_page
            resp = requests.get(API_ENDPOINT, my_params)
            data = resp.json()
            all_results.extend(data['response']['results'])
            # if there is more than one page
            current_page += 1
            total_pages = data['response']['pages']

        with open(fname, 'w') as f:
            print("Writing to", fname)

            # re-serialize it for pretty indentation
            f.write(json.dumps(all_results, indent=2))

Downloading 2020-03-01
...page 1
Writing to theguardian\articles\microsoft\2020-03-01.json
Downloading 2020-03-02
...page 1
Writing to theguardian\articles\microsoft\2020-03-02.json
Downloading 2020-03-03
...page 1
Writing to theguardian\articles\microsoft\2020-03-03.json
Downloading 2020-03-04
...page 1
Writing to theguardian\articles\microsoft\2020-03-04.json
Downloading 2020-03-05
...page 1
Writing to theguardian\articles\microsoft\2020-03-05.json
Downloading 2020-03-06
...page 1
Writing to theguardian\articles\microsoft\2020-03-06.json
Downloading 2020-03-07
...page 1
Writing to theguardian\articles\microsoft\2020-03-07.json
Downloading 2020-03-08
...page 1
Writing to theguardian\articles\microsoft\2020-03-08.json
Downloading 2020-03-09
...page 1
Writing to theguardian\articles\microsoft\2020-03-09.json
Downloading 2020-03-10
...page 1
Writing to theguardian\articles\microsoft\2020-03-10.json
Downloading 2020-03-11
...page 1
Writing to theguardian\articles\microsoft\2020-03-11.json

### Code to convert files into CSV files

For convenience, this code converts all json files to csv files so we can quickly review the content and correct if needed. 

In [5]:
test_directory = 'theguardian/articles/microsoft/'

for file_name in [file for file in os.listdir(test_directory) if file.endswith('.json')]:
  with open(test_directory + file_name) as json_file:
    data = pd.read_json(json_file)
    data.to_csv(test_directory + file_name + '.csv', index =  None)
