# [Collect Data From The New York Times Over Any Period of Time](https://medium.com/@briennakh/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004)

Load dependencies.

In [24]:
import os
import pandas as pd
import requests
import json
import time
import dateutil
import datetime
import configparser
from dateutil.relativedelta import relativedelta
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Specify the date range.

In [3]:
end = datetime.date.today() 
start = end - relativedelta(years=1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 2019-05-26
End date: 2020-05-26


Make a list of the months that fall within this range, even if partially. We need this information for making calls to the Archive API, since it works with only one month at a time.

In [4]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]
months_in_range

[['2019', '6'],
 ['2019', '7'],
 ['2019', '8'],
 ['2019', '9'],
 ['2019', '10'],
 ['2019', '11'],
 ['2019', '12'],
 ['2020', '1'],
 ['2020', '2'],
 ['2020', '3'],
 ['2020', '4'],
 ['2020', '5']]

Get our API key via configparser. This is how we avoid exposing the key to the world. Would not be a big loss with this specific use case but is good practice.

In [15]:
configs = configparser.ConfigParser()
configs.read('config.ini')
YOUR_API_KEY = configs['NYT']['ACCESS_KEY']

I wrote some code to request and process article data from the Archive API. This code works with only one month at a time for optimal memory management. We send a request to the NYT Archive API for a given month, receive and parse the response, and populate a data frame with some details about each article, including its publication date, main headline, section, subject keywords, document type, and material type. Finally, we save the data frame as a CSV file and move on to the next month, until we have reached the end of the desired time range.

In [27]:
def send_request(date):
    '''Sends a request to the NYT Archive API for given date.'''
    base_url = 'https://api.nytimes.com/svc/archive/v1/'
    url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + YOUR_API_KEY
    response = requests.get(url, verify=False).json()
    time.sleep(6)
    return response


def is_valid(article, date):
    '''An article is only worth checking if it is in range, and has a headline.'''
    is_in_range = date > start and date < end
    has_headline = type(article['headline']) == dict and 'main' in article['headline'].keys()
    return is_in_range and has_headline


def parse_response(response):
    '''Parses and returns response as pandas data frame.'''
    data = {'headline': [],  
        'date': [], 
        'doc_type': [],
        'material_type': [],
        'section': [],
        'keywords': []}
    
    articles = response['response']['docs'] 
    for article in articles: # For each article, make sure it falls within our date range
        date = dateutil.parser.parse(article['pub_date']).date()
        if is_valid(article, date):
            data['date'].append(date)
            data['headline'].append(article['headline']['main']) 
            if 'section' in article:
                data['section'].append(article['section_name'])
            else:
                data['section'].append(None)
            data['doc_type'].append(article['document_type'])
            if 'type_of_material' in article: 
                data['material_type'].append(article['type_of_material'])
            else:
                data['material_type'].append(None)
            keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
            data['keywords'].append(keywords)
    return pd.DataFrame(data) 


def get_data(dates):
    '''Sends and parses request/response to/from NYT Archive API for given dates.'''
    total = 0
    print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))
    for date in dates:
        response = send_request(date)
        df = parse_response(response)
        total += len(df)
        if not os.path.exists('headlines'):
            os.mkdir('headlines')
        df.to_csv('headlines/' + date[0] + '-' + date[1] + '.csv', index=False)
        print('Saving headlines/' + date[0] + '-' + date[1] + '.csv...')
    print('Number of articles collected: ' + str(total))

Note that there are two [rate limits](https://developer.nytimes.com/faq#a11) per API: 4,000 requests per day and 10 requests per minute. We sleep for 6 seconds between calls to avoid hitting the per minute rate limit.

Run the code to get and process articles from **months_in_range**.

In [28]:
get_data(months_in_range)

Date range: ['2019', '6'] to ['2020', '5']
Saving headlines/2019-6.csv...
Saving headlines/2019-7.csv...
Saving headlines/2019-8.csv...
Saving headlines/2019-9.csv...
Saving headlines/2019-10.csv...
Saving headlines/2019-11.csv...
Saving headlines/2019-12.csv...
Saving headlines/2020-1.csv...
Saving headlines/2020-2.csv...
Saving headlines/2020-3.csv...
Saving headlines/2020-4.csv...
Saving headlines/2020-5.csv...
Number of articles collected: 80460


We have collected data for 80,460 articles from the past year! Each month has been saved to a CSV file in the headlines directory.

Let's see what the data looks like for the last month that we processed, which is still in memory.

In [29]:
response

{'copyright': 'Copyright (c) 2020 The New York Times Company. All Rights Reserved.',
 'response': {'meta': {'hits': 7559},
  'docs': [{'abstract': 'The protests started as peaceful marches and rallies against an unpopular bill. Then came dozens of rounds of tear gas and a government that refused to back down.',
    'web_url': 'https://www.nytimes.com/interactive/2019/world/asia/hong-kong-protests-arc.html',
    'snippet': 'The protests started as peaceful marches and rallies against an unpopular bill. Then came dozens of rounds of tear gas and a government that refused to back down.',
    'lead_paragraph': 'The protests started as peaceful marches and rallies against an unpopular bill. Then came dozens of rounds of tear gas and a government that refused to back down.',
    'print_section': 'A',
    'print_page': '6',
    'source': 'The New York Times',
    'multimedia': [],
    'headline': {'main': 'Six Months of Hong Kong Protests. How Did We Get Here?',
     'kicker': None,
     'con

There's a lot more. The documentation tells us what is all there.

See what the CSV file for this month looks like.

In [30]:
df

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
0,Six Months of Hong Kong Protests. How Did We G...,2019-10-01,multimedia,Interactive Feature,,[Hong Kong Protests (2019)]
1,"Hong Kong, India, North Korea: Your Wednesday ...",2019-10-01,article,briefing,,[]
2,Looted Ethiopian Crown Resurfaces in the Nethe...,2019-10-03,article,News,,"[Arts and Antiquities Looting, Smuggling, Robb..."
3,Cora Cahan Named President of the Baryshnikov ...,2019-10-02,article,News,,"[Dancing, Nonprofit Organizations]"
4,These Butterflies Evolved to Eat Poison. How C...,2019-10-02,article,News,,"[Flies, Insects, Genetics and Heredity, Evolut..."
...,...,...,...,...,...,...
7554,The Age of ‘The Age of Innocence’,2019-11-01,article,Review,,[Books and Literature]
7555,Uber Fights to Get Its Edge Back,2019-11-01,article,News,,"[Layoffs and Job Reductions, Car Services and ..."
7556,Breath Tests Aim to Stop Drunk Driving. Can We...,2019-11-01,article,News,,"[Tests (Sobriety), Drunken and Reckless Driving]"
7557,A Defense of Clowns,2019-10-31,article,News,,"[Clowns, Hospitals, Infertility, Children and ..."


As seen in the data frame, there is the occasional consistency quirk to watch out for. Sometimes the 1st of a month is considered part of the previous month. Sometimes there is missing data, like with September and October 1978 due to a [multi-union strike](https://github.com/nytimes/public_api_specs/issues/42).