Opened an issue: https://github.com/nytimes/public_api_specs/issues?q=is%3Aopen+is%3Aissue

# Collecting data from The New York Times over any period of time

Load dependencies.

In [304]:
import pandas as pd
from dateutil.relativedelta import relativedelta
import requests
import json
import time
import dateutil

Specify our date range. 

In [317]:
end = datetime.date.today() 
start = end - relativedelta(years=1)
print('Start date: ' + str(start))
print('End date: ' + str(end))

Start date: 2019-05-01
End date: 2020-05-01


Make a list of the months that fall within this range, even if partially. We need this information for making calls to the Archive API, since it works with only one month at a time.

In [318]:
months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]
months_in_range

[['2019', '5'],
 ['2019', '6'],
 ['2019', '7'],
 ['2019', '8'],
 ['2019', '9'],
 ['2019', '10'],
 ['2019', '11'],
 ['2019', '12'],
 ['2020', '1'],
 ['2020', '2'],
 ['2020', '3'],
 ['2020', '4'],
 ['2020', '5']]

Get our API key via configparser. This is how we avoid exposing the key to the world. Would not be a big loss with this specific use case but is good practice.

In [311]:
configs = configparser.ConfigParser()
configs.read('config.ini')
api_key = configs['NYT']['ACCESS_KEY']
api_token

'OvGWufbEAQB2qxGFSDIc0kzUXu7IS0Pb'

Make calls to the Archive API to request our data, one call for each month in months_in_range. We will filter this data later to remove fringe dates, those that fall outside our specified time period.

Note that there are two rate limits per API: 4,000 requests per day and 10 requests per minute. We sleep for 6 seconds between calls to avoid hitting the per minute rate limit.

In [395]:
responses = []
base_url = 'https://api.nytimes.com/svc/archive/v1/'
for date in months_in_range:
    url = base_url + '/' + date[0] + '/' + date[1] + '.json?api-key=' + api_key
    responses.append(requests.get(url, verify=False).json())
    time.sleep(6)



Let's see what the data looks like for, say, April 2020.

In [396]:
responses[11]['response']['docs']

[{'abstract': 'The first quarter was one of the worst in history for many stock markets around the world. The start of the second isn’t looking any better.',
  'web_url': 'https://www.nytimes.com/2020/04/01/business/dealbook/coronavirus-stocks-earnings.html',
  'snippet': 'The first quarter was one of the worst in history for many stock markets around the world. The start of the second isn’t looking any better.',
  'lead_paragraph': ' We are holding a conference call for DealBook readers tomorrow, April 2, at 11 a.m. Eastern. We will go behind the scenes of the Trump administration’s response to the coronavirus and what policy actions may come next. Our special guest will be Maggie Haberman, one of the NYT’s top White House correspondents. You’ll be able to ask Maggie about her reporting during the call, or submit questions in advance to dealbook@nytimes.com. For details about how to join, visit the R.S.V.P. page. This is the first in a weekly series of calls we’re calling the DealBook

There's a lot more. The documentation tells us what is all there.

Populate a data frame with a bunch of details about each article, including its publication date, main headline, section, subject keywords, document type, and material type.

In [401]:
data = {'headline': [],  
        'date': [], 
        'doc_type': [],
        'material_type': [],
        'section': [],
        'keywords': []}

for response in responses: # For each response, get all the articles
    articles = response['response']['docs'] 
    for article in articles: # For each article, make sure it falls within our date range
        date = dateutil.parser.parse(article['pub_date']).date()
        is_in_range = date > start and date < end
        if is_in_range and article['headline']['main']: # Collect its details, only if it has a headline 
            data['date'].append(date)
            data['headline'].append(article['headline']['main']) 
            data['section'].append(article['section_name'])
            data['doc_type'].append(article['document_type'])
            if 'type_of_material' in article: 
                data['material_type'].append(article['type_of_material'])
            else:
                data['material_type'].append(None)
            keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
            data['keywords'].append(keywords)
                
df = pd.DataFrame(data)
df.to_csv('NYT.csv')

In [402]:
df

Unnamed: 0,headline,date,doc_type,material_type,section,keywords
0,‘Nobody Is Above the Law’: House Democrats Are...,2019-05-02,multimedia,Video,U.S.,[United States Politics and Government]
1,Watch ‘Boyz N the Hood’ Free at the Tribeca Fi...,2019-05-02,article,News,Movies,[Tribeca Film Festival (NYC)]
2,Ryan Reynolds Keeps a Bare Closet,2019-05-03,article,News,Fashion & Style,[Fashion and Apparel]
3,How Will Satan & Adam Play in 2019?,2019-05-02,article,News,New York,"[Blues Music, Race and Ethnicity, Documentary ..."
4,Trump Says He Discussed the ‘Russian Hoax’ in ...,2019-05-03,article,News,U.S.,[Russian Interference in 2016 US Elections and...
...,...,...,...,...,...,...
83463,Andy Dalton Joins Crowded N.F.L. Free Agent Pool,2020-04-30,article,News,Sports,"[Football, Free Agents (Sports)]"
83464,‘Bull’ Review: A Lot to Wrangle With,2020-04-30,article,Review,Movies,[Movies]
83465,‘Liberté’ Review: A Miserable Orgy From the Pr...,2020-04-30,article,Review,Movies,[Movies]
83466,Coronavirus Briefing: What Happened Today,2020-04-29,article,briefing,U.S.,[Coronavirus (2019-nCoV)]


Drop duplicates that appear on the same day.

AttributeError: 'DataFrame' object has no attribute 'duplicates'

[{'abstract': '“She tried to do her job, and it killed her,” said the father of Dr. Lorna M. Breen, who worked at a Manhattan hospital hit hard by the coronavirus outbreak.',
  'web_url': 'https://www.nytimes.com/2020/04/27/nyregion/new-york-city-doctor-suicide-coronavirus.html',
  'snippet': '“She tried to do her job, and it killed her,” said the father of Dr. Lorna M. Breen, who worked at a Manhattan hospital hit hard by the coronavirus outbreak.',
  'lead_paragraph': 'A top emergency room doctor at a Manhattan hospital that treated many coronavirus patients died by suicide on Sunday, her father and the police said.',
  'print_section': 'A',
  'print_page': '13',
  'source': 'The New York Times',
  'multimedia': [{'rank': 0,
    'subtype': 'xlarge',
    'caption': None,
    'credit': None,
    'type': 'image',
    'url': 'images/2020/04/27/nyregion/27nyvirus-ersuicideNEW/27nyvirus-ersuicideNEW-articleLarge.jpg',
    'height': 900,
    'width': 600,
    'legacy': {'xlarge': 'images/20

We have collected 82,957 headlines from the past year!

There is a lot that we can do with all of this data that we just got. In the next story, we will analyze the data to explore how The New York Times has evolved in its reporting over the past year, with an emphasis on the coronavirus pandemic. This is interesting as we consider not just what happened, but how the media discussed it.

## OLD CODE... DELETE LATER?

In [12]:
date_range_for_plots = [str(x).split(' ')[0] for x in date_range]
date_range_for_plots[0]

'2019-12-08'

Get the New York Times link for each day (it will give us the first snapshot).

In [64]:
def get_links(url):
    browser.get(url)
    calendar_grid = WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.CLASS_NAME, 'calendar-grid'))) 
    if '2019' in url:
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        time.sleep(5)
    dates = calendar_grid.find_elements_by_tag_name('a') 
    return [date.get_attribute('href') for date in dates]

def scrape_archives(urls):
    browser = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver') 
    browser.maximize_window() 
    data = {}
    for url in urls: 
        links = get_links(url)
        for link in links: 
            date = datetime.datetime.strptime(re.search('web/(\d*)', link)[1], '%Y%m%d')
            if date in date_range:
                data[date] = {'link': link,
                              'headlines': [],
                              'sections': []}
    browser.quit()
    
urls = ['http://web.archive.org/web/2019*/https://www.nytimes.com/', 
        'http://web.archive.org/web/*/nytimes.com']

scrape_archives(urls)

In [251]:
# For each date, the previous day's link gets that day (idk why)
data2 = {}

for date in date_range:
    date_str = (date - timedelta(days=1)).strftime('%Y%m%d')
    link = 'http://web.archive.org/web/' + date_str + '/https://www.nytimes.com/'
    data2[date] = {'link': link,
                   'headlines': [],
                   'sections': []}

In [258]:
data2

{Timestamp('2019-12-08 00:00:00', freq='D'): {'link': 'http://web.archive.org/web/20191207/https://www.nytimes.com/',
  'headlines': ['Video Games and Online Chats Are ‘Hunting Grounds’ for Sexual Predators',
   'Here’s how to protect your children.',
   'Florida Shooting Suspect Showed Videos of Mass Shootings at Party',
   'After two attacks on Navy bases in a week, officials are confronting how persistent such incidents have become.',
   'Judiciary Committee Releases Report Defining Impeachable Offense',
   'Behind the Scenes of Impeachment: Crammed Offices, Late Nights, Cold Pizza',
   'Buttigieg Struggles to Square Transparency With Lack of Disclosure on Consulting',
   'As Candidates Jostle for Position, a Long Race May Become a Marathon',
   'Can Biology Class Reduce Racism?',
   '11 of Our Best Weekend Reads',
   'Did you stay up-to-date this week? Take our news quiz.',
   'Listen: ‘Modern Love’ Podcast',
   'The ‘In Her Words’ Newsletter',
   'The Neediest Cases Fund',
   'The

Get headlines and sections from each link.

In [255]:
browser = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver') 
unacceptable_headlines = ['Listen to ‘The Daily’']

for date in data2:

    print('Scraping ' + str(date) + '...')
    browser.get(data2[date]['link'])
    time.sleep(5)
    body = WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.CLASS_NAME, 'e6b6cmu0'))) 
    
    elements = [x for x in body.find_elements_by_class_name('esl82me0') if x.text not in unacceptable_headlines]

    # Parse headlines
    data2[date]['headlines'] = [headline.text for headline in elements]

    # Parse sections (anchor elements contain headlines)
    links = []
    for x in elements:
        link = x.find_element_by_xpath('../..').get_attribute('href')
        if not link:
            link = x.find_element_by_xpath('..').get_attribute('href')
        if not link:
            link = x.find_element_by_xpath('../../..').get_attribute('href')
        links.append(link)

    sections = []
    for x in links: 
        if 'cooking.nytimes.com' in x:
            section = 'cooking'
        elif '/us/' and '/politics/' in x:
            section = 'us-politics'
        elif 'www.nytimes.com/weekender' in x:
            section = 'weekender'
        elif 'www.nytimes.com/live' in x:
            section = 'live'
        elif 'www.nytimes.com/spotlight' in x:
            section = 'spotlight'
        else:
            matches = re.search('/(?:\d{2,4}/){1,3}([\w-]*)|nytimes.com/(\D*)/', x)
            if matches[1]:
                section = matches[1]
            else:
                section = matches[2]
        sections.append(section)

    data2[date]['sections'] = sections
    print('Found ' + str(len(elements)) + ' headlines') 

Scraping 2019-12-08 00:00:00...
Found 28 headlines
Scraping 2019-12-09 00:00:00...
Found 26 headlines
Scraping 2019-12-10 00:00:00...
Found 30 headlines
Scraping 2019-12-11 00:00:00...
Found 29 headlines
Scraping 2019-12-12 00:00:00...
Found 27 headlines
Scraping 2019-12-13 00:00:00...
Found 28 headlines
Scraping 2019-12-14 00:00:00...
Found 27 headlines
Scraping 2019-12-15 00:00:00...
Found 28 headlines
Scraping 2019-12-16 00:00:00...
Found 26 headlines
Scraping 2019-12-17 00:00:00...
Found 25 headlines
Scraping 2019-12-18 00:00:00...
Found 28 headlines
Scraping 2019-12-19 00:00:00...
Found 31 headlines
Scraping 2019-12-20 00:00:00...
Found 28 headlines
Scraping 2019-12-21 00:00:00...
Found 27 headlines
Scraping 2019-12-22 00:00:00...
Found 28 headlines
Scraping 2019-12-23 00:00:00...
Found 26 headlines
Scraping 2019-12-24 00:00:00...
Found 24 headlines
Scraping 2019-12-25 00:00:00...
Found 26 headlines
Scraping 2019-12-26 00:00:00...
Found 25 headlines
Scraping 2019-12-27 00:00:00...

In [257]:
data2

{Timestamp('2019-12-08 00:00:00', freq='D'): {'link': 'http://web.archive.org/web/20191207/https://www.nytimes.com/',
  'headlines': ['Video Games and Online Chats Are ‘Hunting Grounds’ for Sexual Predators',
   'Here’s how to protect your children.',
   'Florida Shooting Suspect Showed Videos of Mass Shootings at Party',
   'After two attacks on Navy bases in a week, officials are confronting how persistent such incidents have become.',
   'Judiciary Committee Releases Report Defining Impeachable Offense',
   'Behind the Scenes of Impeachment: Crammed Offices, Late Nights, Cold Pizza',
   'Buttigieg Struggles to Square Transparency With Lack of Disclosure on Consulting',
   'As Candidates Jostle for Position, a Long Race May Become a Marathon',
   'Can Biology Class Reduce Racism?',
   '11 of Our Best Weekend Reads',
   'Did you stay up-to-date this week? Take our news quiz.',
   'Listen: ‘Modern Love’ Podcast',
   'The ‘In Her Words’ Newsletter',
   'The Neediest Cases Fund',
   'The

In [225]:
# Test
for element in elements:
    print(element.text)
    link = element.find_element_by_xpath('../..').get_attribute('href')
    link2 = element.find_element_by_xpath('..').get_attribute('href')
    link3 = element.find_element_by_xpath('../../..').get_attribute('href')
    print('link 1: ' + str(link))
    print('link 2: ' + str(link2))
    print('link 3: ' + str(link3))

Trump Signs $2 Trillion Coronavirus Relief Package
link 1: http://web.archive.org/web/20200327235926/https://www.nytimes.com/2020/03/27/world/coronavirus-news.html
link 2: None
link 3: None
Updates: Measure Is Largest Stimulus in Modern History
link 1: http://web.archive.org/web/20200327235926/https://www.nytimes.com/2020/03/27/world/coronavirus-news.html
link 2: None
link 3: None
Updates: New York Region
link 1: http://web.archive.org/web/20200327235926/https://www.nytimes.com/2020/03/27/nyregion/coronavirus-new-york-update.html
link 2: None
link 3: None
Some U.S. Cities Could Have Outbreaks Worse Than Wuhan’s
link 1: http://web.archive.org/web/20200327235926/https://www.nytimes.com/interactive/2020/03/27/upshot/coronavirus-new-york-comparison.html
link 2: None
link 3: None
Updates: Business and Markets
link 1: http://web.archive.org/web/20200327235926/https://www.nytimes.com/2020/03/27/business/stock-market-today-coronavirus.html
link 2: None
link 3: None
‘White-Collar Quarantine’: V

In [232]:
for i in elements:
    if i.text == 'Listen to ‘The Daily’':
        print(i)

<selenium.webdriver.remote.webelement.WebElement (session="bac102a63ca05026b1c33d33b511cba0", element="de5c11e7-124c-498c-a4cb-754f95db04f2")>
