## NYTimes API
- This notebook contains the code for retrieving and parsing the New York Times archive headlines over a period of time and exports it as 
- The code is adapted and modified from [Brienna Herold](https://brienna.medium.com/)'s amazing [article](https://towardsdatascience.com/collecting-data-from-the-new-york-times-over-any-period-of-time-3e365504004).

In [22]:
import requests
from pprint import pprint as pp
import os
import json
import time
import datetime
import dateutil
import pandas as pd
from dateutil.relativedelta import relativedelta
from loguru import logger

ModuleNotFoundError: No module named 'loguru'

In [3]:
# I'm behind a proxy so had to include this. Set this to None by default
# proxies = {
#    'http': os.environ["http_proxy"],
#    'https': os.environ["https_proxy"],
# }

# set your own NYT Developer's API key
API = os.environ["NYT_dev_API"]

In [12]:
def send_request():
    url = f'https://api.nytimes.com/svc/news/v3/content/all/business.json?api-key={API}'
    response = requests.get(url, proxies=None).json()
    # print("Sleep for 20 secs...")
    # time.sleep(20)
    # print("Resuming parsing...")
    return response

In [13]:
results = send_request()

In [18]:
pp(results["results"])

[{'abstract': 'She reported on conflicts around the world and for a time was '
              'the only American broadcast journalist reporting from Baghdad '
              'during the U.S. “shock and awe” bombing campaign in 2003.',
  'byline': 'BY KATHARINE Q. SEELYE',
  'created_date': '2022-09-07T17:59:35-04:00',
  'des_facet': ['Deaths (Obituaries)',
                'Radio',
                'News and News Media',
                'Iraq War (2003-11)',
                'Afghanistan War (2001- )'],
  'first_published_date': '2022-09-07T17:59:35-04:00',
  'geo_facet': ['Baghdad (Iraq)', 'Iraq', 'Afghanistan'],
  'item_type': 'Article',
  'kicker': '',
  'material_type_facet': 'Obituary (Obit)',
  'multimedia': [{'caption': 'The NPR correspondent Anne Garrels in Iraq in '
                             '2006. She became known for conveying how '
                             'momentous events like wars affected the people '
                             'who lived through them.',
           

In [20]:
for result in results["results"]:
    print(result["title"])
    print(result["published_date"])
    print()

Anne Garrels, Fearless NPR Correspondent, Dies at 71
2022-09-07T17:59:35-04:00

Boston Globe Editor to Step Down
2022-09-07T17:07:07-04:00

Regal Cinemas’ parent, crippled by the pandemic, files for bankruptcy.
2022-09-07T13:33:18-04:00

Fed’s Vice Chair Signals More Rate Increases Ahead as Inflation Remains Too Hot
2022-09-07T12:40:29-04:00

Vice, Exploring a Sale, Weighs a Content Deal With a Saudi-Backed Firm
2022-09-07T12:26:12-04:00

United Airlines Plans to Halt J.F.K. Service Unless It Gets More Slots
2022-09-07T11:49:12-04:00

Antitrust regulators expand their global reach.
2022-09-07T09:58:02-04:00

Judge Hands Elon Musk a Win in Court
2022-09-07T08:12:51-04:00

How policy changes in California could influence the nation.
2022-09-07T05:02:04-04:00

India’s economy bucks the global slowdown.
2022-09-07T05:02:00-04:00

The Supply Chain Broke. Robots Are Supposed to Help Fix It.
2022-09-07T05:00:38-04:00

How Russian Gas in Europe Is Dwindling
2022-09-07T05:00:36-04:00

From the 

In [108]:
def send_request():
    url = f'https://api.nytimes.com/svc/news/v3/content/all/business.json?api-key={API}'
    response = requests.get(url, proxies=None).json()
    # print("Sleep for 20 secs...")
    # time.sleep(20)
    # print("Resuming parsing...")
    return response

def parse_response(response):
    ''' 
    Parses the response into pandas dataframe
    '''
    logger.info("Parsing response...")
    data = {'headline': [],  
        'date': [], 
        'doc_type': [],
        'material_type': [],
        'section': [],
        'keywords': []}
    
    articles = response['response']['docs'] 
    for article in articles: # For each article, make sure it falls within our date range
        date = dateutil.parser.parse(article['pub_date']).date()
        if is_valid(article, date):
            data['date'].append(date)
            data['headline'].append(article['headline']['main']) 
            if 'section' in article:
                data['section'].append(article['section_name'])
            else:
                data['section'].append(None)
            data['doc_type'].append(article['document_type'])
            if 'type_of_material' in article: 
                data['material_type'].append(article['type_of_material'])
            else:
                data['material_type'].append(None)
            keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
            data['keywords'].append(keywords)
    return pd.DataFrame(data) 

def get_data(dates):
    '''Sends and parses request/response to/from NYT Archive API for given dates.'''
    total = 0
    print('Date range: ' + str(dates[0]) + ' to ' + str(dates[-1]))
    if not os.path.exists('headlines'):
        os.mkdir('headlines')
    df_headlines = pd.read_csv("headlines/2020-5_2022-8_NYtimes_headlines.csv")
    for date in dates:
        response = send_request(date)
        df = parse_response(response)
        total += len(df)
        print("Concatenating headlines...")
        df_headlines = pd.concat([df_headlines, df])
        print('Saving current data to "headlines/' + dates[0][0] + '-' + dates[0][1] + '_' + dates[-1][0] + '-' + dates[-1][1] + '_NYtimes_headlines.csv"...')
        print(f"Headlines retrieved for {date[0]}/{date[1]}.")
        print()
        df_headlines.to_csv('headlines/' + dates[0][0] + '-' + dates[0][1] + '_' + dates[-1][0] + '-' + dates[-1][1] + '_NYtimes_headlines.csv', index=False)
    print('Number of articles collected: ' + str(len(df_headlines)))

In [92]:
end = datetime.date(2022,8,1)
start = end - relativedelta(years=3)

months_in_range = [x.split(' ') for x in pd.date_range(start, end, freq='MS').strftime("%Y %-m").tolist()]

In [107]:
get_data(months_in_range)

Date range: ['2021', '6'] to ['2022', '8']
Retrieving response from https://api.nytimes.com/svc/archive/v1/2021/6.json?api-key=$API_KEY
Sleep for 20 secs...
Resuming parsing...
Parsing response...
Concatenating headlines...
Headlines retrieved for 2021/6.

Saving current data to "headlines/2021-6_2022-8_NYtimes_headlines.csv"...
Retrieving response from https://api.nytimes.com/svc/archive/v1/2021/7.json?api-key=$API_KEY
Sleep for 20 secs...
Resuming parsing...
Parsing response...
Concatenating headlines...
Headlines retrieved for 2021/7.

Saving current data to "headlines/2021-6_2022-8_NYtimes_headlines.csv"...
Retrieving response from https://api.nytimes.com/svc/archive/v1/2021/8.json?api-key=$API_KEY
Sleep for 20 secs...
Resuming parsing...
Parsing response...
Concatenating headlines...
Headlines retrieved for 2021/8.

Saving current data to "headlines/2021-6_2022-8_NYtimes_headlines.csv"...
Retrieving response from https://api.nytimes.com/svc/archive/v1/2021/9.json?api-key=$API_KEY
