# Python Web APIs: Accessing NYT Data

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [The New York Times API](#nyt)
2. [Top stories API](#top)
3. [Most Viewed and Most Shared APIs](#most)
4. [Article Search API](#search)
5. [Data Analysis](#analysis)
6. [Demo: Handling Nested Arrays of Keywords](#demo)

In [32]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime

<a id='nyt'></a>

# The New York Times API

We are going to use the NYT API to demonstrate how Web APIs can be used to access useful information in an easy way. The New York Times offers a treasure trove of data about their articles that is easily accessible and available for free! We'll now get set up with API keys so that we can make some API calls to the NYT servers.

**Before** proceeding with this lesson, you need an API key.

## Getting API Access

For most APIs, a key or other user credentials are required for any database querying.  Generally, this requires that you register with the organization. Go to the [NYT Developer Page](http://developer.nytimes.com/) and create an account:

![](../img/nytimes_start.png)

Most APIs are set up for developers, so you'll likely be asked to register an "application".  All this really entails is coming up with a name for your project, and providing your real name, organization, and email.  Note that some more popular APIs (e.g. Twitter, Facebook) will require additional information, such as a web address or mobile number.

## Getting your API Keys

Once you've successfully registered, you will be assigned one or more keys, tokens, or other credentials that must be supplied to the server as part of any API call you make.  To make sure that users aren't abusing their data access privileges (e.g. by making many rapid queries), each set of keys will be given several **rate limits** governing the total number of calls that can be made over certain intervals of time.  For the NYT Article API, we have relatively generous rate limits: 10 calls per minute and 4,000 calls per day.

1. Login with your new username and password.

2. Click on your email in the top right corner and you'll see a dropdown menu that says **Apps**

3. Click on **Apps** and then click on the **+ New App** button.

4. You'll see the page where you'll be prompted to add a name for your App. You can call it anything. Then click enable on the APIs that are enabled in the screenshot. You can enable them all but make sure you at least enable the ones on the screenshot. 

![](../img/nytimes_app.png)

5. You'll see an API key next to your App ID. Have that key ready to copy into the first notebook.

![](../img/nytimes_key.png)

## Handling API Keys

API keys are sensitive data! You **do not** want to accidentally check them into a publically shared GitHub repo.

The following cell will:

1. first try to obtain previously saved credentials by loading with `configparser`;
2. if not found, use `getpass` to request the credentials from the user (which works in notebooks as an input prompt);
3. then save those user-inputted credentials using configparser to `~/.notebook-api-keys` which is outside of the .git controlled directory so it doesn't accidentally get added and checked in.

Run the following cell and add the API Key you just created when prompted.

In [1]:
import configparser
import os
from getpass import getpass

def get_api_key(api_name):
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)  # Disable interpolation to avoid issues with special characters
    
    # Try reading the existing config file
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if API key is present
    if config.has_option("API_KEYS", api_name):
        # Ask if the user wants to update the key
        update_key = input(f"An API key for {api_name} already exists. Do you want to update it? (y/n): ").lower()
        if update_key == 'n':
            return config.get("API_KEYS", api_name)
    
    # If no key exists or user opts to update, prompt for the new key
    api_key = getpass(f"Enter your {api_name} API key: ")

    # Save the API key in the config file
    if not config.has_section("API_KEYS"):
        config.add_section("API_KEYS")
    config.set("API_KEYS", api_name, api_key)
    
    with open(config_file_path, "w") as f:
        config.write(f)
    
    return api_key

# Example usage to retrieve the NYT API key
api_key = get_api_key("NYT") #PLEASE DONT PRINT THE API KEY

print("NYT API key retrieved successfully.")


NYT API key retrieved successfully.


💡 **Tip**: Another way to keep your credentials secure and provide convenient access is through the [JupyterLab Credential Store
](https://towardsdatascience.com/the-jupyterlab-credential-store-9cc3a0b9356). If you are using JupyterLab, this is a great general solution for handling API keys!

## Using `pynytimes`

To access the NYTimes' databases, we'll be using a third-party library called [pynytimes](https://github.com/michadenheijer/pynytimes). This package provides an easy to use tool for accessing the wealth of data hosted by the Times.

To install the library, follow the instructions taken from their [Github repo](https://github.com/michadenheijer/pynytimes).

There are multiple options to install `pynytimes`, but the easiest is by just installing it using `pip` in the Jupyter notebook itself, using a magic command:

In [2]:
%pip install pynytimes

Collecting pynytimes
  Downloading pynytimes-0.10.0-py3-none-any.whl.metadata (8.2 kB)
Downloading pynytimes-0.10.0-py3-none-any.whl (20 kB)
Installing collected packages: pynytimes
Successfully installed pynytimes-0.10.0
Note: you may need to restart the kernel to use updated packages.


You can also install it via the command line - whichever you're more comfortable with.

Once the package installed, let's go ahead import the library and initialize a connection to their servers using our api keys.

In [3]:
# Import the NYTAPI object which we'll use to access the API
from pynytimes import NYTAPI

In [4]:
# Intialize the NYT API class into an object using your API key
nyt = NYTAPI(api_key, parse_dates=True)

Ta-da! We are now ready to make some API calls!

## Making API Calls

 Now that we've established a connection to New York Times' rich database, let's go over what kind of data and privileges we have access to.

### APIs

[Here is the collection of the APIs the NYT gives us:](https://developer.nytimes.com/apis)

- [Top stories](https://developer.nytimes.com/docs/top-stories-product/1/overview): Returns an array of articles currently on the specified section 
- [Most viewed/shared articles](https://developer.nytimes.com/docs/most-popular-product/1/overview): Provides services for getting the most popular articles on NYTimes.com based on emails, shares, or views.
- [Article search](https://developer.nytimes.com/docs/articlesearch-product/1/overview): Look up articles by keyword. You can refine your search using filters and facets.
- [Books](https://developer.nytimes.com/docs/books-product/1/overview): Provides information about book reviews and The New York Times Best Sellers lists.
- [Movie reviews](https://developer.nytimes.com/docs/movie-reviews-api/1/overview): Search movie reviews by keyword and opening date and filter by Critics' Picks.
- [Times Wire](https://developer.nytimes.com/docs/timeswire-product/1/overview): Get links and metadata for Times' articles as soon as they are published on NYTimes.com. The Times Newswire API provides an up-to-the-minute stream of published articles.
- [Tag query (TimesTags)](https://developer.nytimes.com/docs/timestags-product/1/overview): Provide a string of characters and the service returns a ranked list of suggested terms.
- [Archive metadata](https://developer.nytimes.com/docs/archive-product/1/overview): Returns an array of NYT articles for a given month, going back to 1851.

<a id='top'></a>

# Top Stories API

Let's look at the top stories of the day. All we have to do is call a single method on the `nyt` object:

In [5]:
# Get all the top stories from the home page
top_stories = nyt.top_stories()

print(f"top_stories is a list of length {len(top_stories)}")

top_stories is a list of length 23


The `top_stories` method has a single paramater called `section` parameter defaults to "home".

In [6]:
# Preview the results
top_stories[:2]

[{'section': 'world',
  'subsection': 'middleeast',
  'title': 'How Tough Is Iran? A String of Military Losses Raises Questions.',
  'abstract': 'Iran is often portrayed as one of the world’s most dangerous actors, but with its attacks on Iranian defenses, nuclear sites and proxy militias, Israel has exposed a compromised and weakened adversary.',
  'url': 'https://www.nytimes.com/2025/06/16/world/middleeast/iran-military-defense.html',
  'uri': 'nyt://article/45e5f07b-2c53-5f6b-af1f-b401b00dbcde',
  'byline': 'By Vivian Yee',
  'item_type': 'Article',
  'updated_date': datetime.datetime(2025, 6, 16, 15, 47, 55, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
  'created_date': datetime.datetime(2025, 6, 16, 11, 13, 28, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
  'published_date': datetime.datetime(2025, 6, 16, 11, 13, 28, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
  'material_type_facet': '',
  'kicker': ''

This is pretty typical output for data pulled from an API. We are looking at a list of nested JSON dictionaries.

When working with a new API, a good way to establish an understanding of the data is to inspect a single object in the collection. Let's grab the first story in the array and inspect its attributes and data:

In [7]:
top_story = top_stories[0]
top_story

{'section': 'world',
 'subsection': 'middleeast',
 'title': 'How Tough Is Iran? A String of Military Losses Raises Questions.',
 'abstract': 'Iran is often portrayed as one of the world’s most dangerous actors, but with its attacks on Iranian defenses, nuclear sites and proxy militias, Israel has exposed a compromised and weakened adversary.',
 'url': 'https://www.nytimes.com/2025/06/16/world/middleeast/iran-military-defense.html',
 'uri': 'nyt://article/45e5f07b-2c53-5f6b-af1f-b401b00dbcde',
 'byline': 'By Vivian Yee',
 'item_type': 'Article',
 'updated_date': datetime.datetime(2025, 6, 16, 15, 47, 55, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'created_date': datetime.datetime(2025, 6, 16, 11, 13, 28, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'published_date': datetime.datetime(2025, 6, 16, 11, 13, 28, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'material_type_facet': '',
 'kicker': '',
 'des_facet

We are provided a diverse collection of data for the article ranging from the expected (title, author, section) and to NLP-derived information such as named entities. Notice that the full article itself is not included - the API does not provide that to us.

💡 **Tip**: If we are interested in a specific section, we can pass in one of the following tags into the `section` parameter:


```arts```, ```automobiles```, ```books```, ```business```, ```fashion```, ```food```, ```health```, ```home```, ```insider```, ```magazine```, ```movies```, ```national```, ```nyregion```, ```obituaries```, ```opinion```, ```politics```, ```realestate```, ```science```, ```sports```, ```sundayreview```, ```technology```, ```theater```, ```tmagazine```, ```travel```, ```upshot```, and ```world```.


NYT and other API provides can and do change their API tags and other aspects of their API usage. It is always a good idea to check out their documentation and see what they suggest.

This [link](https://developer.nytimes.com/docs/timeswire-product/1/routes/content/section-list.json/get) shows you how we can get all the section names. You can run it on the browser or with the code below.

![](../img/nyt_section_list.png)


In [8]:
import requests

url = "https://api.nytimes.com/svc/news/v3/content/section-list.json"
params = {"api-key": api_key}
response = requests.get(url, params=params)
sections_data = response.json()
sections_data

{'status': 'OK',
 'copyright': 'Copyright (c) 2025 The New York Times Company. All Rights Reserved.',
 'num_results': 50,
 'results': [{'section': 'admin', 'display_name': 'Admin'},
  {'section': 'arts', 'display_name': 'Arts'},
  {'section': 'automobiles', 'display_name': 'Automobiles'},
  {'section': 'books', 'display_name': 'Books'},
  {'section': 'briefing', 'display_name': 'Briefing'},
  {'section': 'business', 'display_name': 'Business'},
  {'section': 'climate', 'display_name': 'Climate'},
  {'section': 'corrections', 'display_name': 'Corrections'},
  {'section': 'education', 'display_name': 'Education'},
  {'section': 'en español', 'display_name': 'En español'},
  {'section': 'fashion', 'display_name': 'Fashion'},
  {'section': 'food', 'display_name': 'Food'},
  {'section': 'gameplay', 'display_name': 'Gameplay'},
  {'section': 'guide', 'display_name': 'Guide'},
  {'section': 'health', 'display_name': 'Health'},
  {'section': 'home & garden', 'display_name': 'Home & Garden'},
 

In [9]:
top_arts_stories = nyt.top_stories(section='arts')
print(top_arts_stories[0]['section'])
top_arts_stories[0]

arts


{'section': 'arts',
 'subsection': 'design',
 'title': 'Crowning New York’s Top ‘Pigeon’',
 'abstract': 'Thousands of people gathered on the High Line on Saturday for Pigeon Fest, inspired by an artist’s sculpture and an appreciation for the city’s most resilient birds.',
 'url': 'https://www.nytimes.com/2025/06/16/arts/design/pigeon-fest-high-line.html',
 'uri': 'nyt://article/eb52bf89-c5b8-5d5a-a44a-fe5f329f25b7',
 'byline': 'By Melena Ryzik',
 'item_type': 'Article',
 'updated_date': datetime.datetime(2025, 6, 16, 15, 20, 15, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'created_date': datetime.datetime(2025, 6, 16, 5, 2, 17, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'published_date': datetime.datetime(2025, 6, 16, 5, 2, 17, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'material_type_facet': '',
 'kicker': '',
 'des_facet': ['Festivals',
  'Pigeons',
  'Art',
  'Parks and Other Recreation Areas',
  '

## 🥊 Challenge: Find the top stories for a section

- Choose a section. Grab the top stories and store it in a list.
- How many stories are in the section?
- What is the title of the first story?

In [27]:
top_world_stories = nyt.top_stories(section="world")
print(f"There are {len(top_world_stories)} {"world"} stories.")

There are 36 world stories.


In [28]:
# Education
section = "education"
top_education_stories = nyt.top_stories(section=section)
print(f"There are {len(top_education_stories)} {section} stories.")

There are 38 education stories.


In [29]:
# Grab first story
top_education_story = top_education_stories[0]
top_education_story

{'section': 'us',
 'subsection': '',
 'title': 'Hispanic-Serving College Program Is Discriminatory, Lawsuit Argues',
 'abstract': 'A group behind the Supreme Court case that ended affirmative action is now targeting a federal support for schools that enroll large numbers of Hispanic students.',
 'url': 'https://www.nytimes.com/2025/06/11/us/hispanic-serving-institutions-lawsuit.html',
 'uri': 'nyt://article/213c3b9e-5107-526f-9b6f-b2b7daac26e9',
 'byline': 'By Anemona Hartocollis',
 'item_type': 'Article',
 'updated_date': datetime.datetime(2025, 6, 12, 11, 41, 54, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'created_date': datetime.datetime(2025, 6, 11, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'published_date': datetime.datetime(2025, 6, 11, 16, 30, 35, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=72000))),
 'material_type_facet': '',
 'kicker': '',
 'des_facet': ['Hispanic-Americans',
  'Federal Ai

In [30]:
# Get the title of the top education story
top_education_story_title = top_education_story["title"]
top_education_story_title

'Hispanic-Serving College Program Is Discriminatory, Lawsuit Argues'

## Organizing the API Results into a `pandas` DataFrame

In order to conduct subsequent data analysis, we need to convert the list of JSON data to a `pandas` DataFrame. `pandas` allows us to simply pass in the JSON list and produce a clean table in one line of code. 

First, let's see what happens when we pass in `top_stories` to `pd.json_normalize`:

In [33]:
# Convert to DataFrmae
df = pd.json_normalize(top_stories)
# View the first 5 rows
df.head()

Unnamed: 0,section,subsection,title,abstract,url,uri,byline,item_type,updated_date,created_date,published_date,material_type_facet,kicker,des_facet,org_facet,per_facet,geo_facet,multimedia,short_url
0,world,middleeast,How Tough Is Iran? A String of Military Losses...,Iran is often portrayed as one of the world’s ...,https://www.nytimes.com/2025/06/16/world/middl...,nyt://article/45e5f07b-2c53-5f6b-af1f-b401b00d...,By Vivian Yee,Article,2025-06-16 15:47:55-04:00,2025-06-16 11:13:28-04:00,2025-06-16 11:13:28-04:00,,,"[War and Armed Conflicts, Iran-Israel Proxy Co...",[Hezbollah],[],"[Iran, Israel]",[{'url': 'https://static01.nyt.com/images/2025...,
1,world,middleeast,"Israel attacks Iran’s state television, live o...",,https://www.nytimes.com/live/2025/06/16/world/...,nyt://article/dbd8827b-e460-59f2-8406-0a72da7c...,"By Aaron Boxerman, Farnaz Fassihi, Aric Toler ...",Article,2025-06-16 15:44:03-04:00,2025-06-16 13:00:15-04:00,2025-06-16 13:00:15-04:00,,,[],[],[],[],[{'url': 'https://static01.nyt.com/images/2025...,
2,us,,How the Minnesota Shootings Suspect Was Caught,A two-day manhunt ended Sunday night as police...,https://www.nytimes.com/2025/06/16/us/minnesot...,nyt://article/19c09471-4e34-5eb2-ae55-e2919a49...,"By Nicholas Bogel-Burroughs, Mitch Smith, Jeff...",Article,2025-06-16 16:00:46-04:00,2025-06-16 08:44:58-04:00,2025-06-16 08:44:58-04:00,,,"[Shootings of Minnesota Legislators (2025), Po...",[],"[Boelter, Vance L]",[],[{'url': 'https://static01.nyt.com/images/2025...,
3,us,,How the Shootings of the Minnesota Lawmakers U...,A manhunt ended after a man suspected in the k...,https://www.nytimes.com/2025/06/15/us/minnesot...,nyt://article/5514380b-7303-5529-a90f-26d85d8a...,By Leo Dominguez and Ashley Cai,Article,2025-06-16 15:42:45-04:00,2025-06-15 22:10:53-04:00,2025-06-15 22:10:53-04:00,,,[Shootings of Minnesota Legislators (2025)],[],"[Boelter, Vance L, Hortman, Melissa (1970-2025...","[Brooklyn Park (Minn), Champlin (Minn)]",[{'url': 'https://static01.nyt.com/images/2025...,
4,us,politics,Mike Lee Draws Outrage for Posts Blaming Assas...,The Republican senator from Utah suggested in ...,https://www.nytimes.com/2025/06/16/us/politics...,nyt://article/2669488d-d9cb-55f4-ab80-4cd4fe29...,By Annie Karni,Article,2025-06-16 16:46:57-04:00,2025-06-16 15:51:21-04:00,2025-06-16 15:51:21-04:00,,,"[United States Politics and Government, Assass...",[],"[Hortman, Melissa (1970-2025), Hoffman, John A...","[Minnesota, Utah]",[{'url': 'https://static01.nyt.com/images/2025...,


In [34]:
# Inspect the metadata
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype                    
---  ------               --------------  -----                    
 0   section              23 non-null     object                   
 1   subsection           23 non-null     object                   
 2   title                23 non-null     object                   
 3   abstract             23 non-null     object                   
 4   url                  23 non-null     object                   
 5   uri                  23 non-null     object                   
 6   byline               23 non-null     object                   
 7   item_type            23 non-null     object                   
 8   updated_date         23 non-null     datetime64[ns, UTC-04:00]
 9   created_date         23 non-null     datetime64[ns, UTC-04:00]
 10  published_date       23 non-null     datetime64[ns, UTC-04:00]
 11  material

For the most part, `pandas` does a good job of producing a table where:

- The columns correspond with the JSON dictionary keys from our API call.
- The number of rows matches the number of articles.
- Each cell holds the corresponding value found under that article's dictionary key.

<a id='most'></a>

# Most Viewed and Most Shared APIs

Retrieving the most viewed and shared articles is also quite simple. The `days` parameter returns the most popular articles based on the last $N$ days. Keep in mind, however, that `days` can only take on one of three values: 1, 7, or 30.

In [35]:
# Retrieve the most viewed articles for today.
# The days parameter defaults to 1
most_viewed_today = nyt.most_viewed()
print(f"Title: {most_viewed_today[0]['title']}")
print(f"Section: {most_viewed_today[0]['section']}")
most_viewed_today[0]

Title: Takeaways From Trump’s Military Parade in Washington
Section: U.S.


{'uri': 'nyt://article/da7c3b3d-a55c-54d4-bec5-f12c066d0aee',
 'url': 'https://www.nytimes.com/2025/06/14/us/politics/trump-military-parade-takeaways.html',
 'id': 100000010212934,
 'asset_id': 100000010212934,
 'source': 'New York Times',
 'published_date': datetime.date(2025, 6, 14),
 'updated': datetime.datetime(2025, 6, 15, 16, 11, 3),
 'section': 'U.S.',
 'subsection': 'Politics',
 'nytdsection': 'u.s.',
 'adx_keywords': 'United States Politics and Government;United States Defense and Military Forces;Parades;Trump, Donald J;United States Army;Washington (DC)',
 'column': None,
 'byline': 'By Zach Montague',
 'type': 'Article',
 'title': 'Takeaways From Trump’s Military Parade in Washington',
 'abstract': 'The events in the capital were overshadowed by an assassination in Minnesota and turmoil in the Middle East.',
 'des_facet': ['United States Politics and Government',
  'United States Defense and Military Forces',
  'Parades'],
 'org_facet': ['United States Army'],
 'per_facet': 

🔔 **Question**:  How many stories are provided to us via this function call?

In [36]:
len(most_viewed_today)

20

For this piece of data, we can consult a guide or what's known as a schema to understand the information at our finger tips.

The [Most Viewed Schema](https://developer.nytimes.com/docs/most-popular-product/1/types/ViewedArticle) can answer any questions we may have about this article's data:

| Attribute      | Data Type | Definition      |
| ----------- | ----------- | ----------- |
| url      | string       | Article's URL.       |
| adx_keywords   | string        | Semicolon separated list of keywords.        |
| column   | string        | Deprecated. Set to null.        |
| section   | string        | Article's section (e.g. Sports).        |
| byline   | string        | Article's byline (e.g. By Thomas L. Friedman).        |
| type   | string        | Asset type (e.g. Article, Interactive, ...).        |
| title   | string        | Article's headline (e.g. When the Cellos Play, the Cows Come Home).        |
| abstract   | string        | Brief summary of the article.|
| published_date   | string        | When the article was published on the web (e.g. 2021-04-19).        |
| source   | string        | Publisher (e.g. New York Times).        |
| id   | integer        | Asset ID number (e.g. 100000007772696).        |
| asset_id   | integer        | Asset ID number (e.g. 100000007772696).        |
| des_facet   | array        | Array of description facets (e.g. Quarantine (Life and Culture)).        |
| org_facet   | array        | Array of organization facets (e.g. Sullivan Street Bakery).        |
| per_facet   | array        | Array of person facets (e.g. Bittman, Mark).        |
| geo_facet   | array        | Array of geographic facets (e.g. Canada).        |
| media   | array        | Array of images.        |
| media.type   | string        | Asset type (e.g. image).        |
| media.subtype   | string        | Asset subtype (e.g. photo).        |
| media.caption   | string        | Media caption        |
| media.copyright   | string        | Media credit        |
| media.approved_for_syndication   | boolean        | Whether media is approved for syndication.        |
| media.media-metadata   | array        | Media metadata (url, width, height, ...).        |
| media.media-metadata.url   | string        | Image's URL.        |
| media.media-metadata.format   | string        | Image's crop name     |
| media.media-metadata.height   | integer        | Image's height |
| media.media-metadata.width   | integer        | Image's width      |

To pull most popular articles for the past weekend and month, we pass the numbers 7 or 30 into `days`

In [37]:
most_viewed_week = nyt.most_viewed(days=7)

🔔 **Question**: What is the most viewed article of the last week?

In [38]:
most_viewed_week[0]['title']

'How the Man in Seat 11A Became a Plane Crash’s Sole Survivor'

<a id='search'></a>

# Article Search API

Let's take it up a notch and use the search API to retrieve a set of articles about a particular topic in a chosen period of time.

We'll use the `article_search` function. Two relevant parameters include:

- `query`: The search query
- `results`: Number of articles returned. The default is 10.

Let's try pulling the most recent articles about Berkeley:

In [39]:
articles = nyt.article_search(query="Berkeley")

Let's look at the main headlines of these articles:

In [40]:
headlines = [article['headline']['main'] for article in articles]
headlines

['Liberal Berkeley’s Toughened Stance on Homeless Camps Is a Bellwether',
 'As Kamala Harris Claims Oakland, Berkeley Forgives',
 'Four Generations of Quilts Come Out of the Family ‘Treasure Chest’',
 'Why Amy Tan Decided Not to Shred Her Archive',
 'Energy Dept. Unveils Supercomputer That Merges With A.I.',
 'Why on Earth Should Air Traffic Controllers Be Pro-Trump?',
 'In Berkeley Public Schools, a War Gives Rise to Unusual Tensions',
 'A Toxic Pit Could Be a Gold Mine for Rare-Earth Elements',
 '$800,000 Homes in California',
 'My Journey Deep in the Heart of Trump Country']

We can also take a peek at the first article provided. We're going to remove the `multimedia` key in order to make it more easy to view:

In [41]:
del articles[0]['multimedia']
articles[0]

{'abstract': 'The progressive stronghold in California plans to target large encampments, relying on a Supreme Court decision handed down by a conservative majority.',
 'byline': {'original': 'By Shawn Hubler'},
 'document_type': 'article',
 'headline': {'main': 'Liberal Berkeley’s Toughened Stance on Homeless Camps Is a Bellwether',
  'kicker': '',
  'print_headline': 'Berkeley Stiffens Homeless Rules as Camps Test Empathy’s Limits'},
 '_id': 'nyt://article/9429aa02-c3c4-5140-ae49-9279cb9565d3',
 'keywords': [{'name': 'Location', 'value': 'Berkeley (Calif)', 'rank': 1},
  {'name': 'Subject', 'value': 'Homeless Persons', 'rank': 2},
  {'name': 'Subject', 'value': 'Local Government', 'rank': 3},
  {'name': 'Subject', 'value': 'Law and Legislation', 'rank': 4},
  {'name': 'Subject', 'value': 'Liberalism (US Politics)', 'rank': 5},
  {'name': 'Person', 'value': 'Newsom, Gavin', 'rank': 6},
  {'name': 'Location', 'value': 'California', 'rank': 7},
  {'name': 'Organization', 'value': 'Supre

Notice that not all article data comes in the same format. Data from the search API is presented differently from that of the Most Viewed and Top Stories APIs.

There are schemas for the above data. 

- [Article Schema](https://developer.nytimes.com/docs/articlesearch-product/1/types/Article)
- [Byline](https://developer.nytimes.com/docs/articlesearch-product/1/types/Byline)
- [Headline](https://developer.nytimes.com/docs/articlesearch-product/1/types/Headline)
- [Keyword](https://developer.nytimes.com/docs/articlesearch-product/1/types/Keyword)
- [Multimedia](https://developer.nytimes.com/docs/articlesearch-product/1/types/Multimedia)
- [Person](https://developer.nytimes.com/docs/articlesearch-product/1/types/Person)

Let's search for some articles again, but within a specific time period. 

For example, how would we retrieve all the articles about the first two months of the George Floyd protests?

We need to pass a dictionary to the `dates` argument which contains keys named "begin" and "end". Those two keys point to `datetime` objects that we'll use as time markers. We're also going to use the `options` argument to filter and sort our results.

In [42]:
# Set up start and end date objects
begin = datetime(2020, 5, 23) # May 23, 2020
end = datetime(2020, 7, 23) # July 23, 2020

# Create a dictionary containing the datetime objects
date_dict = {"begin": begin, "end": end}

articles = nyt.article_search(
    query="George Floyd protests",
    results=100,
    dates=date_dict,
    )



In [43]:
# Grab first article and drop the multimedia key to reduce clutter
article = articles[0]
del article["multimedia"]

# Check out results
article

{'abstract': 'From Minneapolis to Buffalo, Homeland Security officials dispatched drones, helicopters and airplanes to monitor Black Lives Matter protests.',
 'byline': {'original': 'By Zolan Kanno-Youngs'},
 'document_type': 'article',
 'headline': {'main': 'U.S. Watched George Floyd Protests in 15 Cities Using Aerial Surveillance',
  'kicker': '',
  'print_headline': 'Surveillance Aircraft Hovered  As Marchers Filled the Streets'},
 '_id': 'nyt://article/079d4446-d165-53bb-941d-6747c32bfc9b',
 'keywords': [{'name': 'Subject',
   'value': 'Drones (Pilotless Planes)',
   'rank': 1},
  {'name': 'Subject', 'value': 'Military Aircraft', 'rank': 2},
  {'name': 'Subject', 'value': 'George Floyd Protests (2020)', 'rank': 3},
  {'name': 'Subject', 'value': 'Black Lives Matter Movement', 'rank': 4},
  {'name': 'Subject', 'value': 'Privacy', 'rank': 5},
  {'name': 'Subject',
   'value': 'Surveillance of Citizens by Government',
   'rank': 6},
  {'name': 'Organization',
   'value': 'Customs and 

In [44]:
len(articles)

10

### We wanted 100 articles but we only got 10?

This is because we are using `article_search` function from the Python package `pynytimes` This package is amazing and very useful because it is a wrapper over the NYT API, which makes dealing with the API more smooth.

But there has been an update to the API itself and that update has not been reflected in the `pynytimes` code. When you are using Python packages, first thing to check when something does not work as expected is the [Issues tab on the repository](https://github.com/michadenheijer/pynytimes/issues). Often, someone else will have noticed it before you.

![](../img/pynytimes_issue.png)


Do not despair! We can fix this :D

First let's check out [the link](https://developer.nytimes.com/docs/timeswire-product/1/routes/content/%7Bsource%7D/%7Bsection%7D.json/get) shared in the GitHub issue.

![](../img/nyt_solution1.png)

Now we know, the key difference is **limits**.

Let's see what is not working:

In [45]:
help(nyt.article_search)

Help on method article_search in module pynytimes.api:

article_search(query: 'Optional[str]' = None, dates: "Optional[dict[Literal['begin', 'end'], DateType]]" = None, options: 'Optional[ArticleSearchOptions]' = None, results: 'int' = 10) -> 'list[dict[str, Any]]' method of pynytimes.api.NYTAPI instance
    Search New York Times articles

    Args:
        query (Optional[str], optional): Search query. Defaults to None.
        dates (Optional[dict[Literal["begin", "end"], DateType]],
        optional): Dictionary with "begin" and "end" of search range.
        Defaults to None.
        options (Optional[ArticleSearchOptions], optional): Options for the
        search results.
        Defaults to None.
        results (int, optional): Load at most this many articles. Defaults to 10.

    Returns:
        list[dict[str, Any]]: Article metadata



Let's investigate further. How do we get the *results* 
> results (int, optional): Load at most this many articles. Defaults to 10.

This is the code from the [pynytimes](https://github.com/michadenheijer/pynytimes/blob/bd3d47f74f347f1beaf5b9fe517d3e2cd4630423/pynytimes/api.py#L548C1-L549C1)

```python
def tag_query(
        self,
        query: str,
        filter_option: Optional[dict[str, Any]] = None,
        filter_options: Optional[str] = None,
        max_results: Optional[int] = None,
    ) -> list[str]:
        """Load Times Tags

        Args:
            query (str): Search query to find a tag
            filter_option (Optional[dict[str, Any]], optional): Filter the tags.
            Defaults to None.
            filter_options (Optional[str], optional): Filter options. Defaults
            to None.
            max_results (Optional[int], optional): Maximum number of results.
            None means no limit. Defaults to None.

        Returns:
            list[str]: List of tags
        """
        # Raise error for TypeError
        tag_query_check_types(query, max_results)

        _filter_options = (
            tag_query_get_filter_options(filter_options) or filter_option
        )

        # Add options to request params
        options = {"query": query, "filter": _filter_options}

        # Define amount of results wanted
        if max_results is not None:
            options["max"] = str(max_results)

        # Set URL, load and return data
        # FIXME what is this, why is this?
        return self.__load_data(url=BASE_TAGS, options=options, location=[])[
            1
        ]  # type:ignore
```

And this is where the number of results is being determined:

```python
# Define amount of results wanted
if max_results is not None:
    options["max"] = str(max_results)
```

It seems that this code is searching for an option in the NYT API called max to communicate the maximum number of results. 

Unfortunately, that is no longer a valid query parameter. Instead we have **limit** parameter.

![](../img/nyt_solution2.png)


### Can we fix this code?

No, unless we download a copy of this repo and edit it and use it, we can't really fix the repo. Also, this might not be a super easy fix. As you see above, the search maximum is now 500 articles. In this repo elsewhere it is ca. 2000 articles. There are many other details about this code that we do not know.

But we can write our own code that does what we want it to do.

## 🥊 Challenge: Article Searching Updated

- Let's create the correct function for article search.
- Retrieve a set of articles for a query of your choice.
- Use a relevant time interval in constructing your `dates` dictionary

Let's take a look at the relevant [NYT Article Search API](https://developer.nytimes.com/docs/articlesearch-product/1/overview)

In [46]:
import requests
import datetime
import time

BASE_URL = "https://api.nytimes.com/svc/search/v2/articlesearch.json"
#QUERY = "\"George Floyd protests\"" # "..." is used for exact phrase matching
QUERY = "George Floyd protests"  # No quotes for general keyword search

begin = datetime.datetime(2020, 5, 23)
end = datetime.datetime(2020, 7, 23)

begin_str = begin.strftime("%Y%m%d")
end_str = end.strftime("%Y%m%d")

articles = []
page = 0
max_pages = 10  # 10 pages * 10 articles per page = 100 articles max

while page < max_pages:
    params = {
        "q": QUERY,
        "begin_date": begin_str,
        "end_date": end_str,
        "api-key": api_key, # we got this earlier and we can still use it, even without the pynytimes package
        "page": page,
        "sort": "newest"
    }

    response = requests.get(BASE_URL, params=params)
    if response.status_code != 200:
        print(f"Request failed on page {page}: {response.text}")
        break

    data = response.json()
    docs = data.get("response", {}).get("docs", [])
    if not docs:
        break

    articles.extend(docs)
    print(f"Fetched page {page + 1} with {len(docs)} articles.")
    page += 1
    time.sleep(5)  # NYT recommends spacing requests to avoid throttling

print(f"\nTotal articles fetched: {len(articles)}")

# Example: Print headline and URL
for i, article in enumerate(articles[:5]):  # just previewing 5
    print(f"{i+1}. {article['headline']['main']}")
    print(f"   {article['web_url']}\n")




Fetched page 1 with 10 articles.
Fetched page 2 with 10 articles.
Fetched page 3 with 10 articles.
Fetched page 4 with 10 articles.
Fetched page 5 with 10 articles.
Request failed on page 5: {"fault":{"faultstring":"Rate limit quota violation. Quota limit  exceeded. Identifier : 626daee0-bbf2-4251-a867-f19748e32b39","detail":{"errorcode":"policies.ratelimit.QuotaViolation"}}}

Total articles fetched: 50
1. House Votes to Remove Confederate Statues From U.S. Capitol
   https://www.nytimes.com/2020/07/22/us/politics/confederate-statues-us-capitol.html

2. Painting Bleak Portrait of Urban Crime, Trump Sends More Agents to Chicago and Other Cities
   https://www.nytimes.com/2020/07/22/us/politics/trump-federal-agents-cities.html

3. To Battle a Militarized Foe, Portland Protesters Use Umbrellas, Pool Noodles and Fire
   https://www.nytimes.com/2020/07/22/us/portland-protest-tactics.html

4. ‘Occupy City Hall’ Encampment Taken Down in Pre-Dawn Raid by N.Y.P.D.
   https://www.nytimes.com/202

<a id='analysis'></a>

# Data Analysis

Now, we'll perform a data analysis on many articles about the 2020 presidential election.

We are working with previously queried set of articles because making the API call will take too much time. The code used to queried the articles we'll analyze can be found in the following cell:

## Query Using the Article Search API

In [None]:
# Change this variable if you'd like to run the query yourself
run_query = False

# Only run this code if you're able to wait for the query to finish
if run_query:
    # Create datetime objects
    begin = datetime(2020, 9, 7) # September 7, 2020
    end = datetime(2020, 11, 7) # November 7, 2020
    date_dict = {"begin": begin, "end": end}

    options_dict = {
        "sort": "oldest",
        "sources": ["New York Times",],
        "type_of_material": ["News Analysis", "News", "Article", "Editorial"]
    }

    # To get the dataset we use, set n_results to 2000
    n_results = 2000
    # n_results = 10

    # Perform article search query
    articles = nyt.article_search(
         query="presidential election",
         results=n_results,
         dates=date_dict,
         options=options_dict)

    # Create DataFrame 
    df = pd.json_normalize(articles)
    
    # Ensure 'lead_paragraph' column has no NaN 
    df['lead_paragraph'] = df['lead_paragraph'].fillna('')
    
    # Save DataFrame
    df.to_csv("../data/election2020_articles.csv")

Let's load in the previously saved data:

In [None]:
df = pd.read_csv("../data/election2020_articles.csv")
df.head()

In [None]:
# Inspect metadata
df.info()

## Perform Sentiment Analysis

Sentiment analysis is a common task when working with text data. Let's track the sentiment of articles about the election over the two month time period. We'll use the `vadersentiment` package to evaluate the sentiment of each article.

According to the [VADER Github Repo](https://github.com/cjhutto/vaderSentiment), "VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is *specifically attuned to sentiments expressed in social media*."

We'll start by installing the `vadersentiment` library.

In [None]:
# Install the vadersentiment library
%pip install vadersentiment

In [None]:
# Import the SentimentIntensityAnalyzer object
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
# Initialize analyzer object
analyzer = SentimentIntensityAnalyzer()
# Calculate the polarity scores of the lead paragraph 
df["sentiment"] = df["lead_paragraph"].apply(lambda x: analyzer.polarity_scores(x) if isinstance(x, str) else np.nan)

In [None]:
# Inspect the sentiment column
df.sentiment.head()

In [None]:
# View single row
df.sentiment.iloc[0]

The `compound` score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most negative) and +1 (most positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. We can think of this score as a normalized, weighted composite score. It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. 

Typical threshold values are:

1. **Positive Sentiment**: compound score $\geq 0.05$
 
2. **Neutral  Sentiment**: $-0.05 <$ compound score $< 0.05$
 
3. **Negative Sentiment**: compound score $\leq -0.05$

In [None]:
# Re-assign sentiment as the compound score
df["sentiment"] = df["sentiment"].apply(lambda x: x["compound"] if isinstance(x, dict) else np.nan)

Let's get a sense of the distribution of scores by calculating some summary statistics and plotting a histogram:

In [None]:
# Summary statistics
df.sentiment.describe()

In [None]:
bins = np.linspace(-1, 1, 17)
df.sentiment.hist(bins=bins, figsize= (9, 7))
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.xlim([-1.0, 1.0])

## 🥊 Challenge: Most Positive, Most Negative

What are the top 3 most positive and negative texts? Tip: try using the `sort_values()` method on the "sentiment" column in your df!

In [None]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np 

df = pd.read_csv("../data/election2020_articles.csv")
# Initialize analyzer object
analyzer = SentimentIntensityAnalyzer()
# Calculate the polarity scores of the lead paragraph and save it in df
df["sentiment"] = df["lead_paragraph"].apply(lambda x: analyzer.polarity_scores(x) if isinstance(x, str) else np.nan)
df["sentiment"] = df["sentiment"].apply(lambda x: x["compound"] if isinstance(x, dict) else np.nan)

In [None]:
# Most positive texts
df.sort_values("sentiment", ascending = False)["headline.main"].iloc[:3].tolist()

In [None]:
# Most negative texts
df.sort_values("sentiment", ascending = True)["headline.main"].iloc[:3].tolist()

Finally, using the VADER thresholds for positive, neutral, and negative, we can see how many articles qualify for each of those labels:

In [None]:
# Proportion of positive, negative, and neutral texts
def bin_func(x):
    if x > 0.05:
        return "positive"
    elif x < -.05:
        return "negative"
    else:
        return "neutral"
# Calculate counts
df.sentiment.apply(bin_func).value_counts()

## Sentiment Over the Course of the Campaign

Let's examine how the compound score evolved over the course of the campaign. Do you have expectations on how this quantity might behave as the election nears? 

First, let's create a new `pandas` series which tracks the sentiment over time:

In [None]:
# change pub_date to DatetimeIndex format
df["pub_date"] = pd.to_datetime(df["pub_date"])

In [None]:
# Create a time series with publication date as the index and sentiment score as the value
sentiment_ts = pd.Series(index= df.pub_date.tolist(),
                         data = df.sentiment.tolist())

Next, we'll calculate daily and weekly averages:

In [None]:
# Resample the data with daily averages and weekly averages
daily = sentiment_ts.resample("d").mean()
weekly = sentiment_ts.resample("w").mean()

🔔 **Question**: We can plot the results below. Do you notice any patterns?

In [None]:
# Daily average sentiment of articles.
daily.plot(figsize = (11, 7))
plt.xlabel("Dates")
plt.ylabel("Sentiment Score");

In [None]:
# Weekly average sentiment of articles.
weekly.plot(figsize = (11, 7))
plt.xlabel("Dates")
plt.ylabel("Sentiment Score");

# 🎬 Demo: Handling Nested Arrays of Keywords

The Times has done us a favor in providing the named entities in the articles, thus relieving us of having to do the tagging ourselves. However, the data structure that it comes in can be tricky to handle. Here, we provide a short tutorial showing one way to cleanly extract keyword data.

In [None]:
# Refer to a sample article's set of keywords
df.keywords.iloc[1]

We see a number of things here:
- Each article's keywords are laid out in a list of dictionaries.
- A dictionary tell us the name, type, ranking, and major of the keyword.
- The five types of keywords are: `subject`, `persons`, `glocations`, `organizations`, and `creative_works`.
- The ordering of the list corresponds to the ranking.
- All articles do not all have the same number of rankings.

We've created a function to extract keyword data based on the ranking. This function will be applied over the pandas series of keyword data.

In [None]:
import ast

# Convert the string representation of the list into actual lists of dictionaries
df['keywords'] = df['keywords'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [None]:
df["keywords"].head()

In [None]:
def rank_extractor(data, rank):
    """Extracts keyword data based on the 'rank' field."""
    if isinstance(data, list):
        for keyword in data:
            if isinstance(keyword, dict) and keyword.get("rank") == rank:
                return {"name": keyword.get("name"), "value": keyword.get("value")}
    return None

In [None]:
# Extract the first, second, and third keywords
rank1 = df.keywords.apply(lambda x: rank_extractor(x, 1))
rank2 = df.keywords.apply(lambda x: rank_extractor(x, 2))
rank3 = df.keywords.apply(lambda x: rank_extractor(x, 3))

In [None]:
# View results
rank1.head()

Let's convert these dictionaries into `pandas` Series:

In [None]:
rank1 = rank1.apply(pd.Series)
rank2 = rank2.apply(pd.Series)
rank3 = rank3.apply(pd.Series)
rank1.head()

Voila! A nice clean format. Now can we conduct some light analysis:

In [None]:
# Most frequent type of keyword in ranking #1
rank1.name.value_counts()

In [None]:
# The most common keywords in ranking #1:
rank1.value.value_counts()

<div class="alert alert-success">

## ❗ Key Points

* APIs allow structured web interactions, often using URLs to query databases and retrieve data.
* API keys authenticate users, enabling access to APIs while monitoring and limiting the number of requests.
* The NYT API allows users to do things like retrieve top stories, find most shared stories, and search for stories.
* Text data acquired through APIs can be analyzed using natural language processing tools such as sentiment analysis.
  
</div>