# Lab1.5 New York Times news API

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

The New York Times provides a search API to access various news sources:

https://developer.nytimes.com

In this notebook, we are going to see how their API library can be used to obtain news for your keyword search, how you can set filters and how to obtain meta data and the text to save it.

Credits:
We thank the blog from Rochelle Terman from which we re-used some code and explanation:
https://dlab.berkeley.edu/blog/scraping-new-york-times-articles-python-tutorial

## 1. Preparations for using the NYT news API

In order to access the nes stream and the archives, you first need to install the package *nytimesarticle* locally on your computer on command line:

https://pypi.org/project/nytimesarticle/0.1.0/

'pip install nytimesarticle==0.1.0'

*nytimesarticle* is a python wrapper for the New York Times Article Search API. This allows you to query the API through python. 

With the Article Search API, you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

The API will not return full text of articles. But it will return a number of helpful metadata such as subject terms, abstract, and date, as well as URLs, which one could conceivably use to scrape the full text of articles.

To begin, you first need to obtain an API key from the New York Times, which is fast and easy to do: https://developer.nytimes.com/get-started

* create an account
* login
* create app
* get the id and key

Once you have the key you can start using the API. 

In the code below, we already created an key for an application that we called *LanguageAsData*. The application was defined for the Archive API and the Article Search API. You can choose the key below or define your own application and API. Note that if too many people use the key given below, the calls could be blocked. In that case, you need to follow the instructions to obtain your own account, create you App and get the credentials.

Below is the code to create an API client that we call *api*.

## 2. Using the Search API

In [1]:
from nytimesarticle import articleAPI

app='LanguageAsData'
appId='9a81707d-bc58-4f34-a69b-77681c2facc0'
appKey='AoT94xb4XOlfyNSV2MkIGpihZsPCktxb'
api = articleAPI(appKey)

We can use the *api.search()* function to launch our query. It takes various parameters of which we illustrate a few.

In [2]:
query="vaccination"
beginDate=20151231
endDate=20181231
filter={'source':['Reuters','AP', 'The New York Times']}
filter={'source':['Reuters','AP']}


articles = api.search(q = query,
                      fq = str(filter), 
                      begin_date = beginDate, 
                      end_date= endDate)


The q (for query) parameter searches the article's body, headline and byline for a particular term. In this case, we are looking for the search term ‘vaccins’. The fq (for filter query) parameter filters search results by various dimensions. For instance, 'source':['Reuters','The New York Times'] will filter by source (Reuters, New York Times, and AP are available through the API.) The begin_date and end_date parameter (in YYYYMMDD format) limits the date range of the search.

As you can see, we can specify multiple filters by using a python dictionary and multiple values by using a list: fq = {'source':['Reuters','AP', 'The New York Times']}

There are many other parameters and filters we can use to specify our serach. Get a full list here.

The output of the search is a dictionary that has the following keys:

In [3]:
print(articles.keys())

dict_keys(['status', 'copyright', 'response'])


We are interested in the 'response', which contains a dictionary with 'docs' with the actual content as a list. The next command show the first element from the list.

In [4]:
print(articles['response']['docs'][0])

{'web_url': 'https://www.nytimes.com/2018/10/16/health/child-flu-death-florida.html', 'snippet': 'A child who had not gotten the flu shot tested positive for influenza B, state health officials said. The flu season has just begun, and flu activity remains low across the country.', 'lead_paragraph': 'A child in Florida who had not received the flu vaccine died from the virus, state officials announced on Monday, the first influenza-related pediatric death reported in the country this flu season.', 'abstract': 'A child who had not gotten the flu shot tested positive for influenza B, state health officials said. The flu season has just begun, and flu activity remains low across the country.', 'print_page': '15', 'source': 'The New York Times', 'multimedia': [{'rank': 0, 'subtype': 'xlarge', 'caption': None, 'credit': None, 'type': 'image', 'url': 'images/2018/10/16/us/17xp-flu/17xp-flu-articleLarge.jpg', 'height': 400, 'width': 600, 'legacy': {'xlarge': 'images/2018/10/16/us/17xp-flu/17xp

The search function returns a dictionary of the first 10 results. To get the next 10, we have to use the page parameter. page = 2 returns the second 10 results, page = 3 the third 10 and so on.

If you run the code, you'll see that the returned dictionary is pretty messy. What we’d really like to have is a list of dictionaries, with each dictionary representing an article and each dictionary representing a field of metadata from that article (e.g. headline, date, etc.) We can do this with a custom function:

In [1]:
def parse_articles(articles):
    '''
    This function takes in a response to the NYT api and parses
    the articles into a list of dictionaries
    '''
    # we create a list structure to capture the results
    news = []
    for doc in articles['response']['docs']:
        # we define a dictionary to store all meta data and the text
        dic = {}
        dic['id'] = doc['_id']  ## obtain the identifier for the document
        if doc['abstract'] is not None:
            dic['abstract'] = doc['abstract'].encode("utf8")
        dic['headline'] = doc['headline']['main'].encode("utf8")
        dic['desk'] = doc['news_desk']
        dic['date'] = doc['pub_date'][0:10] # cutting time of day.
        if doc['snippet'] is not None:
            dic['snippet'] = doc['snippet'].encode("utf8")
        dic['source'] = doc['source']
        dic['url'] = doc['web_url']
        # locations
        locations = []
        for x in range(0,len(doc['keywords'])):
            if 'glocations' in doc['keywords'][x]['name']:
                locations.append(doc['keywords'][x]['value'])
        dic['locations'] = locations
        # subject
        subjects = []
        for x in range(0,len(doc['keywords'])):
            if 'subject' in doc['keywords'][x]['name']:
                subjects.append(doc['keywords'][x]['value'])
        dic['subjects'] = subjects   
        news.append(dic)
    return(news) 

We can use the above function *parse_articles* to process the articles retrieved through the API.

In [8]:
news=parse_articles(articles)
print(len(news))

10


We use the same approach as for the Google news API to store the news with the meta data in a CSV file, using the pandas framework.

In [9]:
COLS = ['id', 'abstract', 'headline', 'desk','date',  'snippet', 'source', 'url', 'locations', 'subjects']

In [10]:
import os
import pandas as pd

# We first define a data frame that we name 'all_news_dataframe' 
# with pandas imported as 'pd' using the columns list that we defined before.
# Basically, we tell pandas what data will be stored.

all_news_dataframe = pd.DataFrame(columns=COLS)

# Iterate over all news items
for i, new_entry in enumerate(news, 1):

    # We now completed appending all the possible values for this tweet.
    # We use the pandas framework imported as 'pd' to create a dataframe from the aggregated data in new_entry
    # We need to provide the columns COLS to tell pandas what value belongs to what.
    # Note that the data need to be aggregated in the same order as the names in COLS, otherwise values will get mixed up
    single_article_dataframe = pd.DataFrame([new_entry], columns=COLS)
        
    # single_tweet_dataframe now contains the data for a single tweet
    # next we add it to the data frame for all tweets 'all_tweets_dataframe'
    # check the pandas documentation if you want to know what ignore_index=True does to the data aggregation
    all_news_dataframe = all_news_dataframe.append(single_article_dataframe, ignore_index=True)

In [11]:
print(all_news_dataframe.shape)

(10, 10)


In [12]:
print(news[0])

{'id': 'nyt://article/ca1851eb-c6f2-56c5-a849-cef1d9a54234', 'abstract': b'A child who had not gotten the flu shot tested positive for influenza B, state health officials said. The flu season has just begun, and flu activity remains low across the country.', 'headline': b'Florida Child Dies From Flu, the First Young Death Reported in the U.S. This Season', 'desk': 'Express', 'date': '2018-10-16', 'snippet': b'A child who had not gotten the flu shot tested positive for influenza B, state health officials said. The flu season has just begun, and flu activity remains low across the country.', 'source': 'The New York Times', 'url': 'https://www.nytimes.com/2018/10/16/health/child-flu-death-florida.html', 'locations': ['Florida'], 'subjects': ['Influenza', 'Deaths (Fatalities)', 'Vaccination and Immunization']}


In [13]:
# We define a file path to store the results as CSV. Make sure the folder 'googlenews_search_results' exists 
# or that you specify another path to an existing location. The 'news_results_<query>.csv' file will be created in that location.
csvFilePath='nyt_search_results/news_results_'+query+'.csv'

# we now open the csvFile for appending our result
csvFile = open(csvFilePath,"w+")       
all_news_dataframe.to_csv(csvFile, columns=COLS, index=False)

## End of notebook