# Workshop 7 -  NYT Article API Call & Clean

This workshop will cover example API call and how to parse through the result (json file).

So what is an API? 

API stands for Applicaation Programming Interface. There is a long explanation of API on [Wikipedia](https://en.wikipedia.org/wiki/Application_programming_interface),
but you can simply think of API as a menu that is used in the web. You make a request using a specified url format. Each API has different format, so you can check that through API documentation for more information for each API. In most cases, you need to have an API key (or token). API key (token) is a password - the password gives you the right to access particular API. Sometimes API keys are free or you have to pay for it. Therefore, it is important to _not share your API keys_ - you might end up paying for API requests that you did not make.

Below is an example using the New York Times API. The New York Times API information is available [here.](https://developer.nytimes.com/)

In [1]:
# load packages
import os
import pandas as pd
import requests 

In [2]:
# Saving API key in the environment settings is a good practice
# I called my New York Times API key saved in the environemnt
nyt_api_key = os.getenv("NYT_API_KEY") # You need to make your own nyt api

For the example, I will use the New York Times archive API. Documentation is available [here](https://developer.nytimes.com/docs/archive-product/1/overview)

In [3]:
# This is the url format
# This returns data for all articles published in September 2018
url = 'https://api.nytimes.com/svc/archive/v1/2018/9.json?api-key='+nyt_api_key

In [4]:
# Using the "get" method, I get a resposne
response = requests.get(url)

In [9]:
# Check the response
# 200 means no error. 404 (or 400) means that there is an error.
response.status_code

200

In [10]:
type(response)

requests.models.Response

In [6]:
# The response comes as a "Response" data type
# Need to .json() to bring the data to a json format
# json is a type of data format - it is essentially a dictionary.
json_file = response.json() 

In [8]:
print(str(json_file)[:1500]) # print out just some of the result

{'copyright': 'Copyright (c) 2019 The New York Times Company. All Rights Reserved.', 'response': {'meta': {'hits': 7102}, 'docs': [{'web_url': 'https://www.nytimes.com/interactive/2018/09/01/us/politics/trump-officials-crimes-and-ethical-violations.html', 'snippet': 'Here are all the people connected to President Trump who have been charged with crimes or found to have violated federal ethics rules.', 'blog': [], 'source': 'The New York Times', 'multimedia': [{'rank': 0, 'subtype': 'xlarge', 'caption': None, 'credit': None, 'type': 'image', 'url': 'images/2018/08/31/us/trump-officials-crimes-and-ethical-violations-promo-1535755498785/trump-officials-crimes-and-ethical-violations-promo-1535755498785-articleLarge.jpg', 'height': 434, 'width': 600, 'legacy': {'xlarge': 'images/2018/08/31/us/trump-officials-crimes-and-ethical-violations-promo-1535755498785/trump-officials-crimes-and-ethical-violations-promo-1535755498785-articleLarge.jpg', 'xlargewidth': 600, 'xlargeheight': 434}, 'subType

In [10]:
type(json_file) # what is the type of this json_file ?

dict

In [11]:
json_file.keys() # because it is a dictionary, use the .keys() to see data

dict_keys(['copyright', 'response'])

As shown on the example print above, it is the 'response' that carries bulk of the information. I will dig in deeper and check the data type of `json_file['response']`.

In [12]:
type(json_file['response']) # check the type - it's a dictionary

dict

In [13]:
json_file['response'].keys() # see the keys

dict_keys(['meta', 'docs'])

As shown on the example print above, it is the **'docs'** that carries bulk of the information. I will dig in deeper and check the data type of `json_file['response']['docs']`.

In [14]:
type(json_file['response']['docs']) # This is a list

list

In [15]:
json_file['response']['docs'][0] # List can be indexed with numbers. Check the first element

{'web_url': 'https://www.nytimes.com/interactive/2018/09/01/us/politics/trump-officials-crimes-and-ethical-violations.html',
 'snippet': 'Here are all the people connected to President Trump who have been charged with crimes or found to have violated federal ethics rules.',
 'blog': [],
 'source': 'The New York Times',
 'multimedia': [{'rank': 0,
   'subtype': 'xlarge',
   'caption': None,
   'credit': None,
   'type': 'image',
   'url': 'images/2018/08/31/us/trump-officials-crimes-and-ethical-violations-promo-1535755498785/trump-officials-crimes-and-ethical-violations-promo-1535755498785-articleLarge.jpg',
   'height': 434,
   'width': 600,
   'legacy': {'xlarge': 'images/2018/08/31/us/trump-officials-crimes-and-ethical-violations-promo-1535755498785/trump-officials-crimes-and-ethical-violations-promo-1535755498785-articleLarge.jpg',
    'xlargewidth': 600,
    'xlargeheight': 434},
   'subType': 'xlarge',
   'crop_name': 'articleLarge'},
  {'rank': 0,
   'subtype': 'thumbnail',
   'c

First element of the list `json_file['response']['docs']` carries a lot of information about an article. I will check the type of each element in the `json_file['response']['docs']` list.

In [16]:
# Check type of an element in the list `json_file['response']['docs']`
type(json_file['response']['docs'][0])

dict

In [17]:
# keys in a dictionary
json_file['response']['docs'][0].keys()

dict_keys(['web_url', 'snippet', 'blog', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'score', 'uri'])

In this example, I want to have `web_url` and `snippet` information.

In [18]:
# Remember, you can index the value of a dictionary using the key
json_file['response']['docs'][0]['web_url']

'https://www.nytimes.com/interactive/2018/09/01/us/politics/trump-officials-crimes-and-ethical-violations.html'

In [19]:
json_file['response']['docs'][0]['snippet']

'Here are all the people connected to President Trump who have been charged with crimes or found to have violated federal ethics rules.'

In [20]:
# This is for the second element in json_file['response']['docs'] list.
# The second element is also a dictionary, and I can find the value through indexing with keys
json_file['response']['docs'][1]['web_url']

'https://www.nytimes.com/2018/08/31/opinion/north-dakota-senate.html'

In [21]:
json_file['response']['docs'][1]['snippet']

'We have seen the future, and it’s in North Dakota.'

Now, if we loop through each element in the list `json_file['response']['docs']`, we will be able to get the urls and the snippets.

Each element in the list `json_file['response']['docs']` is a dictionary. The keys to be used are `web_url` and `snippet`. Each value will be indexed with the keys, and result url and snippet will be appended to empty list. 

If there are data without the `web_url` and `snippet`, they are not necessary. An if statement will be added in the loop to filter such data out. 

In [22]:
# empty lists to save the results
# empty lists should be made **outside** the for loop
# If the empty list is inside the loop, it will generate an empty list every time it loops
urls = []
snippets = []

# for loop
# i is each element in the list "json_file['response']['docs']"
# i is a dictionary
# i['web_url'] will have url result
# i['snippet'] will have snippet result
for i in json_file['response']['docs']:
    if (len(i['web_url']) > 0) & (len(i['snippet']) > 0): # length of the two results should be bigger than 0
        urls.append(str(i['web_url'])) # append the value to list urls
        snippets.append(str(i['snippet'])) # append the value to list snippets

In [23]:
urls # parsed result (urls)

['https://www.nytimes.com/interactive/2018/09/01/us/politics/trump-officials-crimes-and-ethical-violations.html',
 'https://www.nytimes.com/2018/08/31/opinion/north-dakota-senate.html',
 'https://www.nytimes.com/2018/09/02/world/europe/hitler-bell-swastika-germany-church.html',
 'https://www.nytimes.com/2018/09/02/us/politics/meghan-mccain-funeral-trump.html',
 'https://www.nytimes.com/2018/09/02/sports/tennis/us-open-serena-williams.html',
 'https://www.nytimes.com/2018/09/02/business/was-that-serena-williams-in-the-hotel-room-next-to-mine.html',
 'https://www.nytimes.com/2018/09/02/world/europe/pope-francis-archbishop-vigano-kim-davis.html',
 'https://www.nytimes.com/2018/09/02/business/richard-liu-arrest-minnesota-china.html',
 'https://www.nytimes.com/2018/09/02/world/europe/spain-street-vendors-migrants.html',
 'https://www.nytimes.com/2018/09/02/arts/television/adventure-time-appraisal-series-finale.html',
 'https://www.nytimes.com/2018/09/01/style/sarahjessicaparker-nixon-tea-ho

In [24]:
snippets # parsed result (snippets)

['Here are all the people connected to President Trump who have been charged with crimes or found to have violated federal ethics rules.',
 'We have seen the future, and it’s in North Dakota.',
 'Germany’s painful history has become highly personal in a sleepy village, where neighbors are fighting over what to do with a “Hitler bell” in the local church.',
 'Ms. McCain’s emotional call to arms at her father’s funeral was proof that she is her father’s daughter, a paradoxical figure willing to pay the price of being direct.',
 'Williams reached the quarterfinals by beating Kaia Kanepi, who had eliminated top-seeded Simona Halep. Sloane Stephens, the defending champ, also won.',
 'Many players competing in the U.S. Open, including the Williams sisters, get discounts on lodging in New York hotels in return for social media posts and promotional events.',
 'In a new statement, the allies appeared to concede a central claim of Archbishop Carlo Maria Viganò, who has called on the pope to ste

In [25]:
# Make an empty dataframe to save the data
sept2018 = pd.DataFrame()

In [26]:
# Make new columns, assign values to the column from the list urls and snippets
sept2018['url'] = urls
sept2018['snippet'] = snippets

In [27]:
sept2018.head() # check the data

Unnamed: 0,url,snippet
0,https://www.nytimes.com/interactive/2018/09/01...,Here are all the people connected to President...
1,https://www.nytimes.com/2018/08/31/opinion/nor...,"We have seen the future, and it’s in North Dak..."
2,https://www.nytimes.com/2018/09/02/world/europ...,Germany’s painful history has become highly pe...
3,https://www.nytimes.com/2018/09/02/us/politics...,Ms. McCain’s emotional call to arms at her fat...
4,https://www.nytimes.com/2018/09/02/sports/tenn...,Williams reached the quarterfinals by beating ...


In [28]:
sept2018.info() # check the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6996 entries, 0 to 6995
Data columns (total 2 columns):
url        6996 non-null object
snippet    6996 non-null object
dtypes: object(2)
memory usage: 109.4+ KB


In [30]:
sept2018.to_csv('nyt_sept_2018.csv', index = False) # export the data

This is how you parse through a json file, using a mix of for loops, dictionaries, lists and if/else statments! 

You can use this as a guide to parse through more json files - and become an expert on API calls.