# New York Times Scraping
[The New York Times Developer Network](https://developer.nytimes.com/)

In [None]:
import getpass
APIKEY = getpass.getpass()

## Making API calls using terminal/Linux


Use curl to request information from the url and use the -o tag to save that info to a new file, trial.json.

In [None]:
!curl --request GET -o trial.json "https://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=3dUdJXnmS3zDOlHayoM04BUes1cgevHp"

### Viewing the JSON


We can use jq in terminal to view and filter jsons. Install jq using sudo and view the json by passing trail.json to jq.

In [None]:
!sudo apt-get install jq
!jq < trial.json

In [None]:
!head trial.json

### Filtering using `jq`
To filter based on key, use `jq '.key'`, where `.key` is one of the keys from the json file, and `jq` will return the corresponding values in the json.

In [None]:
from __future__ import print_function

To grab just the articles, we want to filter through the `.docs` tag and save the output to trialarticles.json.

In [None]:
!jq < trial.json
!jq '.response | .docs ' < trial.json > trialarticles.json

We can also grab one random article (in this case, the fourth) by pulling the fourth element from the array of articles. We can then filter further using the `.headline` and `.main` tags.

In [None]:
!jq '.response | .docs '[3] < trial.json > trialex.json
!jq '.headline | .main' < trialex.json

## API requests using Python

We also stick with Python for the entire process. First, we import the requests package.

In [None]:
import requests as req

Using the same url, we can pull the json file and save it to a local variable.

In [None]:
url = "https://api.nytimes.com/svc/archive/v1/1970/12.json?api-key=3dUdJXnmS3zDOlHayoM04BUes1cgevHp"
response = req.get(url).json()
response

In [None]:
from google.colab import drive
drive.mount('/content/drive')

We can then save the array of articles in a local array called "articles." After filtering through the json to get just the information under `.docs`, we loop through every element in that array and pull the main headline, abstract and lead paragraph. We then append all that information to "articles" and view the first five items.

In [None]:
articles = []
docs = response['response']['docs']
for doc in docs:
  filteredDoc = {}
  filteredDoc['title'] = doc['headline']['main']
  filteredDoc['abstract'] = doc['abstract']
  filteredDoc['paragraph'] = doc['lead_paragraph']
  articles.append(filteredDoc)
articles[:5]

# From JSON to csv
For working with structured data in notebooks, the most popular and full-featured packages is `pandas`, which can tranform the json into a csv file.

First we import the pandas package. It is a common convention to import it under the *alias* `pd` so that you do not need to type pandas over and over again when referring back to the package name.

In [None]:
import pandas as pd

Then, we use the `read_json()` function in pandas to transform the filtered json into a dataframe.

In [None]:
dfterm = pd.read_json('trialarticles.json')
dfterm.head(5)

In [None]:
dfterm.to_csv('trialarticles.csv')

This is how we would perform the same transformation in python: "articles" is a list, so it requires a different pandas function.

In [None]:
pythondf = pd.DataFrame(articles)
pythondf.head(5)

In [None]:
pythondf.to_csv('pythontrialarticles.csv')

## Trying to obtain full body text

The urls within the terminal dataframe links to an html file, which is pretty messy and does not include the body of an article.

In [None]:
dfterm['web_url'][0]

Here is where we can see the error message: "Please enable JS and disable any ad blocker." Selenium required for further operations, but even then it's unclear whether or not the body will be available.

In [None]:
from bs4 import BeautifulSoup
import requests as req
Web = req.get("https://www.nytimes.com/1970/12/01/archives/egebergs-ouster-is-expected-soon-dismissal-of-health-official-seen.html")
S = BeautifulSoup(Web.text, 'lxml')
print(S.prettify())