# Playing with Trove API

<b>Author: Zhixin Zheng (ivy.zheng@unimelb.edu.au)</b>

<font size=4.5>
    
This jupyter notebook is updated from the demo material in <b>'Working with APIs' training session </b>  for [UoM Digital Studio](https://arts.unimelb.edu.au/research/digital-studio).
    
You can click the <b>Start Button</b> on upper bar or press <b>Shift+Enter</b> to run the cell in the notebook.

## 【IMPORTANT】1. Prerequisite Setting

<font size=4.5>

Most of the packages needed below are natively supported by python. Only `pandas` need to be installed in the package. 
    
Uncomment the following cell (remove the `#`) and run it to install pandas in your environment.

In [1]:
#!pip install pandas

<font size=4.5>

Run the following cell to import the packages we need in this notebook.

In [1]:
###
# Import package
###

# library for api key input without echoing
import getpass

# library for handling json files
import json

# libraries for setting timestamp
import time
from datetime import datetime

# library for reading json in a better format
import pandas as pd

# library for handling http requests
import requests

###
# Set pre-defined function
###


def time_now():
    """
    Returns current time.

    :return: current time.
    """
    return datetime.fromtimestamp(time.time()).strftime("%Y%m%d%H%M")


###
# set root url
###

root_url = "https://api.trove.nla.gov.au/v2/result"

## 2. Set up your API key

<font size=4.5>
There will be an input box jump out when you run the following cell. Once you enter your API key, a masked string like "················" will be shown in the output area. 

In [3]:
api_key = getpass.getpass("Enter your API key: ")

Enter your API key:  ················


## 3. The first trove searching using API

<font size=4.5 color=blue>
    
In this part, we will try to send a basic searching URL to Trove and get our first response.
    
</font>

<font size=4.5>
The following graph is illustrated the sturcture of url we'll send. The red part means that's a mandatory parameter for trove searching.
    
![alt text](./img/basic_searching_trove_url.png "sturcture of URL for basic searching")
    
<font size=4.5>
<b>Parameters for a basic searching</b>

* `key` - API key (Mandatory)
* `zone` - 	Allows searching in a particular zone. Multiple zones can be searched by separating them with a comma. (Mandatory)
* `q` - The search query. (Mandatory)
* `s` - You will receive a CursorMark value called <b>'nextStart'</b> in your results if there are more pages of results. Use this 'nextStart' value with the 's' parameter to get to the next page of results and page through a long list of results. `*` is the default value, meaning get the response from first page.
* `n` - The number of results to return per zone, maximum is 100, default is 20. For example, if your query fetches many records, the fetched records will be divided into many pages. The number of records in each page is set by this `n` parameter.
* `encoding` - Get the response in `xml`(default) or  `json`.

For more information about Parameters available when searching, check [here](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#parameters-available-when-searching).

In [4]:
# edit your searching terms here
query_term = "covid-19"

# Don't edit the following lines
start = "*"
reqeust_params_first = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 10,
    "s": start,
    "encoding": "json",
}

# send our query using GET request and store the response in response_first
response_first = requests.get(root_url, params=reqeust_params_first)

## 4. Check your query response

<font size=4.5 color=blue>
    
In this part, we will try to explore the response we get.
    
</font>

### 4.1 Check the url sent to server/databse

<font size=4.5>
    Run the following cell to check the url you send to Trove.

In [None]:
response_first.url

### 4.2 Check the response status

<font size=4.5>
    
After we sent the get request, we need to check if the request was successful. We can check the http [response status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to monitor the response.
    
Here are some common status codes relating to `get` requests:

* `200` - Success
* `300` - The API is redirecting to a different endpoint
* `400` - Bad request
* `401` - Not authenticated
* `403` - Forbidden
* `404` - Not found
* `429` - Too many requests

In [6]:
response_first.status_code

200

### 4.3 Understand the response json

<font size=4.5>
Now that the request has be successfully completed, we can convert the response into json and parse it.
    
Run the following cell to convert our response into json.

In [7]:
content_first = response_first.json()

<font size=4.5>
Run the following cell to take a look at json response we get.

In [8]:
content_first

{'response': {'query': 'covid-19',
  'zone': [{'name': 'newspaper',
    'records': {'s': '*',
     'n': '10',
     'total': '224',
     'next': '/result?q=covid-19&n=10&encoding=json&zone=newspaper&s=AoIIQZ0yuCkxOTk3MDk2ODM%3D',
     'nextStart': 'AoIIQZ0yuCkxOTk3MDk2ODM=',
     'article': [{'id': '258999523',
       'url': '/newspaper/258999523',
       'heading': 'Defence acts on COVID-19 pandemic',
       'category': 'Article',
       'title': {'id': '1672',
        'value': 'Air Force News (National : 1997 - 2020)'},
       'date': '2020-03-19',
       'page': 3,
       'pageSequence': 3,
       'relevance': {'score': '1439.2992', 'value': 'very relevant'},
       'snippet': 'IN THE wake of the World Health Organisation declaring Novel Coronavirus (COVID-19) a pandemic, Defence has implemented steps to mitigate the spread of the',
       'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/258999523?searchTerm=covid-19'},
      {'id': '1430999',
       'url': '/newspaper/1430999',

<font size=4.5>

Our result is hard to read right? You need a little basic python knowledge to peel the structure of your json response layer by layer. Briefly, use `content_first.keys()` to find the attribute this layer has, and then use `content_first['one_attribute_at_a_time']` to access the value of attribute you find, and use `content_first['upper_attribute_name'].keys()` and `content_first['upper_attribute_name']['lower_arritbute_name']` iteratively until there's no more attribute in the values of upper attribute.
    </font>
    
<font size=4.5>
But here, let's skip the parsing step and run the following cell to get some useful information from ourjson response directly.
    
Here are what we can get from the response:

* Name of Searching Zone;
* `nextStart` value (A value that can point you to next page of searching);
* The total number of fetched records;
* A list of fetched records. (The number of records in this list depends on the `n` parameter you set in url.)  
    

In [9]:
query = content_first["response"]["query"]
zone = content_first["response"]["zone"][0]
zone_name = zone["name"]  # Name of Searching Zone
records = zone["records"]
start = records["nextStart"]
total_num = records["total"]  # Total number of fetched records
articles = records[
    "article"
]  # That's the records of article we want to get! (but just the first page)

print(f"We just made a query in {zone_name} zone")
print(f"The query terms are {query}.")
print(f"There are {total_num} records found in this request.")
print(f"The 's' parameter for next page is '{start}'.")

We just made a query in newspaper zone
The query terms are covid-19.
There are 224 records found in this request.
The 's' parameter for next page is 'AoIIQZ0yuCkxOTk3MDk2ODM='.


<font size=4.5>
    
Please note that <b>the structure may vary if you set different parameters in your request</b>, so don't forget to parse the structure of first page of your response before you capture a larger dataset. Generally, the structures having same record type (work/newspaper and gazette/list) would be the same. Check [record types of metatdata](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#record-types) for more information. 

<font size=4.5>
Let's see what the inner sturcture of each record in <b>newspaper</b> zone looks like:
    
In this basic searching, you can only get the metadata of each article (no full text included).

In [10]:
# the articles variable stores a list of records of article
# we retrieve the first article record here
articles[0]

{'id': '258999523',
 'url': '/newspaper/258999523',
 'heading': 'Defence acts on COVID-19 pandemic',
 'category': 'Article',
 'title': {'id': '1672', 'value': 'Air Force News (National : 1997 - 2020)'},
 'date': '2020-03-19',
 'page': 3,
 'pageSequence': 3,
 'relevance': {'score': '1439.2992', 'value': 'very relevant'},
 'snippet': 'IN THE wake of the World Health Organisation declaring Novel Coronavirus (COVID-19) a pandemic, Defence has implemented steps to mitigate the spread of the',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/258999523?searchTerm=covid-19'}

## 5. Advanced Searching

<font size=4.5 color=blue>
In this part, we will see more <u>parameters</u> and <u>searching commands</u> for advanced searching.

### 5.1 How to limit the searching? (Filtering!)

<font size=4.5>
    
* `facet` - Facets are categories that describe all the records in a particular result set. For example, if you have 20 results, you can check the <b>decade</b> facet to find out that 18 articles are from 2010-2019 and 2 are from 1950-1959. Then you could modify your search to retrieve records only from 1950-1959 using <b>l-decde=195</b> (representing 1950-1959).<br> 
* `l-<facet name>` - Limit the search results using one of the available facets. e.g. <b>l-format=Book</b>

Records in different zones have different supported facets, check [here](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#facetValues) to find the list of supported facets.
    
Edit the parameters in the following cell.

In [11]:
# edit your searching terms & facet here
query_term = "influenza"
facet = "decade"

# Don't edit the following lines
start = "*"
reqeust_params_first = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 0,  # we won't get any records by setting n as 0
    "s": start,
    "facet": facet,
    "encoding": "json",
}

<font size=4.5>
Run the following cell to seand our query and get response.

In [12]:
response_first = requests.get(root_url, params=reqeust_params_first)
content_first = response_first.json()

query = content_first["response"]["query"]
zone = content_first["response"]["zone"][0]
zone_name = zone["name"]
records = zone["records"]
start = records["nextStart"]
total_num = records["total"]

print(f"We just made a query in {zone_name} zone")
print(f"The query terms are {query}.")
print(f"There are {total_num} records found in this request.")
print(f"The 's' parameter for next page is '{start}'.")

We just made a query in newspaper zone
The query terms are influenza.
There are 1715104 records found in this request.
The 's' parameter for next page is '*'.


<font size=4.5> 

Run the following cell to check our response. There won't be any record returned as we set the `n` parameter as 0. <br>
You can find there's a new attribure called `facets` apprear in the response (attribute). It shows the number of articles our query can fetch in each decade.

In [13]:
content_first["response"]

{'query': 'influenza',
 'zone': [{'name': 'newspaper',
   'records': {'s': '*',
    'n': '0',
    'total': '1715104',
    'next': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&s=*',
    'nextStart': '*'},
   'facets': {'facet': {'name': 'decade',
     'displayname': 'Decade',
     'term': [{'count': '31',
       'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=201',
       'search': 201,
       'display': '2010-2019'},
      {'count': '89',
       'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=200',
       'search': 200,
       'display': '2000-2009'},
      {'count': '369',
       'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=199',
       'search': 199,
       'display': '1990-1999'},
      {'count': '494',
       'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=198',
       'search': 198,
       'display': '1980-1989'},
    

<font size=4.5>

Run the following cell to see what values of `facet` attribute looks like:

In [14]:
searching_count_by_facet = content_first["response"]["zone"][0]["facets"]["facet"][
    "term"
]
searching_count_by_facet

###
# Codes for normalizing your facet values if you enter more than 1 facet
###

# facets = zone['facets']
# facets_collection = {}

# for facet in facets['facet']:
#     facets_collection[facet['name']] = pd.json_normalize(facet['term'])
# facets_collection['decade']
# facets_collection['year']

[{'count': '31',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=201',
  'search': 201,
  'display': '2010-2019'},
 {'count': '89',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=200',
  'search': 200,
  'display': '2000-2009'},
 {'count': '369',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=199',
  'search': 199,
  'display': '1990-1999'},
 {'count': '494',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=198',
  'search': 198,
  'display': '1980-1989'},
 {'count': '943',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=197',
  'search': 197,
  'display': '1970-1979'},
 {'count': '881',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zone=newspaper&l-decade=196',
  'search': 196,
  'display': '1960-1969'},
 {'count': '24843',
  'url': '/result?q=influenza&n=0&facet=decade&encoding=json&zon

<font size=4.5>
    
Still confused? Run the following cell to see a human readable format of facet counting, with the help of `pandas.json_normalize()`.<br>
We can see how many newspaper results about 'influenza' can be found in each decade.<br> 
By using `facet` parameter to get an overview of fetched records in each decade, you can either <b>decide the best time range for collecting your records</b>, or <b>rescale the range of fetching time to get sufficient records for your research</b>.

In [15]:
pd.json_normalize(searching_count_by_facet)

Unnamed: 0,count,url,search,display
0,31,/result?q=influenza&n=0&facet=decade&encoding=...,201,2010-2019
1,89,/result?q=influenza&n=0&facet=decade&encoding=...,200,2000-2009
2,369,/result?q=influenza&n=0&facet=decade&encoding=...,199,1990-1999
3,494,/result?q=influenza&n=0&facet=decade&encoding=...,198,1980-1989
4,943,/result?q=influenza&n=0&facet=decade&encoding=...,197,1970-1979
5,881,/result?q=influenza&n=0&facet=decade&encoding=...,196,1960-1969
6,24843,/result?q=influenza&n=0&facet=decade&encoding=...,195,1950-1959
7,98777,/result?q=influenza&n=0&facet=decade&encoding=...,194,1940-1949
8,228273,/result?q=influenza&n=0&facet=decade&encoding=...,193,1930-1939
9,305344,/result?q=influenza&n=0&facet=decade&encoding=...,192,1920-1929


<font size=4.5>

To further refine your fetched by decde, we can use `l-decade` parameter to constrain our search.

In [16]:
# edit here if you want to change the value of l-decade
l_decade = 181  # this mean we want articles  from 1810-1819 only

# don't edit below!
start = "*"
query_term = "influenza"

reqeust_params_advanced = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 20,
    "s": start,
    "encoding": "json",
    "l-decade": l_decade,
}

response = requests.get(root_url, reqeust_params_advanced)
if response.status_code == 200:
    records = response.json()["response"]["zone"][0]["records"]

# show the records we get
records

{'s': '*',
 'n': '1',
 'total': '1',
 'article': [{'id': '2178581',
   'url': '/newspaper/2178581',
   'heading': 'SHIP NEWS.',
   'category': 'Article',
   'title': {'id': '3',
    'value': 'The Sydney Gazette and New South Wales Advertiser (NSW : 1803 - 1842)'},
   'date': '1819-03-06',
   'page': 2,
   'pageSequence': 2,
   'relevance': {'score': '6.024128', 'value': 'very relevant'},
   'snippet': 'On Thursday last arrived the transport ship Surrey, commanded by Captain Raine, the Gentleman whose navigated her from hence to England after she had lost her first Commander, and every other senior Officer',
   'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/2178581?searchTerm=influenza'}]}

### 5.2 How to construct a complex query? (More Filtering!)

<font size=4.5>

We can use different kinds of searching commands to construct a more complex query term for `q` parameter.
    
* Boolean Operators: `AND`, `OR`, `NOT`
* Punctuations: `"quotation marks"`, `(brackets)`.
* Index: `date`, `text`, `fulltext`, etc. For more index can be used in query, please check [here](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#list-of-supported-indexes).

For more information about constructing a complex search query, please check [here](https://trove.nla.gov.au/help/searching/constructing-complex-search-query).

<font size=4.5>

Run the following cell to test different queries using different searching commands in `q` parameter. Though we can't compare the returned records immediately, we can compare <u>total number of fetched records</u> in each query to feel their filtering effect.

In [17]:
# edit the terms you want to test here
query_terms = [
    "influenza",
    "influenza date:[1810 TO 1819]",  # you must provide a certin date range
    "influenza NOT Spanish",
    "influenza AND America AND Spanish",
    "influenza AND (America OR Spanish)",
    "Spanish influenza",
    '"Spanish influenza"',  # this is a phrase searching
]

# don't modify here
for query_term in query_terms:
    reqeust_params_advanced = {
        "key": api_key,
        "q": query_term,
        "zone": "newspaper",
        "n": 100,
        "s": "*",
        "encoding": "json",
    }

    response = requests.get(root_url, reqeust_params_advanced)
    if response.status_code == 200:
        zone = response.json()["response"]["zone"][0]
        total_num = zone["records"]["total"]
        print("*" * 10)
        print("Query term：", query_term)
        print(f"We got {total_num} results in this query.")

**********
Query term： influenza
We got 1715104 results in this query.
**********
Query term： influenza date:[1810 TO 1819]
We got 1 results in this query.
**********
Query term： influenza NOT Spanish
We got 1680924 results in this query.
**********
Query term： influenza AND America AND Spanish
We got 6197 results in this query.
**********
Query term： influenza AND (America OR Spanish)
We got 143408 results in this query.
**********
Query term： Spanish influenza
We got 34180 results in this query.
**********
Query term： "Spanish influenza"
We got 8625 results in this query.


### 5.3 How to sort your response? (Sorting!)

<font size=4.5>

* `sortby` - The sort order for the results. Both date of publication and relevance are supported. The possible values are <b>dateasc</b>, <b>datedesc</b>, and <b>relevance</b>. The defualt value is <b>relevance</b>.

<font size=4.5>

Run the following cell to get records sorting from latest date to earliest date (using <b>datedesc</b>). 

In [18]:
# edit here if you want to change the value of sortby parameter
sortby = "datedesc"

# don't edit below!
start = "*"
query_term = "influenza"

reqeust_params_advanced = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 100,
    "s": "*",
    "encoding": "json",
    "sortby": sortby,
}

response = requests.get(root_url, reqeust_params_advanced)
if response.status_code == 200:
    articles = response.json()["response"]["zone"][0]["records"]["article"]

<font size=4.5>

Run the following cell, you can see the first article of your returned records is published on the latest date Trove can find. 

In [19]:
articles[0]

{'id': '260653327',
 'url': '/newspaper/260653327',
 'heading': 'Stand-out team The airbase operations team push through the challenges of Red Flag, FLTLT Stephanie Anderson writes',
 'category': 'Article',
 'title': {'id': '1672', 'value': 'Air Force News (National : 1997 - 2020)'},
 'date': '2019-03-07',
 'page': 13,
 'pageSequence': 13,
 'relevance': {'score': '6.03076', 'value': 'very relevant'},
 'snippet': 'IN SUPPORT of warfare elements, air- base operations stepped up to ensure the smooth and safe conduct of Exercise Red Flag Nellis.',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/260653327?searchTerm=influenza'}

### 5.4 How to control the metadata returned? (More Attributes!)

<font size=4.5>

You may not satisfied with the degree of detail you get in response of basic searching:
    
<font color=blue>
    
* How can I get the text of article?
* I don't want those poor qualitative OCR text! How can I get the pdf?
    </font>
    
We can use the following parameters to get more details of records:
    
* `reclevel` - 	Indicates whether to return a full or brief metadata record. The possible values are <b>brief</b> and <b>full</b>. <b>brief</b> is default. You can get the link of pdf version of article using this parameter.
* `include` - This parameter is usually used in getting information about a single item in Trove only, as a <b>full</b> reclevel covers all the attributes that 'include' parameter mentions. But if you  want full text (OCR text) of newspaper/gazette in searching, you can set it as <b>articletext</b>.

The following table shows different record structures based on different parameters:

| reclevel | include       | record attributes                                                                                                                                                                                                                                                 |
| -------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "brief"  | none          | id, url, heading, category, edition, date, page, pageSequence, snippet, troveUrl, title.id, title.value, relevance.score, relevance.value                                                                                                                         |
| "brief"  | "all"         | same as above                                                                                                                                                                                                                                                     |
| "brief"  | "articletext" | id, url, heading, category, edition, date, page, pageSequence, snippet, troveUrl, title.id, title.value, relevance.score, relevance.value, <font color=red><b>articleText</b></font>                                                                                                            |
| "full"   | none          | id, url, heading, category, edition, date, page, pageSequence, snippet, troveUrl, title.id, title.value, relevance.score, relevance.value, <font color=blue>illustrated, wordCount, correctionCount, listCount, tagCount, commentCount, identifier, trovePageUrl,</font>  <font color=red><b>pdf</b></font>             |
| "full"   | "all"         | same as above                                                                                                                                                                                                                                                     |
| "full"    | "articletext" | id, url, heading, category, edition, date, page, pageSequence, snippet, troveUrl, title.id, title.value, relevance.score, relevance.value, <font color=blue>illustrated, wordCount, correctionCount, listCount, tagCount, commentCount, identifier, trovePageUrl,</font> <font color=red><b>pdf,  articleText</b></font> |

For more explanations for parameters controlling the metadata returned, please check [here](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#parameters-available-when-requesting-a-record), and [here](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#examples-01)  for more examples.

<font size=4.5>

Run the following cell to get the most detailed records in Trove newspaper searching. 

In [20]:
# edit here
reclevel = "full"
include = "articletext"


# don't edit below!
start = "*"
query_term = "influenza"

reqeust_params_advanced = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 10,
    "s": "*",
    "encoding": "json",
    "reclevel": reclevel,
    "include": include,
}

response = requests.get(root_url, reqeust_params_advanced)
if response.status_code == 200:
    articles = response.json()["response"]["zone"][0]["records"]["article"]

<font size=4.5>

Run the following cell to check the first record we get. <br> 
The value of `pdf` atrribute can lead you to the pdf version of the newspaper page that this article is in.<br>
The value of `articleText` attribute can get OCR text of this article in html format. (Save this piece of text into html file, you can see a "normal view" of the article OCR text.)

In [21]:
articles[0]

{'id': '174293919',
 'url': '/newspaper/174293919',
 'heading': 'INFLUENZA. INFLUENZA RESTRICTIONS.',
 'category': 'Article',
 'title': {'id': '840',
  'value': 'The Telegraph (Brisbane, Qld. : 1872 - 1947)'},
 'edition': 'SECOND EDITION',
 'date': '1919-08-08',
 'page': 5,
 'pageSequence': '5 S',
 'relevance': {'score': '233.26903', 'value': 'very relevant'},
 'snippet': 'The influenza restrictions have, been removed from nearly all, the districts in which they were imposed, but they have now been applied to the shire of Eacham',
 'troveUrl': 'https://trove.nla.gov.au/ndp/del/article/174293919?searchTerm=influenza',
 'illustrated': 'N',
 'wordCount': 42,
 'correctionCount': 0,
 'listCount': 0,
 'tagCount': 0,
 'commentCount': 0,
 'identifier': 'https://nla.gov.au/nla.news-article174293919',
 'trovePageUrl': 'https://trove.nla.gov.au/ndp/del/page/19078175',
 'pdf': 'https://trove.nla.gov.au/ndp/imageservice/nla.news-page19078175/print',
 'articleText': '<p><span>  INFLUEKZA.</span></p>

## 6. How to capture more records?

<font size=4.5 color=blue>
    
In this part, we will see how to capture more records by passing value of `nextStart` attribute to `s` parameter iteratively.

In [22]:
# set the number of record you want to harvest
# please set a number that can be divided exactly by n parameter
harvest_num = 1500

# request parameters setting as usual
start = "*"
query_term = "influenza"

reqeust_params_more = {
    "key": api_key,
    "q": query_term,
    "zone": "newspaper",
    "n": 100,  # the maximum value of n is 100
    "s": start,
    "encoding": "json",
}

<font size=4.5>

Run the following cell to harvest more records after you construct the query you want to try.

In [23]:
# collect the records
articles_more = []

# quit the iteration when there's no next page or we get enough records
while len(articles_more) < harvest_num and start:
    response = requests.get(root_url, reqeust_params_more)
    if response.status_code == 200:
        zone = response.json()["response"]["zone"][0]
        records = zone["records"]
        articles_more.extend(records["article"])
        # get s parameter for next page (if there is)
        start = records.get("nextStart")
        # update s parameter in request parameters
        reqeust_params_more["s"] = start
    else:
        print(response.status_code)

print(f"There are {records['total']} record found in this request.")
print(f"You got {len(articles_more)} records in this request. ")

There are 1715104 record found in this request.
You got 1500 records in this request. 


## 7. Store your grains

<font size=4.5 color=blue>
    
In this part, we will see how to store our harvested records as json file into disk.<br>
(In the previous tests, the harvested records are just stored in memory.)

In [24]:
# record the time you do this query
timestamp = time_now()
# set your output information
output_folder = "./"
output_filename = f"trove_harvest_{timestamp}_records.json"

# save your harvested  records
with open(output_folder + output_filename, "w") as j:
    json.dump(articles_more, j, indent=4)

## 8. A regular workflow

<font size=4.5 color=blue>
    
Finally, we will integrate all the steps we've tried and build up a regular workflow for Trove Harvesting.<br>
    
![A Regular Workflow for Trove Harvesting](./img/trove_harvesting_workflow.png)
    </font>

In [25]:
# set the request parameters here
query_terms = "musical"  # Try to use more searching commands here!
zones = "newspaper"
num_per_page = 100
sort_by = "relevance"
reclevel = "full"
include = "articletext"

# set the harvest number you want here
harvest_num = 1000

# record the time you do this query
timestamp = time_now()

###
# set your output information
###

# set your output folder
output_folder = "./"
# set the name of your json file storing harvested records
output_records = f"trove_harvest_{timestamp}_records.json"
# set the name of your json file storing  request information
output_requestInfo = f"trove_harvest_{timestamp}_requestInfo.json"

In [26]:
reqeust_params_regular = {
    "key": api_key,
    "q": query_terms,
    "zone": zones,
    "n": num_per_page,
    "sort_by": sort_by,
    "reclevel": reclevel,
    "include": include,
    "encoding": "json",
}

# collect the records
articles = []

while len(articles) < harvest_num and start:
    response = requests.get(root_url, reqeust_params_regular)
    if response.status_code == 200:
        zone = response.json()["response"]["zone"][0]
        records = zone["records"]
        articles.extend(records["article"])
        # get s parameter for next page (if there is)
        start = records.get("nextStart")
        # update s parameter in request parameters
        reqeust_params_regular["s"] = start
    else:
        print(response.status_code)

# save information of searching
request_info = {}
request_info["query"] = response.json()["response"]["query"]
request_info["zone_name"] = zone["name"]
request_info["total_num"] = records["total"]

print(f"There are {records['total']} record found in this request.")
print(f"You got {len(articles)} records in this request. ")

# save your harvested  records
with open(output_folder + output_records, "w") as j:
    json.dump(articles, j, indent=4)
print(f"{output_folder + output_records} saved!")

# save your harvested  records
with open(output_folder + output_requestInfo, "w") as j:
    json.dump(request_info, j, indent=4)
print(f"{output_folder + output_requestInfo} saved!")

There are 10435590 record found in this request.
You got 1000 records in this request. 
./trove_harvest_202109060320_records.json saved!
./trove_harvest_202109060320_requestInfo.json saved!


## The End?

### Bonus1: Wait! How can I use these json files?!

<font size=4.5>

There are many kinds of tools can handle these json file, from softwares like Excel and Tableau, to Programming languages like R & Python. You can choose the tool you're most familiar with. <br>
* If you are more comfortable with Excel, try to load json file into Excel following [this post](https://allthings.how/how-to-convert-json-to-excel/).<br>
* If you still want to use Python as your analysis tool, the following cell can help you load the json file using Python.

In [27]:
import json

input_folder = "./sample/"
j_name = "trove_harvest_202109060042_records.json"

with open(input_folder + j_name) as j:
    articles_get = json.load(j)

### Bonus2: I don't want json!! Give me a human readable output file!

<font size=4.5>
The reason why we use json in this notebook only is because <u>this notebook want to focus on how to use Trove API to accurately harvest the a large number of records you want</u>. Json convertion is more like a data wrangling problem.<br>
What's more, some json files may have multiple hierarchical structures that can't be completely converted into table-like files like csv and excel files.<br>
<font color=red>Here we highly recommend you to store all the records you get into json files that neatly store all information first, and then convert the exact attributes you want into a flatten format.</font>
    
<br><br>
If you really want a shortcut, in Python, you can try to use [pandas.json_normalize()](https://pandas.pydata.org/pandas-docs/version/1.2.0/reference/api/pandas.json_normalize.html) to flatten your json (hierarchy <= 2).

In [28]:
import json

import pandas as pd

input_folder = "./sample/"
j_name = "trove_harvest_202109060042_records.json"

with open(input_folder + j_name) as j:
    articles_get = json.load(j)

pd.json_normalize(articles_get)

Unnamed: 0,id,url,heading,category,date,page,pageSequence,snippet,troveUrl,illustrated,...,title.id,title.value,relevance.score,relevance.value,supplement,section,pageLabel,lastCorrection.by,lastCorrection.lastupdated,edition
0,162389066,/newspaper/162389066,MUSICAL. MUSICAL NOTES.,Article,1898-02-19,45,45,In accordance with general expectations Saturd...,https://trove.nla.gov.au/ndp/del/article/16238...,N,...,821,Adelaide Observer (SA : 1843 - 1904),214.46603,very relevant,,,,,,
1,90545756,/newspaper/90545756,MUSICAL. MUSICAL EDUCATION.,Article,1884-10-25,1,1 S,Hand in hand I would willingly go with the mus...,https://trove.nla.gov.au/ndp/del/article/90545...,N,...,74,Launceston Examiner (Tas. : 1842 - 1899),214.12755,very relevant,Supplement to the Launceston Examiner.,,,,,
2,261972892,/newspaper/261972892,Nostalgic musical Musical Countdown the Musical,Article,1998-11-27,6,6 S,IT was a case of turning back the clock for a ...,https://trove.nla.gov.au/ndp/del/article/26197...,Y,...,1685,"The Australian Jewish News (Melbourne, Vic. : ...",209.74936,very relevant,,What's On,6.0,,,
3,100253743,/newspaper/100253743,MUSICAL.,Article,1912-08-31,5,5,"Mr. Harold F. Burstall, A.T.C.L., late of St. ...",https://trove.nla.gov.au/ndp/del/article/10025...,N,...,422,Forbes Times (NSW : 1899 - 1920),205.68875,very relevant,,,,,,
4,100546064,/newspaper/100546064,MUSICAL.,Article,1905-10-20,2,2,To citizens of Goulburn the young lady who mad...,https://trove.nla.gov.au/ndp/del/article/10054...,N,...,367,Goulburn Herald (NSW : 1881 - 1907),205.68875,very relevant,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,121806293,/newspaper/121806293,Musical.,Article,1901-08-07,10,10 S,"Signer Louis Aisoll, the well-Known vio-linist...",https://trove.nla.gov.au/ndp/del/article/12180...,N,...,499,"Referee (Sydney, NSW : 1886 - 1939)",204.68811,very relevant,,,,,,Edition 2
996,121807440,/newspaper/121807440,Musical.,Article,1901-06-12,10,10 S,Mdme. Bello Colo left for Melbourne on Sunday ...,https://trove.nla.gov.au/ndp/del/article/12180...,N,...,499,"Referee (Sydney, NSW : 1886 - 1939)",204.68811,very relevant,,,,,,Edition 2
997,121807813,/newspaper/121807813,Musical.,Article,1901-04-17,10,10 S,Mdlle. Antonia Dolores has appeared in three c...,https://trove.nla.gov.au/ndp/del/article/12180...,N,...,499,"Referee (Sydney, NSW : 1886 - 1939)",204.68811,very relevant,,,,,,Edition 2
998,121813389,/newspaper/121813389,Musical.,Article,1900-08-29,10,10,It is reported in a London paper that a resolu...,https://trove.nla.gov.au/ndp/del/article/12181...,N,...,499,"Referee (Sydney, NSW : 1886 - 1939)",204.68811,very relevant,,,,,,


### Bonus3: I don't want my article text in the cell! How to free it? 

<font size=4.5>
    
Again, this is more like a data wrangling problem. 
    
The key idea to do this is to use your json file, and store the article text (`articleText` attribute) into <b>html</b> one by one. Don't forget to save `id` attribute into the name of respective html file, so that we can trace the text back to metadata using the value of `id`.

In [29]:
import json

input_folder = "./sample/"
j_name = "trove_harvest_202109060042_records.json"
output_folder = "./"

with open(input_folder + j_name) as j:
    articles_get = json.load(j)

In [62]:
for article in articles_get:
    html_name = f"newspaper_{article.get('id')}.html"
    article_text = article.get("articleText")
    with open(output_folder + html_name, "w", encoding="utf-8") as h:
        h.write(article_text)

### Bonus4: I need to! How to free it? 

<font size=4.5>
    
Same idea as saving OCR article text! This time, we need to use `pdf` attribute.<br>
Since we need to download the pdf from online, we need to import a new library `urllib`.

In [66]:
import json
from urllib import request

input_folder = "./sample/"
j_name = "trove_harvest_202109060042_records.json"
output_folder = "./"

with open(input_folder + j_name) as j:
    articles_get = json.load(j)

In [None]:
for article in articles_get:
    pdf_name = f"newspaper_{article.get('id')}.pdf"
    pdf_link = article.get("pdf")
    request.urlretrieve(pdf_link, output_folder + pdf_name)
    print("successfully downloaded: " + pdf_link)

### Bonus5: A few more things if you are confident with the contents above...

<font size=4.5>

1. How can I use these harvested records to do analysis?<br>
2. The pdf I get from Trove API is a full page of newspaper, how can I get a pdf of single article?<br>
3. How can I improve the harvesting efficiency? <br>

## More for Trove API

### [Get information about a single item in Trove.](https://trove.nla.gov.au/about/create-something/using-api/api-technical-guide#api-get-metadata-records)

### [Look up other associated data in Trove](https://trove.nla.gov.au/about/create-something/using-api/api-version-2-technical-guide#api-look-up-associated-data)