# CAS API Document Module

The CAS API supports workflow integration, chemical research, machine learning, and cheminformatics.  Learn more about workflow integration solutions from __[CAS Custom Services℠](https://www.cas.org/cas-custom-services).__

This documentation is provided as a demonstration of some of the capabilities available in the Document Module of the CAS API.  Please contact CAS Custom Services for more information about licensing.  The __[Swagger page](https://helium.cas.org/integration/v1/api-docs/)__ for multiple endpoints is also available as reference.

We will be accessing the [CAS Reference Collection](https://www.cas.org/cas-data/cas-references)

__[Terms of service](https://brandcdn.cas.org/G2HB3KG8/as/kwvzg9mcr7tkkct5fnm35nk/cas-api-content-terms)__

### Table of Contents
[Getting Started](#Getting_Started)

[Document Lookup by Document_Identifier](#Document_Lookup)
- [Construct Query](#Construct_Query)
- [Display Document Information](#Display)
- [Substance Count](#Substance_Count)
- [Substances Associated with a Document](#Substances_Associated_with_a_Document)

[Document Details](#Document_Details)

[Document Search by Structure](#Document_Search_by_Structure)

[Appendix](#Appendix)
- [Document Request](#Document_Request)
- [Document Response](#Document_Response)
- [Document Summary JSON](#Document_Summary)
- [Available Facets](#Document_Facets)

<a id="Getting_Started"></a>
## Getting Started
<a id="Authentication"></a>
### Authentication
__To begin utilizing the CAS API, you will first need valid credentials.__  The CAS API handles authorization via __[OAuth2.0](https://oauth.net/2/)__.  The easiest way to authenticate is with a _client ID_ and _client secret_ which can be provided by CAS Custom Services.

To make any request of the API you will need an access token. The access token represents the authorization to access specific parts of CAS' data.  There are several modules available to CAS Integration API users.  Depending on your specific setup, you will have access to all or some of the following capabilities.  For question about your modules, contact CAS Custom Services.
<a id="Import_Libraries"></a>
### Import libraries

In [None]:
import http.client
import requests
from pprint import pprint
import json
import urllib

<a id="Set_Credentials"></a>
### Set your supplied credentials from CAS

You will need to enter your companies' client id and secret provided by CAS Custom Services.


In [None]:
# Input Client ID and Client Secret credentials supplied to you by CAS

clientID = input("Please input your client ID.")
secret = input("Please input your client secret.")

<a id="Authorization_Method"></a>
### Authorization Method

_The recommended method requires a user to access their _client ID_ and  _client secret_  as variables in the code they wish to run the workflow. The workflow requires programmatic submission of the users credentials to obtain an Access Token. This can be used for the duration of the execution of the workflow._

_If the workflow is expected to run for a duration of longer than 24 hours, the user must add code that will periodically update the Access token before it expires. Failure to do this may result in some or all of the requests after the 24-hour period returning a 401 error._
<a id="Request_Access_Token"></a>
> **Request your Access Token** 

**Definitions**
> **Client ID** is an Organizational-level ID.  Used to initially validate the Organization of a User of the CAS API.  This ID will be provided by a member of CAS Custom Services upon start of the contract.
>
> **Client Secret** is an Organizational-level password/secret.  Used to initially authenticate the Organization of User of the CAS API.  This ID will be provided by a member of CAS Custom Services upon start of the contract.
>
> **Access Token** (_Also known as the Bearer Token_) is a daily token allowing access to the CAS API.  The Access token is passed with a POST or GET request to allow a User to authenticate requests and retrieve data.  This token is valid for _**24 hours**_ from its creation or until a new Access Token is generated via a new request.

In [None]:
# specify the content type in the header
headers = {'Content-Type': 'application/x-www-form-urlencoded'}

# insert credentials specified above into payload sent OAuth
payload = ('grant_type=client_credentials&client_id=' + clientID + '&scope=cas_content%20&client_secret='
            + secret)

# create a connection
conn = http.client.HTTPSConnection("sso.cas.org")

# make our request
conn.request("POST", "/as/token.oauth2", payload, headers)

# retrieve the response and read it
res = conn.getresponse()
data = res.read()

# read the json and store in a dictionary
response = data.decode("utf-8")
response_json = json.loads(response)

# print POST Response (JSON)
pprint(response_json)
print()

# hold onto this token, we'll need it for future requests to Integration API
access_token = response_json['access_token']

<a id="Document_Lookup"></a>
## Use Case: Document Lookup by Document Identifier & Document Details

We are interested in searching for a specific patent.  We would like to know who requested patent protection and obtain other important details, such as the abstract.


<a id="Construct_Query"></a>
### Construct our Query
We will look for a specific patent, 'WO2018237389', by doing a query against the document collection.  You can also use other document identifiers such as, DOI, Patent Application Number, Accession Number, CAS Accession Number, or Pubmed ID.


In [None]:
# make sure our access token is present on requests and that we will receive JSON in return
headers = {'Content-Type': 'application/json',
           'Authorization': 'Bearer {0}'.format(access_token)}
base_url = 'https://helium.cas.org/integration/v1'

# select the document module
module = 'documents'

# interested in the following patent number
patent_number = 'WO2018237389'

offset = 0
length = 5

# create request url
docRequest = f'{base_url}/{module}?q={patent_number}&offset={offset}&length={length}&echo=false'

# make request (remember we are reusing our headers with our access token from above)
resp = requests.get(docRequest, headers=headers)

# get the json from the response
response_json = resp.json()

pprint(response_json, depth=5)

There are 5 records related to this patent which indicate patent family members.  We will display information about each patent.

<a id="Display"></a>
### Display the document-uri, title, patent publication number, and publication date


In [None]:
# pull out the document information

for document in response_json['documents']:
  print("Title:" +document['title'] + "\n" + "Patent Number:" +document['patentPublicationNumber'] + "\n" + "Publication Date:" + document['publicationDate'] + "\n")


<a id="Substance_Count"></a>
### Substance Count

In order to obtain the substance count for a specific document we will need to extract the document uri and build a query for a different endpoint.

In [None]:
# obtain the first document-uri from the above response_json
doc_uri = response_json['documents'][0]['uri']
patent_number = response_json['documents'][0]['patentPublicationNumber']
print(patent_number)

#encode the doc_uri
doc_uri = urllib.parse.quote(doc_uri, safe='')

# request count of substances associated with this document but no answers (length=0)
# create request url (this is a different endpoint)
count_req = f'{base_url}/{module}/{doc_uri}/substances?&offset={offset}&length={length}&echo=true'

# make request (remember we are reusing our headers with our access token from above)
resp = requests.get(count_req, headers=headers)

# get the json from the response
sub_count=resp.json()

print()
print('NUMBER OF SUBSTANCES:' + str(sub_count['count']))

<a id="Substances_Associated_with_a_Document"></a>
### Substances Associated with a Document

Once you have a document of interest, you may want to know which substances that CAS has determined exhibit the novelty of the paper or invention.  Let's use the previous example and build this endpoint.

In [None]:
# obtain document uri from the above response_json
# change length to 100
length = 100

# create request url (this is a different endpoint)
req_url = f'{base_url}/{module}/{doc_uri}/substances?&offset={offset}&length={length}&echo=true'

# make request (remember we are reusing our headers with our access token from above)
resp = requests.get(req_url, headers=headers)

# get the json from the response
response_json = resp.json()

# parse substances from json
for substance in response_json['substances']:
    print(substance['casRn'] + "||" + str(substance['name']))


<a id="Document_Search_by_Structure"></a>
## Use Case: Document Search by Structure

You can also use a particular structure input such as a SMILES string to do a structure search across all structures within documents.  Here we can parse to get all documents that are returned with that search, or we can just print out the count of documents.  For brevity, only 2 documents are output.

In [None]:
# our target structure
target_smiles = ('CC(C)(C)[Si](C)(C)O[C@H]1C[C@@H](O)C1')

#encode the target_smiles
target_smiles = urllib.parse.quote(target_smiles, safe='')

#specify an "As Drawn" search
structure_search_type = 'drw'

request_url = f'{base_url}/{module}?str={target_smiles}&strMode={structure_search_type}'

# make request
resp = requests.get(request_url, headers=headers)

# reusing the response_json variable
response_json = resp.json()

pprint('Document Count:' + str(response_json['count']))
truncated_resp = response_json['documents'][:2]
pprint(truncated_resp)

<a id="Appendix"></a>
# Appendix #
<a id="Document_Request"></a>
## Document Request ##
A request for Document information can be defined using the following parameters

The endpoint is `https://helium.cas.org/integration/v1/documents` with the following parameters:

>**q** is a *query string* used for DOI (Digital Object Identifier), Patent Number, Patent Application Number, Accession Number, CAS Accession Number or Pubmed ID to search by.
> *This parameter can only be used if the `str` parameter is absent*

>**str** - is a *chemical structure query*
> This query can be a molfile, SMILES, InChI, or CXF JSON (exported by structure editor in SciFinder or STN). Maximum length is 20000.
> *Note that this must be a URL encoded string*
> *This parameter can only be used if the `q` parameter is absent*

>**strMode** is the *type of structure search* to be conducted.
> There are three options: `drw` (As Drawn), `sub` (Substructure), or `sim` (Similarity).
> *This parameter can only be used if the `str` parameter is defined*
> _Default: `drw`

>**offset** is the *offset* from the start of the result set to be returned in the response.
> This offset indicates which ordinal to start returning results.  Along with the `length` parameter a "page" of answers can be defined.
> _Default: 0, minimum: 0_

>**length** is the *maximum number of results* to be returned in the response.
> This length indicates the maximum number of results to return starting at the 'offset' parameter.
> _Default: 100, minimum: 0, maximum: 100_

>**echo** is an indication of whether the response should contain the request parameters in the response.
> When set to true the response will contain a `request` attribute
> _Default: false_

>**strHighlight** is an indication of whether *structure highlighting* should be shown in the structure images (svg) returned
> When set to `true`, the structure images (svg) returned will be highlighted.
> _This parameter can only be used if the `str` parameter is defined_
> _Default is false_

>**facet** is an indication of whether to return the *facets* in the response.
> Facets summarize the results by slicing into facet bins.  Each facet bin will indicate the number of results in that bin.
> When set to true, the response will contain a list of faceted bins.
> _Default is false_

>**fq** is an array of *facet queries*.
 > A facet query that can be used to filter results. Multiple filters can be combined (AND'ed) to make the constraints more restrictive (including more filters typically returns less results). This example shows 8a) Number of Components:1 and 8b) Similarity Score:>=99.

### Example ###
```
https://helium.cas.org/integration/v1/documents?q=WO2018237389 \
    &strMode=drw \
    &offset=0 \
    &length=10 \
    &echo=false \
    &facet=false

```

<a id="Document_Response"></a>
## Document Search Response JSON ##
Results of a document search.  The response is in [JSON](https://en.wikipedia.org/wiki/JSON#Syntax) format

The Search Response utilizes an additional important construct, a [Document Summary](#Document_Summary).

>**count** is the *number of results* returned from the query.
> *This count can be 0 or more*
> *Note: This count is not the number of `documents` being returned in this response.*

>**facet** an array of *facets* for the result set.
> Facets contain facet bins which are subdivided into small slices based on the values. Facets summarize the results by slicing into facet bins.  Each facet bin will indicate the number of results in that bin.
> *Note: This is only present if the request has the `facet` parameter set to `true`*

>**request** an array of *request parameters* from the original request.
> The request parameters are defined above in the [Document Request](#Document_Request) Appendix
> *Note: This is only present if the request has the `echo` parameter set to `true`*

>**documents** an array of *document summaries* returned in the response
> The number of `documents` is determined by the `length` parameter on the request or the count, whichever is less

### Example ###
```
{'count': 1,
 'documents': [...]
 }
```

<a id="Document_Summary"></a>
## Document Summary JSON

A Document Summary contains important bibliographic information about a document.

>**uri** is a unique key specific to the Integration API for this document. This uri can be used to retrieve a detail record for a document or the substances within this document.

>**title** is the title of the document.

>**primaryType** is the document type.  This can be a multitude of values; however, the most common are 'Patent' and 'Journal'.

>**abstract** is a summary of the contents of the document.

>**publicationTitle** is the title of the publication, typically present for non-patents. (Ex. Journal of American Chemical Society)

>**database** is the collection which contains the document.

>**doi** is a persistent identifier or handle used to identify documents uniquely.
  
>**accessionNumber** is a document identifier used to represent a document as it enters the CAplus database.

>**casAccessionNumber** is a document identifier used to represent a document which has been completely indexed.

>**pubmedID** is a document identifier associated with a document from the Medline collection.

>**publicationDate** is the data in which the document was published.

>**publicationYear** is the year in which the document was published.

>**volume** is the number of years a journal has been in publication.

>**issue** is the number of individual publications during the year.

>**pages** are the numbers or marks used to indicate the sequence of pages of a document.

>**authors** represents the writer(s) of the document.

>**companyOrganization** represents the affiliation of the author.

>**inventors** represents the corporate body or person who invented the subject matter of a patent.

>**assignees** represents the corporate body or person who owns the industrial property rights of a patent.

>**patentOffice** is the country in which the patent was filed.

>**patentPublicationNumber** is a number identifier that is assigned to a patent application when it is published.

>**patentApplicationNumber** is a number identifier that is assigned to a patent application when it is filed.

>**patentKindCode** is a letter, and in many cases a number, used to distinguish the kind of patent document. (Ex. A1)




<a id="Document_Facets"></a>
## Available Facets 

<div class="alert alert-block alert-success">
<b>Tip:</b> Facets need to be URL encoded.
</div>

>Use of a facet within a query is constructed as follows:
>>Facet Name:Value
>>Example: facet=true&fq=Document Type:Patent (Encoded: facet=true&fq=Document%20Type%3APatent)

#### Document Type
> **Patent**
> 
> **Journal**
> 
> **Book**
> 
> **Dissertation**
> 
> **Preprint**
> 
> **Report**
> 
> **Review**
> 
> **Letter**
> 
> **Historical**
> 
> **Editorial**
> 
> **Conference**
> 
> **Commentary**
> 
> **Clinical Trial**
> 
> **Biography**