# Markush Module
The CAS API supports workflow integration, chemical research, machine learning, and cheminformatics.  Learn more about workflow integration solutions from __[CAS Custom Services℠](https://www.cas.org/solutions/cas-custom-services/workflow-integration).__

This documentation is provided as a demonstration of some of the capabilities available in the Markush Module of the CAS API.  Please contact CAS Custom Services for more information about licensing.  The __[Swagger page](https://helium.cas.org/integration/v1/api-docs/)__ for endpoints are also available as reference.

We will be accessing ([Registry](https://www.cas.org/cas-data/cas-registry) and [MARPAT](https://www.cas.org/support/documentation/markush))

__[Terms of service](https://www.cas.org/sites/default/files/documents/cas-api-terms.pdf)__

### Table of Contents
[Getting Started](#Getting_Started)

[Novelty Search](#Novelty_Search)
- [Construct Registry Query](#Construct_Registry_Query)
- [Handling Multiple Pages](#Multipage)
- [Constructing a Markush Query](#Markush_Query)
- [Display Results of Markush Search](#Display_Markush_Results)

[Appendix](#Appendix)
- [Markush Request](#Markush_Request)
- [Markush Response](#Markush_Response)
- [Markush Summary JSON](#Markush_Summary)

<a id="Getting_Started"></a>
## Getting Started
<a id="Authentication"></a>
### Authentication
__To begin utilizing the CAS API, you will first need valid credentials.__  The CAS API handles authorization via __[OAuth2.0](https://oauth.net/2/)__.  The easiest way to authenticate is with a _client ID_ and _client secret_ which can be provided by CAS Custom Services.

To make any request of the API you will need an access token. The access token represents the authorization to access specific parts of CAS' data.  There are several modules available to CAS Integration API users.  Depending on your specific setup, you will have access to all or some of the following capabilities.  For question about your modules, contact CAS Custom Services.
<a id="Import_Libraries"></a>
### Import libraries

In [None]:
import http.client
import requests
from pprint import pprint
import json
import urllib
from IPython.core.display import SVG

<a id="Set_Credentials"></a>
### Set your supplied credentials from CAS

You will need to enter your companies' client id and secret provided by CAS Custom Services.

In [None]:
# Input client ID and client Secret credentials supplied to you by CAS

clientID = input("Please input your client ID.")
secret = input("Please input your client secret.")

<a id="Authorization_Method"></a>
### Authorization Method

_The recommended method requires a user to access their _client ID_ and  _client secret_  as variables in the code they wish to run the workflow. The workflow requires programmatic submission of the users credentials to obtain an Access Token. This can be used for the duration of the execution of the workflow._

_If the workflow is expected to run for a duration of longer than 24 hours, the user must add code that will periodically update the Access token before it expires. Failure to do this may result in some or all of the requests after the 24-hour period returning a 401 error._
<a id="Request_Access_Token"></a>
> **Request your Access Token** 

**Definitions**
> **Client ID** is an Organizational-level ID.  Used to initially validate the Organization of a User of the CAS API.  This ID will be provided by a member of CAS Custom Services upon start of the contract.
>
> **Client Secret** is an Organizational-level password/secret.  Used to initially authenticate the Organization of User of the CAS API.  This ID will be provided by a member of CAS Custom Services upon start of the contract.
>
> **Access Token** (_Also known as the Bearer Token_) is a daily token allowing access to the CAS API.  The Access token is passed with a POST or GET request to allow a User to authenticate requests and retrieve data.  This token is valid for _**24 hours**_ from its creation or until a new Access Token is generated via a new request.

In [None]:
# specify the content type in the header
headers = {'Content-Type': 'application/x-www-form-urlencoded'}

# insert credentials specified above into payload sent OAuth
payload = ('grant_type=client_credentials&client_id=' + clientID + '&scope=cas_content%20&client_secret='
            + secret)

# create a connection
conn = http.client.HTTPSConnection("sso.cas.org")

# make our request
conn.request("POST", "/as/token.oauth2", payload, headers)

# retrieve the response and read it
res = conn.getresponse()
data = res.read()

# read the json and store in a dictionary
response = data.decode("utf-8")
response_json = json.loads(response)

# print POST Response (JSON)
pprint(response_json)
print()

# hold onto this token, we'll need it for future requests to Integration API
access_token = response_json['access_token']

<a id="Novelty_Search"></a>
## Use Case: Novelty Search

We want to determine if there is any prior art are associated with our target molecule.  We will specify our target molecule using SMILES notation.


<a id="Construct_Registry_Query"></a>
### Construct Registry Query
The first thing we are going to do is an "as drawn" (drw) search of a substance defined as a _SMILES_ string.  In our request we need to specify the number of results we get back in initial response.  We will do that with a _results window_.  This window is defined by the _offset_ and the _length_.  These will define a window of results we will get back.  Typically, on the initial search you will start with the _offset_ set to 0 (first result) and the _length_ to the number of results you want to get back initially.  _the default value is 0 for offset and 100 for length_


In [None]:
# make sure our access token is present on requests and that we will receive JSON in return
headers = {'Content-Type': 'application/json',
           'Authorization': 'Bearer {0}'.format(access_token)}
base_url = 'https://helium.cas.org/integration/v1'

# using the substance module endpoint for now
module = 'substances'

# our target structure
target_smiles = 'O=C1C2=C(N=CN2C)N(C(=O)N1C)C'

#specify a substructure search
structure_search_type = 'sub'

<a id="Multipage"></a>
### Handling Multiple Pages

In [None]:
# let's keep track of the offset
offset = 0

# define page size
page_size = 5

# indicator that there are more pages
more_pages = True

#keep track of pages already requested
page_count = 1

# let's save off all our substance jsons in a dictionary
# the keys will CAS Registry Numbers since they are unique
substance_results = {}

# we'll make all our requests to get all the results
while more_pages:
    request_url = f'{base_url}/{module}?str={target_smiles}&strMode={structure_search_type}&offset={offset}&length={page_size}'

    # make request (remember we are reusing our headers with our access token from above)
    resp = requests.get(request_url, headers=headers)

    # reusing the response_json variable
    response_json = resp.json()

    # retrieve the number of results
    result_count = response_json['count']

    #Do we need more pages?
    if result_count/page_size <= page_count:
        print(f'There are {result_count} results')
        more_pages = False
    else:
        page_count += 1
        more_pages = False

    # increment to get the next page
    offset += page_size

    # add this page's results
    for substance in response_json['substances']:
        substance_results[substance['casRn']] = [substance['canonicalSmiles'], substance['suppliersCount'], substance['uri'], substance['molecularFormula']]

# let's print them perty
pprint(substance_results)

<a id=Markush_Query></a>
## Constructing a Markush Query

In order to check if the structure has been covered by any [Markush structures](https://en.wikipedia.org/wiki/Markush_structure), a search of the CAS [Markush database](https://www.cas.org/support/training/scifinder-n/markush-results) is necessary.  Using the Markush endpoint in the API (_module necessary_) and the _str_ query like we used for a Registry structure search.

In [None]:
# we will use the markush module to conduct a markush search
module = 'markush'

# create request url
request_url = f'{base_url}/{module}?str={target_smiles}&offset={offset}&length={page_size}'

# make request (remember we are reusing our headers with our access token from above)
resp = requests.get(request_url, headers=headers)

# reusing the response_json variable
response_json = resp.json()



<a id="Display_Markush_Results"></a>
## Display Results of Markush Search ##

To complete our use case we will print out the Patent Number, the patent location (claim), the Patent Title and a hit assembled display which matches the query with the raw Markush representation from MARPAT.

In [None]:
# If the result returns a non-zero count . . .
if response_json['count']:

    # output number of results
    print('There are ' + str(response_json['count']) + ' hit records')

    # grab the answers from the response
    markush_answers = response_json['markushAnswers']

    # keep track of result number ordinal
    result_ordinal = 1

    # look at each document in the results
    for answer in markush_answers:
        #output the ordinal and patent number of this result
        print(str(result_ordinal) + ') ' + str(answer['patentNumber']))

        # in order to get document title, we must request a document detail
        module = 'documents'

        # request the title for this document.  Note: Need to encode the uri
        encoded_uri = urllib.parse.quote(answer['documentUri'], safe='')

        # create request url
        docRequest = f'{base_url}/{module}/{encoded_uri}'

        # make request (remember we are reusing our headers with our access token from above)
        resp = requests.get(docRequest, headers=headers)

        # get the json from the response
        docResponse = resp.json()

        # output document hit information
        print('title: ' + docResponse['document']['title'])
        print('in: ' + str(answer['patentLocation']))

        structure = answer['structure']

        component = structure['component']

        # each structure may have several components
        for  sub_component in component:
            # retrieve the svg for this component
            comp_svg = sub_component['svg']

            #display the image if present
            if comp_svg is not None:
                display(SVG(comp_svg))
            else:
                print("No image available")

        # increment our result ordinal
        result_ordinal += 1
else:
    print("There are zero results")

<a id="Appendix"></a>
# Appendix #
<a id="Substance_Request"></a>
## Substance Request ##

See [Substance Module](API_Substance.html#Substance_Request) for details

<a id="Substance_Response"></a>
## Substance Response ##

See [Substance Module](API_Substance.html#Search_Response) for details


<a id="Markush_Request"></a>
## Markush Request ##
A request for Markush information can be defined using the following parameters

The endpoint is `https://helium.cas.org/integration/v1/markush` with the following parameters:

>**str** - is a *chemical structure query*
> This query can be a molfile, SMILES, InChI, or CXF JSON (exported by structure editor in SciFinder or STN). Maximum length is 20000.
> *Note that this must be a URL encoded string*
> *This parameter must be present*

>**offset** is the *offset* from the start of the result set to be returned in the response.
> This offset indicates which ordinal to start returning results.  Along with the `length` parameter a "page" of answers can be defined.
> _Default: 0, minimum: 0_

>**length** is the *maximum number of results* to be returned in the response.
> This length indicates the maximum number of results to return starting at the 'offset' parameter.
> _Default: 100, minimum: 0, maximum: 100_

>**echo** is an indication of whether the response should contain the request parameters in the response.
> When set to true the response will contain a `request` attribute
> _Default: false_


### Example ###
```
'https://helium.cas.org/integration/v1/markush?str=O=C1C2=C(N=CN2C)N(C(=O)N1C)C \
    &offset=0 \
    &echo=False \
    &length=5'
```

<a id="Markush_Response"></a>
## Markush Search Response JSON ##
Results of a Markush search.  The response is in [JSON](https://en.wikipedia.org/wiki/JSON#Syntax) format

The Search Response utilizes an additional important construct, a [Markush Summary](#Markush_Summary).

>**count** is the *number of results* returned from the query.
> *This count can be 0 or more*
> *Note: This count is not the number of `markush hits` being returned in this response.*

>**request** an array of *request parameters* from the original request.
> The request parameters are defined above in the [Substance Request](Substance_Request) Appendix
> *Note: This is only present if the request has the `echo` parameter set to `true`*

>**markushAnswers** an array of *markush summaries* returned in the response
> The number of `markushAnswers` is determined by the `length` parameter on the request or the count, whichever is less

### Example ###
```
{'count': 1,
 'request': {...},     # optional
 'markushAnswers': [...]
 }
```

<a id="Markush_Summary"></a>
## Markush Summary JSON ##
This is the definition of a markush summary of MARPAT entries.  A Markush Summary contains many of the fields available for this Markush representation including a document URI for obtaining document information for the Markush representation

The Substance Summary utilizes an additional important construct, a `structure`.  The `structure` is described in the [Markush Structure JSON](TODO) below.

> **uri** is a unique key specific to the Integration API for this Markush representation.
>this uri can be used to retrieve this record

> **documentUri** is a unique key specific to the Integration API for the patent document where this Markush representation appeared.
>this uri can be used to retrieve a detail record for the document.  Useful for retrieving the title, assignee, etc.  Consult the Document Module for more information

> **patentLocation** is the claim number which this Markush representation relates to.
> This usually provides a claim number but can sometimes return just "claims" without a number

> **broaderDisclosure** I've already forgotten what this is.
> TODO

> **structure** is a construct to describe the substance with the *structural characteristics*
> This is described below in the Markush JSON

### Example ###
```
[{'broaderDisclosure': None,
  'documentUri': 'document/pt/patent/65935126',
  'patentLocation': 'claim 1',
  'patentNumber': 'US20100087455',
  'structure': {...},
  'uri': 'markush/pt/patent/65935126/1/1'},
]
```
