## DataFusion Restful API in Python

This documentation describes the DataFusion, a triples stores database from Thomson Reuters. DataFusion is a graphical database that uses the RDF format. RDF stands for Resource Document Format and is a form of serialisation for the triples store. 

It is divided into the sections:
1. Graphical database concepts
2. RDF vs Neo4j
3. How to search in DataFusion
2. DataFusion RESTFUL API concepts
3. Glossary :
    i.   concepts
    ii.  tokens
    iii. properties
    iv.  relationships
    v.   entities
    vi.  annotations
    vii. documents
    viii.
 

The documentation is not meant to be exhaustive with ongoing revisions. This notebook works in an iPython notebook. Python code in the cells are executed by pressing "Shift-Enter". 

### Graphical Database concepts

A graphical database uses a graph or network structure to hold nodes, edges and properties. Nodes are entities in the graphs which are described by the properties. The nodes are then connected to one another through edges. 

This form of modelling reflects the typical relationships in the real world, as compared to the relational databases. To visulaise the difference between a RDMS and a graphical database, imagine a RDMS as a cube whilst a graphical database as an oddly-shaped sphere. A cube has fixed links between its edges, and add-on cubes can be added on a fixed side. In a sphere, there additional entities can be placed anywhere in the sphere. 

The graphical databases reflect inter-relationships in the real world. These relationships are used in the financial sector in areas ranging from [risk stress testing](http://as.wiley.com/WileyCDA/WileyTitle/productCd-0470666013.html), compliance and risks, [credit risks network monitoring](http://dbsreuters.s3-website-ap-southeast-1.amazonaws.com/), hierarchical relationships modelling, [geopolitical risk dashboard](http://datalab.int.thomsonreuters.com:3838/jr/pro-risk/) and [thirty party risk monitoring](http://datalab.int.thomsonreuters.com/bromer/3pr/).

Neo4j is a commonly used product in the market for graph database - the Panama Papers were built on Neo4j. There are notable differences between the Neo4j and the *RDF*. 


#### RDF vs Neo4j
A RDF is a triples stores consisting of an subject, predicate and object. In a statement, "Thomas bought 30 thousands of Apple Shares', the subject is "Thomas", the predicate is the action verb "bought" whilst the object is the '30 thousands of Apple shares'. 

DataFusion has an additional quad store with annotations that contains the properties of the edges. This 

#### Thomson Reuters DataFusion

### Using DataFusion:
The DataFusion API reflects the DataFusion functionalities on its web interface. It can be used for string searches, finding of relationships amongst entities, listing of properties and the upload of data sources. It can further be used for the annotations which are 'properties' defined for edges, that is not available on the web interface. 

A context in DataFusion is synomynous with a datasource. DataSources can be created on the DataFusion interface for different database format eg. RDF, delimited and RDMS. Being a default RDF database, DataFusion is most expressive in this format with relationships and entities. It is expected that client applications will involve coding to transform other database sources to RDF for the input of data.

#### Contextual Lists

Contexts are retrieved by the following RESTFUL API. Note that the response.status_code returns the HTTP return code. A code of 200 indicates successful retrieval of data. HTML codes can be referenced in the [link](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).
Common errors include 404 - data is not found and 400 - bad request.

In [17]:
import requests

# Obtain contextual lists:
url_dds_test = "http://dds-test.thomsonreuters.com/app/api/context/list"
url_dds = "https://dds.thomsonreuters.com/app/api/context/list"

def retURL(url):
    headers = {  'Content-Type': 'application/json'   , 'Authorization' : 'Bearer hboalirnc3d4n04phhvv2bas8fdjd6h9' } 
    response = requests.get(url, headers=headers)
    
    try:
        if response.status_code <> 200:
            raise ValueError('Error in HTTP ')
        else:
            data = response.json()  # output is in Python list
            print 'HTTP request is successful with code ' + str(response.status_code )
    except:
        print 'Error in HTTP call ' + str(response.status_code )
        data = ""
    return data

retURL(url_dds_test)

HTTP request is successful with code 200


[{u'id': 43,
  u'jdbcRepository': None,
  u'lastBatchReceived': u'2016-06-08T15:12:02Z',
  u'name': u'msft_poc',
  u'predicates': [{u'id': 14230,
    u'uri': u'http://knowledge.microsoft.com/mso/tv.series_season.to'},
   {u'id': 14141,
    u'uri': u'http://knowledge.microsoft.com/mso/business.consumer_company.brands'},
   {u'id': 14159,
    u'uri': u'http://knowledge.microsoft.com/mso/automotive.automotive_class.combined_fuel_economy_minimum'},
   {u'id': 13755,
    u'uri': u'http://knowledge.microsoft.com/mso/cvg.game_version.developer'},
   {u'id': 13808,
    u'uri': u'http://knowledge.microsoft.com/mso/statistics.population_group.income_share_lowest_20_percent'},
   {u'id': 14262,
    u'uri': u'http://knowledge.microsoft.com/mso/games.game.publisher'},
   {u'id': 14676,
    u'uri': u'http://knowledge.microsoft.com/mso/book.written_work.next_in_series'},
   {u'id': 14231,
    u'uri': u'http://knowledge.microsoft.com/mso/tv.series_season.previous_season'},
   {u'id': 14052,
    u'uri'

A sample of 'data' for a context is displayed below. 'data' is a Python list type. Each context key is an 'id'; in this case, for MyRepublic the id is 357 (useful later) in queries. Each context also consists of predicates and rdftypes. A dict of the keys for 'data' is shown below - uri, predicates, rdfTypes etc  It is useful to think of the API actions as corresponding to what's appeared on the DataFusion web interface.
In the below, the datasource names are printed along with the associated ids. Alternatively, given that the HTTP endpoint is open and if DataFusion is logged on, click on [here](http://dds-test.thomsonreuters.com/app/api/context/list) for the context list.

In [14]:
n = len(data)
datasources={data[i][u'name']:data[i][u'id'] for i in range(n)}
print datasources

TypeError: object of type 'NoneType' has no len()

Note that these datasources correspond directly to that on the interface as partly shown below. <img src="RDF_Gui1.png">

#### Master Entity Types

Master entity types can be defined on the DataFusion web interface as below. These entity types function as nodes on the DataFusion graph, with their own specific IDs.<img src="RDF_Master.PNG" height="2"> A list of the Master entity types can be obtained by executing the code:

In [None]:
url_master = "http://dds-test.thomsonreuters.com/app/api/entity/types"
entity_data = retURL(url_master)

n_ent = len(entity_data)
datasources={entity_data[i][u'name']:entity_data[i][u'id'] for i in range(n_ent)}
print datasources

#### Examples of searches in DataFusion

A few examples of searching through DataFusion are listed below: 

##### i. Searching for entity in Worldcheck

This is done for an entity "ASEAN COMMODITIES INC" . In the search string below, %20 is the HTML url code [click here](http://www.degraeve.com/reference/urlencoding.php) corresponding to a blank space. Other popular examples of HTML url codes are %3A to denote ":", %7C to denote '|' and %2F to denote "/".
Context in Worldcheck is specified by "queryFilter=context%7C%7C%7CWorldcheck".

We note in this case that the resulting match in Worldcheck for "ASEAN%20COMMODITIES%20INC" is 0.859 with further information on the entity as having been blacklisted by the MAS being on the ASIC list of unauthorised cold callers. If there is no matching entity in Worldcheck, an empty dataset will be returned.

In [12]:
check_name="ASEAN%20COMMODITIES%20INC"

url_check_1="http://dds-test.thomsonreuters.com/app/api/entity/search?searchString="+check_name+"&dir=asc&includePredicates=false"
url_check_2="&includeRelDir=false&queryFilter=context%7C%7C%7CWorldcheck&filterType=and&extraFields=further_information_attr_exact&includeHiddenFields=false"
url_check=url_check_1+url_check_2
wc_data = retURL(url_check)
print url_check
print wc_data

Error in HTTP call 503


UnboundLocalError: local variable 'data' referenced before assignment

##### ii. Searching for entities directly related to an entity in 1MDB

In [11]:
srchstring="Rosman%20Abdullah"
entity_type="-1"# Organization
contextID="|||1MDB"
#url_string="http://dds-test.thomsonreuters.com/app/api/entity/search/tokenize?searchString="+srchstring+"&entityTypeId="+entity_type
url_string="http://dds.thomsonreuters.com/app/api/entity/search/tokenize?searchString="+srchstring+"&entityTypeId="+entity_type

data = retURL(url_string)
print "No of matches to string is "+str(data[u'count'])
print "data token is "+str(data[u'tokenPart'])

# url_token="http://dds-test.thomsonreuters.com/app/api/entity/search?q=*%3A*&"
url_token="http://dds.thomsonreuters.com/app/api/entity/relationships?parentUrisToken="+str(data[u'tokenPart'])
# with relationships
data_2nd_pred = retURL(url_token)
print "No of 2nd order matches is "+str(len(data_2nd_pred[u'links']))

SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

In [None]:
data_2nd_pred

##### iii. Displaying entity properties

In [None]:
len(data_2nd_pred[u'entities'])