# Accessing IRIS NLP from Python

In this notebook, we'll leverage the NLP capabilities of InterSystems IRIS, also know as [iKnow](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIKNOW). The iKnow NLP engine is also available as an [open-source Python library `iknowpy`](https://github.com/intersystems/iknow), which offers the same linguistic analysis capabilities through a standalone engine for use in Python scripts and applications. However, the version embedded in IRIS leverages the platforms database capabilities to enable cross-document analyses, storing the engine output in a "domain" and offering SQL projections of the raw contents as well as rich REST and ObjectScript APIs on top. In this notebook, we'll show how to access the latter from Python.

We'll first connect to IRIS using the [Native API for Python](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=BPYNAT_about). If you haven't already, install the DB-API from the wheel posted [here](https://intersystems-community.github.io/iris-driver-distribution/) to get access to the `iris` module. After establishing a connection, we can use the [`%SYSTEM.iKnow`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25SYSTEM.iKnow) API to print the domains already created in this namespace. This may show up empty if you haven't used iKnow before.

In [1]:
import iris

# Open a connection to the server
args = {
        'hostname':'localhost', 
        'port': 1972,
        'namespace':'USER', 
        'username':'_SYSTEM', 
        'password':'SYS'
}
connection = iris.connect(**args)

# Create an IRIS object
irispy = iris.createIRIS(connection)

# Invoke the ListDomains() method
irispy.classMethodVoid('%SYSTEM.iKnow','ListDomains')


Domains for Namespace USER:
 Domain ID : Domain name                              : # of sources : version
 --------- : ---------------------------------------- : ------------ : -------
         1 : HelloWorld                               :            3 :       5


## Creating an iKnow Domain

iKnow Domains are repositories to which you can add "sources". Source text is indexed by the iKnow engine and the resulting concepts and relationships are then stored in the domain for you to access through the various APIs. The easiest way to create a domain is using the [iKnow Architect](https://community.intersystems.com/post/creating-domain-iknow-domain-architect), a GUI available from the System Management Portal. This will create a Domain Definition, which registers where data needs to be loaded from so you can easily rebuild it from scratch. 

In this demo, we'll use a lower-level, fully programmatic approach and interact with the [`%iKnow.Domain`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25iKnow.Domain) class directly. Note that you may also programmatically create a Domain Definition by instantiating the [`%iKnow.Model.domain`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25iKnow.Model.domain) class and then using its [`%SaveToClass()`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25iKnow.Model.domain#%25SaveToClass) method if you'd like to take advantage of the rebuildability of Domain Definitions.

In [2]:
# First drop any prior domain
if (irispy.classMethodValue('%iKnow.Domain','Exists','HelloWorld') > 0):
    print('Domain already exists, dropping first...')
    irispy.classMethodVoid('%iKnow.Domain','Delete','HelloWorld')
    
# Now create a new domain. 
domain = irispy.classMethodValue('%iKnow.Domain','Create','HelloWorld')
domain_id = domain.get("Id")
print('Domain created with ID: '+str(domain_id))

# Note that we could also use the %New()/%Save() methods, but that's slightly more verbose from Python

Domain already exists, dropping first...
Domain created with ID: 1


To add data to the domain, the easiest way to get started is using the [`%SYSTEM.iKnow`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25iKnow.Model.domain#%25SaveToClass) API again, which takes the domain names as its first argument.

In [3]:
# Add some text straight from a string argument using IndexString()
# Note that the second argument to that method (and therefore the fourth to classMethodVoid()) is a unique "external"
# identifier we can use to look up our source afterwards. We stick to dull identifiers for now :-)
irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexString', 'HelloWorld', 'Test1',
                           'Hello! Here\'s a first example of a piece of text indexed by the iKnow engine.')
irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexString', 'HelloWorld', 'Test2',
                           'Let\'s add more text for the iKnow engine to process. Of course there can be multiple sentences.')

# Next we'll create a separate configuration to auto-detect the language for English or Frech text
languages = iris.IRISList()
languages.add('fr')
languages.add('en')
irispy.classMethodVoid('%iKnow.Configuration', 'Create', 'EnglishOrFrench', 1, languages)
irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexString', 'HelloWorld', 'Test3',
                           'Hello! On peut également utiliser plusieurs langues dans la même domaine.\n'+
                           'In fact we can even use different languages in the same text! Merci.', 'EnglishOrFrench')

# If you have some text files accessible from the IRIS server, you can use IndexFile() or IndexDirectory()
#irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexDirectory', 'HelloWorld', '/data/nlp/input')

# Similarly, there's a simple method to index data accessible through SQL.
# This method takes additional arguments to identify columns for building the external ID and of course the text column.
#irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexTable', 'HelloWorld', 'Aviation.Event', 
#    'EventID', 'LocationCountry', 'NarrativeSummary')

# And this utility method also prints the number of sources in our domain
irispy.classMethodVoid('%SYSTEM.iKnow','ListDomains')


Domains for Namespace USER:
 Domain ID : Domain name                              : # of sources : version
 --------- : ---------------------------------------- : ------------ : -------
         1 : HelloWorld                               :            3 :       5


### Advanced indexing scenarios

The `%SYSTEM.iKnow` API addresses a few basic use cases, but does not cover every possible data source. Domain Definitions offer more flexibility (and most importantly an easy way to rebuild a domain), including loading the results of a SQL query and can be modified through the Architect. The following paragraph shows [the more hardcore API usage](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIKNOW_load) to achieve this directly from Python if Domain Definitions are not an option.

You can either use your own query or first run the following to have something to index:
```SQL
CREATE TABLE IF NOT EXISTS paragraphs (
    book VARCHAR(100),
    chapter_id INT,
    sequence_number INT,
    paragraph VARCHAR(32000)
)

INSERT INTO paragraphs VALUES ('Python bites!', 101, 1011, 'This is the first paragraph in a great chapter. We''ll introduce the main characters of this exciting murder mystery.')
INSERT INTO paragraphs VALUES ('Python bites!', 101, 1012, 'In the second paragraph, the murder happens!')
INSERT INTO paragraphs VALUES ('Python bites!', 110, 1101, 'In the final chapter, the identity of the killer is finally revealed!')
```

In [4]:
# instantiate a Loader ..
loader = irispy.classMethodValue('%iKnow.Source.Loader','%New',domain_id)

# .. and a Lister that understands SQL
lister = irispy.classMethodValue('%iKnow.Source.SQL.Lister','%New',domain_id)

# then queue our batch and index
sql = "SELECT book, chapter_id, LIST(paragraph || CHAR(13) || CHAR(10)) as full_text FROM paragraphs GROUP BY book, chapter_id"
lister.invokeVoid('AddListToBatch', sql, 'chapter_id', 'book', 'full_text')
loader.invokeVoid('ProcessBatch')

## Querying the domain

The iKnow infrastructure in IRIS includes a comprehensive query API, implementing a few common types of "questions" you might want to ask of the data processed. In our HelloWorld demo domain there isn't as much data to query right now (unless you also indexed a directory or table with interesting contents), but let's take a closer look anyway. 

The iKnow query APIs return data by reference in an array with a `$list` for each result row. We'll first create a utility method for browsing these results. Note that there is also a SQL version of the API that is somewhat more straightforward to consume from Python, but when we start working with filters later on, programmatic access through the Native API will be easier again.

In [5]:
def iknow_query(api : str, method : str, domain_id : int, *args, **kwargs):
    kwargs = { 'outGlo': '^result', **kwargs }
    outGlo = kwargs['outGlo']
    
    # figure out result columns to build a nice return object
    proxy = irispy.classMethodValue('%Dictionary.ParameterDefinition', '%OpenId', api+'||'+method+'RT')
    raw = proxy.get('Default').split(',')
    return_cols = list(map(lambda x:x.split(':')[0], raw))
    return_types = list(map(lambda x:x.split(':')[1], raw))
    
    # parse method spec so we can build positional argument list out of kvargs
    # this is getting a litle into the weeds, and may be dealt with inside the Native API
    # more elegantly in a future release
    kwupper = {}
    for k, v in kwargs.items():
        if k == 'outGlo':
            continue
        kwupper[k.upper()] = v
    proxy = irispy.classMethodValue('%Dictionary.CompiledMethod', '%OpenId', api+'||'+method)
    raw = proxy.get('FormalSpec').split(',')
    full_args = []
    for i, argument in enumerate(raw):
        if (i < 2): 
            continue  # skip &result and domainId
        if ((argument[0]=='&') or (argument[0]=='*')):
            argument = argument[1:]
        argument = argument.split('=')
        argument_name = argument[0].split(':')[0].upper()
        if argument_name in kwupper:
            full_args.append(kwupper[argument_name])
        elif len(args) > (i-2):
            full_args.append(args[i-2])
        elif len(argument)>1:
            default = argument[1]
            if default == '""':
                full_args.append('')
            elif default[0:3] == '$$$':
                full_args.append(iknow_macro(default))
            else:
                full_args.append(argument[1])
        else:
            full_args.append(None)
    
    # call actual API method
    irispy.classMethodVoid(api, method, outGlo, domain_id, *full_args)
    
    # now iterate through output global
    result = []
    direction = 0
    subscript = 0
    while True:
        subscript = irispy.nextSubscript(direction, outGlo, subscript)
        if subscript == None: 
            break
        raw_row = iris.IRISList(irispy.getBytes(outGlo,subscript))
        row = {}
        for i, col in enumerate(return_cols):
            row[col] = raw_row.get(i+1)
        result.append(row)
        
    # clean up results
    irispy.kill(outGlo)
    
    return result

def iknow_macro(macro : str):
    # this hits some internal APIs and exploits the known simple structure of %IKPublic.INC
    # DON'T TRY THIS AT HOME!
    if macro[0:3] == '$$$':
        macro = macro[3:]
    subscript = 0
    value = None
    while True:
        subscript = irispy.nextSubscript(0, 'rINC("%IKPublic",0)', subscript)
        if subscript == None:
            break
        raw_line = irispy.getString('rINC("%IKPublic",0)', subscript).split(' ')
        if (len(raw_line) > 1) and (raw_line[1] == macro):
            value = raw_line[2]
            break
    return value

print('Top concepts:')
print(iknow_query('%iKnow.Queries.EntityAPI', 'GetTop', domain_id))

print()
print('Top entities similar to "text":')
print(iknow_query('%iKnow.Queries.EntityAPI', 'GetSimilar', domain_id, 'text'))

Top concepts:
[{'entUniId': 1, 'entity': 'hello', 'frequency': 2, 'spread': 2}, {'entUniId': 7, 'entity': 'text', 'frequency': 2, 'spread': 2}, {'entUniId': 9, 'entity': 'iknow engine', 'frequency': 2, 'spread': 2}, {'entUniId': 4, 'entity': 'first example', 'frequency': 1, 'spread': 1}, {'entUniId': 6, 'entity': 'piece', 'frequency': 1, 'spread': 1}, {'entUniId': 11, 'entity': 'more text', 'frequency': 1, 'spread': 1}, {'entUniId': 15, 'entity': 'multiple sentences', 'frequency': 1, 'spread': 1}, {'entUniId': 18, 'entity': 'plusieurs langues', 'frequency': 1, 'spread': 1}, {'entUniId': 21, 'entity': 'domaine', 'frequency': 1, 'spread': 1}, {'entUniId': 24, 'entity': 'different languages', 'frequency': 1, 'spread': 1}]

Top entities similar to "text":
[{'entUniId': 7, 'entity': 'text', 'frequency': 2, 'spread': 2}, {'entUniId': 11, 'entity': 'more text', 'frequency': 1, 'spread': 1}]


## Miscellaneous

The following paragraphs capture a few specific use cases brought by users

### Managing skiplists

[Skiplists](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIKNOW_skiplist) help filter specific entities such as stop words out of query results. You can manage them through the [`%iKnow.Utils.MaintenanceAPI`](https://docs.intersystems.com/irislatest/csp/documatic/%25CSP.Documatic.cls?LIBRARY=%25SYS&CLASSNAME=%25iKnow.Utils.MaintenanceAPI) or, when using a Domain Definition, through the [NLP Architect](https://docs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GIKNOW_architect#GIKNOW_architect_skiplists).

In [6]:
# drop earlier skiplist if it exists and create a new one
skip_id = irispy.classMethodInteger('%iKnow.Utils.MaintenanceAPI', 'GetSkipListId', domain_id, 'StopWords')
if (skip_id != None):
    irispy.classMethodVoid('%iKnow.Utils.MaintenanceAPI', 'DropSkipList', domain_id, skip_id)
skip_id = irispy.classMethodInteger('%iKnow.Utils.MaintenanceAPI', 'CreateSkipList', domain_id, 'StopWords')

irispy.classMethodVoid('%iKnow.Utils.MaintenanceAPI', 'AddStringToSkipList', domain_id, skip_id, 'stop 123')
irispy.classMethodVoid('%iKnow.Utils.MaintenanceAPI', 'AddStringToSkipList', domain_id, skip_id, 'hello')
irispy.classMethodVoid('%iKnow.Utils.MaintenanceAPI', 'AddStringToSkipList', domain_id, skip_id, 'thanks')
irispy.classMethodVoid('%iKnow.Utils.MaintenanceAPI', 'AddStringToSkipList', domain_id, skip_id, 'merci')

print('Created skiplist with ID '+str(skip_id)+' and the following strings:')
print(iknow_query('%iKnow.Utils.MaintenanceAPI', 'GetSkipListElements', domain_id, skip_id))

print()
print('Applying to a simple query:')
print(iknow_query('%iKnow.Queries.EntityAPI', 'GetTop', domain_id, skiplistIds = skip_id))

Created skiplist with ID 1 and the following strings:
[{'entUniId': 1, 'entity': 'hello'}, {'entUniId': 27, 'entity': 'merci'}, {'entUniId': 42, 'entity': 'stop 123'}, {'entUniId': 43, 'entity': 'thanks'}]

Applying to a simple query:
[{'entUniId': 7, 'entity': 'text', 'frequency': 2, 'spread': 2}, {'entUniId': 9, 'entity': 'iknow engine', 'frequency': 2, 'spread': 2}, {'entUniId': 4, 'entity': 'first example', 'frequency': 1, 'spread': 1}, {'entUniId': 6, 'entity': 'piece', 'frequency': 1, 'spread': 1}, {'entUniId': 11, 'entity': 'more text', 'frequency': 1, 'spread': 1}, {'entUniId': 15, 'entity': 'multiple sentences', 'frequency': 1, 'spread': 1}, {'entUniId': 18, 'entity': 'plusieurs langues', 'frequency': 1, 'spread': 1}, {'entUniId': 21, 'entity': 'domaine', 'frequency': 1, 'spread': 1}, {'entUniId': 24, 'entity': 'different languages', 'frequency': 1, 'spread': 1}, {'entUniId': 29, 'entity': 'exciting murder mystery', 'frequency': 1, 'spread': 1}]


### Managing and using a User Dictionary

User Dictionaries can be used to tweak iKnow's built-in language models, for example to feed additional marker terms that should be taken into account for specific semantic attributes. Let's consider the thriller terminology in the `paragraphs` table to carry negative sentiment.

In [7]:
# clean up older versions
irispy.classMethodVoid('%iKnow.Configuration','NameIndexDelete', 'EnglishThrillers')
irispy.classMethodVoid('%iKnow.UserDictionary','NameIndexDelete', 'Thrillers')

# create User Dictionary and a Configuration that uses it
udict = irispy.classMethodValue('%iKnow.UserDictionary','%New','Thrillers')
udict.invokeVoid('%Save')
udict.invokeVoid('AddNegativeSentimentTerm','murder')
udict.invokeVoid('AddNegativeSentimentTerm','killer')
udict.invokeVoid('%Save')
config = irispy.classMethodValue('%iKnow.Configuration','%New','EnglishThrillers',0,'en','Thrillers')
config.invokeVoid('%Save')

# index the paragraphs table again, this time simply using %SYSTEM.iKnow.IndexTable() without any grouping
irispy.classMethodVoid('%SYSTEM.iKnow', 'IndexTable', 'HelloWorld', 'paragraphs', 
                           'sequence_number', 'book', 'paragraph', None, None, 'EnglishThrillers')

# you can now use the Domain Explorer to see the indexed text, in which phrases with negative sentiment will
# now be highlighted.