# Welcome to Xilinx Cosine Similarity Acceleration Demo 
---

**This Notebook demonstrates how to use the Xilinx Cosine Similarity product and shows the power of Xilinx FPGAs to accelerate Cosine Similarity**

---

### The Demo : Drug Similarity 

In this Demo Example, we will try to find similar Drugs/Healthcare terms from the [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/quickstart.html) Dataset. UMLS is a collection of health and biomedical vocabularies from a wide variety of Healthcare data sources. One of the knowledge source in UMLS is a [Metathesaurus](https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html) that is a collection of medical concepts called <b>atoms</b> and links them through useful relationships.

For the purpose of this demo, we use a small subset of the <b>atoms</b> file <I>MRCONSO.RRF</I> and use relationships between atoms from the relationships file <I>MRREL.RRF</I>. The atoms and their relations are modeled as a graph database using [TigerGraph Enterprise Database](https://www.tigergraph.com/) and [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) is used as the match similarity measure. Xilinx Cosine Similarity library is used to accelerate the similarity score computation by offloading to [Xilinx Alveo U50](https://www.xilinx.com/products/boards-and-kits/alveo/u50.html) FPGA cards.

Each atom is converted into an embedding representation by probabilistically capturing their relations. We use the [Node2Vec](https://docs.tigergraph.com/tigergraph-platform-overview/graph-algorithm-library#node-2-vec) algorithm to compute embeddings for each atom vertex in the graph.

This Example selects a random vertex in the graph as the query and filters out the UMLS Database and returns the Top Matching similar drugs, drug ingredients or related healthcare concepts.

In General, finding Cosine Similarity on large datasets take a large amount of time on CPU.

With the Xilinx Cosine Similarity Acceleration, speedups > ~ 80x can be achieved.

We will use the Xilinx Cosine Similarity in the backend and setup a Drug Database against which similarity of target vectors can be calculated.

 
### The Demo is Structured into Seven Sections :

1. [**Create New Graph**](#newg)
<br><br>
2. [**Create Graph Schema**](#schema)
<br><br>
3. [**Load Graph Data**](#loadd)
<br><br>
4. [**Install Queries**](#install)
<br><br>
5. [**Create Embeddings**](#embed)
<br><br>
6. [**Send Embeddings to FPGA**](#send)
<br><br>
7. [**Compute Cosine Similarity**](#run)

---
Steps [ 1 - 6 ] are **One-Time** preparation of the Database. 

Step [ 7 ] is **Repeatable** use of Query to run *accelerated* similarity computation on FPGA 

---

### Prerequisites <a id="prerequisites"></a> :

#### Load Necessary Libraries 

In [1]:
import time
import random as rand
from pathlib import Path, PurePosixPath
import pyTigerGraph as tg
import os
from shutil import copyfile

#### Check the Python version. It should be 3.6

In [2]:
from platform import python_version
print(python_version())

3.6.9


#### Provide the Host Name, User Name & Password

In [3]:
hostName = "localhost"                            # TG server hostname
userName = "tigergraph"                           # TG user name
passWord = "tigergraph"                           # TG password

#### Provide the number of results 'topK' to find & Number of U50 Cards installed on System

In [4]:
topK = 10                                         # Number of highest scoring drugs
numDevices = 1                                    # Number of FPGA devices to distribute the queries to

#### Path Setup

**Local**: Location of query files under the Xilinx graphanalytics github repo. Set location of the local repo.

In [5]:
localRepoLocation = Path("/opt/xilinx/apps")
exampleLocation = Path("graphanalytics/integration/Tigergraph-3.x/1.2/examples/drug_similarity/") # when running from github repo
queryFileLocation = localRepoLocation / exampleLocation / "query"

**Remote**: Location of UMLS data on the server. **NOTE**: Data should exist on the TigerGraph server

In [6]:
serverRepoLocation = PurePosixPath("/opt/xilinx/apps")
serverDataLocation = serverRepoLocation / PurePosixPath(exampleLocation) / "data"
copyfile(str(serverDataLocation / "csv/embeddings.csv"), "/tmp/embeddings.csv")

'/tmp/embeddings.csv'

---
### 1. Create New Graph <a id="newg"></a>
- Connect to TigerGraph server by ommiting graph name. This is needed to establish a valid REST endpoint that will be used to create a new desired graph
- Create new graph by using gsql command and create a new connection with the new graph

In [7]:
graphName = f'xgraph_drugsim_{userName}'
conn = tg.TigerGraphConnection(host='http://' + hostName, graphname='', username=userName, password=passWord, useCert=False)
print("\n--------- Creating New graph ----------")
print(conn.gsql(f'create graph {graphName}()', options=[]))

# connect to TG server with new graph
print(f'Using graph {graphName}')
conn = tg.TigerGraphConnection(host='http://' + hostName, graphname=graphName, username=userName, password=passWord, useCert=False)


--------- Creating New graph ----------
Semantic Check Fails: The graph name conflicts with another type or existing graph names! Please use a different name.
The graph xgraph_drugsim_tigergraph could not be created!
Using graph xgraph_drugsim_tigergraph


### 2. Create Graph Schema <a id="schema"></a>
TigerGraph stores graph in the form of vertices that can be associated with other vertices using directed or undirected edges. This is specified in the form of a graph schema. For the purpose of this demo, the schema is already defined as a query file. Load the file, set graph name and run it as gsql commands. 

The user can create schema for their own graph in a similar way. 

In [8]:
print("\n--------- Creating New Schema ----------")
schemaFile = queryFileLocation / "schema_xgraph.gsql"

with open(schemaFile) as fh:
    qStrRaw = fh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))


--------- Creating New Schema ----------
Using graph 'xgraph_drugsim_tigergraph'
All jobs are dropped.
The query client_cosinesim_embed_vectors is dropped.
The query client_cosinesim_get_alveo_status is dropped.
The query cosinesim_get_num_devices is dropped.
The query cosinesim_embed_vectors is dropped.
The query node2vec_query is dropped.
The query remove_dangling_vertices is dropped.
The query cosinesim_clear_embeddings is dropped.
The query cosinesim_set_num_devices is dropped.
The query random_walk is dropped.
The query client_cosinesim_load_alveo is dropped.
The query create_embeddings is dropped.
The query cosinesim_match_sw is dropped.
The query atom_embedding is dropped.
The query client_cosinesim_embed_normals is dropped.
The query client_cosinesim_match_sw is dropped.
The query client_cosinesim_match_alveo is dropped.
The query load_node2vec is dropped.
The query cosinesim_ss_fpga_core is dropped.
The query client_cosinesim_set_num_devices is dropped.
The query load_graph_c

### 3. Load Graph Data <a id="loadd"></a>
The UMLS data is split into files. Each Atom Entry attribute (vertex attributes in the schema) is loaded from the Metathesaurus. Each file is loaded and parsed. Open the load query file and, set graph name and location of the data files.

In [9]:
print("\n--------- Loading data into graph ----------")
loadFile = queryFileLocation / "load_xgraph.gsql"

with open(loadFile) as fh:
    qStrRaw = fh.read()
    qStrRaw = qStrRaw.replace('@graph', graphName)
    qStr    = qStrRaw.replace('$sys.data_root', str(serverDataLocation))
    print(conn.gsql(qStr))
    print(conn.gsql(f"USE GRAPH {graphName}\n RUN LOADING JOB load_xgraph"))
    print(conn.gsql(f"USE GRAPH {graphName}\n DROP JOB load_xgraph"))


--------- Loading data into graph ----------
Using graph 'xgraph_drugsim_tigergraph'
The job load_xgraph is created.
[2A
[2K
[2K
Using graph 'xgraph_drugsim_tigergraph'
[Tip: Use "CTRL + C" to stop displaying the loading status update, then use "SHOW LOADING STATUS jobid" to track the loading progress again]
[Tip: Manage loading jobs with "ABORT/RESUME LOADING JOB jobid"]
Starting the following job, i.e.
JobName: load_xgraph, jobid: xgraph_drugsim_tigergraph.load_xgraph.file.m1.1627455198237
Loading log: '/home2/tigergraph/tigergraph/log/restpp/restpp_loader_logs/xgraph_drugsim_tigergraph/xgraph_drugsim_tigergraph.load_xgraph.file.m1.1627455198237.log'

Job "xgraph_drugsim_tigergraph.load_xgraph.file.m1.1627455198237" loading status
[RUNNING] m1 ( Finished: 0 / Total: 2 )
Job "xgraph_drugsim_tigergraph.load_xgraph.file.m1.1627455198237" loading status
[FINISHED] m1 ( Finished: 2 / Total: 2 )
[LOADED]
+---------------------------------------------------------------------------------

### 4. Install Queries <a id="install"></a>
The cosine similarity application functionality is implemented using gsql queries and UDF functions. The Node2Vec algorithm used in this demo is implemented partly as a query, partly as a UDF. UDFs are installed into Tigergraph as a plugin beforehand, while the queries need to be installed for the graph before running.

The user can create their own queries and install them instead. If user writes their own UDFs, they will need to be compilled and opened as a TigerGraph Plugin (this is not covered in the scope of this demo).

In [10]:
print("\n--------- Installing Queries ----------")
baseQFile = queryFileLocation / "base.gsql"
clientQFile = queryFileLocation / "client.gsql"

with open(baseQFile) as bfh, open(clientQFile) as cfh:
    print("installing base queries ...")
    qStrRaw = bfh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))
    
    print("\ninstalling client queries ...")
    qStrRaw = cfh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))


--------- Installing Queries ----------
installing base queries ...
[                                                                                   ] 0% (0/15)
[                                                                                   ] 0% (0/15)
[===                                                                                ] 3% (0/15)

Using graph 'xgraph_drugsim_tigergraph'
All queries are dropped.
The query random_walk has been added!
The query node2vec_query has been added!
The query load_node2vec has been added!
The query create_embeddings has been added!
The query remove_dangling_vertices has been added!
The query atom_embedding has been added!
The query cosinesim_clear_embeddings has been added!
The query cosinesim_embed_vectors has been added!
The query cosinesim_embed_normals has been added!
The query cosinesim_match_sw has been added!
The query cosinesim_set_num_devices has been added!
The query cosinesim_get_num_devices has been added!
The query cosinesim_

Now that queries are installed, rest of the operations can be performed simply by running the queries as follows.

### 5. Create Embeddings <a id="embed"></a>
---
As seen earlier in the schema, each Drug Record has a set of attributes and is represented as a vertex. Drug Record or Atom relations/connections are embedded into a vector representation called embeddings which are then stored as part of the vertices themselves. Read more about Node2Vec [here](https://en.wikipedia.org/wiki/Node2vec)

In [11]:
print('Creating Drug concept embeddings and storing them in Atom vertices...')
tStart = time.perf_counter()
conn.runInstalledQuery('client_cosinesim_embed_vectors', timeout=240000000)
conn.runInstalledQuery('client_cosinesim_embed_normals', timeout=240000000)
print(f'completed in {time.perf_counter() - tStart:.4f} sec')

Creating Drug concept embeddings and storing them in Atom vertices...
completed in 12.2411 sec


### 6. Send embeddings to FPGA <a id="send"></a>
---
Finally, the embeddings are collected in a buffer which is sent/copied to HBM memory on the FPGA device. 

In [12]:
print('Loading data into FPGA memory...')
# set number of FPGAs to use
conn.runInstalledQuery('client_cosinesim_set_num_devices', {'numDevices': numDevices}, timeout=240000000)

# distribute data to FPGA memory
tStart = time.perf_counter()
resultHwLoad = conn.runInstalledQuery('client_cosinesim_load_alveo', timeout=240000000)
print(f'completed in {time.perf_counter() - tStart:.4f} sec\n')

# Check status
status = conn.runInstalledQuery('client_cosinesim_get_alveo_status', timeout=240000000)
isInit = status[0]["IsInitialized"]
numDev = status[0]["NumDevices"]
print(f'FPGA Init: {isInit}, Dev: {numDev}\n')

Loading data into FPGA memory...
completed in 0.2258 sec

FPGA Init: True, Dev: 1



#### Definations to get Drug Record/Atom and Print the TopK Matchings

In [13]:
def getRecord(id):
    recordList = conn.getVerticesById('Atom', id)
    return [] if len(recordList) == 0 else recordList[0]

def printResults(result, newRecord):
    matches = result[0]['Matches']
    print(f"Matches for Record: {newRecord['attributes']['atom_id']} {newRecord['attributes']['string_text']}\n")
    print("RANK   ATOM ID    Concept Description" + 50*" " + "Confidence")
    print("----|-----------|" + 60*"-" + "|------------")
    i = 0
    for m in matches:
        matchingRecord = getRecord(m['Id'])
        print(f" {i+1:02d}   {matchingRecord['attributes']['atom_id']:13} {matchingRecord['attributes']['string_text'][:55]:60} {m['score']:.6f}")
        i += 1

This completes the TigerGraph database and consine similiarity compute preparation. We can now run as many similarity queries as we want. 

### 7. Compute Cosine Similarity <a id="run"></a>
---
For the purpose of this demo, we get the first 100 Atoms and choose one at random. Atoms are represented by an ID which is passed to the match query.

In [14]:
print('Running Query...')
# pick a random drug record out of 100
targetDrugRecords = conn.getVertices('Atom', limit=100)
targetDrugRecord = targetDrugRecords[rand.randint(0,99)]

# run similarity on the choosen drug record
tStart = time.perf_counter()
result = conn.runInstalledQuery('client_cosinesim_match_alveo',
                                  {'queryRecord': targetDrugRecord['v_id'], 'topK': topK}, timeout=240000000)
tDuration = 1000*(time.perf_counter() - tStart)

printResults(result, targetDrugRecord)
resTime = result[0]["ExecTimeInMs"]
print(f"\nRound Trip time: {tDuration:.2f} msec")
print(f"     Query time: {resTime:.2f} msec")

Running Query...
Matches for Record: A7568440 20-Methylcholanthrene

RANK   ATOM ID    Concept Description                                                  Confidence
----|-----------|------------------------------------------------------------|------------
 01   A7568440      20-Methylcholanthrene                                        1.000000
 02   A7668772      Dimethylbenzanthracene                                       0.942820
 03   A18399960     Mercaptopurine                                               0.568780
 04   A0003594      beta Alanine                                                 0.556910
 05   A23492397     Mercaptopurine                                               0.556750
 06   A22722081     mercaptopurine                                               0.555170
 07   A0017579      8-Hydroxyquinoline                                           0.534590
 08   A1231129      Eicosapentanoic acid                                         0.520670
 09   A0085141      Me

Notice that as a sanity check, the top matching Atom is the query Atom itself with a match score of 1.
Feel free to play with the query!

#### Thanks for your time!