# Welcome to Xilinx Cosine Similarity Acceleration Demo 
---

**This Notebook demonstrates how to use the Xilinx Cosine Similarity product and shows the power of Xilinx FPGAs to accelerate Cosine Similarity**

---

### The Demo : Log Similarity 

This example uses Tigergraph Graph database to represent Log messages within their contexts and finds similar trouble Logs messages for a given query Message. 

Hence, in turn, a user can apply the solution(s) provided in the past, to resolve current issues. <br><br>

The User can provide a <u>Message string</u> to search for similar trouble Logs occurred in the past.

This Example selects a random vertex in the graph as the query and filters out the Log Database and returns the Top Matching Logs. 

The Top Matching are calculated based on similarity between the given Message against all the Log entries.
The Match similarity used here is [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Instead of finding Similarity with the direct one-hot word representation, we used [GloVe](https://en.wikipedia.org/wiki/GloVe_(machine_learning)) Word Embeddings, which maps words into more meaningful space.

In General, finding Cosine Similarity on large dataset will take a huge amount of time on CPU 

With the Xilinx Cosine Similarity Acceleration, it can speedup the process by multiple orders.

We will use the Xilinx Cosine Similarity in the backend and setup a Log Database against which similarity of target vectors can be calculated.

 
### The Demo is Structured into Seven Sections :

1. [**Create New Graph**](#newg)
<br><br>
2. [**Create Graph Schema**](#schema)
<br><br>
3. [**Load Graph Data**](#loadd)
<br><br>
4. [**Install Queries**](#install)
<br><br>
5. [**Create Embeddings**](#embed)
<br><br>
6. [**Send Embeddings to FPGA**](#send)
<br><br>
7. [**Compute Cosine Similarity**](#run)

---
Steps [ 1 - 6 ] are **One-Time** preparation of the Database. 

Step [ 7 ] is **Repeatable** use of Query to run *accelerated* similarity computation on FPGA 

---

### Prerequisites <a id="prerequisites"></a> :

#### Load Necessary Libraries 

In [None]:
import time
import random as rand
from pathlib import Path, PurePosixPath
import pyTigerGraph as tg
import os

#### Check the Python version. It should be 3.6

In [None]:
from platform import python_version
print(python_version())

#### Provide the Host Name, User Name & Password

In [None]:
hostName = "localhost"                            # TG server hostname
userName = "tigergraph"                           # TG user name
passWord = "tigergraph"                           # TG password

#### Provide the number of results 'topK' to find & Number of U50 Cards installed on System

In [None]:
topK = 10                                         # Number of highest scoring log record matches
numDevices = 1                                    # Number of FPGA devices to distribute the queries to

#### Path Setup

**Local**: Location of query files under the Xilinx graphanalytics github repo. Set location of the local repo.

In [None]:
localRepoLocation = Path("/proj/xsjhdstaff3/sachink/ghe")
exampleLocation = Path("graphanalytics/plugin/tigergraph/recomengine/staging/examples/log_similarity") # when running from github repo
queryFileLocation = localRepoLocation / exampleLocation / "query"

**Remote**: Location of synthea generated data on the server. NOTE: Data should exist on the TigerGraph server

In [None]:
serverRepoLocation = PurePosixPath("/proj/xsjhdstaff3/sachink/ghe")
serverDataLocation = serverRepoLocation / PurePosixPath(exampleLocation) / "data"

#### Download the GloVe File  <a id="DownloadFiles"></a>

In [None]:
if not os.path.isfile("/tmp/glove.6B.50d.txt") :
    if not os.path.isfile("/tmp/glove.6B.50d.txt.tar") :
        print("Dowloading GloVe Embedding File ...")
        os.chdir("/tmp")
        os.system("wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ogyMmAu0fcZBdSwTQJuX6jHLzlTJnql0' -O glove.6B.50d.txt.tar")
        print("Download Completed!")
    print("Extracting GloVe Embedding File ...")
    os.chdir("/tmp")
    os.system("tar -xvzf glove.6B.50d.txt.tar")
    print("Done!")
else:
    print("GloVe Model File already present.")

---
### 1. Create New Graph <a id="newg"></a>
- Connect to TigerGraph server by ommiting graph name. This is needed to establish a valid REST endpoint that will be used to create a new desired graph
- Create new graph by using gsql command and create a new connection with the new graph

In [None]:
graphName = f'xgraph_logsim_{userName}'
conn = tg.TigerGraphConnection(host='http://' + hostName, graphname='', username=userName, password=passWord, useCert=False)
print("\n--------- Creating New graph ----------")
print(conn.gsql(f'create graph {graphName}()', options=[]))

# connect to TG server with new graph
print(f'Using graph {graphName}')
conn = tg.TigerGraphConnection(host='http://' + hostName, graphname=graphName, username=userName, password=passWord, useCert=False)

### 2. Create Graph Schema <a id="schema"></a>
TigerGraph stores graph in the form of vertices that can be associated with other vertices using directed or undirected edges. This is specified in the form of a graph schema. For the purpose of this demo, the schema is already defined as a query file. Load the file, set graph name and run it as gsql commands. 

The user can create schema for their own graph in a similar way. 

In [None]:
print("\n--------- Creating New Schema ----------")
schemaFile = queryFileLocation / "schema_xgraph.gsql"

with open(schemaFile) as fh:
    qStrRaw = fh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))

### 3. Load Graph Data <a id="loadd"></a>
The Log dataset is split into files for each vertex and edge. Each file is loaded and parsed to populate vertex attributes and edges. Open the load query file and, set graph name and location of the data files.

In [None]:
print("\n--------- Loading data into graph ----------")
loadFile = queryFileLocation / "load_xgraph.gsql"

with open(loadFile) as fh:
    qStrRaw = fh.read()
    qStrRaw = qStrRaw.replace('@graph', graphName)
    qStr    = qStrRaw.replace('$sys.data_root', str(serverDataLocation))
    print(conn.gsql(qStr))
    print(conn.gsql(f"USE GRAPH {graphName}\n RUN LOADING JOB load_xgraph"))
    print(conn.gsql(f"USE GRAPH {graphName}\n DROP JOB load_xgraph"))

### 4. Install Queries <a id="install"></a>
The cosine similarity application functionality is implemented using gsql queries and UDF functions. The queries need to be installed before running.

The user can create their own queries and install them instead. If user writes their own UDFs, they will need to be compilled and opened as a TigerGraph Plugin (this is not covered in the scope of this demo).

In [None]:
print("\n--------- Installing Queries ----------")
baseQFile = queryFileLocation / "base.gsql"
clientQFile = queryFileLocation / "client.gsql"

with open(baseQFile) as bfh, open(clientQFile) as cfh:
    print("installing base queries ...")
    qStrRaw = bfh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))
    
    print("\ninstalling client queries ...")
    qStrRaw = cfh.read()
    qStr = qStrRaw.replace('@graph', graphName)
    print(conn.gsql(qStr))

Now that queries are installed, rest of the operations can be performed simply by running the queries as follows.

### 5. Create Embeddings <a id="embed"></a>
---
As seen earlier in the schema, each Log Record has a set of attributes and is represented as a vertex. The attributes of Log Records are embedded into vector representation called embeddings which are then stored as part of the Log Record vertices themselves. 

To create embeddings, we use Global Vector word representations and simply take average of embeddings of words occuring in a Log Record. Users are encouraged to experiment with different ways of computing the embeddings and optimize the quality of similarity results.

In [None]:
print('Creating Log embeddings and storing them in LogRecord vertices...')
tStart = time.perf_counter()
conn.runInstalledQuery('client_cosinesim_embed_vectors', timeout=240000000)
conn.runInstalledQuery('client_cosinesim_embed_normals', timeout=240000000)
print(f'completed in {time.perf_counter() - tStart:.4f} sec')

### 6. Send embeddings to FPGA <a id="send"></a>
---
Finally, the embeddings are collected in a buffer which is sent/copied to HBM memory on the FPGA device. 

In [None]:
print('Loading data into FPGA memory...')
# set number of FPGAs to use
conn.runInstalledQuery('client_cosinesim_set_num_devices', {'numDevices': numDevices}, timeout=240000000)

# distribute data to FPGA memory
tStart = time.perf_counter()
resultHwLoad = conn.runInstalledQuery('client_cosinesim_load_alveo', timeout=240000000)
print(f'completed in {time.perf_counter() - tStart:.4f} sec\n')

# Check status
status = conn.runInstalledQuery('client_cosinesim_get_alveo_status', timeout=240000000)
isInit = status[0]["IsInitialized"]
numDev = status[0]["NumDevices"]
print(f'FPGA Init: {isInit}, Dev: {numDev}\n')

#### Definitions to get Log Record and Print the TopK Matchings

In [None]:
def getRecord(id):
    recordList = conn.getVerticesById('LogRecord', id)
    return [] if len(recordList) == 0 else recordList[0]

def printResults(result, newRecord):
    matches = result[0]['Matches']
    print(f"Matches for Record: {newRecord['v_id']} {newRecord['attributes']['SEVERITY']} {newRecord['attributes']['MESSAGE']}\n")
    print("RANK     ID        SEVERITY       MESSAGE" + 50*" " + "Confidence")
    print("----|---------|---------------|" + 60*"-" + "|------------")
    i = 0
    for m in matches:
        matchingRecord = getRecord(m['Id'])
        print(f" {i+1:02d}  {m['Id']:10}  {matchingRecord['attributes']['SEVERITY']:13} {matchingRecord['attributes']['MESSAGE'][:55]:60} {m['score']:.6f}")
        i += 1

This completes the TigerGraph database and consine similiarity compute preparation. We can now run as many similarity queries as we want. 

### 7. Compute Cosine Similarity <a id="run"></a>
---
For the purpose of this demo, we get the first 100 Log Records and choose one at random. Log Records are represented by an ID which is passed to the match query.

In [None]:
print('Running Query...')
# pick a random log record out of 100
targetLogRecords = conn.getVertices('LogRecord', limit=100)
targetLogRecord = targetLogRecords[rand.randint(0,99)]

# run similarity on the choosen log record
tStart = time.perf_counter()
result = conn.runInstalledQuery('client_cosinesim_match_alveo',
                                  {'queryRecord': targetLogRecord['v_id'], 'topK': topK}, timeout=240000000)
tDuration = 1000*(time.perf_counter() - tStart)

printResults(result, targetLogRecord)
resTime = result[0]["ExecTimeInMs"]
print(f"\nRound Trip time: {tDuration:.2f} msec")
print(f"     Query time: {resTime:.2f} msec")

Notice that as a sanity check, one of the top matching records is the query Log Record itself with a match score of 1.
Feel free to play with the query!

#### Thanks for your time!