# Welcome to Xilinx Cosine Similarity Acceleration Demo 
---

**This Notebook demonstrates how to use the Xilinx Cosine Similarity product and shows the power of Xilinx FPGAs to accelerate Cosine Similarity**

---

### The Demo : Wiki Search Engine 

In this Demo Example, we will create Search Engine based on Wikipedia Data. 

The User will provide <u>Keyword</u> ( or ) <u>Context Phrase</u> to search for the related information.

This Example will take the given Keyword / Phrase and filter out the Wikipedia Documents and returns the Top Matching Information. 

The Top Matching are calculated based on similarity between the given Keyword and all Wikipedia Pages. This similarity is know as [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Instead of finding Similarity with the direct one-hot word representation, we used [GloVe](https://en.wikipedia.org/wiki/GloVe_(machine_learning)) Word Embeddings, which maps words into more meaningful space.

In General, finding Cosine Similarity on large dataset will take a huge amount of time on CPU 

With the Xilinx Cosine Similarity Acceleration, it will speedup the process by > ~ 80x

We will use the Xilinx Cosine Similarity module (**xilCosineSim**) and setup a population against which similarity of target vectors can be calculated.

 
### The Demo is Structured in Six Sections :
1. [**Download Wikipedia Data & GloVe Embeddings File**](#DownloadFiles)
<br><br>
2. [**Load and Parse Wikipedia XML File**](#LoadandParse)
<br><br>
3. [**Clean the XML Data**](#DataClean)
<br><br>
4. [**Calculate the Embeddings Representation for All Wiki Pages**](#GloVe)
<br><br>
5. [**Load the Embeddings representation of Wiki Pages into U50 HBM Memory**](#ConfigureDevice)
<br><br>
6. [**Run Cosine Similarity to Find out the TopK Matchings for the given Query**](#TopKMatchings)

 #### Load Xilinx Cosine Similarity Library

In [None]:
import xilCosineSim as xcs

#### Load Necessary Libraries 

In [None]:
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import time
import string
import re
import os

#### Download the Wikipedia Data Dump <a id="DownloadFiles"></a>

In [None]:
if not os.path.isfile("enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2") :
    print("Downloading Wikipedia File ...")
    os.system("wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2")
    print("Download Completed !!")
if not os.path.isfile("enwiki-latest-pages-articles-multistream1.xml-p1p41242") :
    os.system("bzip2 -d enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2")

#### Download the GloVe File

In [None]:
if not os.path.isfile("glove.6B.50d.txt.tar") :
    print("Downloading GloVe Embedding File ...")
    os.system("wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ogyMmAu0fcZBdSwTQJuX6jHLzlTJnql0' -O glove.6B.50d.txt.tar")
    print("Download Completed !!")
if not os.path.isfile("glove.6B.50d.txt") :
    os.system("tar -xvzf glove.6B.50d.txt.tar")

#### Parsing Load and Parse the Wikipedia XML File <a id="LoadandParse"></a>

In [None]:
def Parse_Wikipedia_XML(wikipedia_xml, Max_Pages=None):
    
    '''
    The Input is a wikipedia xml compressed (whole/partial) file. 
    This file can be downloaded from https://dumps.wikimedia.org/enwiki/ 
    To decompress, Use : bzip2 -d <bz2_compressed_xml_file_path>
    
    Parse the XML file from root. This defination parse the children of roots and go through its modules. 
    Every child with a 'page' as attribute is a single wikipedia page. 
    The XML file is consisted with multiple such pages. 
    In each page, it contains title, text and other metadata items. 
    We make a dictionary with title as key & text as value.
    '''
    
    tStart = time.perf_counter()
    tree = ET.parse(wikipedia_xml)
    print(f'Wikipedia XML file Load completed in  : {(time.perf_counter() - tStart):.6f} sec')
    tStart = time.perf_counter()
    root = tree.getroot()
    dictionary = {}
    if Max_Pages != None : 
        root = root[0:Max_Pages]
    for child in root :
        if "page" in child.tag : 
            for branch in child:
                if "title" in branch.tag:
                    if branch.text.isupper() :
                        title = branch.text
                    else : 
                        title_list = re.findall('[A-Z][a-z]*', branch.text)
                        title = " ".join(title_list)
                    dictionary[title] = ""
                if "redirect" in branch.tag :
                    dictionary.pop(title)
                if "revision" in branch.tag : 
                    for chunk in branch:
                        if "text" in chunk.tag:
                            number = 0
                            for line in chunk.text.split("\n") : 
                                if "|" not in line[0:5] and "{{" not in line[0:5] and "}}" not in line[0:5]:
                                    if len(line) > 100 : 
                                        number = number + 1
                                        try:
                                            dictionary[title] = dictionary[title] + " " + line
                                        except:
                                            pass
                                        if number > 5: # Only Read First Five Paragraphs 
                                            break
    print(f'Wikipedia XML Data Parse completed in : {(time.perf_counter() - tStart):.6f} sec')
    return dictionary

#### Helper Functions to Clean the Data <a id="DataClean"></a>
---
###### 1. Remove words which does not contribute to context 
###### 2. Remove Punctuation 
###### 3. Remove Extra Spaces <br>

In [None]:
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",\
             "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",\
             "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",\
             "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", \
              "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", \
             "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", \
             "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",\
             "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",\
             "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", \
             "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def Remove_Stopwords(text):
    return " ".join([item for item in text.split(" ") if item not in stopwords])

def Remove_Punctuation(text):
    new_text = ""
    for char in text: 
        if char in string.punctuation:
            new_text = new_text + " "
        else : 
            new_text = new_text + char
    return new_text  

def Remove_Extraspaces(text):
    return " ".join([ item for item in text.split() if item])


#### Defination to Apply the Cleaning Methods on Data


In [None]:
def Apply_Cleaning(data_frame):
    
    tStart = time.perf_counter()
    data_frame["Lower"] = data_frame["Text"].apply(lambda x: x.lower())
    data_frame["RemovePunctuation"] = data_frame["Lower"].apply(lambda x: Remove_Punctuation(x))
    data_frame["RemoveExtraSpaces"] = data_frame["RemovePunctuation"].apply(lambda x: Remove_Extraspaces(x))
    data_frame["RemoveStopWords"] = data_frame["RemoveExtraSpaces"].apply(lambda x: Remove_Stopwords(x))  
    print(f'Data Cleaning Completed in {(time.perf_counter() - tStart):.6f} sec')
    return data_frame

#### Load the GloVe File & Create and Accessible Lookup Dictionary <a id="GloVe"></a>


In [None]:
def Load_Glove(glove_file_path):
    
    '''
    The GloVe file contains Embeddings for 400,000 words.
    In each line, the first item is a 'word' representation. 
    And rest of the values in the line, seperated by spaces are it's 50 Embedding values.  
    Here we creating dictionary, with each word as a key and it's 50 Embedding representation as value. 
    '''
    
    tStart = time.perf_counter()
    glove_file = open(glove_file_path, encoding="utf8")
    for line in glove_file.readlines():
        line_list = line.split(" ")
        temp_list = line_list[1:-1]
        temp_list.append(line_list[-1].split("\n")[0])
        float_vector = [vector for vector in np.array(temp_list, dtype=np.float32)]
        glove_dict[line_list[0]] = float_vector
    glove_file.close()
    print(f'Loading GloVe Embedding File completed in : {(time.perf_counter() - tStart):.6f} sec')
    print(f'Number of words in vocabulary, having GloVe representation : {len(glove_dict)}')
    return glove_dict 

#### Map GloVe Vector for each Word in Sentance

In [None]:
def Glove_Formatting(x):
    
    '''
    Given a sentence, map each word to it's 50 dimenssional embeddings. 
    If the word is not found in the GloVe dictionary, create 50 Dimenssional vector with all zeros. 
    '''
    
    glove_vector = []
    for item in x:
        try:
            glove_vector.append(glove_dict[item])
        except:
            glove_vector.append([0.0 for i in range(50)])
    return np.array(glove_vector)

#### Map Embeddings of each Word and find the Average of Embeddings for Senetence

In [None]:
def Map_GloVe_Embeddings(data_frame):
    
    tStart = time.perf_counter()
    data_frame["Embeddings"] = data_frame["RemoveStopWords"].apply(lambda x : Glove_Formatting(x.split(" ")) )
    data_frame["EmbeddingsAverage"] = data_frame["Embeddings"].apply(lambda x : np.sum(x, axis=0)/len(x))
    print(f'Embeddings Mapping for the entire Wiki data, Completed in {(time.perf_counter() - tStart):.6f} sec')
    return data_frame

#### Provide Wikipedia XML File Path & Call Parsing Function</a>

In [None]:
wikipedia_xml = "enwiki-latest-pages-articles-multistream1.xml-p1p41242"
dictionary = Parse_Wikipedia_XML(wikipedia_xml)

#### Load GloVe File & Create GloVe Dictionary

In [None]:
glove_file_path = "glove.6B.50d.txt"
glove_dict = {}
glove_dict = Load_Glove(glove_file_path)

#### Creating Pandas Data Frame from the XML with Title and Text as Columns & each Row is an entry of each Wiki Page

In [None]:
key_list = []
value_list = []
data_frame = pd.DataFrame()
for key, value in dictionary.items():
    key_list.append(key) 
    value_list.append(value)
data_frame["Title"] = key_list
data_frame["Text"]  = value_list
data_frame.head()

#### Apply Cleaning for Dataframe & derive Embeddings Average for Wiki Pages 

In [None]:
data_frame = Apply_Cleaning(data_frame)
data_frame = Map_GloVe_Embeddings(data_frame)

data_frame.head()

#### Assertion Check for the size of final Average Embedding Dimenssion <br>

In [None]:
for i in range(len(data_frame)):
    assert data_frame["EmbeddingsAverage"][i].shape[0] ==  50

#### Input the Embeddings Vector size, Length of the Population & Datatype to configure Load for FPGA  <a id="ConfigureDevice"></a>

In [None]:
VectorLength = 50
NumVectors = len(data_frame)
Bytes_Per_value = 4
NumDevices = 1

####  Configure Population Load in FPGA 

In [None]:
opt = xcs.options()
opt.vecLength = VectorLength
opt.numDevices = NumDevices

#### U50 having 8GB of HBM Memory. Check if the given Data Load has exceeded the Limit 

In [None]:
assert VectorLength * NumVectors * Bytes_Per_value  <  NumDevices * 8 * 2**30, "Memory in 1 x U50 is 8GB. Cant Load the Given amount of data into Memory"

#### Load the Population Embeddings into U50 HBM Memeory

In [None]:
cs = xcs.cosinesim(opt, Bytes_Per_value)
cs.startLoadPopulation(NumVectors)
for vecNum in range(NumVectors):
    vecBuf = cs.getPopulationVectorBuffer(vecNum)

    valVec = []
    for vecIdx in range(VectorLength):
        valVec.append((int(data_frame["EmbeddingsAverage"][vecNum][vecIdx]*1000)))  # Converting Float32 Value to Int Type
    vecBuf.append(valVec)
    cs.finishCurrentPopulationVector(vecBuf)

cs.finishLoadPopulation()

####  Find the TopK Matchings for the Given Query <a id="TopKMatchings"></a>

In [None]:
def Find_TopK_Matchings(query, topK=10):
    
    '''
    Apply the Cleaning, GloVe Mapping function on the given Query.
    Call the Xilinx Cosine Simalrity Match Target Vector API to fiind the Top Matchings with the Loaded Population.
    Displlay the Mathcing Pages Information.
    '''
    
    query_clean = Remove_Stopwords(Remove_Extraspaces(Remove_Punctuation(query.lower())))
    query_embedding = np.sum(Glove_Formatting(query_clean.split(" ")), axis=0)/len(query_clean.split(" "))*1000
    targetVec = query_embedding.astype("int32")
    
    tStart = time.perf_counter()
    result = cs.matchTargetVector(topK, targetVec)
    print(f'completed in {1000*(time.perf_counter() - tStart):.6f} msec\n')
    print("RANK  ID    Wiki Title \t\t\t\t\t\t  MESSAGE \t\t\t\t        CONFIDENCE")
    print("----|-----|-------------|" + 65 *"-" + "---------------------|---------")
    num = 0
    for item in result:
        num = num +1
        Message = data_frame["RemoveStopWords"][item.index]
        print("{:02d}".format(num) + 3*" " + "{:05d}".format(item.index) + 3*" "+ \
              '{message: <10}'.format(message=Remove_Extraspaces(data_frame["Title"][item.index][0:10])) + \
              3*" " + Message[0:35] + " ... " + Message[-45:-1] + 3* " " +\
              '{:.6f}'.format(item.similarity) )
    print(f'\nTopK Matchings completed in {1000*(time.perf_counter() - tStart):.6f} msec')

---
#### Upto this point, it is just a One Time Load & Execution. Now, once all the Data is Loaded, we can run any number of Queries 
---

#### Call the TopK Matchings Function with Your Query Input
<br>

In [None]:
Find_TopK_Matchings(query="Agriculture", topK=5)

In [None]:
Find_TopK_Matchings(query="Battle Ships", topK=10)

In [None]:
Find_TopK_Matchings(query="Second World War", topK=20)

In [None]:
Find_TopK_Matchings(query="Drug Discovery", topK=15)

In [None]:
Find_TopK_Matchings(query="United Nations", topK=10)

### <center> End of the Notebook </center>