# Welcome to Xilinx Cosine Similarity Acceleration Demo 
---

**This Notebook demonstrates how to use the Xilinx Cosine Similarity product and shows the power of Xilinx FPGAs to accelerate Cosine Similarity**

---

### The Demo : Wiki Search Engine 

In this Demo Example, we will create Search Engine based on Wikipedia Data. 

The User will provide <u>Keyword</u> ( or ) <u>Context Phrase</u> to search for the related information.

This Example will take the given Keyword / Phrase and filter out the Wikipedia Documents and returns the Top Matching Information. 

The Top Matching are calculated based on similarity between the given Keyword and all Wikipedia Pages. This similarity is know as [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

Instead of finding Similarity with the direct one-hot word representation, we used [GloVe](https://en.wikipedia.org/wiki/GloVe_(machine_learning)) Word Embeddings, which maps words into more meaningful space.

In General, finding Cosine Similarity on large dataset will take a huge amount of time on CPU 

With the Xilinx Cosine Similarity Acceleration, it will speedup the process by > ~ 80x

We will use the Xilinx Cosine Similarity module (**xilCosineSim**) and setup a population against which similarity of target vectors can be calculated.

 
### The Demo is Structured in Six Sections :
1. [**Download Wikipedia Data & GloVe Embeddings File**](#DownloadFiles)
<br><br>
2. [**Load and Parse Wikipedia XML File**](#LoadandParse)
<br><br>
3. [**Clean the XML Data**](#DataClean)
<br><br>
4. [**Calculate the Embeddings Representation for All Wiki Pages**](#GloVe)
<br><br>
5. [**Load the Embeddings representation of Wiki Pages into U50 HBM Memory**](#ConfigureDevice)
<br><br>
6. [**Run Cosine Similarity to Find out the TopK Matchings for the given Query**](#TopKMatchings)

 #### Load Xilinx Cosine Similarity Library

In [1]:
import xilCosineSim as xcs

#### Load Necessary Libraries 

In [2]:
import xml.etree.ElementTree as ET
import pandas as pd
import numpy as np
import time
import string
import re
import os

#### Download the Wikipedia Data Dump <a id="DownloadFiles"></a>

In [3]:
if not os.path.isfile("enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2") :
    print("Downloading Wikipedia File ...")
    os.system("wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2")
    print("Download Completed !!")
if not os.path.isfile("enwiki-latest-pages-articles-multistream1.xml-p1p41242") :
    os.system("bzip2 -d enwiki-latest-pages-articles-multistream1.xml-p1p41242.bz2")

Downloading Wikipedia File ...
Download Completed !!


#### Download the GloVe File

In [4]:
if not os.path.isfile("glove.6B.50d.txt.tar") :
    print("Downloading GloVe Embedding File ...")
    os.system("wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ogyMmAu0fcZBdSwTQJuX6jHLzlTJnql0' -O glove.6B.50d.txt.tar")
    print("Download Completed !!")
if not os.path.isfile("glove.6B.50d.txt") :
    os.system("tar -xvzf glove.6B.50d.txt.tar")

Downloading GloVe Embedding File ...
Download Completed !!


#### Parsing Load and Parse the Wikipedia XML File <a id="LoadandParse"></a>

In [5]:
def Parse_Wikipedia_XML(wikipedia_xml, Max_Pages=None):
    
    '''
    The Input is a wikipedia xml compressed (whole/partial) file. 
    This file can be downloaded from https://dumps.wikimedia.org/enwiki/ 
    To decompress, Use : bzip2 -d <bz2_compressed_xml_file_path>
    
    Parse the XML file from root. This defination parse the children of roots and go through its modules. 
    Every child with a 'page' as attribute is a single wikipedia page. 
    The XML file is consisted with multiple such pages. 
    In each page, it contains title, text and other metadata items. 
    We make a dictionary with title as key & text as value.
    '''
    
    tStart = time.perf_counter()
    tree = ET.parse(wikipedia_xml)
    print(f'Wikipedia XML file Load completed in  : {(time.perf_counter() - tStart):.6f} sec')
    tStart = time.perf_counter()
    root = tree.getroot()
    dictionary = {}
    if Max_Pages != None : 
        root = root[0:Max_Pages]
    for child in root :
        if "page" in child.tag : 
            for branch in child:
                if "title" in branch.tag:
                    if branch.text.isupper() :
                        title = branch.text
                    else : 
                        title_list = re.findall('[A-Z][a-z]*', branch.text)
                        title = " ".join(title_list)
                    dictionary[title] = ""
                if "redirect" in branch.tag :
                    dictionary.pop(title)
                if "revision" in branch.tag : 
                    for chunk in branch:
                        if "text" in chunk.tag:
                            number = 0
                            for line in chunk.text.split("\n") : 
                                if "|" not in line[0:5] and "{{" not in line[0:5] and "}}" not in line[0:5]:
                                    if len(line) > 100 : 
                                        number = number + 1
                                        try:
                                            dictionary[title] = dictionary[title] + " " + line
                                        except:
                                            pass
                                        if number > 5: # Only Read First Five Paragraphs 
                                            break
    print(f'Wikipedia XML Data Parse completed in : {(time.perf_counter() - tStart):.6f} sec')
    return dictionary

#### Helper Functions to Clean the Data <a id="DataClean"></a>
---
###### 1. Remove words which does not contribute to context 
###### 2. Remove Punctuation 
###### 3. Remove Extra Spaces <br>

In [6]:
stopwords = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",\
             "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",\
             "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",\
             "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", \
              "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", \
             "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", \
             "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",\
             "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",\
             "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", \
             "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def Remove_Stopwords(text):
    return " ".join([item for item in text.split(" ") if item not in stopwords])

def Remove_Punctuation(text):
    new_text = ""
    for char in text: 
        if char in string.punctuation:
            new_text = new_text + " "
        else : 
            new_text = new_text + char
    return new_text  

def Remove_Extraspaces(text):
    return " ".join([ item for item in text.split() if item])


#### Defination to Apply the Cleaning Methods on Data


In [7]:
def Apply_Cleaning(data_frame):
    
    tStart = time.perf_counter()
    data_frame["Lower"] = data_frame["Text"].apply(lambda x: x.lower())
    data_frame["RemovePunctuation"] = data_frame["Lower"].apply(lambda x: Remove_Punctuation(x))
    data_frame["RemoveExtraSpaces"] = data_frame["RemovePunctuation"].apply(lambda x: Remove_Extraspaces(x))
    data_frame["RemoveStopWords"] = data_frame["RemoveExtraSpaces"].apply(lambda x: Remove_Stopwords(x))  
    print(f'Data Cleaning Completed in {(time.perf_counter() - tStart):.6f} sec')
    return data_frame

#### Load the GloVe File & Create and Accessible Lookup Dictionary <a id="GloVe"></a>


In [8]:
def Load_Glove(glove_file_path):
    
    '''
    The GloVe file contains Embeddings for 400,000 words.
    In each line, the first item is a 'word' representation. 
    And rest of the values in the line, seperated by spaces are it's 50 Embedding values.  
    Here we creating dictionary, with each word as a key and it's 50 Embedding representation as value. 
    '''
    
    tStart = time.perf_counter()
    glove_file = open(glove_file_path, encoding="utf8")
    for line in glove_file.readlines():
        line_list = line.split(" ")
        temp_list = line_list[1:-1]
        temp_list.append(line_list[-1].split("\n")[0])
        float_vector = [vector for vector in np.array(temp_list, dtype=np.float32)]
        glove_dict[line_list[0]] = float_vector
    glove_file.close()
    print(f'Loading GloVe Embedding File completed in : {(time.perf_counter() - tStart):.6f} sec')
    print(f'Number of words in vocabulary, having GloVe representation : {len(glove_dict)}')
    return glove_dict 

#### Map GloVe Vector for each Word in Sentance

In [9]:
def Glove_Formatting(x):
    
    '''
    Given a sentence, map each word to it's 50 dimenssional embeddings. 
    If the word is not found in the GloVe dictionary, create 50 Dimenssional vector with all zeros. 
    '''
    
    glove_vector = []
    for item in x:
        try:
            glove_vector.append(glove_dict[item])
        except:
            glove_vector.append([0.0 for i in range(50)])
    return np.array(glove_vector)

#### Map Embeddings of each Word and find the Average of Embeddings for Senetence

In [10]:
def Map_GloVe_Embeddings(data_frame):
    
    tStart = time.perf_counter()
    data_frame["Embeddings"] = data_frame["RemoveStopWords"].apply(lambda x : Glove_Formatting(x.split(" ")) )
    data_frame["EmbeddingsAverage"] = data_frame["Embeddings"].apply(lambda x : np.sum(x, axis=0)/len(x))
    print(f'Embeddings Mapping for the entire Wiki data, Completed in {(time.perf_counter() - tStart):.6f} sec')
    return data_frame

#### Provide Wikipedia XML File Path & Call Parsing Function</a>

In [11]:
wikipedia_xml = "enwiki-latest-pages-articles-multistream1.xml-p1p41242"
dictionary = Parse_Wikipedia_XML(wikipedia_xml)

Wikipedia XML file Load completed in  : 8.053361 sec
Wikipedia XML Data Parse completed in : 2.622177 sec


#### Load GloVe File & Create GloVe Dictionary

In [12]:
glove_file_path = "glove.6B.50d.txt"
glove_dict = {}
glove_dict = Load_Glove(glove_file_path)

Loading GloVe Embedding File completed in : 6.805943 sec
Number of words in vocabulary, having GloVe representation : 400000


#### Creating Pandas Data Frame from the XML with Title and Text as Columns & each Row is an entry of each Wiki Page

In [13]:
key_list = []
value_list = []
data_frame = pd.DataFrame()
for key, value in dictionary.items():
    key_list.append(key) 
    value_list.append(value)
data_frame["Title"] = key_list
data_frame["Text"]  = value_list
data_frame.head()

Unnamed: 0,Title,Text
0,Anarchism,'''Anarchism''' is a [[political philosophy]]...
1,Autism,'''Autism''' is a [[developmental disorder]] ...
2,Albedo,[[File:Albedo-e hg.svg|thumb|upright=1.3|The ...
3,Alabama,"'''Alabama''' ({{IPAc-en|,|æ|l|ə|'|b|æ|m|ə|}}..."
4,Achilles,[[File:Achilles fighting against Memnon Leide...


#### Apply Cleaning for Dataframe & derive Embeddings Average for Wiki Pages 

In [14]:
data_frame = Apply_Cleaning(data_frame)
data_frame = Map_GloVe_Embeddings(data_frame)

data_frame.head()

Data Cleaning Completed in 17.073280 sec
Embeddings Mapping for the entire Wiki data, Completed in 39.806676 sec


Unnamed: 0,Title,Text,Lower,RemovePunctuation,RemoveExtraSpaces,RemoveStopWords,Embeddings,EmbeddingsAverage
0,Anarchism,'''Anarchism''' is a [[political philosophy]]...,'''anarchism''' is a [[political philosophy]]...,anarchism is a political philosophy ...,anarchism is a political philosophy and politi...,anarchism political philosophy political movem...,"[[-0.5725899934768677, -0.22443999350070953, -...","[0.012981447413275738, 0.024957372031251243, -..."
1,Autism,'''Autism''' is a [[developmental disorder]] ...,'''autism''' is a [[developmental disorder]] ...,autism is a developmental disorder ...,autism is a developmental disorder characteriz...,autism developmental disorder characterized di...,"[[1.3366999626159668, 0.4839800000190735, -0.0...","[0.09803162424081913, 0.2675378602812644, 0.12..."
2,Albedo,[[File:Albedo-e hg.svg|thumb|upright=1.3|The ...,[[file:albedo-e hg.svg|thumb|upright=1.3|the ...,file albedo e hg svg thumb upright 1 3 the ...,file albedo e hg svg thumb upright 1 3 the per...,file albedo e hg svg thumb upright 1 3 percent...,"[[0.2034599930047989, -0.3614400029182434, 1.0...","[0.10138701840600607, 0.4683248054349573, 0.24..."
3,Alabama,"'''Alabama''' ({{IPAc-en|,|æ|l|ə|'|b|æ|m|ə|}}...","'''alabama''' ({{ipac-en|,|æ|l|ə|'|b|æ|m|ə|}}...",alabama ipac en æ l ə b æ m ə ...,alabama ipac en æ l ə b æ m ə is a state in th...,alabama ipac en æ l ə b æ m ə state southeaste...,"[[-1.351199984550476, -0.25477999448776245, 0....","[-0.10759588041018844, 0.23540118394727452, 0...."
4,Achilles,[[File:Achilles fighting against Memnon Leide...,[[file:achilles fighting against memnon leide...,file achilles fighting against memnon leide...,file achilles fighting against memnon leiden r...,file achilles fighting memnon leiden rijksmuse...,"[[0.2034599930047989, -0.3614400029182434, 1.0...","[0.06961334054730832, 0.2537376214703545, -0.0..."


#### Assertion Check for the size of final Average Embedding Dimenssion <br>

In [15]:
for i in range(len(data_frame)):
    assert data_frame["EmbeddingsAverage"][i].shape[0] ==  50

#### Input the Embeddings Vector size, Length of the Population & Datatype to configure Load for FPGA  <a id="ConfigureDevice"></a>

In [16]:
VectorLength = 50
NumVectors = len(data_frame)
Bytes_Per_value = 4
NumDevices = 1

####  Configure Population Load in FPGA 

In [17]:
opt = xcs.options()
opt.vecLength = VectorLength
opt.numDevices = NumDevices

#### U50 having 8GB of HBM Memory. Check if the given Data Load has exceeded the Limit 

In [18]:
assert VectorLength * NumVectors * Bytes_Per_value  <  NumDevices * 8 * 2**30, "Memory in 1 x U50 is 8GB. Cant Load the Given amount of data into Memory"

#### Load the Population Embeddings into U50 HBM Memeory

In [19]:
cs = xcs.cosinesim(opt, Bytes_Per_value)
cs.startLoadPopulation(NumVectors)
for vecNum in range(NumVectors):
    vecBuf = cs.getPopulationVectorBuffer(vecNum)

    valVec = []
    for vecIdx in range(VectorLength):
        valVec.append((int(data_frame["EmbeddingsAverage"][vecNum][vecIdx]*1000)))  # Converting Float32 Value to Int Type
    vecBuf.append(valVec)
    cs.finishCurrentPopulationVector(vecBuf)

cs.finishLoadPopulation()

####  Find the TopK Matchings for the Given Query <a id="TopKMatchings"></a>

In [20]:
def Find_TopK_Matchings(query, topK=10):
    
    '''
    Apply the Cleaning, GloVe Mapping function on the given Query.
    Call the Xilinx Cosine Simalrity Match Target Vector API to fiind the Top Matchings with the Loaded Population.
    Displlay the Mathcing Pages Information.
    '''
    
    query_clean = Remove_Stopwords(Remove_Extraspaces(Remove_Punctuation(query.lower())))
    query_embedding = np.sum(Glove_Formatting(query_clean.split(" ")), axis=0)/len(query_clean.split(" "))*1000
    targetVec = query_embedding.astype("int32")
    
    tStart = time.perf_counter()
    result = cs.matchTargetVector(topK, targetVec)
    print(f'completed in {1000*(time.perf_counter() - tStart):.6f} msec\n')
    print("RANK  ID    Wiki Title \t\t\t\t\t\t  MESSAGE \t\t\t\t        CONFIDENCE")
    print("----|-----|-------------|" + 65 *"-" + "---------------------|---------")
    num = 0
    for item in result:
        num = num +1
        Message = data_frame["RemoveStopWords"][item.index]
        print("{:02d}".format(num) + 3*" " + "{:05d}".format(item.index) + 3*" "+ \
              '{message: <10}'.format(message=Remove_Extraspaces(data_frame["Title"][item.index][0:10])) + \
              3*" " + Message[0:35] + " ... " + Message[-45:-1] + 3* " " +\
              '{:.6f}'.format(item.similarity) )
    print(f'\nTopK Matchings completed in {1000*(time.perf_counter() - tStart):.6f} msec')

---
#### Upto this point, it is just a One Time Load & Execution. Now, once all the Data is Loaded, we can run any number of Queries 
---

#### Call the TopK Matchings Function with Your Query Input
<br>

In [21]:
Find_TopK_Matchings(query="Agriculture", topK=5)

completed in 1.917142 msec

RANK  ID    Wiki Title 						  MESSAGE 				        CONFIDENCE
----|-----|-------------|--------------------------------------------------------------------------------------|---------
01   00465   Aquacultur   file world capture fisheries aquacu ... production food fish aquatic plants 1990–201   0.709072
02   16389   DTE          directorate technical education mah ... ation governance body government kerala indi   0.701151
03   04379   Economy Fr   economy french guiana tied closely  ...  miquelon economy wallis futuna wallis futun   0.694272
04   15095   Fertilizer   file lite trac spreader jpg thumb l ... bsite world data access date 7 march 2020 re   0.679945
05   12514   SADC         southern african development commun ... can development coordination conference sadc   0.672894

TopK Matchings completed in 3.379484 msec


In [22]:
Find_TopK_Matchings(query="Battle Ships", topK=10)

completed in 3.854516 msec

RANK  ID    Wiki Title 						  MESSAGE 				        CONFIDENCE
----|-----|-------------|--------------------------------------------------------------------------------------|---------
01   04009   Escort       escort carrier escort aircraft carr ... ft carriers new construction became availabl   0.869463
02   05859   Harpers Fe   harpers ferry armory second federal ... ing ship united states navy commissioned 199   0.868337
03   05837   H M S Drea   several ships one submarine royal n ... rine periscope publishing isbn 978 190438109   0.862728
04   13335   Torpedo      file ataquechocrane png thumb torpe ... ip blockading fleet form asymmetrical warfar   0.859361
05   02804   Cruiser      file uss port royal cg 73 jpg thumb ... eavy cruiser design designated cruiser kille   0.858787
06   04660   Frigate      frigate ipac en ˈ f r ɪ ɡ ə type wa ... word books london 2005 isbn 1 84415 301 0 re   0.858240
07   07278   Kriegsmari   kriegsmarine ipa de ˈkʁiːksmaˌ

In [23]:
Find_TopK_Matchings(query="Second World War", topK=20)

completed in 2.881894 msec

RANK  ID    Wiki Title 						  MESSAGE 				        CONFIDENCE
----|-----|-------------|--------------------------------------------------------------------------------------|---------
01   14000   World War    german junkers ju 87 stuka dive bom ... war crimes war crimes trials japanese leader   0.936312
02   04364   French Arm   french armed forces lang fr forces  ...  war independence italy prussia within franc   0.923572
03   13314   Twilight     twilight 2000 1984 post apocalyptic ... ow intensity conflict low intensity civil wa   0.922851
04   01880   Balkan War   balkan wars consisted two conflicts ... man army assassinated young turks due failur   0.916367
05   14012   Battle Mon   battle monte cassino also known bat ... would fall october 1943 proved far optimisti   0.912933
06   03825   Erwin Romm   johannes erwin eugen rommel 15 nove ... 1 polish italian descent sfn butler 2015 p 3   0.909934
07   06063   Foreign It   foreign relations italian repu

In [24]:
Find_TopK_Matchings(query="Drug Discovery", topK=15)

completed in 2.283171 msec

RANK  ID    Wiki Title 						  MESSAGE 				        CONFIDENCE
----|-----|-------------|--------------------------------------------------------------------------------------|---------
01   04120   Experiment   experimental cancer treatments non  ... gns optimism bias irrational belief beat odd   0.819006
02   03324   Duesberg     duesberg hypothesis claim associate ...  hypothesis basis fact ref name drugusenatur   0.793924
03   06130   I R S        leibniz institute research society  ... tional football competition australia irelan   0.786492
04   02848   Chemothera   chemotherapy often abbreviated chem ... rolong life palliative care palliate symptom   0.784181
05   01742   Geneticall   noinclude good article noinclude te ...  gm food labeled status gene edited organism   0.779460
06   02558   Carcinogen   carcinogen substance radionuclide r ... onvert less toxic carcinogen toxic carcinoge   0.776690
07   16222   Darbepoeti   darbepoetin alfa international

In [25]:
Find_TopK_Matchings(query="United Nations", topK=10)

completed in 3.069359 msec

RANK  ID    Wiki Title 						  MESSAGE 				        CONFIDENCE
----|-----|-------------|--------------------------------------------------------------------------------------|---------
01   08009   Foreign Ma   malawi former president bakili mulu ... atute international criminal court article 9   0.917747
02   07413   Foreign La   foreign relations latvia primary re ...  sweden switzerland thailand turkey venezuel   0.916239
03   12761   Foreign Tr   modern trinidad tobago maintains cl ...  represented governor general trinidad tobag   0.913249
04   06458   Foreign Ja   jamaica diplomatic relations many n ... coordinating discussions invigorating societ   0.912989
05   13585   United Nat   united nations trusteeship council  ... rust territory united states permanent membe   0.911421
06   02115   Foreign Ca   cameroon noncontentious low profile ... es well memberships francophonie commonwealt   0.907410
07   07465   Foreign Li   lithuania country south easter

### <center> End of the Notebook </center>