<a href="https://colab.research.google.com/github/TutubanaS/udot_tensorflow/blob/develop/udot_scann.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>UDOT: Google ScaNN powered by NLP methods</h1> UDOT - Google ScaNN
The intention of this document is to create a ScaNN model that could be connected through Rest API while providing the NLP methods within the model.

For more information about ScaNN, please refer to https://github.com/google-research/google-research/tree/master/scann

For more information about UDOT NLP functions, please refer to https://github.com/UniversalDot/tensorflow




Install the requiered Packages

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann
!pip install -q sentence_transformers
!pip install -q yake
!pip install -q bentoml

[K     |████████████████████████████████| 89 kB 3.5 MB/s 
[K     |████████████████████████████████| 4.7 MB 4.2 MB/s 
[K     |████████████████████████████████| 10.4 MB 3.9 MB/s 
[K     |████████████████████████████████| 578.0 MB 12 kB/s 
[K     |████████████████████████████████| 1.7 MB 39.1 MB/s 
[K     |████████████████████████████████| 438 kB 50.4 MB/s 
[K     |████████████████████████████████| 5.9 MB 36.1 MB/s 
[K     |████████████████████████████████| 85 kB 2.2 MB/s 
[K     |████████████████████████████████| 5.5 MB 32.6 MB/s 
[K     |████████████████████████████████| 1.3 MB 41.0 MB/s 
[K     |████████████████████████████████| 163 kB 52.5 MB/s 
[K     |████████████████████████████████| 7.6 MB 40.1 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 60 kB 963 kB/s 
[K     |████████████████████████████████| 132 kB 8.8 MB/s 
[?25h  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
[K 

In [41]:
from typing import Dict, Text

import os
import tqdm
import pprint
import tempfile
import random
import math
from datetime import datetime, date


import pandas as pd
import numpy as np

import bentoml

import tensorflow as tf
import tensorflow_hub as hub

from transformers import DistilBertTokenizer, TFDistilBertModel
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from transformers import TFAutoModel
from transformers import AutoTokenizer


import yake

from sentence_transformers import SentenceTransformer


###Import the dataset that is going to be loaded into ScaNN

---



*   Load the embeddings dataset to a dataframe from csv
*   Drop the index column since it has no purpose for the model





In [3]:
# Load the embeddings to a dataframe
jobs_df = pd.read_csv("/content/drive/MyDrive/Models/Data/job_desc.csv")
jobs_df = jobs_df.drop("Unnamed: 0", axis = 1)

# Load the embeddings to a dataframe
embeddings_df = pd.read_csv("/content/drive/MyDrive/Models/Data/embeddings-msmarco-distilbert-base-v4.csv")

# Drop the axis
embeddings_df = embeddings_df.drop("Unnamed: 0", axis = 1)

#### Turn the dataframe into a tensorflow.dataset for training purposes of ScaNN
---


*   Turn the dataframe into a numpy.array with the dtype = float32 since tf.data.Dataset accepts only float32
*   Create a tensorflow.dataset from the created numpy array in batches (size of 32)



In [4]:
# Turn the created df above into a np.array
embeddings_array = np.array(embeddings_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_tf = tf.data.Dataset.from_tensor_slices(embeddings_array).batch(32)

##Create a ScaNN object for searching purposes
---


*   First create a ScaNN object from tfrs.layers.factorized_top_k with the parameters num_reordering_candidates and num_leaves_to_search. The parameters can be optimized to further points
*   Load the created tf.data.Dataset based on embedding data we loaded already



In [5]:
# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 100
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_tf
    )

# Build the ScaNN

scann.build(embeddings_tf.element_spec.shape)

###Helper NLP Functions for an example input creation
---

The given two functions (text2keywords and text2embeddings) are meant for input creation. The embeddings dataset is already outputted based on the given functions down below. In order to create an input from a natural-language based sentence, we first get the keywords (text2keywords) and get the embeddings of the keywords (text2embeddings)

In [6]:
def text2Keywords(data: list, _ngram: int = 3, _top: int = 25, _windowSize: int = 1) -> list:
    """
    :param data: List of texts. It is a 1D list where each element is a text.
    :param _ngram: How many words will be there in a sentence.
    :param _top: How many phrases we want to return.
    :param _windowSize: To how many words we are going to make comparisons.
    :return: A 1D list where each element is a String that has all the keywords for it.
    """

    kw_extractor = yake.KeywordExtractor(n=_ngram, top=_top, windowsSize=_windowSize)

    return_list = []
    if not (type(data) == list):
        print("The given input is not a list, converting to list")
        data = [data]

    for text in data:
        keywords = kw_extractor.extract_keywords(text)

        keywords_list = []
        for _keywords in keywords:
            keywords_list.append(_keywords[0])

        return_list.append(keywords_list)

    for i in range(len(return_list)):
        str_tmp = ""
        for keyword in return_list[i]:
            str_tmp += str(keyword) + " "
        return_list[i] = str_tmp

    return return_list


def text2Embeddings(sentences: list, sentence_transformer: str = "msmarco-distilbert-base-v4") -> list:
    """

    :param sentences: data: List of texts. It is a 1D list where each element is a text.
    :param sentence_transformer: Model Name of the sentence_transformer
    :return:
    """
    model = SentenceTransformer(sentence_transformer)

    embeddings = []
    for sentence in sentences:
        embedding = model.encode(sentence)
        embeddings.append(embedding)

    return embeddings

##Testing the Model
---
txt = the input of a user/task that we want to search neighbours for.


1.   First turn txt into its keywords sentence (txt2Keywords)
2.   Turn the keyword sentence to an embedding (txt2Embeddings)
3.   Turn the string into an np.array
4.   Turn the np.array into tf.data.Dataset (from_tensor_slices)


In [7]:
txt = "I am a python developer and a Machine learning"

txt_keywords = text2Keywords(txt)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
#embeddings_tf = tf.data.Dataset.from_tensor_slices(txt_embedding)#.batch(32)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)#.batch(32) 

The given input is not a list, converting to list


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

score = scores of the retrieved results (closer the distance, higher the score)

res = retrieved results (just indices of jobs_df):

In [8]:
# Get the results
score, res = scann.predict(txt_embeddings_tf)

# Print the results
for i in res[0]:
  print(jobs_df.iloc[i].jobdescription, "\n")

     A Saas firm in Boston that specializes in data analytics is looking for a Full Stack-Developer, with a strong Python and Machine Learning background, for a promising position with their growing staff. In this role, the Python Developer will be responsible for working on development of new product prototypes, internal tools, and public facing data visualizations. Apply today!The Python Developer will be responsible for:Building new web and data products for market analyticsDeveloping tools used by the Analytics team to parse and understand market dataContributing to development of an efficient data infrastructure to put data models developed by the analytics team into production including the development of APIs, optimizing databases, and automating workflowsBuilding data visualizations as part of product prototypes and for public-oriented micrositesImplementing machine learning models in new products Skills:Experience using Python, SQL, and javaScriptExperience with natural langua

## Saving the Model for TF Serving
Save the built model for further usages.


1.   path : the directory to be saved
2.   tf.saved_model.save : save the created scann model




In [9]:
# Export the query model.

path = os.path.join("/content/drive/MyDrive/Models", "udot_scann")

# Save the index.
tf.saved_model.save(
  scann,
  path,
  options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)



In [10]:
# Load it back; can also be done in TensorFlow Serving.
loaded = tf.saved_model.load("/content/drive/MyDrive/Models/udot_scann")

In [11]:
score, res = loaded(txt_embeddings_tf)

# Print the results
for i in res.numpy():
  print(jobs_df.iloc[i].jobdescription, "\n")


1459          A Saas firm in Boston that specializes in...
3688     Our client a Quantitative Hedge Fund is seekin...
6014     One of our clients is looking for a Fullstack ...
14246    We are looking for that rare combination of ma...
7242     LAMP / PHP Software Developer / Engineer  Atla...
Name: jobdescription, dtype: object 



## Expanding the Dataset
Assign new attributes next to the jobdescription. The added features are: Location, Budget, Deadline

In [164]:
# Add Location (Latitude and Longitude)
# Limit to Netherlands


longitude_list, latitude_list = [], []
longitude_val, latitude_val = 180, 90

for i in range(len(jobs_df)):
  long_tmp = random.randint(-longitude_val, longitude_val)
  longitude_list.append(long_tmp)
  latit_tmp = random.randint(-latitude_val, latitude_val)
  latitude_list.append(latit_tmp)

location_df = pd.DataFrame([longitude_list, latitude_list]).transpose()
location_df.columns = ["Longitude", "Latitude"]

In [176]:
# Add Budget

budget_list = []
for i in range(len(jobs_df)):
  tmp = random.randint(15,1500)
  sharp = tmp - (tmp % 50)
  budget_list.append(sharp)

budget_df = pd.DataFrame(budget_list)
budget_df.columns = ["Budget"]

In [177]:
# Add deadline
deadline_list = []

for i in range(len(jobs_df)):
  tmp_month = random.randint(1, 12)
  
  if tmp_month in [1, 3, 5, 7, 8 , 10, 12]:
    day_range = 31
  elif tmp_month in [2]:
    day_range = 28
  else:
    day_range = 30

  tmp_day = random.randint(1, day_range)
  tmp_year_index = random.randint(0, 9)
  years_list = [2023] * 10

  deadline = str(tmp_day) + '/' + str(tmp_month) + '/' + str(years_list[tmp_year_index])
  deadline_list.append(deadline)

deadline_df = pd.DataFrame(deadline_list)
deadline_df.columns = ["Deadline"]
  

In [178]:
jobs_extended = jobs_df.copy()
jobs_extended = jobs_extended.join(location_df)
jobs_extended = jobs_extended.join(budget_df)
jobs_extended = jobs_extended.join(deadline_df)
jobs_extended

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline
0,Looking for Selenium engineers...must have sol...,-56,-49,250,15/9/2023
1,The University of Chicago has a rapidly growin...,-48,-80,450,10/2/2023
2,"GalaxE.SolutionsEvery day, our solutions affec...",-161,-64,750,15/8/2023
3,Java DeveloperFull-time/direct-hireBolingbrook...,29,-79,1000,28/9/2023
4,Midtown based high tech firm has an immediate ...,129,81,250,6/11/2023
...,...,...,...,...,...
21995,Company Description We are searching for a ta...,62,71,500,2/9/2023
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,163,28,1450,14/9/2023
21997,Do you take pride in your work knowing that th...,64,-27,1250,12/8/2023
21998,Company Description What We Can Offer YouAs th...,-143,70,1350,14/5/2023


### On a Profile side, the data should be altered into Deadline -> Availability, Budget -> Reputation

In [179]:
profile_df = jobs_extended.copy()
profile_df["Requiered Reputation"] = 350*np.log10(profile_df["Budget"])

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [180]:
#today = datetime.combine(date.today(), datetime.min.time())
today = datetime.strptime("31/12/2022", '%d/%m/%Y')

remaining_time_list = []
for i in range(len(profile_df)):
  remaining_time_list.append(-(today - datetime.strptime(profile_df["Deadline"].iloc[i], '%d/%m/%Y')).days*24)

profile_df["Remaining Time"] = remaining_time_list

In [181]:
profile_df

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline,Requiered Reputation,Remaining Time
0,Looking for Selenium engineers...must have sol...,-56,-49,250,15/9/2023,839.279003,6192
1,The University of Chicago has a rapidly growin...,-48,-80,450,10/2/2023,928.624380,984
2,"GalaxE.SolutionsEvery day, our solutions affec...",-161,-64,750,15/8/2023,1006.271442,5448
3,Java DeveloperFull-time/direct-hireBolingbrook...,29,-79,1000,28/9/2023,1050.000000,6504
4,Midtown based high tech firm has an immediate ...,129,81,250,6/11/2023,839.279003,7440
...,...,...,...,...,...,...,...
21995,Company Description We are searching for a ta...,62,71,500,2/9/2023,944.639502,5880
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,163,28,1450,14/9/2023,1106.478801,6168
21997,Do you take pride in your work knowing that th...,64,-27,1250,12/8/2023,1083.918505,5376
21998,Company Description What We Can Offer YouAs th...,-143,70,1350,14/5/2023,1095.616819,3216


### Running an example query on profile_df

In [189]:
#profile_params
AVAILABILITY = 50000 #hrs a week
REPUTATION = 600000
INTEREST = "Data Manager and SQL developer"
#LOCATION

In [190]:
search_df = profile_df.copy()
# Filter on Reputation
search_df = search_df[search_df["Requiered Reputation"] < REPUTATION]

# Filter on Remaining Time
search_df = search_df[search_df["Remaining Time"] < AVAILABILITY]

# Filter on Location


# Filter on Interest
embeddings_search_list = search_df.reset_index()["index"].tolist()
embeddings_search_area = []
for i in embeddings_search_list:
  embeddings_search_area.append(embeddings_df.iloc[i])

embeddings_search_df = pd.DataFrame(embeddings_search_area)

print(len(embeddings_search_df))
# Turn the created df above into a np.array
embeddings_search_array = np.array(embeddings_search_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_search_tf = tf.data.Dataset.from_tensor_slices(embeddings_search_array).batch(32)


# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 10
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_search_tf
    )

# Build the ScaNN

scann.build(embeddings_search_tf.element_spec.shape)


txt_keywords = text2Keywords(INTEREST)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)


# Get the results
score, res = scann.predict(txt_embeddings_tf)

22000
The given input is not a list, converting to list


In [187]:
# Print the results
#search_df = search_df.reset_index()

for i in list(res[0]):
  print(i, search_df["Budget"].iloc[i], search_df["Remaining Time"].iloc[i], search_df["jobdescription"].iloc[i])


725 950 288 We are looking for Oracle PL/SQL Developer who works independently or under only general direction on complex problems which require competence in all phases of programming concepts and practices. Working from diagrams and charts which identify the nature of desired results, processing steps to be accomplished and the relationships between various steps of the problem-solving routine; plans the full range of programming actions needed to efficiently utilize the computer system in achieving desired end products. Analyze, design, develop, test, and implement distributed applications as part of a systems development team. Provide feedback on and adhere to delivery dates. Provide end-user support as a technical expert. Analyze, design, develop and test Oracle PL/SQL applications that enhance customer's operations through the dissemination of pertinent and relevant information in a timely and efficient manner.    Responsibilities: Develop and revise program code based on clearly

In [125]:
embeddings_search_df.reset_index()

Unnamed: 0,index,0,1,2,3,4,5,6,7,8,...,758,759,760,761,762,763,764,765,766,767
0,75,-0.093673,-0.104458,-0.159039,-0.401531,-0.590320,0.181259,0.246020,-0.151857,-0.399698,...,0.969038,-0.115263,0.353825,-0.330405,-0.090375,-0.000762,0.389528,-0.259913,0.614505,0.375233
1,791,-0.186912,0.543980,0.527288,-0.417753,0.272471,-0.080391,-0.635116,-0.231990,-0.221937,...,0.241050,0.069798,0.094456,-0.068452,0.104520,0.221484,-0.161030,0.217160,0.326267,-0.055203
2,1074,0.257794,1.036860,1.037097,-0.458673,-0.093990,0.119062,-1.434599,-0.053094,-0.162293,...,0.018496,-0.887077,0.264888,-0.256127,0.553513,-0.320905,-0.735900,0.125063,0.010833,-0.274815
3,2237,0.232777,0.565806,-0.127907,-0.477396,-0.188041,0.925820,0.322576,0.011772,-0.059456,...,0.180081,-0.648644,-0.347376,0.024427,-0.106345,-0.167204,-0.476882,0.412351,0.287164,0.217456
4,2836,0.686018,-0.290015,-0.090677,0.200527,0.279837,-0.273464,-0.871957,-0.227879,-0.255226,...,0.684522,0.310224,0.277779,-0.133384,-0.364601,-0.084516,-0.272205,-0.317342,0.093202,-0.700192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
60,21083,-0.541942,0.420903,0.021134,-0.111896,-0.614373,-0.350253,0.628237,0.012102,-1.083844,...,0.465889,0.207660,-0.077400,-0.100981,-0.426435,0.261722,0.292192,0.259310,0.380801,-0.420755
61,21482,0.099495,0.431860,0.252433,0.285679,-0.130381,-0.106937,-0.383345,-0.670608,-0.214792,...,0.581133,0.652204,0.258190,-0.303915,0.048109,0.099006,-0.588022,-0.805941,-0.136485,0.221726
62,21587,-0.024222,-0.000127,0.684660,-0.477939,0.420923,-0.381712,-0.647077,-0.342256,-0.831487,...,-0.290259,0.081488,0.963119,0.185108,-0.223313,0.617947,-0.038255,-0.100441,0.188845,-0.140206
63,21829,-0.512376,0.655734,0.463378,0.325330,0.564522,-0.149710,-1.230864,-0.121083,-0.178507,...,0.537187,0.061442,0.175575,-0.259259,0.677057,-0.040027,-0.411762,-0.336087,-0.080398,-0.249226


[75,
 791,
 1074,
 2237,
 2836,
 3257,
 3289,
 3683,
 4062,
 4125,
 4218,
 4509,
 4594,
 4684,
 5224,
 5378,
 5670,
 5692,
 5935,
 6284,
 6323,
 6442,
 6559,
 6681,
 6861,
 8606,
 9114,
 9157,
 9358,
 9507,
 9567,
 9940,
 10342,
 10974,
 11012,
 11101,
 12141,
 12303,
 12569,
 12996,
 13160,
 13530,
 13676,
 13862,
 13964,
 14539,
 14813,
 15268,
 15276,
 15758,
 15915,
 16113,
 16472,
 17621,
 17990,
 18960,
 19105,
 19112,
 19131,
 20308,
 21083,
 21482,
 21587,
 21829,
 21859]