<a href="https://colab.research.google.com/github/TutubanaS/udot_tensorflow/blob/develop/udot_scann.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>UDOT: Google ScaNN powered by NLP methods</h1> UDOT - Google ScaNN
The intention of this document is to create a ScaNN model that could be connected through Rest API while providing the NLP methods within the model.

For more information about ScaNN, please refer to https://github.com/google-research/google-research/tree/master/scann

For more information about UDOT NLP functions, please refer to https://github.com/UniversalDot/tensorflow




Install the requiered Packages

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann
!pip install -q sentence_transformers
!pip install -q yake
!pip install -q geopy
!pip install -q annoy

[K     |████████████████████████████████| 89 kB 3.7 MB/s 
[K     |████████████████████████████████| 4.7 MB 5.0 MB/s 
[K     |████████████████████████████████| 10.4 MB 4.1 MB/s 
[K     |████████████████████████████████| 578.0 MB 15 kB/s 
[K     |████████████████████████████████| 5.9 MB 38.6 MB/s 
[K     |████████████████████████████████| 438 kB 60.7 MB/s 
[K     |████████████████████████████████| 1.7 MB 57.5 MB/s 
[K     |████████████████████████████████| 85 kB 116 kB/s 
[K     |████████████████████████████████| 5.5 MB 31.4 MB/s 
[K     |████████████████████████████████| 1.3 MB 59.0 MB/s 
[K     |████████████████████████████████| 163 kB 60.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 45.7 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 60 kB 3.1 MB/s 
[K     |████████████████████████████████| 132 kB 11.0 MB/s 
[?25h  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
[K

In [2]:
from typing import Dict, Text

import os
from tqdm.notebook import tqdm
import pprint
import tempfile
import random
import math
from datetime import datetime, date

import plotly.express as px
import plotly.graph_objects as go


import pandas as pd
import numpy as np


import tensorflow as tf
import tensorflow_hub as hub

from transformers import DistilBertTokenizer, TFDistilBertModel
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from transformers import TFAutoModel
from transformers import AutoTokenizer

from annoy import AnnoyIndex


import yake

from sentence_transformers import SentenceTransformer


###Import the dataset that is going to be loaded into ScaNN

---



*   Load the embeddings dataset to a dataframe from csv
*   Drop the index column since it has no purpose for the model





In [3]:
# Load the embeddings to a dataframe
jobs_df = pd.read_csv("/content/drive/MyDrive/Models/Data/job_desc.csv")
jobs_df = jobs_df.drop("Unnamed: 0", axis = 1)

# Load the embeddings to a dataframe
embeddings_df = pd.read_csv("/content/drive/MyDrive/Models/Data/embeddings-msmarco-distilbert-base-v4.csv")

# Drop the axis
embeddings_df = embeddings_df.drop("Unnamed: 0", axis = 1)

#### Turn the dataframe into a tensorflow.dataset for training purposes of ScaNN
---


*   Turn the dataframe into a numpy.array with the dtype = float32 since tf.data.Dataset accepts only float32
*   Create a tensorflow.dataset from the created numpy array in batches (size of 32)



In [4]:
# Turn the created df above into a np.array
embeddings_array = np.array(embeddings_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_tf = tf.data.Dataset.from_tensor_slices(embeddings_array).batch(32)

##Create a ScaNN object for searching purposes
---


*   First create a ScaNN object from tfrs.layers.factorized_top_k with the parameters num_reordering_candidates and num_leaves_to_search. The parameters can be optimized to further points
*   Load the created tf.data.Dataset based on embedding data we loaded already



In [5]:
# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 100
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_tf
    )

# Build the ScaNN

scann.build(embeddings_tf.element_spec.shape)

###Helper NLP Functions for an example input creation
---

The given two functions (text2keywords and text2embeddings) are meant for input creation. The embeddings dataset is already outputted based on the given functions down below. In order to create an input from a natural-language based sentence, we first get the keywords (text2keywords) and get the embeddings of the keywords (text2embeddings)

In [6]:
def text2Keywords(data: list, _ngram: int = 3, _top: int = 25, _windowSize: int = 1) -> list:
    """
    :param data: List of texts. It is a 1D list where each element is a text.
    :param _ngram: How many words will be there in a sentence.
    :param _top: How many phrases we want to return.
    :param _windowSize: To how many words we are going to make comparisons.
    :return: A 1D list where each element is a String that has all the keywords for it.
    """

    kw_extractor = yake.KeywordExtractor(n=_ngram, top=_top, windowsSize=_windowSize)

    return_list = []
    if not (type(data) == list):
        print("The given input is not a list, converting to list")
        data = [data]

    for text in data:
        keywords = kw_extractor.extract_keywords(text)

        keywords_list = []
        for _keywords in keywords:
            keywords_list.append(_keywords[0])

        return_list.append(keywords_list)

    for i in range(len(return_list)):
        str_tmp = ""
        for keyword in return_list[i]:
            str_tmp += str(keyword) + " "
        return_list[i] = str_tmp

    return return_list


def text2Embeddings(sentences: list, sentence_transformer: str = "msmarco-distilbert-base-v4") -> list:
    """

    :param sentences: data: List of texts. It is a 1D list where each element is a text.
    :param sentence_transformer: Model Name of the sentence_transformer
    :return:
    """
    model = SentenceTransformer(sentence_transformer)

    embeddings = []
    for sentence in sentences:
        embedding = model.encode(sentence)
        embeddings.append(embedding)

    return embeddings

##Testing the Model
---
txt = the input of a user/task that we want to search neighbours for.


1.   First turn txt into its keywords sentence (txt2Keywords)
2.   Turn the keyword sentence to an embedding (txt2Embeddings)
3.   Turn the string into an np.array
4.   Turn the np.array into tf.data.Dataset (from_tensor_slices)


In [7]:
txt = "I am very good at statistics"

txt_keywords = text2Keywords(txt)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
#embeddings_tf = tf.data.Dataset.from_tensor_slices(txt_embedding)#.batch(32)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)#.batch(32) 

The given input is not a list, converting to list


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

score = scores of the retrieved results (closer the distance, higher the score)

res = retrieved results (just indices of jobs_df):

In [8]:
# Get the results
score, res = scann.predict(txt_embeddings_tf)

# Print the results
for i in res[0]:
  print(jobs_df.iloc[i].jobdescription, "\n")

Strong Business Analyst with good mathematical abilities. Ideally will have experience working with ratings and pricing in insurance. Actuarial environment experience a plus. Strong analytical skills and good communication skills required. Consulting opoortunity. Longterm spot. Develop strategic partnerships with the actuarial, analytics, operations and sales teams to integrate tools and models into an IT platformLead, plan and facilitate brainstorming workshopsProvide guidance to actuarial and analytical teams with regards to IT requirements / best practices for integrating models into IT platformIdentify Key Performance Indicators for the business unitsProvide support and training to users of analytical tools and modelsParticipate in designing and implementing an automated rating and pricing infrastructureCreate specification requirements for data collection and validation in support of modelsAnalyze data quality for pricing elements across underwriting workflowsAnalyze and document 

## Saving the Model for TF Serving
Save the built model for further usages.


1.   path : the directory to be saved
2.   tf.saved_model.save : save the created scann model




In [9]:
# Export the query model.

path = os.path.join("/content/drive/MyDrive/Models", "udot_scann")

# Save the index.
tf.saved_model.save(
  scann,
  path,
  options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)



In [10]:
# Load it back; can also be done in TensorFlow Serving.
loaded = tf.saved_model.load("/content/drive/MyDrive/Models/udot_scann")

In [11]:
score, res = loaded(txt_embeddings_tf)

# Print the results
for i in res.numpy():
  print(jobs_df.iloc[i].jobdescription, "\n")


684      Strong Business Analyst with good mathematical...
3165     The following is a one year contract with our ...
11115    Ascent Pharma is hiring a Statistical Programm...
4376     ESSENTIAL FUNCTIONS: Demonstrate working knowl...
7061     Must be authorized to work in the U.S./ Sponso...
Name: jobdescription, dtype: object 



## Expanding the Dataset
Assign new attributes next to the jobdescription. The added features are: Location, Budget, Deadline

In [12]:
# Add Location (Latitude and Longitude)

from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="geoapiUdot")

longitude_list, latitude_list = [], []
longitude_val_max, longitude_val_min = 7.2971, 2.7685
latitude_val_max, latitude_val_min = 53.8061, 50.4413
extension_val = 10000


for i in tqdm(range(len(jobs_df))):
  
  #Assign a random dot
  longitude_tmp = random.randint(
    extension_val*longitude_val_min,
    extension_val*longitude_val_max )/extension_val

  latitude_tmp = random.randint(
    extension_val*latitude_val_min,
    extension_val*latitude_val_max)/extension_val

  

    

  #print(address)

  longitude_list.append(longitude_tmp)
  latitude_list.append(latitude_tmp)




  #location = geolocator.reverse(latit_tmp +","+ long_tmp)


location_df = pd.DataFrame([longitude_list, latitude_list]).transpose()
location_df.columns = ["Longitude", "Latitude"]

  0%|          | 0/22000 [00:00<?, ?it/s]

In [13]:
# Add Budget

budget_list = []
for i in range(len(jobs_df)):
  tmp = random.randint(50,1500)
  sharp = tmp - (tmp % 50)
  budget_list.append(sharp)

budget_df = pd.DataFrame(budget_list)
budget_df.columns = ["Budget"]

In [14]:
# Add deadline
deadline_list = []

for i in range(len(jobs_df)):
  tmp_month = random.randint(1, 12)
  
  if tmp_month in [1, 3, 5, 7, 8 , 10, 12]:
    day_range = 31
  elif tmp_month in [2]:
    day_range = 28
  else:
    day_range = 30

  tmp_day = random.randint(1, day_range)
  tmp_year_index = random.randint(0, 9)
  years_list = [2023] * 10

  deadline = str(tmp_day) + '/' + str(tmp_month) + '/' + str(years_list[tmp_year_index])
  deadline_list.append(deadline)

deadline_df = pd.DataFrame(deadline_list)
deadline_df.columns = ["Deadline"]
  

In [15]:
# Join the Tables
jobs_extended = jobs_df.copy()
jobs_extended = jobs_extended.join(location_df)
jobs_extended = jobs_extended.join(budget_df)
jobs_extended = jobs_extended.join(deadline_df)
jobs_extended

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline
0,Looking for Selenium engineers...must have sol...,5.6869,51.9334,300,27/9/2023
1,The University of Chicago has a rapidly growin...,2.7817,50.5483,1300,19/9/2023
2,"GalaxE.SolutionsEvery day, our solutions affec...",2.9957,53.4748,600,9/6/2023
3,Java DeveloperFull-time/direct-hireBolingbrook...,4.4241,51.4284,600,26/6/2023
4,Midtown based high tech firm has an immediate ...,4.2634,52.9686,300,31/12/2023
...,...,...,...,...,...
21995,Company Description We are searching for a ta...,4.1324,51.1184,1350,9/5/2023
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,2.9341,51.2382,1400,23/1/2023
21997,Do you take pride in your work knowing that th...,3.6920,50.7641,1450,4/2/2023
21998,Company Description What We Can Offer YouAs th...,6.9158,52.8719,1250,14/2/2023


### On a Profile side, the data should be altered into Deadline -> Availability, Budget -> Reputation

In [16]:
# Budget - Reputation Function
profile_df = jobs_extended.copy()
profile_df["Requiered Reputation"] = 350*np.log10(profile_df["Budget"])

In [17]:
#today = datetime.combine(date.today(), datetime.min.time())
today = datetime.strptime("31/12/2022", '%d/%m/%Y')

remaining_time_list = []
for i in range(len(profile_df)):
  remaining_time_list.append(-(today - datetime.strptime(profile_df["Deadline"].iloc[i], '%d/%m/%Y')).days*24)

profile_df["Remaining Time"] = remaining_time_list

In [18]:
profile_df

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline,Requiered Reputation,Remaining Time
0,Looking for Selenium engineers...must have sol...,5.6869,51.9334,300,27/9/2023,866.992439,6480
1,The University of Chicago has a rapidly growin...,2.7817,50.5483,1300,19/9/2023,1089.880173,6288
2,"GalaxE.SolutionsEvery day, our solutions affec...",2.9957,53.4748,600,9/6/2023,972.352938,3840
3,Java DeveloperFull-time/direct-hireBolingbrook...,4.4241,51.4284,600,26/6/2023,972.352938,4248
4,Midtown based high tech firm has an immediate ...,4.2634,52.9686,300,31/12/2023,866.992439,8760
...,...,...,...,...,...,...,...
21995,Company Description We are searching for a ta...,4.1324,51.1184,1350,9/5/2023,1095.616819,3096
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,2.9341,51.2382,1400,23/1/2023,1101.144812,552
21997,Do you take pride in your work knowing that th...,3.6920,50.7641,1450,4/2/2023,1106.478801,840
21998,Company Description What We Can Offer YouAs th...,6.9158,52.8719,1250,14/2/2023,1083.918505,1080


### Running an example query on profile_df

In [19]:
#profile_params
AVAILABILITY = 100 #hrs a week
REPUTATION = 2000
INTEREST = "Data Manager and SQL developer"
#LOCATION

In [20]:
search_df = profile_df.copy()
# Filter on Reputation
search_df = search_df[search_df["Requiered Reputation"] < REPUTATION]

# Filter on Remaining Time
search_df = search_df[search_df["Remaining Time"] < AVAILABILITY]

# Filter on Interest
embeddings_search_list = search_df.reset_index()["index"].tolist()
embeddings_search_area = []
for i in embeddings_search_list:
  embeddings_search_area.append(embeddings_df.iloc[i])

embeddings_search_df = pd.DataFrame(embeddings_search_area)

# Turn the created df above into a np.array
embeddings_search_array = np.array(embeddings_search_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_search_tf = tf.data.Dataset.from_tensor_slices(embeddings_search_array).batch(32)


# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 10
_num_leaves_to_search = 30
_k = 3


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_search_tf
    )

# Build the ScaNN

scann.build(embeddings_search_tf.element_spec.shape)


txt_keywords = text2Keywords(INTEREST)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)


# Get the results
score, res = scann.predict(txt_embeddings_tf)
res

The given input is not a list, converting to list


array([[ 32, 111, 135]], dtype=int32)

In [21]:
# Print the results
#search_df = search_df.reset_index()

for i in list(res[0]):
  print(i, search_df["jobdescription"].iloc[i],search_df["Budget"].iloc[i], search_df["Remaining Time"].iloc[i])


32 The Senior Database Developer will be responsible for database architecture, ELT process development, Data Management policy development, support and QA activities of data management developments. The individual will be responsible for coordinating with the various IT and functional system owners to design and deliver integrated Data Management solutions that will result in operational efficiency and a high data quality standard in the provisioning and distribution of enterprise data. RESPONSBILITIESBuild/enhance data models and database architecture for Enterprise data Model.Assist in defining, developing and documenting Application model and architectureHelp define data standards and implement best practices working with the other members of the Data Management teamEvaluate, design and develop functional specifications through detailed technical requirements documentEvaluate MDM Applications, developments or models and provide recommendations for improvementsEvaluate existing data

In [22]:
radius = 10 #in km

In [23]:
# Visualize on Location
vector_size = 2  # Length of item vector that will be indexed

t = AnnoyIndex(vector_size, 'euclidean')
for i in range(len(location_df)):
    v = location_df.iloc[i].values.flatten().tolist()
    t.add_item(i, v)

t.build(100) # 10 trees
closest_points = t.get_nns_by_vector([5.479431, 51.444935], radius*10)

closest_long, closest_lat = [], []
for i in closest_points:
  closest_long.append(location_df.iloc[i].values.flatten().tolist()[0])
  closest_lat.append(location_df.iloc[i].values.flatten().tolist()[1])

In [24]:
# Visualize on:
#       All Jobs through their location (Yellow)
#       Jobs within the given radius (Purple)
#       Jobs that are relevant to the given reputation, availability and interests (Red)


profile_df_wo = profile_df.loc[profile_df.index.difference(search_df.index),]

fig1 = px.scatter(x = profile_df_wo["Longitude"], y= profile_df_wo["Latitude"], color_discrete_sequence=['yellow'], opacity=0.2 )
fig2 = px.scatter(x = search_df["Longitude"], y= search_df["Latitude"], color_discrete_sequence=['red'] )
fig3 = px.scatter(x = closest_long, y = closest_lat, color_discrete_sequence=['purple'], opacity = 0.5)
figC = px.scatter(x = [5.479431], y = [51.444935], color_discrete_sequence = ["blue"])

figF = go.Figure(data=fig1.data + fig2.data + fig3.data + figC.data)
figF

In [25]:
search_df

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline,Requiered Reputation,Remaining Time
108,Denali Advanced Integration is one of the nati...,4.5059,51.7913,450,3/1/2023,928.624380,72
228,TAD PGS INC. is currently seeking an Electrica...,6.2774,51.9647,400,4/1/2023,910.720997,96
630,SOC level Integration and functional verificat...,3.2519,52.8770,900,1/1/2023,1033.984878,24
646,Principal Software Engineer - Signal Processin...,4.7930,53.2687,50,1/1/2023,594.639502,24
722,"GUI Tester/Automation Engineer - Boston,MAJob#...",4.9773,51.5168,1200,4/1/2023,1077.713436,96
...,...,...,...,...,...,...,...
21202,Sr. Director of Software Engineering & Technol...,4.8052,51.4881,450,2/1/2023,928.624380,48
21343,"ATR International, Inc. is a leader in the sta...",6.7667,52.5782,1250,1/1/2023,1083.918505,24
21347,"1 year contract Location - San Ramon, CA Res...",5.5767,50.7634,1350,4/1/2023,1095.616819,96
21358,Position: Display Marketing ManagerLocation: S...,3.7945,52.0135,700,1/1/2023,995.784314,24
