<a href="https://colab.research.google.com/github/UniversalDot/tensorflow/blob/develop/model_creation/udot_scann.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>UDOT: Google ScaNN powered by NLP methods</h1> UDOT - Google ScaNN
The intention of this document is to create a ScaNN model that could be connected through Rest API while providing the NLP methods within the model.

For more information about ScaNN, please refer to https://github.com/google-research/google-research/tree/master/scann

For more information about UDOT NLP functions, please refer to https://github.com/UniversalDot/tensorflow




Install the requiered Packages

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann
!pip install -q sentence_transformers
!pip install -q yake
!pip install -q bentoml

[K     |████████████████████████████████| 89 kB 4.8 MB/s 
[K     |████████████████████████████████| 4.7 MB 6.3 MB/s 
[K     |████████████████████████████████| 10.4 MB 6.9 MB/s 
[K     |████████████████████████████████| 578.0 MB 15 kB/s 
[K     |████████████████████████████████| 1.7 MB 48.1 MB/s 
[K     |████████████████████████████████| 438 kB 53.6 MB/s 
[K     |████████████████████████████████| 5.9 MB 53.7 MB/s 
[K     |████████████████████████████████| 85 kB 3.8 MB/s 
[K     |████████████████████████████████| 5.5 MB 45.0 MB/s 
[K     |████████████████████████████████| 1.3 MB 51.3 MB/s 
[K     |████████████████████████████████| 163 kB 74.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 44.9 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 60 kB 4.2 MB/s 
[K     |████████████████████████████████| 132 kB 18.1 MB/s 
[?25h  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
[K

In [2]:
from typing import Dict, Text

import os
import tqdm
import pprint
import tempfile
import random
import math
from datetime import datetime, date


import pandas as pd
import numpy as np

import bentoml

import tensorflow as tf
import tensorflow_hub as hub

from transformers import DistilBertTokenizer, TFDistilBertModel
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from transformers import TFAutoModel
from transformers import AutoTokenizer


import yake

from sentence_transformers import SentenceTransformer


###Import the dataset that is going to be loaded into ScaNN

---



*   Load the embeddings dataset to a dataframe from csv
*   Drop the index column since it has no purpose for the model





In [7]:
# Load the embeddings to a dataframe
jobs_df = pd.read_csv("/content/drive/MyDrive/Work/UniversalDot/3. ProductDevelopment/ML /Datasets/job_desc.csv")
jobs_df = jobs_df.drop("Unnamed: 0", axis = 1)

# Load the embeddings to a dataframe
embeddings_df = pd.read_csv("/content/drive/MyDrive/Work/UniversalDot/3. ProductDevelopment/ML /Datasets/embeddings-msmarco-distilbert-base-v4.csv")

# Drop the axis
embeddings_df = embeddings_df.drop("Unnamed: 0", axis = 1)

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Turn the dataframe into a tensorflow.dataset for training purposes of ScaNN
---


*   Turn the dataframe into a numpy.array with the dtype = float32 since tf.data.Dataset accepts only float32
*   Create a tensorflow.dataset from the created numpy array in batches (size of 32)



In [9]:
# Turn the created df above into a np.array
embeddings_array = np.array(embeddings_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_tf = tf.data.Dataset.from_tensor_slices(embeddings_array).batch(32)

##Create a ScaNN object for searching purposes
---


*   First create a ScaNN object from tfrs.layers.factorized_top_k with the parameters num_reordering_candidates and num_leaves_to_search. The parameters can be optimized to further points
*   Load the created tf.data.Dataset based on embedding data we loaded already



In [10]:
# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 100
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_tf
    )

# Build the ScaNN

scann.build(embeddings_tf.element_spec.shape)

###Helper NLP Functions for an example input creation
---

The given two functions (text2keywords and text2embeddings) are meant for input creation. The embeddings dataset is already outputted based on the given functions down below. In order to create an input from a natural-language based sentence, we first get the keywords (text2keywords) and get the embeddings of the keywords (text2embeddings)

In [11]:
def text2Keywords(data: list, _ngram: int = 3, _top: int = 25, _windowSize: int = 1) -> list:
    """
    :param data: List of texts. It is a 1D list where each element is a text.
    :param _ngram: How many words will be there in a sentence.
    :param _top: How many phrases we want to return.
    :param _windowSize: To how many words we are going to make comparisons.
    :return: A 1D list where each element is a String that has all the keywords for it.
    """

    kw_extractor = yake.KeywordExtractor(n=_ngram, top=_top, windowsSize=_windowSize)

    return_list = []
    if not (type(data) == list):
        print("The given input is not a list, converting to list")
        data = [data]

    for text in data:
        keywords = kw_extractor.extract_keywords(text)

        keywords_list = []
        for _keywords in keywords:
            keywords_list.append(_keywords[0])

        return_list.append(keywords_list)

    for i in range(len(return_list)):
        str_tmp = ""
        for keyword in return_list[i]:
            str_tmp += str(keyword) + " "
        return_list[i] = str_tmp

    return return_list


def text2Embeddings(sentences: list, sentence_transformer: str = "msmarco-distilbert-base-v4") -> list:
    """

    :param sentences: data: List of texts. It is a 1D list where each element is a text.
    :param sentence_transformer: Model Name of the sentence_transformer
    :return:
    """
    model = SentenceTransformer(sentence_transformer)

    embeddings = []
    for sentence in sentences:
        embedding = model.encode(sentence)
        embeddings.append(embedding)

    return embeddings

##Testing the Model
---
txt = the input of a user/task that we want to search neighbours for.


1.   First turn txt into its keywords sentence (txt2Keywords)
2.   Turn the keyword sentence to an embedding (txt2Embeddings)
3.   Turn the string into an np.array
4.   Turn the np.array into tf.data.Dataset (from_tensor_slices)


In [14]:
txt = "drawing"

txt_keywords = text2Keywords(txt)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
#embeddings_tf = tf.data.Dataset.from_tensor_slices(txt_embedding)#.batch(32)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)#.batch(32) 

The given input is not a list, converting to list


score = scores of the retrieved results (closer the distance, higher the score)

res = retrieved results (just indices of jobs_df):

In [15]:
# Get the results
score, res = scann.predict(txt_embeddings_tf)

# Print the results
for i in res[0]:
  print(jobs_df.iloc[i].jobdescription, "\n")

Consultant required to assist with Graphic Design and User Experience for Mobile Applications on iOS platform.  Specific target platform iPad and iPhone. Job FunctionAssist with Graphic design and user experience for prototypes, enterprise applications for Mobile applicationsSpecific experience required doing graphic design and user interfaces on Apple mobile devices.Business Analysis: Gather requirements, generate use cases, activity diagrams, sequence diagrams, and business and functional requirements documents.Gather requirements for iOS based applications. Generate annotated wireframes and functional specifications.The results to be used to illicit and confirm business and functional requirements. Additional Skills:Background with React, Angular and HTML5 

Convert 17 Adobe Captivate 7 and 8 modules to Adobe Captivate 9 and publish in HTML 5.All files need to be tested for: closed captioning, keyboard navigation, button navigation (along the top and the player), links at the end of

## Saving the Model for TF Serving
Save the built model for further usages.


1.   path : the directory to be saved
2.   tf.saved_model.save : save the created scann model




In [16]:
# Export the query model.

path = os.path.join("/content/drive/MyDrive/Models", "udot_scann")

# Save the index.
tf.saved_model.save(
  scann,
  path,
  options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)



In [17]:
# Load it back; can also be done in TensorFlow Serving.
loaded = tf.saved_model.load("/content/drive/MyDrive/Models/udot_scann")

In [18]:
score, res = loaded(txt_embeddings_tf)

# Print the results
for i in res.numpy():
  print(jobs_df.iloc[i].jobdescription, "\n")


5680     Consultant required to assist with Graphic Des...
19925    Convert 17 Adobe Captivate 7 and 8 modules to ...
6504     Role: Learning Asset Graphic DesignerLocation:...
16242    The following statements are intended to descr...
14936    Brief Description:The Sr. Graphic/Web Designer...
Name: jobdescription, dtype: object 



## Expanding the Dataset
Assign new attributes next to the jobdescription. The added features are: Location, Budget, Deadline

In [19]:
# Add Location (Latitude and Longitude)
# Limit to Netherlands


longitude_list, latitude_list = [], []
longitude_val, latitude_val = 180, 90

for i in range(len(jobs_df)):
  long_tmp = random.randint(-longitude_val, longitude_val)
  longitude_list.append(long_tmp)
  latit_tmp = random.randint(-latitude_val, latitude_val)
  latitude_list.append(latit_tmp)

location_df = pd.DataFrame([longitude_list, latitude_list]).transpose()
location_df.columns = ["Longitude", "Latitude"]

In [20]:
# Add Budget

budget_list = []
for i in range(len(jobs_df)):
  tmp = random.randint(15,1500)
  sharp = tmp - (tmp % 50)
  budget_list.append(sharp)

budget_df = pd.DataFrame(budget_list)
budget_df.columns = ["Budget"]

In [21]:
# Add deadline
deadline_list = []

for i in range(len(jobs_df)):
  tmp_month = random.randint(1, 12)
  
  if tmp_month in [1, 3, 5, 7, 8 , 10, 12]:
    day_range = 31
  elif tmp_month in [2]:
    day_range = 28
  else:
    day_range = 30

  tmp_day = random.randint(1, day_range)
  tmp_year_index = random.randint(0, 9)
  years_list = [2023] * 10

  deadline = str(tmp_day) + '/' + str(tmp_month) + '/' + str(years_list[tmp_year_index])
  deadline_list.append(deadline)

deadline_df = pd.DataFrame(deadline_list)
deadline_df.columns = ["Deadline"]
  

In [22]:
jobs_extended = jobs_df.copy()
jobs_extended = jobs_extended.join(location_df)
jobs_extended = jobs_extended.join(budget_df)
jobs_extended = jobs_extended.join(deadline_df)
jobs_extended

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline
0,Looking for Selenium engineers...must have sol...,-124,4,1350,22/12/2023
1,The University of Chicago has a rapidly growin...,120,-37,950,7/2/2023
2,"GalaxE.SolutionsEvery day, our solutions affec...",125,19,250,1/1/2023
3,Java DeveloperFull-time/direct-hireBolingbrook...,-74,-68,350,24/1/2023
4,Midtown based high tech firm has an immediate ...,-78,-62,1000,21/1/2023
...,...,...,...,...,...
21995,Company Description We are searching for a ta...,-114,-70,750,7/3/2023
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,35,-44,600,29/10/2023
21997,Do you take pride in your work knowing that th...,42,86,0,23/6/2023
21998,Company Description What We Can Offer YouAs th...,97,35,1250,4/12/2023


### On a Profile side, the data should be altered into Deadline -> Availability, Budget -> Reputation

In [23]:
profile_df = jobs_extended.copy()
profile_df["Requiered Reputation"] = 350*np.log10(profile_df["Budget"])

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [24]:
#today = datetime.combine(date.today(), datetime.min.time())
today = datetime.strptime("31/12/2022", '%d/%m/%Y')

remaining_time_list = []
for i in range(len(profile_df)):
  remaining_time_list.append(-(today - datetime.strptime(profile_df["Deadline"].iloc[i], '%d/%m/%Y')).days*24)

profile_df["Remaining Time"] = remaining_time_list

In [25]:
profile_df

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Deadline,Requiered Reputation,Remaining Time
0,Looking for Selenium engineers...must have sol...,-124,4,1350,22/12/2023,1095.616819,8544
1,The University of Chicago has a rapidly growin...,120,-37,950,7/2/2023,1042.203262,912
2,"GalaxE.SolutionsEvery day, our solutions affec...",125,19,250,1/1/2023,839.279003,24
3,Java DeveloperFull-time/direct-hireBolingbrook...,-74,-68,350,24/1/2023,890.423816,576
4,Midtown based high tech firm has an immediate ...,-78,-62,1000,21/1/2023,1050.000000,504
...,...,...,...,...,...,...,...
21995,Company Description We are searching for a ta...,-114,-70,750,7/3/2023,1006.271442,1584
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,35,-44,600,29/10/2023,972.352938,7248
21997,Do you take pride in your work knowing that th...,42,86,0,23/6/2023,-inf,4176
21998,Company Description What We Can Offer YouAs th...,97,35,1250,4/12/2023,1083.918505,8112


### Running an example query on profile_df

In [29]:
#profile_params
AVAILABILITY = 50000 #hrs a week
REPUTATION = 600000
INTEREST = "drawing"
#LOCATION

In [31]:
search_df = profile_df.copy()
# Filter on Reputation
search_df = search_df[search_df["Requiered Reputation"] < REPUTATION]

# Filter on Remaining Time
search_df = search_df[search_df["Remaining Time"] < AVAILABILITY]

# Filter on Location


# Filter on Interest
embeddings_search_list = search_df.reset_index()["index"].tolist()
embeddings_search_area = []
for i in embeddings_search_list:
  embeddings_search_area.append(embeddings_df.iloc[i])

embeddings_search_df = pd.DataFrame(embeddings_search_area)

print(len(embeddings_search_df))
# Turn the created df above into a np.array
embeddings_search_array = np.array(embeddings_search_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_search_tf = tf.data.Dataset.from_tensor_slices(embeddings_search_array).batch(32)


# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 10
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_search_tf
    )

# Build the ScaNN

scann.build(embeddings_search_tf.element_spec.shape)


txt_keywords = text2Keywords(INTEREST)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)


# Get the results
score, res = scann.predict(txt_embeddings_tf)

22000
The given input is not a list, converting to list


In [32]:
# Print the results
#search_df = search_df.reset_index()

for i in list(res[0]):
  print(i, search_df["Budget"].iloc[i], search_df["Remaining Time"].iloc[i], search_df["jobdescription"].iloc[i])


20879 350 7488 Responsibilities Include:   *   Hands-on analysis of experimental results to draw physical conclusions   *   Development and optimization of algorithms to extract features from large (GB-TB+) data sets   *   Creation of insightful, simple graphics to represent complex trends   *   Ownership of the intellectual full stack, from open-ended data exploration to algorithm optimization and deployment  Skills:   *   MUST have scientific programming experience in Python, especially numpy / pandas / scikitlearn / matplotlib   *   Real world experience with clustering, regression and developing your own algorithms   *   Basic understanding of analog electronic circuits a strong plus   *   Experience with GPU or other high performance computing processing is a plus   *   Chemistry or biophysics background is a plus  Background:   *   PhD or equivalent experience in computer science, physics, electrical engineering, etc   *   Demonstrated potential to lead a complex, long-term data 

In [33]:
embeddings_search_df.reset_index()

Unnamed: 0,index,0,1,2,3,4,5,6,7,8,...,758,759,760,761,762,763,764,765,766,767
0,0,-0.128923,0.259915,0.511753,0.520825,-0.549695,-0.913610,0.780484,-0.991843,-1.018243,...,0.416153,-0.288344,0.329478,-0.667825,-0.157305,0.447131,-0.320265,-0.212382,1.234798,0.261012
1,1,-0.283205,0.039988,0.742433,-0.008165,0.206873,0.215959,-0.050514,0.452861,-0.538875,...,0.533794,0.658975,0.565936,-0.470865,0.330151,-0.469326,-0.081789,-0.204120,0.493776,0.203026
2,2,-0.568932,0.089363,0.706494,0.521851,0.168672,-0.016307,0.080094,-0.022547,-0.313722,...,0.859631,0.555092,0.797540,-0.696948,0.111468,0.119220,0.359304,-0.296155,0.423150,0.028656
3,3,-0.209362,-0.198573,0.673175,-0.160385,-0.748360,-0.214682,0.146603,-0.432475,0.302616,...,-0.015766,0.374512,-0.464608,-0.084709,0.315031,0.500760,-0.006033,-0.234904,0.150014,-0.373916
4,4,0.283782,-0.081082,-0.059787,-0.294946,-0.103132,0.296134,-0.563419,-0.460248,-0.481913,...,0.163403,0.838825,0.074732,-0.010965,-0.375263,-0.204015,-0.078879,0.186385,0.819544,-0.747072
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21995,21995,0.470394,0.478569,0.328696,-0.116298,-0.145526,-0.260389,0.552319,-0.973348,0.287998,...,0.617041,0.589229,0.288992,-0.348303,0.344829,-0.552856,0.039843,-0.647164,0.589345,0.172956
21996,21996,0.306475,0.501359,0.623540,-0.532646,0.012801,0.272884,-1.071700,-0.152364,-0.887616,...,0.713691,-0.729606,0.107242,0.396348,-0.104560,0.530711,-0.184785,-0.351451,0.060839,0.262252
21997,21997,0.388179,0.477288,1.054785,0.007797,-0.254003,0.075748,-0.058488,-0.537227,-0.144432,...,0.316495,0.442676,0.279777,-0.515668,-0.200802,-0.645020,-0.196196,-0.134393,0.243428,0.076101
21998,21998,-0.164381,0.117790,0.895539,-0.010977,-0.192806,0.108548,0.732970,-0.396491,-0.154540,...,0.992401,0.662473,0.487699,-0.121635,0.528992,0.457051,0.143177,-0.257758,0.560051,-0.375056
