<a href="https://colab.research.google.com/github/UniversalDot/tensorflow/blob/master/udot_scann_0.3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>UDOT: Google ScaNN powered by NLP methods</h1> UDOT - Google ScaNN
The intention of this document is to create a ScaNN model that could be connected through Rest API while providing the NLP methods within the model.

For more information about ScaNN, please refer to https://github.com/google-research/google-research/tree/master/scann

For more information about UDOT NLP functions, please refer to https://github.com/UniversalDot/tensorflow




Install the requiered Packages

In [1]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann
!pip install -q sentence_transformers
!pip install -q yake
!pip install -q bentoml

[K     |████████████████████████████████| 89 kB 2.3 MB/s 
[K     |████████████████████████████████| 4.7 MB 5.5 MB/s 
[K     |████████████████████████████████| 10.4 MB 3.9 MB/s 
[K     |████████████████████████████████| 578.1 MB 25 kB/s 
[K     |████████████████████████████████| 438 kB 66.7 MB/s 
[K     |████████████████████████████████| 5.9 MB 55.8 MB/s 
[K     |████████████████████████████████| 1.7 MB 58.2 MB/s 
[K     |████████████████████████████████| 85 kB 1.3 MB/s 
[K     |████████████████████████████████| 5.5 MB 10.6 MB/s 
[K     |████████████████████████████████| 1.3 MB 64.9 MB/s 
[K     |████████████████████████████████| 182 kB 51.0 MB/s 
[K     |████████████████████████████████| 7.6 MB 43.2 MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 60 kB 2.8 MB/s 
[K     |████████████████████████████████| 132 kB 11.4 MB/s 
[?25h  Building wheel for jellyfish (setup.py) ... [?25l[?25hdone
[K

In [2]:
from typing import Dict, Text

import os
import tqdm
import pprint
import tempfile
import random
import math
from datetime import datetime, date


import pandas as pd
import numpy as np

import bentoml

import tensorflow as tf
import tensorflow_hub as hub

from transformers import DistilBertTokenizer, TFDistilBertModel
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

from transformers import TFAutoModel
from transformers import AutoTokenizer


import yake

from sentence_transformers import SentenceTransformer


###Import the dataset that is going to be loaded into ScaNN

---



*   Load the embeddings dataset to a dataframe from csv
*   Drop the index column since it has no purpose for the model





In [3]:
# Load the embeddings to a dataframe
jobs_df = pd.read_csv("/content/drive/MyDrive/Models/Data/job_desc.csv")
jobs_df = jobs_df.drop("Unnamed: 0", axis = 1)

# Load the embeddings to a dataframe
embeddings_df = pd.read_csv("/content/drive/MyDrive/Models/Data/embeddings-msmarco-distilbert-base-v4.csv")

# Drop the axis
embeddings_df = embeddings_df.drop("Unnamed: 0", axis = 1)

#### Turn the dataframe into a tensorflow.dataset for training purposes of ScaNN
---


*   Turn the dataframe into a numpy.array with the dtype = float32 since tf.data.Dataset accepts only float32
*   Create a tensorflow.dataset from the created numpy array in batches (size of 32)



In [4]:
# Turn the created df above into a np.array
embeddings_array = np.array(embeddings_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_tf = tf.data.Dataset.from_tensor_slices(embeddings_array).batch(32)

##Create a ScaNN object for searching purposes
---


*   First create a ScaNN object from tfrs.layers.factorized_top_k with the parameters num_reordering_candidates and num_leaves_to_search. The parameters can be optimized to further points
*   Load the created tf.data.Dataset based on embedding data we loaded already



In [5]:
# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 100
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_tf
    )

# Build the ScaNN

scann.build(embeddings_tf.element_spec.shape)

###Helper NLP Functions for an example input creation
---

The given two functions (text2keywords and text2embeddings) are meant for input creation. The embeddings dataset is already outputted based on the given functions down below. In order to create an input from a natural-language based sentence, we first get the keywords (text2keywords) and get the embeddings of the keywords (text2embeddings)

In [6]:
def text2Keywords(data: list, _ngram: int = 3, _top: int = 25, _windowSize: int = 1) -> list:
    """
    :param data: List of texts. It is a 1D list where each element is a text.
    :param _ngram: How many words will be there in a sentence.
    :param _top: How many phrases we want to return.
    :param _windowSize: To how many words we are going to make comparisons.
    :return: A 1D list where each element is a String that has all the keywords for it.
    """

    kw_extractor = yake.KeywordExtractor(n=_ngram, top=_top, windowsSize=_windowSize)

    return_list = []
    if not (type(data) == list):
        print("The given input is not a list, converting to list")
        data = [data]

    for text in data:
        keywords = kw_extractor.extract_keywords(text)

        keywords_list = []
        for _keywords in keywords:
            keywords_list.append(_keywords[0])

        return_list.append(keywords_list)

    for i in range(len(return_list)):
        str_tmp = ""
        for keyword in return_list[i]:
            str_tmp += str(keyword) + " "
        return_list[i] = str_tmp

    return return_list


def text2Embeddings(sentences: list, sentence_transformer: str = "msmarco-distilbert-base-v4") -> list:
    """

    :param sentences: data: List of texts. It is a 1D list where each element is a text.
    :param sentence_transformer: Model Name of the sentence_transformer
    :return:
    """
    model = SentenceTransformer(sentence_transformer)

    embeddings = []
    for sentence in sentences:
        embedding = model.encode(sentence)
        embeddings.append(embedding)

    return embeddings

##Testing the Model
---
txt = the input of a user/task that we want to search neighbours for.


1.   First turn txt into its keywords sentence (txt2Keywords)
2.   Turn the keyword sentence to an embedding (txt2Embeddings)
3.   Turn the string into an np.array
4.   Turn the np.array into tf.data.Dataset (from_tensor_slices)


In [7]:
txt = "I am a python developer and a Machine learning"

txt_keywords = text2Keywords(txt)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
#embeddings_tf = tf.data.Dataset.from_tensor_slices(txt_embedding)#.batch(32)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)#.batch(32) 

The given input is not a list, converting to list


Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/545 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

score = scores of the retrieved results (closer the distance, higher the score)

res = retrieved results (just indices of jobs_df):

In [8]:
# Get the results
score, res = scann.predict(txt_embeddings_tf)

# Print the results
for i in res[0]:
  print(jobs_df.iloc[i].jobdescription, "\n")

     A Saas firm in Boston that specializes in data analytics is looking for a Full Stack-Developer, with a strong Python and Machine Learning background, for a promising position with their growing staff. In this role, the Python Developer will be responsible for working on development of new product prototypes, internal tools, and public facing data visualizations. Apply today!The Python Developer will be responsible for:Building new web and data products for market analyticsDeveloping tools used by the Analytics team to parse and understand market dataContributing to development of an efficient data infrastructure to put data models developed by the analytics team into production including the development of APIs, optimizing databases, and automating workflowsBuilding data visualizations as part of product prototypes and for public-oriented micrositesImplementing machine learning models in new products Skills:Experience using Python, SQL, and javaScriptExperience with natural langua

## Saving the Model for TF Serving
Save the built model for further usages.


1.   path : the directory to be saved
2.   tf.saved_model.save : save the created scann model




In [9]:
# Export the query model.

path = os.path.join("/content/drive/MyDrive/Models", "udot_scann")

# Save the index.
tf.saved_model.save(
  scann,
  path,
  options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)



In [10]:
# Load it back; can also be done in TensorFlow Serving.
loaded = tf.saved_model.load("/content/drive/MyDrive/Models/udot_scann")

In [11]:
score, res = loaded(txt_embeddings_tf)

# Print the results
for i in res.numpy():
  print(jobs_df.iloc[i].jobdescription, "\n")


1459          A Saas firm in Boston that specializes in...
3688     Our client a Quantitative Hedge Fund is seekin...
6014     One of our clients is looking for a Fullstack ...
14246    We are looking for that rare combination of ma...
7242     LAMP / PHP Software Developer / Engineer  Atla...
Name: jobdescription, dtype: object 



## Expanding the Dataset
Assign new attributes next to the jobdescription. The added features are: Location, Budget, Deadline

In [12]:
# Add Location (Latitude and Longitude)
# Limit to Netherlands


longitude_list, latitude_list = [], []
longitude_val, latitude_val = 180, 90

for i in range(len(jobs_df)):
  long_tmp = random.randint(-longitude_val, longitude_val)
  longitude_list.append(long_tmp)
  latit_tmp = random.randint(-latitude_val, latitude_val)
  latitude_list.append(latit_tmp)

location_df = pd.DataFrame([longitude_list, latitude_list]).transpose()
location_df.columns = ["Longitude", "Latitude"]

In [13]:
# Add Budget

budget_list = []
for i in range(len(jobs_df)):
  tmp = random.randint(15,1500)
  sharp = tmp - (tmp % 50)
  budget_list.append(sharp)

budget_df = pd.DataFrame(budget_list)
budget_df.columns = ["Budget"]

In [14]:
# Add deadline
deadline_list = []

for i in range(len(jobs_df)):
  tmp_month = random.randint(1, 12)
  
  if tmp_month in [1, 3, 5, 7, 8 , 10, 12]:
    day_range = 31
  elif tmp_month in [2]:
    day_range = 28
  else:
    day_range = 30

  tmp_day = random.randint(1, day_range)
  tmp_year_index = random.randint(0, 9)
  years_list = [2023] * 10

  deadline = str(tmp_day) + '/' + str(tmp_month) + '/' + str(years_list[tmp_year_index])
  deadline_list.append(deadline)

deadline_df = pd.DataFrame(deadline_list)
deadline_df.columns = ["Deadline"]
  

In [15]:
# Add Requiered Hours a Week
hours_list = []

for i in range(len(jobs_df)):

  tmp_hours = random.randint(2, 120)
  hours_list.append(tmp_hours)


hours_df = pd.DataFrame(hours_list)
hours_df.columns = ["Hours Needed"]

hours_df


Unnamed: 0,Hours Needed
0,80
1,73
2,118
3,15
4,58
...,...
21995,26
21996,72
21997,58
21998,72


In [16]:
jobs_extended = jobs_df.copy()
jobs_extended = jobs_extended.join(location_df)
jobs_extended = jobs_extended.join(budget_df)
jobs_extended = jobs_extended.join(hours_df)
jobs_extended = jobs_extended.join(deadline_df)
jobs_extended

Unnamed: 0,jobdescription,Longitude,Latitude,Budget,Hours Needed,Deadline
0,Looking for Selenium engineers...must have sol...,142,43,250,80,28/9/2023
1,The University of Chicago has a rapidly growin...,-81,-72,400,73,25/2/2023
2,"GalaxE.SolutionsEvery day, our solutions affec...",-50,-72,50,118,17/5/2023
3,Java DeveloperFull-time/direct-hireBolingbrook...,-78,60,850,15,23/1/2023
4,Midtown based high tech firm has an immediate ...,28,-20,600,58,24/7/2023
...,...,...,...,...,...,...
21995,Company Description We are searching for a ta...,-162,-84,1350,26,5/12/2023
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,-48,-19,1300,72,19/1/2023
21997,Do you take pride in your work knowing that th...,-122,-61,900,58,4/2/2023
21998,Company Description What We Can Offer YouAs th...,-140,38,800,72,16/6/2023


### On a Profile side, the data should be altered into Deadline -> Availability, Budget -> Reputation

In [17]:
profile_df = jobs_extended.copy()
profile_df["Requiered Reputation"] = 350*np.log10(profile_df["Budget"])

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [18]:
#today = datetime.combine(date.today(), datetime.min.time())
today = datetime.strptime("31/12/2022", '%d/%m/%Y')

remaining_time_list = []
for i in range(len(profile_df)):
  remaining_time_list.append(-(today - datetime.strptime(profile_df["Deadline"].iloc[i], '%d/%m/%Y')).days*24)

profile_df["Remaining Time"] = remaining_time_list

In [19]:
profile_df.to_csv("jobsdesc_extended.csv")

### Running an example query on profile_df

In [20]:
#profile_params
AVAILABILITY = 50000 #hrs a week
REPUTATION = 600000
INTEREST = "Data Manager and SQL developer"
#LOCATION

In [21]:
search_df = profile_df.copy()
# Filter on Reputation
search_df = search_df[search_df["Requiered Reputation"] < REPUTATION]

# Filter on Remaining Time
search_df = search_df[search_df["Remaining Time"] < AVAILABILITY]

# Filter on Location


# Filter on Interest
embeddings_search_list = search_df.reset_index()["index"].tolist()
embeddings_search_area = []
for i in embeddings_search_list:
  embeddings_search_area.append(embeddings_df.iloc[i])

embeddings_search_df = pd.DataFrame(embeddings_search_area)

print(len(embeddings_search_df))
# Turn the created df above into a np.array
embeddings_search_array = np.array(embeddings_search_df, dtype = np.float32)

# Turn the created np.array above into a tf.data.Dataset
# from_tensor_slices takes an np.array and returns a tf.data.Dataset
embeddings_search_tf = tf.data.Dataset.from_tensor_slices(embeddings_search_array).batch(32)


# Param Window ------------------


_num_reordering_candidates = 500
_num_leaves = 10
_num_leaves_to_search = 30
_k = 5


# -------------------------------


# Create a ScaNN layer
scann = tfrs.layers.factorized_top_k.ScaNN(
    num_reordering_candidates =_num_reordering_candidates,
    num_leaves = _num_leaves,
    num_leaves_to_search =_num_leaves_to_search,
    k = _k
)

# Load the data into ScaNN
scann.index_from_dataset(
    embeddings_search_tf
    )

# Build the ScaNN

scann.build(embeddings_search_tf.element_spec.shape)


txt_keywords = text2Keywords(INTEREST)
txt_embedding = text2Embeddings(txt_keywords)
txt_embedding = np.array(txt_embedding)
txt_embeddings_tf = tf.convert_to_tensor(txt_embedding)


# Get the results
score, res = scann.predict(txt_embeddings_tf)

22000
The given input is not a list, converting to list


In [22]:
# Print the results
#search_df = search_df.reset_index()

for i in list(res[0]):
  print(i, search_df["Budget"].iloc[i], search_df["Remaining Time"].iloc[i], search_df["jobdescription"].iloc[i])


13559 250 2616 One of our clients located in Columbia, MD is seeking a Database Manager who will have the ability to manage a team and manage projects. (Previous management experience is not required but would be a plus.)  The Database Manager MUST have experience with Data Warehouse, Cube, and SSAS.This is a Direct Hire opportunity.  No C2C candidates. Requirements:Minimum of 8 years extensive expertise with MS SQL Server 2005, 2008, 2008 R2, and 2012Hands on experience with ETL, SSIS, SSAS, and SSRS design and developmentExperience with ETL, SSIS, SSRS, DR, and HA and demonstrated experience with SQL migrations and upgradesWindows Active Directory and SQL security integrationSQL query and index performance optimizationSQL Server Integration Services and DTSIn depth understanding of SQL Server architecture, tools and securityExperience supporting systems that have high transaction rates, and large volumes of dataExperience implementing High Availability / Disaster RecoveryHands on exp

# Tensorflow Model Specific
## Model Creation
1. The retrieval stage is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.

2. The ranking stage takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

In [23]:
def get_repo() -> None:
  """
  clones the repository atuomatically
  """

  try:
    os.system('git clone https://github.com/UniversalDot/tensorflow')
  except:
    print('it was not possible to get the dataset')
    return
  finally:
    print('dataset downloaded')


def get_dataset() -> tf.Tensor:
  """
  loads the dataset in a format readable from the create_save function
  """
  df = tf.data.Dataset.load('/content/tensorflow/dataset/key_embeddings')
  df = tf.reshape(df.get_single_element(), (-1, 512))
  df = tf.data.Dataset.from_tensor_slices(df)
  return df

get_repo() #loading the github repo
df = get_dataset()

dataset downloaded


In [24]:
module_url = 'https://tfhub.dev/google/universal-sentence-encoder-large/5'
embed = hub.KerasLayer(module_url, trainable=True, name='USE_embedding')

model = tf.keras.Sequential()
model.add(embed)

model.compile()
#model.predict(["I am a machine learning engineer"])



In [25]:
scann = tfrs.layers.factorized_top_k.ScaNN(query_model= model)
scann.index_from_dataset(df.batch(512))

<tensorflow_recommenders.layers.factorized_top_k.ScaNN at 0x7fea55bcd850>

In [26]:
score, res = scann.predict(["Machine Learning Engineer"])



In [27]:
for i in list(res[0]):
  print(i, search_df["Budget"].iloc[i], search_df["Remaining Time"].iloc[i], search_df["jobdescription"].iloc[i])

8785 600 2880 Job Description:Our client is seeking a PHP Developer able to build modern web applications from the ground up.Highly skilled in PHP, MySQL, JavaScript, HTML, and CSS. Candidate will be detail-oriented and demonstrate a keen sense of aesthetics. Familiar with mainstream PHP frameworks (Laravel, CodeIgniter, Symfony, etc). Required to work independently and take the lead on small to medium projects. Excellent problem solving skills. Top notch verbal and written communication skills. Ability to install and maintain development environments for a LAMP platform.Requirements:5+ years of PHP2+ years PHP Frameworks such as Laravel, CodeIgniter, Symfony, etc2+ years Front-end technologies (AJAX, HTML, CSS, JavaScript)2+ years jQuery, Angular, Backbone, Prototype, Bootstrap or other front end frameworks2+ years MySQL experience, with strong ability create efficient, normalized database design2+ years Linux experience Good understanding of SVN Assets: Already familiar with Laravel,