# An Example Using Tensorflow

1. Download the pre-trained model
2. Download the dataset and extract the embeddings
3. Create a collection in Milvus with ID, Message and Embedding fields
4. Insert the data to the collection
5. Build a vector index 
6. Load the collection into the memory
7. Define test sentence and get the embedding for this sentence. Find similar sentences in the dataset

This dataset has around 5500 SMS that are either labeled as spam or ham.

To download this dataset, you can go to the https://archive.ics.uci.edu/dataset/228/sms+spam+collection 

In [1]:
! pip install tensorflow tensorflow-hub
import os
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import re
from pymilvus import Collection, FieldSchema, CollectionSchema, DataType, connections, utility




Extract the embedding vector from this dataset using universal sentence encoder.

Use TensorFlow and a Pre-trained model is readily available in the TensorFlow hub.

https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder 

In [2]:
# Download the model and load the model
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)













We encode the messages in this dataset and store it in the Milvus database.

In [5]:
# Function to generate embeddings
def embeddings(text):
    return np.array(model(text)).flatten().tolist()

In [10]:
# We will create the embeddings from this dataset using the universal-sentence-encoder model

file_path = os.path.join('data', 'SMSSpamCollection')

with open(file_path) as file:
    lines = [line for line in file]

msgs = [x.split('\t')[1].replace('\n', '')   for x in lines]
embdngs = [embeddings([x]) for x in msgs]
indx = list(range(1, len(msgs)+1))

data_to_insert = [indx, msgs, embdngs]

In [11]:
# Connect to milvus database
connections.connect(
  alias="default",
  host='localhost',
  port='19530'
)

In [12]:
# Field Schema
id = FieldSchema(
  name="id",
  dtype=DataType.INT64,
  is_primary=True,
)
message = FieldSchema(
  name="message",
  dtype=DataType.VARCHAR,
  max_length=6000,
)
message_vec = FieldSchema(
  name="message_embeddings",
  dtype=DataType.FLOAT_VECTOR,
  dim=512
)
# collection schema
collection_schema = CollectionSchema(
  fields=[id, message, message_vec],
  description="Spam SMS collection"
)
# Create collection
collection = Collection(
    name="Spam_Test",
    schema=collection_schema,
    using='default')
utility.list_collections()

['Album1', 'dynamic_schema_example', 'partition_key_collection', 'Spam_Test']

In [13]:
# Insert entities
data_insert = collection.insert(data_to_insert)

In [14]:
# Create Index
index_params = {
  "metric_type":"L2",
  "index_type":"IVF_FLAT",
  "params":{"nlist":1024},
  "index_name": "SMS_IVF_FLAT_TEST"
}

# Index on vector field
collection.create_index(
  field_name="message_embeddings", 
  index_params=index_params
)

Status(code=0, message=)

In [15]:
# Load the collection
collection.load(replica_number=1)

We will use a custom test message and find the vector embeddings and send the embeddings for the test message and find out the similar messages in the database.

In [16]:
# test message
test_message = ["claim prize"]
test_message_vector = embeddings(test_message)

In [17]:
## Vector Similarity Search
search_params = {"metric_type": "L2", "params": {"nprobe": 64}}

results = collection.search(
	data=[test_message_vector], 
	anns_field="message_embeddings", 
	param=search_params,
	limit=5, 
	expr=None,
	output_fields=['message']
)

for result in results[0]:
    print (result)

id: 1119, distance: 1.1402018070220947, entity: {'message': '449050000301 You have won a Â£2,000 price! To claim, call 09050000301.'}
id: 577, distance: 1.1955623626708984, entity: {'message': 'You have won ?1,000 cash or a ?2,000 prize! To claim, call09050000327'}
id: 4048, distance: 1.2085381746292114, entity: {'message': 'Win a Â£1000 cash prize or a prize worth Â£5000'}
id: 651, distance: 1.23152494430542, entity: {'message': 'You have won ?1,000 cash or a ?2,000 prize! To claim, call09050000327. T&C: RSTM, SW7 3SS. 150ppm'}
id: 9, distance: 1.2500566244125366, entity: {'message': 'WINNER!! As a valued network customer you have been selected to receivea Â£900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.'}


In [18]:
# Save this vector to use later
with open("test_message_vector.txt", "w") as file:
    for item in test_message_vector:
        file.write(f"{item}\n")