In [2]:
 pip install "pymilvus[model]"


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
from pymilvus import model
from pymilvus import MilvusClient
import json

# This will download a small embedding model "paraphrase-albert-small-v2" (~50MB).
# https://milvus.io/docs/embeddings.md
embedding_fn = model.DefaultEmbeddingFunction()




# Database creation 

Create a database with the vector embeddings of the one-word thematic summaries of the sentence structures in options.json
We use the one word summaries rather than the sentence to avoid the variable placeholders from influencing the semantic embeddings

- One word summaries were generated via Claude-Sonnet 3.5 (In case you want to generate more sentences and one-word summaries)


In [30]:
with open('options.json') as f:
    options = json.load(f)

sentence_structures = options['sentence_structures']

# Creates a database with the vector embeddings of the one-word thematic summaries of the sentence structures in options.json
# We use the one word summaries rather than the sentence to avoid the variable placeholders from influencing the semantic embeddings
data = [
    {"id": i, 
     "vector": embedding_fn.encode_documents(sentence_structures[i]['summary_word'])[0],  # Embeds summary word into vector
     "text": sentence_structures[i]['summary_word'],
     "template": sentence_structures[i]['template'],
     "x_type": sentence_structures[i]['x_type']
     } for i in range(len(sentence_structures))
]

# all that matters for search is the "id" and "vector" fields, the rest is metadata


In [31]:
client = MilvusClient("sentence_summaries.db")
# This collection can take input with mandatory fields named "id", "vector" and
# any other fields as "dynamic schema". You can also define the schema explicitly.
client.create_collection(
    collection_name="summary_word", # This is the partition of the database, treat it like a SQL table
    dimension=768  # Dimension for vectors.
)

client.insert(collection_name="summary_word", data=data)


# Vector search

Example of vector search 

Given the user selects "Guidance, Heritage, Passionate" to describe their Shaper, we search for the most similar one-word summary in the database we created earlier

- This can be run locally as long as the database above has been generated. The embedding model is lightweight, should work on most systems.
- I've added some randomness to this by returning 3 options, then randomly selecting from those 3.

Currently main.py runs this method in the "choose_sentence_structure()" function

Alternatively, you could precompute it for each n-tuple up to the number of options you want users to select, pre-select the sentence structure. This might be more in line with the "One JSON to rule them all" approach they've mentioned. Be warned though, this will take a while.

The same strategy could also make sense for the {x} component of the "My {shaper} is {x} and that {y} my {z}" format

In [10]:

import random 

query_vectors = embedding_fn.encode_queries(["Guidance, Heritage, Passionate"])

res = client.search(
    collection_name="summary_word",  # target collection
    data=query_vectors,  # query vectors
    limit=3,  # number of returned entities
    output_fields=["text", "template", "x_type"],  # specifies fields to be returned
)
dumped_data = json.dumps(res)
parsed_data = json.loads(dumped_data)[0]
chosen_data = random.choice(parsed_data)
chosen_data['entity']['template']


"My {shaper}'s reputation as {x} in our community actively {y} my own {z}."

In [11]:
parsed_data

[{'id': 19,
  'distance': 0.31910303235054016,
  'entity': {'text': 'Instructive',
   'template': 'In the classroom of life, my {shaper}, {x} in every lesson, patiently {y} my {z}.',
   'x_type': 'x'}},
 {'id': 16,
  'distance': 0.31910303235054016,
  'entity': {'text': 'Interwoven',
   'template': "In the tapestry of my life, my {shaper}'s influence as {x} intricately {y} the threads of my {z}.",
   'x_type': 'x'}},
 {'id': 6,
  'distance': 0.31910303235054016,
  'entity': {'text': 'Influential',
   'template': "My {shaper}'s reputation as {x} in our community actively {y} my own {z}.",
   'x_type': 'x'}}]

In [85]:
chosen_data

{'id': 6,
 'distance': 0.3191031217575073,
 'entity': {'text': 'Influential',
  'template': "My {shaper}'s reputation as {x} in our community actively {y} my own {z}.",
  'x_type': 'x'}}

# Not Implemented

1. Filtering options for the sentence construction via answers about self. 
2. Viewing other similar sentences after constructing your own


# 1. Filtering options for the sentence construction via answers about self

I believe this makes this most sense when chosing options to display to users for {z}, and potentially {y} of the "My {shaper} is {x} and that {y} my {z}"

To do this, we take the answers about self, for example: "Rebel", "Fearless", "Timid", "Immigrant", and search through semantically similar options in the {z} or {y} options.



In [6]:
# Partition Setup

with open('options.json') as f:
    options = json.load(f)
    
client = MilvusClient("sentence_summaries.db")

client.create_collection(
    collection_name="z_words", # This is the partition of the database, treat it like a SQL table
    dimension=768  # Dimension for vectors.
)

data = [
    {"id": i, 
     "vector": embedding_fn.encode_documents(options["z"][i])[0],  # Embeds summary word into vector
     "text": options["z"][i],
     } for i in range(len(options["z"]))
]

client.insert(collection_name="z_words", data=data)



{'insert_count': 30, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'cost': 0}

In [7]:
# Then, we search through and return the 20 most similar, we could limit the number we show to the user to ~8, and randomly select from these

query_vectors = embedding_fn.encode_queries(["Rebel, Fearless, Timid, Immigrant"])

res = client.search(
    collection_name="z_words",  # target collection
    data=query_vectors,  # query vectors
    limit=20,  # number of returned entities
    output_fields=["text"],  # specifies fields to be returned
)
dumped_data = json.dumps(res)
parsed_data = json.loads(dumped_data)[0]

final_choices = [choice['entity']['text'] for choice in parsed_data]
final_choices

In [12]:
final_choices

['assertiveness',
 'adaptability',
 'innovation',
 'integrity',
 'leadership skills',
 'social responsibility',
 'self-discipline',
 'sense of purpose',
 'self-confidence',
 'work ethic',
 'time management',
 'digital literacy',
 'mindfulness',
 'financial literacy',
 'emotional intelligence',
 'empathy',
 'ethical decision-making',
 'environmental consciousness',
 'resilience',
 'global perspective']

# 2. Viewing similar sentences after constructing your own

Suggested strategy
- Construct all possible sentences
- Add metadata to each of them that they would want to filter by.
- Add a flag that indicates if the sentence is in-use
- Add metadata for source template

Then you can limit your search for similar sentences using filters which will speed up processing time. 

See here for an example (https://milvus.io/docs/quickstart.md#Load-Existing-Data)
