## Preliminary Exploration
### P1: Using `sentence_transformers` 
There are many pre-trained models available in the `sentence_transformers` (a.k.a SBERT) library in Python. [ *Reference: https://sbert.net/* ]

To make sure the library is installed, run following:
<pre>
pip install sentence_transformers    
</pre>

In [1]:
!pip install sentence_transformers



Now, let's use the `sentence_transformers` library and import a pre-trained model named `all-MiniLM-L6-v2`

1. First we import the library using:

   `from sentence_transformers import SentenceTransformer`

  
2. Then we will need to define a variable to hold the pre-trained model:

   `embedding_model = SentenceTransformer('all-MiniLM-L6-v2')`

   ***Note:** There could be many options for the pre-trained model, such as:*
   - `all-MiniLM-L6-v2` which is overall fast and good for general text (Dim: 384)
   - `all-mpnet-base-v2` which is slower but better quality (Dim: 768)
    
   For more details: https://huggingface.co/sentence-transformers

3. Once the model is loaded, we can use the `encode()` function within the embedded model to transform the given input text into corresponding vector.

    `embedding_model.encode(input_text_here)`


In [2]:
## Code here

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

text = 'Today is a sunny day.'
result = embedding_model.encode(text)

print(f'Original text: {text}')
print(f'Vector length: {len(result)}')
print(f'Embedding vector: {result}')

  from .autonotebook import tqdm as notebook_tqdm


Original text: Today is a sunny day.
Vector length: 384
Embedding vector: [ 2.36809417e-03  9.73476768e-02  9.59824622e-02  9.13097858e-02
  3.10873557e-02 -5.90473823e-02  1.50612637e-01 -9.29311365e-02
 -3.90396342e-02 -1.75307281e-02  2.53553204e-02  1.69024300e-02
  2.84761633e-03  7.10389912e-02  8.43285397e-02  6.34109676e-02
 -2.40979977e-02 -9.74420982e-04 -6.98793307e-02  2.91991327e-02
 -1.09861061e-01  9.27486364e-03 -3.49005498e-02  2.89231036e-02
  9.71570984e-03  7.08094239e-02  4.00788449e-02 -1.34750886e-03
 -3.73694114e-02 -2.54107974e-02 -2.10091248e-02  5.22645079e-02
 -3.07825413e-02 -4.27118735e-03 -1.77656952e-02 -3.46577615e-02
 -6.61386782e-03 -1.32381096e-01 -4.01914157e-02  1.00402310e-01
 -5.23531204e-03 -1.35367617e-01  6.83695031e-03  1.96054019e-02
 -6.42637396e-03 -2.14972515e-02 -3.69780436e-02  6.56473190e-02
  1.03695050e-01 -4.45263945e-02 -5.79723977e-02 -4.58558882e-03
 -2.62573659e-02 -4.00760807e-02 -3.66745777e-02  9.90855992e-02
  4.08710763e-02

4. Eventually make a custom function named `st_embed()` that you can reuse, where it receive input text as a parameters and return the embedding. Convert the result to a list using `tolist()` function, this will be useful when used for preparing the MongoDB document.

In [3]:
## Code here

def st_embed(text):
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    result = embedding_model.encode(text)
    return result

st_embed('This is cold.')

array([-5.77207468e-02,  2.38114316e-02, -4.45517071e-04,  8.69354308e-02,
        4.62214313e-02,  3.00016571e-02,  8.20653215e-02,  6.98352326e-03,
        1.36114284e-02, -2.21099500e-02, -5.80056608e-02, -3.94593813e-02,
        3.89880687e-02,  5.68154827e-02,  1.16812531e-02, -2.18262542e-02,
        1.38524882e-02, -4.20515314e-02, -2.16452852e-02, -4.42779697e-02,
       -1.06043527e-02,  1.08371042e-01, -4.88939062e-02,  2.00086948e-03,
        3.89887057e-02,  4.43795323e-02,  5.04380278e-03, -1.52779408e-02,
       -5.70711493e-02, -1.05001044e-03, -3.28431316e-02,  3.55217829e-02,
        2.01757737e-02, -2.72000227e-02, -1.45785976e-02, -7.56126409e-03,
       -1.22766867e-02, -3.43601108e-02, -4.34343442e-02,  3.11367698e-02,
        3.63502987e-02, -1.80783253e-02,  3.79497744e-02, -1.18687879e-02,
        3.74216475e-02,  9.84410048e-02, -5.72689697e-02,  9.83748510e-02,
        7.06411749e-02, -2.04079095e-02, -7.95387402e-02,  2.16492135e-02,
       -1.01804703e-01,  

Great! you have created an embedded sentence!
<hr>

### P2: Using local LLMs from `ollama`
#### Method 1: using `requests`
Ollama is a platform to run open-source LLMs locally. When running Ollama, by default, there will be a localhost with port `11434` started and served as long as Ollama is active. We can use this served localhost to interact with the local LLM models in Ollama.

Once install Ollama, go to any browser, and try put this in the address bar: `localhost:11434` if the browser is showing `Ollama is running` then good news! Ollama is running and works!

Then we will use the `requests` library to send web requests to interact with LLMs in Ollama

1. import the library using

   `import requests`
   
2. Then define the API url for Ollama embedding and the model to use

    `url = 'http://localhost:11434/api/embeddings`
    
3.  use the model of `nomic-embed-text`

    *Please make sure the model is installed in Ollama, or else, open the terminal and install the model using the command of: `ollama pull nomic-embed-text`*

4. To interact, we will use the `post()` function in `requests` and pass the url, model and the text in following format:

    <pre>
    requests.post(
        url,
        json = {
            "model" : model,
            "prompt" : text
        }
    )</pre>

5. To retrieve the embedding, extract the JSON structure of the response and filter the `embedding` key. Example: `response.json()['embedding']`



In [4]:
## Code here

import requests

url = 'http://localhost:11434/api/embeddings'
model = 'nomic-embed-text' 

response = requests.post(
        url,
        json = {
            "model" : model,
            "prompt" : text
        }
    )

result = response.json()['embedding']

6. Eventually make a custom function named `ollama_embed()` that you can reuse, where it receive input text as a parameters and return the embedding.

In [51]:
## Code here
def ollama_embed(text):
    url = 'http://localhost:11434/api/embeddings'
    model = 'nomic-embed-text'
    response = requests.post(url, json={"model":model, "prompt":text})
    result = response.json()['embedding']
    return result


#### Method 2: using `langchain`
LangChain is a comprehensive library for LLM integration in Python. Using LangChain, we can directly interact with Ollama and use the object-oriented way to interact with the open source LLMs in Ollama.

[*Reference: https://python.langchain.com/docs/introduction/*]

Please make sure install the following library:
- `langchain-ollama`

In [6]:
## 

!pip install langchain-ollama




Then to interact with Ollama,
1. import the library: `from langchain_ollama import OllamaEmbeddings`
2. Create an `OllamaEmbeddings` object and define the model to be used.

    Example: `OllamaEmbeddings(model=xxx)`
3. use the `embed_query()` function in the `OllamaEmbeddings` object, and pass in the text, it will return the embedding directly.

    *The length of the vector is depending on the embedding model used*
4. Eventually make a custom function named `ollama_embed_lc()` that you can reuse, where it receive input text as a parameters and return the embedding.

In [52]:
## Code here
from langchain_ollama import OllamaEmbeddings

embedding = OllamaEmbeddings(model="nomic-embed-text")

example_text = "This is a test sentence to embed."
embedding_vector = embedding.embed_query(example_text)


print("Embedding vector for example_text:")
print(embedding_vector)

def ollama_embed_lc(input_text: str) -> list:
    """
    Takes input text and returns its embedding vector using Ollama embeddings.
    """
    return embedding.embed_query(input_text)


another_text = "LMAO!!! x9 ANIVIA MID!"
another_vector = ollama_embed_lc(another_text)

print("\nEmbedding vector for another_text:")
print(another_vector)


Embedding vector for example_text:
[0.045759533, 0.048638687, -0.15519412, -0.07119232, 0.0648026, -0.006465471, 0.04468354, -0.0060434523, 0.017531652, -0.023755439, -0.043661922, 0.023491938, 0.004667976, 0.049296096, -0.064267315, 0.023731759, 0.077339865, -0.104143724, -0.018971408, 0.04803753, -0.021584792, -0.012928062, -0.031158226, -0.016311025, 0.06997232, 0.0154662365, 0.0044312472, 0.0031901796, -0.06508122, -0.031204365, 0.055903103, 0.002314534, -0.0034225208, -0.047710445, -0.022882627, -0.056000713, -0.020222919, 0.0852439, -0.0032614095, -0.04707582, 0.01913203, 0.0095965015, -0.029933885, -0.007653282, 0.02193616, -0.032393187, 0.08501813, 0.018890873, 0.02683022, -0.08258654, 0.0007857107, -0.044959612, 0.021984795, -0.047106717, 0.09804634, -0.00891882, 0.012330126, -0.04701691, -0.0027929994, -0.015128974, 0.028743735, 0.041340705, -0.096452326, 0.07011951, 0.028670367, -0.0043691387, -0.035164267, 0.057154763, 0.027297853, 0.0037948925, 0.043571137, 0.018515568, 0.

Great! Now you can embed text using local LLMs in Ollama.
<hr>

## Embed protein data and store in MongoDB
### 1. Preparing the data and embeddings
Now we will perform an example use case that using protein data (retrieved from Uniprot database) and create a MongoDB database to store this data and the embedding of it for vector search.

1. Read the data from the csv file of `uniprot_proteins.csv` Please download it from e-learning

In [47]:
## Code here

import pandas as pd

proteins_data = pd.read_csv("uniprot_proteins.csv")
proteins_data.head()

Unnamed: 0,uniprot_id,entry_name,protein_name,gene_names,organism,sequence_length,sequence,function,subcellular_location,pathways,domains,go_molecular_function,go_biological_process,go_cellular_component,keywords,ec_numbers,protein_families
0,P12821,ACE_HUMAN,Angiotensin-converting enzyme,['ACE'],Homo sapiens,1306,MGAASGRRGPGLLLPLPLLLLLPPQPALALDPGLQPGNFSADEAGA...,Isoform produced by alternative promoter usage...,Cell membrane; Secreted,[],"['Peptidase M2 1', 'Peptidase M2 2']","['actin binding', 'bradykinin receptor binding...","['amyloid-beta metabolic process', 'angiogenes...","['basal plasma membrane', 'brush border membra...","['3D-structure', 'Alternative promoter usage',...",[],[]
1,P23368,MAOM_HUMAN,"NAD-dependent malic enzyme, mitochondrial",['ME2'],Homo sapiens,584,MLSRLRVVSTTCTLACRHLHIKEKGKPLMLNPRTNKGMAFTLQERQ...,NAD-dependent mitochondrial malic enzyme that ...,Mitochondrion matrix,[],[],"['electron transfer activity', 'malate dehydro...","['malate metabolic process', 'pyruvate metabol...","['intracellular membrane-bounded organelle', '...","['3D-structure', 'Acetylation', 'Allosteric en...",[],[]
2,P49427,UB2R1_HUMAN,Ubiquitin-conjugating enzyme E2 R1,['CDC34'],Homo sapiens,236,MARPLVPSSQKALLLELKGLQEEPVEGFRVTLVDEGDLYNWEVAIF...,E2 ubiquitin-conjugating enzyme that accepts u...,Cytoplasm; Nucleus,['Protein modification; protein ubiquitination'],['UBC core'],"['ATP binding', 'ubiquitin conjugating enzyme ...","['cellular response to interferon-beta', 'DNA ...","['cytosol', 'nuclear speck', 'nucleoplasm', 'n...","['3D-structure', 'ATP-binding', 'Cell cycle', ...",[],[]
3,P62256,UBE2H_HUMAN,Ubiquitin-conjugating enzyme E2 H,['UBE2H'],Homo sapiens,183,MSSPSPGKRRMDTDVVKLIESKHEVTILGGLNEFVVKFYGPQGTPY...,Accepts ubiquitin from the E1 complex and cata...,,['Protein modification; protein ubiquitination'],['UBC core'],"['ATP binding', 'ubiquitin conjugating enzyme ...",['proteasome-mediated ubiquitin-dependent prot...,"['cytosol', 'nucleus']","['3D-structure', 'Acetylation', 'Alternative s...",[],[]
4,P61077,UB2D3_HUMAN,Ubiquitin-conjugating enzyme E2 D3,['UBE2D3'],Homo sapiens,147,MALKRINKELSDLARDPPAQCSAGPVGDDMFHWQATIMGPNDSPYQ...,Accepts ubiquitin from the E1 complex and cata...,Cell membrane; Endosome membrane,['Protein modification; protein ubiquitination'],['UBC core'],"['ATP binding', 'ubiquitin conjugating enzyme ...","['apoptotic process', 'DNA repair', 'negative ...","['cytosol', 'endosome membrane', 'extracellula...","['3D-structure', 'Alternative splicing', 'Apop...",[],[]


2. Decide the protein info for embedding. There are several fields and we will decide to use the fields of `protein_name` and `function` to be combined as a "sentence" that prepared to be embedding.

    Let say the format of the text for every row would be: <pre>Proteins: protein_name | Function: function</pre>

   Convert this into a function named `format_searchable()` that takes a row of the table as input. So that can be re-use later on all the rows in the dataset.

In [48]:
## Code here

import pandas as pd

#df
def format_searchable(row):
    searchable_text = f'{row['protein_name']} | {row['function']}'
    return searchable_text

In [49]:
format_searchable(proteins_data.iloc[0])

'Angiotensin-converting enzyme | Isoform produced by alternative promoter usage that is specifically expressed in spermatocytes and adult testis, and which is required for male fertility (PubMed:1651327, PubMed:1668266). In contrast to somatic isoforms, only contains one catalytic domain (PubMed:1651327, PubMed:1668266). Acts as a dipeptidyl carboxypeptidase that removes dipeptides from the C-terminus of substrates (PubMed:1668266, PubMed:24297181). The identity of substrates that are needed for male fertility is unknown (By similarity). May also have a glycosidase activity which releases GPI-anchored proteins from the membrane by cleaving the mannose linkage in the GPI moiety. The GPIase activity was reported to be essential for the egg-binding ability of the sperm (By similarity). This activity is however unclear and has been challenged by other groups, suggesting that it may be indirect (By similarity)'

3. Then, loop every data entry on the `protein_data` imported at Step 1. Create a document that stores the data in following format:
<pre>{
        'uniprot_id':
        'entry_name':
        'protein_name':
        'gene_names':
        'organism':
        'sequence_length':
        'sequence':
        'function':
        'subcellular_location':
        'domains':
        'keywords':
        'go_molecular_function':
        'go_biological_process':
        'go_cellular_component':
        'searchable_text': #this is the text produced in Step 2
        'embedding': #embedding result from st_embed() using searchable_text
   }</pre>

   #### Using the function `st_embed()` from above
4. During the processing, create the embedding for the `searchable_text` as well using the `st_embed()` function defined above.

5. This process may need additional time to complete
   
6. Create a variable `protein_docs` at the end

In [53]:
## Code here

from IPython.display import clear_output

protein_docs = []

for idx, row in proteins_data.iterrows():
    clear_output(wait=True)
    print(f'Processing document: {idx+1}/{proteins_data.shape[0]}')
    searchable_text = format_searchable(row)

    doc = {
        'uniprot_id': row['uniprot_id'],
        'entry_name': row['entry_name'],
        'protein_name': row['protein_name'],
        'gene_names': row['gene_names'],
        'organism': row['organism'],
        'sequence_length': int(row['sequence_length']),
        'sequence': row['sequence'],
        'function': row['function'],
        'subcellular_location': row['subcellular_location'],
        'domains': row['domains'],
        'keywords': row['keywords'],
        'go_molecular_function': row['go_molecular_function'],
        'go_biological_process': row['go_biological_process'],
        'go_cellular_component': row['go_cellular_component'],
        'searchable_text': searchable_text,
        'embedding': ollama_embed(searchable_text)
    }

    protein_docs.append(doc)

print("Completed")


Processing document: 349/349
Completed


### 2. Ingest into MongoDB
#### Initialize connection with MongoDB Atlas
1. Establish the connection to MongoDB Atlas. First create a new account in MongoDB Atlas and create a free tier MongoDB database. Then obtain the URI and establish the connection with `pymongo`

In [35]:
## Code here

from pymongo import MongoClient

MONGO_URI = "mongodb+srv://benLab4:benlab4ass@cluster0.goopkpm.mongodb.net/"

try:
    client = MongoClient(MONGO_URI)

    client.admin.command("ping")

    db = client["imported"]
    collection = db["uniprot_proteins"]

    print("Successfully connected to MongoDB Atlas!")

except Exception as e:
    print("Failed to connect to MongoDB Atlas:")
    print(e)

Successfully connected to MongoDB Atlas!


#### Insert documents
2. Then insert all the documents into the MongoDB Atlas database.

In [54]:
## Code here
batch_size = 50

for i in range(0, len(protein_docs), batch_size):
    batch = protein_docs[i:i+batch_size]
    collection.insert_many(batch)
    print(f"Inserted batch {i/batch_size + 1}/{(len(protein_docs)-1)/batch_size + 1}")


Inserted batch 1.0/7.96
Inserted batch 2.0/7.96
Inserted batch 3.0/7.96
Inserted batch 4.0/7.96
Inserted batch 5.0/7.96
Inserted batch 6.0/7.96
Inserted batch 7.0/7.96


#### Creating vector index
3. Then create a vector search index in the MongoDB Atlas. In this case we will use `langchain-mongodb` and `langchain-ollama` library

    In here will use the `MongoDBAtlasVectorSearch` in the `langchain-mongodb` library. To use the function, requires the following information:

   <pre>
       MongoDBAtlasVectorSearch(
           collection = 
           embedding = 
           index_name = 
           relevance_score_fn = 
       )
   </pre>

In [37]:
%pip install langchain langchain-mongodb langchain-ollama pymongo

Collecting langchain
  Downloading langchain-0.3.26-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-mongodb
  Downloading langchain_mongodb-0.6.2-py3-none-any.whl.metadata (1.7 kB)
Collecting langchain-core<1.0.0,>=0.3.66 (from langchain)
  Downloading langchain_core-0.3.66-py3-none-any.whl.metadata (5.8 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.8 (from langchain)
  Downloading langchain_text_splitters-0.3.8-py3-none-any.whl.metadata (1.9 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading sqlalchemy-2.0.41-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
Collecting lark<2.0.0,>=1.1.9 (from langchain-mongodb)
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Collecting greenlet>=1 (from SQLAlchemy<3,>=1.4->langchain)
  Downloading greenlet-3.2.3-cp312-cp312-win_amd64.whl.metadata (4.2 kB)
Downloading langchain-0.3.26-py3-none-any.whl (1.0 MB)
   ---------------------------------------- 0.0/1.0 MB ? eta -:--:--
   -------------------- ----------------

In [55]:
## Code here

from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_ollama import OllamaEmbeddings

embed_model = OllamaEmbeddings(model='nomic-embed-text')

vectorstore = MongoDBAtlasVectorSearch(
    collection = collection,
    embedding = embed_model,
    index_name = "prot_index",
    relevance_score_fn='cosine'
)

vectorstore.create_vector_search_index(dimensions=768)


Great! you have successfully embed the protein data and stored into MongoDB Atlas
<hr>

## Vector Search in MongoDB Atlas
### Testing the vector search
1. Create a vector search pipeline using the `aggregate` in MongoDB and use the pipeline to perform the search.

   The pipeline for vector search will use the `$vectorSearch` operator. The pipeline can be generally view as follows for the vector search stage:

   <pre>
       {
            "$vectorSearch": {
                "index": "prot_index",
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": min(limit * 10, 1000),
                "limit": 
            }
        }
   </pre>

In [63]:
## Code here

query = "enzyme that catalyzes protein breakdown"
query_embedding = ollama_embed(query)
limit = 10

pipeline = [
    {
        "$vectorSearch": {
            "index": "prot_index",
            "path": "embedding",
            "queryVector": query_embedding,
            "numCandidates": min(limit * 10, 1000),
            "limit": limit
        }
    },
    {
        "$project": {
            "uniprot_id": 1,
            "protein_name": 1,
            "organism": 1,
            "function": 1,
            "keywords": 1,
            "searchable_text": 1,
            "score": {"$meta": "vectorSearchScore"}  # Include similarity score
        }
    }
]

In [64]:
result = list(collection.aggregate(pipeline))

2. Create a function called `search_vector()` that accepts:
    - query in text,
    - the limit of results
    - MongoDB collection object
   
   <br>as the input, and return the search result from MongoDB.

In [65]:
## Code here
def search_vector(query, limit, collection):
    query_embedding = ollama_embed(query)
    pipeline = [
        {
            "$vectorSearch": {
                "index": "prot_index",
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": min(limit * 10, 1000),
                "limit": limit
            }
        },
        {
            "$project": {
                "uniprot_id": 1,
                "protein_name": 1,
                "organism": 1,
                "function": 1,
                "keywords": 1,
                "searchable_text": 1,
                "score": {"$meta": "vectorSearchScore"}  # Include similarity score
            }
        }
    ]
    result = list(collection.aggregate(pipeline))
    return result

In [66]:
search_vector("transcription factor that binds DNA", 10, collection)

[{'_id': ObjectId('685ebb252516af4a76508da7'),
  'uniprot_id': 'P17275',
  'protein_name': 'Transcription factor JunB',
  'organism': 'Homo sapiens',
  'function': "Transcription factor involved in regulating gene activity following the primary growth factor response. Binds to the DNA sequence 5'-TGA[GC]TCA-3'. Heterodimerizes with proteins of the FOS family to form an AP-1 transcription complex, thereby enhancing its DNA binding activity to an AP-1 consensus sequence and its transcriptional activity (By similarity)",
  'keywords': "['Acetylation', 'DNA-binding', 'Isopeptide bond', 'Nucleus', 'Phosphoprotein', 'Proteomics identification', 'Reference proteome', 'Transcription', 'Transcription regulation', 'Ubl conjugation']",
  'searchable_text': "Proteins: Transcription factor JunB | Function: Transcription factor involved in regulating gene activity following the primary growth factor response. Binds to the DNA sequence 5'-TGA[GC]TCA-3'. Heterodimerizes with proteins of the FOS family

Great!! you have done your first vector search using MongoDB Atlas.
<hr>

## LAB ASSIGNMENT 4 TASK

Given the requirements such as follows:
1. User may query and find proteins on specific organism such as focus only homo sapiens
2. GO molecular functions should be included
3. User may query specific length of sequence.

You are given the following tasks to complete:
1. Enhance the representation of the `searchable_text`, include more information as per the requirement.
2. Then rework on the embeddings in the MongoDB Atlas. You may delete the existing db/collection and recreate again with the new one.
3. Re-create the search index based on the latest collection
4. Create a simple Streamlit UI that having a simple search query box, to demonstrate the vector search of your MongoDB Atlas. Design a simple display of the information retrieved from MongoDB.
5. As demo purpose, you should make recording to demonstrate how your streamlit app would work. (voice explanation is optional)
6. Submit your work:
     - revised / reworked final version of python notebook or python script. (the link to MongoDB Atlas should be working) that shows how you prepare the data and store in MongoDB Atlas and definition of the index.
     - the streamlit app code that used for demonstrating the vector search
       

In [67]:
def format_searchable(row):
    searchable_text = (
        f"Protein: {row['protein_name']} | "
        f"Function: {row['function']} | "
        f"Organism: {row['organism']} | "
        f"GO Molecular Function: {row['go_molecular_function']} | "
        f"Sequence Length: {row['sequence_length']}"
    )
    return searchable_text

In [68]:
collection.drop()
print("Old collection dropped.")

Old collection dropped.


In [69]:
protein_docs = []

for idx, row in proteins_data.iterrows():
    clear_output(wait=True)
    print(f'Processing document: {idx+1}/{proteins_data.shape[0]}')

    searchable_text = format_searchable(row)

    doc = {
        'uniprot_id': row['uniprot_id'],
        'entry_name': row['entry_name'],
        'protein_name': row['protein_name'],
        'gene_names': row['gene_names'],
        'organism': row['organism'],
        'sequence_length': int(row['sequence_length']),
        'sequence': row['sequence'],
        'function': row['function'],
        'subcellular_location': row['subcellular_location'],
        'domains': row['domains'],
        'keywords': row['keywords'],
        'go_molecular_function': row['go_molecular_function'],
        'go_biological_process': row['go_biological_process'],
        'go_cellular_component': row['go_cellular_component'],
        'searchable_text': searchable_text,
        'embedding': ollama_embed(searchable_text)  # or st_embed(searchable_text)
    }

    protein_docs.append(doc)


Processing document: 349/349


In [70]:
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_ollama import OllamaEmbeddings

embed_model = OllamaEmbeddings(model='nomic-embed-text')

vectorstore = MongoDBAtlasVectorSearch(
    collection=collection,
    embedding=embed_model,
    index_name="prot_index",
    relevance_score_fn='cosine'
)

vectorstore.create_vector_search_index(dimensions=768)

Video Link - BEN LIM CHOONG CHUEN B23CS0032

https://youtu.be/ksHqlBc30xU