<a href="https://colab.research.google.com/github/adidror005/youtube-videos/blob/main/Actual_CHROMADB_FINAL_ACTUAL_video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installs
* ChromaDB
* Sentence Transformers

In [2]:
!pip install chromadb
!pip install sentence_transformers

Collecting sentence_transformers
  Using cached sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.me

### ChromaDB

*Chroma is the open-source AI application database. Batteries included.*

* Embeddings Vector Database and More


Embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal.


https://www.trychroma.com/

* Can run as client-server mode or docker
* Has some asnyc functionality
* In constant developement
* Soon will have a hosted version.
* Compatible with all the LLM app developement frameworks such as Langchain and Llama Index.

### Why would we use ChromaDB?
* **Efficient Vector Storage & Retrieval:** ChromaDB is purpose-built for handling vector embeddings, which are crucial for many AI applications. It provides fast and efficient storage and retrieval of these embeddings, enabling tasks like semantic search and recommendation systems.

* **Open-Source & Flexible:** ChromaDB's open-source nature offers transparency and community support. It's also flexible, allowing integration with various embedding models and supporting different storage backends to suit your needs.


### Goal of Video:
* Give brief overview of ChromaDB
  - Query based on embedding semantic similarity
  - Filter based on text and metadata
* Show you how to further customize ChromaDB by
  - persisting to disk for later usage so we don't have to redo and insert embeddings.
  - chosing the embedding function for text embeddings
  - chosing the distance function for vector search



##### Load some earnings transcripts data

In [4]:
import pandas as pd
from google.colab import drive

file_path = "https://raw.githubusercontent.com/adidror005/youtube-videos/main/earnings_sample%20(1).csv"
df = pd.read_csv(file_path)

df

Unnamed: 0,symbol,quarter,year,date,content
0,TSLA,1,2024,2024-04-23 22:30:10,Martin Viecha: Tesla's First Quarter 2024 Q&A ...
1,TSLA,2,2024,2024-07-24 01:45:26,"Travis Axelrod: Good afternoon, everyone and w..."
2,TSLA,1,2023,2023-04-19 18:54:06,"Martin Viecha: Good afternoon, everyone, and w..."
3,ON,1,2023,2023-05-01 11:43:05,"Operator: Good day, and thank you for standing..."
4,RIVN,1,2023,2023-05-09 20:53:09,"Operator: Good day, ladies and gentlemen. Than..."
5,ON,2,2023,2023-07-31 12:44:02,"Operator: Good day, and thank you for standing..."
6,RIVN,2,2023,2023-08-08 21:07:07,Operator: Good day and thank you for standing ...
7,TSLA,3,2023,2023-10-18 22:06:06,Elon Musk: [Call Starts Abruptly]…of new facto...
8,TSLA,3,2022,2022-10-19 22:14:04,"Martin Viecha: Good afternoon, everyone and we..."
9,ON,3,2022,2022-10-31 12:32:06,"Operator: Good day, and thank you for standing..."


In [5]:
print(df.iloc[0].content)


Martin Viecha: Tesla's First Quarter 2024 Q&A Webcast. My name is Martin Viecha, VP of Investor Relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q1 results were announced at about 3.00 p.m. Central Time in the Update Deck we published at the same link as this webcast. During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and expectations as of today. Actual events and results could differ materially due to a number of risks and uncertainties, including those mentioned in our most recent filings with the SEC. During the question-and-answer portion of today's call, please limit yourself to one question and one follow-up. Please use the raise hand button to join the question queue. But before we jump into Q&A, Elon has some opening remarks. Elon?
Elon Musk: Thanks, Martin. So to recap in Q1 we navigated several unforeseen challenges as well as the ramp of th

#### End to End Simple Example to get Started

##### Step 1: Create Chroma client

In [6]:
import chromadb
chroma_client = chromadb.Client()

##### Step 2: Create a collection
* Collections are where you'll store your embeddings, documents, and any additional metadata.
* You can create a collection with a name



In [7]:
collection = chroma_client.create_collection(name="my_collection")

#### Step 3: Add some text documents to the collection
* documents:List of texts
  - Note for long earnings calls like this we probably want to split this up into chunks, but for simplicity we use entire document. Llama Index for example can help you split the document into chunks
* metadatas:List of dicts with metadata values
* ids: unique strings to identify record

In [8]:
documents = df['content'].tolist()
documents[0]

"Martin Viecha: Tesla's First Quarter 2024 Q&A Webcast. My name is Martin Viecha, VP of Investor Relations, and I'm joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q1 results were announced at about 3.00 p.m. Central Time in the Update Deck we published at the same link as this webcast. During this call, we will discuss our business outlook and make forward-looking statements. These comments are based on our predictions and expectations as of today. Actual events and results could differ materially due to a number of risks and uncertainties, including those mentioned in our most recent filings with the SEC. During the question-and-answer portion of today's call, please limit yourself to one question and one follow-up. Please use the raise hand button to join the question queue. But before we jump into Q&A, Elon has some opening remarks. Elon?\nElon Musk: Thanks, Martin. So to recap in Q1 we navigated several unforeseen challenges as well as the ramp of 

In [10]:
metadata_cols = ['symbol','quarter','year','date']
metadatas = df[metadata_cols].to_dict(orient='records')
metadatas[0:2]

[{'symbol': 'TSLA', 'quarter': 1, 'year': 2024, 'date': '2024-04-23 22:30:10'},
 {'symbol': 'TSLA', 'quarter': 2, 'year': 2024, 'date': '2024-07-24 01:45:26'}]

In [12]:
ids=[f"id_{i}" for i in df.index.tolist()]
ids[0:2]

['id_0', 'id_1']

* Actual adding of the documents, metadatas, and ids to the database

In [13]:
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 49.4MiB/s]


##### Step 4: Query the collection
* query_texts: is texts you want to query
* n_results: # results per query_text to return

In [15]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?",'I am Michael Jordan'],
    n_results=2
)
results

{'ids': [['id_1', 'id_2'], ['id_12', 'id_15']],
 'distances': [[1.1667418479919434, 1.2638124227523804],
  [1.794992208480835, 1.7955803871154785]],
 'metadatas': [[{'date': '2024-07-24 01:45:26',
    'quarter': 2,
    'symbol': 'TSLA',
    'year': 2024},
   {'date': '2023-04-19 18:54:06',
    'quarter': 1,
    'symbol': 'TSLA',
    'year': 2023}],
  [{'date': '2023-02-06 12:18:06', 'quarter': 4, 'symbol': 'ON', 'year': 2022},
   {'date': '2020-05-11 22:12:05',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2020}]],
 'embeddings': None,
 'documents': [["Travis Axelrod: Good afternoon, everyone and welcome to Tesla's Second Quarter 2024 Q&A Webcast. My name is Travis Axelrod, Head of Investor Relations and I’m joined today by Elon Musk, Vaibhav Taneja, and a number of other executives. Our Q2 results were announced at about 3.00 p.m. Central Time and the Update Deck we published at the same link as this webcast. During this call, we will discuss our business outlook and make forward-

* *include*: you can chose can chose ["embeddings"`, `"metadatas"`, `"documents"`, `"distances"] to include.


In [18]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"]
)
results

{'ids': [['id_1', 'id_2', 'id_18']],
 'distances': None,
 'metadatas': [[{'date': '2024-07-24 01:45:26',
    'quarter': 2,
    'symbol': 'TSLA',
    'year': 2024},
   {'date': '2023-04-19 18:54:06',
    'quarter': 1,
    'symbol': 'TSLA',
    'year': 2023},
   {'date': '2023-07-19 21:33:10',
    'quarter': 2,
    'symbol': 'TSLA',
    'year': 2023}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

* Note the query has some issues. It takes TSLA earnings call and not ON!
* So additional filters could really help us...

#####  2 Main Types of Filters for Queries

* Document Filters: *where_document*
* Metdata Filters: *where*





###### Document Filters ("*where_document*")
*  Filter on exact words in a document
```
where_document = {
        "$operator": "your_search_term"
    }

* For example, filter for documents that contain the word 'ON'

In [20]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
        include=["metadatas"],
    where_document={"$contains":"ON"}
)
results

{'ids': [['id_17', 'id_15', 'id_14']],
 'distances': None,
 'metadatas': [[{'date': '2020-08-10 14:10:48',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020},
   {'date': '2021-05-03 16:22:18',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2021}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

### Metadata Filters ("*where*")

```
where = {
    "your_metadata_field": {
        "$operator": "your_search_term"
    }
}
```
or for multiple conditions (can also be further nested)

```
where = {
    "$(LOGICAL_OPERATOR):[
    "your_metadata_field_1": {
        "$operator_1": "your_search_term_1"
    },
    
    "your_metadata_field_2": {
        "$operator_2": "your_search_term_2"
    }
}
```

* You can also nest and and or conditionals

* https://cookbook.chromadb.dev/core/filters/

* FORMAT {COL:{$OPERATOR,VALUE}}

| Filter Type          | Description                                 | Example                                                                                                                 |
|------------------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| `=` (`$eq`)           | Equality                                     | `{"color": {"$eq": "red"}}`                                                                                           |
| `!=` (`$ne`)         | Inequality                                  | `{"price": {"$ne": 100}}`                                                                                            |
| `>` (`$gt`)           | Greater Than                                  | `{"age": {"$gt": 30}}`                                                                                              |
| `>=` (`$gte`)         | Greater Than or Equal To                     | `{"quantity": {"$gte": 5}}`                                                                                          |
| `<` (`$lt`)           | Less Than                                    | `{"rating": {"$lt": 4.5}}`                                                                                          |
| `<=` (`$lte`)         | Less Than or Equal To                        | `{"year": {"$lte": 2023}}`                                                                                          |
| `in` (`$in`)         | Inclusion                                   | `{"category": {"$in": ["electronics", "clothing"]}}`                                                                   |
| `nin` (`$nin`)        | Exclusion                                  | `{"status": {"$nin": ["pending", "cancelled"]}}`                                                                      |
| `contains` (`$contains`) | Substring                                    | `{"description": {"$contains": "sale"}}`                                                                               |
| `and`  (`$and`)               | Conjunction (All conditions must be true)   | `{"$and": [{"color": {"$eq": "blue"}}, {"size": {"$eq": "large"}}]}`                                                 |
| `or`   (`$or`)               | Disjunction (At least one condition is true) | `{"$or": [{"price": {"$lt": 50}}, {"discount": {"$gt": 0.2}}]}`                                                       |
| `not` (`$not`)        | Negation                                     | `{"$not": {"category": {"$eq": "electronics"}}}`                                                                      |

* Symbol = 'ON'

In [21]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={"symbol":{"$eq":"ON"}}
)
results

{'ids': [['id_17', 'id_15', 'id_14']],
 'distances': None,
 'metadatas': [[{'date': '2020-08-10 14:10:48',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020},
   {'date': '2021-05-03 16:22:18',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2021}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

* Can also use shorthand

In [23]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={"symbol":"ON"}
)
results

{'ids': [['id_17', 'id_15', 'id_14']],
 'distances': None,
 'metadatas': [[{'date': '2020-08-10 14:10:48',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020},
   {'date': '2021-05-03 16:22:18',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2021}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

### AND condition for metadata filter

* Symbol is ON
* Quarter is Q1
* Year is >= 2023

In [27]:
results = collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={
        "$and":[
            {"symbol":"ON"},
            {"quarter":1},
            {"year":{"$gte":2020}}
        ]
    }
)
results

{'ids': [['id_15', 'id_14', 'id_3']],
 'distances': None,
 'metadatas': [[{'date': '2020-05-11 22:12:05',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2021-05-03 16:22:18', 'quarter': 1, 'symbol': 'ON', 'year': 2021},
   {'date': '2023-05-01 11:43:05',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2023}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

* or example and ```$in/$nin``` example

In [29]:
results=collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={
        "$and":[
            {"symbol":{"$eq":"ON"}},
            {"symbol":{"$in":["ON"]}},
            {"symbol":{"$nin":["TSLA","RIVN"]}},
        ]
    }
)
results

{'ids': [['id_17', 'id_15', 'id_14']],
 'distances': None,
 'metadatas': [[{'date': '2020-08-10 14:10:48',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020},
   {'date': '2021-05-03 16:22:18',
    'quarter': 1,
    'symbol': 'ON',
    'year': 2021}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

In [30]:
results=collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={
        "$and":[
            {"symbol":{"$eq":"ON"}},
            {"symbol":{"$in":["TSLA"]}},
            {"symbol":{"$nin":["TSLA","RIVN"]}},
        ]
    }
)
results

{'ids': [[]],
 'distances': None,
 'metadatas': [[]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

In [31]:
results=collection.query(
    query_texts=["What is ON saying about EV demand?"],
    n_results=3,
    include=["metadatas"],
    where={
        "$or":[
            {"symbol":{"$eq":"ON"}},
            {"symbol":{"$in":["TSLA"]}},
            {"symbol":{"$nin":["TSLA","RIVN"]}},
        ]
    }
)
results

{'ids': [['id_1', 'id_2', 'id_18']],
 'distances': None,
 'metadatas': [[{'date': '2024-07-24 01:45:26',
    'quarter': 2,
    'symbol': 'TSLA',
    'year': 2024},
   {'date': '2023-04-19 18:54:06',
    'quarter': 1,
    'symbol': 'TSLA',
    'year': 2023},
   {'date': '2023-07-19 21:33:10',
    'quarter': 2,
    'symbol': 'TSLA',
    'year': 2023}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

* Can also have nested conditions

In [33]:
results=collection.query(
    query_texts=["What is ON saying about EV demand?"],
    include=["metadatas"],
    where={
        "$and":[
            {"symbol":{"$eq":"ON"}},
            {
                "$or":[
                    {"symbol":{"$in":["TSLA"]}},
                    {"symbol":{"$nin":["TSLA","RIVN"]}},
                ]
            }
        ]})
results

{'ids': [['id_17',
   'id_15',
   'id_14',
   'id_9',
   'id_19',
   'id_3',
   'id_12',
   'id_5']],
 'distances': None,
 'metadatas': [[{'date': '2020-08-10 14:10:48',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2020},
   {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020},
   {'date': '2021-05-03 16:22:18', 'quarter': 1, 'symbol': 'ON', 'year': 2021},
   {'date': '2022-10-31 12:32:06', 'quarter': 3, 'symbol': 'ON', 'year': 2022},
   {'date': '2023-10-30 12:00:27', 'quarter': 3, 'symbol': 'ON', 'year': 2023},
   {'date': '2023-05-01 11:43:05', 'quarter': 1, 'symbol': 'ON', 'year': 2023},
   {'date': '2023-02-06 12:18:06', 'quarter': 4, 'symbol': 'ON', 'year': 2022},
   {'date': '2023-07-31 12:44:02',
    'quarter': 2,
    'symbol': 'ON',
    'year': 2023}]],
 'embeddings': None,
 'documents': None,
 'uris': None,
 'data': None,
 'included': ['metadatas']}

### Some other useful functions


* Count # of documents

In [34]:
collection.count()

22

* See first few items

In [35]:
collection.peek(3)

{'ids': ['id_0', 'id_1', 'id_10'],
 'embeddings': [[-0.10578790307044983,
   0.02799503318965435,
   0.06110765039920807,
   0.008655349723994732,
   0.028378978371620178,
   0.025117190554738045,
   -0.0009924240875989199,
   0.055910076946020126,
   0.06650234013795853,
   -0.006881932262331247,
   -0.02189655974507332,
   0.00545143848285079,
   0.017506247386336327,
   -0.08112282305955887,
   0.01222716923803091,
   0.014148831367492676,
   -0.008501795120537281,
   -0.12598687410354614,
   -0.09730604290962219,
   0.09277162700891495,
   0.005388063844293356,
   0.006667647045105696,
   -0.0313338004052639,
   0.03445453196763992,
   0.019005367532372475,
   -0.05221017077565193,
   -0.055838827043771744,
   0.03990521281957626,
   -0.04021830856800079,
   -0.03800611570477486,
   -0.049822963774204254,
   0.027562396600842476,
   0.0016175449127331376,
   0.04498564824461937,
   0.025101711973547935,
   0.03933669626712799,
   -0.0017539982218295336,
   -0.05786225199699402,
   

# More customized example

### Can persist for later on usage.
* So we don't have to redo embeddings and so on.
* Note if using colab then you lose your local storage so if you want to keep it you should then copy it to google drive.

### Most important components of RAG system/ Vector database that can be customized with chromadb
1. Distance Function
2. Embedding Function


### Persisting
* Previously database in memory only
* Save database to local machine..

In [36]:
client = chromadb.PersistentClient(path="location")

##### Component 1: Embedding Function

* Chroma uses sentence transformer as defualt https://www.sbert.net/index.html
* You can chose any sentence transformer as embedding https://docs.trychroma.com/guides/embeddings
* Can also use OpenAI, Google, Cohere, Hugging Face, etc... embeddings
* **NOTE** You must pass the embedding function every time.

In [39]:
import torch
from sentence_transformers import SentenceTransformer
from chromadb.utils import embedding_functions
device = 'cuda' if torch.cuda.is_available() else 'cpu'
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2",device=device)

In [40]:
collection = client.create_collection(name="new_collection",embedding_function=emb_fn)

##### Component 2: Distance Metric for Vector Search Embedding distance

---



| Distance            | Parameter | Equation |
| :---------------- | :------ |:-------: |
| Squared L2           |   ```l2```   | $$\sqrt{\sum(A_i - B_i)}^2$$                               |
| Inner Product    |  ```ip```   | $$1-\sum A_i B_i$$ |
| Cosine | ```cosine```   | $$1-\frac{\sum A_i B_i}{\sum A_i^2 \sum B_i^2}$$ |


In [42]:
collection= client.get_or_create_collection(name="new_collection",embedding_function=emb_fn,metadata={"hnsw:space":"cosine"})

#### Add to collection as done previously

In [45]:
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

* We can view the persisted database in the *location* folder

* Copy to google drive if you want to use it later on in colab.

In [1]:
!pip install chromadb


Collecting chromadb
  Downloading chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.112.1-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.30.6-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.26.0-py3-none-any.whl.metadata (1.4 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_pro

In [2]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.met

### Copy to local file system from google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')
!cp -r /content/drive/MyDrive/location /content/location

Mounted at /content/drive


#### Note we can load up and query the database later ...

In [4]:
import torch
from sentence_transformers import SentenceTransformer
from chromadb.utils import embedding_functions
device = 'cuda' if torch.cuda.is_available() else 'cpu'
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2",device=device)

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
import chromadb

client = chromadb.PersistentClient(path="location")
collection = client.get_or_create_collection(name="new_collection",embedding_function=emb_fn,metadata={"hnsw:space":"cosine"})

In [7]:
results = collection.query(
    query_texts=["What is ON saying about EVs?"], # Chroma will embed this for you
    n_results=2, # how many results to return
  where_document={"$contains": "ON"}
)
print(results)


{'ids': [['id_14', 'id_15']], 'distances': [[1.398598028361473, 1.4031562993243114]], 'metadatas': [[{'date': '2021-05-03 16:22:18', 'quarter': 1, 'symbol': 'ON', 'year': 2021}, {'date': '2020-05-11 22:12:05', 'quarter': 1, 'symbol': 'ON', 'year': 2020}]], 'embeddings': None, 'documents': [['Operator: Ladies and gentlemen, thank you for standing by and welcome to the ON Semiconductor First Quarter 2021 Earnings Conference Call.  I would now like to turn the call over to Parag Agarwal, Vice President of Investor Relations and Corporate Development. Thank you. Please go ahead.\nParag Agarwal: Thank you, Denise. Good morning and thank you for joining ON Semiconductor Corporation’s first quarter 2021 quarterly results conference call. I am joined today by Hassane El-Khoury, our President and CEO and Thad Trent, our CFO. This call is being webcast on the Investor Relations section of our website at www.onsemi.com. A replay of this webcast, along with our 2021 first quarter earnings release 