# Embedding Gemma
- https://huggingface.co/blog/embeddinggemma#usage
- You'll need to accept the model license agreement on Hugging Face in order to access the model, and then login with an API token in the notebook here.
- TBD on how this login functinoality will get included on got3


In [6]:
# authenticate with Hugging Face
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Getting Started

- this will download some files and display graphs

In [1]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("google/embeddinggemma-300m")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

torch.Size([4, 4])


In [2]:
similarities

tensor([[1.0000, 0.8078, 0.9839, 0.5595],
        [0.8078, 1.0000, 0.7960, 0.5486],
        [0.9839, 0.7960, 1.0000, 0.5417],
        [0.5595, 0.5486, 0.5417, 1.0000]])

## Sample Query & Documents
- Using transformer and torch to find the most relevant document to a query

In [3]:
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")

# Run inference with queries and documents
query = "Which planet is the Red Planet?"
documents = [
    "Mercury is the closest planet to the Sun and has a very thin atmosphere.",
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Earth is the only planet known to support life.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn is famous for its rings, is sometimes mistaken for the Red Planet.",
    "Uranus often is depicted with a blue-green color due to methane in its atmosphere.",
    "Neptune, the enormous distant blue planet with strong winds, is known for its storms.",
    "Pluto, once considered the ninth planet, is now classified as a dwarf planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])


(768,) (9, 768)
tensor([[0.3376, 0.2881, 0.3313, 0.5894, 0.5091, 0.4406, 0.1767, 0.2805, 0.2955]])


In [4]:
import torch

# Get the index of the most similar document
print('################# QUERY ###################')
print('query:', query)
print('###########################################')
print('')


best_idx = torch.argmax(similarities)
print("Best answer:", documents[best_idx])
worst_idx = torch.argmin(similarities)
print("Worst answer:", documents[worst_idx])
torch.topk(similarities, k=3)

torch.argsort(similarities, descending=True)
print('')# print the documents in a numbered list
# 0 is most habitable, 8 is least habitable
for i, idx in enumerate(torch.argsort(similarities, descending=True)[0]):
    print(f"{i+1}. {documents[idx]}")

################# QUERY ###################
query: Which planet is the Red Planet?
###########################################

Best answer: Mars, known for its reddish appearance, is often referred to as the Red Planet.
Worst answer: Uranus often is depicted with a blue-green color due to methane in its atmosphere.

1. Mars, known for its reddish appearance, is often referred to as the Red Planet.
2. Jupiter, the largest planet in our solar system, has a prominent red spot.
3. Saturn is famous for its rings, is sometimes mistaken for the Red Planet.
4. Mercury is the closest planet to the Sun and has a very thin atmosphere.
5. Earth is the only planet known to support life.
6. Pluto, once considered the ninth planet, is now classified as a dwarf planet.
7. Venus is often called Earth's twin because of its similar size and proximity.
8. Neptune, the enormous distant blue planet with strong winds, is known for its storms.
9. Uranus often is depicted with a blue-green color due to methan

In [5]:
qa_query = "Which planet is known as the Red Planet?"
qa_prompt = f"task: question answering | query: {qa_query}"
qa_embedding = model.encode_query(qa_prompt)

In [6]:
# Question Answering with EmbeddingGemma
qa_query = 'Which planet is known as the Red Planet?'
qa_prompt = f'task: question answering | query: {qa_query}'
qa_embedding = model.encode_query(qa_prompt)

documents = [
    'title: Mars | text: Mars, known for its reddish appearance, is often referred to as the Red Planet.',
    'title: Venus | text: Venus is often called Earths twin because of its similar size and proximity.',
    'title: Jupiter | text: Jupiter, the largest planet in our solar system, has a prominent red spot.',
    'title: Saturn | text: Saturn, famous for its rings, is sometimes mistaken for the Red Planet.'
]
document_embeddings = model.encode_document(documents)
similarities = model.similarity(qa_embedding, document_embeddings)

import torch
best_idx = torch.argmax(similarities)
print('Best answer:', documents[best_idx])

Best answer: title: Mars | text: Mars, known for its reddish appearance, is often referred to as the Red Planet.


### Consider how to integrate this into the getout_of_text_3 framework

- you might use your corpora to get collocates and concordances around terms of interest from statutory languauge
- setup a document collections of the collocates and concordances as a vector space
- you can then use retrieval query and question tasks on this document collection
- ai llms could the summarize the results

- https://huggingface.co/google/embeddinggemma-300m#prompt-instructions
    - `task: search result | query: {content}`
    - `task: question answering | query: {content}`
    - consider how part of speech tagging in the title can be important! also maybe just having one type of classification to avoid multiple words causing confusion
    - manually encoding your collocates per the various meanings of an ambiguous term
    - to then query a question with the statutory language in question, for a context aware approach
    - the getout-of-text3 framework / tool should streamline this workflow to make the process more accessible for folks (i.e. having a term to get collocates in a corpus, then building the document embeddings, then querying the document embeddings with a question or search task -- not as a declarative answer but just one way to surface the 'ordinary' meanings of a term in context).
    - lastly, the COCA is genre divided, so you could consider how to use the genre as a filter for the collocates and concordances and then query question answers depending on models.

### Another Example

- the term 'bank' is often referenced in embedding examples for bert, etc. This performs OK but I've made up the documents.

In [7]:
# Question Answering for ambiguous term 'bank' in different contexts
qa_query = 'What type of bank is Wells Fargo?'
qa_prompt = f'task: question answering | query: {qa_query}'
qa_embedding = model.encode_query(qa_prompt)

documents = [
    'title: Bank Deposit | text: I went to the bank to deposit my paycheck.',
    'title: River Bank | text: The children played on the grassy bank of the river.',
    'title: Making Bank | text: After selling his startup, he is really making bank now.',
    'title: Bank Loan | text: She applied for a loan at the local bank.',
    'title: River Flood | text: The river overflowed its bank after the heavy rain.',
    'title: Unit of Items | text: A bank of computers was set up in the server room.',
    'title: Reserve or Supply | text: The food bank provided meals for those in need',
]
document_embeddings = model.encode_document(documents)
similarities = model.similarity(qa_embedding, document_embeddings)
import torch
best_idx = torch.argmax(similarities)
print('Best answer:', documents[best_idx])
print('All answers ranked:')
for i, idx in enumerate(torch.argsort(similarities, descending=True)[0]):
    print(f'{i+1}. {documents[idx]}')

Best answer: title: Bank Deposit | text: I went to the bank to deposit my paycheck.
All answers ranked:
1. title: Bank Deposit | text: I went to the bank to deposit my paycheck.
2. title: Making Bank | text: After selling his startup, he is really making bank now.
3. title: Bank Loan | text: She applied for a loan at the local bank.
4. title: Unit of Items | text: A bank of computers was set up in the server room.
5. title: Reserve or Supply | text: The food bank provided meals for those in need
6. title: River Bank | text: The children played on the grassy bank of the river.
7. title: River Flood | text: The river overflowed its bank after the heavy rain.


### Outlining the workflow

1. get your sample corpus for collocates, spit out as documents
2. load your model
3. encode the collocates documents for the model
4. ask your query, for the statutory language with term `<TOKEN>`

```python
statutory_language = "The bank shall maintain sufficient reserves."

year_enacted = 2024 # for historical reference, if you have a historical corpora 

ambiguous_term = "sufficient" # the term which leads towards confusion or ambiguity

query = 'What is the ordinary meaning of the ambiguous term "{}" in the context of the following statutory language, "{}", enacted in the year {}?'.format(ambiguous_term,statutory_language,year_enacted)

query_prompt = f"task: question answering | query: {query}"
query_embedding = model.encode_query(query_prompt)

documents = [
    "title: Adequate | text: Enough to meet a need or requirement; satisfactory.",
    "title: Ample | text: More than enough in size, scope, or capacity.",
    "title: Competent | text: Having the necessary ability, knowledge, or skill to do something successfully.",
    "title: Decent | text: Conforming to standards of propriety, good taste, or morality; acceptable.",
    "title: Fair | text: In accordance with the rules or standards; legitimate.",
    "title: Good | text: Having desirable or positive qualities; satisfactory in quality, quantity, or degree.",
    "title: Plenty | text: A large or sufficient amount or quantity; more than enough.",
    "title: Reasonable | text: Based on or using good judgment; fair and sensible.",
    "title: Satisfactory | text: Meeting the requirements or expectations; adequate.",
    "title: Suitable | text: Appropriate for a particular purpose, person, or occasion."
]
document_embeddings = model.encode_document(documents)
similarities = model.similarity(query_embedding, document_embeddings)
most_similar_idx = similarities.argmax()
most_relevant_document = documents[most_similar_idx]
print("Most relevant document:", most_relevant_document)

```

In [8]:
documents_sample=[
    "title: Text 18 | text: for education and the World **Bank** database for economic indicators .",
    "title: Text 18 | text: World Development Report ( World **Bank** , 1988 ) . <p>",
    "title: Text 18 | text: primary schools in the West **Bank** and Gaza , whether public",
    "title: Text 18 | text: primary schools in the West **Bank** were ordered to close 63",
    "title: Text 18 | text: much so that the World **Bank** Report on Education in Africa",
    "title: Text 20 | text: outward investment , the Export-Import **Bank** ( Eximbank ) , a",
    "title: Text 20 | text: ) , a specialized government **bank** established in 1979 , is",
    "title: Text 20 | text: supported by the Asian Development **Bank** . 23 In addition ,",
    "title: Text 20 | text: 1991 to the Central American **Bank** for Economic Integration to help",
    "title: Text 20 | text: As a result , the **bank** revised its charter to allow",
    "title: Text 20 | text: channeled aid through the European **Bank** for Reconstruction and Development by",
    "title: Text 20 | text: not qualified to join the **bank** because it is a non-European",
    "title: Text 20 | text: regional banks -- the European **Bank** for Reconstruction and Development ,",
    "title: Text 20 | text: Development , the Asian Development **Bank** , and the Central American",
    "title: Text 20 | text: , and the Central American **Bank** -- totaled $32.5 million ,",
    "title: Text 20 | text: president of Taiwan 's Central **Bank** , estimated that Taiwan 's",
    "title: Text 20 | text: industrialized states , the World **Bank** , and the International Monetary",
    "title: Text 20 | text: GATT , IMF , World **Bank** , or even the United",
    "title: Text 21 | text: . Last year the central **bank** once refused to hire a",
    "title: Text 21 | text: and been assigned to the **bank** . Another time it refused",
    "title: Text 21 | text: membership exam of the central **bank** . Recently , the dispute",
    "title: Text 29 | text: to obtain for the foreign **bank** agencies in Florida , and",
    "title: Text 29 | text: , even estimates of Colombian **bank** deposits in the state would",
    "title: Text 29 | text: Florida via calculations of Colombian **bank** deposits there or via efforts",
    "title: Text 32 | text: figures , ' including World **Bank** President James Wolfensohn ( al-Ahram",
    "title: Text 32 | text: gracious sanctuary on the west **bank** of the Nile , with",
    "title: Text 558 | text: checking account at the conspiracy **bank** . There 's only $1.58",
    "title: Text 558 | text: check ) to a local **bank** or @ @ @ ",
    "title: Text 558 | text: days , the bigger the **bank** , the more reason to",
    "title: Text 558 | text: I actually did close my **Bank** of America account long ago",
    "title: Text 558 | text: should point out that US **Bank** is one of the better",
    "title: Text 571 | text: Reuters about the current food **bank** situation in these United States",
    "title: Text 574 | text: ( across from the Berkshire **Bank** ) <p> In closing I",
    "title: Text 579 | text: which also draw directly on **bank** deposits , is zero .",
    "title: Text 581 | text: day that they charged my **bank** card , and provided them",
    "title: Text 609 | text: an unprecedented spending bill for **bank** bailouts , Detroit rescues ,",
    "title: Text 618 | text: there exists nothing fishy with **bank** accounts . <p> Do n't",
    "title: Text 629 | text: were given to the national **bank** by the minister of finance",
    "title: Text 629 | text: EBRD or the Africa Development **Bank** . ' <p> Philippe Doizelet",
    "title: Text 639 | text: carlzimmer.com <p> ' The Tangled **Bank** is the best written and",
    "title: Text 662 | text: dealer 's collection of old **bank** notes caught my eye",
    "title: Text 672 | text: Cayman Islands or a Swiss **bank** account . He 's one",
    "title: Text 697 | text: to Work <p> The World **Bank** has just released its 2013",
    "title: Text 702 | text: because someone raided the piggy **bank** to pay for tax cuts",
    "title: Text 722 | text: by Monsanto , the World **Bank** and USAID , and Burkina",
    "title: Text 752 | text: A source in the Deutsche **Bank** claims that in 2008 our",
    "title: Text 752 | text: law . <p> The Deutsche **Bank** informant says that the cause",
    "title: Text 759 | text: of Aswan on the east **bank** . <h> 20 . Bazaruto",
    "title: Text 771 | text: time , with the State **Bank** printing almost Rs. 1 trillion",
    "title: Text 782 | text: a new job as a **bank** teller . He had gotten"
]


statutory_language = "The bank shall maintain sufficient reserves."

year_enacted = 2024 # for historical reference, if you have a historical corpora 

ambiguous_term = "sufficient" # the term which leads towards confusion or ambiguity

query = 'What is the ordinary meaning of the ambiguous term "{}" in the context of the following statutory language, "{}", enacted in the year {}?'.format(ambiguous_term,statutory_language,year_enacted)

query_prompt = f"task: question answering | query: {query}"
query_embedding = model.encode_query(query_prompt)

#print('documents:', documents_sample)

document_embeddings = model.encode_document(documents_sample)

similarities = model.similarity(query_embedding, document_embeddings)
most_similar_idx = similarities.argmax()
most_relevant_document = documents_sample[most_similar_idx]
print('################# RESULTS ###################')
print('')
print('statutory_language:', statutory_language)
print('ambiguous_term:', ambiguous_term)
print('year_enacted:', year_enacted)
print('')
print('query:', query)
print('')
print("Most relevant document:")
print(most_relevant_document)


################# RESULTS ###################

statutory_language: The bank shall maintain sufficient reserves.
ambiguous_term: sufficient
year_enacted: 2024

query: What is the ordinary meaning of the ambiguous term "sufficient" in the context of the following statutory language, "The bank shall maintain sufficient reserves.", enacted in the year 2024?

Most relevant document:
title: Text 29 | text: , even estimates of Colombian **bank** deposits in the state would


______________
## ⭐️ Using `got3` for EmbeddingGemma on keywords from the COCA sample.

- I still need to figure out how embedding most meaningfully fits into this process! I suspect I should be coding content in the title or something...
- reviewing classic examples of 'bank' along with 'modify`

In [1]:
import getout_of_text_3 as got3
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("google/embeddinggemma-300m")

In [None]:
corpus_data = got3.read_corpora("../coca-samples-text/", "my_legal_corpus")

# 2. Search for legal terms with context
results = got3.search_keyword_corpus(
    keyword="modify",
    db_dict=corpus_data,
    case_sensitive=False,
    show_context=True,
    context_words=10,
    output="json"
)

📚 Loading my_legal_corpus corpus from ../coca-samples-text/
📂 Processing acad...
  ✅ text_acad.txt: (265, 1)
📂 Processing blog...
  ✅ text_blog.txt: (991, 1)
📂 Processing fic...
  ✅ text_fic.txt: (273, 1)
📂 Processing mag...
  ✅ text_mag.txt: (948, 1)
📂 Processing news...
  ✅ text_news.txt: (871, 1)
📂 Processing spok...
  ✅ text_spok.txt: (263, 1)
📂 Processing tvm...
  ✅ text_tvm.txt: (233, 1)
📂 Processing web...
  ✅ text_web.txt: (892, 1)

🎯 SUMMARY:
   - my_legal_corpus: 8 genres loaded
   - Total corpora in collection: 1


In [3]:
# how many hits across the genre keys
total_hits = sum(len(v) for v in results.values())
print(f"Total hits for 'modify': {total_hits}")
# per key genre
print('--------------------------------')
for genre, hits in results.items():
    print(f"{genre}: {len(hits)} hits")

Total hits for 'modify': 70
--------------------------------
acad: 10 hits
blog: 12 hits
mag: 10 hits
news: 19 hits
spok: 3 hits
tvm: 4 hits
web: 12 hits


In [4]:
# take got3 json output and format as documents for embeddinggemma
document_sample_keywords=[]
for key in results:
    genre_hits = list(results[key].values())
    for snippet in genre_hits:
        document_sample_keywords.append(snippet)

In [None]:
statutory_language = "The agency may modify the requirements as necessary to ensure compliance."
ambiguous_term = "modify"
year_enacted = 2001 # for historical reference, if you have a historical corpora 
#ambiguous_term = "sufficient" # the term which leads towards confusion or ambiguity
query="What does {} mean in the following statement '{}'?".format(ambiguous_term,statutory_language)
#query = 'What is the ordinary meaning of the ambiguous term "{}" in the context of the following statutory language, "{}", enacted in the year {}?'.format(ambiguous_term,statutory_language,year_enacted)


query_prompt = f"task: search result | query: {query}"
query_embedding = model.encode_query(query_prompt)
#print('documents:', document_sample_keywords)
document_embeddings = model.encode_document(document_sample_keywords)
similarities = model.similarity(query_embedding, document_embeddings)
most_similar_idx = similarities.argmax()
most_relevant_document = document_sample_keywords[most_similar_idx]

print('################# RESULTS ###################')
print('')
print('statutory_language:', statutory_language)
print('ambiguous_term:', ambiguous_term)
print('year_enacted:', year_enacted)
print('query_task:', query_prompt)
print('')
print("Most relevant document {}:".format(most_similar_idx))
print(most_relevant_document)
print('')
print('All answers ranked:')

# similarities is a 2D tensor with shape [1, N], so we flatten it
sim_values = similarities.flatten()
for rank, idx in enumerate(sim_values.argsort(descending=True)):
    print(f"{rank+1}. Document {idx.item()}: {sim_values[idx].item():.4f}")
    #print(document_sample_keywords[idx.item()])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


################# RESULTS ###################

statutory_language: The agency may modify the requirements as necessary to ensure compliance.
ambiguous_term: modify
year_enacted: 2001
query_task: task: search result | query: What does modify mean in the following statement 'The agency may modify the requirements as necessary to ensure compliance.'?

Most relevant document 64:
standards : <p> Use existing Multi-Modal Level-of-Service indicators , and **modify** them to reflect the needs of a particular situation .

All answers ranked:
1. Document 64: 0.4391
2. Document 14: 0.4290
3. Document 63: 0.3993
4. Document 32: 0.3931
5. Document 51: 0.3845
6. Document 0: 0.3808
7. Document 33: 0.3708
8. Document 15: 0.3556
9. Document 21: 0.3455
10. Document 12: 0.3437
11. Document 41: 0.3424
12. Document 55: 0.3407
13. Document 16: 0.3378
14. Document 28: 0.3358
15. Document 20: 0.3344
16. Document 19: 0.3330
17. Document 34: 0.3261
18. Document 13: 0.3251
19. Document 65: 0.3240
20. Document 42

### Using got3.embedding.gemma with search results

Now we can use the new `got3.embedding.gemma.gemma()` function that accepts the JSON search results directly!

In [8]:
# Use the new got3.embedding.gemma function with search results
statutory_language = "Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products."
ambiguous_term="bovine"
year_enacted = 2001 # for historical reference, if you have a historical corpora

keyword_results = got3.search_keyword_corpus(
    keyword=ambiguous_term,
    db_dict=corpus_data,
    case_sensitive=False,
    show_context=True,
    context_words=10,
    output="json"
)


result = got3.embedding.gemma.task(
    statutory_language=statutory_language,
    ambiguous_term=ambiguous_term,
    year_enacted=year_enacted,
    search_results=keyword_results, # Pass the JSON results from search_keyword_corpus
    model="google/embeddinggemma-300m"
)
print('')
print("🎯 Top 3 most relevant contexts:")
for i, item in enumerate(result['all_ranked'][:3]):
    print(f"{i+1}. Genre: {item['genre']}, Score: {item['score']:.4f}")
    print(f"   Context: {item['context'][:100]}...")
    print()

📚 Using pre-computed search results for 'bovine'
📚 Found 5 context examples across 4 genres
🤖 Loading model: google/embeddinggemma-300m

🎯 RESULTS:

Most relevant context from blog (score: 0.2334)
Context: his mouth , I commented loudly on the quantity of **bovine** excrement spewing from it . " That 's fucking bullshit

🎯 Top 3 most relevant contexts:
1. Genre: blog, Score: 0.2334
   Context: his mouth , I commented loudly on the quantity of **bovine** excrement spewing from it . " That 's f...

2. Genre: blog, Score: 0.2244
   Context: " Brown told TheDC . <p> This is not merely **bovine** excrement , it 's delusional . Once you under...

3. Genre: fic, Score: 0.2222
   Context: his intelligence , and merely having to look at that **bovine** lump of clichs brought his blood to ...



_____________
### Analysis of results

#### Most Similar

- In this case it is refering more directly to how a cow eats!
- `...his mouth , I commented loudly on the quantity of **bovine** excrement spewing from it . " That 's f...`

#### Least Similar

- In this case, it's using bovine as an adjective meaning dull or boring, not so much literally as a cow!
- 1997 film L.A. Confidential. and that is a quote from the film: `...and merely having to look at that **bovine** lump of clichs brought his blood to...`
  - ![https://a.ltrbxd.com/resized/sm/upload/3n/0w/ax/pt/rIXzJCAvyd3Ci8ipylDQ5wUKqwh-0-230-0-345-crop.jpg?v=40685f4e4e](https://a.ltrbxd.com/resized/sm/upload/3n/0w/ax/pt/rIXzJCAvyd3Ci8ipylDQ5wUKqwh-0-230-0-345-crop.jpg?v=40685f4e4e)