#This notebook uses Gemini 1.5 Flash and covers the generation of embeddings for different types of data through the Gemini api

###Embedding: a list of decimal point numbers that represent the meaning of aword/sentence/paragraph. A quantity of these numbers represent *dimensionality* of embeddings.

###Embeddings can be used in document search, anomaly detection, text classification, and many other tasks!

### **It can greatly enhance text processing and retrival capabilities**

##Setup

In [2]:
import google.generativeai as genai

###To run this cell, your API key needs to be stored in the Colab Secret named as GOOGLE_API_KEY
###To get the key, go to ai.google.dev

In [3]:
from google.colab import userdata

GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key= GOOGLE_API_KEY)

## Generating embeddings

To generate embeddings for a given data, use `embed_content` method and pass in which `model` to use and what `content` to convert:

- `model`: Required. Must be either `models/text-embedding-004` or `models/embedding-001`.

- `content`: Required. The data to embed.

- `output_dimensionality`: Optional. Reduced dimension for the output embedding. If set, excessive values in the output embedding are truncated from the end. This is supported by `models/text-embedding-004`, but cannot be specified in `models/embedding-001`.

- `task_type`: Optional. The task type for which the embeddings will be used.

- `title`: Optional. You should only set this parameter if your task type is `retrieval_document` (or `document`).

In [4]:
data = 'Hi there, this is Gemini tutorial!'

embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content=data
)

In [5]:
print(len(embed_data['embedding']))

768


###As the embeddings have a large dimentionality (length of 768 array), we will output the first 10 embedding values of the data variable

In [6]:
print(embed_data['embedding'][:10])

[0.058438968, 0.00044316906, -0.009973634, -0.016529616, 0.041487474, -0.013944049, 0.080064304, 0.055516902, -0.03406745, 0.046189807]


###Embedding a batch of data (list of strings)

####This allows us to use a single API call more efficiently instead of calling it multiple times



In [8]:
data = [
    'What is the tuition at UC Berkeley?',
    'Name top 10 books about machine learning.',
    'How to stop procrastinating?'
]

embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content=data
)

for embedding in embed_data['embedding']:
    print(embedding[:10])

[0.05558104, 0.066228844, -0.0832767, -0.0021693057, 0.027031252, 0.0060264543, 0.027830552, -0.017377941, -0.05279287, 0.035122644]
[0.017529475, -0.026426788, 0.0011475749, -0.022670645, -0.03672292, 0.009089399, -0.021839937, 0.022415882, -0.021802153, -0.027643519]
[0.003983932, -0.026754132, 0.061292212, -0.013258428, -0.02447549, 0.04371135, -0.00851315, 0.012199542, 0.02470629, -0.012365352]


##Truncating embeddings
###To reduce the dimentionality, we can specify the `output_dimentionality` argument

In [9]:
embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content='Below is how you truncate data!',
    output_dimensionality=10
)

print(len(embed_data['embedding']))

10


##Specifying embeddings

Since there are many different tasks where embeddings are used, you can refine them specifying for which one of the following `task_type`'s you are using embeddings for:

- `unspecified`: If you do not set the value, it will default to `retrieval_query`.

- `retrieval_query` (or `query`): The given text is a query in a search/retrieval setting.

- `retrieval_document` (or `document`): The given text is a document from a corpus being searched. Optionally, also set the `title` parameter with the title of the document.

- `semantic_similarity` (or `similarity`): The given text will be used for  Semantic Textual Similarity (STS).

- `classification`: The given text will be classified.

- `clustering`: The embeddings will be used for clustering.

- `question_answering`: The given text will be used for question answering.

- `fact_verification`: The given text will be used for fact verification.

###Depending on the `task_type`, the embeddings differ

In [10]:
data = 'Finally touching the grass!'

embed_data_1 = genai.embed_content(
    model='models/text-embedding-004',
    content=data
    # no `task_type` specification; defaults to `retrieval_query`
)

embed_data_2 = genai.embed_content(
    model='models/text-embedding-004',
    content=data,
    task_type='retrieval_document'
)

print(embed_data_1['embedding'][:10]) #printing the first 10 values
print(embed_data_2['embedding'][:10])

[-0.044086624, -0.017505981, -0.014579227, 0.013675352, -0.008016501, 0.025225215, -0.035069168, 0.034015447, 0.037304174, 0.007439849]
[-0.023972979, -0.019189887, -0.0222573, -0.012185102, 0.019129641, 0.035173632, -0.012367243, 0.016283926, 0.022629406, 0.014976865]
