# Gemini API: Embeddings
## Importing libraries

In [2]:
!pip install -U -q google-generativeai

In [3]:
import google.generativeai as genai

Configuring API

In [4]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

An **embedding** is a list of decimal point numbers that represent the meaning of a word/sentence/paragraph. A quantity of these numbers represent dimensionality of embeddings.

Embeddings can be used in document search, anomaly detection, text classification, and many other tasks!

Generating embeddings
To generate embeddings for a given data, use `embed_content` method and pass in which `model` to use and what `content` to convert:

-`model`: Required. Must be either `models/text-embedding-004` or `models/embedding-001`.

- `content`: Required. The data to embed.

- `output_dimensionality`: Optional. Reduced dimension for the output embedding. If set, excessive values in the output embedding are truncated from the end. This is supported by `models/text-embedding-004`, but cannot be specified in `models/embedding-001`.

- `task_type`: Optional. The task type for which the embeddings will be used.

- `title`: Optional. You should only set this parameter if your task type is `retrieval_document` (or `document`).

(The latter three will be explored later.)

Here, we will use `models/text-embedding-004` to generate text embeddings:

In [5]:
data = 'Hi there, this is Gemini tutorial!'

embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content=data
)

Since embeddings can have large dimensionality (i.e. length), we will output only first 10 values:

In [7]:
print(embed_data['embedding'][:10])


[0.05843896, 0.0004431678, -0.009973633, -0.016529618, 0.041487474, -0.013944048, 0.08006431, 0.055516902, -0.03406745, 0.046189807]


By default, the embeddings have dimensionality of 768:

In [6]:
print(len(embed_data['embedding']))

768


Instead of a single string values, we can also pass in a batch of data (i.e. a list of strings):

In [8]:
data = [
    'What is the tuition at UC Berkeley?',
    'Name top 10 books about machine learning.',
    'How to stop procrastinating?'
]

embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content=data
)

In [9]:
for embedding in embed_data['embedding']:
    print(embedding[:10])

[0.055581044, 0.06622884, -0.083276704, -0.0021693043, 0.02703125, 0.0060264487, 0.02783055, -0.017377941, -0.052792866, 0.035122644]
[0.017529475, -0.026426781, 0.0011475772, -0.022670647, -0.03672292, 0.009089401, -0.02183994, 0.022415891, -0.021802155, -0.02764353]
[0.0039839307, -0.02675414, 0.06129221, -0.013258428, -0.024475494, 0.04371136, -0.008513142, 0.012199538, 0.024706287, -0.0123653505]


This allows us to use a single API call more efficiently (instead of calling multiple times).

# Truncating embeddings
To reduce dimensionality of embeddings, we can specify it with `output_dimensionality` argument:

In [10]:
embed_data = genai.embed_content(
    model='models/text-embedding-004',
    content='Below is how you truncate data!',
    output_dimensionality=10
)

Now, the dimensionality of the output is restricted by 10:

In [11]:
print(len(embed_data['embedding']))

10


# Specifying embeddings
Since there are many different tasks where embeddings are used, you can refine them specifying for which one of the following `task_type`'s you are using embeddings for:

- `unspecified`: If you do not set the value, it will default to retrieval_query.

- `retrieval_query` (or `query`): The given text is a query in a search/retrieval setting.

- `retrieval_document` (or `document`): The given text is a document from a corpus being searched. Optionally, also set the title parameter with the title of the document.

- `semantic_similarity` (or `similarity`): The given text will be used for Semantic Textual Similarity (STS).

- `classification`: The given text will be classified.

- `clustering`: The embeddings will be used for clustering.

- `question_answering`: The given text will be used for question answering.

- `fact_verification`: The given text will be used for fact verification.

Depending on the `task_type`, embeddings differ:

In [12]:
data = 'Finally touching the grass!'

embed_data_1 = genai.embed_content(
    model='models/text-embedding-004',
    content=data
    # no `task_type` specification; defaults to `retrieval_query`
)

embed_data_2 = genai.embed_content(
    model='models/text-embedding-004',
    content=data,
    task_type='retrieval_document'
)

print(embed_data_1['embedding'][:10])
print(embed_data_2['embedding'][:10])

[-0.044086628, -0.017505977, -0.014579225, 0.013675356, -0.008016502, 0.025225218, -0.035069168, 0.034015443, 0.03730417, 0.007439851]
[-0.023972979, -0.019189889, -0.022257302, -0.012185106, 0.019129643, 0.035173625, -0.012367235, 0.016283924, 0.02262941, 0.014976868]
