<a href="https://colab.research.google.com/github/alex-smith-uwec/NLP_Spring2025/blob/main/document_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Copyright 2024 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Document search with embeddings

<table class="tfo-notebook-buttons" align="left">
      <td>
    <a target="_blank" href="https://ai.google.dev/gemini-api/tutorials/document_search"><img src="https://ai.google.dev/static/site-assets/images/docs/notebook-site-button.png" height="32" width="32" />View on ai.google.dev</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemini-api/tutorials/document_search.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/google/generative-ai-docs/blob/main/site/en/gemini-api/tutorials/document_search.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

This example demonstrates how to use the Gemini API to create embeddings so that you can perform document search. You will use the Python client library to build a word embedding that allows you to compare search strings, or questions, to document contents.

In this tutorial, you'll use embeddings to perform document search over a set of documents to ask questions related to the Google Car.

## Prerequisites

You can run this quickstart in Google Colab.

To complete this quickstart on your own development environment, ensure that your environment meets the following requirements:

-  Python 3.9+
-  An installation of `jupyter` to run the notebook.

## Setup

First, download and install the Gemini API Python library.

In [1]:
!pip install -U -q google-generativeai

In [2]:
import textwrap
import numpy as np
import pandas as pd

import google.generativeai as genai

# Used to securely store your API key
from google.colab import userdata

from IPython.display import Markdown

### Grab an API Key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

In [4]:
# Or use `os.getenv('API_KEY')` to fetch an environment variable.
API_KEY=userdata.get('API_KEY')

genai.configure(api_key=API_KEY)

Key Point: Next, you will choose a model. Any embedding model will work for this tutorial, but for real applications it's important to choose a specific model and stick with it. The outputs of different models are not compatible with each other.

**Note**: At this time, the Gemini API is [only available in certain regions](https://ai.google.dev/gemini-api/docs/available-regions).

In [5]:
for m in genai.list_models():
  if 'embedContent' in m.supported_generation_methods:
    print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


## Embedding generation

In this section, you will see how to generate embeddings for a piece of text using the embeddings from the Gemini API.

### API changes to Embeddings with model embedding-001

For the new embeddings model, embedding-001, there is a new task type parameter and the optional title (only valid with task_type=`RETRIEVAL_DOCUMENT`).

These new parameters apply only to the newest embeddings models.The task types are:

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITY	| Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION	| Specifies that the embeddings will be used for classification.
CLUSTERING	| Specifies that the embeddings will be used for clustering.

Note: Specifying a `title` for `RETRIEVAL_DOCUMENT` provides better quality embeddings for retrieval.

In [6]:
title = "The next generation of AI for developers and Google Workspace"
sample_text = ("Title: The next generation of AI for developers and Google Workspace"
    "\n"
    "Full article:\n"
    "\n"
    "Gemini API & Google AI Studio: An approachable way to explore and prototype with generative AI applications")

model = 'models/embedding-001'
embedding = genai.embed_content(model=model,
                                content=sample_text,
                                task_type="retrieval_document",
                                title=title)

print(embedding)

{'embedding': [0.034113422, -0.055176623, -0.020209068, -0.0041249595, 0.058917794, 0.01412951, 0.004535358, 0.0014303708, 0.059766337, 0.08292115, 0.007162965, 0.0069041667, -0.05308343, -0.01090513, 0.03214021, -0.037164, 0.050372463, 0.019348344, -0.037328612, 0.026647933, 0.03078176, -0.011288503, -0.031485256, -0.060248997, -0.026219437, -0.009794238, 0.0066301385, -0.018465156, -0.026324723, 0.020442627, -0.06317685, 0.014559578, -0.052296046, 0.016451124, -9.71961e-05, -0.051706687, -0.0054406007, -0.05696762, 0.01114414, -0.009201795, -0.0021951075, -0.10997012, -0.0117121935, 0.021221716, 0.009171805, -0.029621972, 0.034534886, 0.039578073, 0.019021517, -0.062691696, 0.039473332, 0.052403256, 0.061814193, -0.034507953, -0.009557814, -0.004955104, 0.01783901, -0.021176832, 0.015043591, 0.015390573, -0.0063342857, 0.043696415, -0.028341988, 0.02843402, 0.014726859, -0.06585564, -0.044533547, 0.0055523184, 0.035775978, 0.031099156, 0.027357664, 0.028062243, 0.056972917, -0.054656

# New Section

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
my_path="/content/drive/MyDrive/CS491/Data/US_Inaugural_Addresses_2"
file_names=!ls {my_path}
file_names=sorted(file_names)
# print(file_names)

In [13]:
file_names=!ls {my_path}
file_names=sorted(file_names)
# print(file_names)
file_names

['01_washington_1789.txt\t       21_grant_1869.txt\t       41_truman_1949.txt',
 '02_washington_1793.txt\t       22_grant_1873.txt\t       42_eisenhower_1953.txt',
 '03_adams_john_1797.txt\t       23_hayes_1877.txt\t       43_eisenhower_1957.txt',
 '04_jefferson_1801.txt\t       24_garfield_1881.txt\t       44_kennedy_1961.txt',
 '05_jefferson_1805.txt\t       25_cleveland_1885.txt\t       45_johnson_1965.txt',
 '06_madison_1809.txt\t       26_harrison_1889.txt\t       46_nixon_1969.txt',
 '07_madison_1813.txt\t       27_cleveland_1893.txt\t       47_nixon_1973.txt',
 '08_monroe_1817.txt\t       28_mckinley_1897.txt\t       48_carter_1977.txt',
 '09_monroe_1821.txt\t       29_mckinley_1901.txt\t       49_reagan_1981.txt',
 '10_adams_john_quincy_1825.txt  30_roosevelt_theodore_1905.txt  50_reagan_1985.txt',
 '11_jackson_1829.txt\t       31_taft_1909.txt\t\t       51_bush_george_h_w_1989.txt',
 '12_jackson_1833.txt\t       32_wilson_1913.txt\t       52_clinton_1993.txt',
 '13_van_buren_1

In [14]:
# Split combined string by tabs and newlines, then filter out any empty strings
combined_string = ' '.join(file_names)
file_list = [name.strip() for name in combined_string.split() if name.strip() and '.' in name]

# Remove the extension from each file name
text_titles = sorted([name.split('.')[0] for name in file_list])

In [15]:
text_titles

['01_washington_1789',
 '02_washington_1793',
 '03_adams_john_1797',
 '04_jefferson_1801',
 '05_jefferson_1805',
 '06_madison_1809',
 '07_madison_1813',
 '08_monroe_1817',
 '09_monroe_1821',
 '10_adams_john_quincy_1825',
 '11_jackson_1829',
 '12_jackson_1833',
 '13_van_buren_1837',
 '14_harrison_1841',
 '15_polk_1845',
 '16_taylor_1849',
 '17_pierce_1853',
 '18_buchanan_1857',
 '19_lincoln_1861',
 '20_lincoln_1865',
 '21_grant_1869',
 '22_grant_1873',
 '23_hayes_1877',
 '24_garfield_1881',
 '25_cleveland_1885',
 '26_harrison_1889',
 '27_cleveland_1893',
 '28_mckinley_1897',
 '29_mckinley_1901',
 '30_roosevelt_theodore_1905',
 '31_taft_1909',
 '32_wilson_1913',
 '33_wilson_1917',
 '34_harding_1921',
 '35_coolidge_1925',
 '36_hoover_1929',
 '37_roosevelt_franklin_1933',
 '38_roosevelt_franklin_1937',
 '39_roosevelt_franklin_1941',
 '40_roosevelt_franklin_1945',
 '41_truman_1949',
 '42_eisenhower_1953',
 '43_eisenhower_1957',
 '44_kennedy_1961',
 '45_johnson_1965',
 '46_nixon_1969',
 '47_

In [24]:
import glob
text_files=glob.glob(f"{my_path}/*.txt")
text_files=sorted(text_files)

In [32]:
text_files[0]

'/content/drive/MyDrive/CS491/Data/US_Inaugural_Addresses_2/01_washington_1789.txt'

In [33]:
# prompt: produce a dataframe whose columns are text_titles and the contents of the files in text_files

import pandas as pd

# Assuming text_titles and text_files are already defined as in the provided code
# Create an empty list to store the file contents
file_contents = []

# Iterate over the text files
for file in text_files:
    try:
        with open(file, 'r') as f:
            content = f.read()
            file_contents.append(content)
    except Exception as e:
        print(f"Error reading file {file}: {e}")
        file_contents.append("")  # Append an empty string if there's an error

# Create the DataFrame
df = pd.DataFrame({'text_titles': text_titles, 'file_contents': file_contents})
df


Unnamed: 0,text_titles,file_contents
0,01_washington_1789,George Washington\t1789-04-30\tFellow-Citizens...
1,02_washington_1793,George Washington\t1793-03-04\tFellow Citizens...
2,03_adams_john_1797,John Adams\t1797-03-04\tWHEN it was first perc...
3,04_jefferson_1801,Thomas Jefferson\t1801-03-04\tFriends and Fell...
4,05_jefferson_1805,"Thomas Jefferson\t1805-03-04\tPROCEEDING, fell..."
5,06_madison_1809,James Madison\t1809-03-04\tUnwilling to depart...
6,07_madison_1813,James Madison\t1813-03-04\tAbout to add the so...
7,08_monroe_1817,James Monroe\t1817-03-04\tI should be destitut...
8,09_monroe_1821,James Monroe\t1821-03-04\tFellow-Citizens I sh...
9,10_adams_john_quincy_1825,John Quincy Adams\t1825-03-04\tIn compliance w...


In [36]:
# prompt: each entry of file_contents starts with a field like this:
# George Washington\t1789-04-30\t
# I want to remove these fields

import re

# Assuming df is already created as in the previous code
# Remove fields like "George Washington\t1789-04-30\t" from file_contents

def remove_fields(text):
  # Regular expression to match the pattern
  pattern = r'^[^\t]+\t[^\t]+\t'
  return re.sub(pattern, '', text, flags=re.MULTILINE)


df['file_contents'] = df['file_contents'].apply(remove_fields)
df.head()


Unnamed: 0,text_titles,file_contents
0,01_washington_1789,Fellow-Citizens of the Senate and of the House...
1,02_washington_1793,Fellow Citizens I AM again called upon by the ...
2,03_adams_john_1797,"WHEN it was first perceived, in early times, t..."
3,04_jefferson_1801,Friends and Fellow-Citizens CALLED upon to und...
4,05_jefferson_1805,"PROCEEDING, fellow-citizens, to that qualifica..."


In [40]:
!pip install tiktoken

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def embed_fn(title, text):
  # Split the text into smaller chunks if it exceeds the token limit
  max_tokens = 3000  # Adjust this value as needed
  encoding_name = "cl100k_base"  # Encoding used by Gemini

  num_of_tokens = num_tokens_from_string(text, encoding_name)

  if num_of_tokens > max_tokens:
    # Split the text into chunks based on tokens
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(text)
    chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)]
    text_chunks = [encoding.decode(chunk) for chunk in chunks]

    # Embed each chunk separately and combine the embeddings
    embeddings = [genai.embed_content(model=model, content=chunk, task_type="retrieval_document", title=title)["embedding"] for chunk in text_chunks]
    # Average the embeddings (you can use other methods to combine them)
    embedding = np.mean(embeddings, axis=0)
    return embedding
  else:
    return genai.embed_content(model=model, content=text, task_type="retrieval_document", title=title)["embedding"]


df['Embeddings'] = df.apply(lambda row: embed_fn(row['text_titles'], row['file_contents']), axis=1)
df

Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0




BadRequest: 400 POST https://generativelanguage.googleapis.com/v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint: Request payload size exceeds the limit: 10000 bytes.

# Resume original content

## Building an embeddings database

Here are three sample texts to use to build the embeddings database. You will use the Gemini API to create embeddings of each of the documents. Turn them into a dataframe for better visualization.

In [None]:
DOCUMENT1 = {
    "title": "Operating the Climate Control System",
    "content": "Your Googlecar has a climate control system that allows you to adjust the temperature and airflow in the car. To operate the climate control system, use the buttons and knobs located on the center console.  Temperature: The temperature knob controls the temperature inside the car. Turn the knob clockwise to increase the temperature or counterclockwise to decrease the temperature. Airflow: The airflow knob controls the amount of airflow inside the car. Turn the knob clockwise to increase the airflow or counterclockwise to decrease the airflow. Fan speed: The fan speed knob controls the speed of the fan. Turn the knob clockwise to increase the fan speed or counterclockwise to decrease the fan speed. Mode: The mode button allows you to select the desired mode. The available modes are: Auto: The car will automatically adjust the temperature and airflow to maintain a comfortable level. Cool: The car will blow cool air into the car. Heat: The car will blow warm air into the car. Defrost: The car will blow warm air onto the windshield to defrost it."}
DOCUMENT2 = {
    "title": "Touchscreen",
    "content": "Your Googlecar has a large touchscreen display that provides access to a variety of features, including navigation, entertainment, and climate control. To use the touchscreen display, simply touch the desired icon.  For example, you can touch the \"Navigation\" icon to get directions to your destination or touch the \"Music\" icon to play your favorite songs."}
DOCUMENT3 = {
    "title": "Shifting Gears",
    "content": "Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for driving in snow or other slippery conditions."}

documents = [DOCUMENT1, DOCUMENT2, DOCUMENT3]

Organize the contents of the dictionary into a dataframe for better visualization.

In [None]:
df = pd.DataFrame(documents)
df.columns = ['Title', 'Text']
df

Unnamed: 0,Title,Text
0,Operating the Climate Control System,Your Googlecar has a climate control system th...
1,Touchscreen,Your Googlecar has a large touchscreen display...
2,Shifting Gears,Your Googlecar has an automatic transmission. ...


Get the embeddings for each of these bodies of text. Add this information to the dataframe.

In [None]:
# Get the embeddings of each text and add to an embeddings column in the dataframe
def embed_fn(title, text):
  return genai.embed_content(model=model,
                             content=text,
                             task_type="retrieval_document",
                             title=title)["embedding"]

df['Embeddings'] = df.apply(lambda row: embed_fn(row['Title'], row['Text']), axis=1)
df

Unnamed: 0,Title,Text,Embeddings
0,Operating the Climate Control System,Your Googlecar has a climate control system th...,"[-0.03336111, -0.021217091, -0.04958193, -0.01..."
1,Touchscreen,Your Googlecar has a large touchscreen display...,"[0.009660737, -0.030662702, -0.01728142, -0.05..."
2,Shifting Gears,Your Googlecar has an automatic transmission. ...,"[-0.042707954, -0.007160867, -0.03242516, -0.0..."


## Document search with Q&A

Now that the embeddings are generated, let's create a Q&A system to search these documents. You will ask a question about hyperparameter tuning, create an embedding of the question, and compare it against the collection of embeddings in the dataframe.

The embedding of the question will be a vector (list of float values), which will be compared against the vector of the documents using the dot product. This vector returned from the API is already normalized. The dot product represents the similarity in direction between two vectors.

The values of the dot product can range between -1 and 1, inclusive. If the dot product between two vectors is 1, then the vectors are in the same direction. If the dot product value is 0, then these vectors are orthogonal, or unrelated, to each other. Lastly, if the dot product is -1, then the vectors point in the opposite direction and are not similar to each other.

Note, with the new embeddings model (`embedding-001`), specify the task type as `QUERY` for user query and `DOCUMENT` when embedding a document text.

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.

In [None]:
query = "How do you shift gears in the Google car?"
model = 'models/embedding-001'

request = genai.embed_content(model=model,
                              content=query,
                              task_type="retrieval_query")

Use the `find_best_passage` function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passage out of the database.

In [None]:
def find_best_passage(query, dataframe):
  """
  Compute the distances between the query and each document in the dataframe
  using the dot product.
  """
  query_embedding = genai.embed_content(model=model,
                                        content=query,
                                        task_type="retrieval_query")
  dot_products = np.dot(np.stack(dataframe['Embeddings']), query_embedding["embedding"])
  idx = np.argmax(dot_products)
  return dataframe.iloc[idx]['Text'] # Return text from index with max value

View the most relevant document from the database:

In [None]:
passage = find_best_passage(query, df)
passage

'Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for driving in snow or other slippery conditions.'

## Question and Answering Application

Let's try to use the text generation API to create a Q & A system. Input your own custom data below to create a simple question and answering example. You will still use the dot product as a metric of similarity.

In [None]:
def make_prompt(query, relevant_passage):
  escaped = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  prompt = textwrap.dedent("""You are a helpful and informative bot that answers questions using text from the reference passage included below. \
  Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
  However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
  strike a friendly and converstional tone. \
  If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: '{query}'
  PASSAGE: '{relevant_passage}'

    ANSWER:
  """).format(query=query, relevant_passage=escaped)

  return prompt

In [None]:
prompt = make_prompt(query, passage)
print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below.   Be sure to respond in a complete sentence, being comprehensive, including all relevant background information.   However, you are talking to a non-technical audience, so be sure to break down complicated concepts and   strike a friendly and converstional tone.   If the passage is irrelevant to the answer, you may ignore it.
  QUESTION: 'How do you shift gears in the Google car?'
  PASSAGE: 'Your Googlecar has an automatic transmission. To shift gears, simply move the shift lever to the desired position.  Park: This position is used when you are parked. The wheels are locked and the car cannot move. Reverse: This position is used to back up. Neutral: This position is used when you are stopped at a light or in traffic. The car is not in gear and will not move unless you press the gas pedal. Drive: This position is used to drive forward. Low: This position is used for drivi

Choose one of the Gemini content generation models in order to find the answer to your query.

In [None]:
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

models/gemini-1.5-pro
models/gemini-1.5-flash


In [None]:
model = genai.GenerativeModel('gemini-1.5-pro-latest')
answer = model.generate_content(prompt)

In [None]:
Markdown(answer.text)

The provided passage does not contain information about how to shift gears in a Google car, so I cannot answer your question from this source.

## Next steps

To learn how to use other services in the Gemini API, see the [Python quickstart](https://ai.google.dev/tutorials/python_quickstart).

To learn more about how you can use embeddings, see these  other tutorials:

 * [Anomaly Detection with Embeddings](https://ai.google.dev/gemini-api/tutorials/anomaly_detection)
 * [Clustering with Embeddings](https://ai.google.dev/gemini-api/tutorials/clustering_with_embeddings)
 * [Training a Text Classifier with Embeddings](https://ai.google.dev/gemini-api/tutorials/text_classifier_embeddings)