In [None]:
!apt install tesseract-ocr libtesseract-dev
!pip install --no-deps --force-reinstall  google-generativeai chromadb pytesseract

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.
Collecting google-generativeai
  Downloading google_generativeai-0.8.3-py3-none-any.whl.metadata (3.9 kB)
Collecting chromadb
  Using cached chromadb-0.5.23-py3-none-any.whl.metadata (6.8 kB)
Collecting pytesseract
  Using cached pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading google_generativeai-0.8.3-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.8/160.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached chromadb-0.5.23-py3-none-any.whl (628 kB)
Using cached pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract, google-generativeai, chromadb
  Attempting uninstall: pytesseract
    Found exist

In [None]:
import time
from tqdm import tqdm
import pathlib
import google.generativeai as genai
import chromadb
from chromadb import Documents, EmbeddingFunction, Embeddings
import pandas as pd
from PIL import Image
import pytesseract
from IPython.display import Markdown

In [None]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

genai.configure(api_key=GOOGLE_API_KEY)

## Project Description
You're an astronomy student who's very curious about the Apollo 11 missions, and through your research, you've found a lot of different types of data (otherwise known as multimodal) from NASA's public archive.

- Text: You have the full final NASA report post-mission, spanning over 300 pages of incredibly informative content that details a summary of everything that happened as well as conclusions that NASA researchers and engineers came to. For the sake of this exercise, we've selected 3 particularly interesting pages, and converted them to images (you'll see soon why).

- Video: You also have several clips of the famous Neil Armstrong and Buzz Aldrin footage as they first stepped onto the moon, containing highlights of their moonwalks as well as raising the American flag.

- Audio: Finally, you have highlights from the audio recorded throughout the mission, which provides insights into how communication between the astronauts occurred as well as from the astronauts to mission control.

Now, you want to search through and summarize this information for your upcoming research paper. Using your newfound skills from this course, you can accomplish this using Gemini! In particular, we will build a Retrieval Augmented Generation (RAG) system that you can directly interact with.

## Data Preparation
Before we begin, ensure that you've uploaded the resources.zip folder and unzipped it using the following command:

In [None]:
!wget -O resources.zip "https://video.udacity-data.com/topher/2024/June/66744e79_resources/resources.zip"

--2024-12-25 17:49:46--  https://video.udacity-data.com/topher/2024/June/66744e79_resources/resources.zip
Resolving video.udacity-data.com (video.udacity-data.com)... 104.19.142.72, 104.19.139.72, 104.19.141.72, ...
Connecting to video.udacity-data.com (video.udacity-data.com)|104.19.142.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 286142532 (273M) [application/zip]
Saving to: ‘resources.zip’


2024-12-25 17:49:48 (180 MB/s) - ‘resources.zip’ saved [286142532/286142532]



In [None]:
!unzip resources.zip

Archive:  resources.zip
   creating: resources/
  inflating: __MACOSX/._resources    
   creating: resources/video/
  inflating: __MACOSX/resources/._video  
  inflating: resources/.DS_Store     
  inflating: __MACOSX/resources/._.DS_Store  
   creating: resources/audio/
  inflating: __MACOSX/resources/._audio  
   creating: resources/text/
  inflating: __MACOSX/resources/._text  
  inflating: resources/video/Apollo11PlaqueComparison.mov  
  inflating: __MACOSX/resources/video/._Apollo11PlaqueComparison.mov  
  inflating: resources/video/Apollo11Intro.mov  
  inflating: __MACOSX/resources/video/._Apollo11Intro.mov  
  inflating: resources/video/Apollo11MoonwalkMontage.mov  
  inflating: __MACOSX/resources/video/._Apollo11MoonwalkMontage.mov  
  inflating: resources/video/OneSmallStepCompilation.mov  
  inflating: __MACOSX/resources/video/._OneSmallStepCompilation.mov  
  inflating: resources/video/RaisingTheAmericanFlag.mov  
  inflating: __MACOSX/resources/video/._RaisingTheAmericanFl

We first need to parse it in a way that Gemini can understand. We will prepare our data by extracting all file names from the `resources` directory.

In [None]:
data_dir = pathlib.Path("resources/")
all_file_names = [str(file) for file in data_dir.rglob("*") if file.is_file() and not file.name.startswith('.')]

In [None]:
for file_name in all_file_names:
    print(file_name)

print(len(all_file_names))

resources/video/BuzzDescendsCompilation.mov
resources/video/OneSmallStepCompilation.mov
resources/video/Apollo11Intro.mov
resources/video/Apollo11PlaqueComparison.mov
resources/video/Apollo11MoonwalkMontage.mov
resources/video/RaisingTheAmericanFlag.mov
resources/text/images-020.jpg
resources/text/images-333.jpg
resources/text/images-023.jpg
resources/audio/Apollo11OnboardAudioHighlightClip1.mp3
resources/audio/Apollo11OnboardAudioHighlightClip5.mp3
resources/audio/Apollo11OnboardAudioHighlightClip2.mp3
resources/audio/Apollo11OnboardAudioHighlightClip4.mp3
resources/audio/Apollo11OnboardAudioHighlightClip3.mp3
14


You should expect to see 14 files

## Retrieval Augmented Generation (RAG)
To showcase how we build a RAG, we will first build one for the Text case, and generalize it further after. Here is the general idea:

1. **Data Preparation** (done above): We first collected various types of data from NASA's public archive related to the Apollo 11 mission, including text, video, and audio files.
2. **Data Extraction and Summarization**: Extract the multimodal data from images, e.g. extract text from images using Optical Character Recognition (OCR), and use Gemini to generate summaries using a specialized prompt.
3. **Embedding Generation**: Convert the generated summaries into vector embeddings using Gemini's Text Embedding Model. These embeddings represent the summaries in a numerical format suitable for efficient similarity searches.
4. **Creating a Vector Database**: A Vector database was created to store the embeddings. This database facilitates fast and efficient retrieval of relevant documents based on similarity searches. We chose to use Chroma DB.
5. **Querying the RAG System**: For a given query, the system retrieves the most relevant documents (based on their embeddings) and generates a response using the retrieved documents as context.

Something important to note is that RAGs are usually used only when there is a surplus of data. In other words, if the data can't fit into the model prompt. In this case, the data we provided likely can fit into Gemini's 1 million token window, but for the sake of simplicity and restrictions of Google Colab's runtime, we opted to use a smaller set of data.

## Text
We will use Tesseract OCR (Optical Character Recognition) to extract text from images of the NASA report.

In [None]:
pytesseract.pytesseract.tesseract_cmd = (r'/usr/bin/tesseract')

Let's create a function to take in our images of a PDF, transcribe them into text, and summarize each of them.

In [None]:
def create_text_summary():
  path = pathlib.Path("resources/text")

  text_summary_prompt = f"""You are an assistant tailored for summarizing text for retrieval.
  These summaries will be turned into vector embeddings and used to retrieve the raw text.
  Give a concise summary of the text that is well optimized for retrieval. Here is the text."""

  images = []
  text_summaries = []

  for f in path.glob("*"):
    if f.is_dir() or f.name.startswith('.'):
      continue

    image = Image.open(f)
    response = model.generate_content([text_summary_prompt, pytesseract.image_to_string(image)])

    images.append(image)
    text_summaries.append(response.text)

  return images, text_summaries

In [None]:
safety_settings = [
    {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "BLOCK_NONE",
    },
]

model = genai.GenerativeModel('models/gemini-1.5-flash', safety_settings=safety_settings)

In [None]:
image_files, text_summaries = create_text_summary()

Now, we can check out the generated summaries of the three pages we have!

In [None]:
for text_summary in text_summaries:
  print(text_summary)

Apollo 11 Flight Plan, prepared by Flight Planning Branch with TRW support, schedules AS-506/CSM-107/LM-5 operations for a July 16, 1969 launch.  Trajectory parameters are from Mission Planning and Analysis Division.  Changes require Crew Procedures Change Request via the Crew Procedures Control Board (CPCB), addressing crew training, test objectives, RCS/EPS budget, major activity shifts, and flight data file changes.  T. A. Guillory coordinates changes; W. J. North handles distribution requests.

Lunar mission assumptions: Descent stage batteries activated 30 minutes pre-liftoff; 3.8-hour lunar orbit checkout; ascent/descent batteries paralleled during powered descent and pre-liftoff; S-band equipment 100% operational; rendezvous radar operational per flight plan; PGNCS in operate mode throughout lunar stay; forward window heaters off.

Apollo mission launch (9:32 EDT July 16, 1969) and translunar coast details.  Earth orbit insertion at 11 min 43 sec.  Translunar injection (TLI) at 

We create the Chroma database using the generated summaries. You might be wondering what Vector DB and Chroma DB are.

**Vector Database**: A specialized database designed to store and manage high-dimensional vectors, which are numerical representations of data points. It allows efficient similarity searches to find vectors (and their corresponding data) that are close to a given query vector.

**Chroma DB**: An implementation of a vector database used to store and retrieve vector embeddings. These embeddings are generated from our summaries and allow us to perform efficient similarity searches.

In [12]:
class GeminiEmbeddingFunction(EmbeddingFunction):
  def __call__(self, input: Documents) -> Embeddings:
    model = 'models/text-embedding-004'
    title = "Custom query"
    return genai.embed_content(model=model,
                                content=input,
                                task_type="retrieval_document",
                                title=title)["embedding"]

In [13]:
def create_chroma_db(documents, name):
  chroma_client = chromadb.Client()
  db = chroma_client.get_or_create_collection(name=name, embedding_function=GeminiEmbeddingFunction())

  for i, d in enumerate(documents):
    db.add(
      documents=d,
      ids=str(i)
    )
  return db

In [14]:
text_db = create_chroma_db(text_summaries, "text_nasa")

Let's also take a peak at the `text_db` and ensure that embeddings were generated:

In [17]:
peek_data = text_db.peek()

data = {
    'documents': peek_data['documents'],
    'embeddings': []
}
for emb in peek_data['embeddings']:
    data['embeddings'].append(emb)


df = pd.DataFrame.from_dict(data, orient='index').transpose()
df

Unnamed: 0,documents,embeddings
0,"Apollo 11 Flight Plan, prepared by Flight Plan...","[0.07943899929523468, 0.004107338842004538, 0...."
1,Lunar mission assumptions: Descent stage batte...,"[0.0692022368311882, 0.007176002021878958, -0...."
2,"Apollo mission launch (9:32 EDT July 16, 1969)...","[0.0758017897605896, 0.012810258194804192, 0.0..."


You should see a column called `embeddings` with what are seemingly random values, but these values are actually high-dimensional vectors that represent the semantic meaning of your summaries.

Now let's actually try querying our information. We'll test a simple example like getting some file that has to do with the Apollo 11 Flight Plan.

In [18]:
def get_relevant_files(query, db):
  results = db.query(query_texts=[query], n_results=3)
  return results["ids"][0]

In [19]:
files = get_relevant_files("Apollo 11 Flight Plan", text_db)
print(files)

['0', '2', '1']


You should expect to see something like `['0', '2', '1']`. This means that the first entry in the `text_db` is most similar. If you look above at our `pd.DataFrame` output, the document with id 1 is the document about the Apollo 11 Flight Plan, so this is working as we expected!

## Video and Audio
Congrats! W've successfully built a working RAG for text. Now, let's extend this concept to Video and Audio, and build out some more complex queries. We'll begin by generalizing the above summary creation function to all sorts of modalities.

In [39]:
def create_summary(modality):
  path = data_dir / modality

  summary_prompt = f"""You are an assistant tailored for summarizing {modality} for retrieval.
  These summaries will be turned into vector embeddings and used to retrieve the raw {modality}.
  Give a concise summary of the {modality} that is well optimized for retrieval. Here is the {modality}."""

  files = []
  summaries = []

  for f in path.glob("*"):
    if f.is_dir() or f.name.startswith('.'):
      continue
    print(f)

    if modality == "text":
      file = Image.open(f)
      response = model.generate_content([summary_prompt, pytesseract.image_to_string(file)])

    else:
      file = genai.upload_file(f)

      while file.state.name == "PROCESSING":
        print("Waiting for video file upload...\n", end='')
        time.sleep(5)
        file = genai.get_file(file.name)

      response = model.generate_content([summary_prompt, file])

    files.append(file)
    summaries.append(response.text)

  return files, summaries

Now, we will create a folder with all of our data of different modalities. In particular, the first 5 are audio files, next 3 are text files, and final 6 are video files.

In [41]:
all_files = []
all_summaries = []

for modality_type in ["audio", "text", "video"]:
  files, summaries = create_summary(modality_type)
  all_files.extend(files)
  all_summaries.extend(summaries)

resources/audio/Apollo11OnboardAudioHighlightClip1.mp3
resources/audio/Apollo11OnboardAudioHighlightClip5.mp3
resources/audio/Apollo11OnboardAudioHighlightClip2.mp3
resources/audio/Apollo11OnboardAudioHighlightClip4.mp3
resources/audio/Apollo11OnboardAudioHighlightClip3.mp3
resources/text/images-020.jpg
resources/text/images-333.jpg
resources/text/images-023.jpg
resources/video/BuzzDescendsCompilation.mov
Waiting for video file upload...
resources/video/OneSmallStepCompilation.mov
Waiting for video file upload...
resources/video/Apollo11Intro.mov
Waiting for video file upload...
Waiting for video file upload...
resources/video/Apollo11PlaqueComparison.mov
Waiting for video file upload...
resources/video/Apollo11MoonwalkMontage.mov
Waiting for video file upload...
resources/video/RaisingTheAmericanFlag.mov
Waiting for video file upload...


In [None]:
db = create_chroma_db(all_summaries, "nasa")

Again, ensure that the embeddings were generated. Notice that now, we have audio, video, and text data.

In [44]:
peek_data = db.peek()
data = {
    'documents': peek_data['documents'],
    'embeddings': []
}
for emb in peek_data['embeddings']:
    data['embeddings'].append(emb)

df = pd.DataFrame.from_dict(data, orient='index').transpose()
df


Unnamed: 0,documents,embeddings
0,Audio Summary 1,"[0.018710002303123474, 0.05939813330769539, -0..."
1,Audio Summary 2,"[0.01287598442286253, 0.06406120955944061, -0...."
2,Audio Summary 3,"[0.022134069353342056, 0.06920761615037918, -0..."
3,Audio Summary 4,"[0.02908291108906269, 0.07372202724218369, -0...."
4,Audio Summary 5,"[0.014025907963514328, 0.08065280318260193, -0..."
5,Text Summary 1,"[-0.0128444479778409, 0.062167178839445114, -0..."
6,Text Summary 2,"[-0.01730366423726082, 0.0636621043086052, -0...."
7,Text Summary 3,"[-0.007510444615036249, 0.06765061616897583, -..."
8,Video Summary 1,"[-0.011761386878788471, 0.013827462680637836, ..."
9,Video Summary 2,"[-0.01832258142530918, 0.018737012520432472, -..."


In [48]:
files = get_relevant_files("communication with Mission Control", db)
print(files)

['5', '6', '7']


Can we do more than just return the most relevant file? Yes we can! We can ask Gemini to return a response to the query using the files it thinks are most relevant, provide an answer and tell us what files it used! This is really exciting, and has vast applications in many industries.

In [45]:
def query_rag(query, db):
    files = get_relevant_files(query, db)
    prompt = [all_files[int(f)] for f in files]
    prompt.append("Generate a response to the query using the provided files. Here is the query.")
    prompt.append(query)
    return model.generate_content(prompt).text, [all_file_names[int(f)] for f in files]

In [46]:
for response in query_rag("Explain what happened with the Apollo 11 Mission.", db):
    print(response)

The Apollo 11 mission was the first crewed mission to land on the Moon.  On July 20, 1969, astronauts Neil Armstrong and Buzz Aldrin landed the Lunar Module, Eagle, on the Moon's surface in the Sea of Tranquility. Armstrong became the first human to walk on the Moon, followed shortly by Aldrin. They spent about 21.5 hours on the lunar surface, collecting samples, planting a US flag, and deploying scientific instruments.  Afterward, they rejoined Michael Collins in the Command Module, Columbia, and returned safely to Earth, splashing down in the Pacific Ocean on July 24, 1969.  The mission was a monumental achievement in human history and a significant victory in the Space Race.

['resources/video/BuzzDescendsCompilation.mov', 'resources/text/images-023.jpg', 'resources/video/OneSmallStepCompilation.mov']


In [47]:
for response in query_rag("What happens at the Translunar Coast in the Mission Description?", db):
    print(response)

According to the provided document, the Translunar Coast is where the spacecraft transitions from Earth orbit to a trajectory towards the Moon.  There is no further detail on events at the Translunar Coast in this document.
['resources/video/RaisingTheAmericanFlag.mov', 'resources/video/BuzzDescendsCompilation.mov', 'resources/text/images-020.jpg']


# Congrats! We've built a full end to end multimodal RAG with just a few tools.