## Project Statement - Enhancing Search Engine Relevance for Video Subtitles

### Background:
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in
connecting users with relevant information. For Google, providing a seamless and accurate
search experience is paramount. This project focuses on improving the search relevance for
video subtitles, enhancing the accessibility of video content.

### Objective:
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user
queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural
language processing and machine learning techniques to enhance the relevance and accuracy
of search results.

### Step 1 . Read the database and decode the content and saved in the dataframe.

In [None]:
# # Read the code below and write your observation in the next cell

# conn = sqlite3.connect('eng_subtitles_database.db')
# cursor = conn.cursor()
# cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
# # print(cursor.fetchall())

### Define a function to extract the content of the subtitles

1. We have a subtitles file contains "[Script Info]" header format subtitle file, so we fetch only the dialogue content.
2. we have some normal subtitle files so we fetch only the subtitles.

In [None]:
def extract_dialogue_text(subtitle_content):
    """Utilizes regular expressions to find all dialogue lines and extract text."""

    # Check if content is in .ass format (contains "[Script Info]" header)
    if "[Script Info]" in subtitle_content:
        # Extract dialogue lines using a regex that captures text after the last comma
#         r'Dialogue:.*,(.*)'

        dialogue_lines = re.findall(r'Dialogue:.*,(.*)', subtitle_content)

        extracted_text = "\n".join(dialogue_lines)
    else: # Assume .srt or similar format

        # Remove numeric identifiers and timestamps
        subtitle_content = re.sub(r'^\d+\s+|\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '', subtitle_content, flags=re.MULTILINE)

        # Split by blank lines to isolate dialogue lines
        dialogue_lines = subtitle_content.strip().split('\n\n')

        # Combine text, skip empty and non-dialogue lines
        extracted_text = "\n".join(line for line in dialogue_lines if not line.isdigit() and line not in ['', ' '])

    return extracted_text

# Define function to decode compressed binary data and extract dialogue text
def decode_and_extract_dialogues(binary_data):
    with io.BytesIO(binary_data) as f:
        with zipfile.ZipFile(f, 'r') as zip_file:
            subtitle_content = zip_file.read(zip_file.namelist()[0]).decode('latin-1')

    # Clean and extract dialogue text from subtitle content
    cleaned_text = extract_dialogue_text(subtitle_content)

    return cleaned_text

In [25]:
import sqlite3
import zipfile
import io
import pandas as pd
import re
import numpy as np
import pandas as pd

# Connect to the SQLite database
conn = sqlite3.connect('/content/drive/MyDrive/Copy of eng_subtitles_database.db')
cursor = conn.cursor()

# Execute SQL query to retrieve table names
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = cursor.fetchall()


# Iterate through tables and read data
for table in tables:
    table_name = table[0]
    # Fetch data, assuming the column with compressed data is named 'content'
    df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)

    # Apply our decoding and extraction function to the 'content' column
    df['content'] = df['content'].apply(decode_and_extract_dialogues)
    print(f"Table: {table_name}")
#     print(df.head())  # Optionally, print first few rows of DataFrame to check results

# Close connection to the database
conn.close()

Table: zipfiles


In [None]:
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,Watch any video online with Open-SUBTITLES\r\n...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,Ah! There's Princess\r\nDawn and Terry with th...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,<i>Yumi's Cells 2</i>\r\n\r\n\r\n<i>Episode 36...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,Watch any video online with Open-SUBTITLES\r\n...
4,9180600,broker.(2022).eng.1cd,ï»¿1\r\n\r\nWatch any video online with Open-S...


In [None]:
df.shape

(82498, 3)

In [None]:
df.tail()

Unnamed: 0,num,name,content
82493,9521935,the.prophets.game.(2000).eng.1cd,"ï»¿1\r\n\r\nGod,\r\nwhy are you punishing me?\..."
82494,9521937,west.beirut.(1998).eng.1cd,"api.OpenSubtitles.org is deprecated, please\r\..."
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,(Dramatic orchestral music)\r\n\r\n\r\nAdverti...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,Advertise your product or brand here\r\ncontac...
82497,9521941,zombie.island.massacre.(1984).eng.1cd,"(Sharp whistling)\r\n\r\n\r\n- [Man] Hey, wait..."


### Get only the 30% of the data

In [26]:
# Get the number of rows in your dataset
total_rows = len(df)

# Calculate the number of rows for 30% of the data
thirty_percent = int(total_rows * 0.3)

# Randomly select 30% of the data
random_indices = np.random.choice(total_rows, thirty_percent, replace=False)
data = df.iloc[random_indices]

In [None]:
data.shape

(24749, 3)

In [None]:
# data.to_csv('Subtitles.csv',index=False)

In [None]:
## view the dataframe of 30% of the dataset
data.head()

Unnamed: 0,num,name,content
6663,9209381,the.nanny.s06.e09.oh.say.can.you.ski.(1998).en...,ï»¿1\r\n\r\nThank you so much\r\nfor coming by...
38841,9340677,ancient.unexplained.files.s01.e09.legend.of.th...,ï»¿1\n[josh] a digital autopsy\n\nReveals the ...
77442,9500973,alfred.hitchcock.presents.s01.e05.into.thin.ai...,394)}Tonight we are going to tell\r\n431)}the ...
55699,9415187,le.pelican.(1974).eng.1cd,ï»¿1\r\n\r\nWatch any video online with Open-S...
35355,9323167,beyond.oak.island.s03.e05.the.atocha.secrets.o...,ï»¿1\r\n\r\nTonight on <i>Beyond Oak Island......


In [None]:
data['content'][6663]

'ï»¿1\r\n\r\nThank you so much\r\nfor coming by, Dr. Reynolds.\r\n\r\n\r\nIt\'s nice to find a doctor\r\nthat will make house calls.\r\n\r\n\r\nIt\'s nice to find a patient\r\nwho can afford them.\r\n\r\n\r\nI just don\'t know what to\r\ndo anymore. Fran is obsessing\r\nabout still not being pregnant.\r\n\r\n\r\nNo need to worry.\r\n\r\n\r\nAll we have to do\r\nis talk to her calmly,\r\n\r\n\r\nlet her know that she\'s special\r\nand that there\'s nothing\r\nwrong with her.\r\n\r\n\r\n- Hi, Fran.\r\n- Hi.\r\n\r\n\r\nWhy don\'t you tell me\r\nwhy you\'re upside down?\r\n\r\n\r\nWell, I was\r\nwatching "The View"\r\nand Barbara Walters says\r\n\r\n\r\nthat this helps you\r\nget pregnant.\r\n\r\n\r\nPlus you know\r\nI\'m taking those hormones\r\n\r\n\r\nand I also got some fancy herbs\r\nand he\'s been taking zinc.\r\n\r\n\r\n- No, I haven\'t.\r\n- Yeah, you have.\r\n\r\n\r\nRemember those Tic Tacs\r\nthat you said that you thought\r\ntasted a little chalky...\r\n\r\n\r\nThis is sick.\r\n

In [None]:
data['content'][55699]

"ï»¿1\r\n\r\nWatch any video online with Open-SUBTITLES\r\nFree Browser extension: osdb.link/ext\r\n\r\n\r\nSurname: Boyer\r\n\r\n\r\nFirst name: Marc RÃ©gis Jean\r\n\r\n\r\nBorn on 25 April 1962 in Boulogne\r\n\r\n\r\nto Boyer Paul, musician,\r\nand BorÃ© Isabelle, unemployed.\r\n\r\n\r\nCould you please sign?\r\n\r\n\r\n- Here it is. Thank you.\r\n- Thank you.\r\n\r\n\r\nIsabelle, the child is crying.\r\n\r\n\r\n- What will you have?\r\n- Schweppes.\r\n\r\n\r\n- Have you thought about it?\r\n- A little.\r\n\r\n\r\nI don't know... Sounds risky...\r\n\r\n\r\nThere is no danger, believe me.\r\n\r\n\r\nEverything has been meticulously prepared.\r\n\r\n\r\nBe sure we'd all benefit from\r\na success of that job.\r\n\r\n\r\nLet's say I accept..\r\nHow will I get my money?\r\n\r\n\r\nHalf in New York and the rest\r\nwhen you return to Paris.\r\n\r\n\r\nWhen do you need my answer?\r\n\r\n\r\n- When do you go on tour?\r\n- In about a month and a half.\r\n\r\n\r\nWell...\r\n\r\n\r\nIf I get you

### Handle the Contraction of the words

In [None]:
# Handling Contractions
contraction_mapping = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "cannot",
    "can't've": "cannot have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'll": "he will",
    "he's": "he is",
    "how'd": "how did",
    "how'll": "how will",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
    "I'm": "I am",
    "I've": "I have",
    "isn't": "is not",
    "it'd": "it would",
    "it'll": "it will",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "must've": "must have",
    "mustn't": "must not",
    "needn't": "need not",
    "oughtn't": "ought not",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "she'd": "she would",
    "she'll": "she will",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "that'd": "that would",
    "that's": "that is",
    "there'd": "there had",
    "there's": "there is",
    "they'd": "they would",
    "they'll": "they will",
    "they're": "they are",
    "they've": "they have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'll": "we will",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who's": "who is",
    "won't": "will not",
    "wouldn't": "would not",
    "you'd": "you would",
    "you'll": "you will",
    "you're": "you are",
    "you've": "you have"
}

def expand_contractions(text, contraction_mapping):
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        expanded_contraction = contraction_mapping.get(match) if contraction_mapping.get(match) else contraction_mapping.get(match.lower())
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### Define a preprocessing function

In [None]:
import re
import nltk

from tqdm import tqdm

# Ensure NLTK stopwords are downloaded
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Function to preprocess text
def preprocess_text(text):

    # Remove URLs
    text = re.sub(r'https?://\S+|www\.\S+', '', text)

    # Remove timestamps
    # text = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}', '', text)

    # Remove HTML tags
    text = re.sub(r'<[^>]*>', '', text)

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Remove blank new lines
    text = re.sub(r'\n\s*\n', '\n', text)  # Remove blank lines
    # text = re.sub(r'\s+', ' ', text)  # Remove extra spaces

    # Remove special characters except alphanumeric and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove SDH (Subtitles for the Deaf and Hard of Hearing)
    text = re.sub(r'\{.*?\}', '', text)

    # Remove speaker labels
    text = re.sub(r'[A-Z]+:', '', text)

    # Remove text between angle brackets <>
    text = re.sub(r'<.*?>', '', text)

    # Remove text between curly brackets {}
    text = re.sub(r'\{.*?\}', '', text)

    # Remove text between parentheses ()
    text = re.sub(r'\(.*?\)', '', text)

    # Remove text between square brackets []
    text = re.sub(r'\[.*?\]', '', text)

    # Remove text between asterisks *...*
    text = re.sub(r'\*.*?\*', '', text)

    # Remove music note
    text = text.replace('🎵', '')

    # Remove ellipses ...
    text = text.replace('...', '')

    # Remove specific words
    words_to_remove = ["Oh","Wow","Hey","Uh","Ah","Hmm","Huh","Ouch","Oops","Aha","Eek","Umm","Gah","Yay","Phew","Hm","D'oh","Ahem"]  # Add words you want to remove here
    for word in words_to_remove:
        text = text.replace(word, '')

    # Convert to lowercase
    text = text.lower()

    # Handling Contractions
    text = expand_contractions(text, contraction_mapping)

    return text


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Apply preprocessing to the 'decoded_content' column
tqdm.pandas()
data['content'] = data['content'].progress_apply(preprocess_text)

100%|██████████| 24749/24749 [13:21<00:00, 30.90it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['content'] = data['content'].progress_apply(preprocess_text)


In [None]:
data['content'][6663]

'\r\nthank you so much\r\nfor coming by dr reynolds\r\nits nice to find a doctor\r\nthat will make house calls\r\nits nice to find a patient\r\nwho can afford them\r\ni just dont know what to\r\ndo anymore fran is obsessing\r\nabout still not being pregnant\r\nno need to worry\r\nall we have to do\r\nis talk to her calmly\r\nlet her know that shes special\r\nand that theres nothing\r\nwrong with her\r\n hi fran\r\n hi\r\nwhy dont you tell me\r\nwhy youre upside down\r\nwell i was\r\nwatching the view\r\nand barbara walters says\r\nthat this helps you\r\nget pregnant\r\nplus you know\r\nim taking those hormones\r\nand i also got some fancy herbs\r\nand hes been taking zinc\r\n no i havent\r\n yeah you have\r\nremember those tic tacs\r\nthat you said that you thought\r\ntasted a little chalky\r\nthis is sick\r\nyou are obsessive now stop\r\nplease talk to her doctor\r\nthis is sick\r\nyou are obsessive now stop\r\nnow max and fran\r\nim getting the feeling\r\nthat youre not following\r\n

In [None]:
data['content'][55699]

'\r\nwatch any video online with opensubtitles\r\nfree browser extension osdblinkext\r\nsurname boyer\r\nfirst name marc rgis jean\r\nborn on  april  in boulogne\r\nto boyer paul musician\r\nand bor isabelle unemployed\r\ncould you please sign\r\n here it is thank you\r\n thank you\r\nisabelle the child is crying\r\n what will you have\r\n schweppes\r\n have you thought about it\r\n a little\r\ni dont know sounds risky\r\nthere is no danger believe me\r\neverything has been meticulously prepared\r\nbe sure wed all benefit from\r\na success of that job\r\nlets say i accept\r\nhow will i get my money\r\nhalf in new york and the rest\r\nwhen you return to paris\r\nwhen do you need my answer\r\n when do you go on tour\r\n in about a month and a half\r\nwell\r\nif i get your answer  days\r\nbefore departure thats fine\r\nyou have some time think again\r\nfor you its still a lot of money\r\nnow if you dont need the money\r\nmy turn i have to go\r\nlets meet somewhere else\r\nwell be more com

### Convert the documents into chunks . so we don't loss any information

In [None]:
def chunk_documents(text, size=600, overlap=100):
    """
    Breaks down a large string into overlapping chunks of a specified size.
    """
    chunks = []
    for start in range(0, len(text), size-overlap):
        end = start + size
        chunk = text[start:end]
        chunks.append(chunk)
    return chunks


In [None]:
data['chunks']= data['content'].progress_apply(chunk_documents)

100%|██████████| 24749/24749 [00:01<00:00, 13119.16it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['chunks']= data['content'].progress_apply(chunk_documents)


In [None]:
data['chunks'][47818]

['welcome to the repair shop where\nprecious but faded treasures\ntheres an awful lot of work\nto do here\nthings are definitely going to have\nto get worse before they get better\nare restored\nto their former glory\nlook at that\nfurniture restorer jay blades\nbringing history back to life\nis what makes the repair shop\nso special\nand a dream team\nof expert craftspeople\nsolid as a rock\nits actually quite miraculous\nto be honest\ncome together to work\ntheir magic\nlook at that tailormade\njust got to keep calm and carry on\nok here we go\nits going to look great\nemploying heritage craft skills\npassed down the ',
 ' and carry on\nok here we go\nits going to look great\nemploying heritage craft skills\npassed down the generations\nthis is how it was so this\nis how it will be again\npreserving irreplaceable\nheirlooms\nsome objects can have so much\nemotional attachment to the family\nand thats what pushes me\nto want to get it right\nthe team will restore the items\nthe memori

### convert the chunks into vectorization using sentence-transformer

In [None]:
!pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transform

In [None]:
from sentence_transformers import SentenceTransformer, util

In [30]:
# model = SentenceTransformer('all-distilroberta-v1')

In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.2-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2.

### store the chunks vector into ChromaDB database

In [34]:
import chromadb
from chromadb.utils import embedding_functions

CHROMA_DATA_PATH = "/content/drive/MyDrive/Colab Notebooks/chroma_data1/"
EMBED_MODEL = "all-distilroberta-v1"
COLLECTION_NAME = "movie_subtitle_collection"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)


In [35]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL,
    device=device  # Set the device for the embedding function
)


collection = client.create_collection(
 name=COLLECTION_NAME,
embedding_function=embedding_func,
     metadata={"hnsw:space": "cosine"},)


In [None]:
# Iterate over each row of the DataFrame and add chunks individually
for index, row in data.iterrows():
    # Create the metadata dictionary
    metadata = {"source": row['name']}  # Assuming 'source' is a column in your DataFrame
    # Prepare document(s) and ID(s)
    document_chunks = row['chunks']  # This should be a list of chunks for the current row
    document_ids = [f"id{index}_{i}" for i, _ in enumerate(document_chunks)]  # Unique IDs for each chunk

    # Add the chunks to the collection
    collection.add(documents=document_chunks, ids=document_ids, metadatas=[metadata for _ in document_chunks])


In [1]:
# # Iterate over each row of the DataFrame and add chunks individually
# for index, row in tqdm(data.iterrows(),total=len(data), desc="Adding chunks documents to collection"):
#     # Create the metadata dictionary
#     metadata = {"source": row['name']}  # Assuming 'source' is a column in your DataFrame
#     # Prepare document(s) and ID(s)
#     document_chunks = row['chunks']  # This should be a list of chunks for the current row
#     document_ids = [f"id{index}_{i}" for i, _ in enumerate(document_chunks)]  # Unique IDs for each chunk

#     # Add the chunks to the collection
#     collection.add(documents=document_chunks, ids=document_ids, metadatas=[metadata for _ in document_chunks])


### write the query and display the semantic results from chromaDB

In [46]:
query_results = collection.query(query_texts=["what was the condition of Nicole Garder's?"],
  n_results=1,
 )
query_results

{'ids': [['id71894_39']],
 'distances': [[0.5873623490333557]],
 'metadatas': [[{'source': 'csi.ny.s03.e09.and.heres.to.you.mrs.azrael.(2006).eng.1cd'}]],
 'embeddings': None,
 'documents': [['tubes away\r\ncause this things\r\nabout to bust wide open\r\nellen garner had an insurance\r\npolicy on her daughter\r\nand mommy dearest\r\nis the beneficiary\r\nellen has something to gain by\r\nher daughters untimely death\r\nmeans we got a motive\r\ndid you swab the heartsensor\r\npads for dna\r\nyeah the three pads you pulled\r\nfrom nicole came back to her\r\nno surprise but the fourth pad\r\nwas a lowlevel sample\r\nall i could get was amelogenin\r\nher unknown donor was female\r\nthe question is\r\nwhy was she wearing a sensor pad\r\nin nicoles room\r\nthats why nicoles heart rate\r\nnever indicated she was\r\nbeing suffocated\r\no']],
 'uris': None,
 'data': None}

In [47]:
query_results = collection.query(query_texts=["What was the process of analyzing satellite photos to identify potential locations?"],
  n_results=1,
 )
query_results

{'ids': [['id68199_11']],
 'distances': [[0.7553659677505493]],
 'metadatas': [[{'source': 'csi.miami.s01.e09.kill.zone.(2002).eng.1cd'}]],
 'embeddings': None,
 'documents': [['whats the first thing\r\ni would do\r\nyoud pick your spot\r\nprone position\r\nis best for shooting\r\nright the problem is\r\nis this wall\r\nobscures my view of the target\r\nyeah so maybe\r\nyou were kneeling\r\nand maybe i went higher\r\ntake a look at that\r\nso what do you get\r\nwhen a sixfoottall man\r\nlays down with\r\na threefootlong rifle\r\nhot flashes\r\nbut thats just me\r\nwhat you get is a gsr cone\r\nthis is his location\r\nthis is where he shot from\r\nthis is a tough location\r\nfor exposure\r\nyou can be seen by a helicopter\r\nany one of these buildings\r\nive got burlap\r\nwith gravel glued on\r\ncamouflage\r\nbetter than ca']],
 'uris': None,
 'data': None}