## Embeddings, Vector Databases, and Search using ChromaDB and Transformers

In [None]:
# !pip install chromadb==0.3.21 tiktoken==0.3.3

In [6]:
# !pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
[?25hCollecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2


In [11]:
import pandas as pd

qadf = pd.read_excel("data/Copy of Merged-QuestionsAnswers.xlsx")
display(qadf)

Unnamed: 0,Questions,Answers
0,What-is-it-like-to-be-an-AI-developer,Being an AI developer can be both rewarding an...
1,How-useful-is-R-in-AI-Development,R is a powerful language for data analysis and...
2,Will-coding-become-less-important-as-AI-develops,Coding will always be an important part of dev...
3,Will-AI-develop-or-adopt-religion,It is unlikely that AI will develop or adopt a...
4,Can-an-AI-develop-cognitive-dissonance,It is possible for an AI to develop cognitive ...
...,...,...
16166,Do-sales-managers-travel-a-lot,"Yes, sales managers often travel a lot. They m..."
16167,Whats-the-pay-like-at-Pinterest-for-sales-mana...,The pay for a Sales Manager role at Pinterest ...
16168,What-profile-a-MBA-Marketing-fresher-should-jo...,A MBA Marketing fresher should join a profile ...
16169,What-feature-must-be-included-for-sales-manage...,1. Customer Relationship Management (CRM) soft...


In [12]:
def strip_hyphen(x):
    return x.replace('-', ' ')

qadf['Questions'] = qadf["Questions"].apply(strip_hyphen)

In [16]:
qadf.rename(columns={'Answers': 'Contexts'}, inplace=True)

In [34]:
qadf['id'] =[f"id{id}" for id in range(0,qadf.shape[0])]

In [122]:
qadf["Questions"][23]

'Why is society ignoring the potentially devastating consequences of AI development'

In [37]:
qadf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16171 entries, 0 to 16170
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Questions  16171 non-null  object
 1   Contexts   16025 non-null  object
 2   id         16171 non-null  object
dtypes: object(3)
memory usage: 505.3+ KB


In [18]:
contexts = qadf["Contexts"].to_list()

In [19]:
import chromadb
from chromadb.config import Settings

chroma_client = chromadb.Client(
    Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory="chroma_data", 
    )
)

Using embedded DuckDB with persistence: data will be stored in: chroma_data


In [20]:
collection_name = "tech_ans"

In [21]:
# Check the existence of collection name
if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
    chroma_client.delete_collection(name=collection_name)
else:
    print(f"Creating collection: '{collection_name}'...")
    talks_collection = chroma_client.create_collection(name=collection_name)
    print("Collection Created successfully!")

No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


Creating collection: 'tech_ans'...


  from .autonotebook import tqdm as notebook_tqdm
2023-07-02 14:18:05.611224: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Collection Created successfully!


In [38]:
talks_collection.add(
    documents=qadf["Contexts"][:100].tolist(),
    ids=qadf['id'][:100].tolist()
)

In [40]:
import json

results = talks_collection.query(
    query_texts="what is it like to be an AI Developer",
    n_results=5
)

print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id0",
            "id42",
            "id45",
            "id78",
            "id79"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Being an AI developer can be both rewarding and challenging. On the one hand, AI developers have the opportunity to create innovative solutions to complex problems, and to make a real difference in the world. On the other hand, AI development requires a deep understanding of both the technology and the domain in which it is being applied. AI developers must also be able to think critically and creatively, and to work with a wide range of stakeholders.",
            "1. Get a degree in computer science, mathematics, or a related field.\n2. Learn the fundamentals of AI, such as machine learning, deep learning, natural language processing, and computer vision.\n3. Gain experience with programming languages such as Python, Java, and C++.\n4. Familiarize yourself with AI frameworks such a

In [52]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# model_id = 'EleutherAI/gpt-neo-125M'
model_id = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", model=lm_model, tokenizer=tokenizer, max_new_tokens=256, device_map="auto", handle_long_generation="hole"
)

In [123]:

question = 'Why is society ignoring the potentially devastating consequences of AI development'

In [124]:
results = talks_collection.query(
    query_texts=question,
    n_results=5
)

In [125]:
context = results['documents'][0][0]

prompt_template = f"Answer the given question only using the context provided. Do not Hallucinate.\n\nContext: {context}\n\nQuestion: {question}\n\n\
Answer:"

In [127]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the given question only using the context provided. Do not Hallucinate.

Context: Society is largely ignoring the potentially devastating consequences of AI development because it is difficult to predict the long-term effects of AI. Additionally, the potential benefits of AI development are often seen as outweighing the potential risks. AI is seen as a tool that can help us solve many of the world�s problems, and the potential for AI to be used for malicious purposes is often overlooked. Additionally, the development of AI is often seen as a way to create jobs and economic growth, which can be seen as more important than the potential risks.

 Question: Why is society ignoring the potentially devastating consequences of AI development

Answer: The reason is simple: human beings don�t know what this will be like. How is it possible that such a development will take generations to occur? Many social scientists argue that it is quite possible that AI will be used for some other rea

In [71]:
from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M')

In [128]:
lm_response = generator(prompt_template, do_sample=True, min_length=20, max_new_tokens=200)
print(lm_response[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer the given question only using the context provided. Do not Hallucinate.

Context: Society is largely ignoring the potentially devastating consequences of AI development because it is difficult to predict the long-term effects of AI. Additionally, the potential benefits of AI development are often seen as outweighing the potential risks. AI is seen as a tool that can help us solve many of the world�s problems, and the potential for AI to be used for malicious purposes is often overlooked. Additionally, the development of AI is often seen as a way to create jobs and economic growth, which can be seen as more important than the potential risks.

 Question: Why is society ignoring the potentially devastating consequences of AI development

Answer: As usual, AI is typically viewed as a threat to human flourishing. Yet, the benefits and risks of AI are rarely mentioned in the social sciences or in industry as compared with the benefits of government-mandated or individual skills. The 

In [126]:
print(json.dumps(results, indent=4))

{
    "ids": [
        [
            "id23",
            "id31",
            "id99",
            "id74",
            "id47"
        ]
    ],
    "embeddings": null,
    "documents": [
        [
            "Society is largely ignoring the potentially devastating consequences of AI development because it is difficult to predict the long-term effects of AI. Additionally, the potential benefits of AI development are often seen as outweighing the potential risks. AI is seen as a tool that can help us solve many of the world\ufffds problems, and the potential for AI to be used for malicious purposes is often overlooked. Additionally, the development of AI is often seen as a way to create jobs and economic growth, which can be seen as more important than the potential risks.",
            "1. Lack of Diversity: AI development is often dominated by a small group of people with similar backgrounds and experiences. This limits the potential of AI to be truly innovative and creative.\n2. Over-Re