## Retrieval: Similarity search

In [1]:
from configs import API_KEY, DEFAULT_MODEL
from langchain_openai.embeddings import OpenAIEmbeddings
import os
import openai
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [2]:
os.environ["OPENAI_API_KEY"] = API_KEY
# openai.api_key = API_KEY
openai.api_key = os.getenv('OPENAI_API_KEY')

embedding = OpenAIEmbeddings(model='text-embedding-ada-002')

In [3]:
vectorstore_from_dir = Chroma(persist_directory = "./into-to-DS-vectorstore", embedding_function = embedding)

  vectorstore_from_dir = Chroma(persist_directory = "./into-to-DS-vectorstore", embedding_function = embedding)


#### Adding duplicate entry

In [4]:
added_doc = Document(page_content='Alright! So… Let’s discuss the not-so-obvious differences between the terms analysis and analytics. Due to the similarity of the words, some people believe they share the same meaning, and thus use them interchangeably. Technically, this isn’t correct. There is, in fact, a distinct difference between the two. And the reason for one often being used instead of the other is the lack of a transparent understanding of both. So, let’s clear this up, shall we? First, we will start with analysis. Consi',
                     metadata = {'Course Title': 'Introduction to Data and Data Science',
                                 'Lecture Title': 'Analysis vs Analytics'}
                    )

vectorstore_from_dir.add_documents([added_doc])

['4fb2d5f6-43e6-4791-bd31-3cd59bf2efce']

In [19]:
question = "What programming languages do data scientists use ?"

In [20]:
retrieved_docs = vectorstore_from_dir.similarity_search(query = question, k=5)

In [21]:
for i in retrieved_docs:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: What about big data? Apart from R and Python, people working in this area are often proficient in other languages like Java or Scala. These two have not been developed specifically for doing statistical analyses, however they turn out to be very useful when combining data from multiple sources. All right! Let’s finish off with machine learning. When it comes to machine learning, we often deal with big data. Thus, we need a lot of computational power, and we can expect people to use the languages
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specif

## Maximum marginal relevance search

In [22]:
question = "What software do data scientists use?"

In [23]:
retrieved_docs = vectorstore_from_dir.similarity_search(query = question, k=3)
for i in retrieved_docs:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations. In ter

In [25]:
retrieved_docs_mmr = vectorstore_from_dir.max_marginal_relevance_search(query = question, k=3, lambda_mult = 0.1)
for i in retrieved_docs_mmr:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: g formulas and algorithms to numbers you have gathered from your analysis. Here are a couple of examples. Say, you are an owner of an online clothing store. You are ahead of the competition and have a great understanding of what your customer's needs and wants are. You’ve performed a very detailed analysis from women’s clothing articles and

In [26]:
retrieved_docs_mmr = vectorstore_from_dir.max_marginal_relevance_search(query = question, 
                        k=3, 
                        lambda_mult = 0.1,
                       filter={"Lecture Title":"Programming Languages & Software Employed in Data Science - All the Tools You Need"})
for i in retrieved_docs_mmr:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: l analysis. Among the many applications we have plotted, we can say there is an increasing amount of software designed for working with big data such as Apache Hadoop, Apache Hbase, and Mongo DB. In terms of big data, Hadoop is the name that must stick with you. Hadoop is listed as a software in the sense that it is a collection of programs

In [27]:
retrieved_docs_mmr = vectorstore_from_dir.max_marginal_relevance_search(query = question, 
                        k=3, 
                        lambda_mult = 0.7,
                       filter={"Lecture Title":"Programming Languages & Software Employed in Data Science - All the Tools You Need"})
for i in retrieved_docs_mmr:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations. In ter

In [28]:
retrieved_docs_mmr = vectorstore_from_dir.max_marginal_relevance_search(query = question, 
                        k=3, 
                        lambda_mult = 1,
                       filter={"Lecture Title":"Programming Languages & Software Employed in Data Science - All the Tools You Need"})
for i in retrieved_docs_mmr:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations. In ter

## Retirever

In [31]:
retriever = vectorstore_from_dir.as_retriever(search_type = 'mmr',
                                             search_kwargs = {'k':3, 'lambda_mult':0.7})

In [32]:
retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x000002C9F8F69810>, search_type='mmr', search_kwargs={'k': 3, 'lambda_mult': 0.7})

In [33]:
retrieved_docs = retriever.invoke(question)

In [34]:
for i in retrieved_docs:
    print(f"Page content: {i.page_content}\n------------------------------\nLecture Title:{i.metadata['Lecture Title']}\n")

Page content: Alright! So… How are the techniques used in data, business intelligence, or predictive analytics applied in real life? Certainly, with the help of computers. You can basically split the relevant tools into two categories—programming languages and software. Knowing a programming language enables you to devise programs that can execute specific operations. Moreover, you can reuse these programs whenever you need to execute the same action. As you can see from the infographic, R, and Python are the
------------------------------
Lecture Title:Programming Languages & Software Employed in Data Science - All the Tools You Need

Page content: to address the complexity of big data and its computational intensity. Most notably, Hadoop distributes the computational tasks on multiple computers which is basically the way to handle big data nowadays. Power BI, SaS, Qlik, and especially Tableau are top-notch examples of software designed for business intelligence visualizations. In ter