# Analysis of Twitter the-algorithm source code with LangChain, GPT4 and Activeloop's Deep Lake
In this tutorial, we are going to use Langchain + Activeloop's Deep Lake with GPT4 to analyze the code base of the twitter algorithm. 

Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow [docs](https://docs.activeloop.ai/) and [API reference](https://docs.deeplake.ai/en/latest/).

Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the [platform](https://app.activeloop.ai)

In [3]:
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
import os
from langchain_community.vectorstores import FAISS

In [5]:
#embeddings = OpenAIEmbeddings(disallowed_special=())
embeddings = OllamaEmbeddings(model='nomic-embed-text:v1.5')

disallowed_special=() is required to avoid `Exception: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte` from tiktoken for some repositories

### 1. Index the code base (optional)
You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing.

Load all files inside the repository

Then, chunk the files

Execute the indexing. This will take about ~4 mins to compute embeddings and upload to Activeloop. You can then publish the dataset to be public.

`Optional`: You can also use Deep Lake's Managed Tensor Database as a hosting service and run queries there. In order to do so, it is necessary to specify the runtime parameter as {'tensor_db': True} during the creation of the vector store. This configuration enables the execution of queries on the Managed Tensor Database, rather than on the client side. It should be noted that this functionality is not applicable to datasets stored locally or in-memory. In the event that a vector store has already been created outside of the Managed Tensor Database, it is possible to transfer it to the Managed Tensor Database by following the prescribed steps.

In [None]:
# username = "davitbun"  # replace with your username from app.activeloop.ai
# db = DeepLake(
#     dataset_path=f"hub://{username}/twitter-algorithm",
#     embedding_function=embeddings,
#     runtime={"tensor_db": True}
# )
# db.add_documents(texts)

### 2. Question Answering on Twitter algorithm codebase
First load the dataset, construct the retriever, then construct the Conversational Chain

In [6]:
# aproximadamente 37 minutos
faiss_path = "faiss_index"
if os.path.exists(faiss_path):
    db = FAISS.load_local(faiss_path, embeddings, allow_dangerous_deserialization=True)
    print(db.index.ntotal)

20873


In [7]:
retriever = db.as_retriever()
retriever.search_kwargs["distance_metric"] = "cos"
retriever.search_kwargs["fetch_k"] = 100
retriever.search_kwargs["maximal_marginal_relevance"] = True
retriever.search_kwargs["k"] = 10

You can also specify user defined functions using [Deep Lake filters](https://docs.deeplake.ai/en/latest/deeplake.core.dataset.html#deeplake.core.dataset.Dataset.filter)

In [8]:
def filter(x):
    # filter based on source code
    if "com.google" in x["text"].data()["value"]:
        return False

    # filter based on path e.g. extension
    metadata = x["metadata"].data()["value"]
    return "scala" in metadata["source"] or "py" in metadata["source"]


### turn on below for custom filtering
# retriever.search_kwargs['filter'] = filter

In [10]:
from langchain.chains import ConversationalRetrievalChain


model = Ollama(model="mistral:7b")
qa = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [13]:
questions = [
    "O que é o método similarity_search_with_score?",
    "O que é a opção maximal_marginal_relevance?",
    "Para que serve a classe ConversationalRetrievalChain?"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Questão**: {question} \n")
    print(f"**Resposta**: {result['answer']} \n")

-> **Questão**: O que é o método similarity_search_with_score? 

**Resposta**:  The method `similarity_search_with_score` is a function provided by the Dingo library, which returns a list of documents along with their scores based on their similarity to a given query. This function uses vector search algorithms and natural language processing techniques to find the most relevant documents in a collection. The scores represent how closely the documents match the query in terms of context relevance, factual accuracy, response completeness, sub-query completeness, context reranking, and context conciseness. By default, it returns up to 4 documents with their respective scores. You can also pass optional arguments like `search_params` to filter on metadata or `timeout` to specify a maximum search duration. If you only need the documents without their scores, you can call the simpler function `similarity_search`. 

-> **Questão**: O que é a opção maximal_marginal_relevance? 

**Resposta**: 

-> **Questão**: O que é o método similarity_search_with_score? 

**Resposta**:  The method `similarity_search_with_score` is a function provided by the Dingo library, which returns a list of documents along with their scores based on their similarity to a given query. This function uses vector search algorithms and natural language processing techniques to find the most relevant documents in a collection. The scores represent how closely the documents match the query in terms of context relevance, factual accuracy, response completeness, sub-query completeness, context reranking, and context conciseness. By default, it returns up to 4 documents with their respective scores. You can also pass optional arguments like `search_params` to filter on metadata or `timeout` to specify a maximum search duration. If you only need the documents without their scores, you can call the simpler function `similarity_search`. 

-> **Questão**: O que é a opção maximal_marginal_relevance? 

**Resposta**:  The `maximal_marginal_relevance` option is not a valid argument for the `similarity_search_with_score` function in Qdrant. It appears that you might be confusing this with another relevance strategy used by Qdrant, which is called "Maximal Marginal Relevance Feedback (MMR)".

MMR is a ranking technique that optimizes for both query relevance and document diversity among the search results. This means that MMR tries to return documents that are most similar to the query while also ensuring that the selected documents cover a wide range of topics, rather than repeating the same information multiple times.

However, the `similarity_search_with_score` function is used to retrieve documents based on their similarity scores to a given query and does not directly support the MMR strategy. If you want to use MMR in Qdrant, you can consider using the `maximal_marginal_relevance_search` function instead.

Here's an example of how to use `maximal_marginal_relevance_search`:

```python
import json
from qdrant_client import QdrantClient, Document, MetadataFilter
from typing import List

# Initialize the Qdrant client
qdrant = QdrantClient()

# Define your documents and index them
documents = [
    {"id": 1, "content": "The quick brown fox jumps over the lazy dog"},
    {"id": 2, "content": "A red fox jumps over a rabbit hole"},
    {"id": 3, "content": "A snow leopard is a large cat native to Central Asia"}
]
for doc in documents:
    qdrant.upsert([Document(data=doc)], index="my_index")

# Perform a maximal marginal relevance search
search_results = qdrant.maximal_marginal_relevance_search("fox", k=3, lambda_mult=0.5)
print(json.dumps(search_results, indent=2))
```

In this example, we perform a maximal marginal relevance search for the term "fox" and return three documents that are most similar to the query while also ensuring some level of diversity in the search results (specified by the `lambda_mult` parameter). 

-> **Questão**: Para que serve a classe ConversationalRetrievalChain? 

**Resposta**:  The `ConversationalRetrievalChain` class is a part of Langchain library and is designed to retrieve information from a given dataset based on a conversational interaction between the user and an assistant. It's responsible for processing user input, generating a question to be answered by the language model, retrieving relevant documents using the `BaseRetriever` instance, and combining those documents into a single context for the language model.

This class is particularly useful when dealing with large datasets or situations where the desired information might not be present in a single document but can be gleaned from multiple sources. Additionally, it allows for a conversational interaction between the user and the assistant, making the retrieval process more engaging and user-friendly. 