# LangChain & GPT-4 for Code Understanding: Twitter Algorithm

In this lesson we will explore how LangChain, Deep Lake, and GPT-4 can transform our understanding of complex codebases, such as Twitter's open-sourced recommendation algorithm.

## Introduction

In this lesson we will explore how LangChain, Deep Lake, and GPT-4 can transform our understanding of complex codebases, such as Twitter's open-sourced recommendation algorithm.

This approach enables us to ask any question directly to the source code, significantly speeding up the code comprehension.

## The Workflow

This guide involves understanding source code using LangChain in four steps:

1. Install necessary libraries like langchain, deeplake, openai and tiktoken, and authenticate with Deep Lake and OpenAI.
2. Optionally, index a codebase by cloning the repository, parsing the code, dividing it into chunks, and using OpenAI to perform indexing.
3. Establish a Conversational Retriever Chain by loading the dataset, setting up the retriever, and connecting to a language model like GPT-4 for question answering.
4. Query the codebase in natural language and retrieve answers. The guide ends with a demonstration of how to ask and retrieve answers to several questions about the indexed codebase.

### What is LangChain Conversational Retriever Chain?
A conversational Retriever Chain is a retrieval-centric system interacting with data stored in a VectorStore like Deep Lake. It extracts the most applicable code snippets and details for a specific user request using advanced methods like context-sensitive filtering and ranking. The conversational Retriever Chain is designed to provide high-quality, relevant outcomes while considering conversation history and context.

### Twitter Recommendation Pipeline

Twitter’s open-sourced recommendation algorithm works in three main steps:

Candidate Sourcing (fancy speak for data aggregation): the algorithm collects data about your followers, your tweets, and you. The “For You” timeline typically comprises 50% In-Network (people you follow) and 50% Out-of-Network (people you don’t follow) Tweets.

Feature Formation & Ranking: Turns the data into key feature buckets:
Embedding Space (SimClusters and TwHIN), In Network (RealGraph and Trust & Safety), and Social Graph (Follower Graph, Engagements); look for our practical example to discover what each of those is. Later, a neural network trained on Tweet interactions to optimize for positive engagement is used to obtain the final ranking.

Mixing: Finally, in the mixing step, the algorithm groups all features into candidate sources and uses a model called Heavy Ranker to predict user actions, applying heuristics and filtering.



In [1]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake


embeddings = OpenAIEmbeddings()

Could not import azure.core python package.


Download the source code

In [None]:
#!git clone https://github.com/twitter/the-algorithm

Next, load all files inside the repository.

In [2]:
from langchain.document_loaders import TextLoader

root_dir = 'data/the-algorithm'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            pass

print('Documents: ', len(docs))


Documents:  10842


Subsequently, divide the loaded files into chunks:

In [3]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

print('Chunks: ', len(texts))

Created a chunk of size 2549, which is longer than the specified 1000
Created a chunk of size 2095, which is longer than the specified 1000
Created a chunk of size 1983, which is longer than the specified 1000
Created a chunk of size 1020, which is longer than the specified 1000
Created a chunk of size 1540, which is longer than the specified 1000
Created a chunk of size 1245, which is longer than the specified 1000
Created a chunk of size 1257, which is longer than the specified 1000
Created a chunk of size 2273, which is longer than the specified 1000
Created a chunk of size 1411, which is longer than the specified 1000
Created a chunk of size 1263, which is longer than the specified 1000
Created a chunk of size 1672, which is longer than the specified 1000
Created a chunk of size 1794, which is longer than the specified 1000
Created a chunk of size 1034, which is longer than the specified 1000
Created a chunk of size 1201, which is longer than the specified 1000
Created a chunk of s

Chunks:  31211


Perform the indexing process. 

In [4]:
username = "edumunozsala" # replace with your username from app.activeloop.ai
db = DeepLake(dataset_path=f"hub://{username}/twitter-algorithm", embedding_function=embeddings)
db.add_documents(texts)





Your Deep Lake dataset has been successfully created!
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/edumunozsala/twitter-algorithm
hub://edumunozsala/twitter-algorithm loaded successfully.


Evaluating ingest: 100%|██████████| 31/31 [07:44<00:00
 

Dataset(path='hub://edumunozsala/twitter-algorithm', tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (31211, 1536)  float32   None   
    ids      text     (31211, 1)      str     None   
 metadata    json     (31211, 1)      str     None   
   text      text     (31211, 1)      str     None   


['a6853dde-6cd4-11ee-a049-cc2f714963ed',
 'a6853ddf-6cd4-11ee-928c-cc2f714963ed',
 'a6853de0-6cd4-11ee-b56f-cc2f714963ed',
 'a6853de1-6cd4-11ee-8be2-cc2f714963ed',
 'a6853de2-6cd4-11ee-8201-cc2f714963ed',
 'a6853de3-6cd4-11ee-971c-cc2f714963ed',
 'a6853de4-6cd4-11ee-82c9-cc2f714963ed',
 'a6853de5-6cd4-11ee-9072-cc2f714963ed',
 'a6853de6-6cd4-11ee-b33f-cc2f714963ed',
 'a6853de7-6cd4-11ee-b218-cc2f714963ed',
 'a6853de8-6cd4-11ee-9044-cc2f714963ed',
 'a6853de9-6cd4-11ee-8d44-cc2f714963ed',
 'a6853dea-6cd4-11ee-9bb7-cc2f714963ed',
 'a6853deb-6cd4-11ee-a6d3-cc2f714963ed',
 'a6853dec-6cd4-11ee-9a1d-cc2f714963ed',
 'a6853ded-6cd4-11ee-bb5c-cc2f714963ed',
 'a6853dee-6cd4-11ee-84ef-cc2f714963ed',
 'a6853def-6cd4-11ee-8f72-cc2f714963ed',
 'a6853df0-6cd4-11ee-ae54-cc2f714963ed',
 'a6853df1-6cd4-11ee-89ca-cc2f714963ed',
 'a6853df2-6cd4-11ee-9295-cc2f714963ed',
 'a6853df3-6cd4-11ee-9632-cc2f714963ed',
 'a6853df4-6cd4-11ee-9825-cc2f714963ed',
 'a6853df5-6cd4-11ee-aace-cc2f714963ed',
 'a6853df6-6cd4-

## Step 3: Conversational Retriever Chain

First, load the dataset, establish the retriever, and create the Conversational Chain. You can also define custom filtering functions using Deep Lake filters.

In [5]:
# Load the database
db = DeepLake(dataset_path="hub://davitbun/twitter-algorithm", read_only=True, embedding_function=embeddings)
# Create a retriever
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['fetch_k'] = 100
retriever.search_kwargs['maximal_marginal_relevance'] = True
retriever.search_kwargs['k'] = 10

def filter(x):
    if 'com.google' in x['text'].data()['value']:
        return False
    metadata = x['metadata'].data()['value']
    return 'scala' in metadata['source'] or 'py' in metadata['source']

# Uncomment the following line to apply custom filtering
# retriever.search_kwargs['filter'] = filter

|

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/davitbun/twitter-algorithm



-

hub://davitbun/twitter-algorithm loaded successfully.

Deep Lake Dataset in hub://davitbun/twitter-algorithm already exists, loading from the storage
Dataset(path='hub://davitbun/twitter-algorithm', read_only=True, tensors=['embedding', 'ids', 'metadata', 'text'])

  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (23152, 1536)  float32   None   
    ids      text     (23152, 1)      str     None   
 metadata    json     (23152, 1)      str     None   
   text      text     (23152, 1)      str     None   


 

Connect to GPT-4 for question answering.

In [6]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model='gpt-3.5-turbo') # switch to 'gpt-4'
qa = ConversationalRetrievalChain.from_llm(model,retriever=retriever)



## Step 4: Ask Questions to the Codebase in Natural Language

Define all the juicy questions you want to be answered:

In [7]:
questions = [
    "What does favCountParams do?",
    "is it Likes + Bookmarks, or not clear from the code?",
    "What are the major negative modifiers that lower your linear ranking parameters?",   
    "How do you get assigned to SimClusters?",
    "What is needed to migrate from one SimClusters to another SimClusters?",
    "How much do I get boosted within my cluster?",   
    "How does Heavy ranker work. what are it’s main inputs?",
    "How can one influence Heavy ranker?",
    "why threads and long tweets do so well on the platform?",
    "Are thread and long tweet creators building a following that reacts to only threads?",
    "Do you need to follow different strategies to get most followers vs to get most likes and bookmarks per tweet?",
    "Content meta data and how it impacts virality (e.g. ALT in images).",
    "What are some unexpected fingerprints for spam factors?",
    "Is there any difference between company verified checkmarks and blue verified individual checkmarks?",
] 
chat_history = []

for question in questions:  
    result = qa({"question": question, "chat_history": chat_history})
    chat_history.append((question, result['answer']))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> **Question**: What does favCountParams do? 

**Answer**: I don't know the specific functionality of `favCountParams` as it is not mentioned in the provided context. It seems to be a parameter related to ranking or feature scoring, but without more information, I cannot provide a precise answer. 

-> **Question**: is it Likes + Bookmarks, or not clear from the code? 

**Answer**: No, the given code does not clarify whether `favCountParams` is equal to the sum of Likes and Bookmarks or not. 

-> **Question**: What are the major negative modifiers that lower your linear ranking parameters? 

**Answer**: The major negative modifiers that affect linear ranking parameters are:

1. No text hit demotion: This modifier applies a demotion to tweets that do not have any text content.

2. URL only hit demotion: This modifier applies a demotion to tweets that only contain URLs.

3. Name only hit demotion: This modifier applies a demotion to tweets that only contain usernames or profile names.

4