# Source code analysis

In [None]:
!pip install --upgrade langchain chromadb openai tiktoken ipython

Define OpenAI embeddings, Deep Lake multi-modal vector store api and authenticate. For full documentation of Deep Lake please follow [docs](https://docs.activeloop.ai/) and [API reference](https://docs.deeplake.ai/en/latest/).

Authenticate into Deep Lake if you want to create your own dataset and publish it. You can get an API key from the [platform](https://app.activeloop.ai)

In [2]:
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# set your OpenAI api key
os.environ['OPENAI_API_KEY'] = "OpenAI API Key"

In [3]:
embeddings = OpenAIEmbeddings(disallowed_special=())

disallowed_special=() is required to avoid `Exception: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte` from tiktoken for some repositories

### 1. Index the code base
You can directly skip this part and directly jump into using already indexed dataset. To begin with, first we will clone the repository, then parse and chunk the code base and use OpenAI indexing.

In [None]:
!git clone https://github.com/lballabio/quantlib

Load all files inside the repository

In [5]:
import os
from langchain.document_loaders import TextLoader

root_dir = './quantlib'
docs = []
for dirpath, dirnames, filenames in os.walk(root_dir):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            print(file+" not loaded")
            pass

index not loaded
pack-601e8394e954cf8f74b2603f267babad5185b42d.idx not loaded
pack-601e8394e954cf8f74b2603f267babad5185b42d.pack not loaded
favicon.ico not loaded
QL-title.jpg not loaded


Then, chunk the files

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

language = Language.CPP
chunk_size = 1000
chunk_overlap = 0

cpp_splitter = RecursiveCharacterTextSplitter.from_language(
    language=language, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
texts = cpp_splitter.split_documents(docs)

### Run this cell only the first time
You only need to run this cell once and then you can just load the vector database.
This will take about ~8 mins to compute embeddings and save it locally. You save in the cloud instead using Activeloop for example (see https://python.langchain.com/en/latest/use_cases/code/code-analysis-deeplake.html?highlight=Activeloop for an example).

In [7]:
# folder where the vector database will be stored locally
persist_directory = 'db'

vector_db = Chroma.from_documents(documents=texts, embedding=embeddings, persist_directory=persist_directory)
vector_db.persist()

### 2. Question Answering on Quantlib codebase
First load the dataset, construct the retriever, then construct the Conversational Chain

In [8]:
vector_db = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

In [9]:
retriever = vector_db.as_retriever()
# using cosine similarity between the query and embeddings stored in the vector database
retriever.search_kwargs['distance_metric'] = 'cos'
# using max marginal relevance (see https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf)
retriever.search_kwargs['maximal_marginal_relevance'] = True
# number of Documents to fetch to pass to MMR algorithm
retriever.search_kwargs['fetch_k'] = 100
# number of Documents to return
retriever.search_kwargs['k'] = 4

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.memory import ConversationBufferMemory
from IPython.display import display, Markdown


model_name ='gpt-4' # switch to gpt-3.5-turbo if you don't have the api access for gpt-4
llm = ChatOpenAI(model_name=model_name,
                 temperature=0,
                 streaming=True, 
                 callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]))
memory = ConversationBufferMemory(memory_key="chat_history")
qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, verbose=True, return_source_documents=True)

In [11]:
chat_history = []

In [12]:
question = "How to build ir curves on deposit, fra, futures and swaps?"

response = qa({"question": question, "chat_history": chat_history})
chat_history.append((question, response['answer']))
display(Markdown(response['answer']))



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
vars.termStructure = ext::shared_ptr<YieldTermStructure>(new
        PiecewiseYieldCurve<Discount,LogLinear>(vars.settlement,
                                                swapHelpers,
                                                Actual360()));

    Handle<YieldTermStructure> curveHandle =
        Handle<YieldTermStructure>(vars.termStructure);

    ext::shared_ptr<IborIndex> index(new Euribor6M(curveHandle));
    for (Size i=0; i<vars.swaps; i++) {
        Period tenor = swapData[i].n*swapData[i].units;

        VanillaSwap swap = MakeVanillaSwap(tenor, index, 0.0)
            .withEffectiveDate(vars.settlement)
            .withFixedLegDayCount(vars.fixedLegDayCount

To build interest rate curves using deposits, FRAs, futures, and swaps, you can follow these steps:

1. Collect market data: Obtain the market data for deposits, FRAs, futures, and swaps, including their tenors, rates, and other relevant information.

2. Create rate helpers: For each type of instrument (deposits, FRAs, futures, and swaps), create rate helpers using the market data. Rate helpers are objects that help in bootstrapping the yield curve. For example, you can use DepositRateHelper for deposits, FraRateHelper for FRAs, FuturesRateHelper for futures, and SwapRateHelper for swaps.

3. Combine rate helpers: Combine all the rate helpers created in step 2 into a single vector or list.

4. Choose interpolation method: Select an interpolation method for the yield curve, such as linear, log-linear, or cubic spline.

5. Bootstrap the yield curve: Using the combined rate helpers and the chosen interpolation method, bootstrap the yield curve. You can use the PiecewiseYieldCurve class for this purpose. For example:

```cpp
ext::shared_ptr<YieldTermStructure> termStructure(
    new PiecewiseYieldCurve<Discount, LogLinear>(
        settlementDate, rateHelpers, termStructureDayCounter));
```

6. Link the yield curve to the relevant indices: If you need to use the constructed yield curve for pricing other instruments, link the curve to the relevant indices, such as Euribor or Libor.

Here's a high-level example of how to create a yield curve using deposits, FRAs, futures, and swaps:

```cpp
// Step 1: Collect market data (example data)
// ...

// Step 2: Create rate helpers
std::vector<ext::shared_ptr<RateHelper>> rateHelpers;

// Add DepositRateHelper
// ...
rateHelpers.push_back(depositRateHelper);

// Add FraRateHelper
// ...
rateHelpers.push_back(fraRateHelper);

// Add FuturesRateHelper
// ...
rateHelpers.push_back(futuresRateHelper);

// Add SwapRateHelper
// ...
rateHelpers.push_back(swapRateHelper);

// Step 3: Combine rate helpers
// (Already combined in the rateHelpers vector)

// Step 4: Choose interpolation method (LogLinear in this example)
// ...

// Step 5: Bootstrap the yield curve
ext::shared_ptr<YieldTermStructure> termStructure(
    new PiecewiseYieldCurve<Discount, LogLinear>(
        settlementDate, rateHelpers, termStructureDayCounter));

// Step 6: Link the yield curve to the relevant indices
// (Example: Linking to Euribor index)
Handle<YieldTermStructure> curveHandle(termStructure);
ext::shared_ptr<IborIndex> euriborIndex(new Euribor6M(curveHandle));
```

This example demonstrates the high-level process of building a yield curve using deposits, FRAs, futures, and swaps. You'll need to adapt the code to your specific market data and requirements.