# Experiments with indexing Jupyter Notebooks

In [1]:
import os 

from langchain_community.document_loaders import NotebookLoader
from langchain_community.llms import Ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser



In [2]:
def load_notebooks_docs(folder_path):
    docs = []
    for file_name in os.listdir(folder_path):
        if file_name.endswith('.ipynb'):
            loader = NotebookLoader(
                path=os.path.join(folder_path, file_name), 
                include_outputs=True,
                max_output_length=500,
                traceback=False
            )
            docs.append(loader.load()[0])
    return docs 

In [3]:
docs = load_notebooks_docs('./notebooks/')

In [4]:
docs

[Document(page_content='\'markdown\' cell: \'[\'Notebook Source: https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook\']\'\n\n \'markdown\' cell: \'[\'---\']\'\n\n \'markdown\' cell: \'["Logging into Kaggle for the first time can be daunting. Our competitions often have large cash prizes, public leaderboards, and involve complex data. Nevertheless, we really think all data scientists can rapidly learn from machine learning competitions and meaningfully contribute to our community. To give you a clear understanding of how our platform works and a mental model of the type of learning you could do on Kaggle, we\'ve created a Getting Started tutorial for the Titanic competition. It walks you through the initial steps required to get your first decent submission on the leaderboard. By the end of the tutorial, you\'ll also have a solid understanding of how to use Kaggle\'s online coding environment, where you\'ll have trained your own machine learning model.\\n", \'\\n\', \'So i

In [5]:
# loader = NotebookLoader(
#     path='./notebooks', 
#     include_outputs=True,
#     max_output_length=500,
#     traceback=False
# )

In [72]:
# documents = loader.load()

In [73]:
# doc = documents[0]

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50, add_start_index=True
)
splits = text_splitter.split_documents(docs)

In [7]:
len(splits)

233

In [8]:
# bge-large-en models is approximatelly 1.3G
model_name = "BAAI/bge-large-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
bge_embedding = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
vectorstore = Chroma.from_documents(documents=splits, embedding=bge_embedding)

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()


In [10]:
retriever.invoke("What is the percentage of women who survived?")

[Document(page_content='\'markdown\' cell: \'[\'The code above calculates the percentage of male passengers (in **train.csv**) who survived.\\n\', \'\\n\', \'From this you can see that almost 75% of the women on board survived, whereas only 19% of the men lived to tell about it. Since gender seems to be such a strong indicator of survival, the submission file in **gender_submission.csv** is not a bad first guess!\\n\', \'\\n\', "But at the end of the day, this gender-based submission bases its predictions on only a single column.', metadata={'source': 'notebooks/titanic-tutorial.ipynb', 'start_index': 11500}),
 Document(page_content='\'code\' cell: \'[\'women = train_data.loc[train_data.Sex == \\\'female\\\']["Survived"]\\n\', \'rate_women = sum(women)/len(women)\\n\', \'\\n\', \'print("% of women who survived:", rate_women)\']\'\n with output: \'[\'% of women who survived: 0.7420382165605095\\n\']\'\n\n \'markdown\' cell: \'[\'Before moving on, make sure that your code returns the out

In [12]:
# Prompt template
template = """Answer the question based only on the following context:
{context}
Question: {question}
Do know provide answer if not enough context or you don't know.
Answer should provide a reference to the source document file name.
"""
prompt = ChatPromptTemplate.from_template(template)

model = Ollama(model="llama2")

# RAG pipeline
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [13]:
chain.invoke(
    "What is the percentage of women who survived?"
)

'Based on the provided context, the answer to the question "What is the percentage of women who survived?" is:\n\nAccording to the source document "notebooks/eda-to-prediction-dietanic.ipynb" at line 7083, the percentage of women who survived is:\n\n"% of women who survived: 0.95"\n\nReference: notebooks/eda-to-prediction-dietanic.ipynb (start index 7083)'

In [14]:
chain.invoke(
    "What is the percentage of men who survived?"
)

'Based on the provided context, the answer to the question "What is the percentage of men who survived?" is:\n\n"From the **FactorPlot** and **CrossTab**, we can see that out of 891 passengers in the training set, only around 19% of the men survived the accident. This information can be found in Document(page_content=\'\\\'markdown\\\' cell: \\\'[It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate. \\\\n\\\', metadata={\'source\': \'notebooks/eda-to-prediction-dietanic.ipynb\', \'start_index\': 7083\')"]\n\nThe answer is referenced to the source document file name: "notebooks/eda-to-prediction-dietanic.ipynb".'

In [15]:
response = chain.invoke(
    "What models we already tried training?"
)
print(response)

Based on the provided context, it seems that the following models have been tried training:

1. Gradient Boosted Trees (GBT) with default parameters
2. GBT with improved default parameters
3. Training a model with hyperparameter tuning

These references can be found in the following documents:

* Document 1: `notebooks/titanic-competition-w-tensorflow-decision-forests.ipynb`, `start_index`: 17224
* Document 2: `notebooks/titanic-competition-w-tensorflow-decision-forests.ipynb`, `start_index`: 591
* Document 3: `notebooks/titanic-tutorial.ipynb`, `start_index`: 12403


In [16]:
response = chain.invoke(
    "What features were one-hot encoded?"
)
print(response)

Based on the provided context, it appears that the following features were one-hot encoded:

* Age

This information can be found in Document 3, which mentions that "**Continous Features in the dataset: Age**".


In [17]:
response = chain.invoke(
    "Describe what data preprocessing was done?"
)
print(response)

Based on the given context, it appears that the following data preprocessing steps were done:

1. Removing redundant features: The document mentions that there may be many redundant features in the dataset, and suggests removing them. However, no specific examples or details are provided.
2. Converting features into suitable form for modeling: The document mentions that some features may need to be converted into a more suitable form for modeling. Again, no specific examples or details are provided.
3. Replacing null values with mean or median: The document mentions replacing null values with the mean or median age of the dataset. This is done to handle missing data.
4. Grouping features: The document mentions grouping features, but no further details are provided.

References:

* Source document file name: notebooks/eda-to-prediction-dietanic.ipynb
* Start index: 1595, 20173, 8721, 20811


In [18]:
response = chain.invoke(
    "Did we try any hyperparameters tuning?"
)
print(response)

Yes, we tried hyperparameter tuning in the tutorial. According to the source documents, we tried tuning the hyperparameters for the two best classifiers, SVM and RandomForests. The tuning process was done using the Automated Hyperparameter Tuning tutorial provided by TensorFlow. The reference to the source document file name is "notebooks/eda-to-prediction-dietanic.ipynb" at line 36890.


In [19]:
response = chain.invoke(
    "give me a code we used to train GradientBoostedTreesModel"
)
print(response)

Based on the provided context, the code used to train a `GradientBoostedTreesModel` is located in the following document:

Document(page_content='\'code\' cell: \'[\'model = tfdf.keras.GradientBoostedTreesModel(\\n\', ...')

This code trains a `GradientBoostedTreesModel` with the following parameters:

* `verbose`: 0 (very few logs)
* `features`: a list of feature usages, where each usage is represented by a `FeatureUsage` object
* `exclude_non_specified_features`: True (only use the features in "features")
* `random_seed`: 1234

The model is then fit to the training data using the `fit()` method, and the evaluation metric (`accuracy` and `loss`) are printed.


In [20]:
response = chain.invoke(
    "give me a code we used to train GradientBoostedTreesModel"
)
print(response)

Based on the provided context, the code used to train a Gradient Boosted Trees Model is located in the following document:

Document(page_content="'code' cell: '[\'model = tfdf.keras.GradientBoostedTreesModel(\\n\', ...")

This code defines the Gradient Boosted Trees Model using the `tfdf.keras` module, and then trains it on the training data using the `fit()` method. The `random_seed` parameter is set to 1234 for reproducibility.
