# Chapter 5

In [chapter 5](https://learn.deeplearning.ai/courses/langchain/lesson/5/question-and-answer), the topic was "Questions & Answers".

To re-work the lesson, I will first re-create that example presented in the lesson, without diving into too much detail, because I covered embedding in the under aligning theory in my blog posts [Visualizing Embeddings in 2D](https://chrwittm.github.io/posts/2024-03-15-embeddings/) and [Remembering the Wittmann Tours World Trip with RAG](https://chrwittm.github.io/posts/2024-03-22-rag1-remembering-world-trip/).

Speaking about Wittmann Tours, I will re-create the mini RAG scenario using Lang chain afterwards as personal exercise.

## Recreating the example from the lesson

Here is the example CSV file used in this chapter, Let's take a quick look in a pandas data frame.

In [35]:
import pandas as pd

file = "OutdoorClothingCatalog_1000.csv"
df = pd.read_csv(file)

print(f"The columns of the dataset are: {df.columns.tolist()}")

The columns of the dataset are: ['Unnamed: 0', 'name', 'description']


In [36]:
df.head()

Unnamed: 0.1,Unnamed: 0,name,description
0,0,Women's Campside Oxfords,This ultracomfortable lace-to-toe Oxford boast...
1,1,"Recycled Waterhog Dog Mat, Chevron Weave",Protect your floors from spills and splashing ...
2,2,Infant and Toddler Girls' Coastal Chill Swimsu...,"She'll love the bright colors, ruffles and exc..."
3,3,"Refresh Swimwear, V-Neck Tankini Contrasts",Whether you're going for a swim or heading out...
4,4,EcoFlex 3L Storm Pants,Our new TEK O2 technology makes our four-seaso...


Here is how we load the data with the langchain `CSVLoader`

In [37]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [38]:
# Load the data from the CSV file
data = loader.load()
print(data[0])


page_content=': 0
name: Women's Campside Oxfords
description: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. 

Size & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. 

Specs: Approx. weight: 1 lb.1 oz. per pair. 

Construction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. 

Questions? Please contact us for any inquiries.' metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}


In [39]:
import os
from dotenv import load_dotenv

load_dotenv() #contains the OPENAI_API_KEY

True

When running this notebook for the first time, you might need to install `docarray`.

In [40]:
#!pip install docarray

In [41]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.embeddings.openai import OpenAIEmbeddings  # OpenAI embeddings model

# Initialize the embedding model
embedding = OpenAIEmbeddings()

# Create the index with the embedding model
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding  # Provide the required embedding model
).from_loaders([loader])


Notice that we use `gpt-4o-mini` instead of `gpt-3.5-turbo-instruct`. The reason is simply that time has moved on, and there is no instruct variant of GPT-4o. As GPT-4o points out:

_"While GPT-4 chat models, like gpt-4o-mini, do not have a separate instruct variant, they are already trained to handle instruction-based tasks very effectively."_

In [42]:
from langchain_openai import ChatOpenAI

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

In [43]:
from IPython.display import display, Markdown

def display_response(response):
    display(Markdown(response))

In [44]:
query = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
response = index.query(query, llm = llm)

display_response(response)

Here is a table listing the shirts with sun protection along with a summary of each:

| Name                                      | Description Summary                                                                                     |
|-------------------------------------------|--------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt   | Lightweight, UPF 50+ sun protection, 100% polyester, wrinkle-resistant, with cape venting and pockets. |
| Men's Plaid Tropic Shirt, Short-Sleeve    | UPF 50+ protection, made of 52% polyester and 48% nylon, wrinkle-free, evaporates perspiration, with cape venting and pockets. |
| Men's TropicVibe Shirt, Short-Sleeve      | UPF 50+ rated, lightweight, 71% nylon and 29% polyester, wrinkle resistant, with cape venting and pockets. |
| Sun Shield Shirt                          | UPF 50+ rated, made of 78% nylon and 22% Lycra, moisture-wicking, abrasion resistant, fits over swimsuits. |

## Wittmann-Tours

Let's transition the example implementation to re-creating my manual implementation [Remembering the Wittmann Tours World Trip with RAG](https://chrwittm.github.io/posts/2024-03-22-rag1-remembering-world-trip/).

The implementation idea is simple: Instead of indexing lines of a CSV file, let's index the markdown files of the Wittmann-Tours blog.

Since we have already downloaded the dataset in chapter 2, here are just the helper functions:

In [45]:
import os
import glob

def get_blog_post_files(path_to_blog):

    pattern = os.path.join(path_to_blog, "**/*.md")
    return sorted(glob.glob(pattern, recursive=True))

def get_blogpost(path_to_blogpost):
    with open(path_to_blogpost, 'r') as file:
        content = file.read()

    return content

In [46]:
blogpost_files = get_blog_post_files("./../wt-blogposts")
blogpost_files[0:5]

['./../wt-blogposts/3-tage-in-melbourne/index.md',
 './../wt-blogposts/addis-abeba-die-hauptstadt-athiopiens/index.md',
 './../wt-blogposts/aksum-aufbewahrungsort-der-bundeslade/index.md',
 './../wt-blogposts/am-fusse-des-cotopaxi/index.md',
 './../wt-blogposts/an-der-grenze-von-mexiko-nach-belize/index.md']

If you're running this for the first time, you might need to install `unstructured`.

In [47]:
#!pip install unstructured

Let's create a loader for each markdown file. All these loaders are stored in the `loaders` list.

In [23]:
from langchain.document_loaders import UnstructuredMarkdownLoader

loaders = [UnstructuredMarkdownLoader(blogpost_file) for blogpost_file in blogpost_files]

In [24]:
loaders[:5]

[<langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x336694ac0>,
 <langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x3366971c0>,
 <langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x3366972e0>,
 <langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x336696f20>,
 <langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x336697370>]

As it turned out, what we are about to do, requires the `Punkt` tokenizer, a data package used by NLTK (Natural Language Toolkit) for sentence splitting.

While implementing this prototype, I received a few weird error messages regarding the file locations of the punkt tokenizer. It turned out that you need to be on the latest punkt version (3.9.1). (My earlier version 3.8.1 caused the problem).

In [52]:
#!pip uninstall nltk
#!pip install nltk
#!pip install --upgrade nltk

In [53]:
print(nltk.__version__)

3.9.1


In [54]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /Users/chrwittm/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Same as before, we can index all the markdown files by loading them into the vector store.

In [55]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embedding
).from_loaders(loaders)

Now we should be able to query the Wittmann tour blog

In [34]:
#query = "What was the name of our guide in the Masoala rain forest??"
query = "Wie hieß unser Guide im Masoala Regenwald?"
response = index.query(query, llm = llm)

display_response(response)

Euer Guide im Masoala Regenwald hieß Armand.

As it turned out, I need to prompt the model in German to get the right result, because the Wittmann-Tours blog is written in German. Unlike the example that I created in my [previous blog post](https://chrwittm.github.io/posts/2024-03-22-rag1-remembering-world-trip/), this version is not multilingual. However, this could easily be fixed with other langchain functionality, as we've seen in previous chapters.