# Dataset Upload

In addition to creating and editing Datasets in the LangSmith UI, you can also create and edit datasets with the LangSmith SDK.

Let's go ahead an upload a list of examples that we have from our RAG application to LangSmith as a new dataset.

In [8]:
from dotenv import load_dotenv
load_dotenv()  

import os
print("LangSmith Key Set:", os.getenv("LANGCHAIN_API_KEY") is not None)


LangSmith Key Set: True


In [16]:
from langsmith import Client

example_inputs = [
    (
        "What is Retrieval-Augmented Generation (RAG)?",
        "RAG is a technique that combines information retrieval with generative models. It retrieves relevant documents from a knowledge base and uses them to ground the model’s response, improving factual accuracy."
    ),
    (
        "Why is RAG preferred over vanilla LLM responses?",
        "RAG helps improve factual correctness and reduces hallucinations by incorporating external context. It also enables real-time updates without the need to retrain the model."
    ),
    (
        "What are the main components of a RAG pipeline?",
        "A typical RAG pipeline includes a retriever, a chunked document store (vector DB), and a language model that generates answers using the retrieved documents as context."
    ),
    (
        "How do you choose the right chunk size for a RAG pipeline?",
        "Chunk size should balance context completeness and token limits. Smaller chunks may miss important context, while larger ones can exceed token limits or dilute relevance."
    ),
    (
        "Can I use RAG with proprietary documents?",
        "Yes, RAG is ideal for using proprietary or internal documents. You index them in a vector database and retrieve them at query time without exposing data to the base model."
    ),
    (
        "What vector stores are commonly used in RAG?",
        "Common vector stores include FAISS, Pinecone, Weaviate, Chroma, Qdrant, and even lightweight ones like SKLearnVectorStore for local setups."
    ),
    (
        "How does document retrieval affect RAG accuracy?",
        "If retrieval fails to return relevant documents, the generation step will likely produce incorrect or vague responses. Retrieval quality is crucial for RAG effectiveness."
    ),
    (
        "Is it possible to use hybrid search in a RAG system?",
        "Yes, hybrid search combines dense vector search with keyword-based techniques like BM25 to improve recall, especially in noisy or long-text domains."
    ),
    (
        "How can LangSmith help debug a RAG pipeline?",
        "LangSmith lets you trace each RAG step — retrieval, document context, and model output — so you can inspect failures, measure latency, and iterate effectively."
    ),
    (
        "How do I evaluate the performance of a RAG system?",
        "You can evaluate RAG systems using metrics like answer correctness, retrieval precision, or reference comparison with datasets. LangSmith supports custom evaluations for this purpose."
    )

]

client = Client()
from config import DATASET_ID


# Prepare inputs and outputs for bulk creation
inputs = [{"question": input_prompt} for input_prompt, _ in example_inputs]
outputs = [{"output": output_answer} for _, output_answer in example_inputs]

client.create_examples(
  inputs=inputs,
  outputs=outputs,
  dataset_id=dataset_id,
)

{'example_ids': ['665d4c91-f8e5-46be-8e22-f4f8b9c9b89a',
  'a63005ce-9dc9-4581-bb48-6f1de0945132',
  '71f26c5c-4f65-46c3-9127-183a225b578d',
  '7387e305-7bac-4704-a750-4a52d537e6ff',
  '6231eb50-8c19-4c4b-9028-a6e332475df0',
  'd3f006fa-1d6f-4dbb-9466-374c0dce1575',
  'ca7c18b7-15d6-45fa-96cf-b2cbb06c98d9',
  '1c60e230-8082-41bc-be79-48261ee61a7a',
  '758a3f88-0656-4b04-a64b-fbbb1d32ca69',
  'fb2c22f9-eebf-4f5d-a945-bf90e6e30b2e'],
 'count': 10}

## Submitting another Trace

I've moved our RAG application definition to `app.py` so we can quickly import it.

In [None]:
from app import langsmith_rag

Let's ask another question to create a new trace!

In [18]:
question = "What is RAG?"
langsmith_rag(question)

"RAG stands for Retrieval-Augmented Generation, a technique that combines retrieval and generation to answer questions. It involves indexing and retrieving relevant documents based on a user's question, and then using a language model to generate an answer. This approach is used in the provided code to build a chatbot that answers questions about Lilian Weng's blog posts."