# End-to-End RAG Tutorial Using Jira, PyAirbyte, Pinecone, and LangChain

This notebook demonstrates an end-to-end Retrieval-Augmented Generation (RAG) pipeline. We will extract data from Jira using PyAirbyte, store it in a Pinecone vector store, and then use LangChain to perform RAG on the stored data. This workflow showcases how to integrate these tools to build a scalable RAG system.

## Prerequisites

1. **Jira**:
   - Follow the instructions in the [Jira Source Connector Documentation](https://docs.airbyte.com/integrations/sources/jira) to set up your jira airbyte source

2. **Pinecone Account**:
   - **Create a Pinecone Account**: Sign up for an account on the [Pinecone website](https://www.pinecone.io/).
   - **Obtain Pinecone API Key**: Generate a new API key from your Pinecone project settings. For detailed instructions, refer to the [Pinecone documentation](https://docs.pinecone.io/docs/quickstart).

3. **OpenAI API Key**:
   - **Create an OpenAI Account**: Sign up for an account on [OpenAI](https://www.openai.com/).
   - **Generate an API Key**: Go to the API section and generate a new API key. For detailed instructions, refer to the [OpenAI documentation](https://beta.openai.com/docs/quickstart).


## Install PyAirbyte and other dependencies

In [1]:
!pip3 install airbyte openai langchain pinecone-client langchain-openai langchain-pinecone langchainhub 



# Setup Source Jira with PyAirbyte

The provided code configures an Airbyte source to extract issues data from jira data

To configure according to your requirements, you can refer to [this references](https://docs.airbyte.com/integrations/sources/jira#reference).

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.


In [2]:
import airbyte as ab
import json

projects = json.loads(ab.get_secret('projects_list'))

source = ab.get_source(
    "source-jira",
    install_if_missing=True,
    config={
        "api_token": ab.get_secret('jira_api_token'),
        "domain": ab.get_secret('jira_domain') ,
        "email":  ab.get_secret('jira_email_id'),
        "start_date": "2021-01-01T00:00:00Z", # optional field, can be ignored 
        "projects": projects
        },

)

# Verify the config and creds by running `check`:
source.check()

Enter the value for secret 'projects_list':  ········
Enter the value for secret 'jira_api_token':  ········
Enter the value for secret 'jira_domain':  ········
Enter the value for secret 'jira_email_id':  ········


In [3]:
source.select_streams(['issues']) # Select only issues stream
read_result: ab.ReadResult = source.read()
documents_list = []

for key, value in read_result.items():
    docs = value.to_documents()
    for doc in docs:
        documents_list.append(doc)

print(str(documents_list))

## Read Progress

Started reading at 16:27:38.

Read **4** records over **4 seconds** (1.0 records / second).

Wrote **4** records over 1 batches.

Finished reading at 16:27:43.

Started finalizing streams at 16:27:43.

Finalized **1** batches over 0 seconds.

Completed 1 out of 1 streams:

  - issues


Completed writing at 16:27:43. Total time elapsed: 4 seconds


------------------------------------------------


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [4]:
# store and display the issues stream in data frame
issues_df = read_result["issues"].to_pandas()
display(issues_df)

Unnamed: 0,expand,id,self,key,renderedfields,properties,names,schema,transitions,operations,...,versionedrepresentations,fieldstoinclude,fields,projectid,projectkey,created,updated,_airbyte_raw_id,_airbyte_extracted_at,_airbyte_meta
0,"customfield_10030.properties,operations,versio...",10622,https://airbyteio.atlassian.net/rest/api/3/iss...,TESTKEY11-1,"{""statuscategorychangedate"":""13/Apr/21 8:04 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-04-13T08:04:...",10014,TESTKEY11,2021-04-13 15:04:43.876,2021-04-15 18:38:09.705,01J0RKNCB9PK3W5N0MGX227W3Y,2024-06-19 16:00:48.487,{}
1,"customfield_10030.properties,operations,versio...",10077,https://airbyteio.atlassian.net/rest/api/3/iss...,TESTKEY1-15,"{""statuscategorychangedate"":""11/Mar/21 6:17 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-03-11T06:17:...",10004,TESTKEY1,2021-03-11 14:17:30.375,2021-04-15 18:38:12.811,01J0RKNCC10DJ4WPNN61KP9SG2,2024-06-19 16:00:48.512,{}
2,"customfield_10030.properties,operations,versio...",10073,https://airbyteio.atlassian.net/rest/api/3/iss...,TESTKEY1-14,"{""statuscategorychangedate"":""11/Mar/21 6:17 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-03-11T06:17:...",10004,TESTKEY1,2021-03-11 14:17:26.555,2021-04-15 18:38:13.477,01J0RKNCC854EPP807X91WVHA4,2024-06-19 16:00:48.519,{}
3,"customfield_10030.properties,operations,versio...",10070,https://airbyteio.atlassian.net/rest/api/3/iss...,TESTKEY1-13,"{""statuscategorychangedate"":""11/Mar/21 6:17 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-03-11T06:17:...",10004,TESTKEY1,2021-03-11 14:17:23.878,2021-04-15 18:38:14.686,01J0RKNCCEB6HZ7JY7S7NY2697,2024-06-19 16:00:48.525,{}
4,"customfield_10030.properties,operations,versio...",10064,https://airbyteio.atlassian.net/rest/api/3/iss...,TESTKEY1-12,"{""statuscategorychangedate"":""11/Mar/21 6:17 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-03-11T06:17:...",10004,TESTKEY1,2021-03-11 14:17:18.170,2021-04-15 18:38:17.691,01J0RKNCCPEJ60EZH420TA37EJ,2024-06-19 16:00:48.532,{}
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"customfield_10030.properties,operations,versio...",10075,https://airbyteio.atlassian.net/rest/api/3/iss...,IT-23,"{""statuscategorychangedate"":""11/Mar/21 6:17 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2021-03-11T06:17:...",10000,IT,2021-03-11 14:17:28.477,2023-10-12 20:43:50.735,01J0RN6KQQ6YZF5KYS2YAEDTDA,2024-06-19 16:27:41.685,{}
91,"customfield_10030.properties,operations,versio...",10636,https://airbyteio.atlassian.net/rest/api/3/iss...,TK-4,"{""statuscategorychangedate"":""14/Dec/23 9:39 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2023-12-14T09:39:...",10061,TK,2023-12-14 17:39:35.925,2023-12-14 17:39:36.392,01J0RKNFDEEP5FWYYFZZQ9XFZE,2024-06-19 16:00:51.630,{}
92,"customfield_10030.properties,operations,versio...",10635,https://airbyteio.atlassian.net/rest/api/3/iss...,TK-3,"{""statuscategorychangedate"":""14/Dec/23 9:24 AM...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2023-12-14T09:24:...",10061,TK,2023-12-14 17:24:51.587,2023-12-14 17:47:14.151,01J0RKNFDMAE38NRVS0JN4KCMN,2024-06-19 16:00:51.636,{}
93,"customfield_10030.properties,operations,versio...",10629,https://airbyteio.atlassian.net/rest/api/3/iss...,TK-2,"{""statuscategorychangedate"":""06/Jul/22 11:42 A...",,,,"[{""id"":""11"",""name"":""To Do"",""to"":{""self"":""https...",,...,,,"{""statuscategorychangedate"":""2022-07-06T11:42:...",10061,TK,2022-07-06 18:42:59.583,2023-12-14 18:06:01.025,01J0RKNFDTNZY806N72X4S5S65,2024-06-19 16:00:51.641,{}


## Use Langchain to build a RAG pipeline.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.utils import filter_complex_metadata



splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(documents_list)
chunked_docs = filter_complex_metadata(chunked_docs)
print(f"Created {len(chunked_docs)} document chunks.")

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])

Created 13637 document chunks.


In [6]:
from langchain_openai import OpenAIEmbeddings
import os

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")

embeddings=OpenAIEmbeddings()

Enter the value for secret 'OPENAI_API_KEY':  ········


## Setting up Pinecone

Pinecone is a managed vector database service designed for storing, indexing, and querying high-dimensional vector data efficiently.

In [7]:
from pinecone import Pinecone, ServerlessSpec
os.environ['PINECONE_API_KEY'] = ab.get_secret("PINECONE_API_KEY")

index_name = "airbytejiraindex"

pc = Pinecone()

# Create pinecone index if not exists otherwise skip this step
if not (pc.list_indexes()[0]['name'] == index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )


  from tqdm.autonotebook import tqdm


Enter the value for secret 'PINECONE_API_KEY':  ········


In [8]:
index = pc.Index(index_name)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [9]:
from langchain_pinecone import PineconeVectorStore

pinecone = PineconeVectorStore.from_documents(
    chunked_docs, embedding=embeddings, index_name=index_name
)

## RAG

In [10]:
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = pinecone.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")

llm = ChatOpenAI(model_name="gpt-3.5-turbo")



def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")


Langchain RAG pipeline set up successfully.


In [12]:
print(rag_chain.invoke("Summarize the issue of key IT-20"))

The issue of key IT-20 involves a test related to IT, with the summary stating "IT test 2." The issue has a time spent of 2 hours and 23 minutes, with no remaining estimate for completion. The status of the issue has been updated multiple times.


In [21]:
print(rag_chain.invoke("What is the source data about?"))

The source data is updated at various timestamps with null versioned representations.
