# Jira Analysis

This notebook contains scripts to process a set of tickets dumped from Jira using a LangChain chain and Pinecone.

## Imports

I hate having imports strewn all over the code and so, I'm creating a section where I'll keep adding imports.

I recognize that this requires me to do a "Run All" in the Notebook each time, but it's better that than import hell.

In [323]:
import pandas as pd
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
import os
from langchain.embeddings.openai import OpenAIEmbeddings
import pinecone
from uuid import uuid4
from tqdm.auto import tqdm

In [324]:
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("jira_csv.csv")

# Let's make sure we don't have an empty description
df.Description = df.Description.fillna("No Description Available")
df.Resolved = df.Resolved.fillna("0")

## Validate the data.

In [325]:
desc_col = df.columns.get_loc('Description')

# I use the 4th record in the list because it has a really long description.
df.iloc[[3],[desc_col]]

Unnamed: 0,Description
3,"From Andre:\n\nApp: [https://app.crowdbotics.com/dashboard/app/39571|https://app.crowdbotics.com/dashboard/app/39571]\nI created a Connector to test OpenAPI response and as it’s an authenticated request, I added the Bearer Token. It looks like an EnvVar was added to the api.js file but when trying to deploy, it failed (stack trace is the first message in the app Activity Log) because of the newly added token.\n\nIssue with react-native-dotenv: [https://app.circleci.com/pipelines/github/crowdbotics-apps/andre-test-mar-27-39571/4/workflows/c8bdd4fc-a811-4993-afe7-7a610591a870/jobs/12|https://app.circleci.com/pipelines/github/crowdbotics-apps/andre-test-mar-27-39571/4/workflows/c8bdd4fc-a811-4993-afe7-7a610591a870/jobs/12]\n\n\nAnother issue is regarding the connector code generated in GitHub. When I first added the connector, and it has a token, the token was correctly added to the openAPI store, but after changing the connector detail to add a few more fields to the response and save, the token was removed from the code. I had to go back to the connector and add the token again and save, then it was added back to the connector’s store.\n\n----\n\nSteps to test and reproduce\n\n# Go to Connectors page\n# Create a connector with Bearer auth (can be fake information)\n# Save\n# Check if env var is added to the connector code like this: [https://github.com/crowdbotics-dev/aline-032923-dev-73007/blob/323de66db33c0ccd349eb64c10a0bf33958c89cc/store/rapidAPICocktails/api.js#L8|https://github.com/crowdbotics-dev/aline-032923-dev-73007/blob/323de66db33c0ccd349eb64c10a0bf33958c89cc/store/rapidAPICocktails/api.js#L8|smart-link] \n# Go to ""Active in my project"" tab\n# Edit the connector (like the description or new data call) - but do not edit the auth token. Save\n# Expect the Bearer header to still exist in the connector generated code."


## Chunk the data

In our case, we will only chunk up the `Description` field from Jira.

In [326]:
tokenizer = tiktoken.get_encoding("cl100k_base")

def tiktoken_len(text):
    '''
    Creates tokens from input text and returns the number of tokens.

        Parameters:
            text (str): The text to be tokenized
        
        Returns:
            The number of tokens created from the text (int)
    '''
    tokens = tokenizer.encode(text, disallowed_special=())
    return len(tokens)

print(tiktoken_len(str(df.iloc[[3],[desc_col]])))

544


Now that we have a way of finding the number of tokens, let us initialize a splitter that uses the `tiktoken_len` function that we just created to split input text so that each chunk is never larger than a maximum that we set.

In [327]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

Let us initialize the OpenAI API and create a test embedding just so we know everything works.

In [328]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

###### TEST TO MAKE SURE OPENAI API KEY WORKS
texts = [
    'this is the first chunk of text',
    'then another second chunk of text is here'
]

res = embed.embed_documents(texts)
len(res), len(res[0])
###### END TEST

(2, 1536)

It's time to create and initialize our vector database using Pinecone.

In [329]:
index_name = "gpt-test"

PINECONE_API_KEY = os.getenv("PINECONE_API_KEY") or "PINECONE_API_KEY"
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT") or "PINECONE_ENVIRONMENT"

pinecone.init(api_key=PINECONE_API_KEY, environment=PINECONE_ENVIRONMENT)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

index = pinecone.Index(index_name)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1001}},
 'total_vector_count': 1001}

## (Optional) Testing

In [330]:
test_row = df.iloc[3]
metadata = {
    "id": test_row["Issue key"],
    "type": test_row["Issue Type"],
    "status": test_row["Status"],
    "summary": test_row["Summary"],
    "created": test_row["Created"],
    "resolved": test_row["Resolved"]
}

print(len(test_row['Description']))

test_row_chunks = text_splitter.split_text(test_row['Description'])

print(len(test_row_chunks))
for i, desc in enumerate(test_row_chunks):
    print("Chunk", i, tiktoken_len(desc), desc)





1836
2
Chunk 0 307 From Andre:

App: [https://app.crowdbotics.com/dashboard/app/39571|https://app.crowdbotics.com/dashboard/app/39571]
I created a Connector to test OpenAPI response and as it’s an authenticated request, I added the Bearer Token. It looks like an EnvVar was added to the api.js file but when trying to deploy, it failed (stack trace is the first message in the app Activity Log) because of the newly added token.

Issue with react-native-dotenv: [https://app.circleci.com/pipelines/github/crowdbotics-apps/andre-test-mar-27-39571/4/workflows/c8bdd4fc-a811-4993-afe7-7a610591a870/jobs/12|https://app.circleci.com/pipelines/github/crowdbotics-apps/andre-test-mar-27-39571/4/workflows/c8bdd4fc-a811-4993-afe7-7a610591a870/jobs/12]


Another issue is regarding the connector code generated in GitHub. When I first added the connector, and it has a token, the token was correctly added to the openAPI store, but after changing the connector detail to add a few more fields to the response 

## (RUN WITH CAUTION) Create Embeddings

This embeddings insert into the vector is currently not idempotent. So, it will create duplicate records in the vector. Do NOT run this unless you intend to create duplicates as I ended up doing.

In [331]:
batch_limit = 100

descriptions = [] # list to store chunked descriptions
metadata_list = []

data_list = df.to_dict("records")   # converts df to list of dicts
                                    # Makes it easier to iterate.

for i,row in enumerate(tqdm(data_list)):
    metadata = {
        "id": row["Issue key"],
        "type": row["Issue Type"],
        "status": row["Status"],
        "summary": row["Summary"],
        "created": row["Created"],
        "resolved": row["Resolved"]
    }

    # Create chunks for description of each row
    row_chunks = text_splitter.split_text(row['Description'])
    # Create metadata for each chunk
    metadata_chunks = [{
        "chunk": j, "description": description, **metadata
    } for j, description in enumerate(row_chunks)]
    descriptions.extend(row_chunks)
    metadata_list.extend(metadata_chunks)
    # Loop until you've reached the batch limit
    if len(descriptions) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(descriptions))]
        embeds = embed.embed_documents(descriptions)
        print(len(ids), len(descriptions), len(metadata_list), len(embeds))
        upsert_vectors = list(zip(ids, embeds, metadata_list))
        index.upsert(vectors=upsert_vectors)
        descriptions = []
        metadata_list = []

  0%|          | 0/1000 [00:00<?, ?it/s]

100 100 100 100
100 100 100 100
100 100 100 100
100 100 100 100
100 100 100 100
100 100 100 100
100 100 100 100
100 100 100 100
101 101 101 101
100 100 100 100


## Query the Vector Store

In [332]:
desc_field = "description"

index = pinecone.Index(index_name)

vectorstore = Pinecone(index, embed.embed_query, desc_field)


In [343]:
query = "What tickets relate to visual design?"

vectorstore.similarity_search(
    query,  # our search query
    k=5  # return 3 most relevant docs
)

[Document(page_content='Any new tickets to improve the model builder (a.k.a Data Models) as a feature\n\nReference → [https://docs.google.com/spreadsheets/d/148Pqr4Jtk86L5yYyEVJqx17gLXNtc6Xwg70toq_cU-U/edit?usp=sharing|https://docs.google.com/spreadsheets/d/148Pqr4Jtk86L5yYyEVJqx17gLXNtc6Xwg70toq_cU-U/edit?usp=sharing|smart-link]', metadata={'chunk': 0.0, 'created': datetime.datetime(2023, 1, 26, 12, 39), 'id': 'PLAT-10056', 'resolved': '0', 'status': 'To Do', 'summary': 'Model Builder Improvements', 'type': 'Epic'}),
 Document(page_content='Any new tickets to improve the model builder (a.k.a Data Models) as a feature\n\nReference → [https://docs.google.com/spreadsheets/d/148Pqr4Jtk86L5yYyEVJqx17gLXNtc6Xwg70toq_cU-U/edit?usp=sharing|https://docs.google.com/spreadsheets/d/148Pqr4Jtk86L5yYyEVJqx17gLXNtc6Xwg70toq_cU-U/edit?usp=sharing|smart-link]', metadata={'chunk': 0.0, 'created': datetime.datetime(2023, 1, 26, 12, 39), 'id': 'PLAT-10056', 'resolved': '0', 'status': 'To Do', 'summary': 

In [346]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever())

In [347]:
qa.run("Can you classify these tickets?")

'Yes, these tickets seem to be related to UI/UX improvements and minor code changes. They are not specifically related to improving the model builder feature.'