Proof of concept notebook to build a chat bot that I can talk to about jazz music. This will be able to run locally, with the eventual goal of packaging this up and being able to run it in the cloud. 

1. Firstly I'll use the ADA-002 text embedding functionality to vectorize RAG data.
2. I'll use pinecone to store the vectorized text
3. Also, GPT-3.5 Turbo will be used as the Gen AI model that our vector database will send the prompt/context to.
4. Lastly, LangChain will be used to create the pipline that connects these disparate parts.

The data soure I'll use is Wikipedia data that we'll scrap using the *mwcleint* and *mwparserfromhell*. You could also you *wikipedia*, although I had seen feedback that that library might be a little outdated at this point.

To help with the Wikipedia article scraping the **OpenAI Cookbook** article on the same was consulted frequently (https://cookbook.openai.com/examples/embedding_wikipedia_articles_for_search)

In [75]:
# import libraries for interacting with wikipedia
import pandas as pd # to use the DataFrames to store our data :) 
import mwclient # to download Wikipedia articles 
import mwparserfromhell # to parse through the articles afterwards

In [76]:
import os 
# setting constants 

# for Wikipedia data gather 
SITE_NAME = 'en.wikipedia.org'
ARTICLE_NAME_1 = 'Category:American male jazz musicians'
ARTICLE_NAME_2 = 'Category:American women jazz musicians'
SECTIONS_TO_IGNORE = [ # sections that we don't want to store
    "See also",
    "References",
    "External links",
    "Further reading",
    "Footnotes",
    "Bibliography",
    "Sources",
    "Citations",
    "Literature",
    "Footnotes",
    "Notes and references",
    "Photo gallery",
    "Works cited",
    "Photos",
    "Gallery",
    "Notes",
    "References and sources",
    "References and notes",
    'Sources',
    "Discography",
    "Selected discography",
    "Sessionography",
    "Filmography",
    "Concert films",
    "Books",
    "Awards",
    "Awards and honors",
    "Awards and accolades"
]

In [33]:
import yaml

with open("secrets.yaml", "r") as file:
    keys = yaml.safe_load(file)

OPENAI_KEY = keys["secrets"]["openai_key"]
PINECONE_KEY = keys["secrets"]["pinecone_key"]

In [77]:
# instantiate API client and iterate through each entry in the category page and
# add it to a set of titles - will use this list to grab page content iteratively

titles = set() # relevent article titles 

# get American male jazz musician article titles
site = mwclient.Site(SITE_NAME)
category_page_1 = site.pages[ARTICLE_NAME_1]
for cm in category_page_1.members():
    if type(cm) == mwclient.page.Page:
        titles.add(cm.name)

# get American female jazz musician article titles 
category_page_2 = site.pages[ARTICLE_NAME_2]
for cm in category_page_2.members():
    if type(cm) == mwclient.page.Page:
        titles.add(cm.name)

print(f"Found {len(titles)} article titles in {ARTICLE_NAME_1} and {ARTICLE_NAME_2}.")

Found 2775 article titles in Category:American male jazz musicians and Category:American women jazz musicians.


In [79]:
# store article contents 

wikipedia_sections = []
for title in titles: 
    site = mwclient.Site(SITE_NAME)
    page = site.pages[title]
    text = page.text()
    parsed_text = mwparserfromhell.parse(text)
    for section in parsed_text.get_sections():
        if section.get(0).title not in SECTIONS_TO_IGNORE:
            wikipedia_sections.append(section.strip_code())

print(f"Extracted text from {len(wikipedia_sections)} Wikipedia sections.")

Extracted text from 12213 Wikipedia sections.


In [95]:
print(wikipedia_sections[0])

John T. Klemmer (born July 3, 1946) is an American saxophonist, composer, songwriter, and arranger.

He was born in Chicago, Illinois, United States, and began playing guitar at the age of five and alto saxophone at the age of 11. His other early interests included graphics and visual art, writing, dance, puppetry, painting, sculpting, and poetry. He studied at the Art Institute of Chicago and began touring with midwestern "ghost big bands" (Les Elgart, Woody Herman) as well as playing with small local jazz and rock groups. After switching to tenor saxophone in high school, Klemmer played with commercial small groups and big bands in Chicago while leading his own groups and touring.


In [110]:
from langchain.text_splitter import RecursiveCharacterTextSplitter 

# instantiate text splitter break up articles into embed-able strings 
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size = 1500, 
    chunk_overlap = 300,
    length_function = len,
)

# run the splitter on each article text
wikipedia_strings = []
for ws in wikipedia_sections:
    wikipedia_strings.extend(splitter.split_text(ws))

print(f"{len(wikipedia_sections)} Wikipedia sections split into {len(wikipedia_strings)} strings.")

12213 Wikipedia sections split into 19006 strings.


In [111]:
# remove short strings 
filtered_wikipedia_strings = [ws for ws in wikipedia_strings if len(ws) > 16]
print(f"Filtered out {len(wikipedia_strings) - len(filtered_wikipedia_strings)} strings.")

Filtered out 429 strings.


In [112]:
# look at a test string
print(wikipedia_strings[0])

John T. Klemmer (born July 3, 1946) is an American saxophonist, composer, songwriter, and arranger.

He was born in Chicago, Illinois, United States, and began playing guitar at the age of five and alto saxophone at the age of 11. His other early interests included graphics and visual art, writing, dance, puppetry, painting, sculpting, and poetry. He studied at the Art Institute of Chicago and began touring with midwestern "ghost big bands" (Les Elgart, Woody Herman) as well as playing with small local jazz and rock groups. After switching to tenor saxophone in high school, Klemmer played with commercial small groups and big bands in Chicago while leading his own groups and touring.


#### Create RAG database and language model pipline

* Create embed model and embed text
* Create and populate RAG database with using embed model
* Instantiate LLM and connect pipeline together

In [85]:
import openai
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI


# set openai key
openai.key = OPENAI_KEY

# instantiate embedding object to feed strings into
embed = OpenAIEmbeddings(
    model = "text-embedding-ada-002",
    openai_api_key = OPENAI_KEY
)

# create embeddings with ada-002 model with a batch of 1000 strings at a time
BATCH_SIZE = 1000
embeddings = []
for batch_start in range(0, len(wikipedia_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = wikipedia_strings[batch_start:batch_end]
    # print(f"Batch {batch_start} to {batch_end-1}")
    embedding = embed.embed_documents(batch)
    embeddings.extend(embedding)

df = pd.DataFrame({"text": wikipedia_strings, "embedding": embeddings})
print(f"Created {len(embeddings)} embeddings from {len(wikipedia_strings)} strings.")

Created 25594 embeddings from 25594 strings.


In [91]:
# create pinecone vector database 
from pinecone import Pinecone 
from pinecone import ServerlessSpec
import time 

index_name = 'seis-630-final-project-vectorstore'

pc = Pinecone(api_key = PINECONE_KEY)
spec = ServerlessSpec(cloud='aws', region='us-east-1')

pc.create_index(
    index_name,
    dimension=1536, # dimensionality used by ADA-002 text embedding
    metric='dotproduct',
    spec=spec 
)

# wait for index initialization 
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [92]:
pc.describe_index(index_name)

{'dimension': 1536,
 'host': 'seis-630-final-project-vectorstore-kg3o9zv.svc.aped-4627-b74a.pinecone.io',
 'metric': 'dotproduct',
 'name': 'seis-630-final-project-vectorstore',
 'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
 'status': {'ready': True, 'state': 'Ready'}}

In [94]:
index = pc.Index(index_name)
BATCH_SIZE = 100

for batch_start in range(0, len(df), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = df.iloc[batch_start:batch_end]
    # set up batch to upsert, ids, context, and embeddings
    ids = [str(i) for i in range(batch_start, batch_end)]
    metadata = [{"context" : row["text"]} for i, row in batch.iterrows()]
    vectors = batch["embedding"].tolist()
    print(f"Batch {batch_start} to {batch_end-1}")
    to_upsert = zip(ids, vectors, metadata)
    index.upsert(vectors=to_upsert)
    time.sleep(5)


Batch 0 to 99
Batch 100 to 199
Batch 200 to 299
Batch 300 to 399
Batch 400 to 499
Batch 500 to 599
Batch 600 to 699
Batch 700 to 799
Batch 800 to 899
Batch 900 to 999
Batch 1000 to 1099
Batch 1100 to 1199
Batch 1200 to 1299
Batch 1300 to 1399
Batch 1400 to 1499
Batch 1500 to 1599
Batch 1600 to 1699
Batch 1700 to 1799
Batch 1800 to 1899
Batch 1900 to 1999
Batch 2000 to 2099
Batch 2100 to 2199
Batch 2200 to 2299
Batch 2300 to 2399
Batch 2400 to 2499
Batch 2500 to 2599
Batch 2600 to 2699
Batch 2700 to 2799
Batch 2800 to 2899
Batch 2900 to 2999
Batch 3000 to 3099
Batch 3100 to 3199
Batch 3200 to 3299
Batch 3300 to 3399
Batch 3400 to 3499
Batch 3500 to 3599
Batch 3600 to 3699
Batch 3700 to 3799
Batch 3800 to 3899
Batch 3900 to 3999
Batch 4000 to 4099
Batch 4100 to 4199
Batch 4200 to 4299
Batch 4300 to 4399
Batch 4400 to 4499
Batch 4500 to 4599
Batch 4600 to 4699
Batch 4700 to 4799
Batch 4800 to 4899
Batch 4900 to 4999
Batch 5000 to 5099
Batch 5100 to 5199
Batch 5200 to 5299
Batch 5300 to 53

### Creating GPT 3.5 Turbo Chatbot with 5 response memory

In [116]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA
from langchain.vectorstores import Pinecone
import os

vectorstore = Pinecone(index, embed, "context")

# Create reference to OpenAI
llm = ChatOpenAI(openai_api_key = OPENAI_KEY,
                    model_name = "gpt-3.5-turbo",
                    temperature = 0
                    )


# Include previous 5 messages in memory
conv_mem = ConversationBufferWindowMemory(
                memory_key = "history",
                k = 5,
                return_messages = True
                )

# Create chain to manage the chat session 
qa = RetrievalQA.from_chain_type(
                llm = llm, 
                chain_type = "stuff",
                retriever = vectorstore.as_retriever()
                )


In [118]:
qa.invoke("What was Miles Davis' childhood like?")

/home/charlie/Documents/UST_Database_Man/Final_Project/630env/lib/python3.9/site-packages/pydantic/main.py:1051: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.7/migration/


{'query': "What was Miles Davis' childhood like?",
 'result': "Miles Davis' childhood was spent in a musical family in Grand Rapids, Michigan. He often played music with his brother, Xavier Davis, and started playing the trumpet and tuba in grade school. He attended the Interlochen Arts Academy in northern Michigan towards the end of his high school career, where he played jazz and studied classical percussion and trap-set drumming."}

In [119]:
query = "What was Miles Davis' childhood like?"
vectorstore.similarity_search(query, k=3)

[Document(page_content='Life and career\nMiles was born Barry Miles Silverlight to Arthur and Hermine (née Klein) in Newark, New Jersey and grew up in North Plainfield, New Jersey.\n\nHe joined the musicians union at age nine in 1956 as a child prodigy drummer/pianist/vibist appearing with Miles Davis and John Coltrane among other talents of the day live and on TV shows including To Tell the Truth, Dick Van Dyke\'s variety show, and The Andy Williams Show. He made his solo artist debut recording at age fourteen in 1961, "Miles of Genius", as drummer and composer with sidemen Al Hall and Duke Jordan. Miles continued to perform with his own band in the early 1960s in which he composed the material that enabled up and coming talents such as Woody Shaw, Eddie Gómez and Robin Kenyatta to display their talents.\n\nHe wrote the instruction book, "Twelve Themes With Improvisations", published in 1963 by Belwin-Mills, and currently out of print.'),
 Document(page_content='Biography\nDavis grew 