## Data Cleaning

In [49]:
import pandas as pd

In [50]:
df = pd.read_csv("hf://datasets/datastax/philosopher-quotes/philosopher-quotes.csv")

df.head(10)

Unnamed: 0,author,quote,tags
0,aristotle,True happiness comes from gaining insight and ...,knowledge
1,aristotle,"The roots of education are bitter, but the fru...",education;knowledge
2,aristotle,Before you heal the body you must first heal t...,ethics
3,aristotle,The proof that you know something is that you ...,education;knowledge
4,aristotle,Those who are not angry at the things they sho...,
5,aristotle,"Whatever we learn to do, we learn by actually ...",education;knowledge
6,aristotle,The greatest thing by far is to be a master of...,
7,aristotle,The society that loses its grip on the past is...,history;ethics;knowledge
8,aristotle,The man who is truly good and wise will bear w...,knowledge;ethics
9,aristotle,The greatest of all pleasures is the pleasure ...,knowledge;education;history


In [51]:
def write_stories_to_txt(df, output_file):
    # Open the output file in write mode
    with open(output_file, 'w') as f:
        # Group the dataframe by 'title' (story type)
        grouped = df.groupby('author')
        
        # Loop through each story type and its corresponding stories
        for story_type, stories in grouped:
            # Write the story type as a heading
            f.write(f"{story_type}\n")
            f.write("=" * len(story_type) + "\n")  # Adding a separator line
            
            # Loop through all stories under this story type and write them
            for story in stories['quote']:
                f.write(f"{story}\n\n")  # Add a new line after each story
            
            # Add a couple of blank lines between different story types
            f.write("\n\n")

    print(f"Stories have been written to {output_file}")

In [52]:
write_stories_to_txt(df, "new_stories.txt")

Stories have been written to new_stories.txt


## RAG

In [54]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_community.document_loaders import TextLoader #load the document
from langchain_text_splitters import RecursiveCharacterTextSplitter #for creating chunks from the loaded document
from langchain_openai import OpenAIEmbeddings #for converting chunks into embeddings
from langchain_chroma import Chroma #database for stroring the embeddings

In [55]:
from dotenv import load_dotenv
load_dotenv()

True

In [56]:
import os
dir = os.getcwd()
db_dir = os.path.join(dir,"new_chroma_db")
print(db_dir)

/Users/caochunqin/Desktop/githomework/milestone2_pt2/new_chroma_db


### Create vector DB

In [57]:
#Read the text content from the .txt file and load it as langchain document
loader = TextLoader('new_stories.txt')
document = loader.load()

In [59]:
#Split the document into chunks using text splitters 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(document)

print("Document chunk info:\n")
print(f"Number of document chunks: {len(chunks)}")
print(f"Sample chunk: \n{chunks[3].page_content}\n")

Document chunk info:

Number of document chunks: 76
Sample chunk: 
If you would understand anything, observe its beginning and its development

A friend is another I.

He who hath many friends hath none.

The hand is the tool of tools.

Good moral character is not something that we can achieve on our own. We need a culture that supports the conditions under which self-love and friendship flourish.

We give up leisure in order that we may have leisure, just as we go to war in order that we may have peace.

We must be neither cowardly nor rash but courageous.

The true nature of anything is what it becomes at its highest.

To give away money is an easy matter and in any man's power. But to decide to whom to give it and how large and when, and for what purpose and how, is neither in every man's power nor an easy matter.

A man's happiness consists in the free exercise of his highest faculties.

For what is the best choice for each individual is the highest it is possible for him to achiev

In [60]:
#create embeddings using openAI embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

In [61]:
#store the embeddings and chunks into Chroma DB
Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_dir)

<langchain_chroma.vectorstores.Chroma at 0x31e3674a0>

### Retrieve and generate

In [62]:
#setting up the DB for retrieval
embeddings_used = OpenAIEmbeddings(model="text-embedding-3-small")
vectorDB = Chroma(persist_directory=db_dir,embedding_function=embeddings_used)

In [63]:
#setting up Retriver
retriever = vectorDB.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [64]:
def getRetriever(dir):
    """
    dir is the directory of the vector DB
    """
    embeddings_used = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorDB = Chroma(persist_directory=dir,embedding_function=embeddings_used)
    retriever = vectorDB.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    return retriever

In [65]:
def textGeneration_langChain_RAG(msg,type,retrieverDir):
    """
    msg is the scenario for the story from the pic (hugging face model output);
    type is the genre of the story- Horror, Fantasy, Adventure, Comedy, Mystery, Romance
    retriever is the vector DB with relevant stories from txt version of 
        stories dataset from Hugging face - https://huggingface.co/datasets/ShehryarAzhar/stories
    """
    llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0.2,
            max_tokens=200,
            timeout=None,
            max_retries=2
        )

    system_prompt = (
        "You are an expert short {story_type} story teller. " 
        "Use the following pieces of retrieved context to generate {story_type} story. "
        "Use a simple narrative structure to generate {story_type} story based on the given scenario"
        "keep the story to less than 200 words."
        "\n\n"
        "{context}"
    )

    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{scenario_lang}"),
        ]
    )

    rag_chain = prompt | llm | StrOutputParser()

    retriever = getRetriever(retrieverDir)

    out_message = rag_chain.invoke({
            "story_type" : type,
            "context":retriever,
            "scenario_lang" : msg,
        })
    
    return out_message

In [66]:
scenario = "bookshelves with many different colored books on them in a library" #example output from huggingface model
story = textGeneration_langChain_RAG(scenario,"Horror", db_dir)
print(story)

In the heart of the ancient library, where the air was thick with the scent of aged paper and dust, stood a peculiar set of bookshelves. Each shelf was filled with books of every imaginable color, their spines creating a rainbow that seemed to pulse with an eerie life of its own.

Lena, a curious librarian, had always been drawn to this section. She noticed that each time she passed by, the arrangement of the books seemed to change, as if they were whispering secrets to one another in the dead of night. Intrigued, she decided to investigate.

One evening, after the library had closed, Lena returned to the shelves. As she reached out to touch a particularly vibrant red book, the room grew cold, and the lights flickered. The books began to hum, a low, unsettling melody that resonated deep within her bones.

Suddenly, the books flew from the shelves, swirling around her in a chaotic dance. The colors blurred together, forming a vortex
