## Data Cleaning

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("hf://datasets/cmsolson75/artist_song_lyric_dataset/training_lyric_data.csv")
df.head(10)


  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,Prompt,Lyric,Context
0,Song lyric in the style of Dua Lipa.,one one one one one talkin' in my sleep at nig...,New Rules from the album Dua Lipa
1,Song lyric in the style of Dua Lipa.,if you don't wanna see me did a full 80 crazy ...,Don’t Start Now from the album Future Nostalgia
2,Song lyric in the style of Dua Lipa.,you call me all friendly tellin' me how much y...,IDGAF from the album Dua Lipa
3,Song lyric in the style of Dua Lipa.,i know it's hot i know we've got something tha...,Blow Your Mind (Mwah) from the album Dua Lipa
4,Song lyric in the style of Dua Lipa.,i see the moon i see the moon i see the moon o...,Be the One from the album Dua Lipa
5,Song lyric in the style of Dua Lipa.,i've always been the one to say the first good...,Break My Heart from the album Future Nostalgia
6,Song lyric in the style of Dua Lipa.,here where the sky's falling i'm covered in b...,Homesick from the album Dua Lipa
7,Song lyric in the style of Dua Lipa.,if you wanna run away with me i know a galaxy ...,Levitating from the album Future Nostalgia
8,Song lyric in the style of Dua Lipa.,common love isn't for us we created something ...,Physical from the album Future Nostalgia
9,Song lyric in the style of Dua Lipa.,ah yeah ah yeah he calls me the devil i make h...,Hotter Than Hell from the album Dua Lipa


In [3]:
def write_lyrics_to_txt(df, output_file):
    # 打开文件进行写入
    with open(output_file, 'w') as f:
        # 按 'Context' 列进行分组
        grouped = df.groupby('Context')
        
        # 遍历每种类型（Context）及其对应的歌词
        for lyric_type, lyrics in grouped:
            # 写入 Context（作为类别标题）
            f.write(f"## {lyric_type}\n")
            f.write("=" * len(lyric_type) + "\n")  # 添加分隔线
            
            # 遍历每一条记录
            for idx, row in lyrics.iterrows():
                # 写入 Prompt 作为标题
                f.write(f"Prompt: {row['Prompt']}\n")
                # 写入 Lyric
                f.write(f"Lyric: {row['Lyric']}\n\n")
            
            # 添加一些空行区分不同的 Context
            f.write("\n\n")

    print(f"Lyrics have been written to {output_file}")


In [4]:
write_lyrics_to_txt(df, "lyrics.txt")

Lyrics have been written to lyrics.txt


## RAG

In [5]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

from langchain_community.document_loaders import TextLoader #load the document
from langchain_text_splitters import RecursiveCharacterTextSplitter #for creating chunks from the loaded document
from langchain_openai import OpenAIEmbeddings #for converting chunks into embeddings
from langchain_chroma import Chroma #database for stroring the embeddings

In [7]:
from dotenv import load_dotenv
load_dotenv()

True

In [8]:
import os
dir = os.getcwd()
db_dir = os.path.join(dir,"new_chroma_db")
print(db_dir)

/Users/caochunqin/Desktop/githomework/milestone2_pt2/new_chroma_db


### Create vector DB

In [9]:
#Read the text content from the .txt file and load it as langchain document
loader = TextLoader('lyrics.txt')
document = loader.load()

In [10]:
#Split the document into chunks using text splitters 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(document)

print("Document chunk info:\n")
print(f"Number of document chunks: {len(chunks)}")
print(f"Sample chunk: \n{chunks[3].page_content}\n")

Document chunk info:

Number of document chunks: 14018
Sample chunk: 
speak my mind my man made me feel so god damn fine i'm flawless  you wake up flawless post up flawless ridin' round in it flawless flossin' on that flawless this diamond flawless my diamond flawless this rock flawless my roc flawless i woke up like this i woke up like this we flawless ladies tell 'em i woke up like this i woke up like this we flawless ladies tell 'em say i look so good tonight god damn god damn say i look so good tonight god damn god damn god damn  i wake up looking this good god damn god damn god damn and i wouldn't change it if i could if i if i if i if i and you can say what you want i'm the shit what you want i'm the shit i'm the shit i'm the shit i'm the shit i'm the shit i want everyone to feel like this tonight god damn god damn god damn sample spottieottiedopaliscious onika nicki minaj  looking trinidadian japanese and indian got malaysian got that yaki that wavy brazilian them bitches thirst

In [11]:
#create embeddings using openAI embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

In [12]:
#store the embeddings and chunks into Chroma DB
Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=db_dir)

<langchain_chroma.vectorstores.Chroma at 0x172b064b0>

### Retrieve and generate

In [18]:
#setting up the DB for retrieval
embeddings_used = OpenAIEmbeddings(model="text-embedding-3-small")
vectorDB = Chroma(persist_directory=db_dir,embedding_function=embeddings_used)

In [14]:
#setting up Retriver
retriever = vectorDB.as_retriever(search_type="similarity", search_kwargs={"k": 3})

In [15]:
def getRetriever(dir):
    """
    dir is the directory of the vector DB
    """
    embeddings_used = OpenAIEmbeddings(model="text-embedding-3-small")
    vectorDB = Chroma(persist_directory=dir,embedding_function=embeddings_used)
    retriever = vectorDB.as_retriever(search_type="similarity", search_kwargs={"k": 3})
    return retriever

In [16]:
def textGeneration_langChain_RAG(msg,type,retrieverDir):
    """
    msg is the scenario for the story from the pic (hugging face model output);
   
    type is the name of the artist - including Dua Lipa, Ariana Grande, Charlie Puth, Drake, Billie Eilish, Eminem, Lady Gaga, Beyoncé, Ed Sheeran, Justin Bieber, Taylor Swift, Rihanna, Coldplay, Nicki Minaj, Katy Perry, Maroon 5, Selena Gomez, Post Malone

    retriever is the vector DB with relevant stories from txt version of 
       lyrics dataset from Hugging face - https://huggingface.co/datasets/cmsolson75/artist_song_lyric_dataset
    """
    llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0.2,
            max_tokens=200,
            timeout=None,
            max_retries=2
        )

    system_prompt = (

        "You are an expert in recommending and writing song lyrics. Based on the following emotional content. "
        "Choose lyrics from the artist {lyric_type} that matches this emotion as inspirations. "
        "Then write several lyric lines that best fits the {lyric_type} and says when this artist meets certain sitations, what lyrics she/her will sing"
    )

    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{scenario_lang}"),
        ]
    )

    rag_chain = prompt | llm | StrOutputParser()

    retriever = getRetriever(retrieverDir)

    out_message = rag_chain.invoke({
            "lyric_type" : type,
            "context":retriever,
            "scenario_lang" : msg,
        })
    
    return out_message

In [17]:
scenario = "butterflies" #example output from huggingface model
lyric = textGeneration_langChain_RAG(scenario,"Lady Gaga", db_dir)
print(lyric)

For the emotion of "butterflies," which often signifies excitement, nervousness, or the thrill of new love, Lady Gaga's song "Speechless" captures a sense of vulnerability and emotional intensity. Another song that embodies the feeling of butterflies is "The Edge of Glory," which speaks to the exhilaration of living in the moment.

Inspired by Lady Gaga, here are some original lyric lines that capture the essence of "butterflies":

---

When I see you, my heart takes flight,  
In your eyes, the world feels right.  
Every word you say, a gentle breeze,  
With you, I'm floating with such ease.  

---

In the glow of your smile, I find my way,  
Butterflies dance, come what may.  
On the edge of something new,  
With every heartbeat, I fall for you.  

---

When Lady Gaga encounters a moment filled with butterflies, she might sing:

---

Caught in the thrill of this sweet embrace,  
In your
