<a href="https://colab.research.google.com/github/fibleep/adam-mickiewicz-ai/blob/main/LLaMa_Dataset_Enrichment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enriching and formatting the dataset

We will load the Adam Mickiewicz dataset to clean it, enrich it and upload it to hugging face hub

Load the repository with the data

In [24]:
!git clone https://github.com/fibleep/adam-mickiewicz-ai.git

fatal: destination path 'adam-mickiewicz-ai' already exists and is not an empty directory.


In [25]:
!pip install pandas



In [26]:
import sys
sys.path.insert(0,'/data_extraction')


In [27]:
import pandas as pd
df = pd.read_csv("data_extraction/books.csv")
df

Unnamed: 0,Book,Text
0,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,Adomas Mickevičius\n\nKrymo sonetai\nAjudagas\...
1,Adam Mickiewicz Do D. D.,Adam Mickiewicz\n\nDo D. D.\n\n\n\nElegia\n\n ...
2,Adam Mickiewicz Zaloty,Adam Mickiewicz\n\nZaloty\n\n\n\nPóki córeczki...
3,Mickiewicz Adam Ballady i romanse Tukaj albo ...,Mickiewicz Adam\n\nBallady i romanse\nTukaj al...
4,Adam Mickiewicz Pieśń filaretów,Adam Mickiewicz\n\nPieśń filaretów\n\n\n\n He...
...,...,...
149,Adam Mickiewicz Do D… D…,Adam Mickiewicz\n\nDo D… D…\n\n\n\nMoja pieszc...
150,Adam Mickiewicz Dziadów części III Ustęp Oles...,Adam Mickiewicz\n\nDziadów części III Ustęp\nO...
151,"Adam Mickiewicz Dziady. Poema Dziady, część IV","Adam Mickiewicz\n\nDziady. Poema\nDziady, częś..."
152,Adam Mickiewicz Sonety odeskie Sonet II. Do L...,Adam Mickiewicz\n\nSonety odeskie\nSonet II. D...


For each book, we will split each verse and annotate it with the title, later on we will feed it to the model and ask to create a question and answer based on it

In [28]:
# Remove the first 5 lines from the text column, theyre the same as the book column
def remove_first_five_lines(text):
    return '\n'.join(text.split('\n')[5:])

# Apply the function to the 'Text' column
df['Text'] = df['Text'].apply(remove_first_five_lines)

df.head()

Unnamed: 0,Book,Text
0,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"\n\n\nMėgstu ant Ajudago rymodams matyti,\nKai..."
1,Adam Mickiewicz Do D. D.,\nElegia\n\n Gdybyś ty na dzień jeden była w ...
2,Adam Mickiewicz Zaloty,\nPóki córeczki opiewałem wdzięki:\nMamunia sł...
3,Mickiewicz Adam Ballady i romanse Tukaj albo ...,\n\n\n(we czterech częściach)\n\n\n\n\nI\n\n ...
4,Adam Mickiewicz Pieśń filaretów,\n Hej użyjmy żywota!\nWszak żyjem tylko raz:...


In [29]:
# Check the text
df.Text[0]

'\n\n\nMėgstu ant Ajudago rymodams matyti,\nKaip putojančios bangos verpetais puškuoja,\nBei sidabro vainikais juosia marių srytį,\nIr tarsi vaivorykštės aplinkui ratuoja,\n\nAtsimuša į seklių, sekliaus išblaškyti,\nLyg didžiažuvių būriai krantus atakuoja,\nPaveldėtų sausumą, tečiaus nuvaryti,\nVėl kiaukerus, koralus ir perlus tebšluoja.\n\nJaunasis dainiau! Lygiai ir tavo širdyje\nKensmas dažnai sujudin vėsulas grumzdingas,\nBet sveikamjam pakėlus bard’ną įkvėpimo,\n\nAnas be blėdies žūva gelmėj užmiršimo,\nIr už save palieka giesmes nemirtingas,\nTavo garbei vainiką nupint ateityje.\n\n\n\n\n'

Most of the poems the poems/books are in verses, by splitting each per verse we can perform some pretty efficient chunking without losing much of the context. Later on we will perform some extra cleaning.

In [30]:
exploded_df = pd.DataFrame(columns=df.columns)

verse_series_list = []

for row in df.iterrows():

  # Split the verse in each row
  split_text = row[1].Text.split("\n\n")

  for idx,verse in enumerate(split_text):
    verse = verse.replace("\n", "")


    if (not verse) | (len(verse)<10):
      continue

    verse = verse

    verse_series_list.append(pd.Series([row[1].Book, verse], index=df.columns))

print(len(verse_series_list))

4924


In [31]:
exploded_df = pd.concat(verse_series_list, axis=1).T
exploded_df.head()

Unnamed: 0,Book,Text
0,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Mėgstu ant Ajudago rymodams matyti,Kaip putoja..."
1,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Atsimuša į seklių, sekliaus išblaškyti,Lyg did..."
2,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,Jaunasis dainiau! Lygiai ir tavo širdyjeKensma...
3,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Anas be blėdies žūva gelmėj užmiršimo,Ir už sa..."
4,Adam Mickiewicz Do D. D.,Gdybyś ty na dzień jeden była w mojej duszy…...


In [32]:
result = pd.DataFrame([[]])
result['result'] = exploded_df['Text'].apply(len).mean()
result

Unnamed: 0,result
0,246.628554


# Enriching the data
We're going to load in a llama model to enrich the dataset and prepare it for finetuning

# Important!

llama is a gated model, request it at https://huggingface.co/meta-llama/Llama-2-70b-chat-hf and add your api key

In [33]:
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/fifi/.cache/huggingface/token
Login successful


In [34]:
!pip install light-the-torch torchvision torchaudio sentencepiece accelerate bitsandbytes

Collecting light-the-torch
  Downloading light_the_torch-0.7.5-py3-none-any.whl.metadata (9.5 kB)
Collecting torchvision
  Downloading torchvision-0.16.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.6 kB)
Collecting torchaudio
  Downloading torchaudio-2.1.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (6.4 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp311-cp311-macosx_11_0_arm64.whl (1.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m[31m8.5 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting accelerate
  Using cached accelerate-0.24.1-py3-none-any.whl.metadata (18 kB)
Collecting bitsandbytes
  Using cached bitsandbytes-0.41.2.post2-py3-none-any.whl.metadata (9.8 kB)
Collecting pip<23.3,>=22.3 (from light-the-torch)
  Downloading pip-23.2.1-py3-none-any.whl.metadata (4.2 kB)
Collecting torch==2.1.1 (from torchvision)
  Using cached torch-2.1.1-cp311-none-macosx_11_0_arm64.whl.metadata (

In [35]:
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM

# model_name = "HuggingFaceH4/zephyr-7b-beta"
# model = AutoModelForCausalLM.from_pretrained(
#     model_name, load_in_4bit=True
# )
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)

In [36]:
# import transformers

# pipe = transformers.pipeline(
#     "text-generation",
#     model=model,
#     tokenizer=tokenizer,
# )

In [37]:
# sequences = pipe(
#    "Who is adam mickiewicz?",
#     max_length=400,
#     do_sample=True,
#     top_k=10,
#     num_return_sequences=1,
#     eos_token_id=tokenizer.eos_token_id,
# )
# for seq in sequences:
#     print(f"Result: {seq['generated_text']}")

After trying Llama 70b (couldn't get it to load properly) and zephyr-7b (not enough polish), let's try gpt-3 for this:

In [38]:
exploded_df

Unnamed: 0,Book,Text
0,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Mėgstu ant Ajudago rymodams matyti,Kaip putoja..."
1,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Atsimuša į seklių, sekliaus išblaškyti,Lyg did..."
2,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,Jaunasis dainiau! Lygiai ir tavo širdyjeKensma...
3,Adomas Mickevičius Krymo sonetai Ajudagas tłu...,"Anas be blėdies žūva gelmėj užmiršimo,Ir už sa..."
4,Adam Mickiewicz Do D. D.,Gdybyś ty na dzień jeden była w mojej duszy…...
...,...,...
4919,Adam Mickiewicz Sonety odeskie Sonet II. Do L...,O luba! niech twe oczy przyznać się nie boją;J...
4920,Adam Mickiewicz Sonety odeskie Sonet II. Do L...,Że uciekać i kochać bez nadziei muszę.Niech śl...
4921,Adam Mickiewicz Liryki lozańskie Widzenie,"Dźwięk mię uderzył… Nagle moje ciało,Jak ó..."
4922,Adam Mickiewicz Liryki lozańskie Widzenie,"Przeszedłem ludzkie ciała, jak przebiegaPr..."


In [39]:
!pip install langchain

Collecting langchain
  Obtaining dependency information for langchain from https://files.pythonhosted.org/packages/ce/3f/1dafc52526337d1c554227b0e6f16a1aee18e63bf5cd03fd7774297059b2/langchain-0.0.338-py3-none-any.whl.metadata
  Downloading langchain-0.0.338-py3-none-any.whl.metadata (16 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Obtaining dependency information for SQLAlchemy<3,>=1.4 from https://files.pythonhosted.org/packages/c7/55/d1d2ad054fb7e9188681d56df40ed81c2c198314a805b180b0ec99019da1/SQLAlchemy-2.0.23-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Using cached SQLAlchemy-2.0.23-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Obtaining dependency information for aiohttp<4.0.0,>=3.8.3 from https://files.pythonhosted.org/packages/c0/c3/3491f4a4b54798415f9a3bf69f2be2f2edaf2aaa5d2b0171b4420feaf45b/aiohttp-3.9.0-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading aiohttp-3.9.0-cp311-cp311-macosx_11_0_arm64.whl.met

In [40]:
# Create a model

from langchain.pydantic_v1 import BaseModel, Field


class Conversation(BaseModel):
    """Identifying information about a person."""

    question: str = Field(..., description="Zadaj pytanie Adamowi Mickiewiczowi")
    answer: str = Field(..., description="Odpowiedź Adama Mickiewicza, zawsze pisana wierszem i po polsku, minimum 50 słów")

In [45]:

from langchain.chains.openai_functions import (
    create_openai_fn_chain,
    create_openai_fn_runnable,
    create_structured_output_chain,
    create_structured_output_runnable,
)
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate



llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7, openai_api_key="API KEY")
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Nazywasz się Adam Mickiewicz, jesteś polskim poetą i mówisz wierszem. Przetłumacz wszystkie utwory na polski.",
        ),
        (
            "human",
            "Zbuduj pytanie (skierowane do adama mickiewicza, w formie krotkiego generalnego pytania) i odpowiedź, MUSZA BYC PO POLSKU i musi mieć max 50 słów: {input}",
        ),
        ("human", "ZAWSZE ODPOWIADAJ WIERSZEM I MAKSYMALNIE 50 SŁÓW,PYTANIE I ODPOWIEDŹ MUSZĄ BYĆ PO POLSKU! ODPOWIEDŹ TO MUSI BYĆ WIERSZ W STYLU ADAMA MICKIEWICZA (NA PODSTAWIE TEKSTU POWYŻEJ). Użyj poprawnego formatu!!!!"),
    ]
)


In [67]:
import pandas as pd
import threading
from queue import Queue
from langchain.chains.openai_functions import create_structured_output_runnable
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Assuming llm and prompt are already initialized as before

def generate_conversation_threaded(row_index, row, output_queue, stop_event, csv_file):
    """
    Generate a conversation using threading and save the result to a CSV file.
    """
    try:
        if stop_event.is_set():
            raise TimeoutError("Thread execution timed out")

        runnable = create_structured_output_runnable(Conversation, llm, prompt)
        conversation = runnable.invoke({"input": row[1]})
        result = pd.Series([conversation.question, conversation.answer], index=['Question', 'Answer'])
        output_queue.put(result)

        # Save result to CSV
        with open(csv_file, 'a') as f:
            result.to_frame().T.to_csv(f, header=f.tell()==0, index_label='Index')
    except Exception as e:
        print(f"Error occurred: {e}")
        output_queue.put(pd.Series([None, None], index=['Question', 'Answer']))

def process_dataframe(df, csv_file, timeout=60):
    """
    Process each row of the DataFrame using threading with a timeout and save to CSV.
    """
    output_queue = Queue()
    threads = []
    stop_events = []

    for index, row in df.iterrows():
        stop_event = threading.Event()
        thread = threading.Thread(target=generate_conversation_threaded, args=(index, row, output_queue, stop_event, csv_file))
        threads.append(thread)
        stop_events.append(stop_event)
        thread.start()

        # Set a timer to stop the thread after a timeout
        timer = threading.Timer(timeout, stop_event.set)
        timer.start()

    for thread in threads:
        thread.join()

    results = [output_queue.get() for _ in range(len(df))]
    return pd.concat(results, axis=1)

# File to save the results
csv_file = "conversation_results.csv"



In [68]:
updated_df = process_dataframe(exploded_df, csv_file)

  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversati

Error occurred: 1 validation error for _OutputFormatter
__root__
  Expecting property name enclosed in double quotes: line 5 column 3 (char 140) (type=value_error.jsondecode; msg=Expecting property name enclosed in double quotes; doc={
  "output": {
    "question": "Dobranoc, jak dźwięk w twoim uchu przemówił,",
    "answer": "Przez chwilę cichą i uroczą, rozbrzmiał,",
  }
}; pos=140; lineno=5; colno=3)


  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversation = runnable.invoke({"input": row[1]})
  conversati

RuntimeError: can't start new thread

Error occurred: 1 validation error for _OutputFormatter
__root__
  Invalid control character at: line 4 column 51 (char 108) (type=value_error.jsondecode; msg=Invalid control character at; doc={
  "output": {
    "question": "Tyżeś to? i tak późno?",
    "answer": "Nie troszcz się, droga myślicielko,
Zbladłym księżycem zgasłym w lesie,
Tęsknota zawsze serce mą rozgrzewa,
O mnie myślić musisz, mój kochany niewdzięczniku!"
  }
}; pos=108; lineno=4; colno=51)
Error occurred: 1 validation error for _OutputFormatter
__root__
  Invalid control character at: line 4 column 54 (char 128) (type=value_error.jsondecode; msg=Invalid control character at; doc={
  "output": {
    "question": "Jakie wyzwanie czeka Adama Mickiewicza?",
    "answer": "Nieznajomy szpieg wyzwaniem mu stanie,
Sąd krzywoprzysiężny walkę ogłosi,
W dół kryjomy walczyć będzie panie,
Wróg potężny wyrok na niego wzniesie."
  }
}; pos=128; lineno=4; colno=54)
Error occurred: 1 validation error for _OutputFormatter
output -> answe