# Fetching datasets

In [3]:
import pandas as pd
import os
from classes.finetuning.preprocessing import Preprocessor

preoprocessor = Preprocessor()

## Fetching Openhermes

In [4]:
url_openhermes = "https://datasets-server.huggingface.co/rows?dataset=teknium%2Fopenhermes&config=default&split=train&"
df_openhermes = pd.DataFrame()

try:
    path = os.path.join(os.getcwd(), "backups", "finetuning", "df_openhermes.pkl")

    if os.path.exists(path):
        df_openhermes = pd.read_pickle(path)
        print(f"Backup loaded: {path}")
except Exception as e:
    print(f"An exception occurred: {str(e)}")

if df_openhermes.empty:
    df_openhermes = preoprocessor.fetch_rows(url_openhermes, 5000)
    df_openhermes = preoprocessor.df_openhermes_preproc(df_openhermes)

df_openhermes

Backup loaded: /workspaces/datamanagement_project/backups/finetuning/df_openhermes.pkl


Unnamed: 0,question,answer
0,Write a Perl script that processes a log file ...,```perl\n#!/usr/bin/perl\n\nuse strict;\nuse w...
1,"What can be seen once in a minute, twice in a ...",The letter 'M'.
2,Famous inventors and their inventions: Identif...,1. Thomas Edison: One of his most significant ...
3,Generate a list of 12 words that start with 'qu'.,1. Quail\n2. Quarry\n3. Quasar\n4. Quench\n5. ...
4,"Who was the first woman to win a Nobel Prize, ...",Marie Curie; Physics
...,...,...
4995,"BEGININPUT\nBEGINCONTEXT\ndate: June 12, 2023\...",Some popular sustainable design features menti...
4996,Write an Erlang function that creates a distri...,To create a distributed key-value store using ...
4997,BEGININPUT\nBEGINCONTEXT\nauthor: Orlo Orloff\...,| Name | Sport | Score ...
4998,"Alice had 10 marbles, gave 3 to Bob, and then ...","This question is about marbles, not cookies. A..."


## Fetching SlimOrca

In [5]:
url_slimOrca = "https://datasets-server.huggingface.co/rows?dataset=Open-Orca%2FSlimOrca&config=default&split=train&"
df_slimOrca_clean = pd.DataFrame()

try:
    path = os.path.join(os.getcwd(), "backups", "finetuning", "df_slimOrca_clean.pkl")

    if os.path.exists(path):
        df_slimOrca_clean = pd.read_pickle(path)
        print(f"Backup loaded: {path}")
except Exception as e:
    print(f"An exception occurred: {str(e)}")

if df_slimOrca_clean.empty:
    df_slimOrca = preoprocessor.fetch_rows(url_slimOrca, 5000)
    df_slimOrca_clean = preoprocessor.df_slimOrca_preproc(df_slimOrca)

df_slimOrca_clean

Backup loaded: /workspaces/datamanagement_project/backups/finetuning/df_slimOrca_clean.pkl


Unnamed: 0,question,answer
0,"Write an article based on this ""A man has been...",Title: Tragedy Strikes in Sydney: Victims Stab...
1,Answer the following question: - number is 54 ...,The information provided seems to refer to Ria...
2,Produce a long descriptive sentence that uses ...,"Stretching across a vast areaOfLand, totaling ..."
3,Write a title for this article:\n\nArbitration...,"""The Sneaky Clauses Taking Away Your Day in Co..."
4,"Definition: In this task, you are given a hate...",geopolitical\n\nStep 1: Understand the text\nI...
...,...,...
4995,Here is an article:\n\nOhio high school senior...,"Ohio High School Student Wins $250,000 Scholar..."
4996,Q:Answer the following question given this par...,"The correct answer is called ""epistasis."" When..."
4997,"Teacher:In this task, you are given two phrase...","Yes\nExplanation: In this problem, the Head is..."
4998,,


## Merging the two cleaned datasets

In [6]:
df = pd.concat([df_openhermes, df_slimOrca_clean])
df = df.assign(language="en", accuracy=-1, acc_explanation="")
df

Unnamed: 0,question,answer,language,accuracy,acc_explanation
0,Write a Perl script that processes a log file ...,```perl\n#!/usr/bin/perl\n\nuse strict;\nuse w...,en,-1,
1,"What can be seen once in a minute, twice in a ...",The letter 'M'.,en,-1,
2,Famous inventors and their inventions: Identif...,1. Thomas Edison: One of his most significant ...,en,-1,
3,Generate a list of 12 words that start with 'qu'.,1. Quail\n2. Quarry\n3. Quasar\n4. Quench\n5. ...,en,-1,
4,"Who was the first woman to win a Nobel Prize, ...",Marie Curie; Physics,en,-1,
...,...,...,...,...,...
4995,Here is an article:\n\nOhio high school senior...,"Ohio High School Student Wins $250,000 Scholar...",en,-1,
4996,Q:Answer the following question given this par...,"The correct answer is called ""epistasis."" When...",en,-1,
4997,"Teacher:In this task, you are given two phrase...","Yes\nExplanation: In this problem, the Head is...",en,-1,
4998,,,en,-1,


## Splitting the dataset

In [7]:
import numpy as np

# Split the DataFrame into two parts with 50% of the shuffled rows
df_split = np.array_split(df.sample(frac=1, axis=0, random_state=42).sample(frac=1).reset_index(drop=True), 2)
df_split[0]

  return bound(*args, **kwds)


Unnamed: 0,question,answer,language,accuracy,acc_explanation
0,Read the text and determine if the sentence is...,"A). Yes\n\nIn the given text, Megan was eating...",en,-1,
1,Please answer the following question: I'm taki...,The best title for the passage is A: A School ...,en,-1,
2,Given the question: Title: drying Review: foun...,The review is negative. \n\nThe review provide...,en,-1,
3,Which musical instrument did jazz legend Louis...,Trumpet,en,-1,
4,"BEGININPUT\nBEGINCONTEXT\ndate: September 15, ...",The principle that states it is fundamentally ...,en,-1,
...,...,...,...,...,...
4995,Alexandra Elizabeth Shiva is an American film ...,"From the information provided, it is not entir...",en,-1,
4996,An amusement park is designing a new roller co...,To calculate the percentage of children who wi...,en,-1,
4997,,,en,-1,
4998,"Given the question: Given this review: ""Good w...",Definitely.\n\nStep 1: Analyze the review\nI w...,en,-1,


## Translating the Split

In [8]:
from classes.finetuning.translator import Translator

translator = Translator()

df_translated = translator.translate(df_split[0])
df_translated

Searching backup in: ['/workspaces/datamanagement_project/backups/finetuning/0122_5000_translated.pkl']
Backup found: /workspaces/datamanagement_project/backups/finetuning/0122_5000_translated.pkl
Backup loaded: /workspaces/datamanagement_project/backups/finetuning/0122_5000_translated.pkl


Unnamed: 0,question,answer,language,accuracy,acc_explanation
5000,"Se getti una pietra rossa nel mare blu, cosa d...",Bagnato o sommerso.,it,-1,
5001,"Dada la definici√≥n de la tarea y los aportes, ...","Para llegar a la respuesta, analic√© el comenta...",es,-1,
5002,Question : Qui tue Mal√©fique ? S'il n'y a pas ...,Informations insuffisantes pour apporter une r...,fr,-1,
5003,Ti viene fornita una dichiarazione scritta in ...,‡≤¨‡≤π‡≤æ‡≤Æ‡≤æ‡≤∏‡≥ç,it,-1,
5004,Responda la siguiente pregunta: Estoy haciendo...,C: Mucha gente tiene una idea err√≥nea de la re...,es,-1,
...,...,...,...,...,...
9995,Develop a C++ program that calculates the fact...,Here's a simple C++ program that calculates th...,it,-1,
9996,Describe el proceso creativo detr√°s del dise√±o...,"El dise√±o de la portada del √°lbum ""Dark Side o...",es,-1,
9997,"au d√©part, il y avait 32 indicateurs de r√©ussi...","Initialement, il y avait 32 indicateurs de r√©u...",fr,-1,
9998,Entri in una stanza con un fiammifero. C'√® una...,Accendi prima il fiammifero.\n\nEsempio 9:,it,-1,


## Paper implementation

### Evaluator LLM Setup

In [9]:
from classes.llm.evaluator import Evaluator
from classes.llm.gemini import Gemini

evaluator = Evaluator()
gemini = Gemini()

df_to_evaluate = pd.concat([df_split[0], df_translated])

Searching backup in: ['/workspaces/datamanagement_project/backups/finetuning/0122_10000_evaluated.pkl']
Backup found: /workspaces/datamanagement_project/backups/finetuning/0122_10000_evaluated.pkl


Backup loaded: /workspaces/datamanagement_project/backups/finetuning/0122_10000_evaluated.pkl


In [10]:
if evaluator.df_to_evaluate is not None or evaluator.df_to_evaluate.empty:
    df_to_evaluate = evaluator.df_to_evaluate
    
df_to_evaluate

Unnamed: 0,question,answer,language,accuracy,acc_explanation
0,"On the Origin of Species:\n\nBy the mid-1870s,...",Saltationism means that new species were thoug...,en,5,\nExplanation:\nThe response accurately define...
1,Read the text and answer the questions.\n\nWar...,1. The music assets were sold in 2003.\n2. The...,en,5,Explanation: The response accurately answers b...
2,"At a playground, there are four children: Alic...",The children think that a dog ran off with bot...,en,5,\nExplanation:\nThe response accurately reflec...
3,Compile a list of 19 words that commence with ...,1. Reassemble\n2. Reassure\n3. Rebuild\n4. Rec...,en,5,\nExplanation: The AI assistant's response is ...
4,Here is some data about a restaurant: name = A...,Step 1: Identify the given information - In th...,en,5,\nExplanation:\nThe response accurately follow...
...,...,...,...,...,...
9995,Develop a C++ program that calculates the fact...,Here's a simple C++ program that calculates th...,it,5,\nExplanation:\nThe response accurately addres...
9996,Describe el proceso creativo detr√°s del dise√±o...,"El dise√±o de la portada del √°lbum ""Dark Side o...",es,5,\nExplanation:\nThe response accurately and co...
9997,"au d√©part, il y avait 32 indicateurs de r√©ussi...","Initialement, il y avait 32 indicateurs de r√©u...",fr,5,Explanation: The response is accurate and comp...
9998,Entri in una stanza con un fiammifero. C'√® una...,Accendi prima il fiammifero.\n\nEsempio 9:,it,5,\nExplanation:\nThe response is accurate and c...


In [11]:
df_to_evaluate =  evaluator.evaluate(df_to_evaluate, gemini, evaluator)
df_to_evaluate

10000it [00:00, 17934.39it/s]


Unnamed: 0,question,answer,language,accuracy,acc_explanation
0,"On the Origin of Species:\n\nBy the mid-1870s,...",Saltationism means that new species were thoug...,en,5,\nExplanation:\nThe response accurately define...
1,Read the text and answer the questions.\n\nWar...,1. The music assets were sold in 2003.\n2. The...,en,5,Explanation: The response accurately answers b...
2,"At a playground, there are four children: Alic...",The children think that a dog ran off with bot...,en,5,\nExplanation:\nThe response accurately reflec...
3,Compile a list of 19 words that commence with ...,1. Reassemble\n2. Reassure\n3. Rebuild\n4. Rec...,en,5,\nExplanation: The AI assistant's response is ...
4,Here is some data about a restaurant: name = A...,Step 1: Identify the given information - In th...,en,5,\nExplanation:\nThe response accurately follow...
...,...,...,...,...,...
9995,Develop a C++ program that calculates the fact...,Here's a simple C++ program that calculates th...,it,5,\nExplanation:\nThe response accurately addres...
9996,Describe el proceso creativo detr√°s del dise√±o...,"El dise√±o de la portada del √°lbum ""Dark Side o...",es,5,\nExplanation:\nThe response accurately and co...
9997,"au d√©part, il y avait 32 indicateurs de r√©ussi...","Initialement, il y avait 32 indicateurs de r√©u...",fr,5,Explanation: The response is accurate and comp...
9998,Entri in una stanza con un fiammifero. C'√® una...,Accendi prima il fiammifero.\n\nEsempio 9:,it,5,\nExplanation:\nThe response is accurate and c...


In [12]:
from classes.database import DatabaseHandler

db_handler = DatabaseHandler(collection_name= "multilingual_finetuning")

In [17]:
# convert df_to_evaluate to dictionary
df_to_evaluate_dict = df_to_evaluate.to_dict(orient="index")
df_to_evaluate_dict

{0: {'question': 'On the Origin of Species:\n\nBy the mid-1870s, most scientists accepted evolution, but relegated natural selection to a minor role as they believed evolution was purposeful and progressive. The range of evolutionary theories during "the eclipse of Darwinism" included forms of "saltationism" in which new species were thought to arise through "jumps" rather than gradual adaptation, forms of orthogenesis claiming that species had an inherent tendency to change in a particular direction, and forms of neo-Lamarckism in which inheritance of acquired characteristics led to progress. The minority view of August Weismann, that natural selection was the only mechanism, was called neo-Darwinism. It was thought that the rediscovery of Mendelian inheritance invalidated Darwin\'s views.\n\nPlease answer a question about this article. If the question is unanswerable, say "unanswerable". What was meant by the term saltationism?',
  'answer': 'Saltationism means that new species were 

In [18]:
df_to_evaluate_dict = db_handler.clean_and_prepare_data(df_to_evaluate_dict)
db_handler.insert_data(df_to_evaluate_dict)

Inserted 10000 documents.
