# Desafio

Executar o fine-tuning de um foudation model (Llama, Bert, Minstrel, etc), utilizando o dataset "TheAmazonTtiles-1.3MM". <br>
O modelo treinado deverá:
- Receber perguntas com um contexto obtido por meio de uma integração RAG (Retrieve-and-Generate), utilizando documentos relacionados aos produtos da Amazon.
- A partir do prompt formado pela pergunta do usuário e dos dados retornados do RAG, o modelo deverá gerar uma resposta baseada na pergunta do usuário e nos dados provenientes do RAG, incluindo as fontes.

# Passos para a aplicação de fine-tuning no Modelo

O The AmazonTitles-1.3MM consiste em consultas textuais reais de usuários e títulos associados de produtos relevantes encontrados na Amazon, medidos por ações implícitas ou explícitas dos usuários.
1.   Preparação do Dataset <br>
    - Download do dataset AmazonTitles-1.3MM.
    - Prepare os dados para o fine-tuning, garantindo que estejam organizados de maneira adequada para o treinamento do modelo.
    - Limpe e pré-processe os dados conforme necessário para o modelo escolhido.
2.  Execução do Fine-Tuning <br>
    - Execute o fine-tuning do foundation model selecionado utilizando o dataset preparado.
    - Documente o processo de fine-tuning, incluindo os parâmetros utilizados e qualquer ajuste específico realizado no modelo.
3.  Configuração da Integração RAG
    - Configure uma integração RAG (Retrieve-and-Generate) para fornecer contexto ao modelo a partir dos documentos relacionados aos produtos da Amazon.
    - Certifique-se de que a integração esteja funcionando corretamente para recuperar e fornecer dados contextuais ao modelo.
4.  Geração de Respostas
    - Configure o modelo treinado para receber perguntas dos usuários.
    - Quando uma pergunta for recebida, utilize a integração RAG para recuperar informações relevantes do dataset AmazonTitles-1.3MM.
    - Combine a pergunta do usuário e os dados retornados do RAG para formar um prompt completo.
    - O modelo deverá gerar uma resposta baseada na pergunta do usuário e nos dados provenientes do RAG, incluindo as fontes fornecidas.

# Preparação do Dataset
- Download do dataset AmazonTitles-1.3MM.
- Prepare os dados para o fine-tuning, garantindo que estejam organizados de maneira adequada para o treinamento do modelo.
- Limpe e pré-processe os dados conforme necessário para o modelo escolhido.

In [None]:
!pip install transformers datasets faiss-cpu sentence_transformers langchain



In [None]:
FOLDER_PATH = '/content/drive/MyDrive/TheAmazonTitles'

In [None]:
from google.colab import drive

# Montar o Google Drive
drive.mount('/content/drive')

In [None]:
import gzip
import json
import random

# Função para realizar o Reservoir Sampling em um arquivo .json.gz
def reservoir_sampling_gz_json(file_path, sample_size):
    sample = []
    with gzip.open(file_path, 'rt', encoding='utf-8') as f:
        for i, line in enumerate(f):
            try:
                # Tentar carregar cada linha como um objeto JSON
                obj = json.loads(line)

                # Preencher o reservatório até atingir o tamanho da amostra
                if i < sample_size:
                    sample.append(obj)
                else:
                    # Substituir elementos aleatoriamente no reservatório
                    j = random.randint(0, i)
                    if j < sample_size:
                        sample[j] = obj
            except json.JSONDecodeError as e:
                print(f"Erro ao decodificar JSON na linha {i}: {e}")
    return sample

# Definir o tamanho da amostra = 5 MIL
sample_size_train = 4000  # Por exemplo, 10.000 amostras
sample_size_test = 1000  # Por exemplo, 10.000 amostras
sample_size_label = 4000  # Por exemplo, 10.000 amostras

# Carregar uma amostra aleatória dos arquivos JSON compactados
train_sample = reservoir_sampling_gz_json(f'{FOLDER_PATH}/LF-Amazon-1.3M/trn.json.gz', sample_size_train)
test_sample = reservoir_sampling_gz_json(f'{FOLDER_PATH}/LF-Amazon-1.3M/tst.json.gz', sample_size_test)
labels_sample = reservoir_sampling_gz_json(f'{FOLDER_PATH}/LF-Amazon-1.3M/lbl.json.gz', sample_size_label)


# Exibir o número de exemplos carregados na amostra
print(f"Tamanho do dataset de treino: {len(train_sample)}")
print(f"Tamanho do dataset de teste: {len(test_sample)}")
print(f"Número de labels: {len(labels_sample)}")

Tamanho do dataset de treino: 4000
Tamanho do dataset de teste: 1000
Número de labels: 4000


In [None]:
!pip install -U langchain-community



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments, pipeline
import torch
from datasets import Dataset
import json
from tqdm import tqdm
from langchain.vectorstores import FAISS
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.docstore.document import Document

In [None]:
# Prepare training samples
training_samples = []
for item in train_sample:
    prompt = f"Title: {item['title']}\n"
    completion = item['content']
    training_samples.append({'prompt': prompt, 'completion': completion})

# Create Dataset
train_dataset = Dataset.from_list(training_samples)


In [None]:
train_dataset

Dataset({
    features: ['prompt', 'completion'],
    num_rows: 4000
})

In [None]:
model_name = 'distilgpt2'  # You can choose another model if you prefer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set pad_token to eos_token

# Load the model
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))  # Resize token embeddings

def tokenize_function(examples):
    # Concatenate 'prompt' and 'completion' for each example
    inputs = [i + t for i, t in zip(examples['prompt'], examples['completion'])]
    # Tokenize the inputs
    tokenized_inputs = tokenizer(
        inputs,
        truncation=True,
        padding='max_length',
        max_length=128,
    )
    return tokenized_inputs


# Tokenize the dataset
tokenized_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['prompt', 'completion'],  # Remove unused columns
)



Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=f'{FOLDER_PATH}/train/results',
    num_train_epochs=3,  # Increase epochs as needed
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    logging_steps=250,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

# Start training
trainer.train()
trainer.save_model(f'{FOLDER_PATH}/fine-tuned-model')

Step,Training Loss
250,4.6074
500,4.5343
750,4.4874
1000,4.4688
1250,4.4175
1500,4.4554
1750,4.3596
2000,4.3413
2250,3.9216
2500,3.9351


In [None]:
# Após treinar o modelo
model_save_path = '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model'

model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

('/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/tokenizer_config.json',
 '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/special_tokens_map.json',
 '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/vocab.json',
 '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/merges.txt',
 '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/added_tokens.json',
 '/content/drive/MyDrive/TheAmazonTitles/fine-tuned-model/tokenizer.json')

In [None]:
# Prepare documents for the vector store
documents = []
for item in train_sample:
    content = item['content']
    documents.append(Document(page_content=content, metadata={'uid': item['uid'], 'title': item['title']}))

# Create embeddings and vector store
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)

retriever = vectorstore.as_retriever()




In [None]:
tokenizer = AutoTokenizer.from_pretrained(f'{FOLDER_PATH}/fine-tuned-model')
model = AutoModelForCausalLM.from_pretrained(f'{FOLDER_PATH}/fine-tuned-model')

nlp = pipeline('text-generation', model=model, tokenizer=tokenizer)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
def get_relevant_context(question):
    docs = retriever.get_relevant_documents(question)
    context = "\n".join([doc.page_content for doc in docs])
    # Truncate context if necessary
    max_context_tokens = 500  # Adjust based on your model's max length
    context_tokens = tokenizer.tokenize(context)
    if len(context_tokens) > max_context_tokens:
        context_tokens = context_tokens[:max_context_tokens]
        context = tokenizer.convert_tokens_to_string(context_tokens)
    sources = [doc.metadata for doc in docs]
    return context, sources

def generate_response(question):
    context, sources = get_relevant_context(question)
    prompt = f"Question: {question}\nContext: {context}\nAnswer:"
    # Calculate input length
    input_length = len(tokenizer.encode(prompt))
    # Ensure total length does not exceed model's max length
    max_new_tokens = 128  # Adjust as needed
    response = nlp(prompt, max_new_tokens=max_new_tokens, num_return_sequences=1)
    answer = response[0]['generated_text'][len(prompt):]
    return answer.strip(), sources


In [None]:
def get_relevant_context(question, max_docs=5, max_context_tokens=500):
    """
    Retrieve relevant documents and prepare a contextual response.
    """
    # Retrieve top relevant documents
    docs = retriever.get_relevant_documents(question, top_k=max_docs)

    # Combine document content into a single context string
    context = ""
    for doc in docs:
        context += f"Title: {doc.metadata['title']}\nContent: {doc.page_content}\n\n"

    # Truncate the context if it exceeds the max number of tokens
    context_tokens = tokenizer.encode(context)
    if len(context_tokens) > max_context_tokens:
        context_tokens = context_tokens[:max_context_tokens]
        context = tokenizer.decode(context_tokens)

    # Collect the sources metadata
    sources = [doc.metadata for doc in docs]
    return context.strip(), sources

def generate_response(question, max_new_tokens=128):
    """
    Generate a response using the fine-tuned language model based on a user query.
    """
    # Get the relevant context and sources
    context, sources = get_relevant_context(question)

    # Construct a well-structured prompt
    prompt = (
        f"Question: {question}\n"
        f"---\n"
        f"Context:\n{context}\n"
        f"---\n"
        "Based on the above information, provide a detailed and coherent answer to the question:"
    )

    # Calculate input length to ensure the prompt fits within model constraints
    input_length = len(tokenizer.encode(prompt))

    # Generate the response from the language model
    with torch.no_grad():  # Disable gradient calculation to save memory
        response = nlp(prompt, max_new_tokens=max_new_tokens, num_return_sequences=1)

    # Extract the generated answer, removing the prompt part
    answer = response[0]['generated_text'][len(prompt):].strip()
    return answer, sources

In [None]:
question = input("Please enter your question: ")
answer, sources = generate_response(question)
print("Answer:", answer)
print("Sources:", sources)

Please enter your question: I wanna fix my Car
Answer: This is a car that will sell for as little as 2,000 vehicle parts a year and for as long as you do your math with the basic installation instructions. We don't need to show every detail to customers how to get that car online. We hope you buy it for your car. Thank you! ---STereo Install Dash Kit Chevy Impala 2001 2-6 -0.04L/S; 1,250 Vehicle Parts (2-6 -ft -L) + Vehicle Parts (2-6 -ft) + Vehicle Parts (2-6 -ft) **Car Parts (2-6 -ft) **
Sources: [{'uid': 'B000KL1AD0', 'title': 'Stereo Install Dash Kit VW Beetle Activ 98 99 00 2000 (car radio wiring installation parts)'}, {'uid': 'B002656GZK', 'title': 'Pittsburgh Steelers Car Magnet Decal (12 -inch)'}, {'uid': 'B000KL2FXE', 'title': 'Stereo Install Dash Kit Chevy Impala 00 01 02 03 04 05 -car radio wiring installation parts'}, {'uid': 'B003GDDMC6', 'title': 'Haynes Repair Manuals 38040 Equinox Pont Torrent 05-09'}]
