Nome: Fabio Grassiotto  
RA: 890441

# Exercício de Prompt Engineering usando API Groq


In [7]:
%%capture
%pip install -q torch
%pip install datasets
%pip install groq

### Imports

In [12]:
import os
import sys
from collections import Counter
import torch
from datasets import Dataset, load_dataset
import warnings
from groq import Groq
warnings.simplefilter('ignore')

### Global Variables

## Collab Env Setup and GPU Device

In [9]:
# Colab environment
IN_COLAB = 'google.colab' in sys.modules

if (IN_COLAB):
    # Google Drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)

    project_folder="/content/drive/MyDrive/Classes/IA024/Aula_7_8"
    os.chdir(project_folder)
    !ls -la

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


## Setup Groq Library

In [10]:
def load_grok_key():
    try:
        # Open and read the entire content of the file
        with open("groq-key.txt", 'r') as file:
            contents = file.read()
        
        return contents
    
    except FileNotFoundError:
        print(f"The file does not exist.")
        return None
    except Exception as e:
        # Handle other potential exceptions (e.g., permission errors)
        print(f"An error occurred while reading the file: {str(e)}")
        return None
    
grok_key = load_grok_key()
os.environ["GROQ_API_KEY"] = grok_key

### Basic Groq test

In [16]:
client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Is LLama3 better than GPT-4?",
        }
    ],
    model="llama3-70b-8192",
)

print(chat_completion.choices[0].message.content)

LLaMA (not Llama3, which doesn't exist) and GPT-4 are both large language models, but they have different architectures, training objectives, and use cases. Which one is "better" ultimately depends on your specific needs and goals. Here's a brief comparison:

**LLaMA:**

1. Developed by Meta AI, LLaMA is a family of conversation-oriented language models.
2. Trained on a massive dataset of online conversations, with a focus on generating human-like responses.
3. Designed to be more conversational and engaging, with a emphasis on understanding context and nuances of human communication.
4. Models are smaller and more efficient than some other large language models, making them more suitable for deployment on lower-resource hardware.

**GPT-4:**

1. Developed by OpenAI, GPT-4 is a family of language models that builds upon the success of GPT-3.
2. Trained on a massive dataset of text from the internet, with a focus on generating coherent and context-aware text.
3. Designed to be highly ve

# Downloading Dataset

Dataset Card for the IMDB Dataset (https://huggingface.co/datasets/stanfordnlp/imdb):

*Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.*

In this example, we're using the **`datasets`** library to download and load the training and validation sets of the dataset.

In [None]:
os.environ['HF_HOME'] = 'D:\Research\models\hf'

ds_builder = load_dataset_builder("imdb")
print(ds_builder.info.description)
print(ds_builder.info.features)

Shuffle both datasets and select 1000 samples randomly from the test dataset to speed up evaluation.

In [None]:
train_dataset = load_dataset('imdb', split='train')
test_dataset = load_dataset('imdb', split='test')
train_dataset = train_dataset.shuffle(seed=42)
test_dataset = test_dataset.shuffle(seed=42)

In [None]:
train_labels = train_dataset['label']
item_counts = Counter(train_labels)
print(f'Train Dataset Label breakdown: {item_counts}')  

tst_labels = test_dataset['label']
item_counts = Counter(tst_labels)
print(f'Test Dataset Label breakdown: {item_counts}')  

# Data Preparation

Now, we will prepare the data for training our model. First, we define a template with the fields `sentence` and `class`. Then, we use the `map` method to apply this template to the dataset. This will create a new dataset with the fields `sentence` and `class` for each example in the original dataset.

In [None]:
template = """Your task is to classify sentences' sentiment as 'positive' or 'negative'. Your answer should be one word, either 'positive' or 'negative'.

Sentence: {text}
Answer:"""

Before, we need to convert the labels from 0 and 1 to "negative" and "positive". We can do this by using the `map` method to apply a function to each example in the dataset. The function will take the label as input and return the corresponding string and store in the column `class`.

In [None]:
POSITIVE_LABEL = "positive"
NEGATIVE_LABEL = "negative"

# HTML tags removal
train_dataset = train_dataset.map(lambda example: {'text': example['text'].replace("<br />", " ")})
test_dataset = test_dataset.map(lambda example: {'text': example['text'].replace("<br />", " ")})

train_dataset = train_dataset.map(lambda example: {"text": template.format(**example)})
train_dataset = train_dataset.map(lambda example: {'class': POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})

test_dataset = test_dataset.map(lambda example: {"text": template.format(**example)})
test_dataset = test_dataset.map(lambda example: {"class": POSITIVE_LABEL if example["label"] == 1 else NEGATIVE_LABEL})

In [None]:
print(train_dataset[0])
print(test_dataset[0])

## Evaluation Function

In [None]:

def classify_sentence(model, tokenizer, sentence):
  encodeds = tokenizer(sentence, return_tensors="pt", add_special_tokens=False)
  model_inputs = encodeds.to(device)

  with torch.no_grad():
    outputs = model.generate(**model_inputs,max_new_tokens=15,bos_token_id=model.config.bos_token_id,
                                eos_token_id=model.config.eos_token_id,
                                pad_token_id=model.config.eos_token_id
                             )
    torch.cuda.empty_cache()

  return tokenizer.decode(outputs[0][len(model_inputs["input_ids"][0]):], skip_special_tokens=True)

def check_sentiment(str, after_finetune):
  # String starts with ' Negative' or ' Positive', caps or not.
  
  str = str.lower()
  str = str[1:9]
  
  return str
  
def eval_model(model, tokenizer, after_finetune=False):
  predictions = []
  predictions_raw = []

  references = test_dataset["class"]

  text_test_dataset = test_dataset['text']

  for item in tqdm(text_test_dataset):
    predicted_raw = classify_sentence(model, tokenizer, item)
    predicted = check_sentiment(predicted_raw, after_finetune)
    predictions.append(predicted)
    predictions_raw.append(predicted_raw)
  
  correct = sum([1 for p, r in zip(predictions, references) if p.lower() == r.lower()])
  total = len(predictions)
  acc = correct/total
  
  print(f'Model Accuracy = {acc*100}%')

  return predictions_raw, predictions