**Step 1: Install Necessary Packages**

In [6]:
!pip install transformers[sentencepiece]
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

**Step 2: Select a Pretrained Model in Hugging Face**

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)  # Assuming 3 classes (rock, paper, scissors)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Step 3: Modify Code to Use Rock-Paper-Scissors GitHub for Training**

The Rock-Paper-Scissors dataset will need to be adapted for text-based modeling (e.g., image-to-text transformations). For simplicity, let's assume you preprocess the images into class labels ("rock", "paper", "scissors"). You can fine-tune your Hugging Face model with the processed dataset.

**Step 4: Download the Rock-Paper-Scissors Dataset**

In [3]:
!mkdir tmp
!wget --no-check-certificate \
    https://storage.googleapis.com/learning-datasets/rps.zip \
    -O ./tmp/rps.zip

!wget --no-check-certificate \
    https://storage.googleapis.com/learning-datasets/rps-test-set.zip \
    -O ./tmp/rps-test-set.zip

# Unzip the datasets
import zipfile
import os

os.makedirs('./data/train', exist_ok=True)
os.makedirs('./data/test', exist_ok=True)

with zipfile.ZipFile('./tmp/rps.zip', 'r') as zip_ref:
    zip_ref.extractall('./data/train')

with zipfile.ZipFile('./tmp/rps-test-set.zip', 'r') as zip_ref:
    zip_ref.extractall('./data/test')


--2024-12-24 13:20:00--  https://storage.googleapis.com/learning-datasets/rps.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.127.207, 172.217.218.207, 142.251.31.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.127.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 200682221 (191M) [application/zip]
Saving to: ‘./tmp/rps.zip’


2024-12-24 13:20:07 (34.2 MB/s) - ‘./tmp/rps.zip’ saved [200682221/200682221]

--2024-12-24 13:20:07--  https://storage.googleapis.com/learning-datasets/rps-test-set.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 108.177.127.207, 172.217.218.207, 142.251.31.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|108.177.127.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29516758 (28M) [application/zip]
Saving to: ‘./tmp/rps-test-set.zip’


2024-12-24 13:20:09 (19.6 MB/s) - ‘./tmp/rps-test-set.zip’ saved [29516758/2951

**Training Process**

In [4]:
from PIL import Image
import os

def preprocess_images(image_dir):
    data = []
    for label in ['rock', 'paper', 'scissors']:
        path = os.path.join(image_dir, label)
        for img_file in os.listdir(path):
            img_path = os.path.join(path, img_file)
            if img_file.endswith('.png') or img_file.endswith('.jpg'):
                data.append({'image': img_path, 'label': label})
    return data

train_data = preprocess_images('./data/train/rps')
test_data = preprocess_images('./data/test/rps-test-set')


**Fine-Tune the Hugging Face Model**
Convert the dataset into tokenized text and fine-tune the model.

In [9]:
from transformers import Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
import os

# Define the label mapping
label_mapping = {"rock": 0, "paper": 1, "scissors": 2}

# Function to preprocess data
def preprocess_images(image_dir):
    data = []
    for label in ['rock', 'paper', 'scissors']:
        path = os.path.join(image_dir, label)
        for img_file in os.listdir(path):
            img_path = os.path.join(path, img_file)
            if img_file.endswith('.png') or img_file.endswith('.jpg'):
                data.append({'image': img_path, 'label': label})
    return data

# Load and preprocess datasets
train_data = preprocess_images('./data/train/rps')
test_data = preprocess_images('./data/test/rps-test-set')

# Convert datasets into Hugging Face Dataset format
train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)

# Map string labels to integers
def preprocess_data(example):
    example["label"] = label_mapping[example["label"]]  # Convert to numeric label
    return example

train_dataset = train_dataset.map(preprocess_data)
test_dataset = test_dataset.map(preprocess_data)

# Load the Hugging Face model and tokenizer
model_name = "bert-base-uncased"  # Use any Hugging Face model for classification
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

def tokenize_function(examples):
    # Convert numeric labels back to text labels
    text_labels = [list(label_mapping.keys())[label] for label in examples["label"]]
    # Tokenize the text labels
    return tokenizer(text_labels, padding="max_length", truncation=True)


train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Remove unnecessary columns
train_dataset = train_dataset.remove_columns(["image"])
test_dataset = test_dataset.remove_columns(["image"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",             # Directory to save model checkpoints
    evaluation_strategy="epoch",       # Evaluate model at each epoch
    learning_rate=2e-5,                # Learning rate
    per_device_train_batch_size=8,     # Batch size per device
    num_train_epochs=3,                # Number of epochs
    weight_decay=0.01,                 # Weight decay for regularization
    logging_dir="./logs",              # Directory for logs
    save_strategy="epoch"              # Save model at each epoch
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train the model
trainer.train()


Map:   0%|          | 0/2520 [00:00<?, ? examples/s]

Map:   0%|          | 0/372 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/2520 [00:00<?, ? examples/s]

Map:   0%|          | 0/372 [00:00<?, ? examples/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,No log,0.000558
2,0.038900,0.000268
3,0.038900,0.000219


TrainOutput(global_step=945, training_loss=0.02077899726610335, metrics={'train_runtime': 800.4358, 'train_samples_per_second': 9.445, 'train_steps_per_second': 1.181, 'total_flos': 1989137438023680.0, 'train_loss': 0.02077899726610335, 'epoch': 3.0})