# Gemma-2B Modelltraining in Google Colab Pro


In diesem Notebook werden wir ein leichtes Sprachmodell, Gemma-2B, in Google Colab Pro laden, es mit einem speziellen Datensatz trainieren und anschließend das trainierte Modell speichern. Dieses Notebook ist für die Arbeit mit beschränkten Ressourcen optimiert und nutzt Modelle, die effizient auf Google Colab ausgeführt werden können.


In [1]:

# Install necessary libraries
!pip install transformers datasets bitsandbytes huggingface_hub
!pip install ctransformers  # Für die Arbeit mit GGUF-Modellen


Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.27 (from transformers)
  Using cached tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-win_amd64.whl.metadata (13 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
INFO: pip is looking at multiple versions of tokenizers to determine which version is compatible with other requirements. This could take a while.
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.0-cp311-none-win_amd64.whl.metadata (6.8 kB)
Collecting d

ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


## Modell von Hugging Face laden

In [2]:

from huggingface_hub import snapshot_download

# Lade das Gemma-2B Modell herunter
model_id = "google/gemma-2b"
snapshot_download(repo_id=model_id, local_dir="gemma-2b")


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading README.md:   0%|          | 0.00/21.5k [00:00<?, ?B/s]

GatedRepoError: 401 Client Error. (Request ID: Root=1-66d06ab8-07588401657b07a3538d9179;e8c91b55-8ae2-4d6b-8108-2ca0c00a97a3)

Cannot access gated repo for url https://huggingface.co/google/gemma-2b/resolve/68e273d91b1d6ea57c9e6024c4f887832f7b43fa/.gitattributes.
Access to model google/gemma-2b is restricted. You must be authenticated to access it.

## Konvertierung des Modells in GGUF-Format

In [None]:

# Klonen des llama.cpp Repositories
!git clone https://github.com/ggerganov/llama.cpp.git

# Konvertiere das Modell in das GGUF-Format
!python llama.cpp/convert.py gemma-2b --outfile gemma-2b.gguf --outtype q4_K_M


## Datensatz vorbereiten und das Modell trainieren

In [None]:

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# Lade das Tokenizer und Modell
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

# Lade und bereite den Datensatz vor
dataset = load_dataset("Abirate/english_quotes")
dataset = dataset.map(lambda samples: tokenizer(samples["quote"]), batched=True)

# Trainingsargumente definieren
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    logging_dir='./logs',
)

# Trainer konfigurieren
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
)

# Training starten
trainer.train()


## Modell speichern

In [None]:

from google.colab import drive
drive.mount('/content/drive')

# Speichere das trainierte Modell in Google Drive
model.save_pretrained("/content/drive/MyDrive/trained_model")
tokenizer.save_pretrained("/content/drive/MyDrive/trained_model")


## Modell auf Hugging Face hochladen

In [None]:

from huggingface_hub import HfApi

# Lade das Modell auf Hugging Face hoch
api = HfApi()
api.upload_file(
    path_or_fileobj="gemma-2b.gguf",
    path_in_repo="gemma-2b.gguf",
    repo_id="your-username/gemma-2b"
)
