In [1]:
!pip install torch torchvision torchaudio transformers datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp312-cp312-win_amd64.whl.metadata (3.4 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets)
  Using cached pandas-2.2.2-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp312-cp312-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.5-cp312-cp312-win_amd64.whl.metadata (7.7 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting attrs>=17.3.0 (from aiohttp-


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from transformers import pipeline, GPTNeoForCausalLM, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import re

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
model_name = 'EleutherAI/gpt-neo-2.7B'  
model = GPTNeoForCausalLM.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [4]:
# fine tuning on wikitext (not mandatory)
dataset = load_dataset("wikitext", "wikitext-103-raw-v1")

Downloading readme: 100%|█████████████████████████████████████████████████████████| 10.5k/10.5k [00:00<00:00, 1.37MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 733k/733k [00:00<00:00, 1.17MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 157M/157M [00:04<00:00, 34.6MB/s]
Downloading data: 100%|█████████████████████████████████████████████████████████████| 157M/157M [00:04<00:00, 35.8MB/s]
Downloading data: 100%|██████████████████████████████████████████████████████████████| 657k/657k [00:00<00:00, 863kB/s]
Generating test split: 100%|█████████████████████████████████████████████| 4358/4358 [00:00<00:00, 36651.38 examples/s]
Generating train split: 100%|█████████████████████████████████████| 1801350/1801350 [00:03<00:00, 527326.31 examples/s]
Generating validation split: 100%|██████████████████████████████████████| 3760/3760 [00:00<00:00, 367955.74 examples/s]


In [5]:
# tokenizing function for the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

In [6]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|                                                                             | 0/4358 [00:00<?, ? examples/s]


ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

In [8]:
!pip install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.32.1-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.32.1-py3-none-any.whl (314 kB)
   ---------------------------------------- 0.0/314.1 kB ? eta -:--:--
   - -------------------------------------- 10.2/314.1 kB ? eta -:--:--
   - -------------------------------------- 10.2/314.1 kB ? eta -:--:--
   --- ----------------------------------- 30.7/314.1 kB 220.2 kB/s eta 0:00:02
   --- ----------------------------------- 30.7/314.1 kB 220.2 kB/s eta 0:00:02
   ------- ------------------------------- 61.4/314.1 kB 252.2 kB/s eta 0:00:02
   -------- ------------------------------ 71.7/314.1 kB 262.6 kB/s eta 0:00:01
   ---------- ---------------------------- 81.9/314.1 kB 241.3 kB/s eta 0:00:01
   ----------------------- -------------- 194.6/314.1 kB 590.8 kB/s eta 0:00:01
   ----------------------- -------------- 194.6/314.1 kB 590.8 kB/s eta 0:00:01
   -------------------------------------- 314.


[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
pip install accelerate -U

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation']
)

In [None]:
trainer.train()

In [15]:
# creating the generation pipeline 
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

In [16]:
prompt = "How cool is Machine Learning?"

In [20]:
res = generator(
    prompt,
    max_length=150,        
    do_sample=True, 
    temperature=0.7, # for more focused output try lowering the temperature
    top_k=40,              
    top_p=0.9,             
    repetition_penalty=1.2 # for repetitions
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [21]:
generated_text = res[0]['generated_text']
cleaned_text = generated_text.strip().replace('\n', ' ')

print(cleaned_text)

How cool is Machine Learning?  Machine learning is a big deal. It’s already changing the way we do business and make decisions, and it’s going to continue to grow exponentially. But just how cool is it? Let’s find out.  I was recently asked to give a talk on machine learning at an industry conference, and I had to think long and hard about how to approach the subject. I’ve always been fascinated by it, and I’ve written a lot about it, but I’ve never seen it as a subject of high interest to a large audience. So I thought I’d take a few minutes to explain what machine learning is, why it�


In [22]:
# text enhancement (optional)
def clean_text(text):
    text = re.sub(' +', ' ', text)
    text = text.strip()
    return text

cleaned_text = clean_text(cleaned_text)
print(cleaned_text)

How cool is Machine Learning? Machine learning is a big deal. It’s already changing the way we do business and make decisions, and it’s going to continue to grow exponentially. But just how cool is it? Let’s find out. I was recently asked to give a talk on machine learning at an industry conference, and I had to think long and hard about how to approach the subject. I’ve always been fascinated by it, and I’ve written a lot about it, but I’ve never seen it as a subject of high interest to a large audience. So I thought I’d take a few minutes to explain what machine learning is, why it�
