## Assignment 1: Build a Toy Llama-2 Language Model

> CISC7021 Applied Natural Language Processing (2024/2025)

In this assignment, we will prepare a toy language model that employs the **Llama-2** architecture and evaluate the perplexity of the data set.

We will learn how to perform continual pre-training of a base language model using the PyTorch and Hugging Face libraries. Detailed instructions for building this language model can be found in the attached notebook file.

Acknowledgement: The base model checkpoint is converted from [llama2.c](https://github.com/karpathy/llama2.c) project. The data instances were sampled from [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset.

---

🚨 Please note that running this on CPU may be slow. If running on Google Colab or Kaggle, you can avoid this by going to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**. This should be included within the free tier of Colab.

---

We start by doing a `pip install` of all required libraries.
- 🤗 `transformers`, `datasets`, `accelerate` are Huggingface libraries.
- By default, Colab has `transformers`, `pytorch` libraries installed. If you are using a local machine, please install them via `pip` or `conda`.

In [None]:
#!pip install torch torchvision torchaudio
#!pip install transformers

In [1]:
!pip install datasets accelerate -q

### (Optional) Uploading the model/data to Google Colab or Kaggle.

Please upload your dataset and model to computational platforms if you are using Colab or Kaggle environments.

For Colab users, you can mount your Google Drive files by running the following code snippet:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install pyarrow==15.0.2

[0mCollecting pyarrow==15.0.2
  Downloading pyarrow-15.0.2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Downloading pyarrow-15.0.2-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 16.1.0
    Uninstalling pyarrow-16.1.0:
      Successfully uninstalled pyarrow-16.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 24.8.2 requires cubinlinker, which is not installed.
cudf 24.8.2 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 24.8.2 requires ptxcompiler, which is not installed.
cuml 24.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 24.8.2 requires cupy-cuda11

### Necessary Packages, Environment Setups

In [1]:
import torch
import transformers

from typing import List, Optional, Tuple, Union
from transformers import LlamaForCausalLM, LlamaTokenizer, AutoTokenizer
from transformers import Trainer, TrainingArguments
from itertools import chain
from datasets import load_dataset

from tqdm.notebook import tqdm
from torch.nn import CrossEntropyLoss

Please set the correct file path based on your environment.

- If you are using Colab, the path may be: `/content/drive/MyDrive/xxxxxx`
- If you are using Kaggle, the path may be: `/kaggle/input/xxxxxx`

In [2]:
# Please set the correct file path based on your environment.
TRAIN_FILE = '/kaggle/input/assignment1/assignment1/data/zh_train.jsonl'
VALIDATION_FILE = '/kaggle/input/assignment1/assignment1/data/zh_dev.jsonl'
TEST_FILE = '/kaggle/input/assignment1/assignment1/data/zh_test.jsonl'
EN_TEST_FILE = '/kaggle/input/assignment1/assignment1/data/en_test.jsonl'
MODEL_FOLDER = "/kaggle/input/assignment1/assignment1/llama-42m"

Load the model checkpoint into either a GPU or CPU (training will be slow on CPU, but decoding will be fair).

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device type: {device}")

model_path = MODEL_FOLDER
# Load model from local files
model = LlamaForCausalLM.from_pretrained(model_path).to(device)
# Load tokenizer from local files
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


As we can see from the statistics, this model is much smaller than Llama-2 but shares the same decoder-only architecture.


😄 **You do not need to check complex details!** We just present the architecture and number of parameters here.

In [5]:
total_para = sum(v.numel() for k, v in model.state_dict().items() if k != 'model.embed_tokens.weight') / 1e6
print(model)
print(f"#Parameters: {total_para:.2f}M")

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 512)
    (layers): ModuleList(
      (0-7): 8 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=512, out_features=512, bias=False)
          (k_proj): Linear(in_features=512, out_features=512, bias=False)
          (v_proj): Linear(in_features=512, out_features=512, bias=False)
          (o_proj): Linear(in_features=512, out_features=512, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=512, out_features=1376, bias=False)
          (up_proj): Linear(in_features=512, out_features=1376, bias=False)
          (down_proj): Linear(in_features=1376, out_features=512, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((512,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((512,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((

### Task 1: Decoding


If you are familar with the usage of `model.generate()` function in transformer library, please feel free to jump to [Task 1 Playground](#scrollTo=Task_1_Playground).


#### 💡Tutorials: model.generate() function.
---
Minimal example:

```python
prompt = "Once upon a time, " # Input, prefix of generation
```

**Step 1**: Encode raw text using tokenizer model.
```python
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
```

**Step 2**: Set decoding hyper-parameters. Get the model output.
```python
output_ids = model.generate(tokenized_input, do_sample=True, max_new_tokens=300, temperature=0.6)
```
Important parameters:
- `max_new_tokens`: The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
- `temperature`: The value of temperature used to modulate the next token probabilities. Higher temperature -> generate more diverse text. Lower temperature -> generate more deterministic text.
- `do_sample`: `do_sample=False` is using greedy decoing strategy. To enable greedy decoding, we also need to set other sampling parameters `top_p`, `temperature` as `None`.
- [If you are interested in other decoding algorithms, please refer to this link for setting parameters.](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/text_generation#transformers.GenerationConfig)

**Step 3**: Convert model outputs into raw text.
```python
output_text = tokenizer.decode(output_ids[0])
```
or (when input instances >=1)
```python
output_text = tokenizer.batch_decode(output_ids)
```
Important parameters:
- Setting `skip_special_tokens=True` will prevent special tokens, such as `<s>`, from appearing in the results..

---


To understand the outputs of each step, let us do a simple generation task step by step! (Note: the base model is only able to produce fluent story text).

In [28]:
prompt = "Once upon a time, Stella Lou had a dream." # Feel free to use other generation prefix

In [29]:
# Step 1: Encode raw text using tokenizer model. Run tokenization and covert strings into token ids in vocabulary.
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
# See the tokenized results.
print(tokenized_input)

tensor([[    1,  9038,  2501,   263,   931, 29892,   624,  3547,  4562,   750,
           263, 12561, 29889]], device='cuda:0')


In [30]:
# Step 2: Set decoding hyperparameters.

# For greedy decoding
max_new_tokens = 300
do_sample = False  # `do_sample=False` means using greedy decoing strategy. To enable greedy decoding, we also need to set `top_p`, `temperature` as `None`.
temperature = None

# call generation function model.generate()
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=None,
)

# The decoded results are token ids.
print("=" * 20 + "Token IDs" + "=" * 20)
print(output_ids)

tensor([[    1,  9038,  2501,   263,   931, 29892,   624,  3547,  4562,   750,
           263, 12561, 29889,  2296,  5131,   304,   367,   263, 12456,   985,
         29889,  2296,  5131,   304, 19531,   263,  9560, 10714,   322,   263,
           528,  4901, 20844, 29889,  1205,  1183,   471,  2086,  2319,   322,
           278, 10714,   471,  2086,  4802, 29889,    13,  6716,  2462, 29892,
           624,  3547,  4446,   263,  4802, 29892,   528,  4901, 10714,   297,
           263,  3787, 29889,  2296,  4433,   902, 16823,   565,  1183,  1033,
           505,   372, 29889,  2439, 16823,  1497,  4874,   322, 18093,   372,
           363,   902, 29889,    13,   855,  3547,   471,   577,  9796, 29889,
          2296,  1925,   373,   278, 10714,   322,  3252,   381,   839,  2820,
         29889,  2296,  7091,   763,   263,  1855, 12456,   985, 29889,    13,
          6246,   769, 29892,  1554,  8515,  9559, 29889,   624,  3547,  4687,
           304,  4459,   270,   466,  1537, 29889,  

In [31]:
# Step 3: Convert model outputs into raw text.
# decode token ids into tokens
print("=" * 20 + "Decoded Results" + "=" * 20)
# We only have one input instance. So we directly decode the first item of model output, i.e., `output_ids[0]`.
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Once upon a time, Stella Lou had a dream. She wanted to be a princess. She wanted to wear a beautiful dress and a shiny crown. But she was too small and the dress was too big.
One day, Stella saw a big, shiny dress in a store. She asked her mom if she could have it. Her mom said yes and bought it for her.
Stella was so happy. She put on the dress and twirled around. She felt like a real princess.
But then, something strange happened. Stella started to feel dizzy. She couldn't stand up straight. She felt like she was spinning around and around.
Stella's mom saw her and said, "Stella, you need to take a break. You look dizzy."
Stella took off the dress and lay down on the floor. She closed her eyes and took a deep breath. After a few minutes, she felt better.
Stella smiled and said, "Mom, I'm ready to be a princess again!"


#### Another pipeline example: Sampling decoding with temperature.

In [32]:
prompt = "Once upon a time, Stella Lou had a dream."

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
# The value of temperature used to modulate the next token probabilities.
# Higher temperature -> generate more diverse text. Lower temperature -> generate more deterministic text.
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)


<s> Once upon a time, Stella Lou had a dream. She wanted to be a princess and live in a castle. She was so excited to see what the world had to offer.
One day, Stella went to the park and saw a beautiful princess. She was so happy and ran up to her. The princess said, "Hello Stella, I am the princess of this land. Would you like to be my friend?"
Stella was so excited and said, "Yes, I would love to be your friend!"
The princess smiled and said, "Let's go to my castle and have a tea party. We can have a tea party and eat yummy treats."
Stella was so happy and said, "Yes, let's do that!"
So the princess and Stella went to the castle and had a wonderful tea party. They laughed and talked and had a wonderful time. Stella was so happy to have a new friend.
The princess said, "You are so special Stella Lou. I am so glad you are my friend."
Stella Lou smiled and said, "I am so glad too. I am so happy to have you as my friend."
The princess smiled and said, "I am so happy to have you as my fr

#### Task 1 Playground

---

📚 Task 1: Please generate English stories using various prompts and decoding settings. Please feel free to explore any interesting phenomena, such as the impact of different prompts and the effects of various decoding algorithms and parameters. For example, quantify the text properties using linguistic-driven metrics like story length and Type-Token Ratio (TTR). In addition to objective metrics, you are encouraged to discuss your findings based on subjective case studies.

We provide two types of skeleton code: one that takes a single prompt as input and another that can process batched inputs and decoding. Please use the version that best fits your preferences and data types.

---

In [5]:
prompt = "Blair is a cute little girl lived in a happy family."

In [6]:
tokenized_input = tokenizer.encode(prompt, return_tensors='pt').to(device)
print(tokenized_input)

tensor([[    1, 10465,   381,   338,   263,   274,  1082,  2217,  7826, 10600,
           297,   263,  9796,  3942, 29889]], device='cuda:0')


In [8]:
# For greedy decoding
max_new_tokens = 300
do_sample = False  # `do_sample=False` means using greedy decoing strategy. To enable greedy decoding, we also need to set `top_p`, `temperature` as `None`.
temperature = None

# call generation function model.generate()
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=None,
)

# The decoded results are token ids.
print("=" * 20 + "Token IDs" + "=" * 20)
print(output_ids)

print("=" * 20 + "Decoded Results" + "=" * 20)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

tensor([[    1, 10465,   381,   338,   263,   274,  1082,  2217,  7826, 10600,
           297,   263,  9796,  3942, 29889,  7569,  2462, 29892,  1183,   723,
           748,  5377,   322,  1708,   297,   278, 16423, 29889,  3118,  2462,
         29892,  1183,  4446,   263,  4802, 29892, 13328,   541,   357, 17652,
         29889,  2296,  5131,   304,  4380,   372, 29892,   577,  1183,  6350,
          1156,   372, 29889,    13,  1576,   541,   357, 17652,  9115, 29893,
          3448,   322,  5331,   902,   304,   263,  4802,  5447, 29889,   739,
           471,   577,  4802, 29892,   372,  6140,   304,   367, 25508,  1554,
         29889,  2296,  5148,  2820,   322,  4446,   263,  4802, 29892, 13328,
         28149, 29889,  2296,   471,   577, 24173,   322,  6350,   304,  5839,
           372, 29889,    13,  6246,   746,  1183, 23051,   372, 29892,   278,
         28149,  4687,   304,  4337, 29991,   739,   471,   263,   274,  1008,
         29886,   453,   279, 29991,   739,   471,  

In [9]:
max_new_tokens = 300
do_sample = True
temperature = 0.5
top_p = 0.95

# call generation function model.generate()
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=None,
)

# The decoded results are token ids.
print("=" * 20 + "Token IDs" + "=" * 20)
print(output_ids)

print("=" * 20 + "Decoded Results" + "=" * 20)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

tensor([[    1, 10465,   381,   338,   263,   274,  1082,  2217,  7826, 10600,
           297,   263,  9796,  3942, 29889,  2296,   471,  2337,  8743,   411,
           902,   304,   952,   322,  2734,  2820,   278,  3699, 29889,  3118,
          2462, 29892,  1183,   471,  8743,   411,   902,   304, 29891,  1559,
           746,  1183,  6091,   263, 22526, 11462, 29889,  2296,  5148,   701,
           322,  4446,   263,  4802, 29892,  4628,  9570,   297,   278, 14744,
         29889,    13, 29924,   290,  1357,  1497, 29892,   376,  4806,   817,
           304,   748,   304,   278, 11619, 29892, 22827,  3850,    13, 29933,
           433,   381,   471,   885,  1965, 29892,   541,  1183,  6363,   393,
           341,   290,  1357,   471,  1492, 29889,  2688,  3512,   304,   278,
         11619,   322,   540,  1497,   393,   278,  9570,   471,   263,  4319,
         14280, 29889,   940,  1497,   896,  4312,   304,  7952,  2768,  2745,
           278, 14280,   471,   975, 29889,    13, 2

In [33]:
# Skeleton Code: Single input (same as previous code blocks)

prompt = "Blair is a cute little girl lived in a happy family." # ⬅️ try to construct different prompts.

# ⬇️ Try to tune different decoding hyperparameters.
# You can also add more hyperparameters like `top_p`, `top_k`.
max_new_tokens = 500
do_sample = True
temperature = 0.5
top_p = 0.95

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=top_p,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<s> Blair is a cute little girl lived in a happy family. Every day, she would play with her toys and laugh with her family.
One day, her mommy asked her to help with the laundry. "Bishies are not available right now," said mommy.
Blair was sad and said, "I don't want to help. I want to play."
Mommy said, "I know you want to play, but we need to do the laundry first. It's important to help our family."
Blair thought for a moment and then said, "Okay, I'll help." She smiled and ran to get the laundry.
Mommy was very happy and said, "Thank you, honey. You are such a good helper."
Blair smiled and said, "I'm glad I could help. Now, let's do the laundry together."<s>


In [34]:
# Skeleton Code: Bacthed input-output

prompts = ["Blair is a cute little girl lived in a happy family. ", "She got a little puppy as a gift one day."]  # ⬅️ try to construct different prompts.

batch_size = 2 # If you have multiple data inputs, please control the batch size to prevent out-of-memory issues.

# ⬇️ Try to tune different decoding hyperparameters.
# You can also add more hyperparameters like `top_p`, `top_k`.
max_new_tokens = 500
do_sample = True
temperature = 0.8
top_p = 0.95

for i in range(0, len(prompts), batch_size):
    batch_input = prompts[i:i+batch_size]
    tokenized_input = tokenizer(batch_input, return_tensors="pt", padding=True).to(device)

    # For decoder-only models, batched inputs of model.generate() should be in the format of input_ids.
    output_ids = model.generate(
        tokenized_input["input_ids"],
        max_new_tokens=max_new_tokens,
        eos_token_id=1,
        do_sample=do_sample,
        temperature=temperature,
        top_p=top_p,
    )
    output_text = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

    for idx, result in enumerate(output_text):
        print(f"{result}\n")

Blair is a cute little girl lived in a happy family.  She loved to play in the sunshine and explore the garden. 
One day, when her mom was busy in the kitchen, she decided to go outside and explore the garden. She was excited as she ran around the garden, looking for new things to discover. 
Suddenly, she heard a loud noise. It was coming from the garden. She followed the sound and saw a large, round, yellow thing in the middle of the garden. It looked like a ball of gas. She was curious and wanted to know what it was. 
She asked her mom, "What is that thing in the garden?" 
Her mom smiled and said, "That is gas. It comes from the gas machine. It can be dangerous for you to touch it." 
Blair was scared and ran back into the house. But then she thought of something else. She said, "I can be a smart girl and learn how to be safe around gas." 
Her mom smiled and said, "That is a great idea! Let's go back outside and learn about gas."
So, the little girl and her mom went outside and spent 

#### What about other languages?

Oops! This English language model cannot generate stories in other languages!

Why? Let us evaluate the perplexity of different languages in the next task.

In [35]:
prompt = "两只小老虎过马路"

# Decoding hyperparameters
max_new_tokens = 500
do_sample = True
temperature = 0.5
top_p = 0.95

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
    top_p=top_p,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)


<s> 两只小老虎过马路 was a very adventurous boy. He loved to explore and find new things. One day, he decided to go on a big adventure. He packed his bag and set off.
As he was walking, he saw a big tree with a big, juicy apple at the top. He wanted to eat it, but it was too high for him to reach. Suddenly, a friendly bird flew down and offered to help him. The bird flew up and grabbed the apple with its beak. The boy was so happy and grateful that he thanked the bird and shared the apple with it.
As they were eating, the bird told the boy a story about a magical apple that could grant wishes. The boy was so excited and wished for a new toy. Suddenly, a toy appeared in front of him. The boy was amazed and thanked the bird again. But as he turned around to thank the bird, he saw that it had disappeared. The boy realized that the bird had given him a special gift that made him forget about his adventure. He went home with a smile on his face, grateful for the unexpected surprise.<s>


### Task 2: Perplexity Evaluation

#### Background

---

The perplexity serves as a key metric for evaluating language models. It quantifies how well a model predicts a sample, with lower perplexity indicating better performance. For a tokenized sequence $X = (x_0, x_1, \dots, x_t)$, the perplexity is defined mathematically as:

$$\text{Perplexity}(X) = \exp \left( -\frac{1}{t} \sum_{i=1}^t \log p_\theta (x_i | x_{<i}) \right)$$

Here, $p_\theta(x_i | x_{<i})$ represents the probability of a token $ x_i $ given its preceding tokens, and the formulation incorporates the average log probability across the sequence.

---

⚠️ Please make sure to **run the following cell first** to define the evaluation function.

😄 **You do not need to check these complex details! Too hard for beginners!** However, if you are interested, you can compare the following code with the explanations above to better understand how to implement PPL evaluation using PyTorch.

In [6]:
# The following code was adapted from the `evaluate` library. Licensed under the Apache License, Version 2.0 (the "License").
# We modify them to avoid causing serious memory issues in the Colab environment.

def compute_ppl(
        model, tokenizer, inputs, device, batch_size: int = 16, add_start_token: bool = True, max_length=None
):

    if device is not None:
        assert device in ["gpu", "cpu", "cuda"], "device should be either gpu or cpu."
        if device == "gpu":
            device = "cuda"
    else:
        device = "cuda" 

    # if batch_size > 1 (which generally leads to padding being required), and
    # if there is not an already assigned pad_token, assign an existing
    # special token to also be the padding token
    if tokenizer.pad_token is None and batch_size > 1:
        existing_special_tokens = list(tokenizer.special_tokens_map_extended.values())
        # check that the model already has at least one special token defined
        assert (
            len(existing_special_tokens) > 0
        ), "If batch_size > 1, model must have at least one special token to use for padding. Please use a different model or set batch_size=1."
        # assign one of the special tokens to also be the pad token
        tokenizer.add_special_tokens({"pad_token": existing_special_tokens[0]})

    if add_start_token and max_length:
        # leave room for <BOS> token to be added:
        assert (
            tokenizer.bos_token is not None
        ), "Input model must already have a BOS token if using add_start_token=True. Please use a different model, or set add_start_token=False"
        max_tokenized_len = max_length - 1
    else:
        max_tokenized_len = max_length

    encodings = tokenizer(
        inputs,
        add_special_tokens=False,
        padding=True,
        truncation=True if max_tokenized_len else False,
        max_length=max_tokenized_len,
        return_tensors="pt",
        return_attention_mask=True,
    )

    encoded_texts = encodings["input_ids"]
    attn_masks = encodings["attention_mask"]

    # check that each input is long enough:
    if add_start_token:
        assert torch.all(torch.ge(attn_masks.sum(1), 1)), "Each input text must be at least one token long."
    else:
        assert torch.all(
            torch.ge(attn_masks.sum(1), 2)
        ), "When add_start_token=False, each input text must be at least two tokens long. Run with add_start_token=True if inputting strings of only one token, and remove all empty input strings."

    ppls = []
    loss_fct = CrossEntropyLoss(reduction="none")

    for start_index in tqdm(range(0, len(encoded_texts), batch_size)):
        end_index = min(start_index + batch_size, len(encoded_texts))
        encoded_batch = encoded_texts[start_index:end_index].to(device)
        attn_mask = attn_masks[start_index:end_index].to(device)

        if add_start_token:
            bos_tokens_tensor = torch.tensor([[tokenizer.bos_token_id]] * encoded_batch.size(dim=0)).to(device)
            encoded_batch = torch.cat([bos_tokens_tensor, encoded_batch], dim=1)
            attn_mask = torch.cat(
                [torch.ones(bos_tokens_tensor.size(), dtype=torch.int64).to(device), attn_mask], dim=1
            )

        labels = encoded_batch

        with torch.no_grad():
            out_logits = model(encoded_batch, attention_mask=attn_mask).logits

            shift_logits = out_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            shift_attention_mask_batch = attn_mask[..., 1:].contiguous()

            perplexity_batch = torch.exp(
                (loss_fct(shift_logits.transpose(1, 2), shift_labels) * shift_attention_mask_batch).sum(1)
                / shift_attention_mask_batch.sum(1)
            )

            ppls += perplexity_batch.tolist()

    del encoded_batch, attn_mask
    if device == "cuda":
        torch.cuda.empty_cache()

    return {"perplexities": ppls, "mean_perplexity": sum(ppls)/float(len(ppls))}


#### 💡Tutorials: compute_ppl() function.

---
Minimal example:

```python
test_dataset = ["Once upon a time,"]

compute_ppl(
    model=model,
    tokenizer=tokenizer,
    device=device,
    inputs=test_dataset,
    batch_size = 16
)
```

Important parameters:
- `inputs`: list of input text, each separate text snippet is one list entry.
- `batch_size`: the batch size to run evaluations.

Returns:
- `perplexity`: `{"perplexities": [x.x, x.x, ...], "mean_perplexity": x.x}` dictionary containing the perplexity scores for the texts in the input list, as well as the mean perplexity. .


---

#### Task 2 Playground

---

📚 Task 2: Evaluate the perplexity. Ensure that you evaluate both the English and Chinese test data we provided. You are encouraged to collect more diverse text data and discuss your findings regarding the language understanding capacity of the base model.


Note: If you want to reuse the evaluation codes for JSONL data, please structure the content as follows:
```json
{"text": "one data"}
{"text": "two data."}
...
```
**You may find that the PPL value for Chinese text is significantly higher than that for English text. This is evidence that the base model cannot generate a Chinese story at the end of the last task.**

---

In [6]:
# Skeleton Code: Evaluate the perplexity (PPL) on a list of raw text.

test_dataset = ["Blair is a cute little girl lived in a happy family. ", "She got a little puppy as a gift one day."] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


Perplexity: 78.03


In [7]:
test_dataset = ["两只老虎跑得快","一直没有尾巴，一直没有眼睛"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 126800.88


In [8]:
# Skeleton Code: Evaluate the perplexity (PPL) on an external test set file (JSONL).

# English test set.
data_file = EN_TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

model.to("cuda")
results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(English Text) Test Perplexity: {dataset_ppl:.2f}")

# Chinese test set.
data_file = TEST_FILE # ⬅️ you can change your file path
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Chinese Text) Test Perplexity: {dataset_ppl:.2f}")


# Try your own data file!

Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/63 [00:00<?, ?it/s]

(English Text) Test Perplexity: 4.14


Generating test split: 0 examples [00:00, ? examples/s]

  0%|          | 0/63 [00:00<?, ?it/s]

(Chinese Text) Test Perplexity: 70030.40


In [11]:
# 🚨 Release gpu cache before training the model

import gc
gc.collect() # Python thing
# torch.cuda.empty_cache() # PyTorch thing
with torch.no_grad():
    torch.cuda.empty_cache()

### Task 3: Continual Pre-training (in Chinese or in another language you are proficient in)

Currently, our base English LM is proficient in English but lacks the capability to generate or comprehend other languages (e.g., Chinese). The objective of this task is to enhance a base English LM by continually pre-training it with text in another language. This process aims to enable the model to understand and generate mini-story in another language.

We have provided 10,000 Chinese training samples. The training process for any language is the same. We have included useful resource links (in Assignment description PDF) to help you create additional data. If you encounter any issues in creating a dataset in another language, please do not hesitate to contact us.

We have implemented data preprocessing and the training pipeline, so you are not required to optimize these components. Instead, focus on tuning the training hyperparameters and observe the changes in model performance.


---

⚠️ Please **make sure to run the following cell first to pre-process data**.

😄 You do not need to check the details of whole pipeline construction! Please pay attention to the hyper-parameters of `trainer`.

#### Preprocess Data
Here, we preprocess (tokenize and group) the text for the subsequent evaluation and pre-training phases.

Load prepared Chinese dataset from Google drive (or local disk).

In [12]:
chinese_dataset = load_dataset('json', data_files={'train': TRAIN_FILE, 'validation':VALIDATION_FILE, 'test': TEST_FILE})
print(chinese_dataset)
print(chinese_dataset["test"][2]["text"])

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 500
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1000
    })
})
从前，有一个小女孩名叫莉莉。她喜欢和家人一起去度假。有一天，她的家人决定去海边旅行。莉莉非常兴奋，她跳起来又跳下去，像发了疯一样。

当他们到达海滩时，他们搭起了遮阳伞和毯子。莉莉想立刻去游泳，但她的父母告诉她要等吃完午餐再说。莉莉感到很不耐烦，她说：“我现在就想去游泳！”她妈妈回答：“莉莉，我们需要先吃东西。游泳需要能量。”

莉莉意识到妈妈说得对，于是耐心地等待午餐结束。她学会了有时候要克制自己的激动情绪，并听从父母的意见。从那天起，莉莉变得更善于倾听，也更加享受她的假期时光。


We tokenize the raw text using Llama-2's tokenizer and group the tokenized text as inputs.

In [13]:
block_size = 512

def group_texts(examples):
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [14]:
tokenized_zh_datasets = chinese_dataset.map(lambda examples: tokenizer(examples["text"]), batched=True, num_proc=4, remove_columns=["text"])
lm_datasets = tokenized_zh_datasets.map(
    group_texts,
    batched=True,
    batch_size=512,
    num_proc=4,
)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/500 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

#### 💡Tutorials: TrainingArguments().

**Important Training Hyper-parameters**
- learning_rate: The initial learning rate for optimizer.
- num_train_epochs: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
- *_strategy: The evaluation/saving strategy to adopt during training. Possible values are:
    - `"no"`: No evaluation/saving is done during training.
    - `"steps"`: Evaluation/saving is done (and logged) every `eval_steps`.
    - `"epoch"`: Evaluation/saving is done at the end of each epoch.
- per_device_train_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for training.
- per_device_eval_batch_size: The batch size per GPU/XPU/TPU/MPS/NPU core/CPU for evaluation.
- save_total_limit: If a value is passed, will limit the total amount of checkpoints.


---

If you do not understand `AdamW` optimizer and learning scheduler, you may use default settings.

**Optimizer Hyper-parameters**
- weight_decay: The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in [`AdamW`] optimizer.
- adam_beta1: The beta1 hyperparameter for the [`AdamW`] optimizer.
- adam_beta2: The beta2 hyperparameter for the [`AdamW`] optimizer.

**Learning schedule**
- lr_scheduler: The scheduler type to use.
- warmup_ratio: Ratio of total training steps used for a linear warmup from 0 to `learning_rate`.

[Explore more parameters here](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/trainer#transformers.TrainingArguments)

#### Task 3 Playground

---

📚 Please just run the following code to do continual pre-training. Please try your best to tune the hyperparameters or collect more data to improve model performance.

---

In [13]:
# =========Pre-training hyperparameters, please feel free to tune them~=========
# =Important=
lr = 1e-4
epochs = 8
save_steps=200
strategy="steps"
train_bsz = 32 # reduce batch size if you encountered out-of-memory errors.
eval_bsz = 16

# If you do not understand AdamW optimizer and learning scheduler, you may use default settings.
# =Optimizer=
optimizer = "adamw_torch"
weight_decay = 0.01
adam_beta1 = 0.9
adam_beta2 = 0.98
# =Learning scheduler=
lr_scheduler = "linear"
warmup_ratio = 0.01
# =========End of pre-training hyperparameters=========


training_args = TrainingArguments(
    "llama-42m-zh-fairytales",
    evaluation_strategy = strategy,
    eval_steps=save_steps,
    save_strategy = strategy,
    save_steps=save_steps,
    logging_strategy="steps",
    logging_steps = 10,
    learning_rate=lr,
    weight_decay=weight_decay,
    seed=42,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=eval_bsz,
    save_total_limit=1,
    optim = optimizer,
    lr_scheduler_type = lr_scheduler,
    adam_beta1 = adam_beta1,
    adam_beta2 = adam_beta2,
    warmup_ratio = warmup_ratio,
    num_train_epochs = epochs,
    report_to=None
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)




In [14]:
trainer.train()

Step,Training Loss,Validation Loss
200,1.8536,1.84143
400,1.5042,1.522449
600,1.3477,1.399509
800,1.238,1.333789
1000,1.1981,1.29798
1200,1.1493,1.266596
1400,1.0832,1.255093
1600,1.0339,1.247563
1800,0.9945,1.243147
2000,1.0087,1.238736


TrainOutput(global_step=2000, training_loss=1.3321396446228027, metrics={'train_runtime': 3367.7817, 'train_samples_per_second': 18.93, 'train_steps_per_second': 0.594, 'total_flos': 4956004181606400.0, 'train_loss': 1.3321396446228027, 'epoch': 8.0})

Load pre-trained model and try to generate mini-story in another language.

In [15]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device type: {device}")

new_model_path = "llama-42m-zh-fairytales/checkpoint-2000" # saved checkpoint path
model = LlamaForCausalLM.from_pretrained(new_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


Evaluate the PPL on Chinese text (or another language) again.

You will notice that we actually achieve a much lower PPL after continual pre-training.

In [19]:
data_file = TEST_FILE
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Test Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/63 [00:00<?, ?it/s]

Test Perplexity: 3.37


In [21]:
test_dataset = ["两只老虎跑得快","一直没有尾巴，一直没有眼睛"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 13.89


---

The original English base model was pre-trained on 2 million data samples. Considering we are using only 10,000 training samples (0.5% of the original pre-training data), the model can generate a few fluent sentences but may still struggle with long-text generation or common sense of other languages. You can try using more data or training steps depending on your computational resources.

---

In [20]:
prompt = "从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids = model.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids[0])
print(output_text)

<s> 从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。他们喜欢在公园里玩耍。有一天，汤姆看到一个大滑梯。他想滑下来，但是它太高了。



汤姆的朋友，一只名叫马克斯的狗，来了，他说：“马克斯，你能帮我滑下来吗？”马克斯看了看汤姆，说：“好的，汤姆。我们一起滑下来吧。”



汤姆和马克斯整天都在一起玩耍。他们滑下来，吃了蛋糕。马克斯很高兴他能帮助他的朋友。从那天起，汤姆和马克斯成为了最好的朋友。<s>


改变超参数样例1

In [25]:
if device == "cuda":
    torch.cuda.empty_cache()

In [26]:
# =========Pre-training hyperparameters, please feel free to tune them~=========
# =Important=
lr = 1e-5
epochs = 10
save_steps=200
strategy="epoch"
train_bsz = 32 # reduce batch size if you encountered out-of-memory errors.
eval_bsz = 16

# If you do not understand AdamW optimizer and learning scheduler, you may use default settings.
# =Optimizer=
optimizer = "adamw_torch"
weight_decay = 0.01
adam_beta1 = 0.9
adam_beta2 = 0.98
# =Learning scheduler=
lr_scheduler = "linear"
warmup_ratio = 0.01
# =========End of pre-training hyperparameters=========


training_args = TrainingArguments(
    "llama-42m-zh-fairytales",
    evaluation_strategy = strategy,
    eval_steps=save_steps,
    save_strategy = strategy,
    save_steps=save_steps,
    logging_strategy="steps",
    logging_steps = 10,
    learning_rate=lr,
    weight_decay=weight_decay,
    seed=42,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=eval_bsz,
    save_total_limit=1,
    optim = optimizer,
    lr_scheduler_type = lr_scheduler,
    adam_beta1 = adam_beta1,
    adam_beta2 = adam_beta2,
    warmup_ratio = warmup_ratio,
    num_train_epochs = epochs,
    report_to=None
)

trainer2 = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)




In [None]:
trainer2.train()

Epoch,Training Loss,Validation Loss


In [20]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device type: {device}")

new_model_path = "llama-42m-zh-fairytales/checkpoint-2500" # saved checkpoint path
model2 = LlamaForCausalLM.from_pretrained(new_model_path).to(device)
tokenizer2 = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer2.pad_token is None:
    tokenizer2.pad_token = tokenizer2.eos_token

Device type: cuda


In [21]:
# model2
prompt = "从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input2 = tokenizer2.encode(prompt, return_tensors="pt").to(device)
output_ids2 = model2.generate(
    tokenized_input2,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text2 = tokenizer2.decode(output_ids2[0])
print(output_text2)

<s> 从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。她喜欢在一个叫莉莉。她喜欢在一个玩很很妈。她在一个玩具。



有一天，莉莉的妈妈妈说：“妈妈，莉莉。我们把它。”



莉莉和她的妈妈说：“妈妈妈，妈妈，我们的朋友们帮助你。”



莉莉莉她的妈妈说：“我们怎么叫莉莉。我们很妈妈。”



莉莉莉她的妈妈说：“妈妈，莉莉。我们可以很妈妈妈。”



莉莉莉的妈妈说：“妈妈，����


In [22]:
# model2
test_dataset = ["两只老虎跑得快","一直没有尾巴，一直没有眼睛"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model2, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 39.42


In [23]:
# model2
data_file = TEST_FILE
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Test Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/63 [00:00<?, ?it/s]

Test Perplexity: 16.73


调参 ******************3*******************

In [15]:
# =========Pre-training hyperparameters, please feel free to tune them~=========
# =Important=
lr = 5e-5
epochs = 12
save_steps=200
strategy="epoch"
train_bsz = 32 # reduce batch size if you encountered out-of-memory errors.
eval_bsz = 16

# If you do not understand AdamW optimizer and learning scheduler, you may use default settings.
# =Optimizer=
optimizer = "adamw_torch"
weight_decay = 0.01
adam_beta1 = 0.9
adam_beta2 = 0.98
# =Learning scheduler=
lr_scheduler = "linear"
warmup_ratio = 0.01
# =========End of pre-training hyperparameters=========


training_args = TrainingArguments(
    "llama-42m-zh-fairytales",
    evaluation_strategy = strategy,
    eval_steps=save_steps,
    save_strategy = strategy,
    save_steps=save_steps,
    logging_strategy="steps",
    logging_steps = 10,
    learning_rate=lr,
    weight_decay=weight_decay,
    seed=42,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=eval_bsz,
    save_total_limit=1,
    optim = optimizer,
    lr_scheduler_type = lr_scheduler,
    adam_beta1 = adam_beta1,
    adam_beta2 = adam_beta2,
    warmup_ratio = warmup_ratio,
    num_train_epochs = epochs,
     report_to="none"
)

trainer3 = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)



In [16]:
trainer3.train()

Epoch,Training Loss,Validation Loss
1,3.6144,3.610271
2,2.2235,2.264548
3,1.8207,1.813517
4,1.6033,1.641949
5,1.4955,1.543603
6,1.4125,1.483981
7,1.3759,1.445553
8,1.3626,1.419579
9,1.3102,1.401846
10,1.273,1.393842


TrainOutput(global_step=2500, training_loss=1.9941860012054444, metrics={'train_runtime': 2109.7646, 'train_samples_per_second': 37.772, 'train_steps_per_second': 1.185, 'total_flos': 6195005227008000.0, 'train_loss': 1.9941860012054444, 'epoch': 10.0})

In [17]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device type: {device}")

new_model_path = "llama-42m-zh-fairytales/checkpoint-2500" # saved checkpoint path
model3 = LlamaForCausalLM.from_pretrained(new_model_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


In [18]:
# model3
prompt = "从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids2 = model3.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids2[0])
print(output_text)

<s> 从前，有一只叫做汤姆的猫。汤姆和他的朋友们一起玩。他们喜欢在树林里玩耍。有一天，他们在树林里散步时看到了一个很大的杯子。它很害怕，但它太大了。

汤姆把杯子带到了杯子，但他很害怕。他不想吃掉它。他想帮助它。他说：“我们去找吧。”汤姆把杯子找到了一个漂亮的杯子。他说：“哇，我们可以找到它！”

汤姆和他的朋友们很高兴。他们找到了一个愉快的地方。他们找到了一个拿着杯子。他们决定玩杯子。他们把杯子放在树上，然后�


In [19]:
# model3
test_dataset = ["两只老虎跑得快","一直没有尾巴，一直没有眼睛"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model3, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 20.75


In [20]:
# model3
data_file = TEST_FILE
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

results = compute_ppl(model=model3, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Test Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/63 [00:00<?, ?it/s]

Test Perplexity: 3.98


# 印尼语数据集 训练

In [11]:
IN_RAW_DATA = '/kaggle/input/indonisan-raw/Indonesian_raw.csv'

In [16]:
import pandas as pd

DF_IN_RAW = pd.read_csv(IN_RAW_DATA)

column_name = 'targets'  
column_data = DF_IN_RAW[column_name]

print(column_data)
column_data.to_csv('IN_targets.csv', index=False)

0       Terjemahan atau padanan teks tersebut dalam Ba...
1       Terjemahan atau padanan teks tersebut dalam Ba...
2       Terjemahan atau padanan teks tersebut dalam Ba...
3       Terjemahan atau padanan teks tersebut dalam Ba...
4       Terjemahan atau padanan teks tersebut dalam Ba...
                              ...                        
1060    Terjemahan atau padanan teks tersebut dalam Ba...
1061    Terjemahan atau padanan teks tersebut dalam Ba...
1062    Terjemahan atau padanan teks tersebut dalam Ba...
1063    Terjemahan atau padanan teks tersebut dalam Ba...
1064    Terjemahan atau padanan teks tersebut dalam Ba...
Name: targets, Length: 1065, dtype: object


In [20]:
# clean data
IN_RAW = pd.read_csv('IN_targets.csv')
IN_CLEANED = column_data.apply(lambda x: x.replace('Terjemahan atau padanan teks tersebut dalam Bahasa Indonesia adalah:\n\n', ''))
IN_CLEANED

0       Dung Tak Tak Dung Tak! Dung Tat Tak Dung Tak! ...
1       "Bolehkah kami ikut?" pinta Syam. "Lain kali, ...
2       Dada memandu bagal (anak kuda dan keledai) mer...
3       Kaki kiri Kiki terjepit di sela-sela batu yang...
4       Chacha meninggalkan Kiki dengan Ma yang begitu...
                              ...                        
1060    Pada hari Kamis, Manu berpiknik. "Ibu, bagaima...
1061    Hari Jumat cuaca mendung. "Ibu, apakah hujan a...
1062    Hari Sabtu, langit bergemuruh. Geluduk terdeng...
1063    Akhirnya, hujan turun juga! "Hujan! Hujan!" te...
1064    "Manu," panggil Ibu seraya mengejarnya, "Kamu ...
Name: targets, Length: 1065, dtype: object

In [25]:
IN_CLEANED.to_csv('IN_CLEANED.csv', index=False)

In [22]:
from sklearn.model_selection import train_test_split

train_ratio = 0.6
val_ratio = 0.3
test_ratio = 0.1

train_val_ratio = train_ratio / (train_ratio + val_ratio)

train_val_df, test_df = train_test_split(IN_CLEANED, test_size=test_ratio, random_state=42)

train_df, val_df = train_test_split(train_val_df, test_size=val_ratio / (train_ratio + val_ratio), random_state=42)

In [26]:
print(train_df)
print(val_df)
print(test_df)

184    Mengenal Mohana 9 Mei 1971: Hari lahir Rajmoha...
845    "Nah, kalian berdua memiliki hal-hal membahagi...
404    Pintu melihat bannya dan mengerti apa yang dis...
901    Ada beberapa legenda tentang penemuan kopi. Sa...
108    Manusia telah ada selama sekitar 200.000  tahu...
                             ...                        
72     Ranj tidak tahan. Dia merindukan semua teman-t...
329    Ibu Ammu kemudian mengambil peniti dan menusuk...
812    Kakek memakai tas yang sangat kecil, dan memba...
618    Aku pikir mungkin wanita tua yang tinggal di l...
55     Kita bisa dengan mudah mengetahui cuaca di lua...
Name: targets, Length: 638, dtype: object
200     Suara itu terdengar lebih pelan sekarang. KECI...
93      "Jadi, aku mendapat gen ini dari Ibu?" tanya V...
481     Si adik menjawab, "Ada banyak hewan ternak di ...
729     Pak Tani meminta, "Tolong turunkan hujan. Aku ...
1055    Aku tidak keberatan , karena kami akan bermain...
                              ...        

In [37]:
df = train_df.reset_index(drop=True).to_frame()
df.columns = ['text']

# 保存为 JSON Lines 格式
df.to_json('in_train.jsonl', orient='records', lines=True, force_ascii=False)

In [38]:
df = val_df.reset_index(drop=True).to_frame()
df.columns = ['text']
df.to_json('in_val.jsonl', orient='records', lines=True, force_ascii=False)

df = test_df.reset_index(drop=True).to_frame()
df.columns = ['text']
df.to_json('in_test.jsonl', orient='records', lines=True, force_ascii=False)

In [39]:
indonisia_dataset = load_dataset('json', data_files={'train': '/kaggle/working/in_train.jsonl', 'validation':'/kaggle/working/in_val.jsonl', 'test': '/kaggle/working/in_test.jsonl'})
print(indonisia_dataset)
print(indonisia_dataset["test"][2]["text"])

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 638
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 320
    })
    test: Dataset({
        features: ['text'],
        num_rows: 107
    })
})
Sepanjang  tahun  berikutnya  tiap  pergantian  musim,  aku mengamati perubahan yang sangat luar biasa di kolam itu. Pada musim panas, kolam itu seperti lahan tandus yang kering bahkan tak bernyawa. Tetapi, begitu hujan turun, kehidupan seketika mulai kembali. Seperti orkestra yang menunggu aba-aba dari konduktor untuk mulai beraksi. Rintik hujan pertama yang jatuh ke tanah kering menimbulkan aroma alami. Bahkan dia juga melakukan apa yang pesulap tidak bisa wujudkan, mengubah tanah coklat menjadi oasis indah berwarna hijau dan biru.


In [45]:
block_size = 512

def group_texts(examples):
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [46]:
tokenized_in_datasets = indonisia_dataset.map(lambda examples: tokenizer(examples["text"]), batched=True, num_proc=4, remove_columns=["text"])
lm_datasets = tokenized_in_datasets.map(
    group_texts,
    batched=True,
    batch_size=512,
    num_proc=4,
)

  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/638 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/320 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/107 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/638 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/320 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/107 [00:00<?, ? examples/s]

训练模型

In [65]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device type: {device}")

model_path = MODEL_FOLDER
# Load model from local files
model = LlamaForCausalLM.from_pretrained(model_path).to(device)
# Load tokenizer from local files
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


In [66]:
test_dataset = ["Sepanjang  tahun  berikutnya  tiap  pergantian  musim","Seperti orkestra yang menunggu aba-aba dari konduktor untuk mulai beraksi"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 146538.51


In [67]:
# Indonesian test set.
data_file = '/kaggle/working/in_test.jsonl'
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

model.to("cuda")
results = compute_ppl(model=model, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Indonisian Text) Test Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/7 [00:00<?, ?it/s]

(Indonisian Text) Test Perplexity: 34980.24


In [68]:
# =========Pre-training hyperparameters, please feel free to tune them~=========
# =Important=
lr = 5e-5
epochs = 30
save_steps= 20
strategy="steps"
train_bsz = 32 # reduce batch size if you encountered out-of-memory errors.
eval_bsz = 16

# If you do not understand AdamW optimizer and learning scheduler, you may use default settings.
# =Optimizer=
optimizer = "adamw_torch"
weight_decay = 0.01
adam_beta1 = 0.9
adam_beta2 = 0.98
# =Learning scheduler=
lr_scheduler = "linear"
warmup_ratio = 0.01
# =========End of pre-training hyperparameters=========


training_args = TrainingArguments(
    "llama-42m-in-fairytales",
    evaluation_strategy = strategy,
    eval_steps=save_steps,
    save_strategy = strategy,
    save_steps=save_steps,
    logging_strategy="steps",
    logging_steps = 10,
    learning_rate=lr,
    weight_decay=weight_decay,
    seed=42,
    per_device_train_batch_size=train_bsz,
    per_device_eval_batch_size=eval_bsz,
    save_total_limit=1,
    optim = optimizer,
    lr_scheduler_type = lr_scheduler,
    adam_beta1 = adam_beta1,
    adam_beta2 = adam_beta2,
    warmup_ratio = warmup_ratio,
    num_train_epochs = epochs,
     report_to="none"
)

trainer_IN = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)



In [69]:
trainer_IN.train()

Step,Training Loss,Validation Loss
20,6.3192,5.8448
40,4.9271,4.840263
60,4.1854,4.381457
80,3.7599,4.187562
100,3.4265,4.094593
120,3.1791,4.071829
140,2.9628,4.083347
160,2.8128,4.102041
180,2.7481,4.11271


TrainOutput(global_step=180, training_loss=4.007002978854709, metrics={'train_runtime': 147.3581, 'train_samples_per_second': 34.406, 'train_steps_per_second': 1.222, 'total_flos': 394135732224000.0, 'train_loss': 4.007002978854709, 'epoch': 30.0})

In [71]:
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device type: {device}")

new_model_path_in = "llama-42m-in-fairytales/checkpoint-180" # saved checkpoint path
model_in = LlamaForCausalLM.from_pretrained(new_model_path_in).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, device=device)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

Device type: cuda


In [72]:
# model_in
prompt = "Dahulu kala, ada seekor kucing bernama Tom。"

# Decoding hyperparameters
max_new_tokens = 300
do_sample = True
temperature = 0.3

tokenized_input = tokenizer.encode(prompt, return_tensors="pt").to(device)
output_ids2 = model_in.generate(
    tokenized_input,
    max_new_tokens=max_new_tokens,
    eos_token_id=1,
    do_sample=do_sample,
    temperature=temperature,
)
output_text = tokenizer.decode(output_ids2[0])
print(output_text)

<s> Dahulu kala, ada seekor kucing bernama Tom。 alam. Mereka mengikatnya keluar keluar dari kelu. Mereka menghampungnya menghormkannya di antar kelu. Mereka menghantungnya tidak menghanda dibakan kelu.<s>


In [73]:
test_dataset = ["Sepanjang  tahun  berikutnya  tiap  pergantian  musim","Seperti orkestra yang menunggu aba-aba dari konduktor untuk mulai beraksi"] # ⬅️ you can use your examples / or read from raw text file

results = compute_ppl(model=model_in, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/1 [00:00<?, ?it/s]

Perplexity: 289.63


In [74]:
# Indonesian test set.
data_file = '/kaggle/working/in_test.jsonl'
test_dataset = load_dataset('json', data_files={'test': data_file})["test"]["text"]

model.to("cuda")
results = compute_ppl(model=model_in, tokenizer=tokenizer, device=device, inputs=test_dataset, batch_size = 16)
dataset_ppl = results['mean_perplexity']
print(f"(Indonisian Text) Test Perplexity: {dataset_ppl:.2f}")

  0%|          | 0/7 [00:00<?, ?it/s]

(Indonisian Text) Test Perplexity: 85.05
