If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
!pip install -q  torch bitsandbytes==0.41.3 trl==0.4.7 accelerate
!pip install git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/peft.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 38 not upgraded.


Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [4]:
import accelerate
import transformers

transformers.__version__, accelerate.__version__

('4.39.0.dev0', '0.28.0')

You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/language-modeling).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [5]:
from transformers.utils import send_example_telemetry

send_example_telemetry("language_modeling_from_scratch_notebook", framework="pytorch")

# Train a language model

In this notebook, we'll see how to train a [🤗 Transformers](https://github.com/huggingface/transformers) model on a language modeling task. We will cover two types of language modeling tasks which are:

- Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.

![Widget inference representing the causal language modeling task](https://github.com/huggingface/notebooks/blob/main/examples/images/causal_language_modeling.png?raw=1)

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to train a model on it.

This notebooks assumes you have trained a tokenizer on the corpus you are using, see the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook ([open in colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb)).

A script version of this notebook you can directly run on a distributed environment or on TPU is available in our [examples folder](https://github.com/huggingface/transformers/tree/master/examples).

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [6]:
from datasets import load_dataset
# datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

# Place the dataset_cleaned.csv in contents

In [8]:
import random
import math
import csv

def generate_random_problem():
    num1 = random.randint(1, 10000)
    num2 = random.randint(1, 10000)
    operation = random.choice(['+', '-', '*', '/', 'sqrt'])

    if operation == 'sqrt':
        problem = f"\sqrt({num1 ** 2}) = {num1}"
    elif operation == '/':
        problem = f"{num1*num2} / {num1} = {num2}"
    elif operation == '+':
        problem = f"{num1}+{num2} = {num1+num2}"
    elif operation == '-':
        problem = f"{num1} - {num2} = {num1-num2}"
    elif operation == '*':
        problem = f"{num1} * {num2} = {num1*num2}"

    return problem

# Generate 500 random arithmetic problems of each type
problems = []
for _ in range(2000):
    problems.append(generate_random_problem())

# Write the problems to a CSV file
with open('/content/merged_dataset.csv', 'a', newline='') as csvfile:
    writer = csv.writer(csvfile)
    for problem in problems:
        writer.writerow([problem])

In [9]:
import pandas as pd
df = pd.read_csv('/content/merged_dataset.csv')
df.columns = ['Title']
df.to_csv('/content/merged_dataset.csv')
df.head()

Unnamed: 0,Title
0,Telecom Networks are made up of various compon...
1,"In recent years, telecom networks have evolved..."
2,Telecom Networks are mostly used today for wid...
3,Routing a Telephone Call: A call is routed up ...
4,Connection-Oriented Services – I: A dedicated ...


In [10]:
from sklearn.model_selection  import train_test_split
# Load the dataset from CSV
dataset = pd.read_csv('merged_dataset.csv')

# Split the dataset into train and test sets (80% train, 20% test)
train_data, test_data = train_test_split(dataset, test_size=0.05, random_state=42)

# Save the train and test sets to CSV files
train_data.to_csv('merged_dataset_train.csv', index=False)
test_data.to_csv('merged_dataset_val.csv', index=False)

In [11]:
datasets = load_dataset("csv", data_files={"train": '/content/merged_dataset_train.csv',"val": '/content/merged_dataset_val.csv'})

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [12]:
datasets["train"][600]

{'Unnamed: 0': 23,
 'Title': 'The packet’s network protocol type, in this case, TCP/IP, is identified by the data-link layer. Error prevention and “framing” are also provided by the data-link layer. Point-to-Point Protocol (PPP) framing and Ethernet IEEE 802.2 framing are two examples of data-link layer protocols.'}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [13]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [14]:
show_random_elements(datasets["train"])

Unnamed: 0.1,Unnamed: 0,Title
0,7477,It is used for transferring the data from one node to another node.
1,4295,"Like in Infrared communication, there will be a Radio wave transmitter and a receiver. All the Radios today use continuous sine waves to transmit information, as almost every single person on the planet uses these waves in one form or another. The information can be from audio, video, sound, and textual data. Suppose a person is using Radio, sine waves are transmitted from it, and if another person uses a TV, it also broadcasts sine waves. How are these signals separated and identified? Every single Radio signal will have a different frequency for the sine waves."
2,6722,There are some disadvantages of Kerberos. Some of them are as follows:
3,6779,This layer delivers protocols that are accountable for creating seamless transmission between applications.
4,1796,This algorithm makes the routing decisions based on the topology and network traffic.
5,5335,It is the cheapest form of Internet access in a limited way.
6,3851,"Before proceeding towards the TCP termination, it is essential to understand the concept of TCP connection. It will help us to better understand the termination process."
7,844,"Communication medium: Computer network behaves as a communication medium among the users. For example, a company contains more than one computer has an email system which the employees use for daily communication."
8,795,"Attackers’ task is comparatively very easy when they can enter the network they want to attack. Ethernet LANs are very much vulnerable to attack as the switch ports are open to use by default. Various attacks such as Dos attack at layer 2, address spoofing can take place. If the administrator has control over the network then obviously the network is safe. To take total control over the switch ports, the user can use a feature called port-security. If somehow prevent an unauthorized user to use these ports, then the security will increase up to a great extent at layer 2."
9,6600,Step 5: Start the virtual machine and boot into NST


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`gpt2`](https://huggingface.co/gpt2) architecture for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead. For the tokenizer, you can replace the checkpoint by the one you trained yourself.

In [15]:
# @title Default title text
model_checkpoint = "meta-llama/Llama-2-7b-chat-hf"
tokenizer_checkpoint = "meta-llama/Llama-2-7b-chat-hf"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [16]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [17]:
def tokenize_function(examples):
    return tokenizer(examples["Title"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [18]:
tokenized_datasets = datasets.map(tokenize_function, batched=False, num_proc=1, remove_columns=["Title","Unnamed: 0"])

Map:   0%|          | 0/7145 [00:00<?, ? examples/s]

Map:   0%|          | 0/377 [00:00<?, ? examples/s]

In [19]:
datasets["train"][10]

{'Unnamed: 0': 6730,
 'Title': 'In corporate settings, LDAP is frequently used to manage user and group data as well as authentication and authorization. Numerous directory service products, such as Microsoft Active Directory, OpenLDAP, and Novell eDirectory, enable it.'}

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [20]:
tokenized_datasets["train"][10]

{'input_ids': [1,
  512,
  17266,
  403,
  6055,
  29892,
  365,
  29928,
  3301,
  338,
  13672,
  1304,
  304,
  10933,
  1404,
  322,
  2318,
  848,
  408,
  1532,
  408,
  10760,
  322,
  28733,
  29889,
  405,
  4680,
  681,
  3884,
  2669,
  9316,
  29892,
  1316,
  408,
  7783,
  10731,
  18862,
  29892,
  4673,
  10249,
  3301,
  29892,
  322,
  2864,
  514,
  321,
  9882,
  29892,
  9025,
  372,
  29889],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [21]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [22]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [23]:
tokenized_datasets["train"]

Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 7145
})

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [24]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/7145 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/377 [00:00<?, ? examples/s]

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [25]:
tokenizer.decode(lm_datasets["train"][10]["input_ids"])

'read-only memory).<s> Step 4: And, at last, the server connected to other servers and then share the files to the client.<s> This collision signal is received by all the stations on that link. Then,<s> The protocols necessary to implement the mesh topology are present in the three layers of the OSI model. These protocols define the standards required to communicate between two nodes. The protocols can be defined as the set of rules that must be implemented to facilitate communication between devices.<s> There are two types of a network element in the router which are as follows:<s> ATM by choice provides the networking'

In [26]:
tokenizer.decode(lm_datasets["val"][10]["input_ids"])

"must perform remote login.<s> Online Transactions: Voice biometric authentication can be used to authenticate online transactions and reduce the risk of fraudulent transactions.<s> Authorization and Authentication Attacks: Attacks on authorization and authentication can happen when a hacker gets past the security system and accesses information they are not allowed to see. For instance, attackers may get access to user data by abusing improperly configured authentication systems, exposing sensitive data. To stop such attacks, it's crucial to create strong authentication procedures.<s> The installation is very easy, and it also provides a high data transmission rate"

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. First we create the model using the same config as our checkpoint, but initialized with random weights:

In [27]:
from transformers import AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
                                             quantization_config=quant_config,
                                             torch_dtype=torch.bfloat16,
                                              device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

And we will needsome `TrainingArguments`:

In [29]:
from transformers import Trainer, TrainingArguments

In [30]:
training_args = TrainingArguments(
    f"Llama_CN_pretrain",
    evaluation_strategy = "steps",
    max_steps=100,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.01,
    push_to_hub=True
)

In [31]:
from peft import LoraConfig
peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"
)


The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

We pass along all of those to the `Trainer` class:

In [32]:
model.add_adapter(peft_parameters)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["val"],
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


And we can train our model:

In [33]:
trainer.train()

Step,Training Loss,Validation Loss
10,3.0225,3.102206
20,2.8635,2.789849


Step,Training Loss,Validation Loss
10,3.0225,3.102206
20,2.8635,2.789849
30,2.6242,2.74398
40,2.6995,2.685394
50,2.4791,2.642572
60,2.5932,2.589226
70,2.4885,2.563111
80,2.494,2.550784
90,2.4346,2.536127
100,2.4207,2.5331


TrainOutput(global_step=100, training_loss=2.6119864082336424, metrics={'train_runtime': 2500.4313, 'train_samples_per_second': 0.32, 'train_steps_per_second': 0.04, 'total_flos': 4062128898048000.0, 'train_loss': 2.6119864082336424, 'epoch': 0.26})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [34]:
import pandas as pd
df = pd.DataFrame(trainer.state.log_history)

csv_filename = "pretraining_history_llama_v.2.0.0.csv"
df.to_csv(csv_filename, index=False)

In [35]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 12.59


### Perplexity of CN related token prediction is good!!!

In [44]:
from transformers import TextStreamer

output_tokens = []

def stream():
    # query= '''Assuming a network with a maximum segment size (MSS) of 276 bytes, a round trip time (RTT) of 2300 milliseconds, and a packet loss rate of 0.01, determine the TCP throughput of this network.'''
    prompt = f"Explain the setup, data transfer and tear-down phases"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
    streamer = TextStreamer(tokenizer)

    for _ in trainer.model.generate(**inputs, streamer=streamer, max_new_tokens=300):
       output_tokens.append(_.cpu().numpy())
    # Decode generated tokens into text
    output_text = tokenizer.decode(output_tokens[0])
    return output_text

In [45]:
stream()

<s> Explain the setup, data transfer and tear-down phases of the TCP connection.ϵ. In the absence of a network, the two nodes can communicate directly through a point-to-point connection. In a network, the two nodes are connected through a network, and the data is transmitted from one node to the other through the network. The network can be a LAN, WAN, or MAN. The data transfer phase is the third phase of the TCP connection. In this phase, the data is transferred from the sender to the receiver. The sender sends the data in the form of packets, and the receiver receives the packets and reassembles them to get the original data. The tear-down phase is the fourth phase of the TCP connection. In this phase, the connection between the sender and the receiver is terminated. This is done by sending a FIN packet from the sender to the receiver, and then sending an ACK packet from the receiver to the sender.<s> The sender then sends an ACK packet to the receiver, which confirms the terminatio

'<s> Explain the setup, data transfer and tear-down phases of the TCP connection.ϵ. In the absence of a network, the two nodes can communicate directly through a point-to-point connection. In a network, the two nodes are connected through a network, and the data is transmitted from one node to the other through the network. The network can be a LAN, WAN, or MAN. The data transfer phase is the third phase of the TCP connection. In this phase, the data is transferred from the sender to the receiver. The sender sends the data in the form of packets, and the receiver receives the packets and reassembles them to get the original data. The tear-down phase is the fourth phase of the TCP connection. In this phase, the connection between the sender and the receiver is terminated. This is done by sending a FIN packet from the sender to the receiver, and then sending an ACK packet from the receiver to the sender.<s> The sender then sends an ACK packet to the receiver, which confirms the terminati

In [42]:
from transformers import TextStreamer

output_tokens = []

def stream():
    # query= '''Assuming a network with a maximum segment size (MSS) of 276 bytes, a round trip time (RTT) of 2300 milliseconds, and a packet loss rate of 0.01, determine the TCP throughput of this network.'''
    prompt = f"Explain TCP and it's phases"

    inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
    streamer = TextStreamer(tokenizer)

    for _ in trainer.model.generate(**inputs, streamer=streamer, max_new_tokens=300):
       output_tokens.append(_.cpu().numpy())
    # Decode generated tokens into text
    output_text = tokenizer.decode(output_tokens[0])
    return output_text

In [43]:
stream()

<s> Explain TCP and it's phases
everybody can understand it.
TCP is a connection-oriented protocol which means that a connection is established between the source and destination before sending any data. The connection is established using a three-way handshake between the source, destination, and ARP (Address Resolution Protocol).<s>Љ<s>.<s>.<s>Ъ.
<s> <s> (1) The source sends a SYN (synchronize) packet to the destination.
<s>- The destination acknowledges the packet by sending an ACK (acknowledgment) packet.
- The source sends an ACK packet to the destination.
- The destination acknowledges the packet by sending an ACK packet.
- The source sends the data packet to the destination.
- The destination acknowledges the packet by sending an ACK packet.<s>
- The source acknowledges the packet by sending an ACK packet.
<s> <s> <s> (2) The destination sends an ACK packet to the source.
- The source acknowledges the packet by sending an ACK packet.
- The source sends the data packet to the des

"<s> Explain TCP and it's phases\n everybody can understand it.\nTCP is a connection-oriented protocol which means that a connection is established between the source and destination before sending any data. The connection is established using a three-way handshake between the source, destination, and ARP (Address Resolution Protocol).<s>Љ<s>.<s>.<s>Ъ.\n<s> <s> (1) The source sends a SYN (synchronize) packet to the destination.\n<s>- The destination acknowledges the packet by sending an ACK (acknowledgment) packet.\n- The source sends an ACK packet to the destination.\n- The destination acknowledges the packet by sending an ACK packet.\n- The source sends the data packet to the destination.\n- The destination acknowledges the packet by sending an ACK packet.<s>\n- The source acknowledges the packet by sending an ACK packet.\n<s> <s> <s> (2) The destination sends an ACK packet to the source.\n- The source acknowledges the packet by sending an ACK packet.\n- The source sends the data pac

The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.model.save_pretrained("llama-temp")



UnboundLocalError: local variable 'active_adapters' referenced before assignment

In [46]:
from peft import PeftModel
# Save the adapter
trainer.save_model('/content/pretrained_llama_chat_2b_conv')

# Retrieve the model
model_base = trainer.model.base_model

# Loading the adapter
model_new = PeftModel.from_pretrained(model_base, '/content/pretrained_llama_chat_2b_conv', torch_dtype=torch.float16, device_map="cuda")

# Merge the base model and the adapter
model_new = model_new.merge_and_unload()

# Save the overall model
model_new.save_pretrained('/content/pretrained_llama_chat')



events.out.tfevents.1710738022.6d06c6d0aa39.549.1:   0%|          | 0.00/354 [00:00<?, ?B/s]

events.out.tfevents.1710735409.6d06c6d0aa39.549.0:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]



In [47]:
# Pushing the locally saved model to my hugging face model folder
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="/content/pretrained_llama_chat",
    repo_id="VikrantRamesh/llama_CN_pretrain",
    repo_type="model",
    use_auth_token = "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"
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/VikrantRamesh/llama_CN_pretrain/commit/a0046579f2b1fdbe29d521507cf8ad99ae9ec49a', commit_message='Upload folder using huggingface_hub', commit_description='', oid='a0046579f2b1fdbe29d521507cf8ad99ae9ec49a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# trainer.push_to_hub()



adapter_model.safetensors:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

events.out.tfevents.1709442514.0336c16d2713.6670.0:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

events.out.tfevents.1709444689.0336c16d2713.6670.1:   0%|          | 0.00/354 [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.86k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/VikrantRamesh/Llama-2-CN/commit/43e9c9b8219c3f93d2ed59aacc4e273f795be23b', commit_message='End of training', commit_description='', oid='43e9c9b8219c3f93d2ed59aacc4e273f795be23b', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!

We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead. For the tokenizer, replace the checkpoint by the one you trained.

In [None]:
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["Title"])

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["Title","Unnamed: 0"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
block_size = 128

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import AutoConfig, AutoModelForMaskedLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_config(config)

We redefine our `TrainingArguments`:

In [None]:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    "test-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
    push_to_hub_model_id=f"{model_checkpoint}-wikitext2",
)



Like before, the last two arguments are to setup everything so we can push the model to the [Hub](https://huggingface.co/models) at the end of training. Remove the two of them if you didn't follow the installation steps at the top of the notebook, otherwise you can change the value of `push_to_hub_model_id` to something you would prefer.

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["train"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,8.267723
2,No log,7.906465
3,No log,7.762846


TrainOutput(global_step=153, training_loss=8.179368361928105, metrics={'train_runtime': 47.831, 'train_samples_per_second': 25.277, 'train_steps_per_second': 3.199, 'total_flos': 79552237759488.0, 'train_loss': 8.179368361928105, 'epoch': 3.0})

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 2445.09


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

events.out.tfevents.1709358388.ddedf55b6789.5652.0:   0%|          | 0.00/4.18k [00:00<?, ?B/s]

events.out.tfevents.1709358430.ddedf55b6789.5652.1:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Upload 5 LFS files:   0%|          | 0/5 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

events.out.tfevents.1709358481.ddedf55b6789.5652.2:   0%|          | 0.00/630 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/VikrantRamesh/bert-base-cased-wikitext2/commit/733a5a6aaa21fbbbfd94ac69429ca23bacdaee38', commit_message='End of training', commit_description='', oid='733a5a6aaa21fbbbfd94ac69429ca23bacdaee38', pr_url=None, pr_revision=None, pr_num=None)

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("sgugger/my-awesome-model")
```

In [None]:
!pip install -i https://pypi.org/simple/ bitsandbytes
!pip install -U accelerate

Looking in indexes: https://pypi.org/simple/


In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)

In [None]:
import torch

# Model and tokenizer names
base_model_name = "VikrantRamesh/bert-base-cased-wikitext2"
refined_model = "metaMath_CN" #You can give it your own name

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained("sgugger/gpt2-like-tokenizer",use_auth_token="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", trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  # Fix for fp16


# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

# Model
base_model = AutoModelForMaskedLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0},
    use_auth_token=" "
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

  return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  **kwargs,


ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes: `pip install -i https://pypi.org/simple/ bitsandbytes`

In [None]:
!pip install bitsandbytes-cuda116

Collecting bitsandbytes-cuda116
  Downloading bitsandbytes_cuda116-0.26.0.post2-py3-none-any.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes-cuda116
Successfully installed bitsandbytes-cuda116-0.26.0.post2


In [None]:
from transformers import TextStreamer

output_tokens = []

def stream():
    # query= '''Assuming a network with a maximum segment size (MSS) of 276 bytes, a round trip time (RTT) of 2300 milliseconds, and a packet loss rate of 0.01, determine the TCP throughput of this network.'''
    prompt = f"Explain the setup, data transfer and tear-down phases"

    inputs = llama_tokenizer([prompt], return_tensors="pt").to("cuda:0")
    streamer = TextStreamer(llama_tokenizer)

    for _ in base_model.generate(**inputs, streamer=streamer, max_new_tokens=500, stopping_criteria = [EosListStoppingCriteria()], eos_token_id=  llama_tokenizer.convert_tokens_to_ids("####")):
       output_tokens.append(_.cpu().numpy())
    # Decode generated tokens into text
    output_text = llama_tokenizer.decode(output_tokens[0])
    return output_text

In [None]:
output_text = stream()