- **Author:** **Kandimalla Hemanth**
- **Date of modified:**  **1-13-2024**
- **E-mail:** **speechcodehemanth2@gmail.com**


# Cloud TPU v5p Overview

Cloud TPU v5p stands as the fifth-generation offering from Google Cloud in the TPU (Tensor Processing Unit) series, succeeding the v4 TPU. Engineered specifically for large-scale training, it positions itself as a cutting-edge platform for the advancement of foundational Large Language Models (LLMs), diffusion models, and generative Artificial Intelligence (AI).

## Performance Enhancements

At a macroscopic level, Cloud TPU v5p boasts notable performance improvements over its predecessor, the v4. It achieves up to 2x the performance of v4, elevating its capabilities for intensive computational tasks associated with AI model training.

## Enhanced Scalability

One of the standout features is the increased scalability by packing 2x more TPUs into a Pod. In comparison to v4, which had a capacity of 3k TPUs in its largest slice, v5p scales up to 6k TPUs, resulting in an impressive 4x performance boost at the Pod level.

## Increased Clock Frequency

Cloud TPU v5p operates at a higher clock frequency, running at 1.75Ghz as opposed to the 1.05Ghz clock speed of its predecessor. This acceleration contributes to faster processing and improved efficiency in handling complex AI workloads.

## SparseCore Integration

The introduction of SparseCore is a noteworthy addition, specifically designed for large-scale embeddings. This enhancement addresses the growing demand for handling vast amounts of data efficiently in AI applications, ensuring optimal performance in scenarios where sparse data structures are prevalent.

## Expanded High Bandwidth Memory (HBM) Capacity

To further augment its capabilities, Cloud TPU v5p triples the High Bandwidth Memory (HBM) capacity compared to the v4. This expansion provides a more extensive and faster memory access, facilitating enhanced data handling and manipulation.

-  Cloud TPU v5p represents a significant leap forward in the realm of AI hardware, with its emphasis on performance, scalability, and specialized features for handling the complexities of modern AI model development.


In [None]:
!pip install -q -U transformers  datasets

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import multiprocessing

# Load the dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Extract the split name
split_name = list(dataset.keys())[0]

# Get the list of columns
columns = dataset[split_name].column_names

if len(columns)<=1:


In [None]:
print(split_name,columns)

In [None]:
  split_name = list(examples.keys())[0]
    columns = examples[split_name].column_names
    batch_size = len(columns)
    text_column=columns[0]
    label_column=columns[1]

In [None]:
from datasets import load_dataset

dataset = load_dataset('csv', data_files='/content/drive/MyDrive/training.csv')

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import multiprocessing
import torch

# Load the dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

# dataset = load_dataset('csv', data_files='/content/drive/MyDrive/training.csv')

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
split_name=str(list(dataset.keys())[0])
print(split_name)
# Get the list of columns for the first split
columns = dataset[str(list(dataset.keys())[0])].column_names
print(columns)

# Define the maximum sequence length
max_length = 512

def preprocess_function(examples):
    tokenized_outputs = {}
    for column in columns:
        # Ensure that every input to the tokenizer is a string
        column_text = [str(item) for item in examples[column]]
        column_tokens = tokenizer(column_text, padding='max_length', truncation=True, max_length=max_length)
        tokenized_outputs[column] = column_tokens['input_ids']
    # Create a tensor for input_ids by concatenating the tokens from each column
    input_ids = []
    for i in range(len(tokenized_outputs[columns[0]])): # Number of examples
        row_tokens = []
        for column in columns:
            row_tokens.extend(tokenized_outputs[column][i])
        row_tokens = row_tokens[:max_length]  # Truncate to max_length if necessary
        row_tokens += [tokenizer.pad_token_id] * (max_length - len(row_tokens))  # Pad if necessary
        input_ids.append(row_tokens)

    # Create a tensor for attention_mask based on input_ids
    attention_mask = [[int(token_id != tokenizer.pad_token_id) for token_id in input_row] for input_row in input_ids]

    # Convert input_ids and attention_mask to tensors
    model_inputs = {'input_ids': torch.tensor(input_ids), 'attention_mask': torch.tensor(attention_mask)}

    return model_inputs

processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=16,  # Example of setting batch size to 16
    num_proc=multiprocessing.cpu_count(),
    remove_columns=columns,  # Remove all original columns except for 'input_ids' and 'attention_mask'
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)
# Now 'processed_datasets' should be ready to feed to a model for training

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import multiprocessing
import torch

# Load the dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

# dataset = load_dataset('csv', data_files='/content/drive/MyDrive/training.csv')

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
split_name=str(list(dataset.keys())[0])
print(split_name)
# Get the list of columns for the first split
columns = dataset[str(list(dataset.keys())[0])].column_names
print(columns)

# Define the maximum sequence length
max_length = 512



def preprocess_function(examples):
    # print("Preprocessing examples...")
    tokenized_outputs = {}
    for column in columns:
        # Ensure that every input to the tokenizer is a string
        column_text = [str(item) for item in examples[column]]
        column_tokens = tokenizer(column_text, padding='max_length', truncation=True, max_length=max_length)
        tokenized_outputs[column] = column_tokens['input_ids']

    # Debugging: print an example of tokenized outputs
    # print("Tokenized outputs example:", tokenized_outputs[columns[0]][0])

    input_ids = []
    for i in range(len(tokenized_outputs[columns[0]])):
        row_tokens = []
        for column in columns:
            row_tokens.extend(tokenized_outputs[column][i])
        row_tokens = row_tokens[:max_length]  # Truncate to max_length if necessary
        row_tokens += [tokenizer.pad_token_id] * (max_length - len(row_tokens))  # Pad if necessary
        input_ids.append(row_tokens)

    attention_mask = [[int(token_id != tokenizer.pad_token_id) for token_id in input_row] for input_row in input_ids]

    model_inputs = {'input_ids': torch.tensor(input_ids), 'attention_mask': torch.tensor(attention_mask)}

    # Debugging: print an example of model inputs
    # print("Model inputs example:", model_inputs['input_ids'][0])

    return model_inputs


processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=16,  # Example of setting batch size to 16
    num_proc=multiprocessing.cpu_count(),
    remove_columns=columns,  # Remove all original columns except for 'input_ids' and 'attention_mask'
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

In [None]:
print(len(processed_datasets['train']['input_ids']))

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import multiprocessing
import torch

# Load the dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Get the list of columns for the first split
columns = dataset[str(list(dataset.keys())[0])].column_names

# Define the maximum sequence length
max_length = 512

def preprocess_function(examples):
    example=example[str(dataset[list(dataset.keys())[0]])]
    # Tokenize each column separately and concatenate their token IDs.
    tokenized_outputs = {}
    for column in columns:
        column_tokens = tokenizer(examples[column], padding='max_length', truncation=True, max_length=max_length)
        tokenized_outputs[column] = column_tokens['input_ids']

    # Create a tensor for input_ids by concatenating the tokens from each column
    input_ids = []
    for i in range(len(tokenized_outputs[columns[0]])): # Number of examples
        row_tokens = []
        for column in columns:
            row_tokens.extend(tokenized_outputs[column][i])
        row_tokens = row_tokens[:max_length]  # Truncate to max_length if necessary
        row_tokens += [tokenizer.pad_token_id] * (max_length - len(row_tokens))  # Pad if necessary
        input_ids.append(row_tokens)

    # Create a tensor for attention_mask based on input_ids
    attention_mask = [[int(token_id != tokenizer.pad_token_id) for token_id in input_row] for input_row in input_ids]

    # Convert input_ids and attention_mask to tensors
    model_inputs = {'input_ids': torch.tensor(input_ids), 'attention_mask': torch.tensor(attention_mask)}

    return model_inputs

# Map the preprocessing function across the entire dataset
processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=multiprocessing.cpu_count(),
    remove_columns=columns,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)

# Now 'processed_datasets' should be ready to feed to a model for training

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer
import multiprocessing

# Load the dataset
dataset = load_dataset("fka/awesome-chatgpt-prompts")

# Define the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Extract the split name
split_name = list(dataset.keys())[0]

# Get the list of columns
columns = dataset[split_name].column_names

if len(columns)<=1:



max_length=512
import torch
def preprocess_function(examples):
    split_name = list(examples.keys())[0]
    columns = examples[str(split_name)].column_names
    batch_size = len(columns)
    text_column=columns[0]
    label_column=columns[1]
    inputs = [f"{text_column} : {x} Label : " for x in examples[text_column]]
    targets = [str(x) for x in examples[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets, add_special_tokens=False)  # don't add bos token because we concatenate with inputs
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
        # print(i, sample_input_ids, label_input_ids)
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
    # print(model_inputs)
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


processed_datasets = dataset.map(
    preprocess_function,
    batched=True,
    num_proc=multiprocessing.cpu_count(),
    remove_columns=dataset["train"].column_names,
    load_from_cache_file=False,
    desc="Running tokenizer on dataset",
)