# Traning Data for GEN4 Finetuning 
We train GEN4 with 100% synthetic data thus we can just use the data/sd/gen3/gen3_sd.txt which is about the same size as the human data file with which we trained GEN0 data/hd/combined0/train_combined0.txt.

In [2]:
import random
import os

def read_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = file.read().strip().split('\n\n')
    return data

def write_data(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write('\n\n'.join(data))

def get_file_size(filename):
    """Returns the size of the file in bytes."""
    return os.path.getsize(filename)


In [3]:
print("Original GEN0 HD Size: ", get_file_size("data/hd/combined0/train_combined0.txt"))
print("Training Dataset for GEN3 Size: ", get_file_size("data/sd/gen3/gen3_sd.txt"))

Original GEN0 HD Size:  307558317
Training Dataset for GEN3 Size:  307561876


# Finetuning

Finetune the model GEN4 for 5 epochs with 100% synthetic data from GEN3.  
The global_step parameter in "models/distilgpt2-finetuned_gen3_25/trainer_state.json" and "models/distilgpt2-finetuned_gen3_25/checkpoint-10325/trainer_state.json" is originally "global_step": 10010, but in order to force run_clm.py to  
train the model from the last checkpoint with the new dataset we set this parameter to 0 to force it to learn from the entirety of the new dataset. 

In [4]:
!deepspeed run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file data/sd/gen3/gen3_sd.txt \
    --validation_file data/hd/initial_combined/valid_combined.txt \
    --do_train \
    --do_eval \
    --output_dir ./models/distilgpt2-finetuned_gen4_0 \
    --num_train_epochs 5 \
    --save_strategy epoch \
    --learning_rate 5e-5 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --deepspeed ds_config.json \
    --resume_from_checkpoint ./models/distilgpt2-finetuned_gen3_25/checkpoint-10010


[2024-04-23 09:10:57,060] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-23 09:10:57,774] [INFO] [runner.py:568:main] cmd = /home/vasi/Documents/BA_Thesis_Experiment/.venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_clm.py --model_name_or_path distilgpt2 --train_file data/sd/gen3/gen3_sd.txt --validation_file data/hd/initial_combined/valid_combined.txt --do_train --do_eval --output_dir ./models/distilgpt2-finetuned_gen4_0 --num_train_epochs 5 --save_strategy epoch --learning_rate 5e-5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --deepspeed ds_config.json --resume_from_checkpoint ./models/distilgpt2-finetuned_gen3_25/checkpoint-10010
[2024-04-23 09:10:59,585] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-23 09:11:00,133] [INFO] [launch.py:14

# Inference
### Step 3
Let the model generate 100 stories from the first 100 prompts of the original test.wp-source file containing human prompts.

In [2]:
with open("data/hd/prepro/test.wp_source") as pfile:
    prompts = pfile.readlines()

prompts = [prompt[6:] for prompt in prompts][:100] # clear starting characters and limit set to first 100 prompts
print(len(prompts))

100


In [4]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "./models/distilgpt2-finetuned_gen4_0/checkpoint-9655" 
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


for prompt in prompts:
    inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")


    # Generate text
    output_sequences = model.generate(
        input_ids=inputs,
        attention_mask=None,
        max_length=500,  # determines the maximum length of the generated text
        temperature=0.7,  # controls randomness: lower values make text less random
        top_k=50,  # the K most likely next words are considered for each step
        top_p=0.9,  # only the most probable tokens with probabilities that add up to top_p are considered for each step
        repetition_penalty=1.2,  # penalty applied to repeated words
        do_sample=True,  # set to True to return diverse samples
        num_return_sequences=1,  # number of independently computed samples to generate
        pad_token_id=tokenizer.eos_token_id,
    )


       # Decode the output sequences to get the generated text
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)

    generated_story = generated_text.split("\n")[1]

    with open("./outputs/gen4/stories4.txt", "a") as f:
        f.write(generated_story + "\n\n")

# Evaluation
### Step 5
Evaluate each story in the stories output file on all metrics and write values to df with last line being the average values of each column.
Metrics represent the model's average for all stories it generated.

In [1]:
from metrics.LexicalDiversity.lexical_diversity import *
from metrics.SemanticDiversity.sementic_diversity import *
from metrics.SyntacticDiversity.syntactic_diversity import *
from nltk.tokenize import sent_tokenize
import pandas as pd
import spacy

# Define the column names
columns = ["Distinct-2", "Distinct-3", "Self-BLEU", "OV-TTR", "MS-TTR", "S-DIV-AV", "S-DIV-C", "SYN-DIV"]

# Create an empty DataFrame with these columns
df_eval_gen1 = pd.DataFrame(columns=columns)

# Load a spaCy model for dependency parsing
nlp = spacy.load("en_core_web_sm")

with open("./outputs/gen4/stories4.txt", 'r') as f:
    stories = f.read().split("\n\n")

for story in stories:
    print(stories.index(story))
    #print(story)
    # Tokenize the text into sentences
    sentences = sent_tokenize(story)
    graphs = construct_dependency_graphs(sentences)


    # Tokenize the text into sentences
    sentences = sent_tokenize(story)

    

    graphs = construct_dependency_graphs(sentences)

    # Example usage: Adding a new row of data to the DataFrame
    new_data = {
        "Distinct-2": calculate_distinct_n(story, 2),
        "Distinct-3": calculate_distinct_n(story, 3),
        "Self-BLEU": 1-calculate_self_bleu(sentences),
        "OV-TTR": calculate_ttr(story, truncate_length=300),
        "MS-TTR": calculate_mean_segmental_ttr(story, segment_size=50),
        "S-DIV-AV": calculate_semantic_diversity(sentences, 'average'),
        "S-DIV-C": calculate_semantic_diversity(sentences, 'centroid'),
        "SYN-DIV": calculate_syntactic_diversity(graphs)
    }

    # Convert new_data dictionary to a DataFrame
    new_row_df = pd.DataFrame([new_data])

    # Concatenate the new row DataFrame to the original DataFrame
    df_eval_gen1 = pd.concat([df_eval_gen1, new_row_df], ignore_index=True)
        

# Calculate the mean for each column and append as a new row
averages = df_eval_gen1.mean().to_dict()
averages = {key: [value] for key, value in averages.items()}  # Convert each mean value into a list
average_df = pd.DataFrame(averages)  # Create a DataFrame for the averages
average_df.index = ['Average']  # Label the index as 'Average'

# Append the average row to the original DataFrame
df = pd.concat([df_eval_gen1, average_df])

# Specify the file path and name
file_path = './outputs/gen4/eval_table_gen4.csv'

# Write the DataFrame to a CSV file
df.to_csv(file_path, index=False)  # Set index=False to not include row indices in the file

print(f"Data has been written to {file_path}")
# Print the last row (average values)
print("Average values for each metric:")
print(df.iloc[-1])



  from .autonotebook import tqdm as notebook_tqdm


0


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  df_eval_gen1 = pd.concat([df_eval_gen1, new_row_df], ignore_index=True)
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Data has been written to ./outputs/gen4/eval_table_gen4.csv
Average values for each metric:
Distinct-2    0.976052
Distinct-3    0.983941
Self-BLEU     0.999391
OV-TTR        0.868769
MS-TTR        0.935354
S-DIV-AV      0.803510
S-DIV-C       0.458173
SYN-DIV       0.798070
Name: Average, dtype: float64


To create a dictionary that organizes words from your text file into thematic categories, we'll need to follow a process of topic modeling or clustering. Topic modeling, specifically Latent Dirichlet Allocation (LDA), can help identify latent topics in text by clustering words that commonly appear together. Here's a way to approach this:

Explanation

    Preprocessing: The text is cleaned to remove punctuation, numbers, and convert everything to lowercase.
    Document-Term Matrix: The cleaned text is converted into a document-term matrix using CountVectorizer.
    LDA Topic Modeling: The LDA model identifies clusters of words (topics) that commonly appear together.
    Extracting Topics: The top 10 words associated with each topic are selected to represent that topic.

This method provides a basic categorization of the text into themes without explicit labels. Each topic will be named as topic_0, topic_1, etc., and the associated keywords represent the thematic categories.

In [1]:
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Load and preprocess the text
with open("data/hd/prepro/test.wp_source") as pfile:
    prompts = pfile.readlines()

prompts = [prompt[6:] for prompt in prompts] # Remove starting characters
prompts = [re.sub(r'[^a-zA-Z\s]', '', prompt.lower()) for prompt in prompts] # Preprocess each prompt separately

# Convert the prompts into a document-term matrix
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(prompts)

# Fit LDA to find topics
n_topics = 100  # Adjust the number of topics based on experimentation
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
lda.fit(X)

# Extract the words associated with each topic
feature_names = vectorizer.get_feature_names_out()
category_keywords = {}

for topic_idx, topic in enumerate(lda.components_):
    keywords = [feature_names[i] for i in topic.argsort()[:-11:-1]]
    category_keywords[f'topic_{topic_idx}'] = keywords

print(category_keywords)



{'topic_0': ['family', 'bathroom', 'knowing', 'turn', 'reason', 'lyrics', 'short', 'judge', 'story', 'songs'], 'topic_1': ['people', 'day', 'locked', 'member', 'marriage', 'just', 'life', 'bedroom', 'trying', 'living'], 'topic_2': ['god', 'dead', 'heaven', 'sins', 'does', 've', 'dying', 'like', 'life', 'alternate'], 'topic_3': ['door', 'lost', 'time', 'years', 'ancient', 'people', 'knock', 'buried', 'nt', 'ability'], 'topic_4': ['nt', 'day', 'ago', 'does', 'longer', 'live', 'world', 'away', 'people', 'left'], 'topic_5': ['life', 'time', 'suicide', 'people', 'commit', 'nt', 'lies', 'decides', 'man', 'end'], 'topic_6': ['changes', 'year', 'life', 'security', 'time', 'world', 'person', 'assassin', 'old', 'important'], 'topic_7': ['conversation', 'human', 'moon', 'water', 'man', 'wants', 'night', 'place', 'soft', 'girl'], 'topic_8': ['open', 'just', 'hair', 'black', 'stuck', 'hours', 'little', 'door', 'invented', 'nt'], 'topic_9': ['making', 'just', 'place', 'law', 've', 'hate', 'won', 'lo

To create a frequency reference corpus from a text file, you can read the file, tokenize the words, and calculate the frequency of each word. The frequencies will then be normalized to represent the proportion of each word's occurrence compared to the entire corpus. Here's how you can approach this:  
    Reading the File: The text file is read and processed as a single string.
    Preprocessing and Tokenization: Non-alphabetic characters are removed, and the text is converted to lowercase. The word_tokenize function is used to split the text into individual words.
    Counting Word Frequencies: A Counter is used to count the occurrence of each word.
    Calculating Normalized Frequencies: The word counts are normalized by dividing each word's count by the total number of words to obtain the frequency of each word as a proportion of the entire text.

This way, the reference_corpus_freq dictionary will contain the normalized frequencies of words from your text file, which can then be used to assess originality using the provided calculate_originality function.

The threshold in the originality function determines how frequently a word appears in the reference corpus to be considered unoriginal. Words that occur less frequently than this threshold are classified as "original" and contribute to the originality score.
Guidelines for Setting the Threshold

    Purpose of Analysis:
        High Originality Detection: If you want to identify very original texts, set a low threshold (e.g., 0.001). This will ensure that only very rare words count as original.
        Moderate Originality Detection: For less stringent originality detection, set a higher threshold (e.g., 0.01). This allows more words to be considered original.

    Size of Reference Corpus:
        Larger Corpus: If the reference corpus is large, a low threshold is appropriate, as it reflects more accurate frequency distributions.
        Smaller Corpus: If the reference corpus is small, the threshold might need to be higher, as word frequencies are less representative.

    Distribution of Frequencies:
        Analyze the distribution of word frequencies in the reference corpus. If most words have very low frequencies, consider a higher threshold to capture words with slight variations in frequency.
        A threshold set too high might classify too many words as original, skewing the results.

Experimentation

The threshold should ideally be determined through experimentation and evaluation:

    Analyze Different Thresholds: Run the originality function with different threshold values on sample texts and observe how the scores vary.
    Evaluation: If possible, compare the results against known originality standards or human judgments to find a threshold that best matches your requirements.

Example

    Lower Threshold (e.g., 0.001): Use this to identify texts with only very rare words, emphasizing high originality.
    Higher Threshold (e.g., 0.01): This value is more lenient and may classify texts with moderately rare words as original.

Ultimately, the threshold should align with your specific goals for originality assessment and the characteristics of your reference corpus.

In [2]:
import re
from collections import Counter
from nltk import word_tokenize

def create_reference_corpus_freq(file_path):
    # Load the text
    with open(file_path, 'r') as file:
        text = file.read()

    # Basic preprocessing: Remove non-alphabetic characters and make lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
    
    # Tokenize the words
    words = word_tokenize(text)
    
    # Count word frequencies
    word_counts = Counter(words)
    total_words = sum(word_counts.values())
    
    # Calculate normalized frequencies
    reference_corpus_freq = {word: count / total_words for word, count in word_counts.items()}
    
    return reference_corpus_freq

# Example usage
file_path = 'data/hd/prepro/test.wp_target'
reference_corpus_freq = create_reference_corpus_freq(file_path)
print(reference_corpus_freq)

# You can then pass this `reference_corpus_freq` dictionary to your calculate_originality function.




1. MS-Jaccard

This metric measures diversity by calculating the Jaccard similarity between the n-grams of each generated story and every other story in the dataset.  
Using a pseudocount of 0.5 is an interesting choice for your specific use case where the primary objective is to investigate the differences in similarity scores between various generations of synthetic text and a base of human texts. This choice is particularly insightful because it moderates the impact of n-grams that do not overlap between the sets, providing a more stable and "neutral" base score, rather than skewing the results dramatically towards zero. This could help in achieving a more balanced comparison across model generations.
Rationale for Using 0.5 as a Pseudocount

    Balanced Impact: A pseudocount of 0.5 helps in ensuring that missing n-grams don't completely nullify the similarity scores, but rather contribute a moderate base value to the geometric mean calculation. This is beneficial in cases where you expect some level of inherent dissimilarity due to generational changes in model outputs but want to prevent those differences from being exaggerated by zeros.
    Neutral Base Score: By setting the pseudocount to 0.5, the impact of each missing n-gram on the overall score is effectively neutralized, enabling a focus on the n-grams that do exist in both sets. This approach ensures that the similarity score reflects meaningful linguistic features present in both texts rather than being overly penalized for differences.  
2. Feature-based Similarity

We will use a pre-trained model (like BERT) to extract embeddings for the stories and then compute cosine similarities between these embeddings to measure how diverse the stories are in terms of semantic content.  

3. Fluency: This metric assesses the quantity of relevant ideas generated. In text, this can be translated to the number of relevant responses or ideas mentioned.

4. Flexibility: This measures the variety of ideas or categories used. In text, it evaluates how many different themes or subjects are touched upon.

5. Originality: This evaluates the uniqueness of the ideas relative to a typical response. It often requires a larger dataset to determine what counts as "typical."

In [3]:
from metrics.diversity_quality_metrics import *
from metrics.flu_flex_ori import *
import pandas as pd

# Define the column names
columns = ["Jaccard-Sim-2", "Feature-Based-Sim", "Fluency", "Flexibility", "Originality"]

# Create an empty DataFrame with these columns
df_eval = pd.DataFrame(columns=columns)

with open("data/hd/initial_combined/test_combined.txt", "r") as file:
    real_content = file.read().strip()
    real_stories_prompts = real_content.split("\n\n") # use just the story without the prompt from the real data
    real_stories = [story.split("\n")[1] for story in real_stories_prompts]
#print(real_stories[0])

# Assuming 'stories0.txt' is formatted correctly as described:
with open('./outputs/gen4/stories4.txt', 'r') as file:
    synthed_content = file.read()
    synthed_stories = synthed_content.split('\n\n')  # Each story separated by two newlines


# Example usage: Adding a new row of data to the DataFrame
new_data = {
    "Jaccard-Sim-2": calculate_ms_jaccard(real_stories[:len(synthed_stories)], synthed_stories, n=2, pseudocount=0.5),
    "Feature-Based-Sim": calculate_feature_based_similarity(synthed_stories),
    "Fluency": calculate_fluency(synthed_stories),
    "Flexibility": calculate_flexibility(synthed_stories, category_keywords),
    "Originality": calculate_originality(synthed_stories, reference_corpus_freq, threshold=0.001)
}

# Convert new_data dictionary to a DataFrame
new_row_df = pd.DataFrame([new_data])
#print(new_row_df)
# Concatenate the new row DataFrame to the original DataFrame
df_eval = pd.concat([df_eval, new_row_df], ignore_index=True)

# Specify the file path and name
file_path = './outputs/gen4/eval_table_gen4_2.csv'

# Write the DataFrame to a CSV file
df_eval.to_csv(file_path, index=False)  # Set index=False to not include row indices in the file

print(f"Data has been written to {file_path}")
# Print the last row (average values)
print("Average values for each metric:")
print(df_eval.iloc[-1])




  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to /home/vasi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[99, 88, 94, 95, 97, 98, 98, 96, 100, 98, 95, 99, 93, 94, 98, 91, 99, 94, 98, 97, 92, 94, 95, 96, 93, 99, 17, 93, 95, 99, 100, 94, 99, 91, 98, 98, 94, 95, 97, 97, 96, 94, 98, 91, 92, 87, 96, 97, 93, 92, 99, 95, 94, 94, 92, 96, 97, 84, 83, 93, 94, 95, 96, 94, 94, 96, 94, 97, 97, 91, 90, 93, 100, 96, 96, 96, 88, 99, 99, 93, 97, 98, 99, 97, 95, 96, 96, 99, 98, 98, 97, 95, 95, 91, 98, 99, 94, 90, 96, 99]
Data has been written to ./outputs/gen4/eval_table_gen4_2.csv
Average values for each metric:
Jaccard-Sim-2          0.454027
Feature-Based-Sim      0.888057
Fluency              212.320000
Flexibility           94.450000
Originality            0.684917
Name: 0, dtype: float64


  df_eval = pd.concat([df_eval, new_row_df], ignore_index=True)


# Generate Synthetic Data
### Step 6
From GEN 4 we change our data paradigm to a fully synthetic loop training each subsequent generation only with a synthetic dataset created by the previous generation. The dataset is still about the same size as RD0
Contribute to the synthetic dataset by producing stories from the finetuned model.
We use 100% of the original prompt data as our prompt list.

In [1]:
def count_entries(filepath):
    """Counts the number of double-newline-separated entries in a file."""
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read().strip()
    return len(content.split('\n\n'))

total_entries = count_entries("./data/hd/combined0/train_combined0.txt")
print("Total entries in GEN0 Dataset with real data: ", total_entries)

sd_entries_count= total_entries
print("100% \of total entries = ", sd_entries_count)

Total entries in GEN0 Dataset with real data:  109040
100% \of total entries =  109040


In [5]:
import random
from transformers import pipeline, set_seed, GPT2LMHeadModel, GPT2Tokenizer
import os

prompts = []
prompt_files = ["train", "test"]
for name in prompt_files:
    # Path to the file with prompts
    file_path = './data/hd/prepro/'+name+'.wp_source'

    # Read prompts from the file removing the initials [ XX ]
    with open(file_path, 'r', encoding='utf-8') as file:
        prompts += ([line.strip()[7:] for line in file.readlines() if line.strip()])

#print(prompts[0:10])
# Randomly select 100% of the prompts
sample_size = len(prompts)
selected_prompts = random.sample(prompts, int(sd_entries_count))

print(len(selected_prompts))
print(selected_prompts[0:10])


109040
["After a brutal fight a dying enemy soldier grabs you by your clothes and forces on your hand a picture of him and his kids , while saying `` Take care of them . There is no one else . ''", "`` Where are you ? ''", 'You are being sorted into Hogwarts . As the sorting hat is placed on your head , it refuses to sort you , and pleads to the headmaster to expel you ...', "You wake up submerged in water with only a flashlight and a note . The note reads `` You 're now immortal . Welcome to the bottom of the Marianna Trench . This is your first test . ''", 'Mr. Rogers wakes up one day to find that he has been transported to Westeros , and is now a member of the Game of Thrones universe .', "Every morning when you get to work there is a hot cup of coffee sitting on your desk , you still have n't figured out how it 's getting there .", "`` Please do n't leave . ''", 'A white light bursts into the horizon and time was reset 9 years to the past . Everyone seems to recall their memories f

In [3]:
import os

def get_file_size(filename):
    """Returns the size of the file in bytes."""
    return os.path.getsize(filename)

gen0_data_filename = './data/hd/combined0/train_combined0.txt'
gen0_data_size = get_file_size(gen0_data_filename)
print(f"The size of '{gen0_data_filename}' is {gen0_data_size} bytes.")
# 100% of gen0_data_size
gen4_sd_target_size = gen0_data_size
print("Target size (100% of GEN0) for SD File for GEN4: ", gen4_sd_target_size)


The size of './data/hd/combined0/train_combined0.txt' is 307558317 bytes.
Target size (100% of GEN0) for SD File for GEN4:  307558317


In [6]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re

model_name = "./models/distilgpt2-finetuned_gen4_0/checkpoint-9655"
tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side='left')
model = GPT2LMHeadModel.from_pretrained(model_name)

# Define the device based on CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"


if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if torch.cuda.is_available():
    model = torch.nn.DataParallel(model)
    model.cuda()
else:
    model.to("cpu")

batch_size = 64

def generate_text_batch(prompts):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move all tensors to the right device
    outputs = model.module.generate(
        **inputs, 
        max_length=500, 
        num_return_sequences=1, 
        temperature=0.7,  # More randomness
        repetition_penalty=1.2,  # Increase penalty to reduce repetitions
        top_k=50, 
        top_p=0.9
    )
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

output_synth_data = './data/sd/gen4/gen4_sd.txt'

try:
    with open(output_synth_data, 'a', encoding='utf-8') as file:
        for i in range(0, len(selected_prompts), batch_size):
            batch_prompts = selected_prompts[i:i + batch_size]
            print(f"Generating text for batch {i//batch_size+1}/{len(selected_prompts)//batch_size}")
            generated_texts = generate_text_batch(batch_prompts)
            
            for prompt, generated_text in zip(batch_prompts, generated_texts):
                prompt_length = len(tokenizer.encode(prompt))
                #print(prompt)
                # Remove the prompt by slicing the tokens to skip the prompt length
                generated_text_tokens = tokenizer.encode(generated_text)[prompt_length:]
                clean_generated_text = tokenizer.decode(generated_text_tokens, skip_special_tokens=True)

                # Remove leading and ending spaces and special characters
                clean_generated_text = re.sub(r'^[\s\WP]+', '', clean_generated_text)
                clean_generated_text = re.sub(r'^[\s\W]+|[\s\W]+$', '', clean_generated_text)


                output_text = f"{prompt}\n{clean_generated_text}\n\n"
                #print(output_text)
                if get_file_size(output_synth_data) < gen4_sd_target_size:
                    file.write(output_text)
                else:
                    print("Target size reached!")
                    break
    print("Finished generating stories.")
except Exception as e:
    print(f"An error occurred: {e}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 1/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 2/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 3/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 4/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 5/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 6/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 7/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 8/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 9/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 10/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 11/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 12/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 13/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 14/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 15/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 16/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 17/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 18/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 19/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 20/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 21/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 22/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 23/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 24/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 25/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 26/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 27/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 28/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 29/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 30/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 31/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 32/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 33/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 34/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 35/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 36/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 37/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 38/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 39/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 40/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 41/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 42/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 43/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 44/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 45/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 46/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 47/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 48/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 49/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 50/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 51/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 52/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 53/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 54/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 55/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 56/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 57/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 58/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 59/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 60/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 61/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 62/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 63/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 64/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 65/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 66/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 67/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 68/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 69/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 70/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 71/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 72/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 73/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 74/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 75/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 76/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 77/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 78/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 79/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 80/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 81/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 82/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 83/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 84/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 85/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 86/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 87/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 88/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 89/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 90/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 91/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 92/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 93/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 94/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 95/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 96/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 97/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 98/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 99/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 100/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 101/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 102/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 103/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 104/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 105/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 106/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 107/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 108/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 109/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 110/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 111/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 112/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Generating text for batch 113/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 114/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 115/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 116/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 117/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 118/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 119/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 120/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 121/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 122/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 123/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 124/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 125/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 126/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 127/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 128/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 129/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 130/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 131/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 132/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 133/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 134/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 135/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 136/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 137/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 138/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 139/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 140/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 141/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 142/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 143/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 144/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Generating text for batch 145/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 146/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 147/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 148/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 149/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 150/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 151/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 152/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 153/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 154/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 155/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 156/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 157/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 158/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 159/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 160/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 161/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 162/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 163/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 164/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 165/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 166/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 167/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 168/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 169/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 170/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 171/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 172/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 173/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 174/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 175/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 176/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 177/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 178/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 179/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 180/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 181/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 182/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 183/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 184/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 185/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 186/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 187/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 188/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 189/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 190/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 191/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 192/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 193/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 194/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 195/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 196/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 197/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 198/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 199/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 200/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 201/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 202/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 203/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 204/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 205/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 206/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 207/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 208/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 209/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 210/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 211/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 212/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 213/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 214/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 215/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 216/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 217/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 218/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 219/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 220/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 221/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 222/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 223/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 224/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 225/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 226/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 227/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 228/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 229/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 230/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 231/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 232/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 233/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 234/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 235/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 236/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 237/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 238/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 239/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 240/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 241/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 242/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 243/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 244/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 245/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 246/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 247/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 248/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 249/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 250/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 251/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 252/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 253/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 254/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 255/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 256/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 257/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 258/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 259/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 260/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 261/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 262/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 263/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 264/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 265/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 266/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 267/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 268/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 269/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 270/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 271/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 272/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 273/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 274/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 275/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 276/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 277/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 278/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 279/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 280/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 281/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 282/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 283/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 284/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 285/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 286/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 287/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 288/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 289/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 290/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 291/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 292/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 293/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 294/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 295/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 296/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 297/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 298/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 299/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 300/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 301/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 302/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 303/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 304/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 305/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 306/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 307/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 308/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 309/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 310/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 311/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 312/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 313/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 314/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 315/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 316/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 317/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 318/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 319/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 320/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 321/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 322/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 323/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 324/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 325/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 326/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 327/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 328/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 329/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 330/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 331/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 332/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 333/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 334/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 335/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 336/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 337/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 338/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 339/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 340/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 341/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 342/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 343/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 344/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 345/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 346/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 347/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 348/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 349/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 350/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 351/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 352/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 353/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 354/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 355/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 356/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 357/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 358/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 359/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 360/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 361/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 362/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 363/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 364/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 365/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 366/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 367/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 368/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 369/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 370/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 371/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 372/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 373/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 374/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 375/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 376/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 377/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 378/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 379/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 380/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 381/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 382/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 383/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 384/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 385/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 386/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 387/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 388/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 389/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 390/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 391/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 392/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 393/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 394/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 395/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 396/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 397/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 398/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 399/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 400/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 401/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 402/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 403/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 404/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 405/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 406/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 407/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 408/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Generating text for batch 409/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 410/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 411/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 412/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 413/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 414/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 415/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 416/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 417/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 418/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 419/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 420/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 421/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 422/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 423/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 424/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 425/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 426/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 427/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 428/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 429/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 430/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 431/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 432/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 433/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 434/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 435/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 436/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 437/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 438/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 439/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 440/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 441/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 442/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 443/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 444/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 445/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 446/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 447/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 448/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 449/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 450/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 451/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 452/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 453/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 454/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 455/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 456/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 457/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 458/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 459/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 460/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 461/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 462/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 463/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 464/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 465/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 466/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 467/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 468/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 469/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 470/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 471/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 472/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 473/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 474/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 475/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 476/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 477/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 478/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 479/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 480/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 481/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 482/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generating text for batch 483/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Target size reached!
Generating text for batch 484/1703


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Target size reached!
Generating text for batch 485/1703


KeyboardInterrupt: 

In [7]:
# check total size of generated sd file
print("Target size for GEN4 sd corresponding to 1/1 of original HD data file: ", gen4_sd_target_size)
print("Actual size of GEN2's generated SD file: ", get_file_size(output_synth_data))

Target size for GEN4 sd corresponding to 1/1 of original HD data file:  307558317
Actual size of GEN2's generated SD file:  307566967
