# Preprocess Traning Data for GEN2 Training 
We create a new txt file "data/md/combined2/train2_combined.txt" about the same size as the original training file "data/hd/prepro/combined0/train_combined.txt" 
of which 1/2 or 50% is randomly picked from the original text file and 1/2 or 50% is randomly picked from the synthetic data of which file 1/3 or 33.3% from GEN0, "data/sd/gen0/gen0_sd.txt" and 2/3 or 66.6% from GEN1 "data/sd/gen1/gen1_sd.txt"

In [2]:
import random

def read_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = file.read().strip().split('\n\n')
    return data

def write_data(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write('\n\n'.join(data))

# Load data
original_data = read_data('./data/hd/prepro/combined0/train_combined.txt')
synthetic_data_gen0 = read_data('./data/sd/gen0/gen0_sd.txt')
synthetic_data_gen1 = read_data('./data/sd/gen1/gen1_sd.txt')

# Determine portions
one_half_len = int(0.5 * len(original_data))
one_third_len = int(0.33 * (len(original_data)/2))
two_thirds_len = int(0.66 * (len(original_data)/2))

# Randomly sample data
selected_original = random.sample(original_data, one_half_len)
selected_synthetic_gen0 = random.sample(synthetic_data_gen0, one_third_len)
selected_synthetic_gen1 = random.sample(synthetic_data_gen1, two_thirds_len)

# Combine and shuffle
combined_data = selected_original + selected_synthetic_gen0 + selected_synthetic_gen1
random.shuffle(combined_data)

# Write combined data to a new file
write_data(combined_data, './data/md/combined2/train2_combined.txt')
print("Data has been successfully combined and written to train2_combined.txt.")


Data has been successfully combined and written to train2_combined.txt.


(Optional think it is more correct to keep the same validation set so that we can also compare the different gens performance metrics.)
We follow the same logic to create a hybrid validation dataset as well.

In [2]:
import random

# Load data
original_data = read_data('./data/hd/prepro/combined0/valid_combined.txt')
synthetic_data = read_data('./data/sd/gen0/gen0_sd.txt')

# Determine portions
three_quarters_len = int(0.75 * len(original_data))
one_quarter_len = int(0.25 * len(original_data))

# Randomly sample data
selected_original = random.sample(original_data, three_quarters_len)
selected_synthetic = random.sample(synthetic_data, one_quarter_len)

# Combine and shuffle
combined_data = selected_original + selected_synthetic
random.shuffle(combined_data)

# Write combined data to a new file
write_data(combined_data, './data/md/combined1/valid1_combined.txt')
print("Data has been successfully combined and written to train1_combined.txt.")


Data has been successfully combined and written to train1_combined.txt.


# Finetuning
### Step 2

Finetune the model GEN2 for 5 epochs with the combined txt file containing 1/2th real data and 1/2 (1/3 GEN0 + 2/3 GEN1)synthetic data starting from the last checkpoint of the model GEN1.  
The num_train_epochs must be increased by +5 from the previous generation in order to train this generation for 5 more epochs.

In [3]:
!deepspeed run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file data/md/combined2/train2_combined.txt \
    --validation_file data/hd/prepro/combined0/valid_combined.txt \
    --do_train \
    --do_eval \
    --output_dir ./models/distilgpt2-finetuned_gen2 \
    --num_train_epochs 15 \
    --save_strategy epoch \
    --learning_rate 5e-5 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --deepspeed ds_config.json \
    --resume_from_checkpoint ./models/distilgpt2-finetuned_gen1/checkpoint-51270


[2024-04-16 17:54:06,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-16 17:54:06,765] [INFO] [runner.py:568:main] cmd = /home/vasi/Documents/BA_Thesis_Experiment/.venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_clm.py --model_name_or_path distilgpt2 --train_file data/md/combined2/train2_combined.txt --validation_file data/hd/prepro/combined0/valid_combined.txt --do_train --do_eval --output_dir ./models/distilgpt2-finetuned_gen2 --num_train_epochs 15 --save_strategy epoch --learning_rate 5e-5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --deepspeed ds_config.json --resume_from_checkpoint ./models/distilgpt2-finetuned_gen1/checkpoint-51270
[2024-04-16 17:54:08,704] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-16 17:54:09,306] [INFO] [lau

# Inference
### Step 3
Let the model generate a story from a specified prompt.

In [7]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "./models/distilgpt2-finetuned_gen1/checkpoint-51270" 
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


prompt = "When the time came she knew she had to take a very difficult descision, a descision of life or death for the whole planet." 
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")


# Generate text
output_sequences = model.generate(
    input_ids=inputs,
    attention_mask=None,
    max_length=500,  # determines the maximum length of the generated text
    temperature=0.7,  # controls randomness: lower values make text less random
    top_k=50,  # the K most likely next words are considered for each step
    top_p=0.9,  # only the most probable tokens with probabilities that add up to top_p are considered for each step
    repetition_penalty=1.2,  # penalty applied to repeated words
    do_sample=True,  # set to True to return diverse samples
    num_return_sequences=1,  # number of independently computed samples to generate
    pad_token_id=tokenizer.eos_token_id,
)

# Decode the output sequences to get the generated text
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)


print(generated_text)

  from .autonotebook import tqdm as notebook_tqdm


When the time came she knew she had to take a very difficult descision, a descision of life or death for the whole planet. The only thing keeping her sane though was that even if someone tried it, they would die anyway. So instead, she went through with this plan : She would find out whether there is any other way to kill herself before dying so long as everyone else has already died already. And then after several days of searching, and finally finding nothing more than an empty shell filled by people trying desperately not to die yet. Then suddenly everything changed again ; Everything seemed normal except those who lived longer than usual. People began noticing strange patterns appearing throughout their lives which made sense since most things happen around them. One person noticed a pattern on his wrist showing that he/she could move freely while still alive. Another saw another sign saying `` Run '' followed by something similar to `` Run '' leading to
The year 2099 - A man disco

### Step 4
Clear impurities in generated text and write to ouput file.

In [8]:
import re

# Define regex pattern for impurities
#pattern = r"(<newline>|<newline \d+ :>|<newline\*>|\[.*?\]|“|”|``|''|--|__________________________________________________________________|\*)"

# Remove the prompt (first sentence) by finding the first period followed by a space or end of text
text_without_prompt = re.split(r'\.\s*[”]*\s*(?=[A-Z])', generated_text, 1)[1] if '.' in generated_text else generated_text


# Regex to remove specified impurities
#cleaned_text = re.sub(r"\<[^\>]*\>|\[WP\]|\-\-", "", text_without_prompt)
#cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces and strip leading/trailing spaces
# Apply regex to remove impurities
#cleaned_text = re.sub(pattern, "", cleaned_text)

print(text_without_prompt)

with open("./outputs/gen1/story1.txt", "w") as f:
    f.write(text_without_prompt)


The only thing keeping her sane though was that even if someone tried it, they would die anyway. So instead, she went through with this plan : She would find out whether there is any other way to kill herself before dying so long as everyone else has already died already. And then after several days of searching, and finally finding nothing more than an empty shell filled by people trying desperately not to die yet. Then suddenly everything changed again ; Everything seemed normal except those who lived longer than usual. People began noticing strange patterns appearing throughout their lives which made sense since most things happen around them. One person noticed a pattern on his wrist showing that he/she could move freely while still alive. Another saw another sign saying `` Run '' followed by something similar to `` Run '' leading to
The year 2099 - A man discovers himself trapped inside a box containing 100 % human DNA ( 99.999 % ) stored within 100 % of human DNA. Write about how

# Evaluation
### Step 5
Evaluate the output according to predefined metrics.

In [10]:
from metrics.LexicalDiversity.lexical_diversity import *
from metrics.SemanticDiversity.sementic_diversity import *
from metrics.SyntacticDiversity.syntactic_diversity import *
import nltk
from nltk.tokenize import sent_tokenize

with open("./outputs/gen1/story1.txt", 'r') as f:
    story = f.read()


# Download the required Punkt tokenizer models
#nltk.download('punkt')

# Tokenize the text into sentences
sentences = sent_tokenize(story)
#print(type(sentences))

#print(story.split("."))
print("GEN1")
print("Lexical Diversity: ")
print("Distinct 2: ", calculate_distinct_n(story, 2))
print("Distinct 3: ", calculate_distinct_n(story, 3))
print(f"Self_BLEU Score: {1-calculate_self_bleu(sentences)}")
print("Over-ALL-TTR: ", calculate_ttr(story, truncate_length=300))
print("Mean-Segmental-TTR: ", calculate_mean_segmental_ttr(story, segment_size=50))
print("\nSemantic Diversity: ")
print("Semantic diversity (average):", calculate_semantic_diversity(sentences, 'average'))
print("Semantic diversity (centroid):", calculate_semantic_diversity(sentences, 'centroid'))

import spacy

# Load a spaCy model for dependency parsing
nlp = spacy.load("en_core_web_sm")

graphs = construct_dependency_graphs(sentences)

syntactic_diversity = calculate_syntactic_diversity(graphs)
print("\nSyntactic diversity:", syntactic_diversity)

GEN1
Lexical Diversity: 
Distinct 2:  0.9744245524296675
Distinct 3:  0.9923076923076923
Self_BLEU Score: 1.0
Over-ALL-TTR:  0.82
Mean-Segmental-TTR:  0.8821428571428572

Semantic Diversity: 
Semantic diversity (average): 0.8000928248197676
Semantic diversity (centroid): 0.49299244327590885

Syntactic diversity: 0.8527120187625716


# Generate Synthetic Data
### Step 6
Contribute to the synthetic dataset by producing stories from the finetuned model.
We use 50% of the original prompt data as our prompt list.

In [11]:
import random
from transformers import pipeline, set_seed, GPT2LMHeadModel, GPT2Tokenizer
import os

prompts = []
prompt_files = ["train", "test", "valid"]
for name in prompt_files:
    # Path to the file with prompts
    file_path = './data/hd/prepro/'+name+'.wp_source'

    # Read prompts from the file
    with open(file_path, 'r', encoding='utf-8') as file:
        prompts += ([line.strip() for line in file.readlines() if line.strip()])

#print(prompts[0:10])
# Randomly select 50% of the prompts
sample_size = len(prompts) // 2  # 50%
selected_prompts = random.sample(prompts, sample_size)

print(selected_prompts[0:10])




In [13]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re

model_name = "./models/distilgpt2-finetuned_gen1/checkpoint-51270"
tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side='left')
model = GPT2LMHeadModel.from_pretrained(model_name)

# Define the device based on CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"


if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if torch.cuda.is_available():
    model = torch.nn.DataParallel(model)
    model.cuda()
else:
    model.to("cpu")

batch_size = 16

def generate_text_batch(prompts):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move all tensors to the right device
    outputs = model.module.generate(
        **inputs, 
        max_length=500, 
        num_return_sequences=1, 
        temperature=0.7,  # More randomness
        repetition_penalty=1.2,  # Increase penalty to reduce repetitions
        top_k=50, 
        top_p=0.9
    )
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

output_synth_data = './data/sd/gen1/gen1_sd.txt'

try:
    with open(output_synth_data, 'a', encoding='utf-8') as file:
        for i in range(0, len(selected_prompts), batch_size):
            batch_prompts = selected_prompts[i:i + batch_size]
            generated_texts = generate_text_batch(batch_prompts)
            
            for prompt, generated_text in zip(batch_prompts, generated_texts):
                prompt_length = len(tokenizer.encode(prompt))
                #print(prompt)
                # Remove the prompt by slicing the tokens to skip the prompt length
                generated_text_tokens = tokenizer.encode(generated_text)[prompt_length:]
                clean_generated_text = tokenizer.decode(generated_text_tokens, skip_special_tokens=True)

                # Remove leading and ending spaces and special characters
                clean_generated_text = re.sub(r'^[\s\W]+', '', clean_generated_text)
                clean_generated_text = re.sub(r'^[\s\W]+|[\s\W]+$', '', clean_generated_text)


                output_text = f"{prompt}\n{clean_generated_text}\n\n"
                #print(output_text)
                file.write(output_text)
    print("Finished generating stories.")
except Exception as e:
    print(f"An error occurred: {e}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Finished generating stories.
