# Create Traning Data for GEN1 Finetuning 
We create a new txt file "data/md/combined1/train1_combined.txt" with the same count of total entries as HD0 "data/hd/combined0/combined0/train_combined0.txt" 
of which 3/4 or 75% is human data from "data/hd/combined1/train_combined1.txt" and 1/4 or 25% is synthetic data from GEN0's, "data/sd/gen0/gen0_sd.txt".

In [2]:
import random

def read_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = file.read().strip().split('\n\n')
    return data

def write_data(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write('\n\n'.join(data))

def count_entries(filepath):
    """Counts the number of double-newline-separated entries in a file."""
    with open(filepath, 'r', encoding='utf-8') as file:
        content = file.read().strip()
    return len(content.split('\n\n'))

# Load data
human_data = read_data('./data/hd/combined1/train_combined1.txt')
synthetic_data = read_data('./data/sd/gen0/gen0_sd.txt')

# Combine and shuffle
combined_data = human_data + synthetic_data
random.shuffle(combined_data)

# Write combined data to a new file
write_data(combined_data, './data/md/mixed_gen1/mix1_train_combined.txt')
print("Data has been successfully combined and written to mix1_train_combined.txt.")


Data has been successfully combined and written to mix1_train_combined.txt.


Check that the total entries count in of mix1_train_combined.txt is about the same as in the original dataset for gen0 "data/hd/combined0/train_combined0.txt".

In [3]:
print("Original GEN0 HD Entries Count: ", count_entries("data/hd/combined0/train_combined0.txt"))
print("Training Dataset for GEN1 Entries Count: ", count_entries("data/md/mixed_gen1/mix1_train_combined.txt"))

Original GEN0 HD Entries Count:  109040
Training Dataset for GEN1 Entries Count:  109040


# Finetuning

Finetune the model GEN1 for 5 epochs with the combined txt file containing 3/4th real data and 1/4 synthetic data "data/md/mixed_gen1/mix1_train_combined.txt".  
The global_step parameter in "models/distilgpt2-finetuned_gen0_100/trainer_state.json" and "models/distilgpt2-finetuned_gen0_100/checkpoint-11105/trainer_state.json" is originally "global_step": 11105, but in order to force run_clm.py to  
train the model from the last checkpoint with the new dataset we set this parameter to 0 to force it to learn from the entirety of the new dataset. 

In [1]:
!deepspeed run_clm.py \
    --model_name_or_path distilgpt2 \
    --train_file data/md/mixed_gen1/mix1_train_combined.txt \
    --validation_file data/hd/prepro/combined0/valid_combined.txt \
    --do_train \
    --do_eval \
    --output_dir ./models/distilgpt2-finetuned_gen1_75 \
    --num_train_epochs 5 \
    --save_strategy epoch \
    --learning_rate 5e-5 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --deepspeed ds_config.json \
    --resume_from_checkpoint ./models/distilgpt2-finetuned_gen0_100/checkpoint-11105


[2024-04-17 08:26:20,433] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 08:26:21,144] [INFO] [runner.py:568:main] cmd = /home/vasi/Documents/BA_Thesis_Experiment/.venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_clm.py --model_name_or_path distilgpt2 --train_file data/md/combined1/train1_combined.txt --validation_file data/hd/prepro/combined0/valid_combined.txt --do_train --do_eval --output_dir ./models/distilgpt2-finetuned_gen1 --num_train_epochs 5 --save_strategy epoch --learning_rate 5e-5 --per_device_train_batch_size 4 --gradient_accumulation_steps 4 --deepspeed ds_config.json --resume_from_checkpoint ./models/distilgpt2-finetuned_gen0/checkpoint-27765
[2024-04-17 08:26:22,964] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-17 08:26:23,525] [INFO] [laun

# Inference
### Step 3
Let the model generate a story from a specified prompt.

In [1]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "./models/distilgpt2-finetuned_gen1/checkpoint-25635" 
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


prompt = "When the time came she knew she had to take a very difficult descision, a descision of life or death for the whole planet." 
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")


# Generate text
output_sequences = model.generate(
    input_ids=inputs,
    attention_mask=None,
    max_length=500,  # determines the maximum length of the generated text
    temperature=0.7,  # controls randomness: lower values make text less random
    top_k=50,  # the K most likely next words are considered for each step
    top_p=0.9,  # only the most probable tokens with probabilities that add up to top_p are considered for each step
    repetition_penalty=1.2,  # penalty applied to repeated words
    do_sample=True,  # set to True to return diverse samples
    num_return_sequences=1,  # number of independently computed samples to generate
    pad_token_id=tokenizer.eos_token_id,
)

# Decode the output sequences to get the generated text
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)


print(generated_text)

  from .autonotebook import tqdm as notebook_tqdm


When the time came she knew she had to take a very difficult descision, a descision of life or death for the whole planet. ”   The girl turned and looked at me. She had been sitting there on her bed, staring at my face, wondering what I was seeing in her eyes. I stared back at her, and then at her head, and then down into the floor. She looked up at me, and then down into that dark brown space between the two of us. I wasn't looking away from her, I was looking at her.  It was all over now, though. There were people walking towards her ; they hadn't noticed anything until they reached her. And she looked straight ahead, and then out of her peripheral vision, and through her nose, and into mine, and into hers, and into mine.   It was like a dream come true.
You are part of an elite team tasked with guarding your city's most notorious criminals - you have just witnessed them being executed by one of their own men who is also known as `` Villain ''... You walk around while talking to him/

### Step 4
Clear impurities in generated text and write to ouput file.

In [2]:
import re

# Define regex pattern for impurities
#pattern = r"(<newline>|<newline \d+ :>|<newline\*>|\[.*?\]|“|”|``|''|--|__________________________________________________________________|\*)"

# Remove the prompt (first sentence) by finding the first period followed by a space or end of text
text_without_prompt = re.split(r'\.\s*[”]*\s*(?=[A-Z])', generated_text, 1)[1] if '.' in generated_text else generated_text


# Regex to remove specified impurities
#cleaned_text = re.sub(r"\<[^\>]*\>|\[WP\]|\-\-", "", text_without_prompt)
#cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  # Remove extra spaces and strip leading/trailing spaces
# Apply regex to remove impurities
#cleaned_text = re.sub(pattern, "", cleaned_text)

print(text_without_prompt)

with open("./outputs/gen1/story1.txt", "w") as f:
    f.write(text_without_prompt)


The girl turned and looked at me. She had been sitting there on her bed, staring at my face, wondering what I was seeing in her eyes. I stared back at her, and then at her head, and then down into the floor. She looked up at me, and then down into that dark brown space between the two of us. I wasn't looking away from her, I was looking at her.  It was all over now, though. There were people walking towards her ; they hadn't noticed anything until they reached her. And she looked straight ahead, and then out of her peripheral vision, and through her nose, and into mine, and into hers, and into mine.   It was like a dream come true.
You are part of an elite team tasked with guarding your city's most notorious criminals - you have just witnessed them being executed by one of their own men who is also known as `` Villain ''... You walk around while talking to him/her about how he got his hands dirty during interrogation ( but does n´t know which ). Describe this encounter :
I woke up earl

# Evaluation
### Step 5
Evaluate the output according to predefined metrics.

In [4]:
from metrics.LexicalDiversity.lexical_diversity import *
from metrics.SemanticDiversity.sementic_diversity import *
from metrics.SyntacticDiversity.syntactic_diversity import *
import nltk
from nltk.tokenize import sent_tokenize

with open("./outputs/gen1/story1.txt", 'r') as f:
    story = f.read()


# Download the required Punkt tokenizer models
#nltk.download('punkt')

# Tokenize the text into sentences
sentences = sent_tokenize(story)
#print(type(sentences))

#print(story.split("."))
print("GEN1")
print("Lexical Diversity: ")
print("Distinct 2: ", calculate_distinct_n(story, 2))
print("Distinct 3: ", calculate_distinct_n(story, 3))
print(f"Self_BLEU Score: {1-calculate_self_bleu(sentences)}")
print("Over-ALL-TTR: ", calculate_ttr(story, truncate_length=300))
print("Mean-Segmental-TTR: ", calculate_mean_segmental_ttr(story, segment_size=50))
print("\nSemantic Diversity: ")
print("Semantic diversity (average):", calculate_semantic_diversity(sentences, 'average'))
print("Semantic diversity (centroid):", calculate_semantic_diversity(sentences, 'centroid'))

import spacy

# Load a spaCy model for dependency parsing
nlp = spacy.load("en_core_web_sm")

graphs = construct_dependency_graphs(sentences)

syntactic_diversity = calculate_syntactic_diversity(graphs)
print("\nSyntactic diversity:", syntactic_diversity)

GEN1
Lexical Diversity: 
Distinct 2:  0.9224598930481284
Distinct 3:  0.9785522788203753
Self_BLEU Score: 0.9889999239163768
Over-ALL-TTR:  0.6366666666666667
Mean-Segmental-TTR:  0.7875

Semantic Diversity: 
Semantic diversity (average): 0.7536183355481566
Semantic diversity (centroid): 0.48239625026225647

Syntactic diversity: 0.847389705521799


# Generate Synthetic Data
### Step 6
Contribute to the synthetic dataset by producing stories from the finetuned model.
We use 50% of the original prompt data as our prompt list.

In [2]:
import random
from transformers import pipeline, set_seed, GPT2LMHeadModel, GPT2Tokenizer
import os

prompts = []
prompt_files = ["train", "test", "valid"]
for name in prompt_files:
    # Path to the file with prompts
    file_path = './data/hd/prepro/'+name+'.wp_source'

    # Read prompts from the file
    with open(file_path, 'r', encoding='utf-8') as file:
        prompts += ([line.strip() for line in file.readlines() if line.strip()])

#print(prompts[0:10])
# Randomly select 50% of the prompts
sample_size = len(prompts) // 2  # 50%
selected_prompts = random.sample(prompts, sample_size)

print(selected_prompts[0:10])


["[ WP ] Your best friend commits suicide . The last line of their suicide note reads : `` calm down . if everything goes according to plan ill be back soon enough . '' Now everyone is looking to you for answers", '[ WP ] At some point in human history , we had to make a choice between two paths , but our choice doomed us to the self-destructive species we are today . But what if we had chosen the other path ?', '[ WP ] When something is created ( humans , fire , lotion , etc . ) , a god is born to reign over its domain . You are the god of what most consider to be a completely mundane object but , somehow , you are becoming the most feared .', '[ WP ] Supporters of cybernetics and genetic augmentation clash like console fanboys', '[ EU ] Humanity , not the Protheans , were the race that came before the current galactic community . As before , the Turians , Asari , and Salarians believe that we built the Citadel and the Mass Relays .', '[ WP ] Knowing it will save their life , you have

In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re

model_name = "./models/distilgpt2-finetuned_gen1/checkpoint-25635"
tokenizer = GPT2Tokenizer.from_pretrained(model_name, padding_side='left')
model = GPT2LMHeadModel.from_pretrained(model_name)

# Define the device based on CUDA availability
device = "cuda" if torch.cuda.is_available() else "cpu"


if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

if torch.cuda.is_available():
    model = torch.nn.DataParallel(model)
    model.cuda()
else:
    model.to("cpu")

batch_size = 16

def generate_text_batch(prompts):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
    inputs = {key: value.to(device) for key, value in inputs.items()}  # Move all tensors to the right device
    outputs = model.module.generate(
        **inputs, 
        max_length=500, 
        num_return_sequences=1, 
        temperature=0.7,  # More randomness
        repetition_penalty=1.2,  # Increase penalty to reduce repetitions
        top_k=50, 
        top_p=0.9
    )
    return [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

output_synth_data = './data/sd/gen1/gen1_sd.txt'

try:
    with open(output_synth_data, 'a', encoding='utf-8') as file:
        for i in range(0, len(selected_prompts), batch_size):
            batch_prompts = selected_prompts[i:i + batch_size]
            generated_texts = generate_text_batch(batch_prompts)
            
            for prompt, generated_text in zip(batch_prompts, generated_texts):
                prompt_length = len(tokenizer.encode(prompt))
                #print(prompt)
                # Remove the prompt by slicing the tokens to skip the prompt length
                generated_text_tokens = tokenizer.encode(generated_text)[prompt_length:]
                clean_generated_text = tokenizer.decode(generated_text_tokens, skip_special_tokens=True)

                # Remove leading and ending spaces and special characters
                clean_generated_text = re.sub(r'^[\s\W]+', '', clean_generated_text)
                clean_generated_text = re.sub(r'^[\s\W]+|[\s\W]+$', '', clean_generated_text)


                output_text = f"{prompt}\n{clean_generated_text}\n\n"
                #print(output_text)
                file.write(output_text)
    print("Finished generating stories.")
except Exception as e:
    print(f"An error occurred: {e}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

KeyboardInterrupt: 