<a href="https://colab.research.google.com/github/debalabbas/content-generator/blob/master/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [61]:
import json

In [62]:
with open('data/datasets/train.json','r') as f:
  data = json.load(f)

In [63]:
data = [{"text": x} for x in data]
data[0]

{'text': 'Ever wonder why companies want you to come back to the office? \n \nOr spend all day at your desk? \n \nMe too. \n \nSure...I understand the camaraderie. \n \nOr teamwork. \n \nAnd some things are more easily done in person than virtually. I get that. \n \nBut here\'s something else I think: \n \nSuccess is not about long hours in the office. Or sitting at your desk from 9 to 5. \n \nIt\'s about good ideas and sharp execution. \n \nAnd sometimes, the best ideas come when you\'re away from your desk. \n \nLike at the dinner table, on a run, or playing with your kids. \n \nSpending time with family isn\'t just nice, it helps us think better. \n \nDon\'t get stuck on the old idea that more hours at work means more success. \n \nTry something different. \n \nSpend more time living life, and you might find you\'re better at your job. \n \nMaybe a new way to measure success is in the quality of our ideas, and how we execute them - not the quantity of our hours. \n \nYour desk isn\'

In [91]:
model_checkpoint = "gpt2"

In [65]:
from datasets import Dataset
dataset = Dataset.from_list(data)

In [93]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [94]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

In [95]:
tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

Map (num_proc=4):   0%|          | 0/62 [00:00<?, ? examples/s]

In [96]:
block_size = 128

In [97]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [98]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=128,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/62 [00:00<?, ? examples/s]

In [99]:
len(lm_datasets)

173

In [100]:
tokenizer.decode(lm_datasets[1]["input_ids"])

", the best ideas come when you're away from your desk. \n \nLike at the dinner table, on a run, or playing with your kids. \n \nSpending time with family isn't just nice, it helps us think better. \n \nDon't get stuck on the old idea that more hours at work means more success. \n \nTry something different. \n \nSpend more time living life, and you might find you're better at your job. \n \nMaybe a new way to measure success is in the quality of our ideas, and how we execute them"

In [101]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

(…)gpt2/resolve/main/generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [102]:
from transformers import Trainer, TrainingArguments

In [103]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-justin-welsh",
    per_device_train_batch_size=8,
    save_strategy = "steps",
    num_train_epochs=150,
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    save_total_limit=5,

)

In [104]:
tokenizer.pad_token = tokenizer.eos_token

In [105]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets,
    tokenizer=tokenizer
)

In [106]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,2.6065
1000,1.6115
1500,0.9907
2000,0.6412
2500,0.469
3000,0.3874


TrainOutput(global_step=3300, training_loss=1.048942660707416, metrics={'train_runtime': 746.1907, 'train_samples_per_second': 34.777, 'train_steps_per_second': 4.422, 'total_flos': 847581334732800.0, 'train_loss': 1.048942660707416, 'epoch': 150.0})

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

events.out.tfevents.1700080270.b981a2a36bd4.2102.1:   0%|          | 0.00/4.55k [00:00<?, ?B/s]

events.out.tfevents.1700078108.b981a2a36bd4.2102.0:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

Upload 6 LFS files:   0%|          | 0/6 [00:00<?, ?it/s]

events.out.tfevents.1700080315.b981a2a36bd4.2102.2:   0%|          | 0.00/4.71k [00:00<?, ?B/s]

events.out.tfevents.1700080519.b981a2a36bd4.2102.3:   0%|          | 0.00/5.77k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

'https://huggingface.co/debal/gpt2-finetuned-justin-welsh/tree/main/'

In [117]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('debal/gpt2-finetuned-justin-welsh')


In [118]:
answer_1 = input("What do you want to learn about: ")

What do you want to learn about: How to develop a niche?


Comparing Prompting Techniques

In [119]:
# Simple Prompt
text = f"""Main Question: {answer_1} ?
Answer:
"""
encoded_input = tokenizer.encode(text, return_tensors='pt')
generated_output =  model.generate(encoded_input, max_length=300, num_beams=5, no_repeat_ngram_size=3, top_k=10, top_p=0.95)
generated_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [120]:
print(generated_text)

Main Question: How to develop a niche??
Answer:
As a solopreneur, it's easy to feel like you're drowning in a sea of competitors, all vying for the same clients.

But here's the thing:

Your personal brand is the life raft that keeps you afloat and propels you forward. When you invest in branding, you're not just creating a pretty face for your business.


It's about owning your learning, taking control, and not being dedicated to the best practices of the business world..

Branding is the secret sauce to building a strong, self-reliant brand that resonates with your target audience and inspires them to chase dreams of entrepreneurship and freedom.


In [127]:
# Using Latent Space priming in Prompting
text = f"""Main Question: {answer_1} ?
Dialog 1: Well, first I need to think about solopreneurs in general
Dialog 2: Next, maybe I need to figure out how I define the answer. What criteria am I looking to judge on?
Dialog 3: Based on all this, what can I answer ?
Answer:"""

encoded_input = tokenizer.encode(text, return_tensors='pt')
generated_output =  model.generate(encoded_input, max_length=400, num_beams=5, no_repeat_ngram_size=3, top_k=10, top_p=0.95)
generated_text = tokenizer.decode(generated_output[0], skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [122]:
print(generated_text)

Main Question: How to develop a niche??
Dialog 1: Well, first I need to think about solopreneurs in general
Dialog 2: Next, maybe I need to figure out how I define the answer. What criteria am I looking to judge on?
Dialog 3: Based on all this, what can I answer?
Answer: You can either:

Create a niche by selling your skills or creating valuable content
Produce valuable content in a timely manner
Difficulty level
Once you decide on a sub-niche, you can apply this formula to the content you create to arrive at your audience.

For example, let's say you’re a financial analyst and your interest is in wine. You can create content that revolves around wine. The result?

Videos.


2. Ideation
Ideation is the name of the social media game. It's not enough to simply post content and hope for the best. You need to actively seek out sponsorships, audience comments, and audience testimonials.



3. Motivational
People on social media want to learn from others who are where they want to be.motivat