<a href="https://colab.research.google.com/github/andreianmatos/best_practices_chat/blob/main/Best_Practices_Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Please, make a copy of this notebook (or whole folder) to your personal drive...!

*   once in your drive, run section 0 to install dependencies and to mount your drive, adjust the path accordingly

*   copy the shared [data folder](https://drive.google.com/drive/folders/1AV9G0t6KOqid8wgq_sIkjyN24kaYpaGL?usp=drive_link) to your personal drive as well
*   or generate the training files yourself in section 1 and 2
*   if you are going to run section 3 to retrain model, may need to turn on GPU
*   you can just try inference with the existing models in models folder in Section 4
*   section 5 is if you want to upload model to hugging face model hub (to use in javascript API)

# Section 0

**Must run!** Installs + imports + mounting google drive

In [1]:
!pip install youtube_transcript_api spacy

Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.6.2-py3-none-any.whl (24 kB)
Installing collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-0.6.2


In [2]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from googleapiclient.discovery import build
from google.colab import userdata, drive
import os
from tqdm import tqdm
import pprint
import json
import spacy
import pandas as pd
import numpy as np
from collections import defaultdict

In [18]:
drive.mount('/content/drive')
# drive path in personal drive (copy data folder from transcripts folder to your drive)
drive_path = '/content/drive/MyDrive/Colab Notebooks/modina/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Section 1: Getting transcripts from youtube playlist

**No need to run** unless you want to do more web scraping...

If you do want to, in the notebook's secrets (key symbol on the left) please update the YOUTUBE_API_KEY with your own ([tutorial](https://medium.com/swlh/how-to-get-youtubes-api-key-7c28b59b1154))

In [None]:
youtube_api_key = userdata.get('YOUTUBE_API_KEY')
youtube = build('youtube', 'v3', developerKey=youtube_api_key)

# Playlist ID
BEST_PRACTICES_CHAT_ID = 'PLVIdoREykT8Jk_vGehGzNbhzex9HshB5x'

In [None]:
playlist_id = BEST_PRACTICES_CHAT_ID
max_results = 150

playlist_response = []
next_page_token = None

while True:
    request = youtube.playlistItems().list(
        part='snippet',
        playlistId=playlist_id,
        maxResults=min(max_results - len(playlist_response), 50),
        pageToken=next_page_token
    )
    response = request.execute()
    playlist_response.extend(response['items'])
    next_page_token = response.get('nextPageToken')

    if not next_page_token or len(playlist_response) >= max_results:
        break

In [None]:
video_ids = [item['snippet']['resourceId']['videoId'] for item in playlist_response]

In [None]:
languages = ["en"]
transcripts_by_language = {lang: {} for lang in languages}

In [None]:
for video_id in tqdm(video_ids, desc="Processing Videos"):

  try:

    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
    transcript = transcript_list.find_transcript(["en"]) # only working with default english (?)
    transcript_text = transcript.fetch()
    transcripts_by_language["en"][video_id] = ' '.join([phrase['text'] for phrase in transcript_text])
    for lang in languages[1:]:
      transcript = transcript.translate(lang)
      transcript_text = transcript.fetch()
      transcripts_by_language[lang][video_id]  = ' '.join([phrase['text'] for phrase in transcript_text])

  except Exception as e:
    print(f"An error occurred for video ID {video_id}: {e}") #subtitles disabled...

In [None]:
with open(drive_path + 'data/transcripts.json', 'w') as file:
    json.dump(transcripts_by_language, file)

# Section 2: Inspecting + Cleaning Transcripts

**Run if**, you want to inspect the data, **if not**... get compiled_english_transcripts.txt from shared drive and add to personal drive_path + 'data/'

In [4]:
transcripts_path = drive_path + 'data/transcripts.json'

with open(transcripts_path, 'r') as file:
  transcripts_data = json.load(file)

transcripts_info = defaultdict(dict)

# Extract video IDs, languages, and text
for lang, transcripts in transcripts_data.items():
    for video_id, transcript_text in transcripts.items():
        transcripts_info[video_id][lang] = transcript_text

df_transcripts = pd.DataFrame(transcripts_info).T.reset_index().rename(columns={'index': 'Video ID'})

In [5]:
df_transcripts.head()

Unnamed: 0,Video ID,en
0,iDnMcqv3lRU,hello hey hey oh that's a loft I haven't seen ...
1,1gBWFcNGyFc,hey so you were hiking yes it was a it was a l...
2,c6zZyZh__Bk,hello hello y what's up looks cold is it cold ...
3,E992uCyNXe0,hello hello I just came back from the park fro...
4,wsjeRczd1uI,River run past Eve and Adams from swerve of sh...


In [6]:
output_transcripts_path = drive_path + '/compiled_english_transcripts.txt'
with open(output_transcripts_path, 'w', encoding='utf-8') as output_file:
    for index, row in df_transcripts.iterrows():
        en_transcript = row['en']
        output_file.write(en_transcript + '\n\n')

**Clean data + Add structure**...

Adding sentence boundary detection...
Because the text lacks punctuation entirely, separating it by sentences becomes more challenging.... We need to rely on different linguistic cues or patterns to determine potential sentence boundaries.

In [7]:
# attempt to extract sentences

nlp = spacy.load("en_core_web_sm")

def extract_sentences(text):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    return sentences

df_transcripts['en'] = df_transcripts['en'].apply(lambda x: '\n'.join(extract_sentences(x)))

In [8]:
output_transcripts_path = drive_path + 'data/clean_compiled_english_transcripts.txt'
with open(output_transcripts_path, 'w', encoding='utf-8') as output_file:
    for index, row in df_transcripts.iterrows():
        en_transcript = row['en']
        output_file.write(en_transcript + '\n\n')

# Section 3: Fine Tuning with Hugging Face's Transformer Library...

Taking a pretrained model using the Hugging Face transformers library, and training it further on the chats to enhance the abilities for text generation in this particular style.

This involves adjusting the model's parameters based on the new dataset, allowing it to learn more specific patterns and improve its performance on the targeted task.

In [9]:
! pip install -U accelerate transformers diffusers
! pip install matplotlib evaluate einops

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m266.2/270.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.37.2-py3-none-any.whl (8.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m97.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diffusers
  Downloading diffusers-0.26.1-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m85.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: diffusers, accelerate, transformers
  Attempting uninstall: transformers
    Found existing installation: transform

In [10]:
from evaluate import load
from transformers import TrainerCallback, TrainerControl

from transformers import GPT2Tokenizer, AutoTokenizer, TextDataset, GPT2Model, GPT2LMHeadModel, TrainingArguments, Trainer, \
                        DataCollatorForLanguageModeling, pipeline

The function `load_dataset` loads and prepares the training dataset for the language model, also loading the tokenizer object responsible for tokenizing the text data,  from which it creates the data collator which helps in batching and collating the data during training.

The function `train_model` is responsible for training the language model. It sets up the training configuration using `TrainingArguments`, this includes defining the output directory, the number of training epochs, batch sizes, evaluation steps, logging steps, saving steps, and the learning rate, which can be tweaked to alter the model.


In [38]:
def load_dataset(train_path, tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,data_collator

def train_model(model, tokenizer, train_dataset, data_collator, learning_rate):

    training_args = TrainingArguments(
      output_dir= drive_path + "/models/" + model_name + "/" + str(lr),
      overwrite_output_dir=True,
      num_train_epochs=3,
      per_device_train_batch_size=8,
      per_device_eval_batch_size=8,
      eval_steps=500,
      logging_steps=250,
      save_steps=1000,
      warmup_steps=500,
      learning_rate = 1e-5
    )

    trainer = Trainer(
      model=model,
      args=training_args,
      data_collator=data_collator,
      train_dataset=train_dataset,
    )

    result = trainer.train()
    trainer.save_model()

    return result

To learn more about the model, consult [GPT2 Model's Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2#transformers.GPT2LMHeadModel)

Also can be replaced by other HF's models here, some of them might require alterations to the `load_model` and `train_model` functions...

In [39]:
# Initializes the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
lr = 1e-5
training_results = []

## Youtube Chats

Choose the path to the file you want to train your model on...

In [None]:
model_name = "gpt2_chats"

In [None]:
model_GPT2_path = drive_path + f'models/{model_name}/1e-05'
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
training_path = output_transcripts_path # non cleaned english transcripts created in section 2...

In [None]:
train_dataset, data_collator = load_dataset(training_path, tokenizer)

result = train_model(model, tokenizer, train_dataset, data_collator, lr)
training_results.append({"model_name": model_name, "learning_rate": lr, "training_loss": result.metrics["train_loss"]})

Step,Training Loss
250,4.2547
500,3.9148
750,3.7904
1000,3.7203
1250,3.6904
1500,3.6638
1750,3.6503


## Proposal Texts

In order to run the model with the **proposal texts** instead of the youtube chats, simply add them to your personal drive's data folder and repeat the process with this training file.

In [None]:
model_name = "gpt2_proposals"

In [None]:
model_GPT2_path = drive_path + f'models/{model_name}/1e-05'
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
proposal_texts_path = drive_path + "data/bpicd-applications.txt"
training_path = proposal_texts_path

In [None]:
train_dataset, data_collator = load_dataset(training_path, tokenizer)

result = train_model(model, tokenizer, train_dataset, data_collator, lr)
training_results.append({"model_name": model_name, "learning_rate": lr, "training_loss": result.metrics["train_loss"]})

Step,Training Loss


## Both

In order to run the model with the **proposal texts** instead of the youtube chats, simply add them to your personal drive's data folder and repeat the process with this training file.

In [40]:
model_name = "gpt2_chats_proposals"

In [41]:
model_GPT2_path = drive_path + f'models/{model_name}/1e-05'
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [42]:
proposal_texts_path = drive_path + "data/bpicd-applications.txt"
transcripts_path = drive_path + "data/clean_compiled_english_transcripts.txt"

with open(proposal_texts_path, 'r', encoding='utf-8') as file1:
    content_file1 = file1.read()

with open(transcripts_path, 'r', encoding='utf-8') as file2:
    content_file2 = file2.read()

combined_content = content_file1 + content_file2

combined_file_path = drive_path + "data/combined_texts.txt"
with open(combined_file_path, 'w', encoding='utf-8') as combined_file:
    combined_file.write(combined_content)

In [43]:
training_path = combined_file_path

In [44]:
train_dataset, data_collator = load_dataset(training_path, tokenizer)

result = train_model(model, tokenizer, train_dataset, data_collator, lr)
training_results.append({"model_name": model_name, "learning_rate": lr, "training_loss": result.metrics["train_loss"]})



Step,Training Loss
250,4.2532
500,3.9095
750,3.8087
1000,3.7475
1250,3.7169
1500,3.6926
1750,3.6709
2000,3.6568
2250,3.6435
2500,3.6493


Checkpoint destination directory /content/drive/MyDrive/Colab Notebooks/modina//models/gpt2_chats_proposals/1e-05/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory /content/drive/MyDrive/Colab Notebooks/modina//models/gpt2_chats_proposals/1e-05/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


# Section 4: Inference

If you don't want to train the model again, just experiment with the inference...

## Youtube Chats

In [None]:
model_GPT2_chats_path = drive_path + 'models/gpt2_chats/1e-05'
model_finetuned_GPT2 = GPT2LMHeadModel.from_pretrained(model_GPT2_chats_path)
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
gpt2_chats_generator = pipeline('text-generation', model=model_finetuned_GPT2, tokenizer=tokenizer_GPT2)

For GPT2 prompts are used as starting points or initial context for generating text...

In [None]:
prompt = 'generate a dance\n'
generated_chat_gpt2 = gpt2_chats_generator(prompt, max_length=100)
print(generated_chat_gpt2[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


generate a dance
but it was not like the dancers I was doing these things I was not just practicing it I had it from the beginning to be the dancer for instance like
um
so
it was like I was trying to keep the conversation alive like like so you're still talking about what I said about you
I didn't it was like a part of the narrative of it
but also it was like more like when you go to the dance studio like you are talking about dancing


## Proposal Texts

In [None]:
model_GPT2_chats_path = drive_path + 'models/gpt2_proposals/1e-05'
model_finetuned_GPT2 = GPT2LMHeadModel.from_pretrained(model_GPT2_chats_path)
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [None]:
gpt2_proposal_generator = pipeline('text-generation', model=model_finetuned_GPT2, tokenizer=tokenizer_GPT2)

In [None]:
prompt = 'generate a dance'
generated_proposal_gpt2 = gpt2_proposal_generator(prompt, max_length=100)
print(generated_proposal_gpt2[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


generate a dance between them we want to know what happened
okay
that's it
just to say that i was thinking like if you guys can come back to this i think we can expand
yeah i want to talk to you about it um with my girlfriend
and it's kind of complicated
yeah
but i think that it's because i thought and i think that we could we could to make this a bit more like a research exercise to talk about it um
and if


## Both with Beam Search with N-Gram Penalities
Beam search is essentially Greedy Search but the model tracks and keeps num_beams of hypotheses at each time step, so the model is able to compare alternative paths as it generates text. We can also include a n-gram penalty by setting no_repeat_ngram_size = 2 which ensures that no 2-grams appear twice. We will also set num_return_sequences = 5 so we can see what the other 5 beams looked like



In [45]:
model_GPT2_chats_proposals_path = drive_path + 'models/gpt2_chats_proposals/1e-05'
model_finetuned_GPT2 = GPT2LMHeadModel.from_pretrained(model_GPT2_chats_proposals_path)
tokenizer_GPT2 = GPT2Tokenizer.from_pretrained("gpt2")

In [46]:
gpt2_text_generator = pipeline('text-generation', model=model_finetuned_GPT2, tokenizer=tokenizer_GPT2)

In [48]:
prompt = 'generate a dance'
generated_proposal_gpt2 = gpt2_text_generator(prompt,
                                              max_length=200,
                                              num_beams = 5,
                                              no_repeat_ngram_size = 2,
                                              num_return_sequences = 1,
                                              early_stopping = True
                                              )
print(generated_proposal_gpt2[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


generate a dance that's not just a choreographic thing but it's something that you're doing in a way
and then you don't have to think about it
right
yeah
it's just you know like you can just do it like it doesn't matter if you want to do the choreography or not
but i think it would be really nice to have that kind of experience with the dance because then i would like to know what is it that i'm doing and how do i do that
okay
so
i think i i will try to be honest with you about what i was trying to say because i feel like i have a lot of things to talk about and i mean i haven't really talked about this in the last few days but i just wanted to share a little bit of my experience so that we can be more honest about the practices and what we are doing
no
that's why i felt like this is really important for us because we have


## Mass Inference

Function for creating a bunch of txt files with generated texts into a folder in personal drive...

In [None]:
def mass_generation(prompt, gpt2_generator, generated_texts_path, number_texts=20, max_length=200):

    # Create the directory if it doesn't exist
    if not os.path.exists(generated_texts_path):
        os.makedirs(generated_texts_path)

    for i in range(number_texts):
        generated_chat_gpt2 = gpt2_generator(prompt, max_length=max_length)

        file_name = f"generated_{i}.txt"
        file_path = os.path.join(generated_texts_path, file_name)

        with open(file_path, 'w') as file:
            generated_text = generated_chat_gpt2[0]['generated_text']
            file.write(generated_text)

        print(f"Generated text {i + 1} and saved to {file_name}")

In [None]:
prompt = "Start of the conversation..."
folder_path = drive_path + "/generated_texts/generated_chats_2"
mass_generation(prompt, gpt2_chats_generator, folder_path)

In [None]:
prompt = "Start of the conversation..."
folder_path = drive_path + "/generated_texts/generated_proposal"
mass_generation(prompt, gpt2_proposal_generator, folder_path)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 1 and saved to generated_0.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 2 and saved to generated_1.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 3 and saved to generated_2.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 4 and saved to generated_3.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 5 and saved to generated_4.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 6 and saved to generated_5.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 7 and saved to generated_6.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 8 and saved to generated_7.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 9 and saved to generated_8.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 10 and saved to generated_9.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 11 and saved to generated_10.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 12 and saved to generated_11.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 13 and saved to generated_12.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 14 and saved to generated_13.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 15 and saved to generated_14.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 16 and saved to generated_15.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 17 and saved to generated_16.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 18 and saved to generated_17.txt


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 19 and saved to generated_18.txt
Generated text 20 and saved to generated_19.txt


# Section 5: Upload models to hugging face model hub...

**RUN IF** you want to do save the model to HF's model hub, which is necessary in order to do inference from the javascript library...

If you do want to, set up an account in hugging face and do your notebook_login as follows...

In [49]:
!pip install huggingface-hub



In [50]:
from huggingface_hub import notebook_login

In [51]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

I found that we need to also save the tokenizer in the model for the javascript API to function.

In [None]:
model_finetuned_GPT2.push_to_hub("gpt2_bestpractices_chats")
tokenizer_GPT2.push_to_hub("gpt2_bestpractices_chats")

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/blasees/gpt2_bestpractices_chats/commit/8985a6a5c77ebe9adaf4d33b69e1a0e2d15cb280', commit_message='Upload tokenizer', commit_description='', oid='8985a6a5c77ebe9adaf4d33b69e1a0e2d15cb280', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model_finetuned_GPT2.push_to_hub("gpt2_bestpractices_proposal")
tokenizer_GPT2.push_to_hub("gpt2_bestpractices_proposal")

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/blasees/gpt2_bestpractices_proposal/commit/2f25e2735ab7e10e204719e01ff0af8b4956413b', commit_message='Upload tokenizer', commit_description='', oid='2f25e2735ab7e10e204719e01ff0af8b4956413b', pr_url=None, pr_revision=None, pr_num=None)

In [52]:
model_finetuned_GPT2.push_to_hub("gpt2_chats_proposals")
tokenizer_GPT2.push_to_hub("gpt2_chats_proposals")

Non-default generation parameters: {'max_length': 50, 'do_sample': True}


model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/blasees/gpt2_chats_proposals/commit/38c518f6ba6e8bd8ec61d65e2d72fa8f444dc19b', commit_message='Upload tokenizer', commit_description='', oid='38c518f6ba6e8bd8ec61d65e2d72fa8f444dc19b', pr_url=None, pr_revision=None, pr_num=None)