<a href="https://colab.research.google.com/github/blainemartin/ml_training_prep/blob/main/Star_Trek_ML_Training_Data_Prep_Tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Training Data Prep Tool (Star Trek)

This is an all-in-one tool for creating consolidated training data sets to support the creation of "Game Master" chatbots for Star Trek text-based adventures. This can be run in a non-GPU instance. This tool has the capability of compiling the following training data for use LLM Lora training:

*   Episode/Movie Scripts
*   YouTube Playlist Subtitle Transcriptions
*   Wiki Articles (Memory-Alpha, Memory-Beta, and Wikipedia supported)

Steps 1.x ensure you have a consistent connection, and the right runtime type provisioned.

Steps 2.x prepare the folder structure that will be used in later training stages. These steps also download data into the /training/dataset folder.

Step 3 consolidates all of the data obtained in step 2 into one subfolder at /content/text-generation-webui/training/dataset/consolidated which can be used by the raw data input portion of the Web UI training module.
[link text](https://)

In [None]:
#@title 1.0 Keep this tab alive to prevent Colab from disconnecting you { display-mode: "form" }

#@markdown Press play on the music player that will appear below:
%%html
<audio src="https://oobabooga.github.io/silence.m4a" controls>

In [None]:
#@title 1.1 GPU Check { display-mode: "form" }


gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
#@title 1.2 Memory Check { display-mode: "form" }


from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

In [None]:
# @title 2.0 Prep File Structure & Training Data
%cd /content
!git clone https://github.com/oobabooga/text-generation-webui
#!git clone https://github.com/camenduru/text-generation-webui

In [None]:
# @title 2.1 Star Trek Episode Scripts
series = "TNG,TOS,DS9,VOY" # @param {type:"string"}

#Return to home
%cd /content
#Use the git clone command to clone the repository:
!git clone https://github.com/blainemartin/ml_training_prep.git
#Navigate into the new directory:
%cd ml_training_prep
#Then, navigate into the ST:TNG Episode Scripts directory:
%cd "Star Trek Episode Scripts"
#Install required Python modules
!pip install -r requirements.txt
#Now run the script. Use argument to specify output directory.
!python script_downloader.py {series} /content/text-generation-webui/training/datasets/Scripts


In [None]:
# @title 2.2 YouTube Commentary
output_dir = "/content/text-generation-webui/training/datasets/Commentary" # @param {type:"string"}

#Return to home
%cd /content
#Use the git clone/pull command to clone/update the repository:
!git clone https://github.com/blainemartin/ml_training_prep.git || (!cd ml_training_prep && git pull)
%cd /content
#Navigate into the new directory:
%cd ml_training_prep
#Then, navigate into the YouTube Scripts directory:
%cd "YouTube Transcripts"
#Install required Python modules
!pip install yt-dlp
#Now run the script. Use argument to specify playlist URL and output directory.
playlist_urls = [
    "https://www.youtube.com/playlist?list=PLLs9RolP5tC5BnuJq4z5P6UOzyjRfkBQV",
    "https://www.youtube.com/playlist?list=PL0bMaYlUR-3D22hUlvSuhpOAzLei2jkA_",
     "https://www.youtube.com/playlist?list=PL5Pso33oqJDidBC83byR7Mlna6gak_4fx",
     "https://www.youtube.com/playlist?list=PLAXhpI9PdbZYF9gX4d8SHTk56eQ7w912Q",
     "https://www.youtube.com/playlist?list=PL8FWJwq6-Yp50V5fM_uGfh1BwFNDeKRJW"
     "https://www.youtube.com/playlist?list=PLjNbxX7w4eojMaTakwmoqbDF9BXG1j6FI"
]

for url in playlist_urls:
    !python transcripts_downloader.py {url} {output_dir}


In [None]:
# @title 2.3 Wiki Articles
series = "TNG,DS9,YOY,TOS" # @param {type:"string"}
wikis = "MemAlpha,Wikipedia" # @param {type:"string"}
output_dir = "/content/text-generation-webui/training/datasets/Wiki" # @param {type:"string"}

#Return to home
%cd /content
#Use the git clone/pull command to clone/update the repository:
!git clone https://github.com/blainemartin/ml_training_prep.git || (!cd ml_training_prep && git pull)
%cd /content
#Navigate into the new directory:
%cd ml_training_prep
#Then, navigate into the YouTube Scripts directory:
%cd "Star Trek Wiki Articles"
#Install required Python modules
!pip install requests beautifulsoup4 urljoin
#Now run the script. Use argument to specify playlist URL and output directory.
!python article_downloader.py {series} {wikis} {output_dir}


In [None]:
# @title 3.0 Consolidate .txt files for Text Generation Web UI training module(s)
#%cd /content
#%cd ml_training_prep
#!python txt_consolidator.py /content/text-generation-webui/training/datasets


import os
import shutil
import glob

# Define paths
base_path = "/content/text-generation-webui/training/datasets"
consolidated_folder_path = os.path.join(base_path, "consolidated")
consolidated_file_path = os.path.join(base_path, "consolidated.txt")
trainer_datasets_file_path = os.path.join(base_path, "put-trainer-datasets-here.txt")
new_consolidated_folder_path = "/content/consolidated"

# Delete existing folder and files if they exist
for path in [consolidated_folder_path, consolidated_file_path, trainer_datasets_file_path]:
    if os.path.exists(path):
        if os.path.isfile(path):
            os.remove(path)
        else:
            shutil.rmtree(path)

# Create new consolidated folder if it doesn't exist
os.makedirs(new_consolidated_folder_path, exist_ok=True)

# Copy all .txt files to the new consolidated folder
for dirpath, dirnames, filenames in os.walk(base_path):
    for filename in filenames:
        if filename.endswith('.txt'):
            # Create a new filename based on the original file's directory
            new_filename = filename
            counter = 1
            while os.path.exists(os.path.join(new_consolidated_folder_path, new_filename)):
                name, ext = os.path.splitext(filename)
                new_filename = f"{name}_{counter}{ext}"
                counter += 1
            shutil.copy(os.path.join(dirpath, filename), os.path.join(new_consolidated_folder_path, new_filename))

# Copy the new consolidated folder to the original location
shutil.copytree(new_consolidated_folder_path, consolidated_folder_path)

# Concatenate all .txt files in the consolidated folder
with open(consolidated_file_path, 'w') as outfile:
    for filename in glob.glob(os.path.join(consolidated_folder_path, '*.txt')):
        with open(filename, 'r') as readfile:
            outfile.write(readfile.read() + '\n' * 3)



In [None]:
# @title 3.1 Export Dataset to Google Drive

%cd /content/text-generation-webui/training/datasets

import os
import shutil
from google.colab import drive

#Mount Google Drive to /content/drive
drive.mount('/content/drive')

# Define the file and directory paths
file_path = "/content/drive/MyDrive/ST_ML_Training_Set.zip"
dir_path = "/content/text-generation-webui/training/datasets"

# Delete the file if it exists
if os.path.exists(file_path):
    os.remove(file_path)

# Create a zip file from the directory
shutil.make_archive("/content/drive/MyDrive/ST_ML_Training_Set", 'zip', dir_path, "consolidated")