GitHub - arunprsh/train-bert-from-scratch-on-sagemaker: Pretraining a large language model from scratch with your own custom domain data and Amazon SageMaker

Large Language Model (LLM) from Scratch on Amazon SageMaker

This repository contains the supporting notebooks for a medium article on training an LLM from scratch with a custom domain corpus and Amazon SageMaker.

The series covers the following topics:

Module 1: Acquiring the dataset

Learn how to acquire and preprocess your custom domain corpus for training the LLM.

Module 2: Customizing vocabulary

Understand how to customize the vocabulary of the LLM to fit your dataset.

Module 3: Tokenizing datasets

Learn how to tokenize your dataset and prepare it for training.

Module 4: Intermediate training

Discover how to perform intermediate training on the LLM using Amazon SageMaker.

Module 5: Deployment as an endpoint

Explore how to deploy the trained LLM as an endpoint for inference.

Prerequisites

To follow along with the notebooks in this repository, you will need an AWS account with access to Amazon SageMaker. Additionally, you will need to have a custom domain corpus ready for training.

Getting started

To get started, clone this repository and open the notebooks in the order they are presented in the series. Each notebook contains detailed instructions and explanations for the corresponding module.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
01-prepare		01-prepare
02-tokenize		02-tokenize
03-train		03-train
04-task		04-task
05-deploy		05-deploy
data		data
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Model (LLM) from Scratch on Amazon SageMaker

Module 1: Acquiring the dataset

Module 2: Customizing vocabulary

Module 3: Tokenizing datasets

Module 4: Intermediate training

Module 5: Deployment as an endpoint

Prerequisites

Getting started

About

Releases

Packages

Languages

License

arunprsh/train-bert-from-scratch-on-sagemaker

Folders and files

Latest commit

History

Repository files navigation

Large Language Model (LLM) from Scratch on Amazon SageMaker

Module 1: Acquiring the dataset

Module 2: Customizing vocabulary

Module 3: Tokenizing datasets

Module 4: Intermediate training

Module 5: Deployment as an endpoint

Prerequisites

Getting started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages