Skip to content

Pretraining a large language model from scratch with your own custom domain data and Amazon SageMaker

License

Notifications You must be signed in to change notification settings

arunprsh/train-bert-from-scratch-on-sagemaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Large Language Model (LLM) from Scratch on Amazon SageMaker

This repository contains the supporting notebooks for a medium article on training an LLM from scratch with a custom domain corpus and Amazon SageMaker.

alt text

The series covers the following topics:

Module 1: Acquiring the dataset

Learn how to acquire and preprocess your custom domain corpus for training the LLM.

Module 2: Customizing vocabulary

Understand how to customize the vocabulary of the LLM to fit your dataset.

Module 3: Tokenizing datasets

Learn how to tokenize your dataset and prepare it for training.

Module 4: Intermediate training

Discover how to perform intermediate training on the LLM using Amazon SageMaker.

Module 5: Deployment as an endpoint

Explore how to deploy the trained LLM as an endpoint for inference.

Prerequisites

To follow along with the notebooks in this repository, you will need an AWS account with access to Amazon SageMaker. Additionally, you will need to have a custom domain corpus ready for training.

Getting started

To get started, clone this repository and open the notebooks in the order they are presented in the series. Each notebook contains detailed instructions and explanations for the corresponding module.

About

Pretraining a large language model from scratch with your own custom domain data and Amazon SageMaker

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published