SCIENTIFIC LLM

Installation:

python -m pip install git+https://github.com/huggingface/optimum-intel.git

pip install git+https://github.com/huggingface/transformers

pip install datasets

pip install evaluate

pip install wandb

Script folder contains all the files

v2_mp_mistral.py is the main training script
which requires training_utils.py as a util file for model and dataset methods

To run the file there are following argument specific to model

# original args
--device", type=str, default="0", help="cuda device number", choices=["0", "1", "2", "3"]
--layer", type=int, default=11, help="number of layers"
--seq_len", type=int, default=4*1024, help="sequence length"
--batch_size", type=int, default=1, help="batch size"
--float16", action="store_true", help="use float16"
--adafactor", action='store_true', help="use adafactor"
--enb_grad_checkpoint", action='store_true', help="disable use cache in model config and enable gradient checkpointing"
--data_percent", type=float, default=0.001, help="data row percent"
--vocab", type=int, default=30000, help="vocab size"
--checkpoint", type=str, default=None, help="checkpoint path"

we can use following command to run the model

python v2_mp_mistral.py --float16 --enb_grad_checkpoint --layer 11

| Using the --checkpoint argument we can pass the path to the checkpoint to be use

Trying to run using torchrun

Idea is to implement distributed data parallelism

CUDA_VISIBLE_DEVICES=0,1 torchrun --use-env --nproc_per_node 2 v2_mp_mistral.py --layer 6 --seq_len 4096 --float16 --enb_grad_checkpoint

explain the above command

| torchrun is same as python -m torch.distribute.launch filename.py

Inspired from below command huggingface rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node 2 training_vanila.py --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200

Ref LINK

Issues

torchrun is loading the data on both the GPUS and tokenizing it twice for both. This should not happen.
Required implementation: Data is loaded once -> tokenized -> converted to dataset and then to be loaded on both the GPUs

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
custom_trainer		custom_trainer
script		script
.gitignore		.gitignore
README.md		README.md
image.png		image.png
latex_model.code-workspace		latex_model.code-workspace
material.md		material.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SCIENTIFIC LLM

Installation:

Trying to run using torchrun

Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

git-siddhesh/latex_model

Folders and files

Latest commit

History

Repository files navigation

SCIENTIFIC LLM

Installation:

Trying to run using torchrun

Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages