This notebook will use the [LLAMA-2]() model and fine-tune it on a custom dataset, for a summarisation task.

In [1]:
import os

# https://discuss.pytorch.org/t/use-first-available-gpu/42718
os.environ["CUDA_VISIBLE_DEVICES"] = "6"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

DATA_DIR = '/workspace/data/llama'
LLAMA_DIR = '/workspace/data/llama'
os.environ["HUGGINGFACE_TOKEN"] = "<yours>>"   # llama_private
os.environ['HF_DATASETS_CACHE'] = f"{DATA_DIR}/hf_cache"
os.environ['TRANSFORMERS_CACHE'] = f"{DATA_DIR}/hf_cache"


import traceback
import sys
import traceback
print(f"Python {sys.version}")

from matplotlib import pyplot as plt

"""
We need to make this determinitsic so we can keep a track of changes we do to the model
If we are using the same initialisation all the time, then changes
"""
import numpy as np
np.random.seed(317)

import random
random.seed(317)

try:
    import torch as pt
    DEVICE = 'cuda' if pt.cuda.is_available() else 'cpu'
    # DEVICE = 'cpu'
    print(f"PyTorch {pt.__version__}")
    print(f"DEVICE={DEVICE}")
    if pt.cuda.is_available():
        print(f"\tGPU: {pt.cuda.get_device_name(0)}")
        print(f"\t\tcapability: {pt.cuda.get_device_capability('cuda')[0]}")
        print(f"\tCUDA version: {pt.version.cuda}")
        print("\tcuDNN available: ", pt.backends.cudnn.is_available())

        if pt.backends.cudnn.is_available():
            print("\t\tcuDNN version: ", pt.backends.cudnn.version())

    import torch as pt
    from torch import nn
    from torch.nn import functional as F
    from torch.optim import AdamW

    pt.manual_seed(317)
except:
    print("No PyTorch")
    print(traceback.format_exc())


try:
    import tensorflow as tf
    print(f"TensorFlow {tf.__version__}")
    print(f"Build with CUDA: {tf.test.is_built_with_cuda()}")
    print(f"Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
except:
    print("No TensorFlow")

%cd /workspace/data
if not 'llama' in os.getcwd():
    os.makedirs("./llama", exist_ok=True)
    os.chdir("./llama")
%pwd

Python 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
PyTorch 2.1.0a0+4136153
DEVICE=cuda
	GPU: Tesla V100-SXM2-32GB
		capability: 7
	CUDA version: 12.1
	cuDNN available:  True
		cuDNN version:  8902
No TensorFlow
/workspace/data


'/workspace/data/llama'

Create a Dockerfile that will be used to build the image and train the model

In [7]:
%%writefile Dockerfile
FROM nvcr.io/nvidia/pytorch:23.06-py3

RUN apt-get update  && apt-get install -y git python3-virtualenv wget 

ENV HUGGINGFACE_TOKEN="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"

RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git
RUN pip install -U --no-cache-dir accelerate bitsandbytes
RUN pip install -U transformers
RUN pip install flash-attn==2.0.0.post1  #--no-build-isolation
RUN pip uninstall --yes transformer-engine

WORKDIR /workspace
RUN wget https://raw.githubusercontent.com/facebookresearch/llama-recipes/main/examples/custom_dataset.py

RUN apt-get update && apt-get install -y mc

ENV HF_DATASETS_CACHE="/volume/.cache/huggingface/datasets"
ENV TRANSFORMERS_CACHE="/volume/.cache/huggingface/hub"

RUN mkdir -p /volume/output_dir
RUN mkdir -p /volume/fine-tuned

WORKDIR /workspace

RUN wget https://raw.githubusercontent.com/facebookresearch/llama-recipes/main/examples/finetuning.py

CMD [ "/bin/bash", "-e", "-c","huggingface-cli login --token $HUGGINGFACE_TOKEN && python3 -m llama_recipes.finetuning  --model_name meta-llama/Llama-2-7b-hf --use_peft --peft_method lora --quantization --batch_size_training 4 --dataset custom_dataset --custom_dataset.file /workspace/custom_dataset.py --output_dir /volume/output_dir"]

Overwriting Dockerfile


In [8]:
!docker build -t crilun:llama_demo .

[1A[1B[0G[?25l[+] Building 0.0s (0/2)                                          docker:default
 => [internal] load .dockerignore                                          0.0s
[?25h[1A[1A[0G[?25l[+] Building 0.0s (18/18) FINISHED                               docker:default
[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 1.41kB                                     0.0s
[0m[34m => [internal] load metadata for nvcr.io/nvidia/pytorch:23.06-py3          0.0s
[0m[34m => [ 1/14] FROM nvcr.io/nvidia/pytorch:23.06-py3                          0.0s
[0m[34m => CACHED [ 2/14] RUN apt-get update  && apt-get install -y git python3-  0.0s
[0m[34m => CACHED [ 3/14] RUN pip install -U --no-cache-dir git+https://github.c  0.0s

In [9]:
!docker run --gpus '"device=0"' -v /mnt/QNAP/crilun/llama:/volume --name crilun_llama_demo --rm crilun:llama_demo


== PyTorch ==

NVIDIA Release 23.06 (build 63009835)
PyTorch Version 2.1.0a0+4136153

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rig

This is finetunning the [Llama-2 7B model](https://huggingface.co/meta-llama/Llama-2-7b) - but prior to using this you need to ask and be granted access to this model by Meta.

## Usage of the model

In [2]:
import os

from huggingface_hub import login
from transformers import LlamaTokenizer, LlamaForCausalLM
from llama_recipes.inference.model_utils import load_model, load_peft_model

login(token=os.environ["HUGGINGFACE_TOKEN"])


# ---------
# inference
# ---------

DEVICE = "cuda"

# model = load_model('meta-llama/Llama-2-7b-hf', True)
model = LlamaForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-hf',
    return_dict=True,
    load_in_8bit=False,
    device_map=DEVICE,
    low_cpu_mem_usage=True
)
model = load_peft_model(model, f"{DATA_DIR}/output_dir")
model.eval()

tokenizer = LlamaTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')

prompt = "[INST] Tim: Hi, what's up? Kim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating Tim: What did you plan on doing? Kim: Oh you know, uni stuff and unfucking my room Kim: Maybe tomorrow I'll move my ass and do everything Kim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies Tim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores Tim: It really helps Kim: thanks, maybe I'll do that Tim: I also like using post-its in kaban style. [/INST] "
encoded = tokenizer(prompt, return_tensors="pt")

max_new_tokens=200
temperature=0.8
top_k=1000
prompt_length = len(prompt)

with pt.no_grad():
        outputs = model.generate(
            input_ids=encoded['input_ids'].to(device=DEVICE),
            attention_mask=encoded['attention_mask'].to(device=DEVICE),
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_k=top_k,
            return_dict_in_generate=True,
            output_scores=True,
        )

output = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
tokens_generated = outputs.sequences[0].size(0) - prompt_length

output[:len(prompt)], " >>" , output[len(prompt):]

  from .autonotebook import tqdm as notebook_tqdm


Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:10<00:00,  5.48s/it]


("[INST] Tim: Hi, what's up? Kim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating Tim: What did you plan on doing? Kim: Oh you know, uni stuff and unfucking my room Kim: Maybe tomorrow I'll move my ass and do everything Kim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies Tim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores Tim: It really helps Kim: thanks, maybe I'll do that Tim: I also like using post-its in kaban style. [/INST] ",
 ' >>',
 "1. This conversation appears to be between two individuals, Tim and Kim.\n2. Based on the conversation, it seems that Tim and Kim were planning to do various tasks and chores, but ended up procrastinating.\n3. Tim suggested using the Pomodoro technique to break down tasks into smaller chunks and set specific times for breaks, which could help with productivity and staying on track.\n4. Kim also mentioned using post-it notes in a kaban-style, whi

## Pricing


https://instances.vantage.sh/aws/ec2/p3dn.24xlarge

https://lambdalabs.com/blog/8-v100-server-on-prem-vs-p3-instance-tco-analysis-cost-comparison