### Set-up: Training Mistral:

#### Instruction-finetuning:
instruction-tune huggingface-llm-mistral-7b model for a new task. The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama 2 13B on all benchmarks we tested. For details, see its HuggingFace webpage.

#### Training data
Training data is formatted in JSON lines (.jsonl) format, where each line is a dictionary representing a single data sample. All training data must be in a single folder, however it can be saved in multiple jsonl files. The training folder can also contain a template.json file describing the input and output formats.


In [2]:
from concurrent.futures import ThreadPoolExecutor
from tqdm import tqdm
import pandas as pd
import sys
import datetime
sys.path.append('..')
from utils.metrics import Evaluate
# from utils.utils import Mistral_7B_V1
from prompts.mistral_7b_email_type import prompt_data
from utils.s3_helper import read_s3_csv_to_dataframe
import re
import json
import sagemaker
from sagemaker.s3 import S3Uploader
import random
from sagemaker import hyperparameters
from sagemaker.jumpstart.estimator import JumpStartEstimator
from sagemaker import TrainingJobAnalytics

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


### 1. Instruction fine-tuning

####  Pre-requisites: Initialize

In [3]:
# endpoint annd model data
endpoint_name = 'hf-llm-mistral-7b-2024-03-26-20-15-13-644'
model = "Mistral_7B"
model_version = "2.3.0"
model_id = "huggingface-llm-mistral-7b"

# s3bucket used
bucket_name = 'sagemaker-sigparser-caylent-mlops'

# input data
s3_file_key = 'data/email-type/input/processed/28-03-2024_train.csv'
# s3_file_key = 'data/email-type/input/processed/27-03-2024_train.csv'

###  1.1. Preparing training data

In [4]:
cleaned_train_df = read_s3_csv_to_dataframe(bucket_name, s3_file_key)

# Use the below code to read the cleaned data locally
# cleaned_train_df = pd.read_csv('../data/test-data/data-March-11/cleaned_test_data.csv')
cleaned_train_df.shape

(8204, 4)

#### Configure test records

In [5]:
# set the record_count number accordingly for quick test purpose(number should be between)
# by default we can use the length of the dataframe itself.
record_count = len(cleaned_train_df)
# record_count = 5
temp_train_data = ""
temp_train_data = cleaned_train_df.head(record_count).copy()
temp_train_data.iloc[0]

Email Address                 \t-ng@nationalgypsum.com
Email Address Name                               \t-ng
Email Address Display Name                \t- NG EMAIL
Email Type                                  Non-Person
Name: 0, dtype: object

In [6]:
temp_train_data.shape

(8204, 4)

In [7]:
temp_train_data

Unnamed: 0,Email Address,Email Address Name,Email Address Display Name,Email Type
0,\t-ng@nationalgypsum.com,\t-ng,\t- NG EMAIL,Non-Person
1,\t+12134588429.30119168@resources.lync.com,12134588429,\t+12134588429 30119168,Non-Person
2,\t+12134588429.61498480@resources.lync.com,12134588430,\t+12134588429 61498480,Non-Person
3,\t+146238799022001@voicemail.com,146238799022001,146238799022001,Non-Person
4,!badlandsroom@acuitybrands.com,!badlandsroom,!JLS-Badlands Room,Non-Person
...,...,...,...,...
8199,s.masters@robparal.com,s.masters,Stanley Masters| Power Markets Corp.,Person
8200,martinp@ijohep.net,martinp,Martin Park l CDEEF Conference Group,Person
8201,mark@foxwoodtaxsearch.com,mark,Mark Morgan/ Service/ Princeton Finance Ltd.,Person
8202,colep@robsonsweb.com,colep,COLE POWERS | Keystone Minig Corp.,Person


#### Get the prompt and print prompt version to confirm.

In [8]:
system_prompt = prompt_data["system_prompt"]
instruction = prompt_data["instruction"]
prompt_version = prompt_data["prompt_version"]
print(" prompt_version:", prompt_version)

 prompt_version: version-6


#### Prepare the user ask with all the relevant data for the question

In [9]:
def get_context(email_address, email_address_name, email_display_name):
    email_address = email_address.strip()
    email_address_name = email_address_name.strip()
    email_display_name = email_display_name.strip()
    context_input_str = f"""Output:"""
    context_data = f"""{{"email_address":"{email_address}", "email_address_name":"{email_address_name}", "email_display_name":"{email_display_name}"}}"""
    context = context_input_str.strip() + context_data.strip()
    return context


context = temp_train_data.apply(lambda x: get_context(x['Email Address'], x['Email Address Name'], x['Email Address Display Name']), axis=1)

In [10]:
context[0]

'Output:{"email_address":"-ng@nationalgypsum.com", "email_address_name":"-ng", "email_display_name":"- NG EMAIL"}'

In [11]:
def get_output(email_address_type):
    email_address_type = email_address_type.strip().lower()
    # email_address_type = email_address_type.lower()
    output = f"""{{"email_address_type":"{email_address_type}"}}"""
    return output

# output = train.apply(lambda x: get_output(x['First Name'], x['Last Name']), axis=1)
output = temp_train_data.apply(lambda x: get_output(x['Email Type']), axis=1)

In [12]:
output[0]

'{"email_address_type":"non-person"}'

#### Prepare the prompts for all the test records

In [13]:
train_df = pd.DataFrame({'system_prompt':system_prompt,
                         'instruction':instruction,
                          'context': context,
                          'response': output
                        })
train_df.head()

Unnamed: 0,system_prompt,instruction,context,response
0,You are a helpful and detail-oriented assistan...,Please classify this email address for me. All...,"Output:{""email_address"":""-ng@nationalgypsum.co...","{""email_address_type"":""non-person""}"
1,You are a helpful and detail-oriented assistan...,Please classify this email address for me. All...,"Output:{""email_address"":""+12134588429.30119168...","{""email_address_type"":""non-person""}"
2,You are a helpful and detail-oriented assistan...,Please classify this email address for me. All...,"Output:{""email_address"":""+12134588429.61498480...","{""email_address_type"":""non-person""}"
3,You are a helpful and detail-oriented assistan...,Please classify this email address for me. All...,"Output:{""email_address"":""+146238799022001@voic...","{""email_address_type"":""non-person""}"
4,You are a helpful and detail-oriented assistan...,Please classify this email address for me. All...,"Output:{""email_address"":""!badlandsroom@acuityb...","{""email_address_type"":""non-person""}"


In [14]:
template = {
    "prompt": "{system_prompt}\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}",
    "completion": "{response}",
}
with open("../data/Mistral_7B/template.json", "w") as f:
    json.dump(template, f)

In [16]:
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_path = f"../data/Mistral_7B/mistral-7b-fine-tuning-dataset-{prompt_version}.jsonl"

with open(output_path, "w") as f:
    f.write(train_df.to_json(orient='records', lines=True, force_ascii=False))


object_name = f"data/email-type/input/training/{model}/{timestamp}"
# Create the file name as per the task: name-parse, email-signature
file_name = f"mistral-7b-fine-tuning-dataset-{prompt_version}.jsonl"
print(file_name)

mistral-7b-fine-tuning-dataset-version-6.jsonl


In [17]:
local_data_file = output_path
train_data_location = f"s3://{bucket_name}/{object_name}"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("../data/Mistral_7B/template.json", train_data_location)
print(f"Training data: {train_data_location}")

Training data: s3://sagemaker-sigparser-caylent-mlops/data/email-type/input/training/Mistral_7B/2024-03-28_23-15-26


### 1.2. Prepare training parameters

##### Figureout the hyper params for the use case.

In [18]:
my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)
print(my_hyperparameters)

{'peft_type': 'None', 'instruction_tuned': 'True', 'chat_dataset': 'False', 'epoch': '1', 'learning_rate': '6e-06', 'lora_r': '64', 'lora_alpha': '16', 'lora_dropout': '0', 'bits': '16', 'double_quant': 'True', 'quant_type': 'nf4', 'per_device_train_batch_size': '2', 'per_device_eval_batch_size': '8', 'add_input_output_demarcation_key': 'True', 'warmup_ratio': '0.1', 'train_from_scratch': 'False', 'fp16': 'False', 'bf16': 'True', 'evaluation_strategy': 'steps', 'eval_steps': '20', 'gradient_accumulation_steps': '8', 'logging_steps': '8', 'weight_decay': '0.2', 'load_best_model_at_end': 'True', 'max_train_samples': '-1', 'max_val_samples': '-1', 'seed': '10', 'max_input_length': '-1', 'validation_split_ratio': '0.2', 'train_data_split_seed': '0', 'preprocessing_num_workers': 'None', 'max_steps': '-1', 'gradient_checkpointing': 'True', 'early_stopping_patience': '3', 'early_stopping_threshold': '0.0', 'adam_beta1': '0.9', 'adam_beta2': '0.999', 'adam_epsilon': '1e-08', 'max_grad_norm': '

##### Overwrite the hyperparameters. Note. You can select the LoRA method for your fine-tuning by selecting peft_type=lora in the hyper-parameters.

In [19]:
my_hyperparameters["epoch"] = "2"
my_hyperparameters["per_device_train_batch_size"] = "2"
my_hyperparameters["gradient_accumulation_steps"] = "2"
my_hyperparameters["instruction_tuned"] = "True"
print(my_hyperparameters)

{'peft_type': 'None', 'instruction_tuned': 'True', 'chat_dataset': 'False', 'epoch': '2', 'learning_rate': '6e-06', 'lora_r': '64', 'lora_alpha': '16', 'lora_dropout': '0', 'bits': '16', 'double_quant': 'True', 'quant_type': 'nf4', 'per_device_train_batch_size': '2', 'per_device_eval_batch_size': '8', 'add_input_output_demarcation_key': 'True', 'warmup_ratio': '0.1', 'train_from_scratch': 'False', 'fp16': 'False', 'bf16': 'True', 'evaluation_strategy': 'steps', 'eval_steps': '20', 'gradient_accumulation_steps': '2', 'logging_steps': '8', 'weight_decay': '0.2', 'load_best_model_at_end': 'True', 'max_train_samples': '-1', 'max_val_samples': '-1', 'seed': '10', 'max_input_length': '-1', 'validation_split_ratio': '0.2', 'train_data_split_seed': '0', 'preprocessing_num_workers': 'None', 'max_steps': '-1', 'gradient_checkpointing': 'True', 'early_stopping_patience': '3', 'early_stopping_threshold': '0.0', 'adam_beta1': '0.9', 'adam_beta2': '0.999', 'adam_epsilon': '1e-08', 'max_grad_norm': '

##### Validate hyperparameters

In [20]:
hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters
)

## Under Implementation. Run the below cells only after creating the training data

### 1.3. Starting training

In [None]:
%%time
instruction_tuned_estimator = JumpStartEstimator(
    model_id=model_id,
    hyperparameters=my_hyperparameters,
    output_path=f"s3://{bucket_name}/model/email-type/{model}/{timestamp}",
    instance_type="ml.g5.24xlarge",
)
instruction_tuned_estimator.fit({"train": train_data_location}, logs=True)

INFO:sagemaker:Creating training-job with name: hf-llm-mistral-7b-2024-03-28-23-21-00-628


2024-03-28 23:21:01 Starting - Starting the training job...
2024-03-28 23:21:28 Pending - Training job waiting for capacity...
2024-03-28 23:21:54 Pending - Preparing the instances for training......
2024-03-28 23:22:51 Downloading - Downloading input data.......................................
2024-03-28 23:29:32 Training - Training image download completed. Training in progress..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-03-28 23:29:34,319 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-03-28 23:29:34,373 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-03-28 23:29:34,383 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-03-28 23:29:34,385 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m202

##### Extract Training performance metrics. Performance metrics such as training loss and validation accuracy/loss can be accessed through cloudwatch while the training. We can also fetch these metrics and analyze them within the notebook.

In [None]:
training_job_name = instruction_tuned_estimator.latest_training_job.job_name

df = TrainingJobAnalytics(training_job_name=training_job_name).dataframe()
df.head(10)

### 1.4. Deploying inference endpoints

In [None]:
instruction_tuned_predictor = instruction_tuned_estimator.deploy()

### 1.5. Clean up the endpoint

In [None]:
# Delete the SageMaker endpoint
instruction_tuned_predictor.delete_model()
instruction_tuned_predictor.delete_endpoint()