-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
Link to the notebook
(https://github.com/anapt/amazon-sagemaker-examples/blob/9c3f66a740e4c7805422814adf34ea0c06aedc85/training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb)
Describe the bug
When I run the notebook as is using ml.g5.48xlarge instance and try and run inference on it, the model output is non-sensical. (see below):
Example output:
Can you please let us know more the of a of the. a to of with, -., and to and a, a 's of and the of, of its,, that , with is and ands the a the a. of'and as, the first-, in the and,.- and it it ',,-a, it to the, to, or as on the the-'of as the as a and is . and to and- of- and in has have film with the is. the movie the '. to ','the that, as- it of of is- that of. that the to a- in of that a a with of to ofs is and. a to.y s, but the film and'a that and ofly in a is thely- to-- a but,int anda- by of it and that'in and this you, is a in, all that. is as of in to you 't's his rit as at the for an ly the with a '- the in is,a. in that to that is to is of n of n of be, its'that in. you andly toa'to be circum be. by it the time in that more that that-the documentary === but the this ofthe story.. have.'is'' as that as and for,s to to with- with in-s a story, story hard `,the thea and its re by the it a it, his of an, not ollywood this and with and time a have and on f the not thes- as with. as is in this, this the best tos and story the up ベ. about, movie, one into a as it 'the, adaptation for theit still ... story in'it has,/ its "[ riter l comedy Manchester are y and than- is thats that it all,it of un movie this movie and film'not- movie of characters than the into hard time, for as it- own
Keen to understand if this was the same when you ran the notebook and I just need to adjust the parameters, if so it would be great to see a clearer example of the notebook running and the output being correct. This will really help with additional requests for instance costs.
If not - I have seen a similar issue mentioned when fine-tuning BART here https://discuss.huggingface.co/t/what-can-cause-model-generate-bart-output-to-be-gibberish-after-fine-tuning/934/2 but have not been able to deduce whether this is the same for GPT-J.
Alternatively it could be due to the fact that I am not deploying the model (don't have permissions to do this until I can demonstrate the fine tuned model is valid for our use case). Instead I am needing to load the standard "EleutherAI/gpt-j-6B" model and then I load the model state from the fine tuned job
To reproduce
I do not have admin access to AWS and are not permitted to deploy a model within our company without testing to see if the trained model works or not. So to test the trained model output I first need to load it from S3 in a notebook and query it that way. Please see below:
-
Run the notebook (https://github.com/anapt/amazon-sagemaker-examples/blob/9c3f66a740e4c7805422814adf34ea0c06aedc85/training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb) up until
smp_estimator.fit
and get the location of where the model has been saved. -
Define the paths to your S3 bucket and where you would like to save the fine tuned model in you Sagemaker notebook instance. Then save the model locally and then load the trained model and tokenizer so you can query the output:
s3_bucket_name = "S£_BUCKET_NAME"
trained_model = 'PATH_TO_FINE_TUNED_TRAINING_JOB_MODEL'
local_model_file_path = './trained_models_test/' + trained_model
local_extracted_model_path = './trained_models_test/extracted_models/'
# Create directories if they don't exist
if not os.path.exists(os.path.dirname(local_model_file_path)):
os.makedirs(os.path.dirname(local_model_file_path))
if not os.path.exists(local_extracted_model_path):
os.makedirs(local_extracted_model_path)
# Download object from S3
s3 = boto3.client('s3')
s3.download_file(s3_bucket_name, trained_model, local_model_file_path)
#Open the tar file that the fine tuned model was saved to extract the model weights and hyperparameters
with tarfile.open(local_model_file_path, 'r:gz') as tar:
tar.extractall(local_extracted_model_path)
#Load the trained model
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", cache_dir='cached_models')
#Loand the model state
model.load_state_dict(torch.load('./trained_models_test/extracted_models/fullmodel.pt'))
#Load the tokenizer to convert our inputs
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
#Provide input to model
prompt = (
"Can you please let us know more"
)
input_ids = tokenizer.encode(prompt, return_tensors="pt")
gen_tokens = model.generate(
input_ids,
do_sample=True,
temperature=0.7,
max_length=500,
no_repeat_ngram_size=2)
gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)
print(gen_text)
Logs
If applicable, add logs to help explain your problem.
You may also attach an .ipynb
file to this issue if it includes relevant logs or output.