[Bug Report] Fine tuned GPT-J model outputs gibberish, could be due to decoder token issue in training?

**Link to the notebook**
(https://github.com/anapt/amazon-sagemaker-examples/blob/9c3f66a740e4c7805422814adf34ea0c06aedc85/training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb)

**Describe the bug**
When I run the notebook as is using ml.g5.48xlarge instance and try and run inference on it, the model output is non-sensical. (see below):

Example output:

Can you please let us know more the of a of the. a  to  of with, -., and to and a, a 's of and the of, of  its,, that , with is and ands the   a the a. of'and as, the first-, in the and,.-  and it it ',,-a, it to the, to, or  as  on the the-'of as the as a and is . and to and- of- and in has  have  film  with the is. the movie the '. to ','the that, as- it of of is- that of. that the to a- in of that a a with of to ofs  is and. a to.y s, but  the film and'a that and ofly  in a is thely- to-- a but,int anda- by of it and that'in and this  you, is a in, all  that. is as of in to you 't's  his rit as  at the for an ly the with a '- the in is,a. in that to that is to is of n of  n of be, its'that in. you andly toa'to be  circum  be. by  it the time in that  more that that-the  documentary ===  but the this ofthe story.. have.'is'' as that as and for,s to to with- with in-s a story, story  hard  `,the thea and its re  by the it a it, his of an, not ollywood this and with and time a have and on f the not thes- as with. as is in this, this the best tos and story the up ベ. about, movie, one  into a as it 'the, adaptation  for theit  still ...  story in'it has,/ its "[ riter l  comedy  Manchester are y and than- is thats that it all,it of un  movie  this movie and film'not- movie of characters  than the into hard  time, for as it- own

Keen to understand if this was the same when you ran the notebook and I just need to adjust the parameters, if so it would be great to see a clearer example of the notebook running and the output being correct. This will really help with additional requests for instance costs.

If not - I have seen a similar issue mentioned when fine-tuning BART here https://discuss.huggingface.co/t/what-can-cause-model-generate-bart-output-to-be-gibberish-after-fine-tuning/934/2 but have not been able to deduce whether this is the same for GPT-J. 

Alternatively it could be due to the fact that I am not deploying the model (don't have permissions to do this until I can demonstrate the fine tuned model is valid for our use case). Instead I am needing to load the standard "EleutherAI/gpt-j-6B" model and then I load the model state from the fine tuned job

**To reproduce**
I do not have admin access to AWS and are not permitted to deploy a model within our company without testing to see if the trained model works or not. So to test the trained model output I first need to load it from S3 in a notebook and query it that way. Please see below:


1. Run the notebook (https://github.com/anapt/amazon-sagemaker-examples/blob/9c3f66a740e4c7805422814adf34ea0c06aedc85/training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb) up until `smp_estimator.fit` and get the location of where the model has been saved.

2.  Define the paths to your S3 bucket and where you would like to save the fine tuned model in you Sagemaker notebook instance. Then save the model locally and then load the trained model and tokenizer so you can query the output:

```
s3_bucket_name = "S£_BUCKET_NAME"
trained_model = 'PATH_TO_FINE_TUNED_TRAINING_JOB_MODEL'
local_model_file_path = './trained_models_test/' + trained_model
local_extracted_model_path = './trained_models_test/extracted_models/'

# Create directories if they don't exist
if not os.path.exists(os.path.dirname(local_model_file_path)):
    os.makedirs(os.path.dirname(local_model_file_path))
if not os.path.exists(local_extracted_model_path):
    os.makedirs(local_extracted_model_path)

# Download object from S3
s3 = boto3.client('s3')
s3.download_file(s3_bucket_name, trained_model, local_model_file_path)

#Open the tar file that the fine tuned model was saved to extract the model weights and hyperparameters
with tarfile.open(local_model_file_path, 'r:gz') as tar:
    tar.extractall(local_extracted_model_path)

#Load the trained model
model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", cache_dir='cached_models')

#Loand the model state
model.load_state_dict(torch.load('./trained_models_test/extracted_models/fullmodel.pt'))

#Load the tokenizer to convert our inputs
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")

#Provide input to model
prompt = (
    "Can you please let us know more"
)

input_ids = tokenizer.encode(prompt, return_tensors="pt")

gen_tokens = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.7,
    max_length=500,
    no_repeat_ngram_size=2)

gen_text = tokenizer.decode(gen_tokens[0], skip_special_tokens=True)

print(gen_text)
```


**Logs**
If applicable, add logs to help explain your problem.
You may also attach an `.ipynb` file to this issue if it includes relevant logs or output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Report] Fine tuned GPT-J model outputs gibberish, could be due to decoder token issue in training? #3804

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug Report] Fine tuned GPT-J model outputs gibberish, could be due to decoder token issue in training? #3804

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions