Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 237414383616 bytes. Error code 12 (Cannot allocate memory) #35

Closed
3 tasks
keloemma opened this issue Apr 24, 2021 · 3 comments

Comments

@keloemma
Copy link

Environment info

Environment info

  • transformers version: 2.5.1
  • Platform: linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.4
  • Tensorflow version (GPU?):
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

Model I am using (FlauBert):

The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.

  • the official example scripts: (I did not change much , pretty close to the original)
import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased', 
#               'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased' 

# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768])  -> (batch size x number of tokens x embedding dimension)

# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]
  • My own modified scripts: (give details below)
def get_flaubert_layer(texte):

	modelname = "flaubert-base-uncased"
	path = './flau/flaubert-base-unc/'

	flaubert = FlaubertModel.from_pretrained(path)
	flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
	tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
	max_len = 0
	for i in tokenized.values:
		if len(i) > max_len:
			max_len = len(i)
	padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
	token_ids = torch.tensor(padded)
	with torch.no_grad():
		last_layer = flaubert(token_ids)[0][:,0,:].numpy()
		
	return last_layer, modelname

The tasks I am working on is:

  • Producing vectors/features from a language model and pass it to others classifiers

To reproduce

Steps to reproduce the behavior:

  1. Get transformers library and scikit-learn, pandas and numpy, pytorch
  2. Last lines of code
# Reading the file 
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data)  # corpus is dataframe of approximately 40 000 lines

Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()

I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?

My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.

@keloemma
Copy link
Author

I found the solution.

@schwabdidier
Copy link
Member

May you indicate what was the problem ? (For people who could experiment the same problem). Thanks in advance,

@keloemma
Copy link
Author

It was a problem link to "insufficient memory or space when using the model". I passed along small batches to the model to avoid this error: and i create a loop that goes over the I in range(0, len(padded), batch_size) and passes along the padded[i: i+batch_size] to the model, then concatenated the predictions back together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants