You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using distributed or parallel set-up in script?: no
Who can help
Model I am using (FlauBert):
The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.
the official example scripts: (I did not change much , pretty close to the original)
import torch
from transformers import FlaubertModel, FlaubertTokenizer
# Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased',
# 'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']
modelname = 'flaubert/flaubert_base_cased'
# Load pretrained model and tokenizer
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
# do_lowercase=False if using cased models, True if using uncased ones
sentence = "Le chat mange une pomme."
token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])
last_layer = flaubert(token_ids)[0]
print(last_layer.shape)
# torch.Size([1, 8, 768]) -> (batch size x number of tokens x embedding dimension)
# The BERT [CLS] token correspond to the first hidden state of the last layer
cls_embedding = last_layer[:, 0, :]
My own modified scripts: (give details below)
def get_flaubert_layer(texte):
modelname = "flaubert-base-uncased"
path = './flau/flaubert-base-unc/'
flaubert = FlaubertModel.from_pretrained(path)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(path)
tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
token_ids = torch.tensor(padded)
with torch.no_grad():
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
return last_layer, modelname
The tasks I am working on is:
Producing vectors/features from a language model and pass it to others classifiers
To reproduce
Steps to reproduce the behavior:
Get transformers library and scikit-learn, pandas and numpy, pytorch
Last lines of code
# Reading the file
filename = "corpus"
sentences = pd.read_excel(os.path.join(root, filename + ".xlsx"), sheet_name= 0)
data_id = sentences.identifiant
print("Total phrases: ", len(data_id))
data = sentences.sent
label = sentences.etiquette
emb, mdlname = get_flaubert_layer(data) # corpus is dataframe of approximately 40 000 lines
Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?
My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.
The text was updated successfully, but these errors were encountered:
It was a problem link to "insufficient memory or space when using the model". I passed along small batches to the model to avoid this error: and i create a loop that goes over the I in range(0, len(padded), batch_size) and passes along the padded[i: i+batch_size] to the model, then concatenated the predictions back together.
Environment info
Environment info
transformers
version: 2.5.1Who can help
Model I am using (FlauBert):
The problem arises when trying to produce features with the model, the output which is generated causes the system run out of memory.
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Apperently this line produce something which is huge and which take a lot memory :
last_layer = flaubert(token_ids)[0][:,0,:].numpy()
I would have expected it run but I think the fact that I pass the whole dataset to the model is causing the system to break, so I wanted to know if it possible to tell the model to process the data set maybe 500 lines or 1000 lines at at a time so as to not pass the whole dataset. I know that , there is this parameter : batch_size which can be used but since I am not training a model but merely using it to produces embeddings as input for others classifiers ,
Do you perhaps know how to modify the batch size so the whole dataset is not treated. I am not really familiar with this type of architecture. In the example , they just put one single sentence but in my case I load a whole dataset (dataframe). ?
My expectation is to make the model treat all the sentences and then produced the vectors I need for the task of classification.
The text was updated successfully, but these errors were encountered: