Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()] #4

Open
vincsous opened this issue May 15, 2020 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@vincsous
Copy link

Hi,
Fisrt thanks for your work.

When I am trying to do preprocessing. I get following error message:
RuntimeError: Internal: /sentencepiece/src/trainer_interface.cc(336) [!sentences_.empty()]

I am using a *.txt file uploaded on my colab.
I would like to know what does it mean and how to fix it.
Thanks

Vincent

@akanyaani akanyaani self-assigned this May 16, 2020
@RomanPlusPlus
Copy link

RomanPlusPlus commented May 16, 2020

I have the same problem while doing preprocessing locally.

I cd'ed to the gpt-2-tensorflow2.0 dir and run the following command:
python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data" --vocab-size=32000

Tried it with with the data from the "scraped" dir provided with the repo.

Please find the log in the attached file.

log.txt

I've installed the dependencies using conda, as follows:
conda install setuptools ftfy tqdm Click tensorflow numpy
pip install sentencepiece

conda list output:

packages_versions.txt

@akanyaani
Copy link
Owner

Hi @vincsous and @RomanPlusPlus

Thanks for reporting the issue.
I have fixed the issue please pull the code and test.

Thanks

@vincsous
Copy link
Author

Hi @akanyaani and thank you. Preprocessing is working for me now. But I have another problem for the training.
First, as I am using Colab, I do not have multiple GPU so I choose --distributed=False.
It seems that it starts to train but training stops ("Training Done....") at step 20, 11% accuracy.
Here is the log.
log_train.txt

Thanks again

@RomanPlusPlus
Copy link

Hi @akanyaani, thank you for your speedy response.

Unfortunately, the problem persists. I still get the same [!sentences_.empty()] error.

Please find the log in the attached file.

log200517.txt

@akanyaani
Copy link
Owner

Hi @RomanPlusPlus

But it's working on my system could you please print files in that directory.

Add print in the pre_process.py train method.

text_files = glob.glob((data_dir + "/*.txt"))

print(text_files) #Add this and see does it print text files

process_text(text_files)
train_byte_pair_encoding(vocab_size)
create_tf_records(min_seq_len, max_seq_len)
print("Pre-processing is done............")

This error comes when text_files does not have any text files.
If text_files is an empty list then try to resolve path issues.

@akanyaani
Copy link
Owner

Hi @vincsous

I will look into that.

Thanks

@akanyaani akanyaani added the bug Something isn't working label May 19, 2020
@RomanPlusPlus
Copy link

Hi @akanyaani ,

I added the line you suggested.
It prints out the following:

['/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/processed.txt']

I also checked the "processed.txt" file. It's empty.

@akanyaani
Copy link
Owner

Hi @RomanPlusPlus

You are getting this error because you are passing the wrong data directory. This repo has sample data in /data/scraped so now try this.

python pre_process.py --data-dir="/home/<my_linux_username>/Desktop/gpt-2-tensorflow2.0/data/scraped" --vocab-size=32000

@apteryxlabs
Copy link

I am also getting this error. My command:
python pre_process.py --data-dir=/media/b/F:/patent_data_v2/patent_data_joined --vocab-size=50000

Checked the processed.txt file - it's got PLENTY of data.

Notably, this ran fine on my Mac (running Catalina). However, Macs don't have GPUs, so I'm moving all this over to a client's Linux machine.

My os:
Linux Ubuntu (latest version, 20)

Running in conda custom environment.

My conda env.yaml file:
`name: tf
channels:

  • anaconda
  • defaults
    dependencies:
  • _libgcc_mutex=0.1=main
  • _tflow_select=2.1.0=gpu
  • absl-py=0.9.0=py36_0
  • astunparse=1.6.3=py_0
  • blas=1.0=mkl
  • blinker=1.4=py36_0
  • brotlipy=0.7.0=py36h7b6447c_1000
  • c-ares=1.15.0=h7b6447c_1001
  • ca-certificates=2020.6.24=0
  • cachetools=4.1.0=py_1
  • certifi=2020.6.20=py36_0
  • cffi=1.14.0=py36he30daa8_1
  • chardet=3.0.4=py36_1003
  • click=7.1.2=py_0
  • cryptography=2.9.2=py36h1ba5d50_0
  • cudatoolkit=10.1.243=h6bb024c_0
  • cudnn=7.6.5=cuda10.1_0
  • cupti=10.1.168=0
  • ftfy=5.7=py_0
  • gast=0.3.3=py_0
  • google-auth=1.14.1=py_0
  • google-auth-oauthlib=0.4.1=py_2
  • google-pasta=0.2.0=py_0
  • grpcio=1.27.2=py36hf8bcb03_0
  • h5py=2.10.0=py36hd6299e0_1
  • hdf5=1.10.6=hb1b8bf9_0
  • idna=2.10=py_0
  • intel-openmp=2020.1=217
  • keras-preprocessing=1.1.0=py_1
  • ld_impl_linux-64=2.33.1=h53a641e_7
  • libedit=3.1.20191231=h14c3975_1
  • libffi=3.3=he6710b0_2
  • libgcc-ng=9.1.0=hdf63c60_0
  • libgfortran-ng=7.3.0=hdf63c60_0
  • libprotobuf=3.12.3=hd408876_0
  • libstdcxx-ng=9.1.0=hdf63c60_0
  • markdown=3.1.1=py36_0
  • mkl=2019.4=243
  • mkl-service=2.3.0=py36he904b0f_0
  • mkl_fft=1.1.0=py36h23d657b_0
  • mkl_random=1.1.0=py36hd6b4f25_0
  • ncurses=6.2=he6710b0_1
  • numpy=1.18.5=py36ha1c710e_0
  • numpy-base=1.18.5=py36hde5b4d6_0
  • oauthlib=3.1.0=py_0
  • openssl=1.1.1g=h7b6447c_0
  • opt_einsum=3.1.0=py_0
  • pip=20.1.1=py36_1
  • protobuf=3.12.3=py36he6710b0_0
  • pyasn1=0.4.8=py_0
  • pyasn1-modules=0.2.7=py_0
  • pycparser=2.20=py_0
  • pyjwt=1.7.1=py36_0
  • pyopenssl=19.1.0=py36_0
  • pysocks=1.7.1=py36_0
  • python=3.6.10=h7579374_2
  • readline=8.0=h7b6447c_0
  • requests=2.24.0=py_0
  • requests-oauthlib=1.3.0=py_0
  • rsa=4.0=py_0
  • scipy=1.5.0=py36h0b6359f_0
  • setuptools=47.3.1=py36_0
  • six=1.15.0=py_0
  • sqlite=3.32.3=h62c20be_0
  • tensorboard=2.2.1=pyh532a8cf_0
  • tensorboard-plugin-wit=1.6.0=py_0
  • tensorflow=2.2.0=gpu_py36hf933387_0
  • tensorflow-base=2.2.0=gpu_py36h8a81be8_0
  • tensorflow-estimator=2.2.0=pyh208ff02_0
  • tensorflow-gpu=2.2.0=h0d30ee6_0
  • termcolor=1.1.0=py36_1
  • tk=8.6.10=hbc83047_0
  • tqdm=4.47.0=py_0
  • urllib3=1.25.9=py_0
  • wcwidth=0.2.5=py_0
  • werkzeug=1.0.1=py_0
  • wheel=0.34.2=py36_0
  • wrapt=1.12.1=py36h7b6447c_1
  • xz=5.2.5=h7b6447c_0
  • zlib=1.2.11=h7b6447c_3
  • pip:
    • sentencepiece==0.1.85
      prefix: /home/b/anaconda3/envs/tf
      `

@elbowdonkey
Copy link

elbowdonkey commented Aug 4, 2020

You can run into this error even if your path is correct because the train method assumes your data files use the txt file extension. If you don't have files with txt as their extension, they won't be considered, causing the error.

I'd recommend that the train method be changed to:

def train(data_dir, vocab_size, min_seq_len, max_seq_len):
	text_files = glob.glob((data_dir + "/*"))
	process_text(text_files)
	train_byte_pair_encoding(vocab_size)
	create_tf_records(min_seq_len, max_seq_len)
	print("Pre-processing is done............")

In other words, change "/*.txt" to "/*".

Better yet, gather the file paths recursively like so:

text_files = glob.glob((data_dir + "/**/*"))

This allows you to have your data files within their own directories - useful if you have thousands of them and want to work with subsets of those thousands sometimes.

@tkahn
Copy link

tkahn commented Dec 23, 2021

I encountered this error when running the code on Windows. I fixed this by editing all calls to with open like this:

with open(PROCESS_DATA_PATH, 'r', encoding = 'utf-8') as f:
with open(BPE_TSV_PATH, 'w', encoding = 'utf-8', newline='') as f_output:

The files that are read need to be encoded in UTF-8, but I guess that goes without saying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants
@elbowdonkey @tkahn @akanyaani @RomanPlusPlus @vincsous @apteryxlabs and others