Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'str' object has no attribute 'decode' #17

Open
asingh9530 opened this issue Sep 18, 2018 · 4 comments
Open

'str' object has no attribute 'decode' #17

asingh9530 opened this issue Sep 18, 2018 · 4 comments

Comments

@asingh9530
Copy link

When i tried running
python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt
--out_file preprocessed_movie_lines.txt

it gives me error
python preprocessors/preprocess_movie_dialogs.py --raw_data movie_lines.txt --out_file preprocessed_movie_lines.txt
/home/abhinavsingh/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "preprocessors/preprocess_movie_dialogs.py", line 24, in
tf.app.run()
File "/home/abhinavsingh/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "preprocessors/preprocess_movie_dialogs.py", line 18, in main
s = dialog_line.strip().lower().decode("utf-8", "ignore")
AttributeError: 'str' object has no attribute 'decode'

But this is obvious as each line is string but if i remove decode then it dosen't working.

@aayushee
Copy link

aayushee commented Mar 6, 2019

I was also getting this error. I ran the script with python2 and it worked.

@micooke
Copy link

micooke commented Apr 4, 2019

The decode is in the str instantiation which takes a bytes object. First you need to encode the raw string as utf-8, then pass it through str like so...
s = str(bytes(dialog_line.strip().lower(), "utf-8"), "utf-8", "ignore")
...or just ignore the whole thing if your string is utf-8 anyways
s = dialog_line.strip().lower()

@lipsajohny
Copy link

When trying the above solution another error pops. error:

Traceback (most recent call last):
File "preprocessors/preprocess_movie_dialogs.py", line 23, in
tf.app.run()
File "/home/lipsa/anaconda3/envs/iia/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/lipsa/anaconda3/envs/iia/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/lipsa/anaconda3/envs/iia/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "preprocessors/preprocess_movie_dialogs.py", line 15, in main
for line in raw_data:
File "/home/lipsa/anaconda3/envs/iia/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3767: invalid start byte

@d4buss
Copy link

d4buss commented Jun 17, 2021

This seemed to get it working for me

def main(_):
    ##add rb here
    with open(FLAGS.raw_data, "rb") as raw_data, \
            open(FLAGS.out_file, "w") as out:
        for line in raw_data:
            line = str(line)
            parts = line.split(" +++$+++ ")
            dialog_line = parts[-1]
           # modify this line to match below
            s = ''.join((c for c in str(dialog_line.strip().lower()) if ord(c) < 128))
            preprocessed_line = " ".join(nltk.word_tokenize(s))
            out.write(preprocessed_line + "\n")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants