-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-11406: Patch for a utf-8 decode issue that occurs when we get a… #9360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… bad (non utf8) message through Kinesis.
|
Good catch. Could you add unit tests for this. |
python/pyspark/streaming/kinesis.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt it better to just change this line to s.decode('utf-8', errors='ignore')
|
this is ok to test |
|
ok to test |
|
Test build #44649 has finished for PR 9360 at commit
|
|
will do... |
|
I don't believe the new signature for decode is available in 2.6.... I've left the code as-is, and added a test. Update coming... |
|
Test build #44833 has finished for PR 9360 at commit
|
|
I feel throwing an Error for invalid bytes does make sense. If people want to ignore such error, they can use a custom decoder. |
|
I'm inclined to agree as long as there is a way to catch that Exception and continue. I'm not a python wizard, but it appeared as though the process died in worker.py, without giving the job a chance to catch the Exception. |
You can use the |
|
I guess I'm just accustomed to explicit exceptions in Java, especially for something that might kill a job. But perhaps its sufficient to document this and let people know that everyone should implement a custom decoder if they want to protect against bad bytes in a record. (feel free to close PR) |
|
@boneill42 could you close this PR? We don't have permission to close it. It would be better if you can submit another PR to document this method. Thanks a lot! |
|
@boneill42 can you close this issue? |
|
Yep, closed. |
|
Hi, Just FYI. I had exactly same issue while reading twitter data from Kafka (collected using flume). Create a decoder function with parameter "ignore":
Then use that decoder in createDirectStream as below:
-Obaid |
… bad (non utf8) message through Kinesis.