SPARK-11406: Patch for a utf-8 decode issue that occurs when we get a… #9360

boneill42 · 2015-10-29T18:54:54Z

… bad (non utf8) message through Kinesis.

tdas · 2015-10-29T22:17:14Z

Good catch. Could you add unit tests for this.

tdas · 2015-10-29T22:21:03Z

python/pyspark/streaming/kinesis.py

isnt it better to just change this line to s.decode('utf-8', errors='ignore')

tdas · 2015-10-30T00:26:16Z

this is ok to test

tdas · 2015-10-30T00:26:20Z

ok to test

SparkQA · 2015-10-30T00:53:18Z

Test build #44649 has finished for PR 9360 at commit 0bbdc85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

boneill42 · 2015-10-30T13:29:54Z

will do...
Note: The case that triggered this is KPL aggregation. I commented on SPARK-11198, and will explore a solution.

boneill42 · 2015-11-02T21:05:21Z

I don't believe the new signature for decode is available in 2.6....

(spark)blu:streaming brianoneill$ python2.6
Python 2.6.9 (unknown, Aug 22 2015, 20:33:41)
>>> s = "foo"
>>> s.decode('utf-8', errors='ignore')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: decode() takes no keyword arguments

(spark)blu:streaming brianoneill$ python2.7
Python 2.7.10 (default, Jul 13 2015, 12:05:58)
>>> s = "foo"
>>> s.decode('utf-8', errors='ignore')
u'foo'

I've left the code as-is, and added a test. Update coming...

SparkQA · 2015-11-02T21:57:26Z

Test build #44833 has finished for PR 9360 at commit a2e9d18.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public abstract class MemoryConsumer\n * final class ShuffleExternalSorter extends MemoryConsumer\n * public final class BytesToBytesMap extends MemoryConsumer\n * public final class MapIterator implements Iterator<Location>\n * public final class UnsafeExternalSorter extends MemoryConsumer\n * class SpillableIterator extends UnsafeSorterIterator\n * public final class UnsafeSorterSpillReader extends UnsafeSorterIterator\n * public final class UnsafeSorterSpillWriter\n * abstract class CentralMomentAgg(child: Expression) extends ImperativeAggregate with Serializable\n * case class Variance(child: Expression,\n * case class VarianceSamp(child: Expression,\n * case class VariancePop(child: Expression,\n * case class Skewness(child: Expression,\n * case class Kurtosis(child: Expression,\n * case class Kurtosis(child: Expression) extends UnaryExpression with AggregateExpression1\n * case class Skewness(child: Expression) extends UnaryExpression with AggregateExpression1\n * case class Variance(child: Expression) extends UnaryExpression with AggregateExpression1\n * case class VariancePop(child: Expression) extends UnaryExpression with AggregateExpression1\n * case class VarianceSamp(child: Expression) extends UnaryExpression with AggregateExpression1\n

zsxwing · 2015-11-04T19:44:28Z

I feel throwing an Error for invalid bytes does make sense. If people want to ignore such error, they can use a custom decoder.

boneill42 · 2015-11-04T19:58:14Z

I'm inclined to agree as long as there is a way to catch that Exception and continue. I'm not a python wizard, but it appeared as though the process died in worker.py, without giving the job a chance to catch the Exception.

zsxwing · 2015-11-04T20:03:42Z

I'm inclined to agree as long as there is a way to catch that Exception and continue. I'm not a python wizard, but it appeared as though the process died in worker.py, without giving the job a chance to catch the Exception.

You can use the decoder parameter in the createStream method to set a custom decoder to catch the exception.

boneill42 · 2015-11-04T20:16:56Z

I guess I'm just accustomed to explicit exceptions in Java, especially for something that might kill a job.

But perhaps its sufficient to document this and let people know that everyone should implement a custom decoder if they want to protect against bad bytes in a record.

(feel free to close PR)

zsxwing · 2015-12-11T00:16:23Z

@boneill42 could you close this PR? We don't have permission to close it. It would be better if you can submit another PR to document this method. Thanks a lot!

andrewor14 · 2015-12-14T22:45:27Z

@boneill42 can you close this issue?

boneill42 · 2015-12-15T00:00:57Z

Yep, closed.

obaidcuet · 2016-05-04T03:21:01Z

Hi,

Just FYI.
Hope it may be helpful for someone.

I had exactly same issue while reading twitter data from Kafka (collected using flume).
I have no idea where those non-UTF8 data coming from. The should be rejected by flume itself.
However, I do not need those special characters and used below to ignore them:

Create a decoder function with parameter "ignore":

def utf8_decoder_ignore_error(s): """ Decode the unicode as UTF-8 """ if s is None: return None return s.decode('utf-8', "ignore")

Then use that decoder in createDirectStream as below:

kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, keyDecoder=utf8_decoder_ignore_error, valueDecoder=utf8_decoder_ignore_error )

-Obaid

Brian O'Neill added 2 commits October 29, 2015 14:49

SPARK-11406: Patch for a utf-8 decode issue that occurs when we get a…

d2144ae

… bad (non utf8) message through Kinesis.

Fixing docstring.

0bbdc85

tdas reviewed Oct 29, 2015
View reviewed changes

python/pyspark/streaming/kinesis.py Outdated

Copy link

Contributor

tdas Oct 29, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt it better to just change this line to s.decode('utf-8', errors='ignore')

Merge branch 'master' of github.com:apache/spark

34ee632

Added unit test.

a2e9d18

boneill42 closed this Dec 15, 2015

SPARK-11406: Patch for a utf-8 decode issue that occurs when we get a… #9360

SPARK-11406: Patch for a utf-8 decode issue that occurs when we get a… #9360

Uh oh!

Conversation

boneill42 commented Oct 29, 2015

Uh oh!

tdas commented Oct 29, 2015

Uh oh!

tdas Oct 29, 2015

Choose a reason for hiding this comment

Uh oh!

tdas commented Oct 30, 2015

Uh oh!

tdas commented Oct 30, 2015

Uh oh!

SparkQA commented Oct 30, 2015

Uh oh!

boneill42 commented Oct 30, 2015

Uh oh!

boneill42 commented Nov 2, 2015

Uh oh!

SparkQA commented Nov 2, 2015

Uh oh!

zsxwing commented Nov 4, 2015

Uh oh!

boneill42 commented Nov 4, 2015

Uh oh!

zsxwing commented Nov 4, 2015

Uh oh!

boneill42 commented Nov 4, 2015

Uh oh!

zsxwing commented Dec 11, 2015

Uh oh!

andrewor14 commented Dec 14, 2015

Uh oh!

boneill42 commented Dec 15, 2015

Uh oh!

obaidcuet commented May 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Create a decoder function with parameter "ignore":

Then use that decoder in createDirectStream as below:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

obaidcuet commented May 4, 2016 •

edited

Loading