[SPARK-2788] [STREAMING] Add location filtering to Twitter streams #1717

sjbrunst · 2014-08-01T15:29:44Z

TwitterUtils.createStream(...) allows users to specify keywords that restrict the tweets that are returned. This change adds a location parameter that also restricts the returned tweets.

closes #2098

JoshRosen · 2014-08-24T02:50:32Z

It looks like this and #2098 are both trying to add geolocation filters to TwitterStream.

/cc @tdas for review.

tdas · 2014-08-24T23:46:45Z

Jenkins, this is ok to test

SparkQA · 2014-08-24T23:50:43Z

QA tests have started for PR 1717 at commit 9dcad31.

This patch merges cleanly.

tdas · 2014-08-25T00:41:50Z

@sjbrunst This is great addition! Thanks for the effort. However, from the patch, I can see that this changes the signature of a few methods, which required the examples to be changed. This is not desirable as we want to maintain binary compatibility as much as possible across different Spark versions. So I strongly suggest that the existing methods in TwitterUtils not be touched and new methods with the new location parameter by added.

tdas · 2014-08-25T00:43:34Z

external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

Just to confirm, can text filters and locations filters be added simultaneously?

Yes, text filters and locations can be added simultaneously. If both are added, then Twitter will return a mixture of tweets that satisfy either filter.

SparkQA · 2014-08-25T00:45:06Z

QA tests have finished for PR 1717 at commit 9dcad31.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2014-08-25T00:46:11Z

external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterUtils.scala

Rather than changing this signature and adding another one (the above one), it probably better (in terms binary compatibility) to add a single new method, that is

def createStream( jssc: JavaStreamingContext, twitterAuth: Authorization, filters: Array[String], locations: Array[Array[Double]], storageLevel: StorageLevel ): JavaReceiverInputDStream[Status] = {

Same applies to the Scala API.

tdas · 2014-08-25T00:49:22Z

The units tests failed because these new functions are not binary compatible with previous versions of Spark.

sjbrunst · 2014-08-26T19:26:51Z

@tdas Thanks for the comments! I'll work on fixing the binary compatibility, though I might not have it done until sometime next week since I'm currently on vacation.

tdas · 2014-08-26T19:29:15Z

That's cool.

On Tue, Aug 26, 2014 at 12:27 PM, Shawn Brunsting notifications@github.com
wrote:

@tdas https://github.com/tdas Thanks for the comments! I'll work on
fixing the binary compatibility, though I might not have it done until
sometime next week since I'm currently on vacation.

—
Reply to this email directly or view it on GitHub
#1717 (comment).

SparkQA · 2014-09-04T01:24:16Z

QA tests have started for PR 1717 at commit 9f35379.

This patch merges cleanly.

SparkQA · 2014-09-04T02:17:10Z

QA tests have finished for PR 1717 at commit 9f35379.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

sjbrunst · 2014-09-04T13:20:37Z

Unit tests fail because my changes are not completely binary compatible yet. I'm having some trouble overloading the Scala version of the createStream method. See my comment in TwitterUtils.scala.

tdas · 2014-09-05T22:39:41Z

external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

Actually, no need to create two constructors. Since this is a non-public class internal to Spark, we dont need to maintain binary compatibility. So one common constructor is fine enough.

Sounds good. I'll take that out.

SparkQA · 2014-09-05T23:44:06Z

Can one of the admins verify this patch?

tdas · 2014-09-07T03:06:56Z

Jenkins, this is ok to test.

SparkQA · 2014-09-07T03:45:18Z

QA tests have started for PR 1717 at commit 1e88a04.

This patch merges cleanly.

tdas · 2014-09-07T04:08:45Z

external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

I should have caught and commented on this earlier, but why is this Seq[Seq[Double]] and not of Seq[(Double, Double)] ? Its not like that the location will ever be a sequence of more two doubles. So having a Seq[Double] for latitude and longitude is pretty confusing. In fact having (Double, Double) is still confusing, as it is not obvious which one is latitude and which one is longitude. Hence, i think that its best to define a case class Location(latitude: Double, longitude: Double) (within the org.apache.spark.streaming.twitter package), and use that. This should be most intuitive and least ambiguous.

What do you think?

Good question. It definitely is confusing. I went with Seq[Seq[Double]] because the FilterQuery created in TwitterInputDStream.scala requires a double[][] (http://twitter4j.org/javadoc/twitter4j/FilterQuery.html#locations-double:A:A-). This way the only change I have to make to the input is to change between Scala sequences and Java arrays.

The Location case class you described still does not remove all ambiguity, because the FilterQuery requires the south-west corner then the north-east corner for the boundary, and that would not prevent someone from giving them in the wrong order and getting unexpected results. If we're going to define a case class anyways, I think it would be better to make something like case class Boundary(west: Double, south: Double, east: Double, north: Double). Then the locations parameter would be of type Seq[Boundary], and I can convert it to a double[][] just before passing it to the FilterQuery in TwitterInputDStream.scala. Should I go ahead and implement that?

Yes, that makes sense! Please go ahead a do so. Can you make the order of directions same as the order in the expected twitter4j API.

SparkQA · 2014-09-07T04:46:36Z

QA tests have finished for PR 1717 at commit 1e88a04.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

tdas · 2014-09-12T02:26:34Z

@sjbrunst ping! Any updates on this PR?

sjbrunst · 2014-09-12T02:44:34Z

I have the new case class written, I just haven't tested it with an actual stream yet. It should be ready sometime tomorrow or Saturday.

tdas · 2014-09-12T06:01:31Z

Okie dokie!

sjbrunst · 2014-09-14T00:22:39Z

@tdas It's ready for another look! I added a BoundingBox class that can be used to pass in the coordinates, which should be much more intuitive.

SparkQA · 2014-09-14T00:24:32Z

QA tests have started for PR 1717 at commit 8937fc7.

This patch merges cleanly.

SparkQA · 2014-09-14T01:29:48Z

QA tests have finished for PR 1717 at commit 8937fc7.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BoundingBox(west: Double, south: Double, east: Double, north: Double)
- class RatingDeserializer(FramedSerializer):
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
- class Encoder extends compression.Encoder[IntegerType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
- class Encoder extends compression.Encoder[LongType.type]
- class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

SparkQA · 2015-02-04T04:32:08Z

Test build #26709 has finished for PR 1717 at commit 250407e.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class BoundingBox(west: Double, south: Double, east: Double, north: Double)

philcontrolf1 · 2015-07-09T08:58:35Z

Hi all - this functionality is certainly something I'm interested in, but this discussion seems to have stalled somewhat. Is this still the "main" discussion for geofiltered tweets, or has it moved somewhere else?

If it's the main discussion, what needs to happen to get this moving again? I'm more than happy to write code if necessary :-)

sjbrunst · 2015-07-09T18:27:00Z

The discussion hasn't moved anywhere, as far as I know. I was waiting for @tdas to look at the latest changes.

JoshRosen · 2015-07-09T21:19:00Z

To throw a monkey wrench into this discussion: do we really want to be maintaining a Twitter library inside of Spark itself or should we try to move the ongoing development of this source into a separate third-party package?

huitseeker · 2015-07-13T08:19:06Z

external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

Why not Array(Array(box.west, box.south), Array(box.east, box.north)) ?

Good catch. It won't compile without the .toArray at the end, but I can change the Seq to Array. I will update the code.

AmplabJenkins · 2015-07-13T21:54:44Z

Can one of the admins verify this patch?

dmvieira · 2015-08-17T16:28:16Z

Hey guys, I need this patch too.

srowen · 2015-08-17T16:31:44Z

I don't think this PR is going forward, and should be closed. Do you mind closing this PR @sjbrunst ? You can see some other related efforts, but the impression I have from many related discussions is that this belongs outside Spark.

dmvieira · 2015-08-17T16:40:45Z

But without this path you're restricting a lot Twitter functionalities inside Spark and still supporting Twitter interface. Spark still maintain Twitter API interface even without this path. IMHO if Spark don't want to maintain Twitter interface you should remove Twitter streaming as a package inside Spark

sjbrunst · 2015-08-17T16:56:09Z

@srowen I am willing to close this PR, but I agree with @dmvieira . It doesn't matter to me whether this functionality gets added inside Spark or as an external library, but either way it should go somewhere because there is enough demand for it. This patch just expands on the Twitter library that is already part of Spark. If there are plans to make the Twitter library external to Spark, then this change (or a similar PR) can move along with it.

JoshRosen · 2015-08-17T17:01:56Z

Personally I would love to see the Twitter package moved from Spark itself into a separate project / package; the only reason that we have it is for legacy reasons.

dmvieira · 2015-08-17T17:17:57Z

So, why not improve it with this PR and then move it to a new project / package when we think about a better solution? We can create an issue or you can talk with stakeholders to discuss about it.

srowen · 2015-08-17T17:40:29Z

@dmvieira I'm not sure who you're addressing there, but if this isn't something that will go in Spark, there's no reason to discuss the change here, right? Make a new github repo. Maybe that's what you mean.

sjbrunst · 2015-08-17T17:50:23Z

What is the timeline on moving the Twitter package out of Spark and into a separate project? If it is going to be another year before that actually happens then it might be worth it to finish this PR so users have a way to use this feature until then.

JoshRosen · 2015-08-17T17:56:25Z

Should we deprecate these Twitter APIs in order to encourage them to be split into a third-party package?

dmvieira · 2015-08-18T20:26:05Z

I'm starting a third-party package as suggested by @srowen and I hope you enjoy. Feel free to collaborate: https://github.com/dmvieira/spark-twitter-stream-receiver

sjbrunst · 2015-08-20T12:56:44Z

@JoshRosen Do we really want to deprecate the Twitter APIs before there is a user-friendly way to use it in an external package? I think there is a large user base for this feature, and it is a motivating example for the use of Spark Streaming in the programming guide and the code examples. I've also seen AMP Camp exercises built around this Twitter package. If this package is so heavily featured in teaching Spark Streaming and demonstrating its applications, then I think it would be strange to deprecate it before there is an easy-to-use alternative.

srowen · 2015-08-20T13:02:31Z

You can add any third-party package with --packages or add it to your application as a dependency. I don't think Josh is saying it will be deprecated, certainly not now. But this PR should be closed in favor of continuing development elsewhere. That's a separate issue.

sjbrunst changed the title ~~[STREAMING] SPARK-2788 Add location filtering to Twitter streams~~ [SPARK-2788] [STREAMING] Add location filtering to Twitter streams Aug 19, 2014

JoshRosen mentioned this pull request Aug 24, 2014

Geolocation to twitter stream #2098

Closed

tdas reviewed Aug 25, 2014
View reviewed changes

tdas reviewed Sep 5, 2014
View reviewed changes

tdas reviewed Sep 7, 2014
View reviewed changes

Fix follow type in Java unit tests

250407e

This was referenced Feb 6, 2015

[SPARK-3182][Streaming]: Add geolocation bounding for Twitter Streaming #3404

Closed

[SPARK-4382] Add locations parameter to Twitter Stream #3246

Closed

sjbrunst added 2 commits April 21, 2015 09:00

Merge remote-tracking branch 'upstream/master'

08c754e

Merge remote-tracking branch 'upstream/master'

beff8ae

huitseeker reviewed Jul 13, 2015
View reviewed changes

sjbrunst added 3 commits July 14, 2015 09:22

Merge remote-tracking branch 'upstream/master'

113a573

Change a Seq to Array

63aa054

Merge remote-tracking branch 'upstream/master'

acfbab0

sjbrunst closed this Sep 15, 2015

BenFradet mentioned this pull request Feb 8, 2016

[SPARK-13065] [Streaming] streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream() #11003

Closed

[SPARK-2788] [STREAMING] Add location filtering to Twitter streams #1717

[SPARK-2788] [STREAMING] Add location filtering to Twitter streams #1717

Uh oh!

Conversation

sjbrunst commented Aug 1, 2014

Uh oh!

JoshRosen commented Aug 24, 2014

Uh oh!

tdas commented Aug 24, 2014

Uh oh!

SparkQA commented Aug 24, 2014

Uh oh!

tdas commented Aug 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdas commented Aug 25, 2014

Uh oh!

sjbrunst commented Aug 26, 2014

Uh oh!

tdas commented Aug 26, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

sjbrunst commented Sep 4, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2014

Uh oh!

tdas commented Sep 7, 2014

Uh oh!

SparkQA commented Sep 7, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 7, 2014

Uh oh!

tdas commented Sep 12, 2014

Uh oh!

sjbrunst commented Sep 12, 2014

Uh oh!

tdas commented Sep 12, 2014

Uh oh!

sjbrunst commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Sep 14, 2014

Uh oh!

SparkQA commented Feb 4, 2015

Uh oh!

philcontrolf1 commented Jul 9, 2015

Uh oh!

sjbrunst commented Jul 9, 2015

Uh oh!

JoshRosen commented Jul 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Jul 13, 2015

Uh oh!

dmvieira commented Aug 17, 2015

Uh oh!