Skip to content

Conversation

sjbrunst
Copy link

@sjbrunst sjbrunst commented Aug 1, 2014

TwitterUtils.createStream(...) allows users to specify keywords that restrict the tweets that are returned. This change adds a location parameter that also restricts the returned tweets.

closes #2098

@sjbrunst sjbrunst changed the title [STREAMING] SPARK-2788 Add location filtering to Twitter streams [SPARK-2788] [STREAMING] Add location filtering to Twitter streams Aug 19, 2014
@JoshRosen
Copy link
Contributor

It looks like this and #2098 are both trying to add geolocation filters to TwitterStream.

/cc @tdas for review.

@tdas
Copy link
Contributor

tdas commented Aug 24, 2014

Jenkins, this is ok to test

@SparkQA
Copy link

SparkQA commented Aug 24, 2014

QA tests have started for PR 1717 at commit 9dcad31.

  • This patch merges cleanly.

@tdas
Copy link
Contributor

tdas commented Aug 25, 2014

@sjbrunst This is great addition! Thanks for the effort. However, from the patch, I can see that this changes the signature of a few methods, which required the examples to be changed. This is not desirable as we want to maintain binary compatibility as much as possible across different Spark versions. So I strongly suggest that the existing methods in TwitterUtils not be touched and new methods with the new location parameter by added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, can text filters and locations filters be added simultaneously?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, text filters and locations can be added simultaneously. If both are added, then Twitter will return a mixture of tweets that satisfy either filter.

@SparkQA
Copy link

SparkQA commented Aug 25, 2014

QA tests have finished for PR 1717 at commit 9dcad31.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing this signature and adding another one (the above one), it probably better (in terms binary compatibility) to add a single new method, that is

def createStream(
      jssc: JavaStreamingContext,
      twitterAuth: Authorization,
      filters: Array[String],
      locations: Array[Array[Double]],
      storageLevel: StorageLevel
    ): JavaReceiverInputDStream[Status] = {

Same applies to the Scala API.

@tdas
Copy link
Contributor

tdas commented Aug 25, 2014

The units tests failed because these new functions are not binary compatible with previous versions of Spark.

@sjbrunst
Copy link
Author

@tdas Thanks for the comments! I'll work on fixing the binary compatibility, though I might not have it done until sometime next week since I'm currently on vacation.

@tdas
Copy link
Contributor

tdas commented Aug 26, 2014

That's cool.

On Tue, Aug 26, 2014 at 12:27 PM, Shawn Brunsting notifications@github.com
wrote:

@tdas https://github.com/tdas Thanks for the comments! I'll work on
fixing the binary compatibility, though I might not have it done until
sometime next week since I'm currently on vacation.


Reply to this email directly or view it on GitHub
#1717 (comment).

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 1717 at commit 9f35379.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 1717 at commit 9f35379.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sjbrunst
Copy link
Author

sjbrunst commented Sep 4, 2014

Unit tests fail because my changes are not completely binary compatible yet. I'm having some trouble overloading the Scala version of the createStream method. See my comment in TwitterUtils.scala.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no need to create two constructors. Since this is a non-public class internal to Spark, we dont need to maintain binary compatibility. So one common constructor is fine enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll take that out.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@tdas
Copy link
Contributor

tdas commented Sep 7, 2014

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have started for PR 1717 at commit 1e88a04.

  • This patch merges cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have caught and commented on this earlier, but why is this Seq[Seq[Double]] and not of Seq[(Double, Double)] ? Its not like that the location will ever be a sequence of more two doubles. So having a Seq[Double] for latitude and longitude is pretty confusing. In fact having (Double, Double) is still confusing, as it is not obvious which one is latitude and which one is longitude. Hence, i think that its best to define a case class Location(latitude: Double, longitude: Double) (within the org.apache.spark.streaming.twitter package), and use that. This should be most intuitive and least ambiguous.

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It definitely is confusing. I went with Seq[Seq[Double]] because the FilterQuery created in TwitterInputDStream.scala requires a double[][] (http://twitter4j.org/javadoc/twitter4j/FilterQuery.html#locations-double:A:A-). This way the only change I have to make to the input is to change between Scala sequences and Java arrays.

The Location case class you described still does not remove all ambiguity, because the FilterQuery requires the south-west corner then the north-east corner for the boundary, and that would not prevent someone from giving them in the wrong order and getting unexpected results. If we're going to define a case class anyways, I think it would be better to make something like case class Boundary(west: Double, south: Double, east: Double, north: Double). Then the locations parameter would be of type Seq[Boundary], and I can convert it to a double[][] just before passing it to the FilterQuery in TwitterInputDStream.scala. Should I go ahead and implement that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense! Please go ahead a do so. Can you make the order of directions same as the order in the expected twitter4j API.

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have finished for PR 1717 at commit 1e88a04.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tdas
Copy link
Contributor

tdas commented Sep 12, 2014

@sjbrunst ping! Any updates on this PR?

@sjbrunst
Copy link
Author

I have the new case class written, I just haven't tested it with an actual stream yet. It should be ready sometime tomorrow or Saturday.

@tdas
Copy link
Contributor

tdas commented Sep 12, 2014

Okie dokie!

@sjbrunst
Copy link
Author

@tdas It's ready for another look! I added a BoundingBox class that can be used to pass in the coordinates, which should be much more intuitive.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 1717 at commit 8937fc7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 1717 at commit 8937fc7.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BoundingBox(west: Double, south: Double, east: Double, north: Double)
    • class RatingDeserializer(FramedSerializer):
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder extends compression.Encoder[IntegerType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
    • class Encoder extends compression.Encoder[LongType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26709 has finished for PR 1717 at commit 250407e.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class BoundingBox(west: Double, south: Double, east: Double, north: Double)

@philcontrolf1
Copy link

Hi all - this functionality is certainly something I'm interested in, but this discussion seems to have stalled somewhat. Is this still the "main" discussion for geofiltered tweets, or has it moved somewhere else?

If it's the main discussion, what needs to happen to get this moving again? I'm more than happy to write code if necessary :-)

@sjbrunst
Copy link
Author

sjbrunst commented Jul 9, 2015

The discussion hasn't moved anywhere, as far as I know. I was waiting for @tdas to look at the latest changes.

@JoshRosen
Copy link
Contributor

To throw a monkey wrench into this discussion: do we really want to be maintaining a Twitter library inside of Spark itself or should we try to move the ongoing development of this source into a separate third-party package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Array(Array(box.west, box.south), Array(box.east, box.north)) ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. It won't compile without the .toArray at the end, but I can change the Seq to Array. I will update the code.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dmvieira
Copy link

Hey guys, I need this patch too.

@srowen
Copy link
Member

srowen commented Aug 17, 2015

I don't think this PR is going forward, and should be closed. Do you mind closing this PR @sjbrunst ? You can see some other related efforts, but the impression I have from many related discussions is that this belongs outside Spark.

@dmvieira
Copy link

But without this path you're restricting a lot Twitter functionalities inside Spark and still supporting Twitter interface. Spark still maintain Twitter API interface even without this path. IMHO if Spark don't want to maintain Twitter interface you should remove Twitter streaming as a package inside Spark

@sjbrunst
Copy link
Author

@srowen I am willing to close this PR, but I agree with @dmvieira . It doesn't matter to me whether this functionality gets added inside Spark or as an external library, but either way it should go somewhere because there is enough demand for it. This patch just expands on the Twitter library that is already part of Spark. If there are plans to make the Twitter library external to Spark, then this change (or a similar PR) can move along with it.

@JoshRosen
Copy link
Contributor

Personally I would love to see the Twitter package moved from Spark itself into a separate project / package; the only reason that we have it is for legacy reasons.

@dmvieira
Copy link

So, why not improve it with this PR and then move it to a new project / package when we think about a better solution? We can create an issue or you can talk with stakeholders to discuss about it.

@srowen
Copy link
Member

srowen commented Aug 17, 2015

@dmvieira I'm not sure who you're addressing there, but if this isn't something that will go in Spark, there's no reason to discuss the change here, right? Make a new github repo. Maybe that's what you mean.

@sjbrunst
Copy link
Author

What is the timeline on moving the Twitter package out of Spark and into a separate project? If it is going to be another year before that actually happens then it might be worth it to finish this PR so users have a way to use this feature until then.

@JoshRosen
Copy link
Contributor

Should we deprecate these Twitter APIs in order to encourage them to be split into a third-party package?

@dmvieira
Copy link

I'm starting a third-party package as suggested by @srowen and I hope you enjoy. Feel free to collaborate: https://github.com/dmvieira/spark-twitter-stream-receiver

@sjbrunst
Copy link
Author

@JoshRosen Do we really want to deprecate the Twitter APIs before there is a user-friendly way to use it in an external package? I think there is a large user base for this feature, and it is a motivating example for the use of Spark Streaming in the programming guide and the code examples. I've also seen AMP Camp exercises built around this Twitter package. If this package is so heavily featured in teaching Spark Streaming and demonstrating its applications, then I think it would be strange to deprecate it before there is an easy-to-use alternative.

@srowen
Copy link
Member

srowen commented Aug 20, 2015

You can add any third-party package with --packages or add it to your application as a dependency. I don't think Josh is saying it will be deprecated, certainly not now. But this PR should be closed in favor of continuing development elsewhere. That's a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.