-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-2788] [STREAMING] Add location filtering to Twitter streams #1717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jenkins, this is ok to test |
QA tests have started for PR 1717 at commit
|
@sjbrunst This is great addition! Thanks for the effort. However, from the patch, I can see that this changes the signature of a few methods, which required the examples to be changed. This is not desirable as we want to maintain binary compatibility as much as possible across different Spark versions. So I strongly suggest that the existing methods in TwitterUtils not be touched and new methods with the new location parameter by added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, can text filters and locations filters be added simultaneously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, text filters and locations can be added simultaneously. If both are added, then Twitter will return a mixture of tweets that satisfy either filter.
QA tests have finished for PR 1717 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than changing this signature and adding another one (the above one), it probably better (in terms binary compatibility) to add a single new method, that is
def createStream(
jssc: JavaStreamingContext,
twitterAuth: Authorization,
filters: Array[String],
locations: Array[Array[Double]],
storageLevel: StorageLevel
): JavaReceiverInputDStream[Status] = {
Same applies to the Scala API.
The units tests failed because these new functions are not binary compatible with previous versions of Spark. |
@tdas Thanks for the comments! I'll work on fixing the binary compatibility, though I might not have it done until sometime next week since I'm currently on vacation. |
That's cool. On Tue, Aug 26, 2014 at 12:27 PM, Shawn Brunsting notifications@github.com
|
QA tests have started for PR 1717 at commit
|
QA tests have finished for PR 1717 at commit
|
Unit tests fail because my changes are not completely binary compatible yet. I'm having some trouble overloading the Scala version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no need to create two constructors. Since this is a non-public class internal to Spark, we dont need to maintain binary compatibility. So one common constructor is fine enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll take that out.
Can one of the admins verify this patch? |
Jenkins, this is ok to test. |
QA tests have started for PR 1717 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have caught and commented on this earlier, but why is this Seq[Seq[Double]]
and not of Seq[(Double, Double)]
? Its not like that the location will ever be a sequence of more two doubles. So having a Seq[Double]
for latitude and longitude is pretty confusing. In fact having (Double, Double)
is still confusing, as it is not obvious which one is latitude and which one is longitude. Hence, i think that its best to define a case class Location(latitude: Double, longitude: Double)
(within the org.apache.spark.streaming.twitter
package), and use that. This should be most intuitive and least ambiguous.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It definitely is confusing. I went with Seq[Seq[Double]]
because the FilterQuery
created in TwitterInputDStream.scala requires a double[][]
(http://twitter4j.org/javadoc/twitter4j/FilterQuery.html#locations-double:A:A-). This way the only change I have to make to the input is to change between Scala sequences and Java arrays.
The Location
case class you described still does not remove all ambiguity, because the FilterQuery
requires the south-west corner then the north-east corner for the boundary, and that would not prevent someone from giving them in the wrong order and getting unexpected results. If we're going to define a case class
anyways, I think it would be better to make something like case class Boundary(west: Double, south: Double, east: Double, north: Double)
. Then the locations parameter would be of type Seq[Boundary]
, and I can convert it to a double[][]
just before passing it to the FilterQuery
in TwitterInputDStream.scala. Should I go ahead and implement that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense! Please go ahead a do so. Can you make the order of directions same as the order in the expected twitter4j API.
QA tests have finished for PR 1717 at commit
|
@sjbrunst ping! Any updates on this PR? |
I have the new case class written, I just haven't tested it with an actual stream yet. It should be ready sometime tomorrow or Saturday. |
Okie dokie! |
@tdas It's ready for another look! I added a BoundingBox class that can be used to pass in the coordinates, which should be much more intuitive. |
QA tests have started for PR 1717 at commit
|
QA tests have finished for PR 1717 at commit
|
Test build #26709 has finished for PR 1717 at commit
|
Hi all - this functionality is certainly something I'm interested in, but this discussion seems to have stalled somewhat. Is this still the "main" discussion for geofiltered tweets, or has it moved somewhere else? If it's the main discussion, what needs to happen to get this moving again? I'm more than happy to write code if necessary :-) |
The discussion hasn't moved anywhere, as far as I know. I was waiting for @tdas to look at the latest changes. |
To throw a monkey wrench into this discussion: do we really want to be maintaining a Twitter library inside of Spark itself or should we try to move the ongoing development of this source into a separate third-party package? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Array(Array(box.west, box.south), Array(box.east, box.north))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It won't compile without the .toArray
at the end, but I can change the Seq
to Array
. I will update the code.
Can one of the admins verify this patch? |
Hey guys, I need this patch too. |
I don't think this PR is going forward, and should be closed. Do you mind closing this PR @sjbrunst ? You can see some other related efforts, but the impression I have from many related discussions is that this belongs outside Spark. |
But without this path you're restricting a lot Twitter functionalities inside Spark and still supporting Twitter interface. Spark still maintain Twitter API interface even without this path. IMHO if Spark don't want to maintain Twitter interface you should remove Twitter streaming as a package inside Spark |
@srowen I am willing to close this PR, but I agree with @dmvieira . It doesn't matter to me whether this functionality gets added inside Spark or as an external library, but either way it should go somewhere because there is enough demand for it. This patch just expands on the Twitter library that is already part of Spark. If there are plans to make the Twitter library external to Spark, then this change (or a similar PR) can move along with it. |
Personally I would love to see the Twitter package moved from Spark itself into a separate project / package; the only reason that we have it is for legacy reasons. |
So, why not improve it with this PR and then move it to a new project / package when we think about a better solution? We can create an issue or you can talk with stakeholders to discuss about it. |
@dmvieira I'm not sure who you're addressing there, but if this isn't something that will go in Spark, there's no reason to discuss the change here, right? Make a new github repo. Maybe that's what you mean. |
What is the timeline on moving the Twitter package out of Spark and into a separate project? If it is going to be another year before that actually happens then it might be worth it to finish this PR so users have a way to use this feature until then. |
Should we deprecate these Twitter APIs in order to encourage them to be split into a third-party package? |
I'm starting a third-party package as suggested by @srowen and I hope you enjoy. Feel free to collaborate: https://github.com/dmvieira/spark-twitter-stream-receiver |
@JoshRosen Do we really want to deprecate the Twitter APIs before there is a user-friendly way to use it in an external package? I think there is a large user base for this feature, and it is a motivating example for the use of Spark Streaming in the programming guide and the code examples. I've also seen AMP Camp exercises built around this Twitter package. If this package is so heavily featured in teaching Spark Streaming and demonstrating its applications, then I think it would be strange to deprecate it before there is an easy-to-use alternative. |
You can add any third-party package with |
TwitterUtils.createStream(...) allows users to specify keywords that restrict the tweets that are returned. This change adds a location parameter that also restricts the returned tweets.
closes #2098