New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-1615] [java api] SimpleTweetInputFormat #442
Conversation
InputStreamReader jsonReader = new InputStreamReader(new ByteArrayInputStream(bytes)); | ||
jsonReader.skip(offset); | ||
|
||
JSONParser parser = new JSONParser(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to create a new parser and handler for each record.
You can probably improve the performance by reusing the parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I have tried in the beginning to create them as instance fields in the class, however I received an error because the fields are not serializable. I tried to declare them as transient, but Flink threw "Can not submit Job" exception, If you have suggestions please let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting the parser into a transient
field and initalizing it in the open()
method is the way to go.
Can you give me the full stacktrace for the "Can not submit Job" exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked now, Thanks
Thank you for the contribution. Is the json format parsed by this input format a standard format produced by twitter? |
@Elbehery Nice addition, I would like to add this. The build checks also mentions that the license headers are missing. Can you add them to the files? |
tweet = new Tweet(); | ||
tweet = simpleTweetInputFormat.nextRecord(tweet); | ||
|
||
if(tweet != null){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the test fail if tweet
is null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do so, but the problem is that the reachedEnd() is updated inside nextRecord() in DelimitedInputFormat. I tried to make a condition when the readPos == limit, do not call nextRecord, but I could not because these fields are private.
So, I have to stop reading when the reachedEnd is true, which in turn means that the returned object from nextRecord [ the tweet ] is null. To add the required assert, all the tests will fail. I do not know how to avoid this problem, do u have a suggestion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool.
Then I'd recommend to add a similar test to your test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
@rmetzger This is the standard Tweet format as per Twitter. Here You can find Twitter Official Documentation. My parser is retrieving all the tweet except Bounding Box object, and retweeted object. |
@StephanEwen I have added the license, and the JSON-Simple API I am using is already used in many Apache project, such as Cassandra. |
Why Failed ?? |
Have a look at the build server logs. They still complain about unapproved license headers. |
I have revised the commit .. All the files has the license header, except a resource file which contains 4 tweets for testing purpose .. Could this be the problem ? |
Yes, this can be the problem. Can you add a licence header (with comments) to this file? |
DONE |
Your own tests fail for this code. Probably because they get thrown off by the comment in the sample JSON file. A way to make this work is to exclude this file from the RAT test and add a notice text file next to it that states the license header. |
@StephanEwen I have checked and I found that the RAT plugin is included in parent pom.xml file. Shall I exclude the resources folder in flink-contribute, or add the plugin to flink-contribute and exclude the file from there ?!!! |
@StephanEwen, I have removed the license from the resource file, excluded it from RAT, and added the license file. Also I did rebase against upstream/master, before committing. |
I triggered another travis build: https://travis-ci.org/rmetzger/flink/builds/55459787 |
@rmetzger I think it failed again, but I cant see the reason |
The failure is not your fault. It failed because your code has been rebased to a master version with failing tests. |
@rmetzger Still not merged, any updates ? |
This looks good to merge. Any objections? |
I did rebase against the master before creating the PR .. but this was long time ago, could this be the problems for the conflicts ?!! |
Where is your git repository? So that I can checkout your commit and merge it? |
@aljoscha you should be able to see it from the PR .. anyway this is mine https://github.com/Elbehery/flink |
The problem is, that I can't see it in the github interface. On what branch are your changes? Could you please rebase them on top of the current master? |
This closes apache#621 This closes apache#442
This closes apache#621 This closes apache#442
This closes apache#621 This closes apache#442
This is a contribution with a TweetInputFormat, Jira FLINK-1615.
This commit contains, the InputFormat, the Pojos for tweet and nested object, the Hanlder, and a UnitTest with a Test File of four tweets.
Before Pushing, the branch was rebased against upstream/master branch.