Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-1615] [java api] SimpleTweetInputFormat #442

Closed
wants to merge 1 commit into from
Closed

[FLINK-1615] [java api] SimpleTweetInputFormat #442

wants to merge 1 commit into from

Conversation

Elbehery
Copy link

This is a contribution with a TweetInputFormat, Jira FLINK-1615.

This commit contains, the InputFormat, the Pojos for tweet and nested object, the Hanlder, and a UnitTest with a Test File of four tweets.

Before Pushing, the branch was rebased against upstream/master branch.

InputStreamReader jsonReader = new InputStreamReader(new ByteArrayInputStream(bytes));
jsonReader.skip(offset);

JSONParser parser = new JSONParser();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to create a new parser and handler for each record.
You can probably improve the performance by reusing the parser.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I have tried in the beginning to create them as instance fields in the class, however I received an error because the fields are not serializable. I tried to declare them as transient, but Flink threw "Can not submit Job" exception, If you have suggestions please let me know

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the parser into a transient field and initalizing it in the open() method is the way to go.

Can you give me the full stacktrace for the "Can not submit Job" exception?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked now, Thanks

@rmetzger
Copy link
Contributor

rmetzger commented Mar 1, 2015

Thank you for the contribution.

Is the json format parsed by this input format a standard format produced by twitter?
How can I get such input data?

@StephanEwen
Copy link
Contributor

@Elbehery Nice addition, I would like to add this.

The build checks also mentions that the license headers are missing. Can you add them to the files?

tweet = new Tweet();
tweet = simpleTweetInputFormat.nextRecord(tweet);

if(tweet != null){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the test fail if tweet is null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do so, but the problem is that the reachedEnd() is updated inside nextRecord() in DelimitedInputFormat. I tried to make a condition when the readPos == limit, do not call nextRecord, but I could not because these fields are private.

So, I have to stop reading when the reachedEnd is true, which in turn means that the returned object from nextRecord [ the tweet ] is null. To add the required assert, all the tests will fail. I do not know how to avoid this problem, do u have a suggestion ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this behavior was intended, by checking the DelimitedInputFormat UnitTest I found this. I will edit my test case according to this.
selection_019

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool.
Then I'd recommend to add a similar test to your test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@Elbehery
Copy link
Author

@rmetzger This is the standard Tweet format as per Twitter. Here You can find Twitter Official Documentation. My parser is retrieving all the tweet except Bounding Box object, and retweeted object.

@Elbehery
Copy link
Author

@StephanEwen I have added the license, and the JSON-Simple API I am using is already used in many Apache project, such as Cassandra.

@Elbehery
Copy link
Author

Why Failed ??

@StephanEwen
Copy link
Contributor

Have a look at the build server logs. They still complain about unapproved license headers.

@Elbehery
Copy link
Author

I have revised the commit .. All the files has the license header, except a resource file which contains 4 tweets for testing purpose .. Could this be the problem ?

@StephanEwen
Copy link
Contributor

Yes, this can be the problem. Can you add a licence header (with comments) to this file?

@Elbehery
Copy link
Author

DONE

@StephanEwen
Copy link
Contributor

Your own tests fail for this code. Probably because they get thrown off by the comment in the sample JSON file.

A way to make this work is to exclude this file from the RAT test and add a notice text file next to it that states the license header.
Or to make the input format aware of line comments (starting with // or #) and define the header like this.

@Elbehery
Copy link
Author

@StephanEwen I have checked and I found that the RAT plugin is included in parent pom.xml file. Shall I exclude the resources folder in flink-contribute, or add the plugin to flink-contribute and exclude the file from there ?!!!

@Elbehery
Copy link
Author

@StephanEwen, I have removed the license from the resource file, excluded it from RAT, and added the license file. Also I did rebase against upstream/master, before committing.

@rmetzger
Copy link
Contributor

I triggered another travis build: https://travis-ci.org/rmetzger/flink/builds/55459787

@Elbehery
Copy link
Author

@rmetzger I think it failed again, but I cant see the reason

@rmetzger
Copy link
Contributor

The failure is not your fault. It failed because your code has been rebased to a master version with failing tests.

@Elbehery
Copy link
Author

@rmetzger Still not merged, any updates ?

@aljoscha
Copy link
Contributor

This looks good to merge. Any objections?

@Elbehery
Copy link
Author

I did rebase against the master before creating the PR .. but this was long time ago, could this be the problems for the conflicts ?!!

@aljoscha
Copy link
Contributor

Where is your git repository? So that I can checkout your commit and merge it?

@Elbehery
Copy link
Author

@aljoscha you should be able to see it from the PR .. anyway this is mine https://github.com/Elbehery/flink

@aljoscha
Copy link
Contributor

The problem is, that I can't see it in the github interface. On what branch are your changes? Could you please rebase them on top of the current master?

@Elbehery
Copy link
Author

@aljoscha u were right.. I could not find the code on my repo, somehow it was lost !!! ..

I have created a new PR now, #621 .. it has the same commit of #442 ..

Please let me know if there is any issue.

@asfgit asfgit closed this in 8e429ce Apr 29, 2015
bhatsachin pushed a commit to bhatsachin/flink that referenced this pull request May 5, 2015
marthavk pushed a commit to marthavk/flink that referenced this pull request Jun 9, 2015
nltran pushed a commit to nltran/flink that referenced this pull request Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants