Skip to content

pom.xml is passed as the input file to DebuggingWordCount Example#304

Closed
edhgoose wants to merge 1 commit intoapache:asf-sitefrom
edhgoose:asf-site
Closed

pom.xml is passed as the input file to DebuggingWordCount Example#304
edhgoose wants to merge 1 commit intoapache:asf-sitefrom
edhgoose:asf-site

Conversation

@edhgoose
Copy link

The DebuggingWordCount example says:

$ mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.DebuggingWordCount \
     -Dexec.args="--inputFile=pom.xml --output=counts" -Pdirect-runner

This passes the pom.xml file as the input which is processed, which causes the PAssert at the end to fail.

The default file (kinglear.txt) is correct.

@asfgit
Copy link

asfgit commented Aug 26, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/656/

Jenkins built the site at commit id 45a46d7 with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@aaltay
Copy link
Member

aaltay commented Aug 27, 2017

Hey @edhgoose, thank you for your contribution. I believe it is good to keep --inputFile argument here, because the default for that is file stored on GCS. It may required additional dependencies to be installed for the default input file to be used. We avoid imposing this dependency on new users just trying out the example by passing a local input file.

cc: @melap

@edhgoose
Copy link
Author

Thanks @aaltay, all the prior examples use GCS too so I believe it's expected already. I didn't have to do anything special to set up the prior examples. I'd happily agree to make it inputFile=gs://... though as that's a useful lesson. As long as it doesn't try word counting the pom anymore :)

@aaltay
Copy link
Member

aaltay commented Aug 27, 2017

@edhgoose you are right. It looks like the above examples do not have --inputFile for DirectRunner but it is there for other runners. Because other runners do not always have access to GCS files.

I think it makes sense to remove from DirectRunner but still keep it for other runners.

@edhgoose
Copy link
Author

What should it be instead? Surely there aren't any other inputs that would work?

If --inputFile=pom.xml it literally reads the pom.xml file, which parses out words like java, Apache, NOTICE, ASF, etc. So it can't be that.

And because of the PAssert at the end which expects Flourish and Stomach to appear exactly 3 and 1 times respectively, if the file is not the king-lear input then the pipeline is pretty much guaranteed to fail is it not?

@aaltay
Copy link
Member

aaltay commented Aug 28, 2017

@edhgoose you are right, I was not aware of thatPAssert. There are two conflicting issues here:

  1. If we do not use --inputFile with a local file argument some runners (apex, flink, spark) will fail because they do not necessarily have a way to access the default input file hosted in GCS.
  2. If we pass the argument with a local input file, all of them will fail because of the PAssert.

How about we put a copy of the kinglear as a local file into the repository as an example input?

(cc: @lukecwik @melap)

@reuvenlax
Copy link

R: @melap

@melap
Copy link

melap commented Aug 30, 2017

hmm, yeah, what about putting kinglear in a new directory in the examples directory for each language (alongside complete, cookbook, etc.) called exampledata or something? only downside is it would add an extra step of downloading the file locally, but everything would be consistent.

@melap
Copy link

melap commented Aug 30, 2017

cc: @davidcavazos

@lukecwik
Copy link
Member

lukecwik commented Sep 7, 2017

The kinglear.txt file on gs:// is world readable, it requires the SDK/workers to have public internet access and for beam-sdks-java-extensions-google-cloud-platform-core to be a dependency (which it already is). I think removing inputFile=pom.xml is the way to go and just add a bit that instructs our users that the default input file is gs://.../kinglear.txt which they can download themselves manually if they have trouble accessing the remote file.

@aaltay
Copy link
Member

aaltay commented Sep 8, 2017

Ideally this should be fixed for Python examples as well, but https://issues.apache.org/jira/browse/BEAM-2101 might a blocker for doing this.

@melap
Copy link

melap commented Sep 8, 2017

I looked at this more, there seem to be multiple consistency issues.

  1. Java Dataflow runner command appears to get all files in the shakespeare directory (which contains 42 files -- too many to ask a user to download). Python Dataflow command is only getting kinglear.txt.
  2. Python direct runner is using README.md and not kinglear.
  3. Java is getting the shakespeare files from apache-beam-samples bucket, and Python is getting it from the dataflow-samples bucket.

For consistency, would it make sense to change all runner commands on this page to use a local kinglear as input, and perhaps change the text to something like this?
To run this example in Java, download the sample input file and execute the command for your runner:

@lukecwik
Copy link
Member

lukecwik commented Sep 8, 2017

The issue about recommending and downloading a local file is that it prevents running with a runner on a cluster and really only works for runners which have a purely local execution mode like Flink, Spark, and DirectRunner.

@kennknowles
Copy link
Member

I favor both Java and Python pointing to the copy in apache-beam-samples. Echoing Luke, local files are actually more problematic than public-readable file on the web. I think it is just fine for the examples to depend on a variety of Beam filesystem implementations.

Incidentally it makes sense to me to package each filesystem implementation in its own artifact for easy plugging in. Not sure how easy it is to search for how to get gs: URL support in Beam since the package is sort of a grab bag. I guess we are working on the assumption that you are either on GCP or not but without that assumption in mind you won't know what you are getting with that package.

@jkff
Copy link

jkff commented Feb 7, 2018

What are the next steps here?

@kennknowles
Copy link
Member

Is there consensus that this is good? It looks like it. @melap ?

@melap
Copy link

melap commented Feb 22, 2018

I agree it sounds like this is fine: specifically to use default, the copy in apache-beam-samples.

@melap
Copy link

melap commented Feb 22, 2018

retest this please

@melap
Copy link

melap commented Feb 22, 2018

If anyone has any objection to this, do holler. If no objections, I will merge this tomorrow.

@melap
Copy link

melap commented Feb 26, 2018

@asfgit merge

@asfgit asfgit closed this in 422e81b Feb 26, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
melap pushed a commit to apache/beam that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants