pom.xml is passed as the input file to DebuggingWordCount Example#304
pom.xml is passed as the input file to DebuggingWordCount Example#304edhgoose wants to merge 1 commit intoapache:asf-sitefrom edhgoose:asf-site
Conversation
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 45a46d7 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Hey @edhgoose, thank you for your contribution. I believe it is good to keep cc: @melap |
|
Thanks @aaltay, all the prior examples use GCS too so I believe it's expected already. I didn't have to do anything special to set up the prior examples. I'd happily agree to make it |
|
@edhgoose you are right. It looks like the above examples do not have I think it makes sense to remove from |
|
What should it be instead? Surely there aren't any other inputs that would work? If And because of the PAssert at the end which expects Flourish and Stomach to appear exactly 3 and 1 times respectively, if the file is not the king-lear input then the pipeline is pretty much guaranteed to fail is it not? |
|
@edhgoose you are right, I was not aware of that
How about we put a copy of the kinglear as a local file into the repository as an example input? |
|
R: @melap |
|
hmm, yeah, what about putting kinglear in a new directory in the examples directory for each language (alongside complete, cookbook, etc.) called exampledata or something? only downside is it would add an extra step of downloading the file locally, but everything would be consistent. |
|
cc: @davidcavazos |
|
The kinglear.txt file on gs:// is world readable, it requires the SDK/workers to have public internet access and for beam-sdks-java-extensions-google-cloud-platform-core to be a dependency (which it already is). I think removing inputFile=pom.xml is the way to go and just add a bit that instructs our users that the default input file is gs://.../kinglear.txt which they can download themselves manually if they have trouble accessing the remote file. |
|
Ideally this should be fixed for Python examples as well, but https://issues.apache.org/jira/browse/BEAM-2101 might a blocker for doing this. |
|
I looked at this more, there seem to be multiple consistency issues.
For consistency, would it make sense to change all runner commands on this page to use a local kinglear as input, and perhaps change the text to something like this? |
|
The issue about recommending and downloading a local file is that it prevents running with a runner on a cluster and really only works for runners which have a purely local execution mode like Flink, Spark, and DirectRunner. |
|
I favor both Java and Python pointing to the copy in Incidentally it makes sense to me to package each filesystem implementation in its own artifact for easy plugging in. Not sure how easy it is to search for how to get |
|
What are the next steps here? |
|
Is there consensus that this is good? It looks like it. @melap ? |
|
I agree it sounds like this is fine: specifically to use default, the copy in apache-beam-samples. |
|
retest this please |
|
If anyone has any objection to this, do holler. If no objections, I will merge this tomorrow. |
|
@asfgit merge |
The DebuggingWordCount example says:
This passes the pom.xml file as the input which is processed, which causes the PAssert at the end to fail.
The default file (kinglear.txt) is correct.