Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nifi Atlas Bridge Development Opportunity #1 #2

Closed
ccoldwell opened this issue Aug 30, 2017 · 1 comment
Closed

Nifi Atlas Bridge Development Opportunity #1 #2

ccoldwell opened this issue Aug 30, 2017 · 1 comment

Comments

@ccoldwell
Copy link

ccoldwell commented Aug 30, 2017

Value: $10,000.00Closes: 23:00 PST, Tuesday, September 19, 2017Location: Victoria In-person work NOT required

Opportunity Description

We are looking for help building an Apache NiFi-Atlas bridge. We want it to process NiFi provenance data and log this in Apache Atlas as lineage metadata to support our planned use of HortonWorks Data Governance framework.

More specifically, we want to be able to record in Atlas the "logical" lineage of a file from input all the way to being saved to HDFS or a database even though it was manipulated several times. Also, we would like to be able to process the provenance data as the processes proceed and not have to wait until the flow is finished. This would allow Atlas to show the current status of a file before the whole flow is finished.

We are aware of the code at https://github.com/vakshorton/NifiAtlasBridge and https://github.com/vakshorton/NifiAtlasLineageReporter, however it does not seem to do exactly what we want. The one problem with the above code is that it generates Atlas metadata as input and output from too many flow ingress and egress events. Many processors change or clone data, which means the FlowFile's id changes, which then in turn causes the metadata lineage to be broken in Atlas.

A typical simplified use-case would be to get a ZIP file from the file system, move it to a different directory based on filename and date, do various actions on it - unzip as CSV, split, join, update attributes, custom processors, manually edit it, save to a different directory, infer Avro schema, convert CSV to Avro, save file, convert file to orc, put in HDFS, generate Apache Hive DDL, create Hive table. The file names and directory names will not be hard-coded in the flow but will instead be parameter driven. In this scenario, we would like to be able to trace that Hive table lineage all the way back to the ZIP file input.

Acceptance Criteria

To be paid the fixed price for this opportunity, you need to meet all of the following criteria:

  • A merge request from your GitHub account to this repo (https://github.com/bcgov/nifi-atlas) that works with the following software versions:
    • Java 8
    • Apache Atlas 0.8 
    • Apache NiFi 1.3+
  • The solution must work in a clustered environment with provenance data coming from several nodes. We would like it to work in a site-to-site environment (cluster A does some processing, then hands it over to cluster B to do further processing - we would like to be able to have the lineage follow all the way through).
  • It is acceptable for the code to work with a local file system, even though we will implement it in the cloud using e.g. Azure, AWS etc.
  • The code must allow us to add more processors and custom processors quite easily.
  • The code needs to include a build script (Maven or Gradle) to compile the Java code and produce a NAR file.
  • The code needs to include some basic unit testing, as well as an example of use in a Dataflow template to demonstrate end-to-end functionality.
  • This code does not need to be production ready, but must be a good starting point for us to use.
  • The code needs to be commented where possible, especially in the areas of provenance processing.

How to Apply

Go to the Opportunity Page, click the Apply button above and submit your proposal by 16:00 PST on 23:00 PST, Tuesday, September 19, 2017.

We plan to assign this opportunity by Tuesday, September 26, 2017 with work to start on Tuesday, September 26, 2017.

If your proposal is accepted and you are assigned to the opportunity, you will be notified by email and asked to confirm your agreement to the Code With Us terms and contract.

Proposal Evaluation Criteria

We will score proposals by the following criteria:

  • Expressed knowledge of NiFi provenance system that will employ best practices for processing provenance data efficiently as expressed in pseudocode / structures (30 points),
  • Experience contributing Java code to any public code repositories with more than 5 contributors: (10 points),
  • Experience contributing Java code to either of the following projects https://github.com/apache/nifi ,https://github.com/apache/incubator-atlas (10 points),
  • Ability to deliver complete solution on or before October 13, 2017 (10 points).
@bcgov bcgov locked and limited conversation to collaborators Aug 30, 2017
@bcgov bcgov unlocked this conversation Sep 5, 2017
@ccoldwell
Copy link
Author

This opportunity has been assigned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants