Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] MVP v0.1 for new Data Load approach with AirCan #1

Closed
22 of 29 tasks
rufuspollock opened this issue Jun 4, 2020 · 1 comment
Closed
22 of 29 tasks

[epic] MVP v0.1 for new Data Load approach with AirCan #1

rufuspollock opened this issue Jun 4, 2020 · 1 comment
Assignees
Milestone

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Jun 4, 2020

Load data into CKAN DataStore using Airflow as the runner. Replacement for DataPusher and Xloader. Clean separation of components do you can reuse what you want (e.g. don't use Airflow but use your own runner)

Acceptance

  • Upload of raw file to CKAN triggers load to datastore
  • Manual triggering via UI and API
    • API: datapusher_submit
  • Loads CSV ok (no type casting so all strings)
  • Loads XLSX ok (uses types)
  • Load google sheets
  • Load Tabular Data Resource (uses types)
  • Deploy to cloud instance (e.g. GCP cloud dataflow)
  • Continuous deployment of AirCan (done with DX)
  • Error Handling (when DAG fails, the CKAN instance must know the status)
  • Success End to end run: CKAN instance with ckannext- aircan-connector; upload a CSV file and have a DAG on GCP triggered. End up with the successfully parsed data in CKAN DataStore.
  • Failed End to end run: CKAN instance with ckannext- aircan-connector; upload a CSV file and have a DAG on GCP triggered. CKAN instance must know something went wrong.

Tasks

Deploy to GCP

  • Enable Composer on GCP with access for external calls [infra]
  • Deploy Aircan code on Composer (we're doing it via PyPi and also via pasting the raw code on dependencies folder on GCP bucket; this second method is easier for development purposes only)
  • DAG that is triggered remotely by a CKAN instance and process a CSV file, converting it to a JSON and sending results back to CKAn via API

Auth and logging

  • Authentication with GCP - Aircan connector must authenticate with Composer
  • Logging - Logs are enabled on Composer and can be consumed via API. Note: There is no standard format for logging yet

CKAN integration

  • New ckanext with hooks to trigger the run automatically
  • Can manual run somehow (?)

Analysis

  • For the MVP, should we use data-loader-next [for what] [Ans: processors in airflow pipeline]? Yes, this is going to be our data processsing library
  • Is anyone using data-loader-next? Should we invest time improving its docs? (Not sure if it's necessary) We will be using the core code. Whether we keep this as its own lib is debatable but an 1-2h improving docs now is fine
  • In data-loader-next does it offer support for Tabular data resources? Yes, sort of but does not currently respect types in Table Schema as this was following xloader approach where everything a string.
    • TODO: fix table schema support so types are respected as we will need that when doing proper load
  • Will Airflow orchestrate the load? Yes
  • In Airflow design, how atomic do we make the steps? e.g. do we break out creating table in datastore from load ...? Up to us. Maybe ... but actually not crucial either way.
  • How do we do logging / reporting? See Analysis for Logging system #2
@rufuspollock rufuspollock changed the title MVP for new Data Load approach with AirCan [epic] MVP for new Data Load approach with AirCan Jun 4, 2020
@rufuspollock rufuspollock changed the title [epic] MVP for new Data Load approach with AirCan [epic] MVP v0.1 for new Data Load approach with AirCan Jul 14, 2020
@rufuspollock rufuspollock added this to the Sprint - 20 July 2020 milestone Jul 14, 2020
@rufuspollock
Copy link
Member Author

FIXED. One outstanding item is continuous deployment which is in #66

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants