You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Load data into CKAN DataStore using Airflow as the runner. Replacement for DataPusher and Xloader. Clean separation of components do you can reuse what you want (e.g. don't use Airflow but use your own runner)
Acceptance
Upload of raw file to CKAN triggers load to datastore
Manual triggering via UI and API
API: datapusher_submit
Loads CSV ok (no type casting so all strings)
Loads XLSX ok (uses types)
Load google sheets
Load Tabular Data Resource (uses types)
Deploy to cloud instance (e.g. GCP cloud dataflow)
Continuous deployment of AirCan (done with DX)
Error Handling (when DAG fails, the CKAN instance must know the status)
Success End to end run: CKAN instance with ckannext- aircan-connector; upload a CSV file and have a DAG on GCP triggered. End up with the successfully parsed data in CKAN DataStore.
Failed End to end run: CKAN instance with ckannext- aircan-connector; upload a CSV file and have a DAG on GCP triggered. CKAN instance must know something went wrong.
Tasks
README describing how to use (README driven development for a new user)
Run via UI or tests - can have hardcoded local CSV and CKAN target
Trigger it from the API - maybe add this to README (e.g. curl ...) with ability to configure both ckan instance and src csv file
Deploy to GCP
Enable Composer on GCP with access for external calls [infra]
Deploy Aircan code on Composer (we're doing it via PyPi and also via pasting the raw code on dependencies folder on GCP bucket; this second method is easier for development purposes only)
DAG that is triggered remotely by a CKAN instance and process a CSV file, converting it to a JSON and sending results back to CKAn via API
Auth and logging
Authentication with GCP - Aircan connector must authenticate with Composer
Logging - Logs are enabled on Composer and can be consumed via API. Note: There is no standard format for logging yet
CKAN integration
New ckanext with hooks to trigger the run automatically
Can manual run somehow (?)
Analysis
For the MVP, should we use data-loader-next [for what] [Ans: processors in airflow pipeline]? Yes, this is going to be our data processsing library
Is anyone using data-loader-next? Should we invest time improving its docs? (Not sure if it's necessary) We will be using the core code. Whether we keep this as its own lib is debatable but an 1-2h improving docs now is fine
In data-loader-next does it offer support for Tabular data resources? Yes, sort of but does not currently respect types in Table Schema as this was following xloader approach where everything a string.
TODO: fix table schema support so types are respected as we will need that when doing proper load
Will Airflow orchestrate the load? Yes
In Airflow design, how atomic do we make the steps? e.g. do we break out creating table in datastore from load ...? Up to us. Maybe ... but actually not crucial either way.
Load data into CKAN DataStore using Airflow as the runner. Replacement for DataPusher and Xloader. Clean separation of components do you can reuse what you want (e.g. don't use Airflow but use your own runner)
Acceptance
datapusher_submit
Loads XLSX ok (uses types)Load google sheetsError Handling (when DAG fails, the CKAN instance must know the status)Failed End to end run: CKAN instance with ckannext- aircan-connector; upload a CSV file and have a DAG on GCP triggered. CKAN instance must know something went wrong.Tasks
Migrate Loader library code (Migrate datastore-next-library code to this repo #9from xloaderor https://gitlab.com/datopian/tech/data-loader-next-gen/)curl ...
) with ability to configure both ckan instance and src csv fileDeploy to GCP
dependencies
folder on GCP bucket; this second method is easier for development purposes only)Auth and logging
Logging - Logs are enabled on Composer and can be consumed via API. Note: There is no standard format for logging yetCKAN integration
Analysis
The text was updated successfully, but these errors were encountered: