-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data pipeline #8
Comments
Maybe release python and/or R api-client for users to access the API easily from jupyter notebooks? |
@neotheicebird Yeah, we discussed it at the meeting - using AWS API. |
@Rotzke awesome! Just to keep us on the same page, I mean a python/R client side library apart from the web API development. Thanks |
@neotheicebird Up to you, good sir :) Created a new issue on teams. |
@neotheicebird - al least some standard code to access the data will be very useful. In pandas we have somehting like: dfm = pd.read_csv(url_m, converters = {'time_index':pd.to_datetime},
index_col = 'time_index') This works to read monthly data from stable URL, it is slow to query internet every time we run the program so may have some class to load/update data, similar to one below (from here):
Maybe this can be a client/small librabry/pypi package, but as far we can do a just some preferred code to download and manipulate the data. Updated pipeline in indicate this. |
Awesome, didn't know about |
@epogrebnyak the code example and a simple pypi package to access API sounds good |
@neotheicebird @epogrebnyak Guys, we have Slack for chatting! :) |
Based on discussion with @Rotzke, updated pipeline:
|
Some more detail on pipeline, based on mini-kep:
Raw data:
Parsing:
Transformation:
Frontend:
End-user:
This is to discuss a role of interim database. |
My thoughts are still about a minimum working example (MWE) for several parsers that can produce compatible output, and a pipeline to allow them working together. Here is an example of this kind. End user case - MWEEnd user wants to calculate Russian monthly non-oil export and see this figure in roubles. This is a bit simplistic task, but it is two galaxies away from everyday some Excel calculations, just about one. We need something that drags data from different sources. The formula will be:
The sources are:
Implementation - MWE and extensionsMultiple data sources. Imagine you have working parsers for Rosstat, EIA and Bank of Russia publications. Each parser will produce output in its For this to work well:
This is a parser-to-notebook solution, no database, no API. Single data source. Imagine someone took the burden to collect the output CSV into one dataframe for you and told you this is a your reference dataset, go ahead with it. In other words, someone took care of problems #1, #5, #6, and hopefully, #7. You deal with just one URL, but when needed you can check it at source. This single datasource may be a meta-parser and probably can also be a github repo. There is still no single database and no API, but this:
Still not convinced where exactly a interim database fits (storing parsing inputs?), but so far @Rotzke says we need one, so I take it for granted. A kind of little roadmap to keep going I think is the following:
From this sceleton we can quickly do a common database, database API and many other magnificent stuff (even an interim database) as well as add more parsers. Hope someone still wants to do this (this way). ;) |
After 20.06.17 videochat, brief notes:
todo to follow! |
Our project is about aggregating data from individual parsers under common namespace and releasing the data through final API (correct if something missing):
data/processed
folder.data/processed
CSVs from several parsers in common database.(the simplest call returns aggregate CSV with variables from diffrernt parsers).
designated package).
Comments welcome.
The text was updated successfully, but these errors were encountered: