Data pipeline #8

epogrebnyak · 2017-06-18T17:23:14Z

Our project is about aggregating data from individual parsers under common namespace and releasing the data through final API (correct if something missing):

Raw data source - a file or a web page at Rosstat or other agency.
Parser reads source and provides resulting CSVs in its data/processed folder.
Some automated and human eye validation is done during/after parsing.
'daemon' collects data/processed CSVs from several parsers in common database.
The user queries database through API to get aggregated data from parsers
(the simplest call returns aggregate CSV with variables from diffrernt parsers).
The user reads data from from URL or stores data as a local file (using standard code in R/pandas or
designated package).
The user writes code in Jupiter notebook with R or pandas using the data.

Comments welcome.

The text was updated successfully, but these errors were encountered:

neotheicebird · 2017-06-18T17:39:02Z

Maybe release python and/or R api-client for users to access the API easily from jupyter notebooks?

Rotzke · 2017-06-18T17:54:07Z

@neotheicebird Yeah, we discussed it at the meeting - using AWS API.

neotheicebird · 2017-06-18T17:59:10Z

@Rotzke awesome! Just to keep us on the same page, I mean a python/R client side library apart from the web API development. Thanks

Rotzke · 2017-06-18T18:06:02Z

@neotheicebird Up to you, good sir :) Created a new issue on teams.

epogrebnyak · 2017-06-18T18:07:49Z

@neotheicebird - al least some standard code to access the data will be very useful. In pandas we have somehting like:

dfm = pd.read_csv(url_m, converters = {'time_index':pd.to_datetime}, 
                                           index_col = 'time_index')

This works to read monthly data from stable URL, it is slow to query internet every time we run the program so may have some class to load/update data, similar to one below (from here):

class LocalDataset():
    
    def __init__(self, _id):        
        self._id = _id
        try:
            self.ts = get_local_data_as_series(_id)
        except:
            print("Cannot load from file: " + self.filename)
            self.update()


    def update(self):
       self.ts = get_data_as_series(self._id)
       save_local_data(self._id, self.ts)
       return self

Maybe this can be a client/small librabry/pypi package, but as far we can do a just some preferred code to download and manipulate the data. Updated pipeline in indicate this.

neotheicebird · 2017-06-18T18:14:14Z

Awesome, didn't know about pd.read_csv having an URL arg

neotheicebird · 2017-06-18T18:15:42Z

@epogrebnyak the code example and a simple pypi package to access API sounds good

Rotzke · 2017-06-18T18:16:30Z

@neotheicebird @epogrebnyak Guys, we have Slack for chatting! :)

epogrebnyak · 2017-06-18T20:32:00Z

Based on discussion with @Rotzke, updated pipeline:

Scrapers collect source files/web pages from Rosstat or other agencies to document file store / database (Mongo, a NoSQL database). This is raw immutable data.
Parsers take raw files from filestore / database and make processed files (CSV or other).
Automated and human eye validators check parsing result.
Aggregator daemon collects data from processed files by different parsers to a production database.
Web frontend gives end users database overview and access instructions.
End users apply standard code or packages in R/pandas to access production database (may also cache data to a local file):

by reading a downloadable CSV dumps at stable URL
by end-user API

The user writes code in Jupiter notebook with R or pandas using the data.
Visualization and data analysis by end user impress everyone.
...
Profit! ;-)

epogrebnyak · 2017-06-19T13:33:24Z

epogrebnyak · 2017-06-19T22:00:26Z

My thoughts are still about a minimum working example (MWE) for several parsers that can produce compatible output, and a pipeline to allow them working together. Here is an example of this kind.

End user case - MWE

End user wants to calculate Russian monthly non-oil export and see this figure in roubles. This is a bit simplistic task, but it is two galaxies away from everyday some Excel calculations, just about one. We need something that drags data from different sources.

The formula will be:

EXPORT_EX_OIL = FX_USD * (EXPORT_GOODS_TOTAL - NAT_EXPORT_OIL_t * PRICE_BRENT * CONVERSION_FACTOR)

EXPORT_EX_OIL - non-oil export in rub 
FX_USD - exhange rate, rub/usd 
EXPORT_GOODS_TOTAL - total goods export 
NAT_EXPORT_OIL_t - oil export volume, mln t 
PRICE_BRENT - oil price, usd / barrel
CONVERSION_FACTOR is about 6.3 b / t```

The sources are:

Rosstat (EXPORT_GOODS_TOTAL, FX_USD)
EIA (BRENT)
Bank of Russia (EXPORT_OIL, FX_USD)

Implementation - MWE and extensions

Multiple data sources. Imagine you have working parsers for Rosstat, EIA and Bank of Russia publications. Each parser will produce output in its data/processed folder, some output CSV files. To complete the task the end user queries the URLs of data/processed folders with pd.read_csv and merges the dataframes. The rest is calculation on dataframes.

For this to work well:

variables in different output CSVs have same name convention (relatively easy)
output CSVs must have the same format (easy)
output CSVs are found at stable URLs (easy)
there is a common code that simplifies reading output CSVs fron URLs (easy)
someone tells you there are no conflicts between CSVs (remember FX_USD appears in two sources) - easy with few variables
CSV were updated at the same time (easy first time, more challenging the next time you come back)
before everything, the each parser works right

This is a parser-to-notebook solution, no database, no API.

Single data source. Imagine someone took the burden to collect the output CSV into one dataframe for you and told you this is a your reference dataset, go ahead with it. In other words, someone took care of problems #1, #5, #6, and hopefully, #7. You deal with just one URL, but when needed you can check it at source. This single datasource may be a meta-parser and probably can also be a github repo.

There is still no single database and no API, but this:

gives data to test single database
gives a framework to develop more parsers and validation procedures for parsing results.

Still not convinced where exactly a interim database fits (storing parsing inputs?), but so far @Rotzke says we need one, so I take it for granted.

A kind of little roadmap to keep going I think is the following:

present some existing parsers to the team
unify the parser output format in data/processed
make a dummy version of a 'meta-parser' that merges data/processed outputs to a single CSV
see what kind of information is needed at user front-end (eg variable desciptions, latest values, graphs, etc)
provide common end-user access code to a single CSV set at stable URL (in pandas/R)
make demo Jupiter notebook with calculations using the dataset

From this sceleton we can quickly do a common database, database API and many other magnificent stuff (even an interim database) as well as add more parsers.

Hope someone still wants to do this (this way). ;)

epogrebnyak · 2017-06-20T23:11:05Z

After 20.06.17 videochat, brief notes:
Our pipeline to work with data is the following:

source (an url, a file, group of files, api request)
parser script downloads source, extracts raw data, parses raw data, transforms it to canonical form (CSV or json) and stores output in data/processed folder. there is individual parcer for each source. variable names do not overlap between sources.
aggregator script collects outputs from parcers, validates it (eg checks variable names, but the check will be more extensive as number of sources and variables grows) and uploads it database
the user queries database api to get individual time series or parts of dataset. variables have common nameing pattern. several users can query the api at once.
the user makes useful plots or models using the data from api in Jupiter notebooks (or otherwise)
the is a web frontend for the database with variable names, usage statistics and api guidelines

todo to follow!

epogrebnyak assigned neotheicebird, Rotzke, Akrishna91, MrBorusLee and rohlin17 Jun 18, 2017

Rotzke added the enhancement label Jun 19, 2017

epogrebnyak removed the enhancement label Jun 19, 2017

epogrebnyak mentioned this issue Jun 26, 2017

List of sources for scrapping #10

Open

epogrebnyak unassigned Rotzke Aug 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data pipeline #8

Data pipeline #8

epogrebnyak commented Jun 18, 2017 •

edited

Loading

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

epogrebnyak commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

epogrebnyak commented Jun 18, 2017 •

edited

Loading

epogrebnyak commented Jun 19, 2017 •

edited

Loading

epogrebnyak commented Jun 19, 2017

epogrebnyak commented Jun 20, 2017

Data pipeline #8

Data pipeline #8

Comments

epogrebnyak commented Jun 18, 2017 • edited Loading

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

epogrebnyak commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

neotheicebird commented Jun 18, 2017

Rotzke commented Jun 18, 2017

epogrebnyak commented Jun 18, 2017 • edited Loading

epogrebnyak commented Jun 19, 2017 • edited Loading

epogrebnyak commented Jun 19, 2017

End user case - MWE

Implementation - MWE and extensions

epogrebnyak commented Jun 20, 2017

epogrebnyak commented Jun 18, 2017 •

edited

Loading

epogrebnyak commented Jun 18, 2017 •

edited

Loading

epogrebnyak commented Jun 19, 2017 •

edited

Loading