Skip to content
Evgeny Pogrebnyak edited this page Aug 10, 2015 · 3 revisions

Main task:

Extract time-series data form tables published as MS Word files, store and retrieve this data as dataframe-like csv and xls(x) files.

  • By dataframe-like we mean that data is organised by columns, there is a variable label in column header and time in row labels. This is standard representation of dataframe in R.

  • Data from MS Word files require some reshaping as tables may contain combinations on annual, quarterly and monthly data.

  • Tables headers are associated with variables through header-to-variable label dictionaries. This is userd-defined specification, stored in yaml-readable file.

Desired outcomes:

  1. Nicely formatted xls(x) files
  2. Long history of times series (before 1999)
  3. Quick updates upon publication of new data
  4. High-level checks of data validity
    • does value equal to value from other source or hardcoded value?
    • does a sum of values make sense
  5. Navigable list of variables, variables arranged by group
  6. Each variable described
  7. Possibly less meaningful variables omitted
  8. Importable data

Not todo:

  • FRED/quandl style API https://research.stlouisfed.org/fred2/data/CPIAUCSL.txt
  • Data from sources other than Rosstat STEI
  • NY Fed style charts... in PDF
  • API to generate xls(x)/csv from database: vars=[list of vars], start=[date], end=[date], freq = "aqm", description_sheet = True,

Workflow:

8   data usage examples(plotting, forecasting) 
	  ^
	  |
7	suggested statements for reading csv/xls(x) in pandas / r 
	  ^
	  |
6	csv/xls(x)
	  ^
	  |
5	pandas dataframes
	  ^
	  |
4	database
	  ^
	  |
3	stream (as date-label-value)
	  ^
	  |
2	labelled csv
	  ^
	  |   <- 2* label specification in yaml
	  |
1	raw csv
	  ^
	  |
0	doc file(s)

Testing by workflow stage (with priority):

8 no tests
7 no tests
6 no tests
5 database not empty + no same key rows
4 no tests
3 test row splitter functions + add new splitter
2 test behaviour on row generators
2* write vars as sampel yaml + read other sample yaml + read as OrderedDict
1 simple table + braeaks + special charaters + double space (low)

More testing:

  • List weaknesses
  • Test on different month
  • Inspection functions:
    doc2db.inspect. - show as labelled and to be imported (dump stream) - compare labels in stream and in spec - show all full labels (implemented in query.py) (these should be docstrings)

Questions and features:

Question:

  • working with packages
  • pdb run until exception in ipython
  • command to create packages

Feature:

  • dump headers - make headers pivotal
  • maybe: split to dictionaries variables
  • kill wrong variable
  • segment part of file to new yaml and read to database
  • labels + varnames in database
  • write list of variables to topic / alphabetic