Home

Main task:

Extract time-series data form tables published as MS Word files, store and retrieve this data as dataframe-like csv and xls(x) files.

By dataframe-like we mean that data is organised by columns, there is a variable label in column header and time in row labels. This is standard representation of dataframe in R.
Data from MS Word files require some reshaping as tables may contain combinations on annual, quarterly and monthly data.
Tables headers are associated with variables through header-to-variable label dictionaries. This is userd-defined specification, stored in yaml-readable file.

Desired outcomes:

Nicely formatted xls(x) files
Long history of times series (before 1999)
Quick updates upon publication of new data
High-level checks of data validity
- does value equal to value from other source or hardcoded value?
- does a sum of values make sense
Navigable list of variables, variables arranged by group
Each variable described
Possibly less meaningful variables omitted
Importable data

Not todo:

FRED/quandl style API https://research.stlouisfed.org/fred2/data/CPIAUCSL.txt
Data from sources other than Rosstat STEI
NY Fed style charts... in PDF
API to generate xls(x)/csv from database: vars=[list of vars], start=[date], end=[date], freq = "aqm", description_sheet = True,

Workflow:

8   data usage examples(plotting, forecasting) 
	  ^
	  |
7	suggested statements for reading csv/xls(x) in pandas / r 
	  ^
	  |
6	csv/xls(x)
	  ^
	  |
5	pandas dataframes
	  ^
	  |
4	database
	  ^
	  |
3	stream (as date-label-value)
	  ^
	  |
2	labelled csv
	  ^
	  |   <- 2* label specification in yaml
	  |
1	raw csv
	  ^
	  |
0	doc file(s)

Testing by workflow stage (with priority):

8 no tests
7 no tests
6 no tests
5 database not empty + no same key rows
4 no tests
3 test row splitter functions + add new splitter
2 test behaviour on row generators
2* write vars as sampel yaml + read other sample yaml + read as OrderedDict
1 simple table + braeaks + special charaters + double space (low)

More testing:

List weaknesses
Test on different month
Inspection functions:
doc2db.inspect. - show as labelled and to be imported (dump stream) - compare labels in stream and in spec - show all full labels (implemented in query.py) (these should be docstrings)

Questions and features:

Question:

working with packages
pdb run until exception in ipython
command to create packages

Feature:

dump headers - make headers pivotal
maybe: split to dictionaries variables
kill wrong variable
segment part of file to new yaml and read to database
labels + varnames in database
write list of variables to topic / alphabetic

Provide feedback

Saved searches