-
Notifications
You must be signed in to change notification settings - Fork 6
Home
Extract time-series data form tables published as MS Word files, store and retrieve this data as dataframe-like csv and xls(x) files.
-
By dataframe-like we mean that data is organised by columns, there is a variable label in column header and time in row labels. This is standard representation of dataframe in R.
-
Data from MS Word files require some reshaping as tables may contain combinations on annual, quarterly and monthly data.
-
Tables headers are associated with variables through header-to-variable label dictionaries. This is userd-defined specification, stored in yaml-readable file.
- Nicely formatted xls(x) files
- Long history of times series (before 1999)
- Quick updates upon publication of new data
- High-level checks of data validity
- does value equal to value from other source or hardcoded value?
- does a sum of values make sense
- Navigable list of variables, variables arranged by group
- Each variable described
- Possibly less meaningful variables omitted
- Importable data
Not todo:
- FRED/quandl style API https://research.stlouisfed.org/fred2/data/CPIAUCSL.txt
- Data from sources other than Rosstat STEI
- NY Fed style charts... in PDF
- API to generate xls(x)/csv from database:
vars=[list of vars], start=[date], end=[date], freq = "aqm", description_sheet = True,
8 data usage examples(plotting, forecasting)
^
|
7 suggested statements for reading csv/xls(x) in pandas / r
^
|
6 csv/xls(x)
^
|
5 pandas dataframes
^
|
4 database
^
|
3 stream (as date-label-value)
^
|
2 labelled csv
^
| <- 2* label specification in yaml
|
1 raw csv
^
|
0 doc file(s)
8 no tests
7 no tests
6 no tests
5 database not empty + no same key rows
4 no tests
3 test row splitter functions + add new splitter
2 test behaviour on row generators
2* write vars as sampel yaml + read other sample yaml + read as OrderedDict
1 simple table + braeaks + special charaters + double space (low)
More testing:
- List weaknesses
- Test on different month
- Inspection functions:
doc2db.inspect. - show as labelled and to be imported (dump stream) - compare labels in stream and in spec - show all full labels (implemented in query.py) (these should be docstrings)
Question:
- working with packages
- pdb run until exception in ipython
- command to create packages
Feature:
- dump headers - make headers pivotal
- maybe: split to dictionaries variables
- kill wrong variable
- segment part of file to new yaml and read to database
- labels + varnames in database
- write list of variables to topic / alphabetic