Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Pune 2017 18 Budget data processing, flat tables and hierarchy
Working on 2017-18 Budget Data
Author: Nikhil VJ
Date : during first half of August 2017 and first week and last week of September 2017
Screenshot of a table from Budget excel published by PMC:
To bring Pune Municipal Corporation's 2017-18 Budget book data to a stage where it can be read by a program and be open for visualizations, data analysis etc.
The Pune budget document has a multi-hierarchical nature. To get it
to a stage where it can be read by a program and be open for visualizations, data analysis etc, it needs two things : a flattened table to load into a db, and a separate parent-child codes sheet to express the complex hierarchies of the budget book that get lost when we flatten the tables.
Part 1 : Analysis
Listing the tables in the 2017-18 PMC Budget Book
How are table-wise budget amounts controlled
The A_Revenue_Expenditures sheet has many entries that are called transfers to other sections. It seems like the other expenditure sheets 'get their funding' from Revenue_Expenditures.
To do: trace the budget items showing transfers to other sections, verify figures.
Part 2: Flattening tables
Each sheet is to be cleaned up to make it program-readable.
Process to be done for each table:
- Unfreeze top rows/columns
- Put a cleaner header row at top, from the corresponding sheet in this file: 2017-18 budget simplified headers.ods.
- delete the original headers with the merged cells etc.
- Delete the lines with blank code cell, or which are just totalling etc.
Note from Nikhil : Not to be a spoilsport but this is easier to do in ms-excel than libre-calc, because when you set the filter to show just blanks and then select all the rows displayed and delete them, in excel it doesn't also delete all the other invisible (filtered) rows in between of the ones you have selected.
(eg: if row 10 and row 100 were blanks, if I used the filter to show only them and then shift- or drag- selected them both and deleted the rows, in libre-calc it also deletes rows 11 thru 99 whereas ms-excel leaves them untouched as desired.)
Handling strange entries:
- There are lines with code like "CE11A" that stand for total of other lines.. we can delete these lines. If not sure, then keep them. You can find these by searching for "बेरीज" (but check before deleting!)
- There may be multiple lines with code like "#+" and they probably have expenditures listed only for previous years or are for incomplete works, and you do want to include them in the db as miscellaneous entries. Assign serial numbered codes to them in this format: [prefix][MSC].
- Example : Capital Expenditures : CEMSC001 (CE code, MSC subheader, 001 serial)
(you'll get the prefix code from the other entries in that table.)
- Example : Capital Expenditures : CEMSC001 (CE code, MSC subheader, 001 serial)
We would now have the budget data in a flat tabular structure. What's more, headers are standardised across tables, so for example 2017-18 budgeted expenditure has the same column name in all sheets.
Screenshot of same table, cleaned up:
File prepared :: 2017-18 SC flat tables-working.ods
Note: This is not final file; scroll to bottom for downloading final work.
Part 3: Expressing the Budget document's hierarchy
The Pune budget document has a multi-hierarchical nature. See this screenshot to see an example of the hierarchy of one particular line item.With flattening of the tables in the budget doc, some vital information gets lost. See this screenshot for example:
Info in selected cells is lost when the table is flattened; and without this info the line items by themselves won't make sense.
Plus, in the expanded lists, budget codes no longer implicitly reflect the right hierarchy, possibly due to re-assigning of departments etc. Example:
CE20C106 comes under CE20C551#
So there is a need to express this hierarchy and include these name/titles so they can be retrieved when we visualize the budget data via the flat tables.
We start with a parent-child codes table, also known as self-referencing table. Below is a sample table:
This can be interpreted by a program as:
grandparent |-parent |-child |-pet
The sections are still kept separate, as at this point trying to really establish parent-child lineage between the sheets may get too complicated.
File prepared : 2017-18 budget parent-child relationship.ods
This file captures the budget book's hierarchy through self-referencing / parent-child tables. Here's a screenshot:
Mapping budget sheets to hierarchy sheets
|budget sheet||hierarchy sheet|
As seen above, data from multiple sections comes in common code sheets. Example: both
A_Revenue_List have their hierarchy expressed in
RE. That is because
A_Revenue_List has items whose parents are in
Quick reference for RE PB etc:
- RE = Revenue (running) Expenditure
- CE = Capital Expenditure
- FE = Special Expenditure
- XE = Water Budget (Part C) Revenue Expenditure
- ZE = Water Budget (Part C) Capital Expenditure
- ADM = Ward Office Expenditure
- PB = Participatory Budget Expenditure
- RI = Revenue (running) Income
- XI = Water Budget Revenue (running) Income
Generating hierarchical JSON
Hierarchical JSON is a de facto standard for building visualizations for hierarchical data. In this project we have expressed the hierarchy in a self-referencing tabular form. It's hard to find open source visualizations that use that, but there are many json users. See this page for examples.
That said, it is much tougher for a person to CREATE this kind of data in the nested JSON form, and much easier to do in tabular form. Hence we start with the tabular form. And once made, we can convert it to JSON via a program.
Nikhil has created a script, hosted on a webpage, to do exactly this:
So, from the budget hierarchy spreadsheet that we created, each sheet was output as CSV, and then fed into the converter to generate corresponding JSON. The CSVs and JSONs are stored in the repo in the budget/hierarchy2017-18 folder.
Part 4: Documenting challenges encountered in the process
Involves both part 1 and 2 which were done in tandem.
Duplicate codes check:
Some line items with same code were found across Capital Expenditures and Capital List tables.
repetitions | code 2 | CE20C106 2 | CE20C466 2 | CE20C533 2 | CE20C536 2 | CE22A109 2 | CE22A133 2 | CE22A362
They all are zero in the main list, but are present in extended lists as sub-items under other items, with a value.
How to resolve : suffix a
_0 for the main list entries to differentiate them.
Confirmed that there are no such repeaters in the other sheets.
Why this is needed : The budget code is the primary key for this data. So, it needs to be non-NULL and unique.
Participatory Budget works
The works listed in CitizensPlan1718 and CitizensNonPlan1718 sheets are Participatory Budget works, but they are also a mixup of expanded lists of various parent items listed in other sheets. The parent items are as follows:
RE20A302/C2- RE20C302/C1- RE22C302/C4- RE14B302/C5- XE12B302/C6- XE23A302/C7- CE20A1196/C2- CE20C406/C1- CE22A547/C4- FE14A121/C5- ZE16A117/C6- ZE16B130/C7-
Example: 'RE20A302/C2-' corresponds to line item 'RE20A302#' in Revenue Expenses sheet. See screenshot.
In the parent-child relationship, independently made a 'PB' sheet, defined a header 'PB', defined these parent items with common parent 'PB', and listed all the line items in both 'CitizensPlan1718' and 'CitizensNonPlan1718' sheets under them.
Extra: Shell script used for getting parentcodes from a list of codes : take substring up till '-'.
cat pblist1 | cut -d'-' -f1 > pblist2
Amounts were in Rs. lakhs. They had to be multiplied by 1e5 (100,000) as in the rest of the budget data the values are is Rs. only.
Admin ward works
SMRY1718 lists works at Admin ward level. There are 15 admin wards in the city. Till the previous year there used to be works listed here in plan and nonplan sheets just like how the Participatory Budget is done. But this year we see only a table of totals per admin ward. Usually each of the categories (roads, water supply etc) are an expansion of a budget item in one of the main expenditure sheets. But this year, those items have
0 budget allocation. Listing them below
RE14B301/W5-# RE20A301/W2-# RE20C301/W1-# RE22C301/W4-# XE12B301/W6-# XE23A301/W7-#
So, with just 15 line items in the admin ward sheet this time, and no specific budget code, codes
ADM15 were created for the flat table. In the hierarchy doc, these were put in a separate sheet named
- And amounts were changed to be in Rs. instead of in Rs. lakhs.
Zapping all zeros
.. because there's a difference between no data and rs.0 budgeted. The original budget doc has a number formatting rule to keep invisible all cells having 0 value. Screenshot:
Find + Replace command works well, with
entire cells option selected. Select
all sheets to replace all the zero's at a go.
Removing special chars
- XE12B302/C6-54 : changed
2ftto avoid having double-quotes char in the data.
- At a few places, undid line wrapping in the work description (ie, made it from multi line to single line)
Part 5: Flat tables : Combining expenditures into one
After the various tables / sections / lists data has been flattened, we now have to combine them.
- Listed out the columns from all the tables.
- Optimized the column titles, translating to english, and keeping titles same for common fields (like
2017-18 SC budgeted)
- Reflected optimized titles back to the flat tables doc.
- Calculated which columns repeat in how many lists.
- Created a combined layout for the expenditure tables that includes all columns used by various lists.
- Added a "list" and a "list sr" column to enable traceback.
- Working doc for planning this: 2017-18 combined layout planning.ods
- Final combined expenditures layout (total 28 columns):
|.||Column||Features in these many lists|
|E||2017-18 SC budgeted||16|
|H||2013-14 actual exp||3|
|I||2014-15 actual exp||3|
|J||2015-16 actual exp||4|
|K||actual expenditure till 31 march 2015||1|
|L||actual expenditure till 31 march 2016||2|
|N||2016-17 budgeted: from debt||2|
|O||2017-18: from debt||3|
|P||2017-18: nonplan and other||3|
|Q||allotted till 31 march 2017: from debt||3|
|R||allotted till 31 march 2017: from nonplan||3|
|S||allotted till 31 march 2017: total||3|
|V||no. of wards||1|
- Saved the flat tables as a new file: 2017-18 SC flat tables-reordering.ods
- Shifted the columns in the various sheets so they all reflect this new layout. Added the "list" and "list sr" columns to all and populated them (second one is a serial number for that sheet).
- Copied rows from all the expenditure sheets to a new sheet "combined exp".
- Exported the combined sheet into a CSV with delimiting char set as semicolon (
;) (choosing this instead of comma (
,) because many work description cells have comma's but there is no semicolon in the data).
File prepared : 2017-18-SC-budget-exp-combined.csv
Part 6: Tracking inter-sheet allocations
While there are 7 separate expenditure sections found in the hierarchy doc (RE, CE, FE, XE, ZE, ADM, PB), they are all not getting funds independently. Here let us trace the budget items that actually stand for re-allocation of outlay to another section.
Main focus: RE (A_Revenue_Expenditure)
|code||work||2017-18 SC budgeted|
|RE61A101||59. अंदाजपत्रक 'क' कडे वर्ग||-25,33,67,047|
... yet to be completed.
- Data: 2017-18-SC-budget-exp-combined.csv
These can be used for budget data viz etc.
Next : Let's Visualize! Budget data visualizer