Pune 2017 18 Budget data processing, flat tables and hierarchy

answerquest edited this page Oct 3, 2017 · 9 revisions

Working on 2017-18 Budget Data

Author: Nikhil VJ
Date : during first half of August 2017 and first week and last week of September 2017

Screenshot of a table from Budget excel published by PMC:
RE orig

Aim

To bring Pune Municipal Corporation's 2017-18 Budget book data to a stage where it can be read by a program and be open for visualizations, data analysis etc.

Explanation

The Pune budget document has a multi-hierarchical nature. To get it to a stage where it can be read by a program and be open for visualizations, data analysis etc, it needs two things : a flattened table to load into a db, and a separate parent-child codes sheet to express the complex hierarchies of the budget book that get lost when we flatten the tables.

Part 1 : Analysis

Listing the tables in the 2017-18 PMC Budget Book

  1. A_Revenue_Income
  2. A_Revenue_Expenditure
  3. A_Revenue_List
  4. A_Capital_Expenditure
  5. A_Land_Acquisition
  6. A_Special_Fund
  7. A_DPDC_Machinery_Tree
  8. A_Capital_List
  9. C_Revenue_Income
  10. C_Revenue_Expenditure
  11. C_Revenue_List
  12. C_Capital_Expenditure
  13. C_Capital_List
  14. JNNURM
  15. SPV
  16. SMRY1718
  17. CitizensNonPlan1718
  18. CitizensPlan1718

Links

How are table-wise budget amounts controlled

The A_Revenue_Expenditures sheet has many entries that are called transfers to other sections. It seems like the other expenditure sheets 'get their funding' from Revenue_Expenditures.

To do: trace the budget items showing transfers to other sections, verify figures.

Part 2: Flattening tables

Each sheet is to be cleaned up to make it program-readable.

Process to be done for each table:

  • Unfreeze top rows/columns
  • Put a cleaner header row at top, from the corresponding sheet in this file: 2017-18 budget simplified headers.ods.
  • delete the original headers with the merged cells etc.
  • Delete the lines with blank code cell, or which are just totalling etc.

Note from Nikhil : Not to be a spoilsport but this is easier to do in ms-excel than libre-calc, because when you set the filter to show just blanks and then select all the rows displayed and delete them, in excel it doesn't also delete all the other invisible (filtered) rows in between of the ones you have selected.
(eg: if row 10 and row 100 were blanks, if I used the filter to show only them and then shift- or drag- selected them both and deleted the rows, in libre-calc it also deletes rows 11 thru 99 whereas ms-excel leaves them untouched as desired.)

Handling strange entries:

  • There are lines with code like "CE11A" that stand for total of other lines.. we can delete these lines. If not sure, then keep them. You can find these by searching for "बेरीज" (but check before deleting!)
  • There may be multiple lines with code like "#+" and they probably have expenditures listed only for previous years or are for incomplete works, and you do want to include them in the db as miscellaneous entries. Assign serial numbered codes to them in this format: [prefix][MSC][001].
    • Example : Capital Expenditures : CEMSC001 (CE code, MSC subheader, 001 serial)
      (you'll get the prefix code from the other entries in that table.)

Post cleanup

We would now have the budget data in a flat tabular structure. What's more, headers are standardised across tables, so for example 2017-18 budgeted expenditure has the same column name in all sheets.

Screenshot of same table, cleaned up:
RE cleaned

File prepared :: 2017-18 SC flat tables-working.ods
Note: This is not final file; scroll to bottom for downloading final work.

Part 3: Expressing the Budget document's hierarchy

The Pune budget document has a multi-hierarchical nature. See this screenshot to see an example of the hierarchy of one particular line item.With flattening of the tables in the budget doc, some vital information gets lost. See this screenshot for example:

budget1718 hierarchy info lost

Info in selected cells is lost when the table is flattened; and without this info the line items by themselves won't make sense.

Plus, in the expanded lists, budget codes no longer implicitly reflect the right hierarchy, possibly due to re-assigning of departments etc. Example: CE20C106 comes under CE20C551#

So there is a need to express this hierarchy and include these name/titles so they can be retrieved when we visualize the budget data via the flat tables.

We start with a parent-child codes table, also known as self-referencing table. Below is a sample table:

code parentcode name
grandparent root Granny
parent grandma Mom
child parent Kid
pet child Kitty

This can be interpreted by a program as:

grandparent
|-parent
   |-child
     |-pet

The sections are still kept separate, as at this point trying to really establish parent-child lineage between the sheets may get too complicated.

File prepared : 2017-18 budget parent-child relationship.ods

This file captures the budget book's hierarchy through self-referencing / parent-child tables. Here's a screenshot:

budget17-18 parentchild file

Mapping budget sheets to hierarchy sheets

budget sheet hierarchy sheet
A_Revenue_Income RI
A_Revenue_Expenditure RE
A_Revenue_List RE
A_Capital_Expenditure CE
A_Land_Acquisition FE
A_Special_Fund FE
A_DPDC_Machinery_Tree FE
A_Capital_List CE
C_Revenue_Income XI
C_Revenue_Expenditure XE
C_Revenue_List XE
C_Capital_Expenditure ZE
C_Capital_List ZE
JNNURM FE
SPV FE
SMRY1718 ADM
CitizensNonPlan1718 PB
CitizensPlan1718 PB

As seen above, data from multiple sections comes in common code sheets. Example: both A_Revenue_Expenditure and A_Revenue_List have their hierarchy expressed in RE. That is because A_Revenue_List has items whose parents are in A_Revenue_Expenditure.

Quick reference for RE PB etc:

  • RE = Revenue (running) Expenditure
  • CE = Capital Expenditure
  • FE = Special Expenditure
  • XE = Water Budget (Part C) Revenue Expenditure
  • ZE = Water Budget (Part C) Capital Expenditure
  • ADM = Ward Office Expenditure
  • PB = Participatory Budget Expenditure
  • RI = Revenue (running) Income
  • XI = Water Budget Revenue (running) Income

Generating hierarchical JSON

Hierarchical JSON is a de facto standard for building visualizations for hierarchical data. In this project we have expressed the hierarchy in a self-referencing tabular form. It's hard to find open source visualizations that use that, but there are many json users. See this page for examples.

That said, it is much tougher for a person to CREATE this kind of data in the nested JSON form, and much easier to do in tabular form. Hence we start with the tabular form. And once made, we can convert it to JSON via a program.

Convert tabular:
csv
to JSON:
json

Nikhil has created a script, hosted on a webpage, to do exactly this:

Self-referencing table to Hierarchical JSON converter

So, from the budget hierarchy spreadsheet that we created, each sheet was output as CSV, and then fed into the converter to generate corresponding JSON. The CSVs and JSONs are stored in the repo in the budget/hierarchy2017-18 folder.


Part 4: Documenting challenges encountered in the process

Involves both part 1 and 2 which were done in tandem.

Duplicate codes check:

Some line items with same code were found across Capital Expenditures and Capital List tables.

repetitions | code
2 | CE20C106
2 | CE20C466
2 | CE20C533
2 | CE20C536
2 | CE22A109
2 | CE22A133
2 | CE22A362

They all are zero in the main list, but are present in extended lists as sub-items under other items, with a value.
How to resolve : suffix a _0 for the main list entries to differentiate them.
Confirmed that there are no such repeaters in the other sheets.
Why this is needed : The budget code is the primary key for this data. So, it needs to be non-NULL and unique.

Participatory Budget works

The works listed in CitizensPlan1718 and CitizensNonPlan1718 sheets are Participatory Budget works, but they are also a mixup of expanded lists of various parent items listed in other sheets. The parent items are as follows:

RE20A302/C2-
RE20C302/C1-
RE22C302/C4-
RE14B302/C5-
XE12B302/C6-
XE23A302/C7-
CE20A1196/C2-
CE20C406/C1-
CE22A547/C4-
FE14A121/C5-
ZE16A117/C6-
ZE16B130/C7-

Example: 'RE20A302/C2-' corresponds to line item 'RE20A302#' in Revenue Expenses sheet. See screenshot.

budget1718 find all CE20A1196 in CE and PBB sheets.jpg

In the parent-child relationship, independently made a 'PB' sheet, defined a header 'PB', defined these parent items with common parent 'PB', and listed all the line items in both 'CitizensPlan1718' and 'CitizensNonPlan1718' sheets under them.

  • Extra: Shell script used for getting parentcodes from a list of codes : take substring up till '-'. cat pblist1 | cut -d'-' -f1 > pblist2

  • Amounts were in Rs. lakhs. They had to be multiplied by 1e5 (100,000) as in the rest of the budget data the values are is Rs. only.

Admin ward works

The sheet SMRY1718 lists works at Admin ward level. There are 15 admin wards in the city. Till the previous year there used to be works listed here in plan and nonplan sheets just like how the Participatory Budget is done. But this year we see only a table of totals per admin ward. Usually each of the categories (roads, water supply etc) are an expansion of a budget item in one of the main expenditure sheets. But this year, those items have 0 budget allocation. Listing them below

RE14B301/W5-#
RE20A301/W2-#
RE20C301/W1-#
RE22C301/W4-#
XE12B301/W6-#
XE23A301/W7-#

So, with just 15 line items in the admin ward sheet this time, and no specific budget code, codes ADM01 through ADM15 were created for the flat table. In the hierarchy doc, these were put in a separate sheet named ADM.

  • And amounts were changed to be in Rs. instead of in Rs. lakhs.

Zapping all zeros

.. because there's a difference between no data and rs.0 budgeted. The original budget doc has a number formatting rule to keep invisible all cells having 0 value. Screenshot:
budget1718 zap zeros.jpg

Find + Replace command works well, with entire cells option selected. Select all sheets to replace all the zero's at a go.

Removing special chars

  • XE12B302/C6-54 : changed 2" to 2ft to avoid having double-quotes char in the data.
  • At a few places, undid line wrapping in the work description (ie, made it from multi line to single line)

Part 5: Flat tables : Combining expenditures into one

After the various tables / sections / lists data has been flattened, we now have to combine them.

  • Listed out the columns from all the tables.
  • Optimized the column titles, translating to english, and keeping titles same for common fields (like 2017-18 SC budgeted)
  • Reflected optimized titles back to the flat tables doc.
  • Calculated which columns repeat in how many lists.
  • Created a combined layout for the expenditure tables that includes all columns used by various lists.
  • Added a "list" and a "list sr" column to enable traceback.
  • Working doc for planning this: 2017-18 combined layout planning.ods
  • Final combined expenditures layout (total 28 columns):
. Column Features in these many lists
A list --
B list sr --
C code 16
D work 16
E 2017-18 SC budgeted 16
F ward 10
G BLP 4
H 2013-14 actual exp 3
I 2014-15 actual exp 3
J 2015-16 actual exp 4
K actual expenditure till 31 march 2015 1
L actual expenditure till 31 march 2016 2
M 2016-17 budgeted 7
N 2016-17 budgeted: from debt 2
O 2017-18: from debt 3
P 2017-18: nonplan and other 3
Q allotted till 31 march 2017: from debt 3
R allotted till 31 march 2017: from nonplan 3
S allotted till 31 march 2017: total 3
T project budget 7
U revised allocation 1
V no. of wards 1
W road 3
X electrical 3
Y building 3
Z SRA 3
AA water supply 3
AB drainage 3
  • Saved the flat tables as a new file: 2017-18 SC flat tables-reordering.ods
  • Shifted the columns in the various sheets so they all reflect this new layout. Added the "list" and "list sr" columns to all and populated them (second one is a serial number for that sheet).
  • Copied rows from all the expenditure sheets to a new sheet "combined exp".
  • Exported the combined sheet into a CSV with delimiting char set as semicolon (;) (choosing this instead of comma (,) because many work description cells have comma's but there is no semicolon in the data).

File prepared : 2017-18-SC-budget-exp-combined.csv

Part 6: Tracking inter-sheet allocations

While there are 7 separate expenditure sections found in the hierarchy doc (RE, CE, FE, XE, ZE, ADM, PB), they are all not getting funds independently. Here let us trace the budget items that actually stand for re-allocation of outlay to another section.

Main focus: RE (A_Revenue_Expenditure)

code work 2017-18 SC budgeted
RE61A101 59. अंदाजपत्रक 'क' कडे वर्ग -25,33,67,047

... yet to be completed.


Final files

These can be used for budget data viz etc.

Next : Let's Visualize! Budget data visualizer

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.