# Extracting data from PovCalNet

PovCalNet is primarily designed for generating poverty and distributional statistics for custom groupings of countries and years. It does not, at present, provide an easy way to extract machine-readable data on the underlying distributions. In order to make this possible, we constructed a script to do this.

## PovCalNet raw data

PovCalNet contains data on income distributions for TODO country-year or country-part-year combinations. Underlying each of these is distribution data in one of two formats:

- **Unit record**. A vector of raw income (consumption) observations from an actual survey. This is the most common case, particularly for recent years in which survey availability has expanded. Typically this will have a sample size of several thousand, but ranges from under 1000 to over 100,000. For these datasets, statistics like headcount poverty and Gini index are calculated directly from the observations.
- **Grouped**. A list of points on a Lorenz curve, the exact number depending on the detail at which grouped data were available. This format is most common for older data (especially pre-1990), where original survey observations are not available. It is also used for certain countries (e.g. China) where only grouped data are made available to the PovCalNet team. For these datasets, two parametric Lorenz curve models are fitted to the points, and statistics are calculated from the better-fitting distribution.

## The detailed PovCalNet output

PovCalNet is designed to provide formatted output on a particular country-year query, as in Figure 2.1.

![_**Figure 2.1:** PovCalNet output for Argentina 1991_](images/povcalnet-argentina1991.png)

However via the "Detail output" link at the right of the table, a far more detailed output can be obtained. For one of the shorter "grouped" format surveys, the first few lines look like this, with data following:

In [1]:
with open("../data/txtcache/ARG_2_1991.txt") as f:
    for line in f.readlines()[:17]:
        print(line, end="")



**********************************************************************************************
**                                   Basic Information                                      **
**********************************************************************************************

----------------- Dataset Information -----------------
                         Country: Argentina
                    Country dode: ARG
                       Data Year: 1991
                        Coverage: Urban
             Welfare measurement: Income
                     Data format: Unit record
                     Data source: ARG_U1991Y
                  Data time span: UnDefined
-------------------------------------------------------



Crucially this detailed output contains

- a specfication of the Lorenz curve: either the original points themselves (for grouped data cases) or a gridded sample of 100 points (for unit record cases).
- the survey mean (in local currency and PPP$)

This is enough information to reconstruct the survey data distribution with useably high accuracy.

However, while this detailed output contains much useful information, it is not in a useful format to be ingested automatically by another process.

## Detailed PovCalNet output: as JSON


We use a scraping (text parsing) technique to read the individual data items output in the detailed output and re-output them as JSON (JavaScript Object Notation), as flexible text-based format which can be read by a variety of tools and programming libraries.

The first few lines of the output for the above example looks different, with more of the structure made explicit, but the same values can be seen:

In [2]:
with open("../data/jsoncache/ARG_2_1991.json") as f:
    for line in f.readlines()[:11]:
        print(line, end="")

{
    "dataset": {
        "source": "ARG_U1991Y",
        "timespan": "UnDefined",
        "coverage": "Urban",
        "year": 1991,
        "iso3c": "ARG",
        "format": "Unit record",
        "country": "Argentina",
        "measure": "Income"
    },


This forms the input to the modelling stage of the project.