Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DATA] population statistics error #436

Open
lethabo24 opened this issue Jun 10, 2020 · 9 comments
Open

[DATA] population statistics error #436

lethabo24 opened this issue Jun 10, 2020 · 9 comments
Assignees
Labels

Comments

@lethabo24
Copy link

Which Dataset

The za_province_pop

Error Description

The Gauteng and NorthWest populations do not correspond to the National Statistics PDF document

Suggested fixes

@lethabo24 lethabo24 added the data label Jun 10, 2020
@shaze
Copy link
Contributor

shaze commented Jun 11, 2020

Thanks -- I see digits transposed in the Northwest figures, but can't see the problem in Gauteng: 15176115 corresponds to Figure 1 on page vi. Could you elaborate please.

@lethabo24
Copy link
Author

lethabo24 commented Jun 11, 2020 via email

@vukosim vukosim changed the title [DATA] [DATA] population statistics error Jun 11, 2020
@vukosim
Copy link
Member

vukosim commented Jun 11, 2020

Thanks @18306063

@shaze we also have the statssa midyear estimates now in the staging area folder. We might want to just make a choice on where to put that, maybe

data/official_statistics/

@vukosim vukosim added this to the Repo Cleanup and Enhacements milestone Jun 11, 2020
@shaze
Copy link
Contributor

shaze commented Jun 11, 2020

OK -- the one in data/district_data has been there longer so there may be scripts dependant on it. But easy to change so it is more important to have it in the right logical place so I have no objection moving or replacing it

But if using the new file I think needs to be made program friendly -- if you read in with Pandas it seems the columns as text by default, and even harder to handle if not using Pandas

@vukosim
Copy link
Member

vukosim commented Jun 11, 2020

@elolelo Can you comment.

@shaze
Copy link
Contributor

shaze commented Jun 11, 2020

Hi Lethabo

Thanks -- it seems that they've slightly contradictory figures in the same document. Fortunately only off by 1 so way below any error mark (also adding the provincial figures does not give the total figure so we can't check that way to find which is correct)

The NW error is definitely wrong. Will push with today's figures

Will fix and push in few minutes

@elolelo
Copy link
Collaborator

elolelo commented Jun 11, 2020

So, the Gauteng value in question can be found in this file , a breakdown of that figure can be found on this one

@elolelo
Copy link
Collaborator

elolelo commented Jun 11, 2020

@elolelo Can you comment.

I am not sure to what extent are these new files program friendly. They may be changed if necessary.

@shaze
Copy link
Contributor

shaze commented Jun 11, 2020

Thanks. Ideally they must be computer-readable -- Pandas is the most flexible so readable by Pandas is essential.

  • No spaces in numbers
  • Use decimal points not commas

Also for the age break down file, I think having 5 provinces followed by 4 provinces is very difficult fo a computer to follow.

Two possible formats are below. My preference would be for 1 though 2 is what we're doing in other places and may be more human friendly.

  1. Column-wise

Have columns: province, age group, male, female, total

Province is repeated

  1. Row-wise

Using the same format that we're using for keys
Have 27 columns, 3 for each province
Eastern Cape\tMales,Eastern Cape\tFemales,Eastern Cape\tTotal,Free State\tMales,......

Note using the same convention as we do for district -- spaces separating words in names of provinces and tabs separating the name of the province from the category. This approach is very readable in GitHub, but programs can parse easily and using the convention of tabs separating the province name from the category means that

Final point -- I note in several places that the total is not equal to the sum of males and females. I doubt that these figures were done at time where non-binary categories were allowed so they are likely to be errors (in the source document). It might be worth pointing this out in the README. The discrepancy is so small as to be inconsequential for any work being done.

Many thanks for all this work -- it is very helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants