Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter not working in cbs_get_data, on column SoortRegio_2 #33

Closed
victorhartman opened this issue Nov 24, 2021 · 4 comments
Closed

Filter not working in cbs_get_data, on column SoortRegio_2 #33

victorhartman opened this issue Nov 24, 2021 · 4 comments

Comments

@victorhartman
Copy link

I am only interested in data from the gemeentes from the 'Kerncijfers wijken en buurten'.

So my code looks like this:

cbs_get_data("84583NED", SoortRegio_2 = "Gemeente ", verbose = T)

(with two spaces after 'gemeente'). This does not seem to work however, all types of regions are loaded. Any idea why?

Could it possibly have something to do with the '_' in the column name?

@datadwerg
Copy link

Hello, I hope it is OK for another user to reply instead of the package creator?

In the vignette there is an example which uses has_substring() to filter. I also think it is better to filter on a dimension column and not a measure column. I think this works:

cbs_get_data(id = "84583NED", WijkenEnBuurten = has_substring("GM"))

@ayoub-abdessadak
Copy link

CODE in python:

from logging import exception
import pandas 
import cbsodata
from openpyxl import Workbook


# Filter met cbsodata api
def cbs_filter(ned, soort_regio=None, gemeente=None):
    
    # NED = bijvoorbeeld 84583NED, table wordt gedownload en omgezet naar pandas tabel (https://pandas.pydata.org/docs/reference/frame.html)
    try:
        data = pandas.DataFrame(cbsodata.get_data(ned))
    except exception as E:
        print(E)
        return 'Hmmm, er is iets fout gegaan bekijk je logs'

    #Indien er een soort_regio is doorgegeven dan wordt er gefilterd op soort_regio
    if soort_regio:
       data = data[data.SoortRegio_2.str.contains(soort_regio[0].upper() + soort_regio[1::].lower())]

    #Indien gemeente is doorgegeven dan wordt er gefilterd op gemeente
    if gemeente:
        data = data[data.Gemeentenaam_1.str.contains(gemeente[0].upper() + gemeente[1::].lower())]
    

    #Verder kunnen er nog meerdere filters worden toegevoegd aan de hand van de beschikbare kolommen

    return data


# Filter zonder cbsodata api
def cbs_filter_dt(data, soort_regio=None, gemeente=None):

    #Indien er een soort_regio is doorgegeven dan wordt er gefilterd op soort_regio
    if soort_regio:
       data = data[data.SoortRegio_2.str.contains(soort_regio[0].upper() + soort_regio[1::].lower())]

    #Indien gemeente is doorgegeven dan wordt er gefilterd op gemeente
    if gemeente:
        data = data[data.Gemeentenaam_1.str.contains(gemeente[0].upper() + gemeente[1::].lower())]

    #Verder kunnen er nog meerdere filters worden toegevoegd aan de hand van de beschikbare kolommen

    return data

Uitvoering:

>>> from cbs_data import *

>>> data = cbs_filter('84583NED', 'wijk', 'amsterdam')

>>> data
        ID          WijkenEnBuurten  ... MateVanStedelijkheid_115 Omgevingsadressendichtheid_116
921    921    Burgwallen-Oude Zijde  ...                      1.0                         7622.0
927    927  Burgwallen-Nieuwe Zijde  ...                      1.0                         9222.0
936    936      Grachtengordel-West  ...                      1.0                        10638.0
941    941      Grachtengordel-Zuid  ...                      1.0                         9212.0
949    949       Nieuwmarkt/Lastage  ...                      1.0                         6995.0
...    ...                      ...  ...                      ...                            ...
1464  1464     Bijlmer Oost (E,G,K)  ...                      1.0                         3329.0
1479  1479               Nellestein  ...                      2.0                         2021.0
1483  1483   Holendrecht/Reigersbos  ...                      1.0                         2539.0
1491  1491                     Gein  ...                      2.0                         2081.0
1496  1496                 Driemond  ...                      5.0                          453.0


[99 rows x 118 columns]

>>> data.to_excel("amsterdam_wijk.xlsx")

amsterdam_wijk.xlsx

@edwindj
Copy link
Owner

edwindj commented Nov 25, 2021

@datadwerg thx for answering the question: seems indeed the intention of the question.
Another option of course is to do post-filtering: download the data and then filter on this column (but that will take a lot more memory)

data_all <- cbs_get_data(id = "84583NED")
data_gm <- subset(data_all, grepl("GM", WijkenEnBuurten))
# or using tidyverse
library(tidyverse)
data_gm <- data_all %>% filter(str_detect(WijkenEnBuurten, "GM"))

@victorhartman
Copy link
Author

Thanks for the responses. The solution by @datadwerg answers my question and works.

Kerncijfers wijken en buurten 2020 contains ~17.000 rows and the gemeentes only take 355 of those. So although post-filtering works, it is much slower/requires more resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants