# Error Messages for Bad Arguments

`ced.download` has a lot of arguments and a large number of different
ways to describe geographies. As human coders, we don't always get things
exactly right. Sometimes we forget the exact name of an argument or
sometimes we mispell it. Other times we copy and paste, change part of
the geography but not a different part.

This notebooks demonstrates a new set of error messages that can come
up in these scenarios and hopefully will guide people to the underlying
cause and the solution as quickly as possible.

This work is inspired by [Scott Pham](http://scottpham.com/) who reported
some issues with older versions of the error messaging that were confusing
at times.

In [1]:
import censusdis.data as ced
from censusdis.datasets import ACS1
from censusdis.states import ALL_STATES_AND_DC, NJ
from censusdis.counties.new_jersey import HUDSON

## Arguments

`ced.download` has a lot of arguments as the docstring indicates.

In [2]:
help(ced.download)

Help on function download in module censusdis.data:

download(dataset: str, vintage: Union[int, Literal['timeseries']], download_variables: Union[str, Iterable[str], NoneType] = None, *, group: Union[str, Iterable[str], NoneType] = None, leaves_of_group: Union[str, Iterable[str], NoneType] = None, set_to_nan: Union[bool, Iterable[int]] = True, skip_annotations: bool = True, query_filter: Optional[Dict[str, str]] = None, with_geometry: bool = False, with_geometry_columns: bool = False, tiger_shapefiles_only: bool = False, remove_water: bool = False, download_contained_within: Optional[Dict[str, Union[str, Iterable[str]]]] = None, area_threshold: float = 0.8, api_key: Optional[str] = None, variable_cache: Optional[ForwardRef('VariableCache')] = None, row_keys: Union[str, Iterable[str], NoneType] = None, **kwargs: Union[str, Iterable[str]]) -> Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame]
    Download data from the US Census API.
    
    This is the main API for

On top of all the named arguments, there are `kwargs` where we can express geography, like `state=NJ, county='*'`.
All of this makes `ced.download` super powerful and flexible, but at the same time it opens up a variety of ways
to make mistakes.

## Getting Args Wrong

There are two major categories of mistakes that come up when calling `ced.download`. Any longtime `censusdis` user
is likely to have experienced them both.

- *Category 1* consists of errors involving misspelling a required or optional argument when passing it by name.
For example, instead of `download_variables=`, you might type `variables=`.

- *Category 2* consists of errors due to a bad geography spec in `kwargs`. For example, you might say 
`state=NJ, county='*', tract='*'` when querying a data set that only provides data at the county level.

These two kinds of errors are very different, but they are also similar in that they involve passing
a named argument that is not recognized as either a named argument or a geographic argument. In earlier
versions of `censusdis`, both cases generated exceptions that suggested the problem involved geography.
Now, as we will see below, the error messages are different and more complete for both cases. We hope
that this will make the developer experience better, and make it easier to fix the root cause, when
either of these cases come up.



## Category 1

Here is an example of a case where `download_variables=` is accidentally passed as `variables=`.

In [3]:
variables = [
    "NAME",
    "B25003_001E",
    "B25003_002E",
    "B25003_003E",
]

In [4]:
# Note that we are going to catch the exception and print the message.
# Normally we would let it bubble up to the notebook kernel and your
# IDE or browser would display it in some form including the stack trace
# and other details.

try:
    ced.download(
        dataset=ACS1,
        vintage=2023,
        # Here is the problem. This is neither a properly
        # named arg nor part of a geography spec.
        variables=variables,
        # This is the geography.
        state=ALL_STATES_AND_DC,
    )
except Exception as e:
    print(str(e))


The following arguments are not recognized as non-geographic arguments or goegraphic arguments
for the dataset acs/acs1 in vintage 2023: 'variables'.

There are two reasons why this might happen:

1. The arg(s) mentioned above are mispelled versions of named or geopgrahic arguments.
2. The arg(s) mentioned above are valid geographic arguments for some data sets and
   vintages, but not for acs/acs1 in vintage 2023.

Supported geographies for dataset='acs/acs1' in year=2023 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'place']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['metropolitan_statistical_area_micropolitan_statistical_area']
['metropolitan_statistical_area_micropolitan_statistical_area', 'state_or_part', 'principal_city_or_part']
['metropolitan_statistical_area_micropolitan_statistical_area', 'metropolitan_division']
['combined_statistical_

Notice how the error message warns us of two possible cases, and specifically identifies the
argument `variables` that caused the error.

Let's try another category 1 example, but with a misspelled geography constraint `states=` instead of `state=`.

In [5]:
try:
    ced.download(
        dataset=ACS1,
        vintage=2023,
        download_variables=variables,
        # Incorrect `states=` instead of `state=` in the geography.
        states=ALL_STATES_AND_DC,
    )
except Exception as e:
    print(str(e))


The following arguments are not recognized as non-geographic arguments or goegraphic arguments
for the dataset acs/acs1 in vintage 2023: 'states'.

There are two reasons why this might happen:

1. The arg(s) mentioned above are mispelled versions of named or geopgrahic arguments.
2. The arg(s) mentioned above are valid geographic arguments for some data sets and
   vintages, but not for acs/acs1 in vintage 2023.

Supported geographies for dataset='acs/acs1' in year=2023 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'place']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['metropolitan_statistical_area_micropolitan_statistical_area']
['metropolitan_statistical_area_micropolitan_statistical_area', 'state_or_part', 'principal_city_or_part']
['metropolitan_statistical_area_micropolitan_statistical_area', 'metropolitan_division']
['combined_statistical_are

We got a similar error message, but now we will want to look at the second half, where all the 
allowable geographies are listed, and see that `state`, not `states` is the proper way to express
what we want.

Of course, we might be having a bad day and make both mistakes at once.

In [6]:
try:
    ced.download(
        dataset=ACS1,
        vintage=2023,
        # Here is the problem. This is neither a properly
        # named arg nor part of a geography spec.
        variables=variables,
        # Incorrect `states=` instead of `state=` in the geography.
        states=ALL_STATES_AND_DC,
    )
except Exception as e:
    print(str(e))


The following arguments are not recognized as non-geographic arguments or goegraphic arguments
for the dataset acs/acs1 in vintage 2023: 'variables', 'states'.

There are two reasons why this might happen:

1. The arg(s) mentioned above are mispelled versions of named or geopgrahic arguments.
2. The arg(s) mentioned above are valid geographic arguments for some data sets and
   vintages, but not for acs/acs1 in vintage 2023.

Supported geographies for dataset='acs/acs1' in year=2023 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'place']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['metropolitan_statistical_area_micropolitan_statistical_area']
['metropolitan_statistical_area_micropolitan_statistical_area', 'state_or_part', 'principal_city_or_part']
['metropolitan_statistical_area_micropolitan_statistical_area', 'metropolitan_division']
['combined_st

Notice how in this case both of the arguments that caused a problem are identified.
Hopefully this helps us recover and get back on track with the query we wanted to
write from the beginning. Here it is.


In [7]:
df = ced.download(
    dataset=ACS1,
    vintage=2023,
    download_variables=variables,
    state=ALL_STATES_AND_DC,
)

df.head()

Unnamed: 0,STATE,NAME,B25003_001E,B25003_002E,B25003_003E
0,1,Alabama,2051545,1438756,612789
1,2,Alaska,276852,183575,93277
2,4,Arizona,2907014,1967704,939310
3,5,Arkansas,1232871,816038,416833
4,6,California,13699816,7658458,6041358


## Category 2

This is the other type of error, where everything looks valid, but the 
geographpy specification is malformed. Here is an example:

In [8]:
try:
    ced.download(
        dataset=ACS1,
        vintage=2023,
        download_variables=variables,
        state=NJ,
        county=HUDSON,
        place="*",
    )
except Exception as e:
    print(str(e))


Unable to match the geography specification {'state': '34', 'county': '017', 'place': '*'}.

Supported geographies for dataset='acs/acs1' in year=2023 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'place']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['metropolitan_statistical_area_micropolitan_statistical_area']
['metropolitan_statistical_area_micropolitan_statistical_area', 'state_or_part', 'principal_city_or_part']
['metropolitan_statistical_area_micropolitan_statistical_area', 'metropolitan_division']
['combined_statistical_area']
['urban_area']
['state', 'congressional_district']
['state', 'public_use_microdata_area']
['state', 'school_district_elementary']
['state', 'school_district_secondary']
['state', 'school_district_unified']


In this case, all of the arguments we provided in trying to specify a geography are
valid geography keywords for this dataset. However they three of them cannot be
combined in this way because places are not nested inside counties in the geography
hierarchy. As the extended details in the error message indicate, we can provide
`state` and `county`, or `state` and `place`, but not all three.

This is actually a pretty common error, because we sometimes like to think in terms
of containment that the U.S. Census data model does not support. Like we would like
to query data on all the places in Hudson County, NJ. `censusdis` has another way to
do this, which is explored in another notebook called
[Geographies Contained within Geographies.ipynb](./Geographies%20Contained%20within%20Geographies.ipynb).
Using the technique described there, we could do a query to get data on places
that are withing the county.

In [9]:
df_places_in_hudson_co = ced.contained_within(
    state=NJ,
    county=HUDSON,
).download(
    dataset=ACS1, vintage=2023, download_variables=variables, state=NJ, place="*"
)

df_places_in_hudson_co

Unnamed: 0,STATE,COUNTY,PLACE,NAME,B25003_001E,B25003_002E,B25003_003E
0,34,17,3580,"Bayonne city, New Jersey",30335,9959,20376
1,34,17,36000,"Jersey City city, New Jersey",130020,35907,94113
2,34,17,74630,"Union City city, New Jersey",25001,6041,18960


Another variation of category 2 errors occurs when we try to query a geography that is valid for some data
sets but not for this one. For example, we can try our query at the census tract level.

In [10]:
try:
    ced.download(
        dataset=ACS1,
        vintage=2023,
        download_variables=variables,
        state=NJ,
        county=HUDSON,
        tract="*",
    )
except Exception as e:
    print(str(e))


The following arguments are not recognized as non-geographic arguments or goegraphic arguments
for the dataset acs/acs1 in vintage 2023: 'tract'.

There are two reasons why this might happen:

1. The arg(s) mentioned above are mispelled versions of named or geopgrahic arguments.
2. The arg(s) mentioned above are valid geographic arguments for some data sets and
   vintages, but not for acs/acs1 in vintage 2023.

Supported geographies for dataset='acs/acs1' in year=2023 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'place']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['metropolitan_statistical_area_micropolitan_statistical_area']
['metropolitan_statistical_area_micropolitan_statistical_area', 'state_or_part', 'principal_city_or_part']
['metropolitan_statistical_area_micropolitan_statistical_area', 'metropolitan_division']
['combined_statistical_area

In this case, `tract` is identified as a bad argument and we get an error message like we
did in category 1. We can then see in the list of supported geographies that unlike some
other data sets, `ACS1` does not offer tract-level data.