## Census Scraper
In this notebook, we will use the [ONS API](https://developer.ons.gov.uk) to obtain census 2021 data. This has been written for another ONS data exploratory project hosted on [github](https://github.com/cwtravisyip/ONS_Census2021/tree/main).

In [1]:
!curl -O https://raw.githubusercontent.com/cwtravisyip/ONS_Census2021/main/custom_function.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7425  100  7425    0     0  18198      0 --:--:-- --:--:-- --:--:-- 18198


In [3]:
# import data management packages
import pandas as pd 
import geopandas as gpd
import geojson
import numpy as np
import os

# import web data packages
import requests


# import custom defined function from the ONS data exploratory project
from custom_function import *

from api_keys import user_agent_ons

# define header
headers = {"user_agent": user_agent_ons}



import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


## Define Geography
For this particular excercise, we can try to look at the census data on the Lower Layer Super Output Area (LSOA). The list of OSMA can be found on the [ONS Open Geography website](https://geoportal.statistics.gov.uk/). We will focus only on England and Wales. Note that with the API, we can only request 2000 instances in one call.

Alternatively, we could also look at the following geographical level:
* [Low-Tier Local Authority](https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_December_2011_FEB_EW_2022/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson)
* [Middle Layer Super Output Area]()

In [19]:
# instantiate an empty geopandas dataframe
gpd.GeoDataFrame()

In [44]:
object_id_start= 0
chunksize = 500
out_crs = 27700
lsoa = gpd.GeoDataFrame()
api_base = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3/FeatureServer/0/"

while object_id_start <= 0:
    print(object_id_start)
    object_ids = ",".join([str(ind) for ind in range(object_id_start,object_id_start + chunksize)])
    object_ids_q = "ObjectIds=" + object_ids
    api_query = f"query?outFields=*&outSR={out_crs}&f=geojson&where=1%3D1&ObjectIds<800"
    api_url = api_base + api_query
    # get the lsoa data
    newd = gpd.read_file(api_url)
    print(len(newd))
    # append to the existing list
    lsoa = pd.concat([lsoa,newd], ignore_index = True)

    # prepare for next loop
    object_id_start += chunksize


# lsoa.plot()


0
2000


In [45]:
lsoa

Unnamed: 0,FID,LSOA21CD,LSOA21NM,BNG_E,BNG_N,LONG,LAT,Shape__Area,Shape__Length,GlobalID,geometry
0,1,E01000001,City of London 001A,532123,181632,-0.097140,51.51816,133759.153557,2289.743177,1a259a13-a525-4858-9cb0-e4952ba01af6,"POLYGON ((532105.312 182010.574, 532162.491 18..."
1,2,E01000002,City of London 001B,532480,181715,-0.091970,51.51882,225673.949043,2486.578125,1233e433-0b0d-4807-8117-17a83c23960d,"POLYGON ((532634.497 181926.016, 532619.141 18..."
2,3,E01000003,City of London 001C,532239,182033,-0.095320,51.52174,57288.376411,1142.183482,5163b7cb-4ffe-4f41-95b9-aa6cfc0508a3,"POLYGON ((532135.138 182198.131, 532158.250 18..."
3,4,E01000005,City of London 001E,533581,181283,-0.076270,51.51468,190508.858498,2167.942170,2af8015e-386e-456d-a45a-d0a223c340df,"POLYGON ((533808.018 180767.774, 533649.037 18..."
4,5,E01000006,Barking and Dagenham 016A,544994,184274,0.089317,51.53875,144195.364548,1935.412725,b492b45e-175e-4e77-b0b5-5b2fd6993ef4,"POLYGON ((545122.049 184314.931, 545271.849 18..."
...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,E01002098,Haringey 007A,530700,190552,-0.114330,51.59865,195366.483536,1883.235139,f4c81ac2-3972-4cd5-b289-ac4a7852d8b9,"POLYGON ((530589.132 190809.272, 530831.600 19..."
1996,1997,E01002099,Haringey 008A,531285,190815,-0.105790,51.60088,158564.879272,2245.082485,fe55d1f8-9308-4dfc-82bf-d4b942601017,"POLYGON ((531330.000 190898.000, 531410.000 19..."
1997,1998,E01002100,Haringey 008B,532029,190698,-0.095100,51.59966,143076.640450,2273.799699,b2cf7af1-fc49-4b1e-b366-e3cb53358d35,"POLYGON ((532074.813 191104.953, 532293.170 19..."
1998,1999,E01002101,Haringey 007B,531231,191454,-0.106330,51.60664,203592.155052,3349.433629,0c6ef04e-c742-448e-96d3-d7af02df59ae,"POLYGON ((531725.752 191665.306, 531867.263 19..."


In [5]:
# get the ltla data
# ltla = gpd.read_file("https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_December_2011_FEB_EW_2022/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson")
# filter for England and Wales only
# ltla_EW = ltla[ltla['lad11cd'].str.contains("[EW]")]

# # fill in for wales
# wales_ind = ltla_EW['lad11cd'].str.contains("W")
# ltla_EW.loc[wales_ind, "rgn16nm"] = "Wales"

# # inspect the result
# ltla_EW.sample(n=10)

## Exploring the ONS API
Using the introduction of the API, we will discet the API system

In [4]:
# retrieve the dataset available
# dataset_info = get_dataset_info()
# dataset_info.head()

## Send API Request
From the census API, there are couple relevant variables for the study of food deserts:
* Car or van availability TS045 (2021:4)
* Car or van availability by household type RM008 (2021:3)
* Distance travelled to work by car or van availability RM015 (2021:3)
* Method used to travel to work TS061 (2021:6)

To get the API data, we will first need to get the area codes as elements of a list object.


In [46]:
# get the ltla area code 
lsoa_code = list(lsoa['LSOA21CD'])

# subset the list into chunks to loop over
chunksize = 10
n_chunks = np.ceil(len(lsoa_code) / chunksize)
area_chunks = np.array_split(lsoa_code, n_chunks)

### Car or Van Availability (RM008 v2021 ed.3)

In [48]:
# instantiate an empty pd.DataFrame
rm008 = pd.DataFrame()

# loop over all ltla code
for areas in area_chunks:
    # send census2021_api requests and return as pd.df    
    df = requests_census2021_api(area_code= area_chunks[0],datasetId="RM008", version = 3, area_type = "lsoa", verbose = 1)
    rm008 = pd.concat([rm008, df], ignore_index = True)

In [49]:
rm008.head()

Unnamed: 0,href,id,label,Does not apply.Does not apply,Does not apply.One-person household: Aged 66 years and over,Does not apply.One-person household: Other,Does not apply.Single family household: Couple family household: No children,Does not apply.Single family household: Couple family household: Dependent children,Does not apply.Single family household: Couple family household: All children non-dependent,Does not apply.Single family household: Lone parent household,...,No cars or vans in household.Single family household: Lone parent household,No cars or vans in household.Other household types,1 or more cars or vans in household.Does not apply,1 or more cars or vans in household.One-person household: Aged 66 years and over,1 or more cars or vans in household.One-person household: Other,1 or more cars or vans in household.Single family household: Couple family household: No children,1 or more cars or vans in household.Single family household: Couple family household: Dependent children,1 or more cars or vans in household.Single family household: Couple family household: All children non-dependent,1 or more cars or vans in household.Single family household: Lone parent household,1 or more cars or vans in household.Other household types
0,,E01000001,City of London 001A,0,0,0,0,0,0,0,...,9,82,0,31,48,83,43,5,8,64
1,,E01000002,City of London 001B,0,0,0,0,0,0,0,...,19,70,0,28,52,64,35,12,7,51
2,,E01000003,City of London 001C,0,0,0,0,0,0,0,...,38,82,0,17,60,49,23,13,6,22
3,,E01000005,City of London 001E,0,0,0,0,0,0,0,...,38,56,0,2,24,12,24,6,14,19
4,,E01000006,Barking and Dagenham 016A,0,0,0,0,0,0,0,...,39,40,0,6,26,24,104,39,37,135


### Data Diagnosis
Looking at the column names, we can see the data is disaggregated in the following manner:
* Number of cars
    * Household Type

Another thing we may want to manipulate. From a quick inspection, we can see that there are a lot of columns with the "Does not apply" and most of their value are 0. We shall confirm if all the data points on these columns are 0. Note that in comparison to the TS045 dataset, the disaggregation on the number of cars offers less dimension. Namely, the RM008 dataset effectively only divide the population with or without car/vans. In contrast, the TS045 dataset provides the disaggregate in the following dimensions:
* No cars or vans in household
* 1 car or van in household
* 2 cars or vans in household
* 3 or more cars or vans in household
* Does not apply

In [8]:
# inspect the does not apply columns
dna_cols = rm008.columns[rm008.columns.str.contains("Does not apply")]
rm008[dna_cols].sum()

Does not apply.Does not apply                                                                  0
Does not apply.One-person household: Aged 66 years and over                                    0
Does not apply.One-person household: Other                                                     0
Does not apply.Single family household: Couple family household: No children                   0
Does not apply.Single family household: Couple family household: Dependent children            0
Does not apply.Single family household: Couple family household: All children non-dependent    0
Does not apply.Single family household: Lone parent household                                  0
Does not apply.Other household types                                                           0
No cars or vans in household.Does not apply                                                    0
1 or more cars or vans in household.Does not apply                                             0
dtype: int64

From the result above, we have a good certainty that the "Does not apply" columns could be ignored. Hence we will drop these columns.

In [9]:
rm008.drop(columns = dna_cols)

Unnamed: 0,href,id,label,No cars or vans in household.One-person household: Aged 66 years and over,No cars or vans in household.One-person household: Other,No cars or vans in household.Single family household: Couple family household: No children,No cars or vans in household.Single family household: Couple family household: Dependent children,No cars or vans in household.Single family household: Couple family household: All children non-dependent,No cars or vans in household.Single family household: Lone parent household,No cars or vans in household.Other household types,1 or more cars or vans in household.One-person household: Aged 66 years and over,1 or more cars or vans in household.One-person household: Other,1 or more cars or vans in household.Single family household: Couple family household: No children,1 or more cars or vans in household.Single family household: Couple family household: Dependent children,1 or more cars or vans in household.Single family household: Couple family household: All children non-dependent,1 or more cars or vans in household.Single family household: Lone parent household,1 or more cars or vans in household.Other household types
0,,E06000001,Hartlepool,3384,3779,782,649,212,2305,993,2264,4190,5955,6058,2702,3323,4336
1,,E06000002,Middlesbrough,4631,6190,1222,1213,345,4180,2182,3106,6035,7194,9036,3469,4986,6472
2,,E06000003,Redcar and Cleveland,5297,4208,774,721,249,2763,1276,4442,6716,9312,8934,4072,5040,7834
3,,E06000004,Stockton-on-Tees,5755,5613,1048,1017,303,3352,1625,5057,9555,13270,15123,5303,6755,9981
4,,E06000005,Darlington,3632,3990,753,656,164,2113,945,3219,5873,7982,7597,2635,3644,5715
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
345,,E06000006,Halton,3733,4120,673,657,227,2824,1075,3422,6427,8370,9126,3940,5039,6318
346,,E06000007,Warrington,5108,5137,911,668,177,2437,1338,6336,10784,15150,17438,6192,7240,11621
347,,E06000008,Blackburn with Darwen,3818,5122,929,1190,298,3063,1394,2856,5782,6989,11705,3726,4763,7130
348,,E06000009,Blackpool,5660,7700,1475,1142,311,3775,1956,4138,7090,8245,7610,3196,5110,7382


In [10]:
# instantiate an empty pd.DataFrame
ts045 = pd.DataFrame()

# loop over all ltla code
for areas in area_chunks:
    # send census2021_api requests and return as pd.df    
    df = requests_census2021_api(area_code= area_chunks[0],datasetId="ts045", version = 3, area_type = "ltla", verbose = 1)
    ts045 = pd.concat([ts045, df], ignore_index = True)

TypeError: 'module' object is not callable