In [45]:
import requests

postcode_data_url = "http://www.nomisweb.co.uk/output/census/2011/Postcode_Estimates_Table_1.csv"

with requests.Session() as s:
    download = s.get(postcode_data_url)
    decoded_content = download.content.decode('utf-8')

csv_file = open("Postcode_Estimates_Table_1.csv", "w")
csv_file.write(decoded_content)
csv_file.close()



In [51]:
import pandas as pd

df_postcodes = pd.read_csv("Postcode_Estimates_Table_1.csv")
df_postcodes

Unnamed: 0,Postcode,Total,Males,Females,Occupied_Households
0,AL1 1AG,14,6,8,6
1,AL1 1AJ,124,60,64,51
2,AL1 1AR,32,17,15,17
3,AL1 1AS,34,17,17,13
4,AL1 1BH,52,15,37,41
...,...,...,...,...,...
1308775,YO8 9YA,23,14,9,8
1308776,YO8 9YB,33,17,16,13
1308777,YO8 9YD,9,4,5,4
1308778,YO8 9YE,13,6,7,3


# Data Cleaning

## Postcode Format

UK postcodes can be in inconsistent formats so we should first do some work to check out how they are formatted.

First we want to explore the format of the postcode strings. A good first check is to see if they are all the same length:

In [52]:
#check the lengths of the postcode strings
df_postcodes["Postcode"].apply(len).value_counts()


7    1308780
Name: Postcode, dtype: int64

They are all of length 7, that is a good start.

We'll use regex with capture groups to grab the different parts of the postcode. Postcodes can be in one of 6 formats:

| ![UK Postcode Formats](images/uk_postcode_format.png) |
|:--:|
| Obtained from https://ideal-postcodes.co.uk/guides/uk-postcode-format |

Inspecting the postcodes in the data set they are formated as:
 - Fixed length of 7 characters.
 - Outcodes left aligned.
 - Incodes right aligned.
 - Spaces in between the outcodes and incodes where they are not 7 characters long.

We will use regex with captu groups to extract some of the postcode parts:
- Sector
- Sub District (which we will treat the same as disctrict for those without a sub-district)
- District
- Area

These represent increasingly larger geographical areas (with the smallest being the full postcode).

In [73]:
#regex to extract Sector (1), Sub-District (2), District (3) and Area (4)
regex_str = r"^(((([A-Z]{1,2})[0-9]{1,2})[A-Z]?)\s*[0-9])[A-Z]{2}$"
postcode_parts = ["Sector", "Sub-District", "District", "Area"]
df_postcodes[postcode_parts] = df_postcodes["Postcode"].str.extract(regex_str)
df_postcodes

Unnamed: 0,Postcode,Total,Males,Females,Occupied_Households,Sector,Sub-District,District,Area
0,AL1 1AG,14,6,8,6,AL1 1,AL1,AL1,AL
1,AL1 1AJ,124,60,64,51,AL1 1,AL1,AL1,AL
2,AL1 1AR,32,17,15,17,AL1 1,AL1,AL1,AL
3,AL1 1AS,34,17,17,13,AL1 1,AL1,AL1,AL
4,AL1 1BH,52,15,37,41,AL1 1,AL1,AL1,AL
...,...,...,...,...,...,...,...,...,...
1308775,YO8 9YA,23,14,9,8,YO8 9,YO8,YO8,YO
1308776,YO8 9YB,33,17,16,13,YO8 9,YO8,YO8,YO
1308777,YO8 9YD,9,4,5,4,YO8 9,YO8,YO8,YO
1308778,YO8 9YE,13,6,7,3,YO8 9,YO8,YO8,YO


We should check that we have an expected range of lengths for each part, and that we have no nulls:

In [75]:
for col in postcode_parts:
    print(col)
    print("Nulls: " + str(df_postcodes[col].isnull().sum()))
    print(df_postcodes[col].apply(len).value_counts())
    print()

Sector
Nulls: 0
5    1308780
Name: Sector, dtype: int64

Sub-District
Nulls: 0
3    644619
4    631294
2     32867
Name: Sub-District, dtype: int64

District
Nulls: 0
3    644978
4    627471
2     36331
Name: District, dtype: int64

Area
Nulls: 0
2    1162930
1     145850
Name: Area, dtype: int64



This all looks as expected.