School data is available on the [DfE](https://www.compare-school-performance.service.gov.uk/) website that includes the postcode for each school.
We can convert a postcode to lat-lon using the free api [Postcodes.io](https://postcodes.io/)
We can then use lat-lon of the postcodes and the LSOA midpoint to measure the closest distance to a school and add it  as variavble in the machine learning. Hopefully greatly improving model accuracy.

The API would take a while to install. It uses data from ONS available online [here](https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk)

## Download DfE data

In [1]:
import pandas as pd
import numpy as np


url_2015 = 'https://www.compare-school-performance.service.gov.uk/download-data?download=true&regions=0&filters=SPINE&fileformat=csv&year=2014-2015&meta=false'
url_2014 = 'https://www.compare-school-performance.service.gov.uk/download-data?download=true&regions=0&filters=SPINE&fileformat=csv&year=2013-2014&meta=false'

save_as_2015 = 'open-data/schools-v2-2015.csv'
save_as_2014 = 'open-data/schools-v2-2014.csv'

Cannot download in python. Do it manually

## Download postcode data

Manually download the csv from [https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk](https://data.gov.uk/dataset/national-statistics-postcode-lookup-uk) and save as 'open-data/postcodes.csv

In [2]:
df_postcodes = pd.read_csv('open-data/postcodes.csv')

In [3]:
df_postcodes['Postcode 1'] = df_postcodes['Postcode 1'].apply(lambda x: x.upper().replace(' ', ''))
df_postcodes['Postcode 2'] = df_postcodes['Postcode 1'].apply(lambda x: x.upper().replace(' ', ''))
df_postcodes['Postcode 3'] = df_postcodes['Postcode 1'].apply(lambda x: x.upper().replace(' ', ''))
p1 = df_postcodes.drop(['Postcode 2', 'Postcode 3'], axis=1).rename(columns={'Postcode 1': 'POSTCODE'})
p2 = df_postcodes.drop(['Postcode 1', 'Postcode 3'], axis=1).rename(columns={'Postcode 2': 'POSTCODE'})
p3 = df_postcodes.drop(['Postcode 1', 'Postcode 2'], axis=1).rename(columns={'Postcode 3': 'POSTCODE'})
df_postcodes_long = pd.concat([p1, p2, p3], axis = 0, ignore_index=True)


In [4]:
df_postcodes_long = df_postcodes_long[['POSTCODE','Lower Super Output Area Code',
                            'Longitude', 'Latitude']].drop_duplicates()
df_postcodes_long = df_postcodes_long[df_postcodes_long['Lower Super Output Area Code'].isnull() == False]

In [5]:
df_postcodes_long.head()

Unnamed: 0,POSTCODE,Lower Super Output Area Code,Longitude,Latitude
0,S206RU,E01008038,-1.379193,53.340953
1,TW47BD,E01002660,-0.383652,51.466899
2,GU513ZQ,E01022870,-0.831674,51.288637
3,OX46BE,E01028518,-1.198788,51.723313
4,TA79JH,E01029150,-2.877144,51.149999


In [6]:
df_postcodes_long.shape

(1755145, 4)

### Merge the two datasets on the postcodes

In [7]:
def school_by_lsoa(year):
    if year == 2014:
        file = save_as_2014
    if year == 2015:
        file = save_as_2015
    
    df_schools = pd.read_csv(file)
    df_schools = df_schools[['URN', 'SCHNAME', 'POSTCODE', 'ISPRIMARY', 'ISSECONDARY', 'ISPOST16']]
    df_schools = df_schools[df_schools.POSTCODE.isnull() == False]
    df_schools['POSTCODE'] = df_schools['POSTCODE'].apply(lambda x: x.upper().replace(' ', ''))
    merged = df_schools.merge(df_postcodes_long, how='left', on='POSTCODE')
    not_merged = merged['Lower Super Output Area Code'].isnull().sum() / len(merged)
    print('Percent of schools were the LSOA could not be determined from the postcode')
    print('{:.2f}'.format(not_merged*100) + '%')
    merged['year'] = year
    return(merged)

In [8]:
df_2014 = school_by_lsoa(2014)
df_2015 = school_by_lsoa(2015)
df = pd.concat([df_2014, df_2015], ignore_index = False).rename(columns={'Lower Super Output Area Code': 'LSOA_code'})
secondary = df[df.ISSECONDARY == 1]
primary = df[df.ISPRIMARY == 1]
post16 = df[df.ISPOST16 == 1]



Percent of schools were the LSOA could not be determined from the postcode
0.54%


  if self.run_code(code, result):


Percent of schools were the LSOA could not be determined from the postcode
0.37%


In [9]:
secondary = secondary.groupby(['year', 'LSOA_code']).size().reset_index(name='schools_secondary')
primary = primary.groupby(['year', 'LSOA_code']).size().reset_index(name='schools_primary')
post16 = post16.groupby(['year', 'LSOA_code']).size().reset_index(name='schools_post16')

In [10]:
df = primary.merge(secondary).merge(post16)

## Codes are 2001. Convert to 2011
Download Data
http://geoportal.statistics.gov.uk/datasets?q=Lower%20Layer%20Super%20Output%20Area%20(2001)%20to%20Lower%20Layer%20Super%20Output%20Area%20(2011)%20to%20Local%20Authority%20District%20(2011)%20Lookup&sort=name

In [11]:
lsoa_lookup = pd.read_csv('open-data/LSOA01_to_LSOA11.csv', encoding = 'latin')

In [12]:
lsoa_lookup.sort_values('LSOA11CD').head()

Unnamed: 0,LSOA01CD,LSOA01NM,LSOA11CD,LSOA11NM,CHGIND,LAD11CD,LAD11NM,LAD11NMW
7397,E01000001,City of London 001A,E01000001,City of London 001A,U,E09000001,City of London,
7398,E01000002,City of London 001B,E01000002,City of London 001B,U,E09000001,City of London,
7399,E01000003,City of London 001C,E01000003,City of London 001C,U,E09000001,City of London,
7400,E01000005,City of London 001E,E01000005,City of London 001E,U,E09000001,City of London,
666,E01000006,Barking and Dagenham 016A,E01000006,Barking and Dagenham 016A,U,E09000002,Barking and Dagenham,


In [13]:
df = df.merge(lsoa_lookup, how='left', left_on='LSOA_code', right_on='LSOA01CD')

In [14]:
df_2011_zones = df.groupby(['year', 'LSOA11CD' ])[['schools_primary', 'schools_secondary', 'schools_post16' ]].sum().reset_index()
df_2011_zones = df_2011_zones.rename(columns={'LSOA11CD': 'LSOA_code'})

In [15]:
df_2011_zones.head()

Unnamed: 0,year,LSOA_code,schools_primary,schools_secondary,schools_post16
0,2014,E01000002,1,1,1
1,2014,E01000010,3,1,1
2,2014,E01000056,1,1,1
3,2014,E01000065,2,1,1
4,2014,E01000066,2,1,1


## Join the new data onto the existing lot

In [16]:
df_old = pd.read_csv('02-enriched-data.csv')

In [17]:
df_new = df_old.merge(df_2011_zones, how='left', on=['LSOA_code', 'year'])

In [18]:
df_new.schools_secondary.isnull().sum() / len(df_new)

0.93210327609304588

In [19]:
city_of_london_lsoa = 'E01000001'
city_of_london_lsoa in df.LSOA_code

False

In [20]:
df_new[df_new.schools_primary.isnull() == False]

Unnamed: 0,LSOA_code,Region,LA_Code,LA_Name,year,mode,travel_time,nearest,urban_rural,area_square_km,...,road_LA_trunk_length_km,bus_LA_vehicle_km_travelled,schools_all_LA,schools_nursery_LA,schools_primary_LA,schools_private_LA,schools_secondary_LA,schools_primary,schools_secondary,schools_post16
48,E01000002,London,E09000001,City of London,2014,car,7.197691,employment_centre,Urban major conurbation,0.2284,...,0.0,14.730303,5,0,1,4,0,1.0,1.0,1.0
49,E01000002,London,E09000001,City of London,2014,cycle,7.211952,employment_centre,Urban major conurbation,0.2284,...,0.0,14.730303,5,0,1,4,0,1.0,1.0,1.0
50,E01000002,London,E09000001,City of London,2014,public transport,4.338976,employment_centre,Urban major conurbation,0.2284,...,0.0,14.730303,5,0,1,4,0,1.0,1.0,1.0
51,E01000002,London,E09000001,City of London,2015,car,7.427854,employment_centre,Urban major conurbation,0.2284,...,0.0,14.696970,5,0,1,4,0,1.0,1.0,1.0
52,E01000002,London,E09000001,City of London,2015,cycle,7.295751,employment_centre,Urban major conurbation,0.2284,...,0.0,14.696970,5,0,1,4,0,1.0,1.0,1.0
53,E01000002,London,E09000001,City of London,2015,public transport,4.217422,employment_centre,Urban major conurbation,0.2284,...,0.0,14.696970,5,0,1,4,0,1.0,1.0,1.0
54,E01000002,London,E09000001,City of London,2015,public transport,9.000000,primary_school,Urban major conurbation,0.2284,...,0.0,14.696970,5,0,1,4,0,1.0,1.0,1.0
55,E01000002,London,E09000001,City of London,2014,public transport,8.000000,primary_school,Urban major conurbation,0.2284,...,0.0,14.730303,5,0,1,4,0,1.0,1.0,1.0
56,E01000002,London,E09000001,City of London,2015,cycle,9.000000,primary_school,Urban major conurbation,0.2284,...,0.0,14.696970,5,0,1,4,0,1.0,1.0,1.0
57,E01000002,London,E09000001,City of London,2014,cycle,8.000000,primary_school,Urban major conurbation,0.2284,...,0.0,14.730303,5,0,1,4,0,1.0,1.0,1.0


In [21]:
df_new.head()

Unnamed: 0,LSOA_code,Region,LA_Code,LA_Name,year,mode,travel_time,nearest,urban_rural,area_square_km,...,road_LA_trunk_length_km,bus_LA_vehicle_km_travelled,schools_all_LA,schools_nursery_LA,schools_primary_LA,schools_private_LA,schools_secondary_LA,schools_primary,schools_secondary,schools_post16
0,E01000001,London,E09000001,City of London,2014,car,6.75308,employment_centre,Urban major conurbation,0.1298,...,0.0,14.730303,5,0,1,4,0,,,
1,E01000001,London,E09000001,City of London,2014,cycle,6.610821,employment_centre,Urban major conurbation,0.1298,...,0.0,14.730303,5,0,1,4,0,,,
2,E01000001,London,E09000001,City of London,2014,public transport,3.648643,employment_centre,Urban major conurbation,0.1298,...,0.0,14.730303,5,0,1,4,0,,,
3,E01000001,London,E09000001,City of London,2015,car,6.153411,employment_centre,Urban major conurbation,0.1298,...,0.0,14.69697,5,0,1,4,0,,,
4,E01000001,London,E09000001,City of London,2015,cycle,6.501751,employment_centre,Urban major conurbation,0.1298,...,0.0,14.69697,5,0,1,4,0,,,
