#Notebook: Data Loading
*Description:* Ths notebook reflects the foundational step of the project process; it reads project datasets and produces data objects for modification and use in later project steps. The cells in the notebook perform the following roles:

* Define libraries, drive data, and file paths
* Define data file paths
* Define function for property price paid data loading
* Define function for borough name data loading
* Define function for local council tax data loading
* Define function for English indices of deprivation data loading
* Define function for Airbnb listings data loading

*Data:* The notebook is, in effect, a response to the following datasets:

* Attribute-linked property price paid and environmental performance data (57,719 rows; sourced from [University College London / UK Data Service](University College London / UK Data Service))
* CSV of London boroughs categorized by binary `inner/outer` position. Borough names were derived from [borough council tax data](https://data.london.gov.uk/dataset/council-tax-charges-bands-borough), while geographic location was informed by definition in the London Government Act 1963 as described by [Wikipedia](https://en.wikipedia.org/wiki/Inner_London). Note that while City of London and Newham are not officially recognized as boroughs of Inner London, for common statistical purposes, we include both.
* Council tax charges by band and borough (32 rows; sourced from [London Datastore](https://data.london.gov.uk/dataset/council-tax-charges-bands-borough))
* English Indices of Deprivation 2019 for Greater London (4,836 rows; sourced from [London Datastore/UK Government](https://data.london.gov.uk/dataset/indices-of-deprivation))
* condensed STR listings (86,359 rows; sourced from [Inside Airbnb](http://insideairbnb.com/explore/))

*Return:* The notebook provides the basis for data loading in forthcoming notebooks. We borrow and repurpose the data loader functions as needed.


# 1. Initialization
*Description:* In this section, we mount Google Drive data, import libraries, and initialize filepaths.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import libraries
from collections import defaultdict
from dask import dataframe as dd
import pandas as pd

In [None]:
# define data filepaths to pass as args into loader functions, filepaths based on JR G-Drive
filepath_boroughs = "/content/drive/MyDrive/CIS5450/Term_Project/data/ldn_borough/ldn_borough.csv"
filepath_counciltax = "/content/drive/MyDrive/CIS5450/Term_Project/data/ldn_borough/ldn_counciltax2019.csv"
filepath_epdc  = "/content/drive/MyDrive/CIS5450/Term_Project/data/uk_property/uk_epdc_prop.csv"
filepath_indices = "/content/drive/MyDrive/CIS5450/Term_Project/data/ldn_borough/ldn_id2019.csv"
filepath_listings = "/content/drive/MyDrive/CIS5450/Term_Project/data/ldn_airbnb/listings_cond.csv"

#2a. Function: Read Property Price Paid Data

*Description:* This cell comprises two parts: a function `read_epdc_prop_to_df` to read an attribute-linked property file and store the result as a dataframe, and a function `parse_epdc_prop_df` to parse the resulting dataframe.

*Data:* The reader function is designed to accept attribute-linked [property price data](https://reshare.ukdataservice.ac.uk/854942/) for England and Wales for the period 2011-2019. This dataset is a consolidation of two disparate datasets prepared by the UK Government: a [price paid dataset](https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads) and an [environmental performance for buildings dataset](https://www.gov.uk/government/statistical-data-sets/live-tables-on-energy-performance-of-buildings-certificates). Use of the attribute-linked data allows for the generation of a property price per square metre valuation not possible with either individual dataset.

As a consolidation of two comprehensive datasets, the attribute-linked dataset contains a number of columns. Metadata for the most relevant columns is included below:

```
transactionid: An automatically-generated number at time of each published sale.
postcode: postal code used at time of original transaction.
price: Sale price stated on transfer deed.
dateoftransfer: Date when sale was completed, as stated on transfer deed.
propertytype: Letter code {D: detached, S: semi-detached, T: terraced, F: flat/maisonette, O: other}
oldnew: Indicate age of property {Y: A newly built property, N: an established build}
duration: Housing tenure {F: freehold, L: leasehold}
PAON/SAON: Primary and secondary addressable object name (i.e. address, unit).
tfarea: Total floor area of dwelling unit, in square metre (m^2).
```

**Note:** For more information on the data acquisition process, please refer to the [Project Code Archive](https://colab.research.google.com/drive/1Qog7FldvTOogSIrSKqsDEx8MRwefYr3J?usp=sharing).

*Return:* Function returns a Pandas dataframe of parsed attribute-linked property data.

In [None]:
def read_epdc_prop_to_df(filepath):
    # use dask for multiprocessor csv reading support
    file_in = dd.read_csv(filepath, dtype="string")
    # return a pandas df using parsing helper method
    return parse_epdc_prop_df(file_in)

def parse_epdc_prop_df(df):
    # define cols to cast to int
    cols_int = ['id', 'price', 'year', 'numberrooms',
                'BUILDING_REFERENCE_NUMBER']
    # define cols to drop (mostly redundant, unneeded, or artefact)
    to_drop = ['Unnamed: 0', 'towncity', 'county', 'lad11nm', 'rgn11nm',
               'LOCAL_AUTHORITY_LABEL', 'CONSTITUENCY_LABEL']
    # convert dask df to pandas df
    pd_df = df.compute()
    # drop specified cols from df
    pd_df.drop(axis=1, columns=to_drop, inplace=True)
    # for each col in df, attempt to convert to numeric value
    for col in pd_df.columns:
        # convert to numeric format generally
        pd_df[col] = pd.to_numeric(pd_df[col], errors='ignore')
        # if column targeted as integer
        if col in cols_int:
            # convert to int rather than float
            pd_df[col] = pd_df[col].astype(int)
    # convert district column to title case
    pd_df['district'] = pd_df['district'].str.title()
    # get (cost / floor area) ratio for each row and round to two dec points
    pd_df['cost_fl_area'] = round(pd_df['price']/pd_df['tfarea'], 2)
    return pd_df

#2b. Function: Read Borough Names

*Description:* This cell contains a function `read_boroughs_to_list`, which given a csv of London boroughs and their location in the Greater London (inner, outer), will return relevant inner London boroughs as a list.

*Data:* The function is designed to accept a two-column .csv of London borough names and location (innerm, outer) in Greater London. Borough names were derived from [borough council tax data](https://data.london.gov.uk/dataset/council-tax-charges-bands-borough), while geographic location was informed by definition in the London Government Act 1963 as described by [Wikipedia](https://en.wikipedia.org/wiki/Inner_London). Note that while City of London and Newham are not officially recognized as boroughs of Inner London, for common statistical purposes, we include both.

*Return:* Function returns a list of Inner London boroughs:


```
boroughs = ['Camden', 'City of London', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Islington', 'Kensington and Chelsea',
'Lambeth', 'Lewisham', 'Newham', 'Southwark', 'Tower Hamlets', 'Wandsworth', 'City of Westminster']
```



In [None]:
def read_boroughs_to_list(filepath):
    # read file
    file_in = pd.read_csv(filepath, dtype={"neighbourhood": "str"})
    # filter by inner boroughs
    file_in = file_in[file_in['location'] == 'inner']
    # return boroughs as list
    return file_in['neighbourhood'].values.tolist()

#2c. Function: Read Local Council Tax Data

*Description:* This cell contains a function `read_council_tax_to_df`, which given a .csv of historical council tax data for each borough in London, will return the data as a Pandas dataframe.

*Data:* The function is designed to accept a multi-sheet .csv of London [borough council tax data](https://data.london.gov.uk/dataset/council-tax-charges-bands-borough). This data is intended as a supplement to provide insight into any prospective variability within the relationship between housing purchase prices and STR frequency in inner London.

*Return:* Function returns London borough council tax data (including bands) as Pandas dataframe.

In [None]:
def read_council_tax_to_df(filepath):
    # initialize dictionary for col types
    d_tax = defaultdict(lambda x: "int32")
    d_tax['code'], d_tax['borough'] = "string", "string"
    # read file
    file_in = pd.read_csv(filepath)
    return file_in

#2d. Function: Read UK Indices of Deprivation Data

*Description:* This cell contains a function `read_indices_to_df`, which given a .csv of United Kingdom Indices of Deprivation (2019) for each borough in London, will return the data as a Pandas dataframe.

*Data:* The function is designed to accept a .csv of [summary indices of deprivation](https://data.london.gov.uk/dataset/indices-of-deprivation) for each borough in London, including indices for income, education, employment, health deprivation and disability, crime, barriers to housing and services, and living environment quality.

The UK Indices of Deprivation (2019) are a set of indicators generated by the UK Ministry of Housing, Communities, and Local Government to provide a relative measure of deprivation for small areas across England. These small areas (Lower-layer Super Output Areas) are aggregated to the borough level for purposes of this study.

As with the tax data, this data is intended as a supplement to provide insight into any prospective variability within the relationship between housing purchase prices and STR frequency in inner London.

*Return:* Function returns UK indices of deprivation for London boroughs as Pandas dataframe.

In [None]:
def read_indices_to_df(filepath):
    # initialize dictionary for col types
    d_tax = defaultdict(float)
    d_tax['code'], d_tax['borough'] = "string", "string"
    # read file
    file_in = pd.read_csv(filepath)
    # generate prefixes and suffixes of columns to keep
    prefix = ['hdep', 'crime', 'housebar', 'env']
    suffix = ['-avgrank', '-rankavgrank', '-avgscore', '-rankavgscore']
    # generate list of columns to keep and return subset of data
    to_keep = []
    for pfx in prefix:
      for sfx in suffix:
        to_keep.append(pfx + sfx)
    return file_in[to_keep]

#2e. Function: Read Short-term Rental Listing Data

*Description:* This cell comprises two parts: a function `read_listings_to_df` to read a .csv and store the result as a dataframe, and a function `parse_listings_df` to perform basic data cleanup and encoding as loading proceeds.

*Data:* The reader function is designed to accept condensed Airbnb listing data for Greater London for the period ending in 2020. This dataset is prepared by advocacy group [Inside Airbnb,](http://insideairbnb.com/about/) and is drawn from their data archive.

Metadata for the most relevant columns is included below:

```
id: Unique identifier of STR property
name: Name of STR property as it appears on the platform
hostid, hostname: Unique identifier of host and their associated name
neighbourhood: STR location within borough
price: Dollar price per evening
minimum_nights: Minimum number of nights required to book STR
```

*Return:* Function returns Pandas dataframe of parsed STR listings for Greater London for a period ending in 2020.

In [None]:
def read_listings_to_df(filepath):
    # assign dtypes to each column
    d_listing = {
        "id": "int32",
        "name": "string",
        "host_id": "int32",
        "host_name": "string",
        "neighbourhood_group": "string",
        "neighbourhood": "string",
        "latitude": "float",
        "longitude": "float",
        "room_type": "string",
        "price": "float",
        "minimum_nights": "int32",
        "number_of_reviews": "int32",
        "last_review": "string",
        "reviews_per_month": "float",
        "calculated_host_listings_count": "int32",
        "availability_365": "int32",
    }
    # read csv using dtype dict
    file_in = pd.read_csv(filepath, dtype=d_listing)
    return parse_listings_df(file_in)


def parse_listings_df(dataframe):
    # rename column for common alignment
    dataframe.rename(inplace=True, columns={'neighbourhood': 'borough'})
    # convert last_review as date/time
    dataframe['last_review'] = pd.to_datetime(dataframe['last_review'],format='%Y-%m-%d')
    # build array of columns to drop
    to_drop = ['name', 'host_name', 'neighbourhood_group']
    # drop specified columns from listings
    dataframe.drop(axis=1, columns=to_drop, inplace=True)
    return dataframe

#3. Demonstration: Load Data
*Description:* In this section, we load the requisite data on a test basis. With functional data loader functions in place, we now deploy them to other notebooks as is needed. Note that the data read in is only cleaned in the barest sense; full data cleaning is performed as needed in the later notebooks.

In [None]:
# create dataframes from data loader functions
# initialize list of boroughs
boroughs = read_boroughs_to_list(filepath_boroughs)
# initialize dataframe of English Indices of Deprivation
indices = read_indices_to_df(filepath_indices)
# initialize dataframe of STR listings
listings = read_listings_to_df(filepath_listings)
# initialize dataframe of council tax rates
tax_rates = read_council_tax_to_df(filepath_counciltax)
# initialize dataframe of properties (price paid, environmental performance)
properties = read_epdc_prop_to_df(filepath_epdc)

In [None]:
# print boroughs as list
print("List of boroughs: ", boroughs)

List of boroughs:  ['Camden', 'City of London', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Islington', 'Kensington and Chelsea', 'Lambeth', 'Lewisham', 'Newham', 'Southwark', 'Tower Hamlets', 'Wandsworth', 'City of Westminster']


In [None]:
# print 'filtered' boroughs (i.e. City of Westminster -> Westminster)
borough_filter = list(map(lambda b: "Westminster" if b == "City of Westminster" else b, boroughs))
print("Modified list of boroughs: ", borough_filter)

Modified list of boroughs:  ['Camden', 'City of London', 'Greenwich', 'Hackney', 'Hammersmith and Fulham', 'Islington', 'Kensington and Chelsea', 'Lambeth', 'Lewisham', 'Newham', 'Southwark', 'Tower Hamlets', 'Wandsworth', 'Westminster']


In [None]:
# print English Indices of Deprivation (5 row instances, for review)
indices.head(5)

Unnamed: 0,hdep-avgrank,hdep-rankavgrank,hdep-avgscore,hdep-rankavgscore,crime-avgrank,crime-rankavgrank,crime-avgscore,crime-rankavgscore,housebar-avgrank,housebar-rankavgrank,housebar-avgscore,housebar-rankavgscore,env-avgrank,env-rankavgrank,env-avgscore,env-rankavgscore
0,8758.3,247,-0.7,254,1234.8,317,-1.7,317,28055.8,10,35.7,10,26411.2,10,39.0,13
1,19499.7,94,0.2,96,23668.3,21,0.5,28,31669.7,2,46.3,2,23068.5,41,29.1,52
2,5011.8,297,-1.1,298,17479.5,111,0.1,113,25355.6,20,31.2,21,20275.3,71,24.7,91
3,9446.2,228,-0.6,230,14596.5,165,-0.1,156,20691.0,58,25.9,61,16997.0,123,19.8,144
4,10642.8,210,-0.5,207,21996.4,45,0.4,49,30661.4,4,42.8,3,22811.3,46,28.7,54


In [None]:
# print STR listings (5 row instances, for review)
listings.head(5)

Unnamed: 0,id,host_id,borough,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,13913,54730,Islington,51.56802,-0.11121,Private room,65.0,1,21,2020-02-22,0.18,2,359
1,15400,60302,Kensington and Chelsea,51.48796,-0.16898,Entire home/apt,100.0,10,89,2020-03-16,0.71,1,232
2,17402,67564,Westminster,51.52195,-0.14094,Entire home/apt,300.0,3,42,2019-11-02,0.38,15,307
3,17506,67915,Hammersmith and Fulham,51.47935,-0.19743,Private room,150.0,3,0,NaT,,2,362
4,25023,102813,Wandsworth,51.44687,-0.21874,Entire home/apt,65.0,21,35,2020-03-30,0.7,1,15


In [None]:
# print council tax rates (5 row instances, for review)
tax_rates.head(5)

Unnamed: 0,code,borough,band_a,band_b,band_c,band_d,band_e,band_f,band_g,band_h
0,E09000001,City of London,648,757,865,973,1189,1405,1621,1945
1,E09000002,Barking & Dagenham,1037,1210,1383,1556,1902,2248,2593,3112
2,E09000003,Barnet,1030,1202,1374,1545,1889,2232,2576,3091
3,E09000004,Bexley,1119,1306,1492,1679,2052,2425,2798,3358
4,E09000005,Brent,1055,1231,1407,1583,1935,2286,2638,3166


In [None]:
# print properties listings (5 row instances, for review)
properties.head(5)

Unnamed: 0,id,transactionid,oa11,postcode,price,dateoftransfer,propertytype,oldnew,duration,paon,...,LIGHTING_ENV_EFF,MAIN_FUEL,WIND_TURBINE_COUNT,HEAT_LOSS_CORRIDOOR,UNHEATED_CORRIDOR_LENGTH,FLOOR_HEIGHT,PHOTO_SUPPLY,SOLAR_WATER_HEATING_FLAG,MECHANICAL_VENTILATION,cost_fl_area
0,12186,{79A74E22-352D-1289-E053-6B04A8C01627},E00171041,SW11 8NJ,445000,2018-10-01,F,N,L,"WARWICK BUILDING, 366",...,Very Poor,electricity (not community),0,no corridor,,,,N,natural,9468.09
1,12863,{7E86B6FB-5F2C-458C-E053-6B04A8C0C84C},E00171047,SW11 8NP,580000,2018-10-17,F,N,L,"THE BRIDGE, 334",...,Very Poor,electricity (not community),0,unheated corridor,6.9,,,N,natural,8055.56
2,12951,{79A74E22-35CE-1289-E053-6B04A8C01627},E00171047,SW11 8NG,121250,2018-07-03,F,N,L,"BURNELLI BUILDING, 352",...,Good,electricity (not community),0,unheated corridor,6.3,,,N,natural,2694.44
3,13941,{93E6821E-DF80-40FD-E053-6B04A8C0C1DF},E00171047,SW11 8NG,128000,2019-08-27,F,N,L,"BURNELLI BUILDING, 352",...,Very Good,electricity (not community),0,heated corridor,,,,N,natural,2666.67
4,1446,{8A78B2B0-1418-5CB0-E053-6B04A8C0F504},E00171044,SW11 8PG,1250000,2019-04-30,F,N,L,"OSWALD BUILDING, 374",...,Very Poor,electricity (not community),0,unheated corridor,11.2,,,N,"mechanical, extract only",13888.89
