# Zoopla Data Ingestion

The purpose of this Notebook is to extract and wrangle zoopla data to finally input in a SQL file for further analysis.

Within the Notebook there are several steps:

1. Extract the basic information and individual link for all the search pages of each borough search.
2. Scrape the JSON file of each individual house and parse it to extract the interesting information.
3. Wrangle the data to get an structured DataFrame.
4. Extract from the text features as much information as possible.
5. Drop the useless features.
6. Drop the records with not enough data.
7. Save the database in a SQL server.

In [1]:
import requests
from bs4 import BeautifulSoup as soup
import numpy as np
import pandas as pd
import math
import json
import time
import re
from datetime import datetime, date, timedelta
from DataIngestZoopla import number_of_search_pages
from DataIngestZoopla import get_borough_url
from DataIngestZoopla import get_main_house_details
from DataIngestZoopla import get_house_inner_details
from DataIngestZoopla import sq_m_features_find
from DataIngestZoopla import sq_ft_features_find
import ast

## 1. Basic information extraction

First I load the list of London Boroughs and the keys to make the search links.

As the search links had not a complete sense, I preferred to create them manually.

In [2]:
df_boroughs = pd.read_csv('list_of_boroughs.csv')

Then using the custom functions get_borough_url and number_of_search_pages I extract the links and number of pages for each page in each borough search.

In [4]:
df_boroughs = pd.DataFrame(df_dict)
df_boroughs['borough_url'] = df_boroughs.apply(lambda x: get_borough_url(x['london_brough_links_1']\
                                                                         ,x['london_brough_links_2']), 
                                               axis = 1)
df_boroughs[['pages','house_num']] = df_boroughs['borough_url'].apply(lambda x: number_of_search_pages(x))
df_boroughs

NameError: name 'df_dict' is not defined

Using the list of boroughs with its link for the first search page I use the custom function get_main_house_details to scrape from each search page the link and listed date of each house.

As there are many old ads, I filter to get just the ones less than 6 months old.

In [15]:
for i in range(0,2):#df_boroughs.shape[0]):
    data_link, data_listed = get_main_house_details(df_boroughs['borough_url'][i],df_boroughs['pages'][i],
                                                    df_boroughs['house_num'][i])
    print(len(data_link))
    if i == 0:
        df_houses = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses['Borough'] = df_boroughs['borough_name'][i]
    else:
        df_houses_temp = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses_temp['Borough'] = df_boroughs['borough_name'][i]
        df_houses = df_houses.append(df_houses_temp)
        
df_houses['date_listed'] = df_houses['Listed'].apply(lambda x: datetime.strptime(x.split(' ')[-3][:-2]\
                                                                                 + '-' + x.split(' ')[-2]\
                                                                                 + '-' + x.split(' ')[-1],
                                                                                 '%d-%b-%Y'))
df_houses_to_get = df_houses[df_houses['date_listed'] > pd.to_datetime(date.today() - timedelta(days=180))]        

Request 1 of 115
Request 2 of 115
Request 3 of 115
Request 4 of 115
Request 5 of 115
Request 6 of 115
Request 7 of 115
Request 8 of 115
Request 9 of 115
Request 10 of 115
Request 11 of 115
Request 12 of 115
Request 13 of 115
Request 14 of 115
Request 15 of 115
Request 16 of 115
Request 17 of 115
Request 18 of 115
Request 19 of 115
Request 20 of 115
Request 21 of 115
Request 22 of 115
Request 23 of 115
Request 24 of 115
Request 25 of 115
Request 26 of 115
Request 27 of 115
Request 28 of 115
Request 29 of 115
Request 30 of 115
Request 31 of 115
Request 32 of 115
Request 33 of 115
Request 34 of 115
Request 35 of 115
Request 36 of 115
Request 37 of 115
Request 38 of 115
Request 39 of 115
Request 40 of 115
Request 41 of 115
Request 42 of 115
Request 43 of 115
Request 44 of 115
Request 45 of 115
Request 46 of 115
Request 47 of 115
Request 48 of 115
Request 49 of 115
Request 50 of 115
Request 51 of 115
Request 52 of 115
Request 53 of 115
Request 54 of 115
Request 55 of 115
Request 56 of 115
R

## 2. Individual information extraction

Once I have all the individual links I scrape the JSON file of each house and extract the useful features.

The output of this operation would be a DataFrame with all the features and all the houses.

In [175]:
df_houses_details = get_house_inner_details(df_houses_to_get['Link'][:60],
                                            df_houses_to_get['date_listed'][:60],
                                            df_houses_to_get['Borough'][:60])
df_houses_result = df_houses_to_get.join(df_houses_details)
df_houses_wrangling = df_houses_result.copy()

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,firstPublished,lastSale,firstPublishedDate,firstPublishedPrice,lastSaleDate,lastSaleNewBuild,lastSalePrice,sq_ft,sq_ft_num,sq_m_feat
0,/for-sale/details/59016388/,Listed on 26th Jun 2021,Camden,2021-06-26,Keatons - Kentish Town,020 8033 5916,True,"Cliff Court, Cliff Road, London NW1",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 16:52:13,285000.0,,,,,,
1,/for-sale/details/59014076/,Listed on 26th Jun 2021,Camden,2021-06-26,London Habitat,020 8033 8980,False,"Rose Joan Mews, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 11:32:09,1150000.0,,,,[over 1400sqft],1400.0,130.0
2,/for-sale/details/59013561/,Listed on 26th Jun 2021,Camden,2021-06-26,Next Property,020 8022 0180,False,"Leather Lane, Farringdon EC1N",False,False,...,,,,,,,,,,
3,/for-sale/details/59012418/,Listed on 26th Jun 2021,Camden,2021-06-26,Foxtons - Camden,020 3463 6980,False,"North Villas, Camden, London NW1",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 00:05:11,850000.0,,,,,,
4,/for-sale/details/59012325/,Listed on 25th Jun 2021,Camden,2021-06-25,AbbeySpring London,020 3544 2531,True,"Goldhurst Terrace, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...","{'__typename': 'PriceHistoryLastSale', 'date':...",2021-06-25 23:35:16,1100000.0,2017-10-13,False,655000.0,,,
5,/for-sale/details/59011798/,Listed on 25th Jun 2021,Camden,2021-06-25,Hunters - West Hampstead,020 3542 2160,False,"Kylemore Road, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 22:01:56,2350000.0,,,,,,
6,/for-sale/details/59010843/,Listed on 25th Jun 2021,Camden,2021-06-25,Hamptons - City Sales,020 3463 2556,True,"Woburn Court, Bernard Street, London WC1N",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 19:47:54,1250000.0,,,,[1171 sq ft],1171.0,109.0
7,/for-sale/details/59010666/,Listed on 25th Jun 2021,Camden,2021-06-25,Savills - Marylebone & Fitzrovia,020 8022 3384,False,"Goodge Street, Fitzrovia, London W1T",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...","{'__typename': 'PriceHistoryLastSale', 'date':...",2021-06-25 19:21:38,1350000.0,1996-01-22,False,209000.0,,,
8,/for-sale/details/59010250/,Listed on 25th Jun 2021,Camden,2021-06-25,Niche Estates,020 8033 4236,True,"Prince Of Wales Road, London NW5",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 18:13:10,1200000.0,,,,,,
9,/for-sale/details/59010209/,Listed on 25th Jun 2021,Camden,2021-06-25,Foxtons - Clerkenwell,020 3463 2552,False,"Kings Cross Road, King's Cross, London WC1X",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 18:07:40,575000.0,,,,,,


In [2]:
#temporaly to load borough batches
df_boroughs = pd.read_csv('df_houses_result_Camden.csv', index_col = 0)
df_boroughs = df_boroughs.append(pd.read_csv('df_houses_result_Greenwich.csv', index_col = 0))
df_boroughs = df_boroughs.append(pd.read_csv('df_houses_result_Hackney.csv', index_col = 0))
df_boroughs = df_boroughs.append(pd.read_csv('df_houses_result_Hammersmith.csv', index_col = 0))
df_boroughs.reset_index(drop = True, inplace = True)
df_houses_wrangling = df_boroughs.copy()
df_boroughs

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,priceHistory,floorArea,tenure,detailedDescription,features,furnishedState,title,latitude,longitude,statusSummary
0,/new-homes/details/54204561/,Listed on 30th Jun 2021,Camden,2021-06-30,Savills - Margaret Street RDS,020 8022 3382,True,"Postmark, Mount Pleasant, London WC1X",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,leasehold,"Postmark marks one of the most unique, central...","[""Residents' gym"", '24-hour concierge service'...",,2 bed flat for sale,51.523426,-0.112205,Just added
1,/for-sale/details/59054637/,Listed on 30th Jun 2021,Camden,2021-06-30,"Purplebricks, Head Office",024 7511 8874,False,"West End Lane, West Hampstead NW6",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,leasehold,Very well presented four bedroom ground floor ...,"['Beautiful four bedroom ground floor flat', '...",,4 bed flat for sale,51.544358,-0.191971,Just added
2,/for-sale/details/59054349/,Listed on 30th Jun 2021,Camden,2021-06-30,Chestertons - Hampstead,020 8115 2458,False,"Duncan House, Fellows Road, Belsize Park, Lond...",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,leasehold,An extremely bright south facing upper floor s...,"['Bright upper floor studio apartment', 'Purpo...",,Studio for sale,51.544589,-0.160580,Just added
3,/for-sale/details/59053356/,Listed on 30th Jun 2021,Camden,2021-06-30,Foxtons - Islington,020 3463 9586,False,"Staveley Close, Islington, London N7",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,,Arranged over 2 floors with a separate entranc...,"['Well arranged 4 bedroom maisonette', 'Large ...",,4 bed flat for sale,51.554136,-0.120189,Just added
4,/for-sale/details/59053349/,Listed on 30th Jun 2021,Camden,2021-06-30,Foxtons - Hampstead,020 3463 7492,False,"Willow Road, Hampstead, London NW3",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,,Ideally located in the heart of Hampstead Vill...,['Modern two bedroom raised ground floor flat'...,,2 bed flat for sale,51.557677,-0.174122,Just added
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6076,/for-sale/details/57314883/,Listed on 4th Jan 2021,Hammersmith,2021-01-04,Douglas & Gordon - Fulham,020 8115 3521,False,"Bishops Road, London SW6",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,leasehold,A lovely two double bedroom garden flat of cir...,"['2 double bedrooms', '1 bathroom', '1 recepti...",,2 bed flat for sale,51.479042,-0.204174,
6077,/new-homes/details/57312040/,Listed on 4th Jan 2021,Hammersmith,2021-01-04,Prime London (Central and Riverside),020 7768 1152,True,"Westwood House, Chelsea Creek, London SW6",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,,This brand new and expertly crafted one bedroo...,"['New one bedroom apartment', '612 sq ft / 57 ...",,1 bed flat for sale,51.475423,-0.184906,
6078,/for-sale/details/57311651/,Listed on 4th Jan 2021,Hammersmith,2021-01-04,Stanley Chelsea,020 3641 2739,False,"Tierney Lane, London W6",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...","{'__typename': 'FloorArea', 'unitsLabel': 'sq....",,Luxurious river facing second floor apartment ...,"['96 sq m / 1033,34 sq ft', 'Service Charge - ...",,2 bed flat for sale,51.487442,-0.225316,
6079,/new-homes/details/57311537/,Listed on 4th Jan 2021,Hammersmith,2021-01-04,Prime London (Central and Riverside),020 7768 1152,True,"Westwood House, Chelsea Creek, London SW6",False,False,...,"{'__typename': 'PriceHistory', 'firstPublished...",,,This brand new and expertly crafted two bedroo...,"['Canal facing two bedroom apartment', '844 sq...",,2 bed flat for sale,51.475423,-0.184906,


## 3. Data Wrangling

The first item of data wrangling is to extract and convert the floor area from its dictionary and convert to m^2 if necessary.

In [4]:
df_houses_wrangling['floorArea'].fillna(value = np.nan, inplace = True)
df_houses_wrangling['floor_area_amount'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['value'])
df_houses_wrangling['floor_area_units'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['unitsLabel'])
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: None if 
                                                                  pd.isnull(x['floor_area_amount']) else
                                                                  ((round(x['floor_area_amount']/10.7639)) 
                                                                  if (x['floor_area_units'] == 'sq. ft') 
                                                                  else x['floor_area_amount']), axis = 1)

Now, to extract some basic information such as amount of beds, baths, etc. from its dictionary.

In [5]:
# temporally to delete empty lines
df_houses_wrangling.drop(df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()].index, 
                         inplace = True, axis = 0)
df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()]

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,detailedDescription,features,furnishedState,title,latitude,longitude,statusSummary,floor_area_amount,floor_area_units,floor_area_msq


In [6]:
df_houses_wrangling['numBedrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numBedrooms']) 
                                                                        else ast.literal_eval(x)['numBedrooms'])
df_houses_wrangling['numBathrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numBathrooms']) 
                                                                        else ast.literal_eval(x)['numBathrooms'])
df_houses_wrangling['numLivingRooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numLivingRooms']) 
                                                                        else ast.literal_eval(x)['numLivingRooms'])

Next step is to extract historical information. In this case the dictionary varies depending on the information available, so it is important to avoid errors trying to use None values.

In [7]:
df_houses_wrangling['firstPublished'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['firstPublished'])
df_houses_wrangling['lastSale'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                            pd.isnull(x) 
                                                                            else ast.literal_eval(x)['lastSale'])
df_houses_wrangling['firstPublishedDate'] = df_houses_wrangling['firstPublished'].apply(lambda x: None if 
                                                                                    pd.isnull(x) 
                                                                    else x['firstPublishedDate'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublished'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['priceLabel'])
df_houses_wrangling['lastSaleDate'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                            if pd.isnull(x) 
                                                                            else x['date'])
df_houses_wrangling['lastSaleNewBuild'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['newBuild'])
df_houses_wrangling['lastSalePrice'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                             if pd.isnull(x) 
                                                                             else x['price'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublishedPrice'].apply(lambda x:None 
                                                                                              if pd.isnull(x)\
                                                                 else int(x.replace(',','').split('£')[-1]))

From features data we can try to extract floor are information, as currently this feature is full of empty records but it is very important.

I will use the functions I created to extract any feature that contains the words sq ft or sq m in some way.

In [11]:
df_houses_wrangling['sq_m_feat_from_ft'] = df_houses_wrangling['features'].apply(lambda x: None if x == None
                                                                    else (sq_ft_features_find(ast.literal_eval(x))
                                                                         if isinstance(x,str)
                                                                         else sq_ft_features_find(x)))

df_houses_wrangling['sq_m_feat_from_sq'] = df_houses_wrangling['features'].apply(lambda x: None if x == None 
                                                                    else (sq_m_features_find(ast.literal_eval(x))
                                                                         if isinstance(x,str)
                                                                         else sq_m_features_find(x)))

In [13]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6077 entries, 0 to 6080
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 6077 non-null   object 
 1   Listed               6077 non-null   object 
 2   Borough              6077 non-null   object 
 3   date_listed          6077 non-null   object 
 4   agency_name          6077 non-null   object 
 5   agency_phone         6071 non-null   object 
 6   chain_free           6077 non-null   object 
 7   address              6077 non-null   object 
 8   isRetirementHome     6077 non-null   object 
 9   isSharedOwnership    6077 non-null   object 
 10  listingCondition     6077 non-null   object 
 11  listingStatus        6077 non-null   object 
 12  RoomCount            6077 non-null   object 
 13  price                6077 non-null   float64
 14  propertyType         5834 non-null   object 
 15  isAuction            6077 non-null   o

One the difficulties I am finding is to fill the floor size feature, even though I believe it is a feature of great importance there is very little houses with that information as it is. In fact, just around the 20% so far.

I tried to extract this information from the features object by extracting any number related with sq ft or sq m, let's see how it improves the previous percentage.

In [28]:
df_houses_wrangling['floor_area_amount'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_ft'] 
                                                                     if np.isnan(x['floor_area_amount'])
                                                                    else x['floor_area_amount'], axis = 1)
df_houses_wrangling['floor_area_amount'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_sq'] 
                                                                    if np.isnan(x['floor_area_amount'])
                                                                    else x['floor_area_amount'], axis = 1)

In [29]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6077 entries, 0 to 6080
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 6077 non-null   object 
 1   Listed               6077 non-null   object 
 2   Borough              6077 non-null   object 
 3   date_listed          6077 non-null   object 
 4   agency_name          6077 non-null   object 
 5   agency_phone         6071 non-null   object 
 6   chain_free           6077 non-null   object 
 7   address              6077 non-null   object 
 8   isRetirementHome     6077 non-null   object 
 9   isSharedOwnership    6077 non-null   object 
 10  listingCondition     6077 non-null   object 
 11  listingStatus        6077 non-null   object 
 12  RoomCount            6077 non-null   object 
 13  price                6077 non-null   float64
 14  propertyType         5834 non-null   object 
 15  isAuction            6077 non-null   o

The situation has improved and now there is a better percentage (26%), even though still not great.

Let's try now with the description, applying a similar method.