# Zoopla Data Ingestion

The purpose of this Notebook is to extract and wrangle zoopla data to finally input in a SQL file for further analysis.

Within the Notebook there are several steps:

1. Extract the basic information and individual link for all the search pages of each borough search.
2. Scrape the JSON file of each individual house and parse it to extract the interesting information.
3. Wrangle the data to get an structured DataFrame.
4. Extract from the text features as much information as possible.
5. Drop the useless features.
6. Drop the records with not enough data.
7. Save the database in a SQL server.

In [164]:
import requests
from bs4 import BeautifulSoup as soup
import numpy as np
import pandas as pd
import math
import json
import time
import re
from datetime import datetime, date, timedelta
from DataIngestZoopla import number_of_search_pages
from DataIngestZoopla import get_borough_url
from DataIngestZoopla import get_main_house_details
from DataIngestZoopla import get_house_inner_details
from DataIngestZoopla import sq_m_features_find
from DataIngestZoopla import sq_ft_features_find
from DataIngestZoopla import extract_feature
import ast
import mysql.connector
from mysql.connector import Error
from sqlalchemy import create_engine
import MySQLdb

## 1. Basic information extraction

First I load the list of London Boroughs and the keys to make the search links.

As the search links had not a complete sense, I preferred to create them manually.

In [91]:
df_boroughs = pd.read_csv('list_of_boroughs.csv')

Then using the custom functions get_borough_url and number_of_search_pages I extract the links and number of pages for each page in each borough search.

In [92]:
df_boroughs = pd.DataFrame(df_dict)
df_boroughs['borough_url'] = df_boroughs.apply(lambda x: get_borough_url(x['london_brough_links_1']\
                                                                         ,x['london_brough_links_2']), 
                                               axis = 1)
df_boroughs[['pages','house_num']] = df_boroughs['borough_url'].apply(lambda x: number_of_search_pages(x))
df_boroughs

NameError: name 'df_dict' is not defined

Using the list of boroughs with its link for the first search page I use the custom function get_main_house_details to scrape from each search page the link and listed date of each house.

As there are many old ads, I filter to get just the ones less than 6 months old.

In [15]:
for i in range(0,df_boroughs.shape[0]):
    data_link, data_listed = get_main_house_details(df_boroughs['borough_url'][i],df_boroughs['pages'][i],
                                                    df_boroughs['house_num'][i])
    print(len(data_link))
    if i == 0:
        df_houses = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses['Borough'] = df_boroughs['borough_name'][i]
    else:
        df_houses_temp = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses_temp['Borough'] = df_boroughs['borough_name'][i]
        df_houses = df_houses.append(df_houses_temp)
        
df_houses['date_listed'] = df_houses['Listed'].apply(lambda x: datetime.strptime(x.split(' ')[-3][:-2]\
                                                                                 + '-' + x.split(' ')[-2]\
                                                                                 + '-' + x.split(' ')[-1],
                                                                                 '%d-%b-%Y'))
df_houses_to_get = df_houses[df_houses['date_listed'] > pd.to_datetime(date.today() - timedelta(days=180))]        

Request 1 of 115
Request 2 of 115
Request 3 of 115
Request 4 of 115
Request 5 of 115
Request 6 of 115
Request 7 of 115
Request 8 of 115
Request 9 of 115
Request 10 of 115
Request 11 of 115
Request 12 of 115
Request 13 of 115
Request 14 of 115
Request 15 of 115
Request 16 of 115
Request 17 of 115
Request 18 of 115
Request 19 of 115
Request 20 of 115
Request 21 of 115
Request 22 of 115
Request 23 of 115
Request 24 of 115
Request 25 of 115
Request 26 of 115
Request 27 of 115
Request 28 of 115
Request 29 of 115
Request 30 of 115
Request 31 of 115
Request 32 of 115
Request 33 of 115
Request 34 of 115
Request 35 of 115
Request 36 of 115
Request 37 of 115
Request 38 of 115
Request 39 of 115
Request 40 of 115
Request 41 of 115
Request 42 of 115
Request 43 of 115
Request 44 of 115
Request 45 of 115
Request 46 of 115
Request 47 of 115
Request 48 of 115
Request 49 of 115
Request 50 of 115
Request 51 of 115
Request 52 of 115
Request 53 of 115
Request 54 of 115
Request 55 of 115
Request 56 of 115
R

## 2. Individual information extraction

Once I have all the individual links I scrape the JSON file of each house and extract the useful features.

The output of this operation would be a DataFrame with all the features and all the houses.

In [175]:
df_houses_details = get_house_inner_details(df_houses_to_get['Link'],
                                            df_houses_to_get['date_listed'],
                                            df_houses_to_get['Borough'])
df_houses_result = df_houses_to_get.join(df_houses_details)
df_houses_wrangling = df_houses_result.copy()

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,firstPublished,lastSale,firstPublishedDate,firstPublishedPrice,lastSaleDate,lastSaleNewBuild,lastSalePrice,sq_ft,sq_ft_num,sq_m_feat
0,/for-sale/details/59016388/,Listed on 26th Jun 2021,Camden,2021-06-26,Keatons - Kentish Town,020 8033 5916,True,"Cliff Court, Cliff Road, London NW1",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 16:52:13,285000.0,,,,,,
1,/for-sale/details/59014076/,Listed on 26th Jun 2021,Camden,2021-06-26,London Habitat,020 8033 8980,False,"Rose Joan Mews, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 11:32:09,1150000.0,,,,[over 1400sqft],1400.0,130.0
2,/for-sale/details/59013561/,Listed on 26th Jun 2021,Camden,2021-06-26,Next Property,020 8022 0180,False,"Leather Lane, Farringdon EC1N",False,False,...,,,,,,,,,,
3,/for-sale/details/59012418/,Listed on 26th Jun 2021,Camden,2021-06-26,Foxtons - Camden,020 3463 6980,False,"North Villas, Camden, London NW1",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-26 00:05:11,850000.0,,,,,,
4,/for-sale/details/59012325/,Listed on 25th Jun 2021,Camden,2021-06-25,AbbeySpring London,020 3544 2531,True,"Goldhurst Terrace, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...","{'__typename': 'PriceHistoryLastSale', 'date':...",2021-06-25 23:35:16,1100000.0,2017-10-13,False,655000.0,,,
5,/for-sale/details/59011798/,Listed on 25th Jun 2021,Camden,2021-06-25,Hunters - West Hampstead,020 3542 2160,False,"Kylemore Road, London NW6",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 22:01:56,2350000.0,,,,,,
6,/for-sale/details/59010843/,Listed on 25th Jun 2021,Camden,2021-06-25,Hamptons - City Sales,020 3463 2556,True,"Woburn Court, Bernard Street, London WC1N",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 19:47:54,1250000.0,,,,[1171 sq ft],1171.0,109.0
7,/for-sale/details/59010666/,Listed on 25th Jun 2021,Camden,2021-06-25,Savills - Marylebone & Fitzrovia,020 8022 3384,False,"Goodge Street, Fitzrovia, London W1T",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...","{'__typename': 'PriceHistoryLastSale', 'date':...",2021-06-25 19:21:38,1350000.0,1996-01-22,False,209000.0,,,
8,/for-sale/details/59010250/,Listed on 25th Jun 2021,Camden,2021-06-25,Niche Estates,020 8033 4236,True,"Prince Of Wales Road, London NW5",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 18:13:10,1200000.0,,,,,,
9,/for-sale/details/59010209/,Listed on 25th Jun 2021,Camden,2021-06-25,Foxtons - Clerkenwell,020 3463 2552,False,"Kings Cross Road, King's Cross, London WC1X",False,False,...,"{'__typename': 'PriceHistoryFirstPublished', '...",,2021-06-25 18:07:40,575000.0,,,,,,


## 3. Data Wrangling

The first item of data wrangling is to extract and convert the floor area from its dictionary and convert to m^2 if necessary.

In [86]:
df_houses_wrangling['floorArea'].fillna(value = np.nan, inplace = True)
df_houses_wrangling['floor_area_amount'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['value'])
df_houses_wrangling['floor_area_units'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['unitsLabel'])
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: None if 
                                                                  pd.isnull(x['floor_area_amount']) else
                                                                  ((round(x['floor_area_amount']/10.7639)) 
                                                                  if (x['floor_area_units'] == 'sq. ft') 
                                                                  else x['floor_area_amount']), axis = 1)

Now, to extract some basic information such as amount of beds, baths, etc. from its dictionary.

In [87]:
# temporally to delete empty lines
df_houses_wrangling.drop(df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()].index, 
                         inplace = True, axis = 0)
df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()]

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,detailedDescription,features,furnishedState,title,latitude,longitude,statusSummary,floor_area_amount,floor_area_units,floor_area_msq


In [88]:
df_houses_wrangling['numBedrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numBedrooms']) 
                                                                        else ast.literal_eval(x)['numBedrooms'])
df_houses_wrangling['numBathrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numBathrooms']) 
                                                                        else ast.literal_eval(x)['numBathrooms'])
df_houses_wrangling['numLivingRooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(ast.literal_eval(x)['numLivingRooms']) 
                                                                        else ast.literal_eval(x)['numLivingRooms'])

Next step is to extract historical information. In this case the dictionary varies depending on the information available, so it is important to avoid errors trying to use None values.

In [89]:
df_houses_wrangling['firstPublished'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else eval(x)['firstPublished'])
df_houses_wrangling['lastSale'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                            pd.isnull(x) 
                                                                            else ast.literal_eval(x)['lastSale'])
df_houses_wrangling['firstPublishedDate'] = df_houses_wrangling['firstPublished'].apply(lambda x: None if 
                                                                                    pd.isnull(x) 
                                                                    else x['firstPublishedDate'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublished'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['priceLabel'])
df_houses_wrangling['lastSaleDate'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                            if pd.isnull(x) 
                                                                            else x['date'])
df_houses_wrangling['lastSaleNewBuild'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['newBuild'])
df_houses_wrangling['lastSalePrice'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                             if pd.isnull(x) 
                                                                             else x['price'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublishedPrice'].apply(lambda x:None 
                                                                                              if pd.isnull(x)\
                                                                 else int(x.replace(',','').split('Â£')[-1]))

## 4. Feature Extraction

From features data we can try to extract floor are information, as currently this feature is full of empty records but it is very important.

I will use the functions I created to extract any feature that contains the words sq ft or sq m in some way.

In [90]:
df_houses_wrangling['sq_m_feat_from_ft'] = df_houses_wrangling['features'].apply(lambda x: None if pd.isnull(x)
                                                                                else sq_ft_features_find(x))

df_houses_wrangling['sq_m_feat_from_m'] = df_houses_wrangling['features'].apply(lambda x: None if pd.isnull(x)
                                                                                else sq_m_features_find(x))

In [91]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50675 entries, 0 to 50739
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 50675 non-null  object 
 1   Listed               50675 non-null  object 
 2   Borough              50675 non-null  object 
 3   date_listed          50675 non-null  object 
 4   agency_name          50675 non-null  object 
 5   agency_phone         50628 non-null  object 
 6   chain_free           50675 non-null  object 
 7   address              50675 non-null  object 
 8   isRetirementHome     50675 non-null  object 
 9   isSharedOwnership    50675 non-null  object 
 10  listingCondition     50675 non-null  object 
 11  listingStatus        50675 non-null  object 
 12  RoomCount            50675 non-null  object 
 13  price                50675 non-null  float64
 14  propertyType         48731 non-null  object 
 15  isAuction            50675 non-null 

One the difficulties I am finding is to fill the floor size feature, even though I believe it is a feature of great importance there is very little houses with that information as it is. In fact, just around the 20% so far.

I tried to extract this information from the features object by extracting any number related with sq ft or sq m, let's see how it improves the previous percentage.

In [92]:
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_ft'] 
                                                                     if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_m'] 
                                                                    if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)

In [93]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50675 entries, 0 to 50739
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 50675 non-null  object 
 1   Listed               50675 non-null  object 
 2   Borough              50675 non-null  object 
 3   date_listed          50675 non-null  object 
 4   agency_name          50675 non-null  object 
 5   agency_phone         50628 non-null  object 
 6   chain_free           50675 non-null  object 
 7   address              50675 non-null  object 
 8   isRetirementHome     50675 non-null  object 
 9   isSharedOwnership    50675 non-null  object 
 10  listingCondition     50675 non-null  object 
 11  listingStatus        50675 non-null  object 
 12  RoomCount            50675 non-null  object 
 13  price                50675 non-null  float64
 14  propertyType         48731 non-null  object 
 15  isAuction            50675 non-null 

The situation has improved and now there is a better percentage (26%), even though still not great.

Let's try now with the description, applying a similar method.

In [94]:
df_houses_wrangling['sq_m_feat_from_dft'] = df_houses_wrangling['detailedDescription'].apply(lambda x: None 
                                                                                             if pd.isnull(x) 
                                                                                    else sq_ft_features_find(x))

df_houses_wrangling['sq_m_feat_from_dm'] = df_houses_wrangling['detailedDescription'].apply(lambda x: None 
                                                                                            if pd.isnull(x) 
                                                                                     else sq_m_features_find(x))

In [95]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50675 entries, 0 to 50739
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 50675 non-null  object 
 1   Listed               50675 non-null  object 
 2   Borough              50675 non-null  object 
 3   date_listed          50675 non-null  object 
 4   agency_name          50675 non-null  object 
 5   agency_phone         50628 non-null  object 
 6   chain_free           50675 non-null  object 
 7   address              50675 non-null  object 
 8   isRetirementHome     50675 non-null  object 
 9   isSharedOwnership    50675 non-null  object 
 10  listingCondition     50675 non-null  object 
 11  listingStatus        50675 non-null  object 
 12  RoomCount            50675 non-null  object 
 13  price                50675 non-null  float64
 14  propertyType         48731 non-null  object 
 15  isAuction            50675 non-null 

In [96]:
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_dft'] 
                                                                     if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_dm'] 
                                                                    if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)

In [97]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50675 entries, 0 to 50739
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Link                 50675 non-null  object 
 1   Listed               50675 non-null  object 
 2   Borough              50675 non-null  object 
 3   date_listed          50675 non-null  object 
 4   agency_name          50675 non-null  object 
 5   agency_phone         50628 non-null  object 
 6   chain_free           50675 non-null  object 
 7   address              50675 non-null  object 
 8   isRetirementHome     50675 non-null  object 
 9   isSharedOwnership    50675 non-null  object 
 10  listingCondition     50675 non-null  object 
 11  listingStatus        50675 non-null  object 
 12  RoomCount            50675 non-null  object 
 13  price                50675 non-null  float64
 14  propertyType         48731 non-null  object 
 15  isAuction            50675 non-null 

Using the description feature, the situation improves to almost 40%, but I don't think it is possible to improve further. If that is the case, I will drop the rows without floor area information. 

In that case I will drop the null values rows. Between 15k and 20k is a good sample of houses to make a reliable study,

In [98]:
df_houses_wrangling2 = df_houses_wrangling.copy()

In [99]:
df_houses_wrangling2.columns

Index(['Link', 'Listed', 'Borough', 'date_listed', 'agency_name',
       'agency_phone', 'chain_free', 'address', 'isRetirementHome',
       'isSharedOwnership', 'listingCondition', 'listingStatus', 'RoomCount',
       'price', 'propertyType', 'isAuction', 'priceHistory', 'floorArea',
       'tenure', 'detailedDescription', 'features', 'furnishedState', 'title',
       'latitude', 'longitude', 'statusSummary', 'floor_area_amount',
       'floor_area_units', 'floor_area_msq', 'numBedrooms', 'numBathrooms',
       'numLivingRooms', 'firstPublished', 'lastSale', 'firstPublishedDate',
       'firstPublishedPrice', 'lastSaleDate', 'lastSaleNewBuild',
       'lastSalePrice', 'sq_m_feat_from_ft', 'sq_m_feat_from_m',
       'sq_m_feat_from_dft', 'sq_m_feat_from_dm'],
      dtype='object')

In [100]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['Link','Listed','agency_phone','RoomCount','priceHistory',
                                                 'floorArea','floor_area_amount','floor_area_units',
                                                  'firstPublished','lastSale','sq_m_feat_from_ft',
                                                  'sq_m_feat_from_m','sq_m_feat_from_dft','sq_m_feat_from_dm'],
                                                axis = 1)
df_houses_wrangling2 = df_houses_wrangling2[df_houses_wrangling2['floor_area_msq'].notnull()].reset_index(drop = True)
df_houses_wrangling2

Unnamed: 0,Borough,date_listed,agency_name,chain_free,address,isRetirementHome,isSharedOwnership,listingCondition,listingStatus,price,...,statusSummary,floor_area_msq,numBedrooms,numBathrooms,numLivingRooms,firstPublishedDate,firstPublishedPrice,lastSaleDate,lastSaleNewBuild,lastSalePrice
0,Camden,2021-06-30,Savills - Margaret Street RDS,True,"Postmark, Mount Pleasant, London WC1X",False,False,new,for_sale,1330000.0,...,Just added,85.0,2,2,1,2021-06-30 23:07:17,1330000.0,,,
1,Camden,2021-06-30,TK International,False,"Holly Bush Vale, Hampstead Village NW3",False,False,pre-owned,for_sale,695000.0,...,Just added,47.0,1,1,1,2021-06-30 18:06:20,695000.0,2018-02-21,False,465000.0
2,Camden,2021-06-30,Knight Frank - Hampstead Sales,False,"Frognal Lane, London NW3",False,False,pre-owned,for_sale,595000.0,...,Just added,49.0,1,1,1,2021-06-30 16:29:58,595000.0,,,
3,Camden,2021-06-30,Barnard Marcus - Auctions,False,"Cosway Street, London NW1",False,False,pre-owned,for_sale,1750000.0,...,Just added,234.0,7,7,0,2021-06-30 16:28:37,1750000.0,,,
4,Camden,2021-06-30,Amberden Estates,False,"Parliament Hill, London NW3",False,False,pre-owned,for_sale,4995000.0,...,Just added,321.0,7,2,0,2021-06-30 16:04:51,4995000.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18438,Waltham Forest,2021-02-01,Linus Jackson Property Agent,True,"Church Road, London E10",False,False,pre-owned,for_sale,335000.0,...,,58.0,2,1,1,2021-02-01 16:49:35,350000.0,,,
18439,Waltham Forest,2021-02-01,Quatremain & Co,False,"Walnut Court, Woodmill Road, London E5",False,False,pre-owned,for_sale,300000.0,...,,46.0,2,1,1,2021-02-01 14:08:26,350000.0,2007-12-06,False,240000.0
18440,Waltham Forest,2021-01-28,The Stow Brothers,False,"Belgrave Road, London E11",False,False,pre-owned,for_sale,342995.0,...,,15.0,2,1,1,2021-01-28 15:16:47,349995.0,2011-03-18,False,152000.0
18441,Waltham Forest,2021-01-28,The Stow Brothers E4,False,"Parade Gardens, London E4",False,False,pre-owned,for_sale,395000.0,...,,26.0,2,1,1,2021-01-28 09:46:03,395000.0,,,


In [101]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   listingStatus        18443 non-null  object 
 9   price                18443 non-null  float64
 10  propertyType         17848 non-null  object 
 11  isAuction            18443 non-null  object 
 12  tenure               11915 non-null  object 
 13  detailedDescription  18443 non-null  object 
 14  features             17085 non-null  object 
 15  furnishedState       788 non-null   

I will get rid of furnished state and status summary as they do not give much information and almost all records are missing.

In [102]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['furnishedState','statusSummary'],
                                                axis = 1)

In [103]:
df_houses_wrangling2['listingStatus'].value_counts()

for_sale            18431
sold                    6
sale_under_offer        6
Name: listingStatus, dtype: int64

Listing status feature also gives very little information and completely useless for the purpose of the study.

In [104]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['listingStatus'],
                                                axis = 1)

In [105]:
df_houses_wrangling2['propertyType'].value_counts()

flat                      13089
terraced                   1471
semi_detached               919
detached                    682
maisonette                  569
studio                      479
end_terrace                 300
town_house                  107
mews                         52
land                         36
bungalow                     29
block_of_flats               25
detached_bungalow            20
cottage                      16
link_detached                10
houseboat                    10
semi_detached_bungalow        9
parking                       8
lodge                         5
retail                        5
barn_conversion               3
country_house                 1
chalet                        1
finca                         1
villa                         1
Name: propertyType, dtype: int64

I will complete the missing values of property type seting them as "Other".

In the future I will have to deal with the fact that there are many different categorical options.

In [106]:
df_houses_wrangling2['propertyType'].fillna(value = 'Other', inplace = True)

In [107]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   price                18443 non-null  float64
 9   propertyType         18443 non-null  object 
 10  isAuction            18443 non-null  object 
 11  tenure               11915 non-null  object 
 12  detailedDescription  18443 non-null  object 
 13  features             17085 non-null  object 
 14  title                18443 non-null  object 
 15  latitude             18443 non-null 

In [108]:
df_houses_wrangling2['tenure'].value_counts()

leasehold            7873
freehold             2627
share_of_freehold    1412
feudal                  3
Name: tenure, dtype: int64

I will try now to complete tenure feature missing values using features and description features. This time will be more simple as we just need to find a mention to any of the 4 options.

In [109]:
patterns = [r'(leasehold)',
            r'(lease)',
            r'(freehold)',
            r'(share.of.freehold)',
            r'(feudal)',
            r'(leaseholder)',
            r'(freeholder)',
            r'(commonhold)']

df_houses_wrangling2['tenure_from_feat'] = df_houses_wrangling2['features'].apply(lambda x: None if pd.isnull(x) 
                                                                                else extract_feature(x,patterns))
df_houses_wrangling2['tenure_from_desc'] = df_houses_wrangling2['detailedDescription'].apply(lambda x: None 
                                                                                             if pd.isnull(x) 
                                                                                else extract_feature(x,patterns))
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   price                18443 non-null  float64
 9   propertyType         18443 non-null  object 
 10  isAuction            18443 non-null  object 
 11  tenure               11915 non-null  object 
 12  detailedDescription  18443 non-null  object 
 13  features             17085 non-null  object 
 14  title                18443 non-null  object 
 15  latitude             18443 non-null 

In [110]:
df_houses_wrangling2['tenure'] = df_houses_wrangling2.apply(lambda x: x['tenure_from_feat'] 
                                                                     if pd.isnull(x['tenure'])
                                                                    else x['tenure'], axis = 1)
df_houses_wrangling2['tenure'] = df_houses_wrangling2.apply(lambda x: x['tenure_from_desc'] 
                                                                     if pd.isnull(x['tenure'])
                                                                    else x['tenure'], axis = 1)
df_houses_wrangling2['tenure'].value_counts()

leasehold            7943
freehold             2976
lease                1583
share_of_freehold    1412
feudal                  3
Name: tenure, dtype: int64

In [111]:
tenure_map = {'lease':'leasehold','feudal':'Other'}
df_houses_wrangling2['tenure'].replace(tenure_map, inplace = True)
df_houses_wrangling2['tenure'] = df_houses_wrangling2['tenure'].fillna('Other')
df_houses_wrangling2['tenure'].value_counts()

leasehold            9526
Other                4529
freehold             2976
share_of_freehold    1412
Name: tenure, dtype: int64

In [112]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   price                18443 non-null  float64
 9   propertyType         18443 non-null  object 
 10  isAuction            18443 non-null  object 
 11  tenure               18443 non-null  object 
 12  detailedDescription  18443 non-null  object 
 13  features             17085 non-null  object 
 14  title                18443 non-null  object 
 15  latitude             18443 non-null 

In [113]:
patterns = [r'(terrace)',
            r'(balcony)',
            r'(roof.deck)']

df_houses_wrangling2['balcony_terrace'] = df_houses_wrangling2['features'].apply(lambda x: False if pd.isnull(x) 
                                                            else (True if pd.notnull(extract_feature(x,patterns)) 
                                                            else False))
df_houses_wrangling2['balcony_terrace'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['balcony_terrace'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['balcony_terrace'].value_counts()

False    9901
True     8542
Name: balcony_terrace, dtype: int64

In [114]:
patterns = [r'(car.park)',
            r'(carpark)',
            r'(parking)']

df_houses_wrangling2['parking'] = df_houses_wrangling2['features'].apply(lambda x: False if pd.isnull(x) 
                                                            else (True if pd.notnull(extract_feature(x,patterns)) 
                                                            else False))
df_houses_wrangling2['parking'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['parking'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['parking'].value_counts()

False    12404
True      6039
Name: parking, dtype: int64

In [115]:
patterns = [r'(garden)',
            r'(backyard)',
           r'(back.yard)',
           r'(patio)']

df_houses_wrangling2['garden'] = df_houses_wrangling2['features'].apply(lambda x: False if pd.isnull(x) 
                                                            else (True if pd.notnull(extract_feature(x,patterns)) 
                                                            else False))
df_houses_wrangling2['garden'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['garden'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['garden'].value_counts()

True     10196
False     8247
Name: garden, dtype: int64

In [116]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   price                18443 non-null  float64
 9   propertyType         18443 non-null  object 
 10  isAuction            18443 non-null  object 
 11  tenure               18443 non-null  object 
 12  detailedDescription  18443 non-null  object 
 13  features             17085 non-null  object 
 14  title                18443 non-null  object 
 15  latitude             18443 non-null 

First published date shows a small amount of missing values. I will fill them using its value in date_listed, basically asuming that the house is the first time published. I will do the same for price.

First I need to convert the feature in datetime of the same format as date_listed.

In [132]:
df_houses_wrangling2['firstPublishedDate'] = df_houses_wrangling2['firstPublishedDate'].apply(lambda x: None 
                                                                                              if pd.isnull(x)
                                                                                             else x.split(' ')[0])
df_houses_wrangling2['firstPublishedDate']

0        2021-06-30
1        2021-06-30
2        2021-06-30
3        2021-06-30
4        2021-06-30
            ...    
18438    2021-02-01
18439    2021-02-01
18440    2021-01-28
18441    2021-01-28
18442    2021-01-27
Name: firstPublishedDate, Length: 18443, dtype: object

In [141]:
df_houses_wrangling2['firstPublishedDate'] = df_houses_wrangling2['firstPublishedDate'].fillna(df_houses_wrangling2['date_listed'])

In [145]:
df_houses_wrangling2['firstPublishedPrice'] = df_houses_wrangling2['firstPublishedPrice'].fillna(df_houses_wrangling2['price'])

In [148]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18443 entries, 0 to 18442
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Borough              18443 non-null  object 
 1   date_listed          18443 non-null  object 
 2   agency_name          18443 non-null  object 
 3   chain_free           18443 non-null  object 
 4   address              18443 non-null  object 
 5   isRetirementHome     18443 non-null  object 
 6   isSharedOwnership    18443 non-null  object 
 7   listingCondition     18443 non-null  object 
 8   price                18443 non-null  float64
 9   propertyType         18443 non-null  object 
 10  isAuction            18443 non-null  object 
 11  tenure               18443 non-null  object 
 12  detailedDescription  18443 non-null  object 
 13  features             17085 non-null  object 
 14  title                18443 non-null  object 
 15  latitude             18443 non-null 

Now I will drop the used features and also last sale, as it shows too many missing values and also it provides very little information.

In [149]:
df_houses_wrangling2.drop(['detailedDescription','features','lastSaleDate','lastSaleNewBuild',
                           'lastSalePrice','tenure_from_feat','tenure_from_desc'],
                         inplace = True,
                         axis = 1)
df_houses_wrangling2

Unnamed: 0,Borough,date_listed,agency_name,chain_free,address,isRetirementHome,isSharedOwnership,listingCondition,price,propertyType,...,longitude,floor_area_msq,numBedrooms,numBathrooms,numLivingRooms,firstPublishedDate,firstPublishedPrice,balcony_terrace,parking,garden
0,Camden,2021-06-30,Savills - Margaret Street RDS,True,"Postmark, Mount Pleasant, London WC1X",False,False,new,1330000.0,flat,...,-0.112205,85.0,2,2,1,2021-06-30,1330000.0,True,False,False
1,Camden,2021-06-30,TK International,False,"Holly Bush Vale, Hampstead Village NW3",False,False,pre-owned,695000.0,flat,...,-0.179551,47.0,1,1,1,2021-06-30,695000.0,False,True,True
2,Camden,2021-06-30,Knight Frank - Hampstead Sales,False,"Frognal Lane, London NW3",False,False,pre-owned,595000.0,flat,...,-0.187055,49.0,1,1,1,2021-06-30,595000.0,False,False,True
3,Camden,2021-06-30,Barnard Marcus - Auctions,False,"Cosway Street, London NW1",False,False,pre-owned,1750000.0,block_of_flats,...,-0.165689,234.0,7,7,0,2021-06-30,1750000.0,False,False,True
4,Camden,2021-06-30,Amberden Estates,False,"Parliament Hill, London NW3",False,False,pre-owned,4995000.0,semi_detached,...,-0.162663,321.0,7,2,0,2021-06-30,4995000.0,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18438,Waltham Forest,2021-02-01,Linus Jackson Property Agent,True,"Church Road, London E10",False,False,pre-owned,335000.0,flat,...,-0.016124,58.0,2,1,1,2021-02-01,350000.0,False,False,False
18439,Waltham Forest,2021-02-01,Quatremain & Co,False,"Walnut Court, Woodmill Road, London E5",False,False,pre-owned,300000.0,flat,...,-0.050792,46.0,2,1,1,2021-02-01,350000.0,False,False,False
18440,Waltham Forest,2021-01-28,The Stow Brothers,False,"Belgrave Road, London E11",False,False,pre-owned,342995.0,flat,...,0.023347,15.0,2,1,1,2021-01-28,349995.0,False,True,True
18441,Waltham Forest,2021-01-28,The Stow Brothers E4,False,"Parade Gardens, London E4",False,False,pre-owned,395000.0,flat,...,-0.015026,26.0,2,1,1,2021-01-28,395000.0,True,True,True


In [171]:
engine = create_engine('mysql+mysqldb://root:d12d12@127.0.0.1/zoopla_houses', echo=False)

In [172]:
df_houses_wrangling2.to_sql('zoopla_houses', con=engine)