# Zoopla Data Ingestion

The purpose of this Notebook is to extract and wrangle zoopla data to finally input in a SQL file for further analysis.

Within the Notebook there are several steps:

1. Extract the basic information and individual link for all the search pages of each borough search.
2. Scrape the JSON file of each individual house and parse it to extract the interesting information.
3. Wrangle the data to get an structured DataFrame.
4. Extract from the text features as much information as possible.
5. Drop the useless features.
6. Drop the records with not enough data.
7. Save the database in a SQL server.

In [1]:
import requests
from bs4 import BeautifulSoup as soup
import numpy as np
import pandas as pd
import math
import json
import time
import re
from datetime import datetime, date, timedelta
from DataIngestZoopla import number_of_search_pages
from DataIngestZoopla import get_borough_url
from DataIngestZoopla import get_main_house_details
from DataIngestZoopla import get_house_inner_details
from DataIngestZoopla import sq_m_features_find
from DataIngestZoopla import sq_ft_features_find
from DataIngestZoopla import extract_feature
import ast
import mysql.connector
from mysql.connector import Error
from sqlalchemy import create_engine
import MySQLdb

## 1. Basic information extraction

First I load the list of London Boroughs and the keys to make the search links.

As the search links had not a complete sense, I preferred to create them manually.

In [2]:
df_boroughs = pd.read_csv('data_temp/list_of_boroughs.csv', index_col = 0)

Then using the custom functions get_borough_url and number_of_search_pages I extract the links and number of pages for each page in each borough search.

In [3]:
#df_boroughs = pd.DataFrame(df_dict)
df_boroughs['borough_url'] = df_boroughs.apply(lambda x: get_borough_url(x['london_brough_links_1']\
                                                                         ,x['london_brough_links_2']), 
                                               axis = 1)
df_boroughs[['pages','house_num']] = df_boroughs['borough_url'].apply(lambda x: number_of_search_pages(x))
df_boroughs

Unnamed: 0,borough_name,london_brough_links_1,london_brough_links_2,borough_url,pages,house_num
0,Camden,camden-london-borough,Camden (London Borough),https://www.zoopla.co.uk/for-sale/property/cam...,102,2544
1,Greenwich,greenwich-royal-borough,Greenwich (Royal Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/gre...,57,1417
2,Hackney,hackney-london-borough,Hackney (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/hac...,69,1720
3,Hammersmith,hammersmith-and-fulham-london-borough,Hammersmith and Fulham (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/ham...,80,1988
4,Islington,islington-london-borough,Islington (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/isl...,90,2246
5,Kensington and Chelsea,kensington-and-chelsea-royal-borough,Kensington and Chelsea (Royal Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/ken...,147,3672
6,Lambeth,lambeth-london-borough,Lambeth (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/lam...,112,2798
7,Lewisham,lewisham-london-borough,Lewisham (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/lew...,67,1670
8,Southwark,southwark-london-borough,Southwark (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/sou...,94,2338
9,Tower Hamlets,tower-hamlets-london-borough,Tower Hamlets (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/tow...,152,3776


In [4]:
df_boroughs = df_boroughs[df_boroughs['borough_name'] == 'City of London']
df_boroughs.reset_index(inplace = True, drop = True)
df_boroughs

Unnamed: 0,borough_name,london_brough_links_1,london_brough_links_2,borough_url,pages,house_num
0,City of London,city-of-london-london-borough,City of London (London Borough)%2C London,https://www.zoopla.co.uk/for-sale/property/cit...,13,307


Using the list of boroughs with its link for the first search page I use the custom function get_main_house_details to scrape from each search page the link and listed date of each house.

As there are many old ads, I filter to get just the ones less than 6 months old.

In [5]:
for i in range(0,df_boroughs.shape[0]):
    data_link, data_listed = get_main_house_details(df_boroughs['borough_url'][i],df_boroughs['pages'][i],
                                                    df_boroughs['house_num'][i])
    print(len(data_link))
    if i == 0:
        df_houses = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses['Borough'] = df_boroughs['borough_name'][i]
    else:
        df_houses_temp = pd.DataFrame({'Link':data_link,'Listed':data_listed},
                     columns = ['Link','Listed'])
        df_houses_temp['Borough'] = df_boroughs['borough_name'][i]
        df_houses = df_houses.append(df_houses_temp)
        
df_houses['date_listed'] = df_houses['Listed'].apply(lambda x: datetime.strptime(x.split(' ')[-3][:-2]\
                                                                                 + '-' + x.split(' ')[-2]\
                                                                                 + '-' + x.split(' ')[-1],
                                                                                 '%d-%b-%Y'))
df_houses_to_get = df_houses[df_houses['date_listed'] > pd.to_datetime(date.today() - timedelta(days=180))]        

Request 1 of 13
Request 2 of 13
Request 3 of 13
Request 4 of 13
Request 5 of 13
Request 6 of 13
Request 7 of 13
Request 8 of 13
Request 9 of 13
Request 10 of 13
Request 11 of 13
Request 12 of 13
Request 13 of 13
307


## 2. Individual information extraction

Once I have all the individual links I scrape the JSON file of each house and extract the useful features.

The output of this operation would be a DataFrame with all the features and all the houses.

In [6]:
df_houses_wrangling = df_houses_wrangling.append(df_houses_result, ignore_index = True)
display(df_houses_wrangling.shape)
display(df_houses_wrangling.tail())

NameError: name 'df_houses_wrangling' is not defined

In [7]:
df_houses_to_get_temp = df_houses_to_get.copy()
df_houses_details = get_house_inner_details(df_houses_to_get_temp['Link'],
                                            df_houses_to_get_temp['date_listed'],
                                            df_houses_to_get_temp['Borough'])
df_houses_to_get_temp.reset_index(inplace = True, drop = True)
df_houses_result = df_houses_to_get_temp.join(df_houses_details)
df_houses_wrangling = df_houses_result.copy()

Request 0 of 149
Request 1 of 149
Request 2 of 149
Request 3 of 149
Request 4 of 149
Request 5 of 149
Request 6 of 149
Request 7 of 149
Request 8 of 149
Request 9 of 149
Request 10 of 149
Request 11 of 149
Request 12 of 149
Request 13 of 149
Request 14 of 149
Request 15 of 149
Request 16 of 149
Request 17 of 149
Request 18 of 149
Request 19 of 149
Request 20 of 149
Request 21 of 149
Request 22 of 149
Request 23 of 149
Request 24 of 149
Request 25 of 149
Request 26 of 149
Request 27 of 149
Request 28 of 149
Request 29 of 149
Request 30 of 149
Request 31 of 149
Request 32 of 149
Request 33 of 149
Request 34 of 149
Request 35 of 149
Request 36 of 149
Request 37 of 149
Request 38 of 149
Request 39 of 149
Request 40 of 149
Request 41 of 149
Request 42 of 149
Request 43 of 149
Request 44 of 149
Request 45 of 149
Request 46 of 149
Request 47 of 149
Request 48 of 149
Request 49 of 149
Request 50 of 149
Request 51 of 149
Request 52 of 149
Request 53 of 149
Request 54 of 149
Request 55 of 149
Re

## 3. Data Wrangling

The first item of data wrangling is to extract and convert the floor area from its dictionary and convert to m^2 if necessary.

In [8]:
df_houses_wrangling['floorArea'].fillna(value = np.nan, inplace = True)
df_houses_wrangling['floor_area_amount'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else x['value'])
df_houses_wrangling['floor_area_units'] = df_houses_wrangling['floorArea'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else x['unitsLabel'])
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: None if 
                                                                  pd.isnull(x['floor_area_amount']) else
                                                                  ((round(x['floor_area_amount']/10.7639)) 
                                                                  if (x['floor_area_units'] == 'sq. ft') 
                                                                  else x['floor_area_amount']), axis = 1)

Now, to extract some basic information such as amount of beds, baths, etc. from its dictionary.

In [9]:
# temporally to delete empty lines
df_houses_wrangling.drop(df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()].index, 
                         inplace = True, axis = 0)
df_houses_wrangling[df_houses_wrangling['RoomCount'].isnull()]

Unnamed: 0,Link,Listed,Borough,date_listed,agency_name,agency_phone,chain_free,address,isRetirementHome,isSharedOwnership,...,detailedDescription,features,furnishedState,title,latitude,longitude,statusSummary,floor_area_amount,floor_area_units,floor_area_msq


In [10]:
df_houses_wrangling['numBedrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(x['numBedrooms']) 
                                                                        else x['numBedrooms'])
df_houses_wrangling['numBathrooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(x['numBathrooms']) 
                                                                        else x['numBathrooms'])
df_houses_wrangling['numLivingRooms'] = df_houses_wrangling['RoomCount'].apply(lambda x: 0\
                                                  if pd.isnull(x['numLivingRooms']) 
                                                                        else x['numLivingRooms'])

Next step is to extract historical information. In this case the dictionary varies depending on the information available, so it is important to avoid errors trying to use None values.

In [11]:
df_houses_wrangling['firstPublished'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                                  pd.isnull(x) 
                                                                        else x['firstPublished'])
df_houses_wrangling['lastSale'] = df_houses_wrangling['priceHistory'].apply(lambda x: None if 
                                                                            pd.isnull(x) 
                                                                            else x['lastSale'])
df_houses_wrangling['firstPublishedDate'] = df_houses_wrangling['firstPublished'].apply(lambda x: None if 
                                                                                    pd.isnull(x) 
                                                                    else x['firstPublishedDate'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublished'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['priceLabel'])
df_houses_wrangling['lastSaleDate'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                            if pd.isnull(x) 
                                                                            else x['date'])
df_houses_wrangling['lastSaleNewBuild'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                                if pd.isnull(x) 
                                                                            else x['newBuild'])
df_houses_wrangling['lastSalePrice'] = df_houses_wrangling['lastSale'].apply(lambda x: None 
                                                                             if pd.isnull(x) 
                                                                             else x['price'])
df_houses_wrangling['firstPublishedPrice'] = df_houses_wrangling['firstPublishedPrice'].apply(lambda x:None 
                                                                                              if pd.isnull(x)\
                                                                 else int(x.replace(',','').split('£')[-1]))

## 4. Feature Extraction

From features data we can try to extract floor are information, as currently this feature is full of empty records but it is very important.

I will use the functions I created to extract any feature that contains the words sq ft or sq m in some way.

In [12]:
df_houses_wrangling['sq_m_feat_from_ft'] = df_houses_wrangling['features'].apply(lambda x: None if np.all(pd.isnull(x))
                                                                                else sq_ft_features_find(" - ".join(x)))

df_houses_wrangling['sq_m_feat_from_m'] = df_houses_wrangling['features'].apply(lambda x: None if np.all(pd.isnull(x))
                                                                                else sq_m_features_find(" - ".join(x)))

In [13]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Link                 150 non-null    object        
 1   Listed               150 non-null    object        
 2   Borough              150 non-null    object        
 3   date_listed          150 non-null    datetime64[ns]
 4   agency_name          150 non-null    object        
 5   agency_phone         150 non-null    object        
 6   chain_free           150 non-null    bool          
 7   address              150 non-null    object        
 8   isRetirementHome     150 non-null    bool          
 9   isSharedOwnership    150 non-null    bool          
 10  listingCondition     150 non-null    object        
 11  listingStatus        150 non-null    object        
 12  RoomCount            150 non-null    object        
 13  price                150 non-null  

One the difficulties I am finding is to fill the floor size feature, even though I believe it is a feature of great importance there is very little houses with that information as it is. In fact, just around the 20% so far.

I tried to extract this information from the features object by extracting any number related with sq ft or sq m, let's see how it improves the previous percentage.

In [14]:
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_ft'] 
                                                                     if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_m'] 
                                                                    if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)

In [15]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 41 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Link                 150 non-null    object        
 1   Listed               150 non-null    object        
 2   Borough              150 non-null    object        
 3   date_listed          150 non-null    datetime64[ns]
 4   agency_name          150 non-null    object        
 5   agency_phone         150 non-null    object        
 6   chain_free           150 non-null    bool          
 7   address              150 non-null    object        
 8   isRetirementHome     150 non-null    bool          
 9   isSharedOwnership    150 non-null    bool          
 10  listingCondition     150 non-null    object        
 11  listingStatus        150 non-null    object        
 12  RoomCount            150 non-null    object        
 13  price                150 non-null  

The situation has improved and now there is a better percentage (26%), even though still not great.

Let's try now with the description, applying a similar method.

In [16]:
df_houses_wrangling['sq_m_feat_from_dft'] = df_houses_wrangling['detailedDescription'].apply(lambda x: None 
                                                                                             if pd.isnull(x) 
                                                                                    else sq_ft_features_find(x))

df_houses_wrangling['sq_m_feat_from_dm'] = df_houses_wrangling['detailedDescription'].apply(lambda x: None 
                                                                                            if pd.isnull(x) 
                                                                                     else sq_m_features_find(x))

In [17]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Link                 150 non-null    object        
 1   Listed               150 non-null    object        
 2   Borough              150 non-null    object        
 3   date_listed          150 non-null    datetime64[ns]
 4   agency_name          150 non-null    object        
 5   agency_phone         150 non-null    object        
 6   chain_free           150 non-null    bool          
 7   address              150 non-null    object        
 8   isRetirementHome     150 non-null    bool          
 9   isSharedOwnership    150 non-null    bool          
 10  listingCondition     150 non-null    object        
 11  listingStatus        150 non-null    object        
 12  RoomCount            150 non-null    object        
 13  price                150 non-null  

In [18]:
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_dft'] 
                                                                     if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)
df_houses_wrangling['floor_area_msq'] = df_houses_wrangling.apply(lambda x: x['sq_m_feat_from_dm'] 
                                                                    if np.isnan(x['floor_area_msq'])
                                                                    else x['floor_area_msq'], axis = 1)

In [19]:
df_houses_wrangling.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Link                 150 non-null    object        
 1   Listed               150 non-null    object        
 2   Borough              150 non-null    object        
 3   date_listed          150 non-null    datetime64[ns]
 4   agency_name          150 non-null    object        
 5   agency_phone         150 non-null    object        
 6   chain_free           150 non-null    bool          
 7   address              150 non-null    object        
 8   isRetirementHome     150 non-null    bool          
 9   isSharedOwnership    150 non-null    bool          
 10  listingCondition     150 non-null    object        
 11  listingStatus        150 non-null    object        
 12  RoomCount            150 non-null    object        
 13  price                150 non-null  

Using the description feature, the situation improves to almost 40%, but I don't think it is possible to improve further. If that is the case, I will drop the rows without floor area information. 

In that case I will drop the null values rows. Between 15k and 20k is a good sample of houses to make a reliable study,

In [20]:
df_houses_wrangling2 = df_houses_wrangling.copy()

In [21]:
df_houses_wrangling2.columns

Index(['Link', 'Listed', 'Borough', 'date_listed', 'agency_name',
       'agency_phone', 'chain_free', 'address', 'isRetirementHome',
       'isSharedOwnership', 'listingCondition', 'listingStatus', 'RoomCount',
       'price', 'propertyType', 'isAuction', 'priceHistory', 'floorArea',
       'tenure', 'detailedDescription', 'features', 'furnishedState', 'title',
       'latitude', 'longitude', 'statusSummary', 'floor_area_amount',
       'floor_area_units', 'floor_area_msq', 'numBedrooms', 'numBathrooms',
       'numLivingRooms', 'firstPublished', 'lastSale', 'firstPublishedDate',
       'firstPublishedPrice', 'lastSaleDate', 'lastSaleNewBuild',
       'lastSalePrice', 'sq_m_feat_from_ft', 'sq_m_feat_from_m',
       'sq_m_feat_from_dft', 'sq_m_feat_from_dm'],
      dtype='object')

In [22]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['Link','Listed','agency_phone','RoomCount','priceHistory',
                                                 'floorArea','floor_area_amount','floor_area_units',
                                                  'firstPublished','lastSale','sq_m_feat_from_ft',
                                                  'sq_m_feat_from_m','sq_m_feat_from_dft','sq_m_feat_from_dm'],
                                                axis = 1)
df_houses_wrangling2 = df_houses_wrangling2[df_houses_wrangling2['floor_area_msq'].notnull()].reset_index(drop = True)
df_houses_wrangling2

Unnamed: 0,Borough,date_listed,agency_name,chain_free,address,isRetirementHome,isSharedOwnership,listingCondition,listingStatus,price,...,statusSummary,floor_area_msq,numBedrooms,numBathrooms,numLivingRooms,firstPublishedDate,firstPublishedPrice,lastSaleDate,lastSaleNewBuild,lastSalePrice
0,City of London,2021-10-28,Foxtons - Clerkenwell,False,"Moor Lane, Moorgate, London EC2Y",False,False,pre-owned,for_sale,1400000,...,Just added,80.0,2,2,1,2021-10-28T18:06:32,1400000.0,,,
1,City of London,2021-10-27,Savills - Clerkenwell,True,"Cock Lane, London EC1A",False,False,pre-owned,for_sale,900000,...,Just added,73.0,2,2,1,2021-10-27T18:25:11,900000.0,,,
2,City of London,2021-10-27,Hamptons - City Sales,False,"Priory House, 5 Friar Street, London EC4V",False,False,pre-owned,for_sale,600000,...,Just added,54.0,1,1,1,2021-10-27T15:29:28,600000.0,2015-10-23,False,920000.0
3,City of London,2021-10-26,Foxtons - West End,False,"Endell Street, Covent Garden, London WC2H",False,False,pre-owned,for_sale,650000,...,Just added,46.0,1,1,2,2021-10-26T14:15:11,650000.0,,,
4,City of London,2021-10-21,Circa London,True,"Dyer's Buildings, London EC1N",False,False,pre-owned,for_sale,1595000,...,,86.0,2,2,1,2021-10-21T13:12:37,1595000.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,City of London,2021-05-29,JBrown Property UK,False,"Principal Place, Worship Street, London, Great...",False,False,pre-owned,for_sale,875000,...,,42.0,1,1,1,2021-05-29T10:26:57,875000.0,,,
72,City of London,2021-05-18,Chestertons - Islington,False,"Aldersgate Street, Clerkenwell, Islington, Lon...",False,False,pre-owned,for_sale,8000000,...,,604.0,3,0,0,2021-05-18T14:47:24,8800000.0,,,
73,City of London,2021-05-13,JLL - City,True,"2 Principal Place, London EC2A",False,False,pre-owned,for_sale,850000,...,,51.0,1,1,1,2021-05-13T07:15:59,875000.0,,,
74,City of London,2021-05-07,Relocate Me,True,"Vicary House, St Barts Square, London EC1A",False,False,new,for_sale,925000,...,,51.0,1,1,1,2021-05-07T20:57:42,925000.0,,,


In [23]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   listingStatus        76 non-null     object        
 9   price                76 non-null     int64         
 10  propertyType         76 non-null     object        
 11  isAuction            76 non-null     bool          
 12  tenure               51 non-null     object        
 13  detailedDescription  76 non-null     

I will get rid of furnished state and status summary as they do not give much information and almost all records are missing.

In [24]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['furnishedState','statusSummary'],
                                                axis = 1)

In [25]:
df_houses_wrangling2['listingStatus'].value_counts()

for_sale    76
Name: listingStatus, dtype: int64

Listing status feature also gives very little information and completely useless for the purpose of the study.

In [26]:
df_houses_wrangling2 = df_houses_wrangling2.drop(['listingStatus'],
                                                axis = 1)

In [27]:
df_houses_wrangling2['propertyType'].value_counts()

flat      71
studio     4
           1
Name: propertyType, dtype: int64

I will complete the missing values of property type seting them as "Other".

In the future I will have to deal with the fact that there are many different categorical options.

In [28]:
df_houses_wrangling2['propertyType'].fillna(value = 'Other', inplace = True)

In [29]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   price                76 non-null     int64         
 9   propertyType         76 non-null     object        
 10  isAuction            76 non-null     bool          
 11  tenure               51 non-null     object        
 12  detailedDescription  76 non-null     object        
 13  features             76 non-null     

In [30]:
df_houses_wrangling2['tenure'].value_counts()

leasehold            48
share_of_freehold     2
freehold              1
Name: tenure, dtype: int64

I will try now to complete tenure feature missing values using features and description features. This time will be more simple as we just need to find a mention to any of the 4 options.

In [31]:
patterns = [r'(leasehold)',
            r'(lease)',
            r'(freehold)',
            r'(share.of.freehold)',
            r'(feudal)',
            r'(leaseholder)',
            r'(freeholder)',
            r'(commonhold)']

df_houses_wrangling2['tenure_from_feat'] = df_houses_wrangling2['features'].apply(lambda x: None if np.any(pd.isnull(x)) 
                                                                                else extract_feature(" - ".join(x),patterns))
df_houses_wrangling2['tenure_from_desc'] = df_houses_wrangling2['detailedDescription'].apply(lambda x: None 
                                                                                             if np.any(pd.isnull(x)) 
                                                                                else extract_feature(" - ".join(x),patterns))
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   price                76 non-null     int64         
 9   propertyType         76 non-null     object        
 10  isAuction            76 non-null     bool          
 11  tenure               51 non-null     object        
 12  detailedDescription  76 non-null     object        
 13  features             76 non-null     

In [32]:
df_houses_wrangling2['tenure'] = df_houses_wrangling2.apply(lambda x: x['tenure_from_feat'] 
                                                                     if pd.isnull(x['tenure'])
                                                                    else x['tenure'], axis = 1)
df_houses_wrangling2['tenure'] = df_houses_wrangling2.apply(lambda x: x['tenure_from_desc'] 
                                                                     if pd.isnull(x['tenure'])
                                                                    else x['tenure'], axis = 1)
df_houses_wrangling2['tenure'].value_counts()

leasehold            49
lease                 4
share_of_freehold     2
freehold              1
Name: tenure, dtype: int64

In [33]:
tenure_map = {'lease':'leasehold','feudal':'Other'}
df_houses_wrangling2['tenure'].replace(tenure_map, inplace = True)
df_houses_wrangling2['tenure'] = df_houses_wrangling2['tenure'].fillna('Other')
df_houses_wrangling2['tenure'].value_counts()

leasehold            53
Other                20
share_of_freehold     2
freehold              1
Name: tenure, dtype: int64

In [34]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   price                76 non-null     int64         
 9   propertyType         76 non-null     object        
 10  isAuction            76 non-null     bool          
 11  tenure               76 non-null     object        
 12  detailedDescription  76 non-null     object        
 13  features             76 non-null     

In [37]:
patterns = [r'(terrace)',
            r'(balcony)',
            r'(roof.deck)']

df_houses_wrangling2['balcony_terrace'] = df_houses_wrangling2['features'].apply(lambda x: False if np.any(pd.isnull(x)) 
                                                            else (True if pd.notnull(extract_feature(" - ".join(x),patterns)) 
                                                            else False))
df_houses_wrangling2['balcony_terrace'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['balcony_terrace'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['balcony_terrace'].value_counts()

True     45
False    31
Name: balcony_terrace, dtype: int64

In [39]:
patterns = [r'(car.park)',
            r'(carpark)',
            r'(parking)']

df_houses_wrangling2['parking'] = df_houses_wrangling2['features'].apply(lambda x: False if np.any(pd.isnull(x)) 
                                                            else (True if pd.notnull(extract_feature(" - ".join(x),patterns)) 
                                                            else False))
df_houses_wrangling2['parking'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['parking'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['parking'].value_counts()

False    54
True     22
Name: parking, dtype: int64

In [41]:
patterns = [r'(garden)',
            r'(backyard)',
           r'(back.yard)',
           r'(patio)']

df_houses_wrangling2['garden'] = df_houses_wrangling2['features'].apply(lambda x: False if np.any(pd.isnull(x)) 
                                                            else (True if pd.notnull(extract_feature(" - ".join(x),patterns)) 
                                                            else False))
df_houses_wrangling2['garden'] = df_houses_wrangling2.apply(lambda x: False 
                                                if pd.isnull(x['detailedDescription']) 
                                                else (True if x['garden'] == True 
                                                else (True 
                                                if pd.notnull(extract_feature(x['detailedDescription'],patterns)) 
                                                else False)), 
                                                axis = 1)
df_houses_wrangling2['garden'].value_counts()

False    46
True     30
Name: garden, dtype: int64

In [42]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 31 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   price                76 non-null     int64         
 9   propertyType         76 non-null     object        
 10  isAuction            76 non-null     bool          
 11  tenure               76 non-null     object        
 12  detailedDescription  76 non-null     object        
 13  features             76 non-null     

First published date shows a small amount of missing values. I will fill them using its value in date_listed, basically asuming that the house is the first time published. I will do the same for price.

First I need to convert the feature in datetime of the same format as date_listed.

In [43]:
df_houses_wrangling2['firstPublishedDate'] = df_houses_wrangling2['firstPublishedDate'].apply(lambda x: None 
                                                                                              if pd.isnull(x)
                                                                                             else x.split(' ')[0])
df_houses_wrangling2['firstPublishedDate']

0     2021-10-28T18:06:32
1     2021-10-27T18:25:11
2     2021-10-27T15:29:28
3     2021-10-26T14:15:11
4     2021-10-21T13:12:37
             ...         
71    2021-05-29T10:26:57
72    2021-05-18T14:47:24
73    2021-05-13T07:15:59
74    2021-05-07T20:57:42
75    2021-05-06T13:49:16
Name: firstPublishedDate, Length: 76, dtype: object

In [44]:
df_houses_wrangling2['firstPublishedDate'] = df_houses_wrangling2['firstPublishedDate'].fillna(df_houses_wrangling2['date_listed'])

In [45]:
df_houses_wrangling2['firstPublishedPrice'] = df_houses_wrangling2['firstPublishedPrice'].fillna(df_houses_wrangling2['price'])

In [48]:
df_houses_wrangling2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Borough              76 non-null     object        
 1   date_listed          76 non-null     datetime64[ns]
 2   agency_name          76 non-null     object        
 3   chain_free           76 non-null     bool          
 4   address              76 non-null     object        
 5   isRetirementHome     76 non-null     bool          
 6   isSharedOwnership    76 non-null     bool          
 7   listingCondition     76 non-null     object        
 8   price                76 non-null     int64         
 9   propertyType         76 non-null     object        
 10  isAuction            76 non-null     bool          
 11  tenure               76 non-null     object        
 12  title                76 non-null     object        
 13  latitude             76 non-null     

Now I will drop the used features and also last sale, as it shows too many missing values and also it provides very little information.

In [47]:
df_houses_wrangling2.drop(['detailedDescription','features','lastSaleDate','lastSaleNewBuild',
                           'lastSalePrice','tenure_from_feat','tenure_from_desc'],
                         inplace = True,
                         axis = 1)
df_houses_wrangling2

Unnamed: 0,Borough,date_listed,agency_name,chain_free,address,isRetirementHome,isSharedOwnership,listingCondition,price,propertyType,...,longitude,floor_area_msq,numBedrooms,numBathrooms,numLivingRooms,firstPublishedDate,firstPublishedPrice,balcony_terrace,parking,garden
0,City of London,2021-10-28,Foxtons - Clerkenwell,False,"Moor Lane, Moorgate, London EC2Y",False,False,pre-owned,1400000,flat,...,-0.089223,80.0,2,2,1,2021-10-28T18:06:32,1400000.0,True,False,False
1,City of London,2021-10-27,Savills - Clerkenwell,True,"Cock Lane, London EC1A",False,False,pre-owned,900000,flat,...,-0.103014,73.0,2,2,1,2021-10-27T18:25:11,900000.0,False,False,True
2,City of London,2021-10-27,Hamptons - City Sales,False,"Priory House, 5 Friar Street, London EC4V",False,False,pre-owned,600000,flat,...,-0.101676,54.0,1,1,1,2021-10-27T15:29:28,600000.0,False,False,False
3,City of London,2021-10-26,Foxtons - West End,False,"Endell Street, Covent Garden, London WC2H",False,False,pre-owned,650000,flat,...,-0.124905,46.0,1,1,2,2021-10-26T14:15:11,650000.0,True,False,True
4,City of London,2021-10-21,Circa London,True,"Dyer's Buildings, London EC1N",False,False,pre-owned,1595000,flat,...,-0.110003,86.0,2,2,1,2021-10-21T13:12:37,1595000.0,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,City of London,2021-05-29,JBrown Property UK,False,"Principal Place, Worship Street, London, Great...",False,False,pre-owned,875000,flat,...,-0.079653,42.0,1,1,1,2021-05-29T10:26:57,875000.0,True,False,False
72,City of London,2021-05-18,Chestertons - Islington,False,"Aldersgate Street, Clerkenwell, Islington, Lon...",False,False,pre-owned,8000000,,...,-0.097849,604.0,3,0,0,2021-05-18T14:47:24,8800000.0,True,True,True
73,City of London,2021-05-13,JLL - City,True,"2 Principal Place, London EC2A",False,False,pre-owned,850000,flat,...,-0.082622,51.0,1,1,1,2021-05-13T07:15:59,875000.0,True,False,False
74,City of London,2021-05-07,Relocate Me,True,"Vicary House, St Barts Square, London EC1A",False,False,new,925000,flat,...,-0.100153,51.0,1,1,1,2021-05-07T20:57:42,925000.0,False,False,True


In [49]:
engine = create_engine('mysql+mysqldb://root:d12d12@127.0.0.1/zoopla_houses', echo=False)

In [50]:
df_houses_wrangling2.to_sql('zoopla_houses_city', con=engine)