# Web Scraping
 
 
 
Forming a CSV file for data from https://www.rightmove.co.uk/house-prices/e10.html?page=1 . I want to first extract the data from the first page before going ahead and scraping the other 40 pages.


In [301]:
import requests
import bs4

In [302]:
url = 'https://www.rightmove.co.uk/house-prices/e10.html?page=1'
res = requests.get(url)
res.text

'<!DOCTYPE html><html><head lang="en"><meta charset="utf-8"><title>House Prices in E10</title><meta name="viewport" content="width=device-width,initial-scale=1"><meta name="description" content="The average price for a property in E10 is £519,239 over the last year. Use Rightmove online house price checker tool to find out exactly how much properties sold for in E10 since 1995 (based on official Land Registry data)."><!-- Favicons --><link rel="shortcut icon" href="/spw/images/favicons/favicon.ico?v=1"><!-- APPLE ICONS --><link rel="apple-touch-icon" sizes="72x72" href="/spw/images/favicons/apple-touch-icon-72x72.png"><link rel="apple-touch-icon" sizes="114x114" href="/spw/images/favicons/apple-touch-icon-114x114.png"><link rel="apple-touch-icon" sizes="120x120" href="/spw/images/favicons/apple-touch-icon-120x120.png"><link rel="apple-touch-icon" sizes="144x144" href="/spw/images/favicons/apple-touch-icon-144x144.png"><link rel="apple-touch-icon" sizes="152x152" href="/spw/images/favic

In [303]:
soup = bs4.BeautifulSoup(res.text,'lxml')
soup

<!DOCTYPE html>
<html><head lang="en"><meta charset="utf-8"/><title>House Prices in E10</title><meta content="width=device-width,initial-scale=1" name="viewport"/><meta content="The average price for a property in E10 is £519,239 over the last year. Use Rightmove online house price checker tool to find out exactly how much properties sold for in E10 since 1995 (based on official Land Registry data)." name="description"/><!-- Favicons --><link href="/spw/images/favicons/favicon.ico?v=1" rel="shortcut icon"/><!-- APPLE ICONS --><link href="/spw/images/favicons/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/><link href="/spw/images/favicons/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/><link href="/spw/images/favicons/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/><link href="/spw/images/favicons/apple-touch-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/><link href="/spw/images/favicons/apple-touch-icon-152x152.pn

In [304]:
soup.select('.main')[0]

<div class="main"><!-- PEBBLE_RENDER --></div>

In [305]:
soup.title.text

'House Prices in E10'

In [306]:
soup.title

<title>House Prices in E10</title>

In [307]:
soup.find_all('div')[1]

<div class="main-content" id="content"><div class="main"><!-- PEBBLE_RENDER --></div></div>

In [308]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head lang="en">
  <meta charset="utf-8"/>
  <title>
   House Prices in E10
  </title>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="The average price for a property in E10 is £519,239 over the last year. Use Rightmove online house price checker tool to find out exactly how much properties sold for in E10 since 1995 (based on official Land Registry data)." name="description"/>
  <!-- Favicons -->
  <link href="/spw/images/favicons/favicon.ico?v=1" rel="shortcut icon"/>
  <!-- APPLE ICONS -->
  <link href="/spw/images/favicons/apple-touch-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="/spw/images/favicons/apple-touch-icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
  <link href="/spw/images/favicons/apple-touch-icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
  <link href="/spw/images/favicons/apple-touch-icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
  <link href="/sp

In [309]:
from subprocess import check_output
import json

s = soup.select('script')
s = s[1]
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__PRELOADED_STATE__));'
with open('temp.js','w') as f:
    f.write(js)
window_init_state = check_output(['node','temp.js'])
big_file = json.loads(window_init_state.decode("utf-8"))
big_file

{'results': {'title': 'House Prices in E10',
  'metaTagDescription': 'The average price for a property in E10 is £519,239 over the last year. Use Rightmove online house price checker tool to find out exactly how much properties sold for in E10 since 1995 (based on official Land Registry data).',
  'containsScotland': False,
  'resultCount': '7,979',
  'properties': [{'address': '11, Whitney Road, London, Greater London E10 7HG',
    'propertyType': 'Terraced',
    'bedrooms': 3,
    'images': {'imageUrl': 'https://media.rightmove.co.uk/dir/66k/65394/109597502/65394_whitneyrd_IMG_00_0000_max_135x100.jpeg',
     'count': 71},
    'hasFloorPlan': True,
    'transactions': [{'displayPrice': '£720,050',
      'dateSold': '17 Dec 2021',
      'tenure': 'Freehold',
      'newBuild': False},
     {'displayPrice': '£445,000',
      'dateSold': '12 Aug 2015',
      'tenure': 'Freehold',
      'newBuild': False}],
    'location': {'lat': 51.57167, 'lng': -0.01677},
    'detailUrl': 'https://www.r

I was able to extract a dictionary from the javascript file using nodejs.

In [310]:
dict = big_file['results']['properties']
dict

[{'address': '11, Whitney Road, London, Greater London E10 7HG',
  'propertyType': 'Terraced',
  'bedrooms': 3,
  'images': {'imageUrl': 'https://media.rightmove.co.uk/dir/66k/65394/109597502/65394_whitneyrd_IMG_00_0000_max_135x100.jpeg',
   'count': 71},
  'hasFloorPlan': True,
  'transactions': [{'displayPrice': '£720,050',
    'dateSold': '17 Dec 2021',
    'tenure': 'Freehold',
    'newBuild': False},
   {'displayPrice': '£445,000',
    'dateSold': '12 Aug 2015',
    'tenure': 'Freehold',
    'newBuild': False}],
  'location': {'lat': 51.57167, 'lng': -0.01677},
  'detailUrl': 'https://www.rightmove.co.uk/house-prices/detailMatching.html?prop=109597502&sale=14270152&country=england'},
 {'address': '82, Manor Road, Leyton, London, Greater London E10 7HN',
  'propertyType': 'Terraced',
  'bedrooms': 4,
  'images': {'imageUrl': 'https://media.rightmove.co.uk/dir/71k/70202/113322182/70202_102565007736_IMG_00_0000_max_135x100.jpeg',
   'count': 13},
  'hasFloorPlan': True,
  'transactio

Time to scrape the rest of the pages!

In [311]:
list_of_dicts = []

for n in range(1,40):
    scrape_url = 'https://www.rightmove.co.uk/house-prices/e10.html?page={}'.format(n)
    res = requests.get(scrape_url)

    soup = bs4.BeautifulSoup(res.text,'lxml')
    s = soup.select('script')
    s = s[1]
    js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__PRELOADED_STATE__));'
    with open('temp.js','w') as f:
        f.write(js)
    window_init_state = check_output(['node','temp.js'])
    big_file = json.loads(window_init_state.decode("utf-8"))
    list_of_dicts.append(big_file['results']['properties'])

In [312]:
len(list_of_dicts)

39

In [313]:
list_of_dicts[38]

[{'address': '4, Manor Road, Leyton, London, Greater London E10 7AL',
  'propertyType': 'Semi-Detached',
  'bedrooms': 5,
  'images': {'imageUrl': 'https://media.rightmove.co.uk/dir/33k/32706/66821818/32706_S41300_IMG_07_0000_max_135x100.jpg',
   'count': 21},
  'hasFloorPlan': True,
  'transactions': [{'displayPrice': '£830,000',
    'dateSold': '23 Jul 2019',
    'tenure': 'Freehold',
    'newBuild': False},
   {'displayPrice': '£103,500',
    'dateSold': '30 Jan 1995',
    'tenure': 'Freehold',
    'newBuild': False}],
  'location': {'lat': 51.56845, 'lng': -0.01792},
  'detailUrl': 'https://www.rightmove.co.uk/house-prices/detailMatching.html?prop=66821818&sale=10188928&country=england'},
 {'address': '13, Alexandra Road, Leyton, London, Greater London E10 5QQ',
  'propertyType': 'Terraced',
  'bedrooms': 3,
  'images': {'imageUrl': 'https://media.rightmove.co.uk/dir/7k/6944/51728841/6944_2158_IMG_13_0001_max_135x100.jpg',
   'count': 12},
  'hasFloorPlan': False,
  'transactions':

In [314]:
import pandas as pd
import numpy as np

In [325]:
np.save('list_of_dicts.npy', list_of_dicts)

Time to create a dataframe from the extracted list of dictionaries.

In [316]:
list_of_df = []
for dict in list_of_dicts:
    x = pd.DataFrame(dict)
    list_of_df.append(x)

prop_df = pd.concat(list_of_df)

In [317]:
prop_df.head()

Unnamed: 0,address,propertyType,bedrooms,images,hasFloorPlan,transactions,location,detailUrl
0,"11, Whitney Road, London, Greater London E10 7HG",Terraced,3.0,{'imageUrl': 'https://media.rightmove.co.uk/di...,True,"[{'displayPrice': '£720,050', 'dateSold': '17 ...","{'lat': 51.57167, 'lng': -0.01677}",https://www.rightmove.co.uk/house-prices/detai...
1,"82, Manor Road, Leyton, London, Greater London...",Terraced,4.0,{'imageUrl': 'https://media.rightmove.co.uk/di...,True,"[{'displayPrice': '£643,756', 'dateSold': '9 D...","{'lat': 51.57082, 'lng': -0.01955}",https://www.rightmove.co.uk/house-prices/detai...
2,"118, Newport Road, London, Greater London E10 6PG",Flat,2.0,{'imageUrl': 'https://media.rightmove.co.uk/di...,True,"[{'displayPrice': '£504,000', 'dateSold': '29 ...","{'lat': 51.56547, 'lng': -0.00145}",https://www.rightmove.co.uk/house-prices/detai...
3,"8, Holden House, 616, High Road Leyton, London...",Flat,,{'imageUrl': '/spw/images/placeholder/no-image...,False,"[{'displayPrice': '£130,000', 'dateSold': '26 ...","{'lat': 51.56901, 'lng': -0.00823}",
4,"3, Buckingham Road, London, Greater London E10...",Terraced,3.0,{'imageUrl': 'https://media.rightmove.co.uk/di...,True,"[{'displayPrice': '£714,000', 'dateSold': '18 ...","{'lat': 51.56015, 'lng': -0.01063}",https://www.rightmove.co.uk/house-prices/detai...


As you can see there is a need to further unpack some columns before I can conduct Exploratory Data Analysis. For example, 'transactions' needs to be further broken down to extract more features. Furthermore, 'transactions' can take lists and therefore takes multiple values for the same address.

In [318]:
prop_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 975 entries, 0 to 24
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   address       975 non-null    object 
 1   propertyType  975 non-null    object 
 2   bedrooms      586 non-null    float64
 3   images        975 non-null    object 
 4   hasFloorPlan  975 non-null    bool   
 5   transactions  975 non-null    object 
 6   location      975 non-null    object 
 7   detailUrl     975 non-null    object 
dtypes: bool(1), float64(1), object(6)
memory usage: 61.9+ KB
