## Lägenhetsjägaren (The Apartment Hunter, 2022)

29 January 2022

A Swedish drama about a man who must find a flat. But not all is what it seems...

In [1]:
import requests
import pandas as pd
import numpy as np
from lxml import html

from scraping_utils import get_data_from_page, create_urllist
from cleaning_utils import clean_price_column

import re

In [2]:
headers_info = {'Host': 'www.realestate.com.au', 
                'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
                'Cookie': 'reauid=165f301772670000e401f2629c030000cf740100; mid=11604324607491749210; utag_main=v_id:0183a1b78bfa000df68706d9f1870504e004900900bd0; split_audience=c; KP2_UIDz-ssn=0dwTSAWPurUNrYSnIzsFxpGTMNjc8M6Sjmoc9m5E6JesN6RTHSO4bnaxbIKDsl8lvUTczXiLmsl3Fxwn9Hf73Ypfh4VHaSSQkUdJYbX22exsC4HXWEwNzJ7I733M3Zvw0jHUTAa2cJpKzWN3H4cp09pV; KP2_UIDz=0dwTSAWPurUNrYSnIzsFxpGTMNjc8M6Sjmoc9m5E6JesN6RTHSO4bnaxbIKDsl8lvUTczXiLmsl3Fxwn9Hf73Ypfh4VHaSSQkUdJYbX22exsC4HXWEwNzJ7I733M3Zvw0jHUTAa2cJpKzWN3H4cp09pV; Country=AU; fullstory_audience_split=B',
                'Upgrade-Insecure-Requests': '1'}

## Scrape data

This gathers the high-level data for properties in each of the suburb-postcode combinations specified below.

In [3]:
suburb_postcodes = ['fairfield,+vic+3078', 'brunswick,+vic+3056', 'carlton,+vic+3053', 'hawthorn,+vic+3122',
                   'camberwell,+vic+3124', 'moonee+ponds,+vic+3039', 'fitzroy,+vic+3065', 'elsternwick,+vic+3185']

In [4]:
urls = create_urllist(suburb_postcodes, 10)

In [5]:
dfs = []

for url in urls:
    print(f'Scraping URL: {url}')
    page = requests.get(url, headers=headers_info)
    tree = html.fromstring(page.text)
    df_temp = pd.DataFrame(get_data_from_page(tree))
    dfs.append(df_temp)

Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-1
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-2
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-3
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-4
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-5
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-6
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-7
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-8
Scraping URL: https://www.realestate.com.au/buy/in-fairfield,+vic+3078/list-9
Scraping URL: https://www.realestate.com.au/buy/in-brunswick,+vic+3056/list-1
Scraping URL: https://www.realestate.com.au/buy/in-brunswick,+vic+3056/list-2
Scraping URL: https://www.realestate.com.au/buy/in-brunswick,+vic+3056/list-3
Scraping URL: https://www.realestate.com.au/buy/in-brunswick,+vi

In [6]:
df = pd.concat(dfs).drop_duplicates()

## Data Cleaning

Prices are recorded in a text field, often with unnecessary other text. The cleaning steps removes this and creates columns for the minimum and maximum price specified for each property.

In [7]:
df = clean_price_column(df)

## Scrape additional data for each property

This gathers the data available on each of the pages of the properties, especially the text.

In [10]:
texts = []

path_text = './/span[@class="property-description__content"]/text()'

In [11]:
# for each of the properties, get the text description
url_base = 'https://www.realestate.com.au'

for n, url in enumerate(df.link.values):
    print(f'Scraping property {n}')
    property_url = f'{url_base}{url}'
    page = requests.get(property_url, headers=headers_info)
    tree = html.fromstring(page.text)
    property_text = ' '.join(tree.xpath(path_text))
    texts.append(property_text)

Scraping property 0
Scraping property 1
Scraping property 2
Scraping property 3
Scraping property 4
Scraping property 5
Scraping property 6
Scraping property 7
Scraping property 8
Scraping property 9
Scraping property 10
Scraping property 11
Scraping property 12
Scraping property 13
Scraping property 14
Scraping property 15
Scraping property 16
Scraping property 17
Scraping property 18
Scraping property 19
Scraping property 20
Scraping property 21
Scraping property 22
Scraping property 23
Scraping property 24
Scraping property 25
Scraping property 26
Scraping property 27
Scraping property 28
Scraping property 29
Scraping property 30
Scraping property 31
Scraping property 32
Scraping property 33
Scraping property 34
Scraping property 35
Scraping property 36
Scraping property 37
Scraping property 38
Scraping property 39
Scraping property 40
Scraping property 41
Scraping property 42
Scraping property 43
Scraping property 44
Scraping property 45
Scraping property 46
Scraping property 47
Sc

Scraping property 378
Scraping property 379
Scraping property 380
Scraping property 381
Scraping property 382
Scraping property 383
Scraping property 384
Scraping property 385
Scraping property 386
Scraping property 387
Scraping property 388
Scraping property 389
Scraping property 390
Scraping property 391
Scraping property 392
Scraping property 393
Scraping property 394
Scraping property 395
Scraping property 396
Scraping property 397
Scraping property 398
Scraping property 399
Scraping property 400
Scraping property 401
Scraping property 402
Scraping property 403
Scraping property 404
Scraping property 405
Scraping property 406
Scraping property 407
Scraping property 408
Scraping property 409
Scraping property 410
Scraping property 411
Scraping property 412
Scraping property 413
Scraping property 414
Scraping property 415
Scraping property 416
Scraping property 417
Scraping property 418
Scraping property 419
Scraping property 420
Scraping property 421
Scraping property 422
Scraping p

Scraping property 751
Scraping property 752
Scraping property 753
Scraping property 754
Scraping property 755
Scraping property 756
Scraping property 757
Scraping property 758
Scraping property 759
Scraping property 760
Scraping property 761
Scraping property 762
Scraping property 763
Scraping property 764
Scraping property 765
Scraping property 766
Scraping property 767
Scraping property 768
Scraping property 769
Scraping property 770
Scraping property 771
Scraping property 772
Scraping property 773
Scraping property 774
Scraping property 775
Scraping property 776
Scraping property 777
Scraping property 778
Scraping property 779
Scraping property 780
Scraping property 781
Scraping property 782
Scraping property 783
Scraping property 784
Scraping property 785
Scraping property 786
Scraping property 787
Scraping property 788
Scraping property 789
Scraping property 790
Scraping property 791
Scraping property 792
Scraping property 793
Scraping property 794
Scraping property 795
Scraping p

Scraping property 1119
Scraping property 1120
Scraping property 1121
Scraping property 1122
Scraping property 1123
Scraping property 1124
Scraping property 1125
Scraping property 1126
Scraping property 1127
Scraping property 1128
Scraping property 1129
Scraping property 1130
Scraping property 1131
Scraping property 1132
Scraping property 1133
Scraping property 1134
Scraping property 1135
Scraping property 1136
Scraping property 1137
Scraping property 1138
Scraping property 1139
Scraping property 1140
Scraping property 1141
Scraping property 1142
Scraping property 1143
Scraping property 1144
Scraping property 1145
Scraping property 1146
Scraping property 1147
Scraping property 1148
Scraping property 1149
Scraping property 1150
Scraping property 1151
Scraping property 1152
Scraping property 1153
Scraping property 1154
Scraping property 1155
Scraping property 1156
Scraping property 1157
Scraping property 1158
Scraping property 1159
Scraping property 1160
Scraping property 1161
Scraping pr

Scraping property 1476
Scraping property 1477
Scraping property 1478
Scraping property 1479
Scraping property 1480
Scraping property 1481
Scraping property 1482
Scraping property 1483
Scraping property 1484
Scraping property 1485
Scraping property 1486
Scraping property 1487
Scraping property 1488
Scraping property 1489
Scraping property 1490
Scraping property 1491
Scraping property 1492
Scraping property 1493
Scraping property 1494
Scraping property 1495
Scraping property 1496
Scraping property 1497
Scraping property 1498
Scraping property 1499
Scraping property 1500
Scraping property 1501
Scraping property 1502
Scraping property 1503
Scraping property 1504
Scraping property 1505
Scraping property 1506
Scraping property 1507
Scraping property 1508
Scraping property 1509
Scraping property 1510
Scraping property 1511
Scraping property 1512
Scraping property 1513
Scraping property 1514
Scraping property 1515
Scraping property 1516
Scraping property 1517
Scraping property 1518
Scraping pr

In [12]:
df = (
    df
    .assign(description=texts)
)

## Write the data

In [14]:
output_folder = '/home/alex/Desktop/Data/scraped/apartments'

df.to_csv(f'{output_folder}/scraped_161022.csv', index=False)