# Web Scraping Apartments

## General Approach

If not all data is in tables, we can use web scraping to extract the data from the website. In this notebook, we will scrape the data from the website of a general marketplace ss.com . We will extract the data and store it in a pandas DataFrame.

## Steps

1. We will use the requests library to get the HTML code of the website.
2. We will use the BeautifulSoup library to parse the HTML code.
3. We will extract the data from the HTML code. We will use BeautifulSoup to find the data we need.
4. We will store the data in a pandas DataFrame.
5. We will clean the data. - optionally
6. We will save the data in a CSV or XLSX or JSON file.


In [1]:
# first we load libraries
# let's start with standard libraries
# we will use time to pause between requests - good practice
import time
# we will use datetime to get the current date and time for custom file names
from datetime import datetime

# we will need requests to get the data from the web
import requests

# we will want BeautifulSoup to parse the data
from bs4 import BeautifulSoup
# if you dot have BeautifulSoup installed, you can install it with pip install beautifulsoup4
# official BeautifulSoup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
# in our case pip install pandas[html] will install BeautifulSoup as well	

# we will need pandas to manipulate the data - once we have it
import pandas as pd
# pandas version
print(f"pandas version: {pd.__version__}")

pandas version: 2.2.2


In [2]:
# now we just need a url to scrape
url = "https://www.ss.com/lv/real-estate/flats/riga/centre/sell/"
print(f"Will open the following url: {url}")

Will open the following url: https://www.ss.com/lv/real-estate/flats/riga/centre/sell/


In [3]:
# now we will request the data from the web
response = requests.get(url)
# check response and raise error if it is not 200
if response.status_code != 200:
    raise Exception(f"Failed to load page, status code: {response.status_code}")
else:
    print(f"Page {url} loaded successfully")

Page https://www.ss.com/lv/real-estate/flats/riga/centre/sell/ loaded successfully


In [4]:
# now we have ALL the data for the page in our memory
# let's look at first 100 characters of text
print(response.text[:100])

<!DOCTYPE html>
<HTML lang="lv"><HEAD>
<title>SS.COM Dzīvokļi - Rīga - Centrs, Cenas, Pārdod - Slu


In [7]:
# we could parse our html by hand but it would be quite painful and unnecessary
# for example I could find where Valdemāra is mentioned
valdemara = response.text.find("Valdemāra")
print(f"Valdemāra is mentioned at position {valdemara}")
# i could print some 60 characters around it
print(response.text[valdemara-30:valdemara+30])
# so it is possible but too slow and error prone

Valdemāra is mentioned at position 17057
</option><option value="4545">Valdemāra</option><option valu


## Making soup out of HTML

Instead we will BeautifulSoup library to parse the HTML code. We will extract the data from the HTML code. We will use BeautifulSoup to find the data we need.

In [8]:
# so we will make soup out of response
soup = BeautifulSoup(response.text, 'lxml') # we do not have to specify parser 
#but lxml is considered better than default html.parser
# if you do not have lxml installed, you can install it with pip install lxml
# alternatively you could just use the default parser
# soup = BeautifulSoup(response.text)
# print title of page
print(soup.title)

<title>SS.COM Dzīvokļi - Rīga - Centrs, Cenas, Pārdod - Sludinājumi</title>


In [9]:
# let's get a table row headline
# notice that tr element has an id of head_line
# id attributes are supposed to be unique
# we can use this to find our row
headline_row = soup.find("tr", {"id": "head_line"})
# so we passed two arguments to find
# first is the name of the tag we are looking for  
# second is a dictionary of attributes we are looking for
print(headline_row)
# in this case I bypassed the need to look for specific table since I alread know the id


<tr id="head_line">
<td class="msg_column" colspan="3" width="70%">
<span style="float:left;"> Sludinājumi
</span>
<span align="right" class="msg_column" style="float:right;text-align:right;padding-right:3px;">
<noindex>
<a class="a19" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4S.html" rel="nofollow">datums</a></noindex></span>
</td>
<td class="msg_column_td" nowrap=""><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4SFDwT.html" rel="nofollow" title="">Iela</a></noindex></td><td class="msg_column_td" nowrap=""><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4SelM=.html" rel="nofollow" title="">Ist.</a></noindex></td><td class="msg_column_td" nowrap=""><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4QelM=.html" rel="nofollow" title="">m2</a></noindex></td><td class="msg_column_td" nowrap=""><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4XelM=.html" rel="nofollow" title

In [11]:
# now we could find all td - table data elements in the row
# td docs: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td
table_data = headline_row.find_all("td")
# how many
print(f"Found {len(table_data)} table data elements")
# print last two
print(table_data[-2:])

Found 8 table data elements
[<td background="https://i.ss.com/img/pl.gif" class="msg_column" nowrap="" style="border-left:1px #FFFFFF solid;"><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4bRDwT.html" rel="nofollow">Cena, m2</a></noindex></td>, <td class="msg_column_td" nowrap=""><noindex><a class="a18" href="/lv/real-estate/flats/riga/centre/sell/fDgSeF4belM=.html" rel="nofollow" title="">Cena</a></noindex></td>]


In [13]:
# now let's extract all text from each td
# we can use simple list comprehension
# table_data_text = [td.text for td in table_data]
# again we could have done this with for loop
table_data_text = []
for td in table_data:
    table_data_text.append(td.text.strip())
print(table_data_text)
# well the first element is kind of useless we would want the rest

['Sludinājumi\r\n\n\n\ndatums', 'Iela', 'Ist.', 'm2', 'Stāvs', 'Sērija', 'Cena, m2', 'Cena']


In [28]:
# let's write a function that will get soup as parameter and return a list of column names
# also we will provide three more parameters
# id to use to find the row
# how many columns to skip
# and a tuple of column names to start with
def get_column_names(soup, id="head_line", skip=1, start=("URL", "Description")):
    # find the row
    headline_row = soup.find("tr", {"id": id})
    # check that we found the row
    if headline_row is None:
        raise Exception(f"Failed to find row with id: {id}")
    # find all table data elements
    table_data = headline_row.find_all("td")
    # extract text
    table_data_text = [td.text.strip() for td in table_data]
    # return the list of column names
    return start + tuple(table_data_text[skip:]) # we could have made start list and then used start.extend(table_data_text[skip:])

# let's test our function on our soup
column_names = get_column_names(soup)
print(column_names)

('URL', 'Description', 'Iela', 'Ist.', 'm2', 'Stāvs', 'Sērija', 'Cena, m2', 'Cena')


In [17]:
# now we want to gather all table rows that have id that starts with tr_
# first we get all table rows
all_rows = soup.find_all("tr")
# how many
print(f"Found {len(all_rows)} rows")

Found 39 rows


In [19]:
# now let's filter those rows that have id starting with tr_
# lets start with loop solution
apartment_rows = []
for row in all_rows:
    if row.get("id", "").startswith("tr_"):
    # id attribute is not guaranteed
    # so get is better than direct access
    # then we can always use startswith method
        # we will add one more check
        # we only want those rows that do not start with tr_bnr
        # those are banners and we are not interested in them
        if not row.get("id", "").startswith("tr_bnr"):
            apartment_rows.append(row)
# how many
print(f"Found {len(apartment_rows)} apartment rows")

Found 30 apartment rows


In [20]:
# let's analyze the first row
first_row = apartment_rows[0]
# print first row
print(first_row)

<tr id="tr_55022175"><td class="msga2 pp0"><input id="c55022175" name="mid[]" type="checkbox" value="55022175_1106_0"/></td><td class="msga2"><a href="/msg/lv/real-estate/flats/riga/centre/bcmghe.html" id="im55022175"><img alt="" class="isfoto foto_list" src="https://i.ss.com/gallery/7/1231/307705/61540873.th2.jpg"/></a></td><td class="msg2"><div class="d1"><a class="am" data="ZCU5QSU5NyU4RSU3RSVBQXolQUVqaiU5NyU5QyU4OSU3RCVBRndjbWolOUIlOUUlODV3JUE4c2I=|3ffUFxC29" href="/msg/lv/real-estate/flats/riga/centre/bcmghe.html" id="dm_55022175">Plašas telpas jūsu plāniem. Ideāli gan sev, gan kā Invest projek</a></div></td><td c="1" class="msga2-o pp6" nowrap="">Dzirnavu 157</td><td c="1" class="msga2-o pp6" nowrap="">4</td><td c="1" class="msga2-o pp6" nowrap="">68</td><td c="1" class="msga2-o pp6" nowrap="">1/3</td><td c="1" class="msga2-o pp6" nowrap="">P. kara</td><td c="1" class="msga2-o pp6" nowrap="">372 €</td><td c="1" class="msga2-o pp6" nowrap="">25,300  €</td></tr>


In [23]:
# let us extract text from all td elements
# we will use list comprehension
first_row_data = [td.text.strip() for td in first_row.find_all("td")]
# print first row data
print(*first_row_data, sep="\n")



Plašas telpas jūsu plāniem. Ideāli gan sev, gan kā Invest projek
Dzirnavu 157
4
68
1/3
P. kara
372 €
25,300  €


In [24]:
# so first cell is not needed at all we can skip that
# second cell contains anchor link which we could use to make a url to the ad
# third cell contains the description and so on

# let's get url from the second cell
# we will use find method to find the first anchor tag
# then we will get the href attribute
# we will use get method to get the attribute

# first we get the second cell
second_cell = first_row.find_all("td")[1]
# then we get the anchor tag
anchor = second_cell.find("a")
# then we get the href attribute
url = anchor.get("href") # anchor would always have href attribute
# print url
print(url)

/msg/lv/real-estate/flats/riga/centre/bcmghe.html


In [25]:
# we just need a prefix to make it a full url
prefix = "https://www.ss.com"
full_url = prefix + url
print(full_url)
# so this is the info we could not extract with pandas table reading

https://www.ss.com/msg/lv/real-estate/flats/riga/centre/bcmghe.html


In [26]:
# now we can make a function to extract all data from a row
# we will row and column names as parameters
# we will return a dictionary - 
# why? because list of dictionaries converts nicely to Pandas DataFrame
def get_ad_dict(row, column_names, url_td=1, prefix="https://www.ss.com"):
    # first we get all table data elements
    table_data = row.find_all("td")
    # now we find url from index with url_td
    # typically it will be second element with index 1
    url = prefix + table_data[url_td].find("a").get("href")
    # then we extract text from each element after url_td
    table_data_text = [td.text.strip() for td in table_data[url_td+1:]]
    # add url to beginning of the list
    table_data_text.insert(0, url)
    # then we make a dictionary from column names and row data
    # assert column names and table data text have the same length
    assert len(column_names) == len(table_data_text) 
    # above not required but to catch some errors
    return dict(zip(column_names, table_data_text))

In [29]:
# now let us test our function
first_ad_dict = get_ad_dict(first_row, column_names)
print(first_ad_dict)

{'URL': 'https://www.ss.com/msg/lv/real-estate/flats/riga/centre/bcmghe.html', 'Description': 'Plašas telpas jūsu plāniem. Ideāli gan sev, gan kā Invest projek', 'Iela': 'Dzirnavu 157', 'Ist.': '4', 'm2': '68', 'Stāvs': '1/3', 'Sērija': 'P. kara', 'Cena, m2': '372 €', 'Cena': '25,300  €'}


In [30]:
# now we can make a function that will create a list of dictionaries from all rows
# we pass in all_rows and get a list of dictionaries
def get_all_ads(all_rows, column_names):
    # we will use list comprehension
    return [get_ad_dict(row, column_names) for row in all_rows]

# let's test our function
all_ads = get_all_ads(apartment_rows, column_names)
# how many
print(f"Found {len(all_ads)} ads")
# let's print first 3 ads
for ad in all_ads[:3]:
    print(ad)

Found 30 ads
{'URL': 'https://www.ss.com/msg/lv/real-estate/flats/riga/centre/bcmghe.html', 'Description': 'Plašas telpas jūsu plāniem. Ideāli gan sev, gan kā Invest projek', 'Iela': 'Dzirnavu 157', 'Ist.': '4', 'm2': '68', 'Stāvs': '1/3', 'Sērija': 'P. kara', 'Cena, m2': '372 €', 'Cena': '25,300  €'}
{'URL': 'https://www.ss.com/msg/lv/real-estate/flats/riga/centre/dbhnb.html', 'Description': 'Mūsdienīgs dzīvoklis centrā, atjaunotā namā, kas renovēts cienot', 'Iela': 'Cēsu 5', 'Ist.': '3', 'm2': '72', 'Stāvs': '2/5', 'Sērija': 'P. kara', 'Cena, m2': '2,569 €', 'Cena': '185,000  €'}
{'URL': 'https://www.ss.com/msg/lv/real-estate/flats/riga/centre/adoki.html', 'Description': 'Omulīgs dzīvoklis ar labu mājas sajūtu, 2 istabas, 35 kv. m, 2/5', 'Iela': 'Antonijas 15', 'Ist.': '2', 'm2': '35', 'Stāvs': '2/5', 'Sērija': 'P. kara', 'Cena, m2': '2,143 €', 'Cena': '75,000  €'}


In [31]:
# now I can convert all_ads into DataFrame
all_ads_df = pd.DataFrame(all_ads)
# let's check the first few rows
all_ads_df.head()

Unnamed: 0,URL,Description,Iela,Ist.,m2,Stāvs,Sērija,"Cena, m2",Cena
0,https://www.ss.com/msg/lv/real-estate/flats/ri...,"Plašas telpas jūsu plāniem. Ideāli gan sev, ga...",Dzirnavu 157,4,68,1/3,P. kara,372 €,"25,300 €"
1,https://www.ss.com/msg/lv/real-estate/flats/ri...,"Mūsdienīgs dzīvoklis centrā, atjaunotā namā, k...",Cēsu 5,3,72,2/5,P. kara,"2,569 €","185,000 €"
2,https://www.ss.com/msg/lv/real-estate/flats/ri...,"Omulīgs dzīvoklis ar labu mājas sajūtu, 2 ista...",Antonijas 15,2,35,2/5,P. kara,"2,143 €","75,000 €"
3,https://www.ss.com/msg/lv/real-estate/flats/ri...,"Renovētā mājā pārdod gaišu, klusu 4 istabu dzī...",Čaka 31,4,74,3/4,P. kara,"1,959 €","145,000 €"
4,https://www.ss.com/msg/lv/real-estate/flats/ri...,"Renovētā mājā pārdod gaišu, klusu 3 istabu Dzī...",Čaka 31,3,84,3/4,P. kara,"1,452 €","122,000 €"
