[< Back to the main notebook](./index.md)

# Detour no.1: Rental properties in Prague

> This is a rendered version of a Jupyter notebook. The source notebook can be found [in my GitHub repository](https://github.com/barjin/ndbi023-project), along with the data used in this analysis.

For the data about the rental prices in Prague, I scraped the website [sreality.cz](https://www.sreality.cz/). The website is a real estate portal that lists properties for sale and rent in the Czech Republic. The data was collected in the first week of May, 2024.

The website provides a search interface where users can filter the properties based on various criteria. The frontend application sends requests to the API at the backend, which returns the search results in JSON format. The API is not documented, but it is possible to reverse-engineer the requests sent by the frontend application to the API.

The API also doesn't seem to employ any kind of rate limiting or bot detection. This makes it possible to scrape the data from the website with a very simple `curl` call (thanks!). The command used to pull the data from the server is at [`scripts/sreality/index.sh`](https://github.com/barjin/ndbi023-project/blob/master/scripts/sreality/index.sh).

```bash
#!/bin/bash
# This script downloads data from sreality.cz, processes them and saves them to ../data/sreality/sreality.json
# Jindřich Bär (barjin), 2024

# Expected usage: ./index.sh
#  - the script stores the data in `data/sreality/index.json`
#  - in case of any errors, try running `chmod +x ./*.sh` in this directory first.

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

# Download the listing ids from Sreality.cz
$SCRIPT_DIR/downloader.sh

# Parse the response and pull the hash_ids for each listing
jq ._embedded.estates[] data.json | jq -s . > array.json
jq .[].hash_id ./array.json  > ids.csv

# Download the details for each listing, based on the hash_id and store them in details.json
$SCRIPT_DIR/download.details.sh
jq . details.json | jq -s . > details_array.json

# Combine the locality and recommendations_data into a single object and store it in index.json
jq ".[] | { locality: .locality.value } + .recommendations_data" ./details_array.json | jq -s . > ../../data/sreality/index.json

# Clean up
rm data.json array.json details.json details_array.json ids.csv
```

## First look into the data

Once we've acquired the data, we can load them into a pandas DataFrame and start the analysis.

In [7]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_json('./data/sreality/index.json')
df

Unnamed: 0,locality,category_main_cb,furnished,object_type,parking_lots,price_summary_unit_cb,locality_street_id,terrace,balcony,locality_district_id,...,low_energy,easy_access,building_condition,locality_country_id,garage,room_count_cb,energy_efficiency_rating_cb,price_summary_czk,garden_area,estate_area
0,"Maiselova, Praha 1 - Josefov",1.0,2.0,0.0,0.0,2.0,120458.0,0.0,1.0,5001.0,...,0.0,0.0,1.0,112.0,0.0,0.0,7.0,35000.0,,
1,"Opatovická, Praha 1 - Nové Město",1.0,3.0,0.0,0.0,2.0,121234.0,1.0,1.0,5001.0,...,0.0,0.0,9.0,112.0,0.0,0.0,3.0,49500.0,,
2,"Grafická, Praha 5 - Smíchov",1.0,1.0,0.0,1.0,2.0,119641.0,1.0,0.0,5005.0,...,1.0,1.0,6.0,112.0,1.0,0.0,1.0,32500.0,,
3,"Holečkova, Praha 5 - Košíře",1.0,1.0,0.0,0.0,2.0,119726.0,0.0,0.0,5005.0,...,0.0,0.0,1.0,112.0,0.0,1.0,7.0,33000.0,,
4,"Na zátorách, Praha 7 - Holešovice",1.0,1.0,0.0,0.0,2.0,120960.0,0.0,1.0,5007.0,...,0.0,1.0,1.0,112.0,0.0,0.0,0.0,10158.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3655,"Řeporyjská, Praha 5 - Jinonice",1.0,2.0,0.0,1.0,2.0,121745.0,0.0,0.0,5005.0,...,0.0,0.0,6.0,112.0,1.0,0.0,3.0,23000.0,,
3656,"Na Poříčí, Praha 1 - Nové Město",1.0,3.0,0.0,1.0,2.0,120809.0,1.0,0.0,5001.0,...,0.0,0.0,1.0,112.0,1.0,0.0,7.0,75000.0,,
3657,"náměstí Jiřího z Poděbrad, Praha 3 - Vinohrady",1.0,0.0,0.0,0.0,2.0,122964.0,0.0,0.0,5003.0,...,0.0,0.0,2.0,112.0,0.0,0.0,0.0,40609.0,,
3658,,,,,,,,,,,...,,,,,,,,,,


There seem to some `NaN` values in the data, which we need to handle. We can also drop the columns that are not relevant to our analysis.

In [8]:
df.isna().sum()

locality                          3
category_main_cb                  3
furnished                         3
object_type                       3
parking_lots                      3
price_summary_unit_cb             3
locality_street_id                3
terrace                           3
balcony                           3
locality_district_id              3
locality_ward_id                  3
loggia                            3
category_sub_cb                   3
building_type                     3
elevator                          3
locality_gps_lat                  3
basin                             3
locality_region_id                3
category_type_cb                  3
hash_id                           3
cellar                            3
object_kind                       3
locality_gps_lon                  3
usable_area                       3
locality_quarter_id               3
ownership                         3
locality_municipality_id          3
low_energy                  

We can drop the rows/columns with too many missing values:

In [9]:
df.drop(['garden_area', 'estate_area'], inplace=True, axis=1)

df.dropna(inplace=True)

Now we simply store the cleaned data to a new file and we're ready to start the analysis.

In [12]:
df.to_csv('./data/sreality/index.csv', index=False)

---

[< Back to the main notebook](./index.md)