---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/credit_risk/main/images/myself.png" align=right>

# **Rent Forecast Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

This project applies **Machine Learning** techniques to predict the rental value of residential properties in the city of São Paulo, based on characteristics such as location, number of bedrooms, parking spaces, among other relevant attributes. 

The goal is to develop a **decision-making support tool** that helps tenants and owners identify fair prices that are consistent with the market.

To this end, a web interface will be created where the user can enter information about the property and obtain an estimated rental value in return. 

<p align="center">
<img width=70% src="
https://raw.githubusercontent.com/gabrielcapela/rent-predict-sp/main/images/crisp-dm.jpeg">
</p>

The development of the project involves steps such as **understanding the problem, analyzing and preparing the data, building and evaluating predictive models, and finally implementing a web application** that allows the practical use of the solution.

# Business Understanding

## Problem Contextualization

The city of **São Paulo** is home to some of the largest and most dynamic real estate activity in Brazil. With a highly competitive and volatile market, understanding and predicting property rental prices becomes essential for different agents in the sector — from landlords, tenants, investors, to digital rental platforms.

Using historical data on properties available for rent in São Paulo during 2019, this project aims to **explore and model the factors that directly influence rental prices**. In addition, it seeks to develop an accessible and functional interface, allowing users to enter characteristics of a property and quickly and conveniently obtain a price estimate, thus becoming an effective tool for decision support in the real estate market.

Since the **main objective of this project is to deploy a functional solution**, it was decided to use AutoML tools, which accelerate the predictive modeling process. The YData Profiling library will be used for automated exploratory data analysis, and PyCaret, which allows for rapid testing of different regression models, preprocessing, and selection of the best approach based on performance metrics. This strategy allows for the focus of efforts on delivering a usable end product, with significant productivity gains.

For the deployment phase, the project will use **Flask to build a lightweight and flexible web interface**, enabling users to interact with the predictive model in a simple and efficient way. **This application will be hosted on Heroku**, a cloud platform that facilitates easy and scalable deployment of web applications. This setup ensures that the solution is not only technically robust, but also accessible to end users without requiring advanced technical expertise.

## Source of Data

<p align="center">
<img width=70% src="
https://raw.githubusercontent.com/gabrielcapela/rent-predict-sp/main/images/kaggle_data.png">
</p>

The data used in this project were obtained from the [Kaggle](https://www.kaggle.com/datasets/argonalyst/sao-paulo-real-estate-sale-rent-april-2019/data) platform and contain detailed information about properties located in the city of São Paulo in 2019. The variables include physical characteristics of the properties (such as number of bedrooms, bathrooms, parking spaces), as well as economic and location variables. The dataset is already presented in a structured and relatively clean format, which allows for a more direct advance to the modeling and implementation phases.

## Importance of the Problem

Predicting rental prices in urban centers like São Paulo is relevant for several reasons:

*   Landlords and real estate agencies can better price their properties, avoiding vacancies or outdated prices;
*   Tenants can identify neighborhoods with better value for money;
*   Online platforms can integrate intelligent price suggestion systems;
*   Startups and real estate businesses can use predictive models as part of their solutions.

However, there are some important limitations:

*   The model does not consider subjective variables such as finishing, perceived security or view;
*   The data is from 2019, which may not reflect the current post-pandemic market;
*   External factors, such as public policies and urban infrastructure, are not directly considered.

## Project Objective

Although predictive modeling is an essential step, the main focus of this project is the deployment phase: that is, putting the model into production through an application accessible via a web interface.

### The goal is to deliver a functional solution, where users can enter characteristics of a property and instantly obtain an estimate of the rental value, simulating a real usage scenario in systems or online platforms.

## Criteria for Model Evaluation

**The model's performance will be assessed based on the R2 metric** (coefficient of determination), which indicates the proportion of the rental price variability that is explained by the model's independent variables. In other words, R2 measures how well the model can reproduce the real values ​​from the available data. The R2 value ranges from 0 to 1, with values ​​closer to 1 indicating a model with greater explanatory power.

For this project, **we expect to achieve an R2 greater than 0.7**, meaning that the model performs well, explaining most of the price variability.

# Data Understanding

This step is essential to gain initial insights into the dataset, **identify patterns**, **detect anomalies**, and **assess data quality**.

The dataset used can be downloaded from this [page](https://github.com/gabrielcapela/rent-predict-sp/blob/7e157b64910df7c095880ea7d5c59964f9d72a45/data/sao-paulo-properties-april-2019.csv) and includes several quantitative and categorical variables for each apartment, as well as **a numeric variable indicating the rental or sale value**.

The complete data analysis will be performed using the YData Profiling package, a tool that generates automatic reports.

## Load and inspect the dataset

First, let's import the necessary packages and the dataset and check its first few lines.

In [1]:
#Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

In [2]:
#Importing the dataset
df = pd.read_csv('https://github.com/gabrielcapela/rent-predict-sp/blob/7e157b64910df7c095880ea7d5c59964f9d72a45/data/sao-paulo-properties-april-2019.csv?raw=true')
pd.set_option('display.max_columns', None)  #Show all the columns
#Showing the first lines
df.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


In [3]:
#Showing the shape of dataset
print(f"The data set has {df.shape[0]} rows and {df.shape[1]} columns.")

The data set has 13640 rows and 16 columns.


Below is the meaning of each variable:

| **Variable**       | **Description**                                                                 |
|--------------------|---------------------------------------------------------------------------------|
| `Price`            | Monthly rent price (in Brazilian Reais - R$).                                   |
| `Condo`            | Monthly condominium fee (in R$).                                                |
| `Size`             | Property size in square meters (m²).                                            |
| `Rooms`            | Total number of rooms (usually includes living room and bedrooms).              |
| `Toilets`          | Total number of bathrooms/toilets.                                              |
| `Suites`           | Number of ensuite bathrooms (bathrooms directly connected to a bedroom).        |
| `Parking`          | Number of available parking spots.                                              |
| `Elevator`         | Indicates whether the building has an elevator (`1` = Yes, `0` = No).           |
| `Furnished`        | Indicates whether the property is furnished (`1` = Yes, `0` = No).              |
| `Swimming Pool`    | Indicates whether the building has a swimming pool (`1` = Yes, `0` = No).       |
| `New`              | Indicates whether the property is new (`1` = Yes, `0` = No).                    |
| `District`         | Neighborhood and city (e.g., `Belém/São Paulo`).                                |
| `Negotiation Type` | Type of transaction (e.g., `rent`).                                             |
| `Property Type`    | Type of property (e.g., `apartment`).                                           |
| `Latitude`         | Geographic coordinate (latitude) of the property.                               |
| `Longitude`        | Geographic coordinate (longitude) of the property.                              |

First we notice that there is a column with the type of negotiation, this study will be limited to rent, data on sales will be excluded:

In [4]:
# Show the unique values of the 'Negotiation Type' column
print(df['Negotiation Type'].value_counts())
# Deletin the 'sale' entrys from the 'Negotiation Type' column
df = df[df['Negotiation Type'] != 'sale']
#Showing the shape of dataset
print(f"The data set has {df.shape[0]} rows and {df.shape[1]} columns.")

Negotiation Type
rent    7228
sale    6412
Name: count, dtype: int64
The data set has 7228 rows and 16 columns.


## Ydata Profiling

YData Profiling is a powerful open-source Python tool that automates the process of exploratory data analysis (EDA). With just a single line of code, it generates an interactive and comprehensive HTML report that summarizes key insights about the dataset.

In [5]:
profile = ProfileReport(df, title="report", explorative=True)
#profile.to_notebook_iframe()
profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 16/16 [00:00<00:00, 21.16it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Click [here](https://gabrielcapela.github.io/rent-predict-sp/report.html) to see the full report!

### Some observations can already be made:

* 🟢 There are no missing values;
* 🟢 There are values ​​equal to zero, but it is consistent with the meaning of the data;
* 🔴 The columns 'Negotiation Type' and 'Property Type' can be excluded, as they have a single value;
* 🟢 There are many boolean variables (has or does not have), which will facilitate processing;
* 🟡 You can make a visualization with the columns 'Longitude' and 'Latitude'; e
* 🔴 The column 'District' must go through some kind of encoding for processing.

## Ploting a Map

Interactive visualization of the geographic locations of properties available for rent in the city of São Paulo. **Each point on the map represents a property**, with the color varying according to the rental value — higher values ​​are represented by more intense colors. A **logarithmic transformation** was applied to the rental values ​​to improve the sensitivity of the color scale, since most prices are concentrated in lower ranges.

In [6]:
import folium
import branca.colormap as cm
import numpy as np

df_map = df[['Latitude', 'Longitude', 'Price']].copy()
df_map ['Price_log'] = np.log1p(df_map['Price']) 

# Map centered in SP
mapa_sp = folium.Map(location=[-23.55052, -46.63331], zoom_start=12)

# Colormap adjusted to log scale
colormap = cm.linear.YlOrRd_09.scale(df_map['Price_log'].min(), df_map['Price_log'].max())
colormap.caption = 'Rent Price (log scale)'

# Adding points to the map
for _, row in df_map.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=3,
        color=colormap(row['Price_log']),
        fill=True,
        fill_color=colormap(row['Price_log']),
        fill_opacity=0.7
    ).add_to(mapa_sp)

colormap.add_to(mapa_sp)
mapa_sp




#### TO DO LIST 🚮

* 🔴 Delete columns 'Longitude', 'Latitude', 'Negotiation Type' and 'Property Type'.
* 🔴 Encode the column 'District'.

# Data Preparation

In this step, we will clean and modify the data based on the insights gathered from the previous analysis. 

## Data Cleaning

First, let's delete the columns and rows marked in the previous step:

In [7]:
# Making a copy
df_clean = df.copy()
# List of columns to drop
columns_to_drop = ['Longitude', 'Latitude', 'Negotiation Type', 'Property Type']

# Drop the columns from the DataFrame
df_clean = df_clean.drop(columns=columns_to_drop, errors='ignore')

print(f"Data shape after dropping columns: {df_clean.shape}")

Data shape after dropping columns: (7228, 12)


## Division of the dataset

In [8]:
#Importing the package needed for split
from sklearn.model_selection import train_test_split

#Dividing the dataset into train e test
df_train, df_test = train_test_split(df_clean, test_size=0.2, random_state=42)

#Checking the sizes of the divisions
print(f"Training set size: {df_train.shape}")
print(f"Testing set size: {df_test.shape}")


Training set size: (5782, 12)
Testing set size: (1446, 12)


# Modeling

## Pycaret

PyCaret is a **Python AutoML library** that makes it easy to build machine learning models with just a few lines of code. It automates tasks like preprocessing, comparison, and model tuning, making it ideal for fast, efficient projects.

In [9]:
# Importing the necessary packages
from sklearn.utils._testing import ignore_warnings # or other utilities

from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

### Defining the Setup

In [10]:
# Creating the PyCaret setup
reg = setup(data=df_train, target='Price')

Unnamed: 0,Description,Value
0,Session id,4203
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5782, 12)"
4,Transformed data shape,"(5782, 12)"
5,Transformed train set shape,"(4047, 12)"
6,Transformed test set shape,"(1735, 12)"
7,Numeric features,10
8,Categorical features,1
9,Preprocess,True


### Creating the pipeline

In [32]:
# Creating the pipeline
reg = setup(data = df_train,
            target = 'Price',
            normalize = True,
            #log_experiment = True,
            experiment_name = 'test_01',
            session_id= 2025)

Unnamed: 0,Description,Value
0,Session id,2025
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5782, 12)"
4,Transformed data shape,"(5782, 12)"
5,Transformed train set shape,"(4047, 12)"
6,Transformed test set shape,"(1735, 12)"
7,Numeric features,10
8,Categorical features,1
9,Preprocess,True


### Comparing the models

In [33]:
import logging

# Suppressing the log output from LightGBM and other libraries
logging.getLogger("pycaret").setLevel(logging.WARNING)
logging.getLogger("lightgbm").setLevel(logging.WARNING)

# Now running compare_models
best = compare_models()


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,765.8808,3030991.7038,1705.3054,0.7521,0.2875,0.2328,0.978
lightgbm,Light Gradient Boosting Machine,803.0385,3232111.1943,1762.3425,0.7361,0.2989,0.2476,0.642
et,Extra Trees Regressor,790.8812,3191199.9458,1754.6646,0.7356,0.3019,0.2415,0.807
gbr,Gradient Boosting Regressor,859.8203,3421607.2012,1825.5669,0.7127,0.3204,0.2727,0.298
knn,K Neighbors Regressor,934.4019,4072433.6987,1987.5811,0.6655,0.3451,0.2789,0.09
br,Bayesian Ridge,1065.1606,4197167.7844,2017.3592,0.6538,0.5323,0.3757,0.065
ridge,Ridge Regression,1066.1086,4198019.5836,2017.6226,0.6537,0.5316,0.3763,0.093
lar,Least Angle Regression,1066.1993,4198126.1275,2017.6538,0.6537,0.5317,0.3763,0.069
llar,Lasso Least Angle Regression,1065.6765,4198048.5383,2017.5974,0.6537,0.5332,0.376,0.072
lasso,Lasso Regression,1065.6894,4198060.0696,2017.6014,0.6537,0.5332,0.376,0.08


### Verifying and instantiating the best model

In [34]:
# Checking the best model 
print(best)

RandomForestRegressor(n_jobs=-1, random_state=2025)


In [35]:
# Instantiating the model
rf = create_model('rf')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,776.3301,3966904.1566,1991.7089,0.7865,0.2803,0.23
1,691.2646,2022424.5442,1422.1197,0.7694,0.2765,0.2276
2,557.2631,1001578.9491,1000.7892,0.794,0.2612,0.2169
3,787.7459,2366808.307,1538.4435,0.7939,0.2884,0.2311
4,894.3285,5143134.9457,2267.8481,0.7199,0.2999,0.2398
5,762.7625,3507694.6434,1872.884,0.7106,0.2897,0.231
6,758.1093,2904089.0892,1704.1388,0.7601,0.2993,0.2215
7,722.9952,2006349.313,1416.4566,0.8348,0.2797,0.2315
8,839.6349,4109407.3576,2027.1673,0.6179,0.2975,0.2459
9,868.3739,3281525.7326,1811.4982,0.7344,0.3028,0.2525


### Hyperparameter Tuning

In [36]:
tuned_rf = tune_model(rf, optimize='R2')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,825.8662,4252288.4751,2062.1078,0.7711,0.2893,0.2503
1,710.8573,2106364.9797,1451.3321,0.7599,0.2897,0.243
2,614.8172,1227814.5408,1108.0679,0.7475,0.2828,0.2402
3,834.1097,2424267.7183,1557.006,0.7889,0.3092,0.2607
4,931.0277,6409433.2707,2531.6859,0.6509,0.3086,0.2537
5,796.6593,3523053.3278,1876.9798,0.7093,0.3044,0.2555
6,820.5907,3019802.2242,1737.7578,0.7505,0.3187,0.2473
7,763.7136,1959644.1173,1399.8729,0.8386,0.2886,0.2528
8,866.7328,3867891.1074,1966.6955,0.6403,0.3099,0.2677
9,918.2176,3252156.2822,1803.3736,0.7368,0.3216,0.2775


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


# Evaluation


In [37]:
# Evaluating the model
evaluate_model(tuned_rf)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

### Prediction with training data

In [47]:
# Making the predictions
predict_model(tuned_rf)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,681.8337,2556682.6251,1598.963,0.777,0.2792,0.225


Unnamed: 0,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Price,prediction_label
4027,835,76,3,1,0,2,0,0,1,0,Saúde/São Paulo,2750,2660.96
10152,870,71,3,2,1,1,1,0,1,0,Lapa/São Paulo,2200,2189.07
1121,400,70,3,2,1,1,0,1,1,0,Jabaquara/São Paulo,2928,1950.20
9858,281,54,2,2,1,2,0,0,1,0,Campo Limpo/São Paulo,1500,1542.07
10495,310,70,2,1,0,1,0,0,0,0,Cambuci/São Paulo,1500,1546.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2630,368,49,1,1,0,1,0,1,1,0,Casa Verde/São Paulo,2500,2291.20
1062,830,127,3,4,3,2,1,1,1,0,Ipiranga/São Paulo,4900,4629.30
1980,1250,65,2,2,1,2,0,0,0,0,Pinheiros/São Paulo,3700,3235.69
11145,2500,205,4,6,4,4,1,1,1,0,Brooklin/São Paulo,15000,16093.93


The model achieved an **R² score of 0.777**, which is within the expected range. However, this result was obtained **using the training data**. We will now finalize the model and perform the final test.

### Finalizing the model

In [48]:
# Finalizing the model
final_rf= finalize_model(tuned_rf)

In [49]:
# Checking the parameters
print(final_rf)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['Condo', 'Size', 'Rooms',
                                             'Toilets', 'Suites', 'Parking',
                                             'Elevator', 'Furnished',
                                             'Swimming Pool', 'New'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapper(include=['District'],
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('rest_encoding',
                 TransformerWrapper(include=['District'],
                                    transformer=TargetEncoder(cols=['District'],
                                                              handle_missing='return_nan'))),
                ('normalize', TransformerWrapper(transformer=StandardScaler())),
                ('clean_column_

### Predicting on new data

In [50]:
# Making Predictions with Unseen Data (test set)
unseen_predictions = predict_model(final_rf, data=df_test)


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Random Forest Regressor,788.3934,3703779.1139,1924.5205,0.7377,0.2804,0.2206


In [51]:
unseen_predictions.head()

Unnamed: 0,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Price,prediction_label
2020,0,45,2,2,1,1,0,0,0,0,Pirituba/São Paulo,1399,1117.0
10852,610,65,2,2,0,1,0,0,1,0,Butantã/São Paulo,2000,1706.968333
3043,517,64,2,2,1,1,0,0,1,0,Carrão/São Paulo,2000,1862.0
9391,230,55,1,2,1,1,1,0,0,0,Santana/São Paulo,1300,1316.14
1383,900,45,1,2,1,1,1,1,1,0,Saúde/São Paulo,1600,1934.02


 After testing with unseen data, the **R² score was 0.7377**, indicating that **the model performs well on new data and generalizes effectively**. This result demonstrates the model's robustness and its potential for accurate rental price predictions in real-world scenarios. The next step will be to **deploy the model** and integrate it into the application for practical use.

### Saving the model

In [52]:
save_model(final_rf,'Final_Model_02_05_25')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Condo', 'Size', 'Rooms',
                                              'Toilets', 'Suites', 'Parking',
                                              'Elevator', 'Furnished',
                                              'Swimming Pool', 'New'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['District'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('rest_encoding',
                  TransformerWrapper(include=['District'],
                                     transformer=TargetEncoder(cols=['District'],
                                                               handle_missing='return_nan'))),
                 ('normalize', TransformerWrapper(transformer=StandardScaler())),
                

# Deployment