---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/credit_risk/main/images/myself.png" align=right>

# **Rent Forecast Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

This project aims to develop a **Machine Learning model capable of predicting the probability of customer default** at the time of a credit card application, even before any payment history is available.

Default prediction is critical to minimize financial losses, preserve institutional credibility, and provide fair and efficient access to credit. However, the task is challenging due to limited data at the application stage, potential classification errors (false positives/negatives), and the need for representative historical data.

The ultimate goal is to provide financial institutions with a **data-driven decision-support tool** that improves the accuracy and fairness of credit approval processes.



<p align="center">
<img width=70% src="
https://raw.githubusercontent.com/gabrielcapela/rent-predict-sp/main/images/crisp-dm.jpeg">
</p>

The CRoss Industry Standard Process for Data Mining ([CRISP-DM](https://www.ibm.com/docs/pt-br/spss-modeler/saas?topic=dm-crisp-help-overview)) methodology will be used to **guide** the stages of this project. In this project **supervised machine learning** algorithms will be used to predict the probability of customer default.

The process begins with the **Business Understanding** stage, where the business objective of minimizing losses through more accurate credit decisions is understood. Next, in the **Data Understanding** stage, the available data — originally from Nubank and released by Academia Sigmoidal — is explored to assess quality, identify patterns, and understand the relationship between resources and default risk, obvious modifications will already be made. In **Data Preparation**, the data is then cleaned, transformed, and prepared for modeling. In the **Modeling** stage, the models are trained and evaluated using appropriate metrics, taking into account class imbalance. Once the best model is selected and calibrated, its performance is reviewed in the **Evaluation** stage. Finally, in the **Deployment** stage, the model will be deployed in a web application for the project, in order to provide the probability of default of a given customer, given their data.

This notebook will cover the **first three phases** of the project: Business Understanding, Data Understanding and Data Preparation.

# Business Understanding

## Problem Contextualization

The city of **São Paulo** is home to some of the largest and most dynamic real estate activity in Brazil. With a highly competitive and volatile market, understanding and predicting property rental prices becomes essential for different agents in the sector — from landlords, tenants, investors, to digital rental platforms.

Using historical data on properties available for rent in São Paulo during 2019, this project aims to **explore and model the factors that directly influence rental prices**. In addition, it seeks to develop an accessible and functional interface, allowing users to enter characteristics of a property and quickly and conveniently obtain a price estimate, thus becoming an effective tool for decision support in the real estate market.

Since the **main objective of this project is to deploy a functional solution**, it was decided to use AutoML tools, which accelerate the predictive modeling process. The YData Profiling library will be used for automated exploratory data analysis, and PyCaret, which allows for rapid testing of different regression models, preprocessing, and selection of the best approach based on performance metrics. This strategy allows for the focus of efforts on delivering a usable end product, with significant productivity gains.

For the deployment phase, the project will use **Flask to build a lightweight and flexible web interface**, enabling users to interact with the predictive model in a simple and efficient way. **This application will be hosted on Heroku**, a cloud platform that facilitates easy and scalable deployment of web applications. This setup ensures that the solution is not only technically robust, but also accessible to end users without requiring advanced technical expertise.

## Source of Data

<p align="center">
<img width=70% src="
https://raw.githubusercontent.com/gabrielcapela/rent-predict-sp/main/images/kaggle_data.png">
</p>

The data used in this project were obtained from the [Kaggle](https://www.kaggle.com/datasets/argonalyst/sao-paulo-real-estate-sale-rent-april-2019/data) platform and contain detailed information about properties located in the city of São Paulo in 2019. The variables include physical characteristics of the properties (such as number of bedrooms, bathrooms, parking spaces), as well as economic and location variables. The dataset is already presented in a structured and relatively clean format, which allows for a more direct advance to the modeling and implementation phases.

## Importance of the Problem

Predicting rental prices in urban centers like São Paulo is relevant for several reasons:

*   Landlords and real estate agencies can better price their properties, avoiding vacancies or outdated prices;
*   Tenants can identify neighborhoods with better value for money;
*   Online platforms can integrate intelligent price suggestion systems;
*   Startups and real estate businesses can use predictive models as part of their solutions.

However, there are some important limitations:

*   The model does not consider subjective variables such as finishing, perceived security or view;
*   The data is from 2019, which may not reflect the current post-pandemic market;
*   External factors, such as public policies and urban infrastructure, are not directly considered.

## Project Objective

Although predictive modeling is an essential step, the main focus of this project is the deployment phase: that is, putting the model into production through an application accessible via a web interface.

### The goal is to deliver a functional solution, where users can enter characteristics of a property and instantly obtain an estimate of the rental value, simulating a real usage scenario in systems or online platforms.

## Criteria for Model Evaluation

**The model's performance will be assessed based on the R2 metric** (coefficient of determination), which indicates the proportion of the rental price variability that is explained by the model's independent variables. In other words, R2 measures how well the model can reproduce the real values ​​from the available data. The R2 value ranges from 0 to 1, with values ​​closer to 1 indicating a model with greater explanatory power.

For this project, **we expect to achieve an R2 greater than 0.7**, meaning that the model performs well, explaining most of the price variability.

# Data Understanding

This step is essential to gain initial insights into the dataset, **identify patterns**, **detect anomalies**, and **assess data quality**.

The dataset used can be downloaded from this [page](https://github.com/gabrielcapela/rent-predict-sp/blob/7e157b64910df7c095880ea7d5c59964f9d72a45/data/sao-paulo-properties-april-2019.csv) and includes several quantitative and categorical variables for each apartment, as well as **a numeric variable indicating the rental or sale value**.

The complete data analysis will be performed using the YData Profiling package, a tool that generates automatic reports.

## Load and inspect the dataset

First, let's import the necessary packages and the dataset and check its first few lines.

In [29]:
#Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

In [30]:
#Importing the dataset
df = pd.read_csv('https://github.com/gabrielcapela/rent-predict-sp/blob/7e157b64910df7c095880ea7d5c59964f9d72a45/data/sao-paulo-properties-april-2019.csv?raw=true')
pd.set_option('display.max_columns', None)  #Show all the columns
#Showing the first lines
df.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


In [31]:
#Showing the shape of dataset
print(f"The data set has {df.shape[0]} rows and {df.shape[1]} columns.")

The data set has 13640 rows and 16 columns.


Below is the meaning of each variable:

| **Variable**       | **Description**                                                                 |
|--------------------|---------------------------------------------------------------------------------|
| `Price`            | Monthly rent price (in Brazilian Reais - R$).                                   |
| `Condo`            | Monthly condominium fee (in R$).                                                |
| `Size`             | Property size in square meters (m²).                                            |
| `Rooms`            | Total number of rooms (usually includes living room and bedrooms).              |
| `Toilets`          | Total number of bathrooms/toilets.                                              |
| `Suites`           | Number of ensuite bathrooms (bathrooms directly connected to a bedroom).        |
| `Parking`          | Number of available parking spots.                                              |
| `Elevator`         | Indicates whether the building has an elevator (`1` = Yes, `0` = No).           |
| `Furnished`        | Indicates whether the property is furnished (`1` = Yes, `0` = No).              |
| `Swimming Pool`    | Indicates whether the building has a swimming pool (`1` = Yes, `0` = No).       |
| `New`              | Indicates whether the property is new (`1` = Yes, `0` = No).                    |
| `District`         | Neighborhood and city (e.g., `Belém/São Paulo`).                                |
| `Negotiation Type` | Type of transaction (e.g., `rent`).                                             |
| `Property Type`    | Type of property (e.g., `apartment`).                                           |
| `Latitude`         | Geographic coordinate (latitude) of the property.                               |
| `Longitude`        | Geographic coordinate (longitude) of the property.                              |

First we notice that there is a column with the type of negotiation, this study will be limited to rent, data on sales will be excluded:

In [32]:
# Show the unique values of the 'Negotiation Type' column
print(df['Negotiation Type'].value_counts())
# Deletin the 'sale' entrys from the 'Negotiation Type' column
df = df[df['Negotiation Type'] != 'sale']
#Showing the shape of dataset
print(f"The data set has {df.shape[0]} rows and {df.shape[1]} columns.")

Negotiation Type
rent    7228
sale    6412
Name: count, dtype: int64
The data set has 7228 rows and 16 columns.


## Ydata Profiling

YData Profiling is a powerful open-source Python tool that automates the process of exploratory data analysis (EDA). With just a single line of code, it generates an interactive and comprehensive HTML report that summarizes key insights about the dataset.

In [33]:
profile = ProfileReport(df, title="report", explorative=True)
#profile.to_notebook_iframe()
profile.to_file("report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 16/16 [00:00<00:00, 16.32it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Click [here](https://gabrielcapela.github.io/rent-predict-sp/report.html) to see the full report!

### Some observations can already be made and will be checked in the following steps:

* 🟢 There are no missing values;
* 🟢 There are values ​​equal to zero, but it is consistent with the meaning of the data;
* 🔴 The columns 'Negotiation Type' and 'Property Type' can be excluded, as they have a single value;
* 🟢 There are many boolean variables (has or does not have), which will facilitate processing;
* 🟡 You can make a visualization with the columns 'Longitude' and 'Latitude'; e
* 🔴 The column 'District' must go through some kind of encoding for processing.

## Ploting a Map

Interactive visualization of the geographic locations of properties available for rent in the city of São Paulo. **Each point on the map represents a property**, with the color varying according to the rental value — higher values ​​are represented by more intense colors. A **logarithmic transformation** was applied to the rental values ​​to improve the sensitivity of the color scale, since most prices are concentrated in lower ranges.

In [35]:
import folium
import branca.colormap as cm
import numpy as np

df_map = df[['Latitude', 'Longitude', 'Price']].copy()
df_map ['Price_log'] = np.log1p(df_map['Price']) 

# Map centered in SP
mapa_sp = folium.Map(location=[-23.55052, -46.63331], zoom_start=12)

# Colormap adjusted to log scale
colormap = cm.linear.YlOrRd_09.scale(df_map['Price_log'].min(), df_map['Price_log'].max())
colormap.caption = 'Rent Price (log scale)'

# Adding points to the map
for _, row in df_map.iterrows():
    folium.CircleMarker(
        location=[row['Latitude'], row['Longitude']],
        radius=3,
        color=colormap(row['Price_log']),
        fill=True,
        fill_color=colormap(row['Price_log']),
        fill_opacity=0.7
    ).add_to(mapa_sp)

colormap.add_to(mapa_sp)
mapa_sp




#### TO DO LIST 🚮

* 🔴 Delete columns 'Longitude', 'Latitude', 'Negotiation Type' and 'Property Type'.
* 🔴 Encode the column 'District'.

# Data Preparation

In this step, we will clean and modify the data based on the insights gathered from the previous analysis. 

## Data Cleaning

First, let's delete the columns and rows marked in the previous step:

In [36]:
# Making a copy
df_clean = df.copy()
# List of columns to drop
columns_to_drop = ['Longitude', 'Latitude', 'Negotiation Type', 'Property Type']

# Drop the columns from the DataFrame
df_clean = df_clean.drop(columns=columns_to_drop, errors='ignore')

print(f"Data shape after dropping columns: {df_clean.shape}")

Data shape after dropping columns: (7228, 12)


## Division of the dataset

In [11]:
#Importing the package needed for split
from sklearn.model_selection import train_test_split

#Dividing the dataset into train e test
df_train, df_test = train_test_split(df_clean, test_size=0.2, random_state=42)

#Checking the sizes of the divisions
print(f"Training set size: {df_train.shape}")
print(f"Testing set size: {df_test.shape}")


Training set size: (5782, 12)
Testing set size: (1446, 12)


# Modeling

## Pycaret

PyCaret is a **Python AutoML library** that makes it easy to build machine learning models with just a few lines of code. It automates tasks like preprocessing, comparison, and model tuning, making it ideal for fast, efficient projects.

In [37]:
# Importing the necessary packages
from sklearn.utils._testing import ignore_warnings # or other utilities

from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

## Defining the Setup

In [38]:
# Creating the PyCaret setup
reg = setup(data=df_train, target='Price')

Unnamed: 0,Description,Value
0,Session id,4002
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5782, 12)"
4,Transformed data shape,"(5782, 12)"
5,Transformed train set shape,"(4047, 12)"
6,Transformed test set shape,"(1735, 12)"
7,Numeric features,10
8,Categorical features,1
9,Preprocess,True


## Creating the pipeline

In [39]:
# Creating the pipeline
reg = setup(data = df_train,
            target = 'Price',
            normalize = True,
            #log_experiment = True,
            experiment_name = 'test_01')

Unnamed: 0,Description,Value
0,Session id,4409
1,Target,Price
2,Target type,Regression
3,Original data shape,"(5782, 12)"
4,Transformed data shape,"(5782, 12)"
5,Transformed train set shape,"(4047, 12)"
6,Transformed test set shape,"(1735, 12)"
7,Numeric features,10
8,Categorical features,1
9,Preprocess,True


## Comparing the models

In [47]:
import logging

# Suppressing the log output from LightGBM and other libraries
logging.getLogger("pycaret").setLevel(logging.WARNING)
logging.getLogger("lightgbm").setLevel(logging.WARNING)

# Now running compare_models
best = compare_models()


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,769.6515,3266666.1907,1748.8054,0.7474,0.2879,0.2383,0.473
rf,Random Forest Regressor,763.6512,3379130.4055,1785.8902,0.7309,0.2821,0.2282,0.898
et,Extra Trees Regressor,777.8501,3618419.017,1835.8625,0.7211,0.2902,0.2312,0.678
gbr,Gradient Boosting Regressor,837.4317,3544703.776,1836.174,0.7144,0.3119,0.2649,0.293
br,Bayesian Ridge,1078.673,4567862.2778,2066.5082,0.6407,0.5499,0.3802,0.095
lasso,Lasso Regression,1079.4257,4567829.6812,2066.7146,0.6405,0.5571,0.3806,0.067
llar,Lasso Least Angle Regression,1079.4137,4567775.1789,2066.7011,0.6405,0.5572,0.3806,0.086
ridge,Ridge Regression,1079.9274,4568854.3962,2067.036,0.6403,0.5559,0.3809,0.061
lar,Least Angle Regression,1080.0703,4569051.5837,2067.1133,0.6403,0.5548,0.381,0.068
lr,Linear Regression,1080.0703,4569051.5837,2067.1133,0.6403,0.5548,0.381,0.093


## Verifying and instantiating the best model

In [48]:
# Checking the best model 
print(best)

LGBMRegressor(n_jobs=-1, random_state=4409)


In [49]:
# Instantiating the model
lightgbm = create_model('lightgbm')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,705.7612,2599566.6473,1612.3172,0.7933,0.2846,0.2372
1,699.529,2246438.4844,1498.8124,0.7461,0.279,0.2327
2,737.0278,2279803.7666,1509.9019,0.767,0.3019,0.2537
3,685.5571,1632144.2137,1277.554,0.8235,0.283,0.2379
4,942.2798,7496872.8571,2738.0418,0.6898,0.2917,0.2331
5,704.4168,2055213.218,1433.6015,0.6925,0.2892,0.2412
6,789.2938,2723980.9043,1650.4487,0.7822,0.2775,0.2251
7,895.6641,3904735.3214,1976.0403,0.7137,0.2975,0.2411
8,701.0855,1892477.5547,1375.6735,0.7148,0.2969,0.2468
9,835.8996,5835428.9394,2415.6633,0.7508,0.2779,0.2343


## Hyperparameter Tuning

In [50]:
tuned_lightgbm = tune_model(lightgbm, optimize='R2')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,734.8497,2960994.2514,1720.754,0.7646,0.2958,0.2396
1,713.4773,2202180.8174,1483.9747,0.7511,0.2958,0.2387
2,717.8852,1792664.0599,1338.9041,0.8168,0.2947,0.2402
3,703.2536,1825889.9219,1351.2549,0.8026,0.2741,0.2287
4,1001.3049,8119964.3872,2849.5551,0.664,0.3051,0.2338
5,686.7794,2148180.4857,1465.6672,0.6786,0.2975,0.2334
6,811.3027,3094287.3113,1759.0586,0.7526,0.2856,0.2227
7,927.2848,3669521.9161,1915.5996,0.7309,0.31,0.2523
8,766.3161,2298551.7882,1516.0976,0.6536,0.3231,0.25
9,924.0038,7656818.283,2767.0956,0.6731,0.3041,0.2491


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


# Evaluation


In [51]:
# Evaluating the model
evaluate_model(tuned_lightgbm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

### Prediction with training data

In [52]:
# Making the predictions
predict_model(tuned_lightgbm)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,716.7275,2115480.1035,1454.469,0.7894,0.2861,0.2372


Unnamed: 0,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Price,prediction_label
4049,550,70,3,1,0,0,0,1,0,0,Saúde/São Paulo,2000,2039.252302
4429,0,76,2,2,1,1,0,0,1,0,Freguesia do Ó/São Paulo,2000,1848.532565
9415,690,68,2,2,1,2,0,0,0,0,Tremembé/São Paulo,1350,1652.439362
1693,0,77,2,2,1,2,1,0,1,0,Butantã/São Paulo,2800,2368.960824
1990,1500,93,2,3,1,2,0,0,0,0,Pinheiros/São Paulo,3300,5569.459360
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9783,300,66,3,2,1,1,0,0,0,0,Vila Formosa/São Paulo,1200,1489.524152
2552,1280,73,2,3,2,1,0,1,1,0,Santa Cecília/São Paulo,3320,3599.569681
3307,532,45,1,1,0,1,1,0,0,0,Vila Prudente/São Paulo,1000,1270.333649
607,730,77,3,2,1,1,1,0,1,0,Vila Formosa/São Paulo,1200,2190.521130


The model achieved an R² score of 0.79, which is within the expected range. However, this result was obtained using the training data. We will now finalize the model and perform the final test.

### Finalizing the model

In [55]:
# Finalizing the model
final_lightgbm = finalize_model(tuned_lightgbm)

In [56]:
# Checking the parameters
print(final_lightgbm)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['Condo', 'Size', 'Rooms',
                                             'Toilets', 'Suites', 'Parking',
                                             'Elevator', 'Furnished',
                                             'Swimming Pool', 'New'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapper(include=['District'],
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('rest_encoding',
                 TransformerWrapper(include=['District'],
                                    transformer=TargetEncoder(cols=['District'],
                                                              handle_missing='return_nan'))),
                ('normalize', TransformerWrapper(transformer=StandardScaler())),
                ('clean_column_

### Predicting on new data

In [57]:
# Making Predictions with Unseen Data (test set)
unseen_predictions = predict_model(final_lightgbm, data=df_test)
unseen_predictions.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,804.4476,3333630.1066,1825.8231,0.7639,0.2961,0.2441


Unnamed: 0,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Price,prediction_label
2020,0,45,2,2,1,1,0,0,0,0,Pirituba/São Paulo,1399,1108.741457
10852,610,65,2,2,0,1,0,0,1,0,Butantã/São Paulo,2000,1661.742803
3043,517,64,2,2,1,1,0,0,1,0,Carrão/São Paulo,2000,1905.557969
9391,230,55,1,2,1,1,1,0,0,0,Santana/São Paulo,1300,1315.99621
1383,900,45,1,2,1,1,1,1,1,0,Saúde/São Paulo,1600,2140.668342


 After testing with unseen data, the R² score was **0.7639**, indicating that **the model performs well on new data and generalizes effectively**. This result demonstrates the model's robustness and its potential for accurate rental price predictions in real-world scenarios. The next step will be to **deploy the model** and integrate it into the application for practical use.

### Saving the model

In [59]:
save_model(final_lightgbm,'Final_Model_02_05_25')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['Condo', 'Size', 'Rooms',
                                              'Toilets', 'Suites', 'Parking',
                                              'Elevator', 'Furnished',
                                              'Swimming Pool', 'New'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['District'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('rest_encoding',
                  TransformerWrapper(include=['District'],
                                     transformer=TargetEncoder(cols=['District'],
                                                               handle_missing='return_nan'))),
                 ('normalize', TransformerWrapper(transformer=StandardScaler())),
                

# Deployment