In this file step by step are described the transformations of the data made in the files:
* pipeline_pre-processing.py
* missvalue_outliers_analysis.ipynb
* model_choosing_analysis.ipynb
* pipeline_for_training_data.ipynb
* pipeline_for_production.ipy

# Importing

In [1]:
import pandas as pd
import numpy as np
import os
import joblib

from collections import Counter

import geocoder
import re

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer

In [2]:
np.set_printoptions(suppress=False)

work_dir = r'C:\Users\krasavica\Desktop\Projekty - DS\python-project-ApartmentPriceAnalysis'
os.chdir(work_dir)

pd.set_option('display.float_format', '{:.2f}'.format)

In [3]:
data_initial = pd.read_csv('data_2024-01.csv', index_col=0).reset_index(drop = True)

# Data description
The data includes the following information:
1. **link** - link to the ad
2. **price** - the price in PLN given in the ad or "Zapytaj o cenę" ("Ask price") in case of no price given
3. **address** - the addres given in the ad
4. **area** - apartment area in m²
5. **num_rooms** - number of rooms in the apartment
6. **floor** - the floor on which the apartment is located, usually given in the form of floor by the number of floors in the entire building, for example, 1/7
7. **rent** - monthly rent value in PLN
8. **ownership_status** - form of ownership of the apartment, in Poland there are full ownership, cooperative ownership right to premises, cooperative tenant right to premises and right to municipal premises
9. **flat_condition** - condition of the apartment (to be moved in/to be finished/to be renovated)
10. **perks** - information whether the apartment has a balcony, garden or terrace
11. **parking** - whether the apartment has parking space
12. **heating** - type of heating of the apartment (municipal/gas/electric/boiler room/tiled stoves/other)
13. **market** - primary or secondary market
14. **ad_type** - advertiser type (real estate office/developer/private)
15. **availability** - date from when the apartment is available
16. **year** - year of building
17. **devel_type** - building type (Apartment block/Condominium/Townhouse/Row house etc.)
18. **windows** - material of windows in the apartment(plastic/wooden/aluminum)
19. **lift** - whether the apartment building has an elevator
20. **mater** - apartment building material (brick/hollow block/silicate/large slab etc.)
21. **utilities** - whether the apartment has Internet, cable TV, telephone
22. **security** - whether the apartment has intercom, video intercom, territory monitoring etc.
23. **equipment** - whether the apartment has dishwasher, refrigerator, furniture, oven etc.
24. **add_inf** - additional information, for example, whether the apartment has air conditioning, basement, separate kitchen etc.

## Approach to missing data
In the scrapped data missing values were marked marked with "Zapytaj o cenę" ("Ask price"), "Zapytaj" ("Ask") and "brak informacji" ("no information")<br>
Variables **'address'**, **'area'**, **'num_rooms'**, **'lift'**, **'market'** and **'ad_type'** had no missing data.<br><br>
The next approach to missing data was applied:
1. Variables with too much missing data (above 80%) were removed (**'availability'**)
2. Observations with missing key variables were removed (**'devel_type'**)
3. For categorical variables **'ownership_status'**, **'flat_condition'**, **'heating'**, **'windows'**, **'mater'**, a separate category named "nie podano" ("not provided") was created for missing data
4. No missing data were filled for variables **'parking'**, **'perks'**, **'utilities'**, **'security'**, **'equipment'**, **'add_inf'**. It was recognized that in the case of missing data, the apartment has nothing of the facilities described in the variable
5. For numerical variables, the gaps were filled using the KNN method (**'rent_cat'**, **'year'**, **'floor'**)

# Pre-processing data transformation
Used in file: **pipeline_pre-processing.py**<br>
In the file above, a function with data preprocessing is defined (**preliminary_transform**).<br>
It consists of such functions as:
1. **standardize_missing_values** - Converts placeholders like "Zapytaj o cenę" ("Ask price") to NaN
2. **clean_numeric_columns** - Cleans and converts numeric columns such as 'price', 'area', 'rent'
3. **categorize_rent** - Categorizes rental prices into bins
4. **process_floor_data** - Splits and normalizes apartment floor info
5. **fill_missing_categoricals** - Fills missing categorical values with "nie podano" ("not provided")
6. **encode_parking_presence** - Encodes binary presence of parking (e.g., yes/no)
7. **convert_year_to_int** - Converts the 'year' column is of integer type 
8. **standardize_ownership_labels** - Standardizes ownership labels (e.g., unifying similar terms)
9. **multiple_choice_transform** - One-hot encodes multi-choice variables based on a predefined dictionary
10. **location_transform** - Extracts region, city, and street from the address
11. **city_info_transform** - Adds population and administrative info by merging with external city data

### standardize_missing_values

In [4]:
def standardize_missing_values(data):
    
    """
    Replaces custom placeholders for missing values with standard NaN values.

    This function searches the DataFrame for specific strings that are used
    to indicate missing or unavailable information (e.g., 'Zapytaj o cenę',
    'Zapytaj', 'brak informacji') and replaces them with `np.nan`, which is
    the standard missing value marker in pandas.

    Parameters:
    ----------
    data : pd.DataFrame
        The input DataFrame to be cleaned.

    Returns:
    -------
    pd.DataFrame
        A DataFrame with specified placeholder values replaced by NaN.
    """
    missing_placeholders = [
        'Zapytaj o cenę', # 'Ask price'
        'Zapytaj',        # 'Ask'
        'brak informacji' # 'no information'
        ]
    
    return data.replace(missing_placeholders, np.nan)


In [5]:
# Application of the function
standardized_missing_values_data = standardize_missing_values(data_initial)

# View the changes that have made to the data
missing_info = (
    standardized_missing_values_data.isna()
    .sum()
    .to_frame(name='missing_count')
    .assign(
        missing_percent=lambda df: round(100 * df['missing_count'] / len(standardized_missing_values_data), 2)
    )
    .sort_values(by='missing_count', ascending=False)
)

print(missing_info)

                  missing_count  missing_percent
availability              38287            81.84
equipment                 32486            69.44
rent                      25514            54.54
utilities                 21380            45.70
parking                   20529            43.88
add_inf                   19672            42.05
windows                   16767            35.84
mater                     16710            35.72
security                  14992            32.05
perks                     10959            23.43
ownership_status           8158            17.44
flat_condition             7465            15.96
heating                    6028            12.89
price                      2109             4.51
floor                       751             1.61
year                         26             0.06
devel_type                    9             0.02
link                          0             0.00
address                       0             0.00
area                

### clean_numeric_columns

In [6]:
def clean_numeric_columns(data):
    
    """
    Cleans and converts specified columns containing numeric values with extra characters to float type.

    This function is designed to process the columns "price", "area", and "rent" in a DataFrame
    where numeric values may be represented as strings containing non-numeric characters such as
    units (e.g., "m²"), letters, or whitespace. The steps include:

    1. Converting each value to string format to enable regex processing.
    2. Removing all alphabetic characters (including Polish-specific letters like 'ł', 'Ł', '²') and spaces.
    3. Replacing commas with dots to correctly format decimal numbers.
    4. Converting cleaned strings to numeric (float) values using `pd.to_numeric`.

    Parameters:
    ----------
    data : pd.DataFrame
        The input DataFrame that contains the columns "price", "area", and "rent".

    Returns:
    -------
    pd.DataFrame
        The modified DataFrame with "price", "area", and "rent" columns cleaned and converted to float.
    """
    
    for var in ["price", "area", "rent"]:
        
        data[var] = (
            data[var]
            .astype(str)                              
            .str.replace('[ a-zA-ZłŁ²]*', '', regex=True)
            .str.replace(',', '.', regex=False)
        )
        
        data[var] = pd.to_numeric(data[var])
        
    return data

In [7]:
# Application of the function
cleaned_numeric_columns_data = clean_numeric_columns(standardized_missing_values_data)

# View the changes that have made to the data
clean_numeric_columns_show = pd.concat([data_initial[["price", "area", "rent"]].head(),
               cleaned_numeric_columns_data[["price", "area", "rent"]].head()],
              axis=1)

clean_numeric_columns_show.columns = ["price_before", "area_before", "rent_before",
                                      "price", "area", "rent"]
print(clean_numeric_columns_show)

   price_before area_before rent_before      price  area    rent
0    415 000 zł     37,4 m²     Zapytaj  415000.00 37.40     NaN
1    880 000 zł     68,5 m²      750 zł  880000.00 68.50  750.00
2    590 000 zł       60 m²        1 zł  590000.00 60.00    1.00
3    699 000 zł       77 m²     Zapytaj  699000.00 77.00     NaN
4  1 378 000 zł    69,03 m²    1 120 zł 1378000.00 69.03 1120.00


### categorize_rent

In [8]:
def categorize_rent(data):
    
    """
    Categorizes rental prices into discrete bins and creates a new column 'rent_cat'.

    This function groups the values from the 'rent' column into five predefined ranges (bins)
    to simplify analysis or modeling. Each bin is assigned a numeric label from 1 to 5.
    The bins are: 
        - 0 to 500
        - 501 to 1000
        - 1001 to 1500
        - 1501 to 2000
        - 2001 and above

    The original 'rent' column is removed after categorization, and the new 'rent_cat'
    column is added with float-typed values.

    Parameters:
    ----------
    data : pd.DataFrame
        The input DataFrame that contains a 'rent' column with numeric values.

    Returns:
    -------
    pd.DataFrame
        The modified DataFrame with the 'rent' column replaced by a categorical 'rent_cat' column.
    """
    
    # Group the values from the 'rent' column into ranges
    data['rent_cat'] = pd.cut(data['rent'],
                              bins = [0, 500, 1000, 1500, 2000, np.inf],
                              labels = np.arange(1, 6, 1))
    data['rent_cat'] = data['rent_cat'].astype("float")
    
    # Drop original 'rent' column
    data = data.drop('rent', axis=1)
    
    return data

In [9]:
# Application of the function
categorized_rent_data = categorize_rent(cleaned_numeric_columns_data)

# View the changes that have made to the data
categorize_rent_show = pd.concat([data_initial[["rent"]].head(7),
                                  categorized_rent_data[["rent_cat"]].head(7)],
                                 axis=1)

categorize_rent_show.columns = ["rent_before", "rent_cat"]
print(categorize_rent_show.dropna())

  rent_before  rent_cat
1      750 zł      2.00
2        1 zł      1.00
4    1 120 zł      3.00
5      750 zł      2.00
6      700 zł      2.00


### process_floor_data

In [10]:
def process_floor_data(data):
    
    """
    Process floor information in the dataset by extracting and standardizing floor-related features.

    This function performs the following steps:
    1. Extracts the total number of floors in the building from the 'floor' column.
       - Assumes the format is 'apartment_floor/number_of_floors'.
       - If the format does not contain '/', sets the value to NaN.
    2. Extracts the apartment's floor number from the 'floor' column.
       - Converts special floor names ('parter', 'suterena') to '0'.
       - Replaces '> 10' with None (to handle inconsistent data).
    3. If the apartment's floor is labeled as 'poddasze' (attic), replaces it with the total number of floors.
    4. Converts the apartment floor and total floors columns to float type.
    5. Removes the original 'floor' column from the dataframe.

    Parameters:
    ----------
    data : pandas.DataFrame
        The input dataframe containing a 'floor' column with floor information.

    Returns:
    -------
    pandas.DataFrame
        The dataframe with two new columns:
        - 'number_floor_in_building': total floors in the building as float.
        - 'ap_floor': apartment floor number as float.
        The original 'floor' column is dropped.
    """
    
    # Extract apartment floor (left part before '/'), convert special names
    data['number_floor_in_building'] = (
        data['floor']
        .apply(lambda x: str(x).split('/')[1] if str(x).__contains__('/') else np.nan)
        .astype('float')
        )
    data['ap_floor'] = data['floor'].apply(lambda x: str(x).split('/')[0]).replace({'parter':'0',
                                                                                    'suterena':'0',
                                                                                    '> 10': None})
    # Replace 'poddasze' (attic) with total floors in building
    data['ap_floor'] = np.where(data['ap_floor'] == 'poddasze',
                                        data['number_floor_in_building'], 
                                        data['ap_floor'])
    # Convert apartment floor to float
    data['ap_floor'] = data['ap_floor'].astype('float')
    
    # Drop original 'floor' column
    data = data.drop('floor', axis=1)
    
    return data

In [11]:
# Application of the function
processed_floor_data = process_floor_data(categorized_rent_data)

# View the changes that have made to the data
process_floor_data_show = pd.concat([data_initial[["floor"]].head(),
                                  processed_floor_data[["ap_floor", "number_floor_in_building"]].head()],
                                 axis=1)

print(process_floor_data_show)

      floor  ap_floor  number_floor_in_building
0       1/3      1.00                      3.00
1       4/7      4.00                      7.00
2  parter/1      0.00                      1.00
3    parter      0.00                       NaN
4       3/4      3.00                      4.00


### fill_missing_categoricals

In [12]:
def fill_missing_categoricals(data):
    
    """
    Fill missing values in selected categorical columns with a placeholder string.

    This function targets a predefined list of categorical columns and replaces any
    missing (NaN) values with the string 'nie podano' (Polish for 'not provided').

    Parameters:
    ----------
    data : pandas.DataFrame
        The input DataFrame containing the columns to process.

    Returns:
    -------
    pandas.DataFrame
        The DataFrame with missing values in the specified categorical columns
        filled with the placeholder string.
    """
    # single-choice variables in the data
    cols = ['ownership_status', 'flat_condition', 'heating', 'windows', 'mater']
    data[cols] = data[cols].fillna('nie podano') # 'not provided'
    
    return data

In [13]:
# Application of the function
fill_missing_categoricals_data = fill_missing_categoricals(processed_floor_data)

# View the changes that have made to the data
missing_info = (
    fill_missing_categoricals_data.isna()
    .sum()
    .to_frame(name='missing_count')
    .assign(
        missing_percent=lambda df: round(100 * df['missing_count'] / len(fill_missing_categoricals_data), 2)
    )
    .sort_values(by='missing_count', ascending=False)
)

print(missing_info)

                          missing_count  missing_percent
availability                      38287            81.84
equipment                         32486            69.44
rent_cat                          25514            54.54
utilities                         21380            45.70
parking                           20529            43.88
add_inf                           19672            42.05
security                          14992            32.05
perks                             10959            23.43
number_floor_in_building           2238             4.78
price                              2109             4.51
ap_floor                           1235             2.64
year                                 26             0.06
devel_type                            9             0.02
address                               0             0.00
link                                  0             0.00
area                                  0             0.00
num_rooms                      

### encode_parking_presence

In [14]:
# review of the data
data_initial['parking'].value_counts()

parking
garaż/miejsce parkingowe    26253
Zapytaj                     20529
Name: count, dtype: int64

Only the values “garage/parking space” and “Ask” appear in the set. As a simplification, it is assumed that the “Ask” values mean that no such parking space is available.

In [15]:
def encode_parking_presence(data):
    
    """
    Encode the presence of parking information into a binary indicator column.

    This function creates a new column 'parking_coded' where:
    - 1 indicates that parking information is present (non-missing),
    - 0 indicates that parking information is missing (NaN).

    Parameters:
    ----------
    data : pandas.DataFrame
        The input DataFrame containing the 'parking' column.

    Returns:
    -------
    pandas.DataFrame
        The DataFrame with an additional 'parking_coded' column representing
        parking presence as a binary indicator.
    """
    
    data['parking_coded'] = data['parking'].apply(lambda x: 0 if pd.isna(x) else 1)
    
    return data

In [16]:
# Application of the function
encoded_parking_presence_data = encode_parking_presence(fill_missing_categoricals_data)

# View the changes that have made to the data
encode_parking_presence_show = pd.concat([data_initial[["parking"]].head(6),
                                  encoded_parking_presence_data[["parking_coded"]].head(6)],
                                 axis=1)
print(encode_parking_presence_show)

                    parking  parking_coded
0  garaż/miejsce parkingowe              1
1  garaż/miejsce parkingowe              1
2  garaż/miejsce parkingowe              1
3  garaż/miejsce parkingowe              1
4                   Zapytaj              0
5                   Zapytaj              0


### convert_year_to_int

In [17]:
def convert_year_to_int(data):
    
    """
    Convert the 'year' column to integer type, coercing invalid strings to NaN.

    Uses vectorized conversion with pandas.to_numeric, converting
    any non-convertible string to NaN, then casts to nullable integer dtype.

    Parameters:
    ----------
    data : pandas.DataFrame
        DataFrame with a 'year' column to be converted.

    Returns:
    -------
    pandas.DataFrame
        DataFrame with 'year' column converted to nullable integer dtype.
    """
    
    data['year'] = data['year'].astype('Int64')
    
    return data

In [18]:
# Application of the function
converted_year_to_int_data = convert_year_to_int(encoded_parking_presence_data)

# View the changes that have made to the data
print("Descrition before:")
print(data_initial['year'].describe())
print()
print("Descrition after:")
print(converted_year_to_int_data['year'].describe())

Descrition before:
count     46782
unique      207
top        2023
freq       9691
Name: year, dtype: object

Descrition after:
count   46756.00
mean     1998.22
std        83.95
min         1.00
25%      1983.00
50%      2020.00
75%      2023.00
max      2027.00
Name: year, dtype: Float64


### standardize_ownership_labels

In [19]:
def standardize_ownership_labels(data):
    
    """
    Standardize specific labels in the 'ownership_status' column.

    This function replaces the label 'spółdzielcze własnościowe' with
    the standardized label 'spółdzielcze wł. prawo do lokalu' in the
    'ownership_status' column. All other values remain unchanged.

    Parameters:
    ----------
    data : pandas.DataFrame
        Input DataFrame containing the 'ownership_status' column.

    Returns:
    -------
    pandas.DataFrame
        DataFrame with updated 'ownership_status' labels.
    """
    
    data['ownership_status'] = data['ownership_status'].apply(
        lambda x: 'spółdzielcze wł. prawo do lokalu' if x == 'spółdzielcze własnościowe' else x
    )
    
    return data

In [20]:
# Application of the function
standardized_ownership_labels_data = standardize_ownership_labels(converted_year_to_int_data)

# View the changes that have made to the data
print("Value counts before:")
print(data_initial[['ownership_status']].value_counts())
print()
print("Value counts  after:")
print(standardized_ownership_labels_data[['ownership_status']].value_counts())

Value counts before:
ownership_status                 
pełna własność                       36790
Zapytaj                               8158
spółdzielcze wł. prawo do lokalu      1576
udział                                 171
użytkowanie wieczyste / dzierżawa       86
spółdzielcze własnościowe                1
Name: count, dtype: int64

Value counts  after:
ownership_status                 
pełna własność                       36790
nie podano                            8158
spółdzielcze wł. prawo do lokalu      1577
udział                                 171
użytkowanie wieczyste / dzierżawa       86
Name: count, dtype: int64


### multiple_choice_transform

In [21]:
def items_of_var(var, data = standardized_ownership_labels_data):   
    item_list = []
    for items in data[var]:
        present = str(items).split(',')
        present = [x.strip(' ') for x in present]
        item_list.extend(present)

    item_list = list(dict.fromkeys(item_list))
    if 'nan' in item_list:
        item_list.remove('nan')
    return item_list

perklist = items_of_var('perks')
utilitylist = items_of_var('utilities')
securitylist = items_of_var('security')
equipmentlist = items_of_var('equipment')
additionallist = items_of_var('add_inf')

var_values_dict = {'utilities': utilitylist,
                   'security': securitylist,
                   'equipment': equipmentlist,
                   'add_inf': additionallist,
                   'perks': perklist}

for var in var_values_dict:
    print(f'{var}: {var_values_dict[var]}')

utilities: ['telewizja kablowa', 'internet', 'telefon']
security: ['drzwi / okna antywłamaniowe', 'teren zamknięty', 'domofon / wideofon', 'monitoring / ochrona', 'rolety antywłamaniowe', 'system alarmowy']
equipment: ['zmywarka', 'lodówka', 'meble', 'piekarnik', 'kuchenka', 'pralka', 'telewizor']
add_inf: ['pom. użytkowe', 'piwnica', 'dwupoziomowe', 'oddzielna kuchnia', 'klimatyzacja']
perks: ['balkon', 'taras', 'ogródek']


In [22]:
# saving the dictionary
var_values_dict_path = "1. Data Preparation/multiple_choice_var_dict.joblib"
joblib.dump(var_values_dict, var_values_dict_path)

['1. Data Preparation/multiple_choice_var_dict.joblib']

In [23]:
def splitcolumn(serieslike, colname, items, missing_categories):
    
    """
    Splits a string column into binary indicator columns for a list of expected items.

    This function processes a string from a specified column in a Series-like object,
    splits it by commas into a list of features, and maps the presence of each expected
    item in that list to a binary value (1 if present, 0 if not). It also tracks any
    unexpected values (not included in `items`) by appending them to the `missing_categories` list.

    Parameters:
    ----------
    serieslike : pandas.Series or pandas.DataFrame
        The data structure containing the column to split.
        
    colname : str
        The name of the column to process.

    items : list of str
        The list of expected categorical values to detect in the column.

    missing_categories : list
        A list that will be extended with unexpected values encountered during processing.

    Returns:
    -------
    pandas.Series
        A Series of binary values (0 or 1) indicating the presence of each item from `items`.
    """
    
    # Extract the value from the given column
    input_value = serieslike[colname]
    
      # If the value is missing, return a Series of 0s (none of the items are present)
    if pd.isna(input_value):
        return pd.Series([0 for x in items])
    
    # Split the string into individual items using commas
    present = input_value.split(',')

    # Remove leading/trailing spaces
    present = [x.strip(' ') for x in present]

    # Filter out empty or very short items (likely noise)
    present = [x for x in present if len(x) > 1]
    
    # Create a binary list indicating whether each expected item is present
    presences = [1 if x in present else 0 for x in items]
    
    # Identify any new/unexpected items not in the `items` list
    new_items = [x for x in present if x not in items]
    if new_items:
        missing_categories.extend(new_items) # Add them to the missing_categories list
    
    return pd.Series(presences)

In [24]:
def multiple_choice_transform(data, train_dataset):
    
    """
    Transforms multiple-choice categorical variables into binary indicator columns.

    This function applies one-hot encoding to columns containing multiple-choice values
    (e.g., "option1, option2") based on predefined expected values stored in a dictionary
    loaded from a joblib file. For each such variable, it creates new binary columns
    indicating the presence of each expected value.

    If unexpected (missing) categories are found in the data and `train_dataset` is True,
    they are collected and printed as a warning.

    Parameters:
    -----------
    data : pandas.DataFrame
        The dataset containing the multiple-choice categorical variables.

    train_dataset : bool
        Flag indicating whether the function is applied on training data.
        If True, the function will print information about any new, unexpected categories.

    Returns:
    --------
    pandas.DataFrame
        The transformed DataFrame with multiple-choice variables expanded into binary columns.
    """
    
    # Load the predefined dictionary mapping each multiple-choice column
    # to the list of expected values (options)
    var_values_dict = joblib.load("1. Data Preparation/multiple_choice_var_dict.joblib")
    
    # List to collect unexpected (missing) categories encountered in the data
    missing_categories = []
    
    for key in var_values_dict:
        # Prepare names for the new binary columns
        column_names = [key + '_' + x for x in var_values_dict[key]]
        
        # Apply the splitcolumn function row-wise, creating binary indicators
        data[column_names] = data.apply(
            splitcolumn,
            args = (key, var_values_dict[key], missing_categories),
            axis = 1
        )
    
    # Drop the original multiple-choice columns after transformation
    data = data.drop(var_values_dict, axis=1)
    
    # If this is the training dataset, report any unexpected categories found
    if train_dataset:
        if missing_categories:
            category_counts = Counter(missing_categories)
            print("There are new categories that are not in the dictionary:")
            for category, count in category_counts.items():
                print(f"  - {category}: {count} times")
        else:
            print("All the categorizations occurring in the set in multi-vector selection variables were coded.")
    
    return data

In [25]:
# Application of the function
multiple_choice_transformed_data = multiple_choice_transform(standardized_ownership_labels_data, train_dataset = True)

# View the changes that have made to the data
vars_before = ['utilities', 'security', 'equipment', 'add_inf', 'perks']
vars_after = ['utilities_telewizja kablowa', 'security_domofon / wideofon',
              'equipment_zmywarka', 'add_inf_piwnica', 'perks_balkon']

multiple_choice_transform_show = pd.concat([data_initial[vars_before].head(6),
                                  multiple_choice_transformed_data[vars_after].head(6)],
                                 axis=1)

print(multiple_choice_transform_show[['utilities', 'utilities_telewizja kablowa',
                                     'security', 'security_domofon / wideofon',
                                     'equipment', 'equipment_zmywarka',
                                     'add_inf', 'add_inf_piwnica',
                                     'perks', 'perks_balkon']])

All the categorizations occurring in the set in multi-vector selection variables were coded.
                              utilities  utilities_telewizja kablowa  \
0  telewizja kablowa, internet, telefon                            1   
1           telewizja kablowa, internet                            1   
2           telewizja kablowa, internet                            1   
3                              internet                            0   
4  telewizja kablowa, internet, telefon                            1   
5  telewizja kablowa, internet, telefon                            1   

                                            security  \
0  drzwi / okna antywłamaniowe, teren zamknięty, ...   
1  drzwi / okna antywłamaniowe, teren zamknięty, ...   
2                teren zamknięty, domofon / wideofon   
3  drzwi / okna antywłamaniowe, teren zamknięty, ...   
4  drzwi / okna antywłamaniowe, domofon / wideofo...   
5                                 domofon / wideofon   

   securi

### location_transform

In [26]:
# review of the data
for i in range(5):
    print(data_initial["address"].iloc[i])

ul. Henryka Strobanda, Wrzosy, Toruń, kujawsko-pomorskie
łąkowa 27 B, Stare Polesie, Polesie, Łódź, łódzkie
ul. Błękitna, Marki, wołomiński, mazowieckie
Szyce, Wielka Wieś, krakowski, małopolskie
ul. Rakowiecka 43A, Stary Mokotów, Mokotów, Warszawa, mazowieckie


Address structure always looks like this:
1. street - one optional field
2. city district - two optional field
3. city - one obligatory field
4. powiat - one optional field
5. region (voivodship) - one obligatory field

In [27]:
def address_transform(addressline, cities_dict):
    
    """
    Extracts region, location (city/town), and street name from a raw address string.

    This function processes a single address line to parse and separate the region,
    city (location), and street components based on commas and a dictionary of cities
    grouped by regions.

    Parameters:
    -----------
    addressline : str or NaN
        A string representing the full address, typically in the format:
        'Street, City, Region'. If missing (NaN), the function returns three NaN values.

    cities_dict : dict
        A dictionary where keys are region names and values are lists of known cities/towns
        in that region. Used to validate which element of the address refers to the location.

    Returns:
    --------
    pandas.Series
        A Series containing three values: [region, location, street].
    """
    
    # Handle missing or null addresses
    if pd.isna(addressline):
        return pd.Series([np.nan, np.nan, np.nan])
    
    # Split the address by commas and strip whitespace
    parts = [x.strip() for x in addressline.split(',')]
    
    # Extract the region (last element)
    region = parts[-1]
    
    # Attempt to identify location (city/town) using cities_dict
    city_in_region = cities_dict[region]
    
    if parts[-2] in city_in_region:
        location = parts[-2]
    elif len(parts) >= 3 and parts[-3] in city_in_region:
        location = parts[-3]
    else:
        location = np.nan

    # Determine if the first part of the address is a valid street
    potential_street = parts[0]
    if potential_street == location:
        street = np.nan
    else:
        # Remove common prefixes from street names
        street = potential_street
        for prefix in ['ul. ', 'al. ', 'pl. ']:
            street = street.removeprefix(prefix)
        street = street.strip()

    return pd.Series([region, location, street])

In [28]:
def location_transform(data):
    
    """
    Extracts structured geographic information (region, location, and street/district)
    from a raw address column using a reference dataset of Polish localities.

    This function reads a CSV file containing Polish place names, filters valid settlement types,
    and constructs a dictionary that maps regions (voivodeships) to cities and villages.
    It then uses this dictionary to parse the 'address' column and extract three elements:
    region, location (city/village), and street/district, which are added as new columns.

    Parameters:
    -----------
    data : pandas.DataFrame
        A DataFrame containing a column named 'address' with full address strings.

    Returns:
    --------
    pandas.DataFrame
        The same DataFrame with three new columns:
        - 'region': the voivodeship (region) the address belongs to,
        - 'location': the specific city/town/village found in the address,
        - 'street/district': the remaining part of the address (typically street).
    """

    # Load external CSV file containing Polish place names and administrative regions
    locations = pd.read_csv('1. Data Preparation/locations_and_regions.csv')

    # Keep only rows with relevant types of settlements
    valid_types = ['wieś', 'miasto', 'osada', 'kolonia', 'osada leśna']
    locations = locations[locations['Rodzaj'].isin(valid_types)]

    # Group places by region (voivodeship)
    locations = locations.groupby('Województwo')

    # Create a dictionary: region -> list of cities/villages in that region
    cities_dict = {}
    for reg in locations:
        cities_dict[reg[0]] = list(reg[1]['Nazwa miejscowości'])

    # Manually correct known naming mismatch: Stargard (used to be Stargard Szczeciński)
    cities_dict['zachodniopomorskie'].append('Stargard')

    # Apply address transformation to extract region, location, and street/district
    data[['region', 'location', 'street/district']] = data['address'].apply(
        address_transform,
        args=[cities_dict]
    )

    return data

In [29]:
# Application of the function
location_transformed_data = location_transform(multiple_choice_transformed_data)

# View the changes that have made to the data
location_transform_var = ['address', 'region', 'location', 'street/district']

print(location_transformed_data[location_transform_var].info())
location_transformed_data[location_transform_var].head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46782 entries, 0 to 46781
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   address          46782 non-null  object
 1   region           46782 non-null  object
 2   location         46726 non-null  object
 3   street/district  41415 non-null  object
dtypes: object(4)
memory usage: 1.4+ MB
None


Unnamed: 0,address,region,location,street/district
0,"ul. Henryka Strobanda, Wrzosy, Toruń, kujawsko...",kujawsko-pomorskie,Toruń,Henryka Strobanda
1,"łąkowa 27 B, Stare Polesie, Polesie, Łódź, łód...",łódzkie,Łódź,łąkowa 27 B
2,"ul. Błękitna, Marki, wołomiński, mazowieckie",mazowieckie,Marki,Błękitna
3,"Szyce, Wielka Wieś, krakowski, małopolskie",małopolskie,Wielka Wieś,Szyce
4,"ul. Rakowiecka 43A, Stary Mokotów, Mokotów, Wa...",mazowieckie,Warszawa,Rakowiecka 43A


### city_info_transform

In [30]:
def city_info_transform (data):
        
    """
    Enriches the dataset with demographic and administrative information about cities.

    This function merges the input dataset with external city-level statistics, such as
    population size, population density, and administrative rights (powiat rights).
    It also creates categorized bins for population size and density to facilitate analysis.

    Parameters:
    -----------
    data : pandas.DataFrame
        Input DataFrame that must include 'location' and 'region' columns,
        typically produced by the `location_transform` function.

    Returns:
    --------
    pandas.DataFrame
        A DataFrame with additional features:
        - 'pop_numb_cat': population size category (0-7), where:
              0 → up to 10,000 people  
              1 → 10,001-20,000  
              2 → 20,001-50,000  
              3 → 50,001-100,000  
              4 → 100,001-250,000  
              5 → 250,001-500,000  
              6 → 500,001-1,000,000  
              7 → more than 1,000,000
        - 'pop_dens_cat': population density category (0-7), where:
              0 → up to 500 people/km²  
              1 → 501-1000  
              2 → 1001-1500  
              3 → 1501-2000  
              4 → 2001-2500  
              5 → 2501-3000  
              6 → 3001-3500  
              7 → more than 3500
        - 'with_powiat_rights': binary indicator (0 or 1) whether the city has powiat (county-level) administrative rights.

        The function also drops unneeded columns from the merged location data.

    """    
    
    # Load city-level demographic and administrative data    
    locations_data = pd.read_excel("1. Data Preparation/locations_info.xlsx")
    
    # Categorize population size into 8 bins
    locations_data['pop_numb_cat'] = pd.cut(
        locations_data['Liczba ludności'],
        bins=[0, 10_000, 20_000, 50_000, 100_000, 250_000, 500_000, 1_000_000, 2_000_000],
        labels=np.arange(0, 8)
    )
    
    # Categorize population density into 8 bins (step = 500)
    locations_data['pop_dens_cat'] = pd.cut(
        locations_data['Gęstość zaludnienia'],
        bins=range(1, 4002, 500),
        labels=np.arange(0, 8)
    )
    
    # Merge with the main dataset using location and region
    data_merged = data.loc[:, 'link':'street/district'].merge(
        locations_data,
        left_on=['location', 'region'],
        right_on=['Miasto', 'Województwo'],
        how='left'
    )
    
    # Replace NaNs in powiat rights with 0 (default: no rights)    
    data_merged['with_powiat_rights'] = data_merged['na_prawach_powiatu'].fillna(0)
    
    # Convert categories to numeric and fill missing values with 0
    # Missing data refers to villages, i.e., the file contains data only for cities
    data_merged['pop_numb_cat'] = pd.to_numeric(data_merged['pop_numb_cat']).fillna(0)
    data_merged['pop_dens_cat'] = pd.to_numeric(data_merged['pop_dens_cat']).fillna(0)

    # Remove unnecessary columns from location metadata
    data_merged.drop([
        'Miasto', 'Powiat', 'Województwo', 'Powierzchnia',
        'Liczba ludności', 'Gęstość zaludnienia', 'na_prawach_powiatu'
    ], axis=1, inplace=True)
    
    return data_merged

In [31]:
# Application of the function
city_info_transformed_data = city_info_transform(location_transformed_data)

# View the changes that have made to the data
location_transform_var = ['address', 'region','location','street/district',
                          'with_powiat_rights', 'pop_numb_cat',
                          'pop_dens_cat']
city_info_transformed_data[location_transform_var].head()

Unnamed: 0,address,region,location,street/district,with_powiat_rights,pop_numb_cat,pop_dens_cat
0,"ul. Henryka Strobanda, Wrzosy, Toruń, kujawsko...",kujawsko-pomorskie,Toruń,Henryka Strobanda,1.0,4.0,3.0
1,"łąkowa 27 B, Stare Polesie, Polesie, Łódź, łód...",łódzkie,Łódź,łąkowa 27 B,1.0,6.0,4.0
2,"ul. Błękitna, Marki, wołomiński, mazowieckie",mazowieckie,Marki,Błękitna,0.0,2.0,2.0
3,"Szyce, Wielka Wieś, krakowski, małopolskie",małopolskie,Wielka Wieś,Szyce,0.0,0.0,0.0
4,"ul. Rakowiecka 43A, Stary Mokotów, Mokotów, Wa...",mazowieckie,Warszawa,Rakowiecka 43A,1.0,7.0,6.0


### preliminary_transform

In [32]:
def preliminary_transform (data, train_dataset):
    
    """
    Performs a full preliminary transformation pipeline on the input dataset.

    This function executes a sequence of data cleaning and feature engineering steps
    that prepare the dataset for further analysis or modeling. It ensures consistency
    in missing values, encodes categorical data, standardizes formats, and enriches
    the data with external geographic and demographic information.

    Parameters:
    -----------
    data : pandas.DataFrame
        The raw dataset to be processed. Must contain all necessary columns 
        such as 'floor', 'year', 'address', etc.

    train_dataset : bool
        Indicates whether the data being processed is training data.
        This is used to display warnings about unexpected new categories
        in multi-label (multi-choice) variables.

    Returns:
    --------
    pandas.DataFrame
        A cleaned and enriched DataFrame ready for modeling or further processing.

    The pipeline performs the following transformations:
    ----------------------------------------------------
    1. 'standardize_missing_values' - Converts placeholders like "Zapytaj o cenę" to NaN.
    2. 'clean_numeric_columns' - Cleans and converts numeric columns such as 'price', 'area', 'rent'.
    3. 'categorize_rent' - Categorizes rental prices into bins.
    4. 'process_floor_data' - Splits and normalizes apartment floor info.
    5. 'fill_missing_categoricals' - Fills missing categorical values with "nie podano".
    6. 'encode_parking_presence' - Encodes binary presence of parking (e.g., yes/no).
    7. 'convert_year_to_int' - Converts the 'year' column is of integer type.
    8. 'standardize_ownership_labels' - Standardizes ownership labels (e.g., unifying similar terms).
    9. 'multiple_choice_transform' - One-hot encodes multi-choice variables based on a predefined dictionary.
    10. 'location_transform' - Extracts region, city, and street from the address.
    11. 'city_info_transform' - Adds population and administrative info by merging with external city data.
    """
    
    standardized_missing_values_data = standardize_missing_values(data)
    cleaned_numeric_columns_data = clean_numeric_columns(standardized_missing_values_data)
    categorized_rent_data = categorize_rent(cleaned_numeric_columns_data)
    processed_floor_data = process_floor_data(categorized_rent_data)
    fill_missing_categoricals_data = fill_missing_categoricals(processed_floor_data)
    encoded_parking_presence_data = encode_parking_presence(fill_missing_categoricals_data)
    converted_year_to_int_data = convert_year_to_int(encoded_parking_presence_data)
    standardized_ownership_labels_data = standardize_ownership_labels(converted_year_to_int_data)
    multiple_choice_transformed_data = multiple_choice_transform(standardized_ownership_labels_data,
                                                                 train_dataset)
    location_transformed_data = location_transform(multiple_choice_transformed_data)
    city_info_transformed_data = city_info_transform(location_transformed_data)
    
    return city_info_transformed_data

In [33]:
# Application of the function
preliminary_transformed_data = preliminary_transform(data_initial, True)

All the categorizations occurring in the set in multi-vector selection variables were coded.


In [34]:
# Information for changed data
preliminary_transformed_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46782 entries, 0 to 46781
Data columns (total 51 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   link                                  46782 non-null  object 
 1   price                                 44673 non-null  float64
 2   address                               46782 non-null  object 
 3   area                                  46782 non-null  float64
 4   num_rooms                             46782 non-null  int64  
 5   ownership_status                      46782 non-null  object 
 6   flat_condition                        46782 non-null  object 
 7   parking                               26253 non-null  object 
 8   heating                               46782 non-null  object 
 9   market                                46782 non-null  object 
 10  ad_type                               46782 non-null  object 
 11  availability   

# Missing and outlier observations tramsformation
Used in file: **pipeline_pre-processing.py**<br>
Analysis of missing and outlier observations was conducted in the file: **missvalue_outliers_analysis** <br>
The file shows the distributions of each variable, establishes the variables taken into the model, and sets limits beyond which an observation will be considered an outlier. <br><br>

In summary, the model will be trained on the data of apartments which:<br>
1. price is in the range of 100 thousand PLN to 1,5 million PLN
2. area is in the range of 20 m² to 200 m²
3. is in a building that has less than 20 floors
4. is in a building that was built no earlier than 1900
5. have no missing data in the variables 'area', 'num_rooms', 'market', 'devel_type', 'ad_type', 'location', 'region'

In addition, in some categorical variables, categories that occurred less frequently than about 5% of the set were converted to “other” ("inny" in Polish) categories. These were the variables:
1. **ownership_status** - 'spółdzielcze wł. prawo do lokalu', 'udział', 'użytkowanie wieczyste / dzierżawa'
2. **heating** - 'kotłownia', 'elektryczne', 'piece kaflowe'
3. **devel_type** - 'plomba', 'loft', 'dom wolnostojący', 'szeregowiec'
4. **windows** - 'drewniane', 'aluminiowe'
5. **mater** - 'drewno', 'keramzyt', 'beton', 'beton komórkowy', 'żelbet'

In [35]:
# creating dictionary with the defined range of outlier values
outlier_values_dict = {"max_floor": 20,
                       "min_price": 100_000,
                       "max_price": 1_500_000,
                       "min_area": 20,
                       "max_area": 200,
                       "min_year": 1900,
                       "categories_to_replace": ['spółdzielcze wł. prawo do lokalu', 'udział', 'użytkowanie wieczyste / dzierżawa',
                                                 'kotłownia', 'elektryczne', 'piece kaflowe',
                                                 'plomba', 'loft', 'dom wolnostojący', 'szeregowiec',
                                                 'drewniane', 'aluminiowe',
                                                 'drewno', 'keramzyt', 'beton', 'beton komórkowy', 'żelbet',
                                                 'inne', 'inny'],
                       "variables_to_drop_na": ['area', 'num_rooms', 'market', 'devel_type', 'ad_type', 'location', 'region']}

In [36]:
# saving the dictionary
outlier_values_dict_path = "1. Data Preparation/outlier_values_dict.joblib"
joblib.dump(outlier_values_dict, outlier_values_dict_path)

['1. Data Preparation/outlier_values_dict.joblib']

In order to clear the set of missing and outlier values, **cleaning_data** function is defined, it consists of functions:
1. **'replace_mistakes_with_na'** - replacing clearly incorrect numerical values with missing values
2. **'proceed_outlier_categories'** - replacing rare categorical values
3. **'clean_missing_values'** - dropping observations with missing values in key variables
4. **'procced_outliers'** - removing outliers based on defined thresholds

### replace_mistakes_with_na

In [37]:
def replace_mistakes_with_na(data):
    
    """
    Cleans data by replacing obviously incorrect or implausible values in selected numerical columns with NaN.

    Specifically:
    - Replaces values in the 'area' column below 10 and above 500 with NaN,
      as these are likely due to data entry errors.
    - Replaces values above 100 in 'number_floor_in_building' and 'ap_floor' with NaN,
      as these are likely due to data entry errors.
    - Replaces values in the 'year' column below 1700 with NaN, assuming such years are invalid
      for apartment construction.

    Parameters:
    -----------
    data : pd.DataFrame
        The input dataset to be cleaned.

    Returns:
    --------
    data_clean : pd.DataFrame
        The cleaned dataset with incorrect values replaced by NaN.
    """

    # Create copy of the original dataset to avoid modifying it in-place
    data_clean = data.copy()
    
    # Set area information with area above 500 or below 10 to NaN
    data_clean['area'] = (
        data_clean['area']
        .apply(lambda x: x if pd.isna(x) or x <= 500 or x >= 10 else np.nan)
    )
    data_clean['area'] = pd.to_numeric(data_clean['area'])

    # Set floor information with floor number above 100 to NaN
    data_clean['number_floor_in_building'] = (
        data_clean['number_floor_in_building']
        .apply(lambda x: x if pd.isna(x) or x <= 100 else np.nan)
    )
    data_clean['number_floor_in_building'] = pd.to_numeric(data_clean['number_floor_in_building'])

    data_clean['ap_floor'] = (
        data_clean['ap_floor']
        .apply(lambda x: x if pd.isna(x) or x <= 100 else np.nan)
    )
    data_clean['ap_floor'] = pd.to_numeric(data_clean['ap_floor'])

    # Set construction years below 1700 to NaN
    data_clean['year'] = (
        data_clean['year']
        .apply(lambda x: x if pd.isna(x) or x >= 1700 else np.nan)
    )
    data_clean['year'] = pd.to_numeric(data_clean['year']).astype('Int64')
    
    return data_clean

### clean_missing_values

In [38]:
def clean_missing_values(data, train_dataset, variables_to_drop_na):

    """
    Replaces specified outlier or rare category values in all columns of the dataset with a general label 'inny' (Polish for "other").

    Parameters:
    -----------
    data : pd.DataFrame
        The input dataset containing categorical variables.

    categories_to_replace : list of str
        A list of category values (strings) that should be considered outliers or rare,
        and replaced with the label 'inny' (Polish for "other").

    Returns:
    --------
    data_clean : pd.DataFrame
        A cleaned version of the input dataset with specified category values replaced by 'inny'.
    """
    
    # Create copy of the original dataset to avoid modifying it in-place
    data_clean_missing = data.copy()
    
    # Remove rows with NA values in the specified columns
    data_clean_missing = data_clean_missing.dropna(subset = variables_to_drop_na)
    
    # If this is test data, print a message about removed rows due to missing values
    if not train_dataset:
        if len(data) > len(data_clean_missing):
            print("""
                  W zbiorze wystąpiły braki w cechach: 'typ budynku', 'powierzchnia',
                  'liczba pokoi', 'rynek', 'ogłoszeniodawca', 'miasto', 'województwo'.
                  Do otrzymania prognozy wszystkie z powyższych cech
                  muszą być wypełnione.
                  Obserwacje te zostały usunięte ze zbioru do predykcji.
                  """)    
    
    return data_clean_missing

### proceed_outlier_categories

In [39]:
def proceed_outlier_categories(data, categories_to_replace):
    
    """
    Replaces outlier category values in the dataset with a general label 'inny'.

    Parameters:
    -----------
    data : pd.DataFrame
        The input dataset containing categorical variables.
    
    categories_to_replace : list
        A list of categories values
        that should be replaced with the label 'inny' (Polish for "other").

    Returns:
    --------
    data_clean : pd.DataFrame
        A cleaned version of the input dataset with specified category values replaced.
    """

    # Create copy of the original dataset to avoid modifying it in-place
    data_clean = data.copy()
    
    # Replace rare or inconsistent category values with 'inny' (Polish for "other")
    data_clean = data_clean.replace(categories_to_replace, 'inny')
    
    return data_clean

### proceed_outliers

In [40]:
def proceed_outliers(data, train_dataset,
                     minprice, maxprice,
                     minarea, maxarea,
                     maxfloor, minyear,
                     to_delete_from_test_set):
    
    """
    Filters out observations that are likely outliers based on provided thresholds 
    for selected numerical features such as price, area, floor level, and year of construction.

    Parameters:
    -----------
    data : pd.DataFrame
        The input dataset containing apartment listings.

    train_dataset : bool
        Indicates whether the data is used for training (True) or inference/prediction (False).

    minprice : float
        Minimum acceptable price. Listings with a lower price are considered outliers.

    maxprice : float
        Maximum acceptable price. Listings with a higher price are considered outliers.

    minarea : float
        Minimum acceptable area (m²). Listings with a smaller area are considered outliers.

    maxarea : float
        Maximum acceptable area (m²). Listings with a larger area are considered outliers.

    maxfloor : int
        Maximum acceptable floor level. Higher floor numbers are treated as outliers.

    minyear : int
        Minimum acceptable construction year. Years below this threshold are treated as invalid.
        
    to_delete_from_test_set : bool
        Indicates whether outlier observations will remain in the test set or be removed.
        
    Returns:
    --------
    pd.DataFrame
        Cleaned dataset with extreme values filtered out (if `train_dataset=True`) 
        or the original dataset with a warning printed about outliers (if `train_dataset=False`).
    """

    # Create copy of the original dataset to avoid modifying it in-place
    data_clean = data.copy()
    
    # Filter observations based on price range
    data_clean = data_clean[data_clean['price'] <= maxprice]
    data_clean = data_clean[data_clean['price'] >= minprice]
    
    # Filter observations based on area range
    data_clean = data_clean[
        (data_clean['area'] <= maxarea) |
        (data_clean['area'].isna())
    ]
    data_clean = data_clean[
        (data_clean['area'] >= minarea)|
        (data_clean['area'].isna())
    ]
    
    # Filter based on floor information (if not missing)
    data_clean = data_clean[
        (data_clean['number_floor_in_building'] <= maxfloor) | 
        (data_clean['number_floor_in_building'].isna())
    ]
    data_clean = data_clean[
        (data_clean['ap_floor'] <= maxfloor) |
        (data_clean['ap_floor'].isna())
    ]

    # Filter based on year of building information
    data_clean = data_clean[
        (data_clean['year'] >= minyear) |
        (data_clean['year'].isna())
    ]
    
    if train_dataset:
        # For training data, return only the cleaned records
        return data_clean
    else:
        # For inference data, return full data but warn about outliers
        if len(data) > (len(data_clean) + sum(data['price'].isna())):
            if not to_delete_from_test_set:
                print("""
                      W zbiorze wystąpiły obserwację o skrajnych wartościach
                      ze względu na zaproponowaną cenę, powierzchnię, piętro lub rok budynku.
                      Może zmniejszeć dokładność prognozy.
                      """)
                return data
            else:
                print("""
                      W zbiorze wystąpiły obserwację o skrajnych wartościach
                      ze względu na zaproponowaną cenę, powierzchnię, piętro lub rok budynku.
                      Obserwacje te zostały usunięte ze zbioru do predykcji.
                      """)
                return data_clean
        # No outliers detected: return the original data
        return data

### cleaning_data

In [41]:
def cleaning_data (data, train_dataset, to_delete_from_test_set = True):
    
    """
    Applies a sequence of data cleaning operations on a dataset, including:
    1. `replaced_mistakes_data` - Replacing clearly incorrect numerical values with missing values
    2. `cleaned_missind_values_data` - Dropping observations with missing values in key variables
    3. `proceeded_outlier_cat_data` - Replacing rare categorical values
    4. `procceded_outliers_data` - Removing outliers based on defined thresholds

    Parameters:
    -----------
    data : pd.DataFrame
        The input dataset to be cleaned.

    train_dataset : bool
        Indicates whether the data is used for model training (True)
        or for inference/prediction (False). This affects how aggressively outliers and missing data are removed.
        
    to_delete_from_test_set : bool
        Indicates whether outlier observations will remain in the test set or be removed. Default is set to True.
        
    Returns:
    --------
    pd.DataFrame
        Cleaned dataset after all preprocessing steps.
    """
    
    # Load thresholds and rules for outlier and missing value handling from file
    outlier_values_dict = joblib.load("1. Data Preparation/outlier_values_dict.joblib")
    
    variables_to_drop_na = outlier_values_dict["variables_to_drop_na"]
    categories_to_replace = outlier_values_dict["categories_to_replace"]
    minprice = outlier_values_dict["min_price"]
    maxprice = outlier_values_dict["max_price"]
    minarea = outlier_values_dict["min_area"]
    maxarea = outlier_values_dict["max_area"]
    maxfloor = outlier_values_dict["max_floor"]
    minyear = outlier_values_dict["min_year"]
    
    # Replace implausible or erroneous numeric values with NaN
    replaced_mistakes_data = replace_mistakes_with_na(data)
    
    # Drop observations with missing values in critical variables
    cleaned_missind_values_data = clean_missing_values(replaced_mistakes_data, train_dataset, variables_to_drop_na)
    
    # Replace rare categorical values with a standard label
    proceeded_outlier_cat_data = proceed_outlier_categories(cleaned_missind_values_data, categories_to_replace)
    
    # Remove or flag outliers based on provided thresholds
    procceded_outliers_data = proceed_outliers(proceeded_outlier_cat_data,
                                               train_dataset,
                                               minprice, maxprice,
                                               minarea, maxarea,
                                               maxfloor, minyear,
                                               to_delete_from_test_set)
    
    return procceded_outliers_data
    

In [42]:
# Application of the function
cleaned_outliers_data = cleaning_data(preliminary_transformed_data, True)

# View the changes that have made to the data
len_before = len(preliminary_transformed_data)
len_after = len(cleaned_outliers_data)

print(f"Number of observations before cleaning of outliers: {len_before}")
print(f"Number of observations after cleaning of outliers: {len_after}")

len_change = np.round((1 - len_after/len_before)*100, 2)
print(f"Data set decreased by {len_before-len_after} ads ({len_change}%)")

Number of observations before cleaning of outliers: 46782
Number of observations after cleaning of outliers: 41862
Data set decreased by 4920 ads (10.52%)


# Data for modeling

In [43]:
# defining variables for the model
features_to_use = ['area', 'num_rooms', 'ownership_status', 'flat_condition',
                   'parking_coded', 'heating', 'ad_type', 'year', 'devel_type',
                   'windows','lift', 'mater', 'rent_cat', 'market',
                   'number_floor_in_building', 'ap_floor',
                   
                   'utilities_telewizja kablowa', 'utilities_internet', 'utilities_telefon',
                   
                   'security_drzwi / okna antywłamaniowe','security_teren zamknięty',
                   'security_domofon / wideofon', 'security_monitoring / ochrona',
                   'security_rolety antywłamaniowe', 'security_system alarmowy',
                   
                   'equipment_zmywarka', 'equipment_lodówka', 'equipment_meble',
                   'equipment_piekarnik', 'equipment_kuchenka', 'equipment_pralka', 'equipment_telewizor',
                   
                   'add_inf_pom. użytkowe', 'add_inf_piwnica', 'add_inf_dwupoziomowe',
                   'add_inf_oddzielna kuchnia', 'add_inf_klimatyzacja',
                   
                   'perks_balkon', 'perks_taras', 'perks_ogródek',
                   
                   'region', 'with_powiat_rights', 'pop_numb_cat', 'pop_dens_cat']

target = 'price'

features_to_onehotencode = ['ownership_status', 'flat_condition', 'heating', 'market',
                            'ad_type', 'windows', 'lift', 'mater', 'devel_type', 'region']

features_for_kNN_impute = ['rent_cat','year','ap_floor','number_floor_in_building',
                           'devel_type', 'lift', 'market',
                           'perks_balkon', 'perks_taras', 'perks_ogródek']

In [44]:
# creation of a dictionary of variables
model_features_dict = {"features_to_use": features_to_use,
                       "target": target,
                       "features_to_onehotencode": features_to_onehotencode,
                       "features_for_kNN_impute": features_for_kNN_impute}

# saving the dictionary
model_features_dict_path = "1. Data Preparation/model_features_dict.joblib"
joblib.dump(model_features_dict, model_features_dict_path)

['1. Data Preparation/model_features_dict.joblib']

In [45]:
# check if there are all variables in the set for analysisdefine variables for the model
model_features_dict = joblib.load("1. Data Preparation/model_features_dict.joblib")
    
all(x in cleaned_outliers_data.columns for x in model_features_dict['features_to_use'])

True

# Data processing for model selection
Before modeling, the data were processed as follows:
1. missing and outlier data was cleaned - **cleaning_data**
2. categorical features were one-hot encoded - **one_hot_encode_train_test**
3. all features were scale with z-score normalization - **scale_train_test**
4. missing data was filled usin kNN imputation - **kNN_impute_train_test**
5. data was splitted into train and test datasets and all the steps above were combined - **prepare_train_test_data**

In [46]:
# Split the dataset into training and testing sets
train_data, test_data= train_test_split(preliminary_transformed_data,
                                                    test_size = 0.2,
                                                    random_state = 99)

# Clean the train and test data
train_cleaned_data = cleaning_data(train_data, train_dataset = True)
test_cleaned_data = cleaning_data(test_data, train_dataset = False,
                                  to_delete_from_test_set = True)

# Extract features (X) and target variable (y)
X_train = train_cleaned_data[features_to_use]
y_train = train_cleaned_data[target]

X_test = test_cleaned_data[features_to_use]
y_test = test_cleaned_data[target]

# Ensure NaNs are np.nan, not pd.NA
X_train = X_train.replace({pd.NA: np.nan})
X_test = X_test.replace({pd.NA: np.nan})


                  W zbiorze wystąpiły braki w cechach: 'typ budynku', 'powierzchnia',
                  'liczba pokoi', 'rynek', 'ogłoszeniodawca', 'miasto', 'województwo'.
                  Do otrzymania prognozy wszystkie z powyższych cech
                  muszą być wypełnione.
                  Obserwacje te zostały usunięte ze zbioru do predykcji.
                  

                      W zbiorze wystąpiły obserwację o skrajnych wartościach
                      ze względu na zaproponowaną cenę, powierzchnię, piętro lub rok budynku.
                      Obserwacje te zostały usunięte ze zbioru do predykcji.
                      


### one_hot_encode

In [47]:
def one_hot_encode_train_test(X_train_to_encode, X_test_to_encode, features_to_onehotencode):
    
    """
    One-hot encodes selected categorical features with category dropping for stability.
    
    Parameters:
    -----------
    X_train_to_encode : pd.DataFrame
        Training data.
    X_test_to_encode : pd.DataFrame
        Test data.
    features_to_onehotencode : list of str
        List of categorical feature names to be one-hot encoded.

    Returns:
    --------
    X_train : pd.DataFrame
        Training data with encoded features.
    X_test : pd.DataFrame
        Test data with encoded features.
    """
    
    # Create copies of the original datasets to avoid modifying them in-place
    X_train = X_train_to_encode.copy()
    X_test = X_test_to_encode.copy()
    
    for column in features_to_onehotencode:
        # Define categories to drop if found
        categories_to_drop = ['nie podano', 'inny', 'wtórny', 'prywatny', 'opolskie']
        
        # Decide which category to drop fot this feature
        cat_to_drop = None
        for cat in categories_to_drop:
            if cat in X_train[column].unique():
                cat_to_drop = [cat]
                break
                
        # Create and fit the encoder
        encoder = OneHotEncoder(drop = cat_to_drop if cat_to_drop else 'first',
                                sparse_output = False,
                                handle_unknown='ignore')
        encoder.fit(X_train[[column]])
        
        # Column names for output
        cols = encoder.get_feature_names_out([column])
        
        # Transform both train and test
        X_train_encoded = pd.DataFrame(encoder.transform(X_train[[column]]), columns=cols, index=X_train.index)
        X_test_encoded = pd.DataFrame(encoder.transform(X_test[[column]]), columns=cols, index=X_test.index)
        
        # Join with original data
        X_train = pd.concat([X_train, X_train_encoded], axis=1)
        X_test = pd.concat([X_test, X_test_encoded], axis=1)
        
    # Drop the original categorical column
    X_train.drop(features_to_onehotencode, axis=1, inplace=True)
    X_test.drop(features_to_onehotencode, axis=1, inplace=True)
    
    return X_train, X_test

In [48]:
# Application of the function
X_train_encoded, X_test_encoded = one_hot_encode_train_test(X_train, X_test, features_to_onehotencode)

# View the changes that have made to the data
vars_after_encoding = []
for var in features_to_onehotencode:
        col_after_encode = [col for col in X_train_encoded.columns if col.startswith(var)]
        vars_after_encoding = vars_after_encoding + col_after_encode
        
print(X_train_encoded[vars_after_encoding].info())
print(X_train_encoded[vars_after_encoding].head())

<class 'pandas.core.frame.DataFrame'>
Index: 33509 entries, 4068 to 29313
Data columns (total 37 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ownership_status_inny            33509 non-null  float64
 1   ownership_status_pełna własność  33509 non-null  float64
 2   flat_condition_do remontu        33509 non-null  float64
 3   flat_condition_do wykończenia    33509 non-null  float64
 4   flat_condition_do zamieszkania   33509 non-null  float64
 5   heating_gazowe                   33509 non-null  float64
 6   heating_inny                     33509 non-null  float64
 7   heating_miejskie                 33509 non-null  float64
 8   market_pierwotny                 33509 non-null  float64
 9   ad_type_biuro nieruchomości      33509 non-null  float64
 10  ad_type_deweloper                33509 non-null  float64
 11  windows_inny                     33509 non-null  float64
 12  windows_plastikowe  

### scale

In [49]:
def scale_train_test(X_train_to_scale, X_test_to_scale):
    
    """
    Standardizes numerical features in training and test datasets using StandardScaler.

    This function applies z-score normalization to ensure each feature has a mean of 0
    and standard deviation of 1. It fits the scaler only on the training data and uses
    the same transformation on the test data to avoid data leakage.

    Parameters:
    -----------
    X_train_to_scale : pd.DataFrame
        The training dataset with numerical features to be standardized.

    X_test_to_scale : pd.DataFrame
        The test dataset with the same structure as X_train_to_scale.

    Returns:
    --------
    X_train : pd.DataFrame
        The standardized training dataset.

    X_test : pd.DataFrame
        The standardized test dataset, transformed using the scaler fitted on training data.
    """
    # Create copies of the original datasets to avoid modifying them in-place
    X_train = X_train_to_scale.copy()
    X_test = X_test_to_scale.copy() 
    
    # Create and fit the scaler
    scaler = StandardScaler()

    X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns, index = X_train.index)
    X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns, index = X_test.index)
        
    return X_train, X_test

In [50]:
# Application of the function
X_train_scaled, X_test_scaled = scale_train_test(X_train_encoded, X_test_encoded)

# View the changes that have made to the data
print(X_train_scaled.head())

       area  num_rooms  parking_coded  year  rent_cat  \
4068  -0.15       0.40           0.92  0.73       NaN   
21202  1.92       0.40           0.92  0.70       NaN   
44497  1.04       0.40          -1.09  0.70     -0.96   
22440 -0.49      -0.71          -1.09 -0.37      0.77   
16880 -0.81      -0.71           0.92  0.76     -0.96   

       number_floor_in_building  ap_floor  utilities_telewizja kablowa  \
4068                        NaN       NaN                        -0.84   
21202                     -0.06      0.52                         1.19   
44497                     -1.21     -0.54                        -0.84   
22440                      1.08     -0.54                         1.19   
16880                     -0.06     -0.01                         1.19   

       utilities_internet  utilities_telefon  ...  region_małopolskie  \
4068                -1.02              -0.56  ...               -0.31   
21202                0.98               1.77  ...                3

### kNN_impute

In [51]:
def kNN_impute_train_test(X_train_to_impute, X_test_to_impute,
                          features_for_kNN_impute, features_to_onehotencode):       
    
    """
    Performs K-Nearest Neighbors (KNN) imputation on selected features in training and test datasets.

    This function uses the KNNImputer to fill in missing values in specified features by
    leveraging similarity between instances. It also automatically handles one-hot encoded
    categorical variables by including all derived columns in the imputation process.

    Parameters:
    -----------
    X_train_to_impute : pd.DataFrame
        Training dataset containing missing values to be imputed.

    X_test_to_impute : pd.DataFrame
        Test dataset with the same structure as the training data.

    features_for_kNN_impute : list of str
        List of original feature names to apply KNN imputation to.
        If a feature has been one-hot encoded, columns after encoding
        will be included in the imputation process.

    Returns:
    --------
    X_train : pd.DataFrame
        Training dataset with missing values imputed using KNN.

    X_test : pd.DataFrame
        Test dataset imputed using the same fitted KNN model from training data.
    """
    
    # Create copies of the original datasets to avoid modifying them in-place
    X_train = X_train_to_impute.copy()
    X_test = X_test_to_impute.copy()
    features_for_kNN_impute_new = features_for_kNN_impute.copy()
    
    # Identify which features were one-hot encoded
    encoded_vars = list(set(features_for_kNN_impute) & set(features_to_onehotencode))
    
    # Replace original variable name with all derived one-hot encoded columns
    for var in encoded_vars:
        col_after_encode = [col for col in X_train.columns if col.startswith(var)]
        features_for_kNN_impute_new.remove(var)
        features_for_kNN_impute_new = features_for_kNN_impute_new + col_after_encode
        
    # Create the imputer
    imputer = KNNImputer()
    
    # Fit and apply KNN imputation to the selected columns
    X_train[features_for_kNN_impute_new] = imputer.fit_transform(X_train[features_for_kNN_impute_new])
    X_test[features_for_kNN_impute_new] = imputer.transform(X_test[features_for_kNN_impute_new])
    
    return X_train, X_test

In [52]:
# Application of the function
X_train_imputed, X_test_imputed = kNN_impute_train_test(X_train_scaled, X_test_scaled,
                                                        features_for_kNN_impute, features_to_onehotencode)

In [53]:
# View the changes that have made to the data
print("Are there missing values in data?", X_train_imputed.isna().any().any())
print(X_train_imputed.head())

Are there missing values in data? False
       area  num_rooms  parking_coded  year  rent_cat  \
4068  -0.15       0.40           0.92  0.73     -0.96   
21202  1.92       0.40           0.92  0.70      0.08   
44497  1.04       0.40          -1.09  0.70     -0.96   
22440 -0.49      -0.71          -1.09 -0.37      0.77   
16880 -0.81      -0.71           0.92  0.76     -0.96   

       number_floor_in_building  ap_floor  utilities_telewizja kablowa  \
4068                       1.16      1.05                        -0.84   
21202                     -0.06      0.52                         1.19   
44497                     -1.21     -0.54                        -0.84   
22440                      1.08     -0.54                         1.19   
16880                     -0.06     -0.01                         1.19   

       utilities_internet  utilities_telefon  ...  region_małopolskie  \
4068                -1.02              -0.56  ...               -0.31   
21202                0.98 

### prepare_train_test_data

In [54]:
def prepare_train_test_data (data, to_delete_from_test_set, test_size, random_seed):
    
    """
    Prepares and processes the dataset for model training and evaluation by applying
    feature selection, encoding, scaling, and missing value imputation.

    The function loads preprocessing configuration from a saved dictionary and performs the following steps:
    1. Splits the dataset into train and test sets.
    2. Cleans missing and outlier values in train and test sets.
    3. Selects relevant features and target column.
    3. Applies one-hot encoding to categorical features.
    4. Scales numeric features using StandardScaler.
    5. Imputes missing values using K-Nearest Neighbors imputation.

    Parameters:
    -----------
    data : pd.DataFrame
        The full dataset to be prepared for training and testing.
    
    to_delete_from_test_set : bool
        Indicates whether outlier observations will remain in the test set or be removed. Default is set to True.

    test_size : float
        Proportion of the dataset to include in the test split (e.g., 0.2 for 20%).

    random_seed : int
        Random seed to ensure reproducibility of the train-test split.

    Returns:
    --------
    X_train_imputed : pd.DataFrame
        Fully preprocessed training feature set.

    X_test_imputed : pd.DataFrame
        Fully preprocessed test feature set.

    y_train : pd.Series
        Target variable for training data.

    y_test : pd.Series
        Target variable for test data.
    """
    
    # Load preprocessing configuration with selected features and target definition
    model_features_dict = joblib.load("1. Data Preparation/model_features_dict.joblib")
    
    features_to_use = model_features_dict['features_to_use']
    target = model_features_dict['target']
    features_to_onehotencode = model_features_dict['features_to_onehotencode']
    features_for_kNN_impute = model_features_dict['features_for_kNN_impute']
    
    # Validate that all required features exist in the provided dataset
    if all(x in data.columns for x in model_features_dict['features_to_use']) == 0:
        print("The dataset does not have all the variables defined for modeling")
    
    # Split the dataset into training and testing sets
    train_data, test_data = train_test_split(data,
                                             test_size = test_size,
                                             random_state = random_seed)

    # Clean the train and test data
    train_cleaned_data = cleaning_data(train_data, train_dataset = True)
    
    test_cleaned_data = cleaning_data(test_data,
                                      train_dataset = False,
                                      to_delete_from_test_set = to_delete_from_test_set)
    
    # Since the test set is needed for the evaluation of the model, it must not contain missing data in the target variable
    test_cleaned_data = test_cleaned_data.dropna(subset = target)

    # Extract features (X) and target variable (y)
    X_train = train_cleaned_data[features_to_use]
    y_train = train_cleaned_data[target]

    X_test = test_cleaned_data[features_to_use]
    y_test = test_cleaned_data[target]
    
    # Ensure NaNs are np.nan, not pd.NA
    X_train = X_train.replace({pd.NA: np.nan})
    X_test = X_test.replace({pd.NA: np.nan})
    
    # One-hot encode categorical features
    X_train_encoded, X_test_encoded = one_hot_encode_train_test(X_train, X_test, features_to_onehotencode)
    
    # Scale features using StandardScaler
    X_train_scaled, X_test_scaled = scale_train_test(X_train_encoded, X_test_encoded)
    
    # Impute missing values using KNN imputer
    X_train_imputed, X_test_imputed = kNN_impute_train_test(X_train_scaled, X_test_scaled,
                                                            features_for_kNN_impute, features_to_onehotencode)
    
    return X_train_imputed, X_test_imputed, y_train, y_test
    

In [55]:
# Application of the function
X_train, X_test, y_train, y_test = prepare_train_test_data(preliminary_transformed_data,
                                                           to_delete_from_test_set = False,
                                                           test_size = 0.2, random_seed = 99)


                  W zbiorze wystąpiły braki w cechach: 'typ budynku', 'powierzchnia',
                  'liczba pokoi', 'rynek', 'ogłoszeniodawca', 'miasto', 'województwo'.
                  Do otrzymania prognozy wszystkie z powyższych cech
                  muszą być wypełnione.
                  Obserwacje te zostały usunięte ze zbioru do predykcji.
                  

                      W zbiorze wystąpiły obserwację o skrajnych wartościach
                      ze względu na zaproponowaną cenę, powierzchnię, piętro lub rok budynku.
                      Może zmniejszeć dokładność prognozy.
                      


In [56]:
# View the changes that have made to the data
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 33509 entries, 4068 to 29313
Data columns (total 71 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   area                                  33509 non-null  float64
 1   num_rooms                             33509 non-null  float64
 2   parking_coded                         33509 non-null  float64
 3   year                                  33509 non-null  float64
 4   rent_cat                              33509 non-null  float64
 5   number_floor_in_building              33509 non-null  float64
 6   ap_floor                              33509 non-null  float64
 7   utilities_telewizja kablowa           33509 non-null  float64
 8   utilities_internet                    33509 non-null  float64
 9   utilities_telefon                     33509 non-null  float64
 10  security_drzwi / okna antywłamaniowe  33509 non-null  float64
 11  security_teren za

# Data processing for training finale model
Before modeling, the data were processed as follows:
1. categorical features were one-hot encoded - **one_hot_encode_train_ecoder**
2. all features were scale with z-score normalization - **scale_train_scaler**
3. missing data was filled usin kNN imputation - **kNN_impute_train_imputer**
4. all the steps above were combined, and the fitted encoder, scaler and imputer were saved - **prepare_final_training_data**

### one_hot_encode

In [57]:
def one_hot_encode_train_encoder(X_to_encode, features_to_onehotencode):
    
    """
    Applies one-hot encoding to specified categorical features in the training dataset
    and returns both the transformed DataFrame and a dictionary of trained encoders.

    For each categorical feature, the function:
    - Checks whether specific unwanted categories (e.g., 'nie podano') are present.
    - Drops the first matching category from the one-hot encoding if found, otherwise drops the first by default.
    - Uses sklearn's OneHotEncoder with 'handle_unknown=ignore' to handle unseen categories gracefully.
    - Returns a transformed DataFrame and a dictionary of fitted encoders for future use (e.g., for encoding test data).

    Parameters:
    -----------
    X_to_encode : pd.DataFrame
        Input DataFrame containing categorical features to be encoded.

    features_to_onehotencode : list of str
        List of column names in `X_to_encode` to apply one-hot encoding on.

    Returns:
    --------
    X : pd.DataFrame
        DataFrame with original categorical features replaced by one-hot encoded columns.

    encoders : dict
        Dictionary mapping each encoded feature name to its corresponding fitted OneHotEncoder instance.
    """
    
    # Create copy of the original dataset to avoid modifying it in-place
    X = X_to_encode.copy()
    
    # Dictionary to store fitted encoders for each column
    encoders = {}
    
    for column in features_to_onehotencode:
        # Define categories that should be dropped if they exist
        categories_to_drop = ['nie podano', 'inny', 'wtórny', 'prywatny', 'opolskie']
        
        # Determine which category to drop (first match found)
        cat_to_drop = None
        
        for cat in categories_to_drop:
            if cat in X[column].unique():
                cat_to_drop = [cat]
                break
                
        # Create and fit the encoder
        encoder = OneHotEncoder(drop = cat_to_drop if cat_to_drop else 'first',
                                sparse_output = False, handle_unknown = 'ignore')
        encoder.fit(X[[column]])
        
        # Get names of one-hot encoded columns
        cols = encoder.get_feature_names_out([column])
        
        # Transform the data and wrap in DataFrame
        X_encoded = pd.DataFrame(encoder.transform(X[[column]]), columns=cols, index=X.index)
        
        # Append the encoded columns to the dataset
        X = pd.concat([X, X_encoded], axis=1)
        
        # Save the encoder for this column
        encoders[column] = encoder
        
    # Drop the original categorical column
    X.drop(features_to_onehotencode, axis=1, inplace=True)
    
    return X, encoders

### scale

In [58]:
def scale_train_scaler(X_to_scale):
    
    """
    Applies standard scaling (zero mean, unit variance) to the input dataset and returns
    both the scaled dataset and the fitted scaler for future use (e.g., for test data).

    The function performs the following steps:
    - Copies the input DataFrame to avoid modifying the original.
    - Fits a 'StandardScaler' from 'sklearn.preprocessing' to the data.
    - Applies the transformation and returns the scaled data along with the scaler.

    Parameters:
    -----------
    X_to_scale : pd.DataFrame
        The DataFrame containing numerical features to be scaled.

    Returns:
    --------
    X : pd.DataFrame
        The scaled version of the input DataFrame, with the same column names.

    scaler : StandardScaler
        The fitted 'StandardScaler' object, which can be used to transform new data
        (e.g., test set) using the same scaling parameters.
    """
    
    # Create copy of the original dataset to avoid modifying it in-place
    X = X_to_scale.copy()
    
    # Create the scaler
    scaler = StandardScaler()

    # Fit and apply the scaler to the data and preserve column names
    X = pd.DataFrame(scaler.fit_transform(X), columns = X.columns)
        
    return X, scaler

### kNN_impute

In [None]:
def kNN_impute_train_imputer(X_to_impute, features_for_kNN_impute, features_to_onehotencode):       
    
    """
    Performs K-Nearest Neighbors (KNN) imputation on selected features of a dataset,
    including those that have been one-hot encoded.

    The function:
    - Identifies features that were originally categorical and have been one-hot encoded.
    - Replaces these original variables in the list of features to impute with their
      one-hot encoded column names.
    - Fits a 'KNNImputer' on the specified features.
    - Returns the dataset with imputed values and a dictionary containing the fitted
      imputer and the actual list of features used.

    Parameters:
    -----------
    X_to_impute : pd.DataFrame
        The input dataset (typically already encoded and scaled) with missing values to be imputed.

    features_for_kNN_impute : list of str
        List of original features intended for imputation.

    features_to_onehotencode : list of str
        List of features that were one-hot encoded (to help expand into multiple columns).

    Returns:
    --------
    X : pd.DataFrame
        The dataset with imputed values for the specified features.

    imputer_dict : dict
        A dictionary containing:
            - 'imputer' : the fitted 'KNNImputer'`object
            - 'features': the actual list of column names used for imputation,
                          including one-hot encoded columns.
    """
    
    # Create copy of the original dataset to avoid modifying it in-place
    X = X_to_impute.copy()
    
     # Copy the list of features to impute
    features_for_kNN_impute_new = features_for_kNN_impute.copy()
    
    # Identify one-hot encoded variables among features to impute
    encoded_vars = list(set(features_for_kNN_impute) & set(features_to_onehotencode))
    
    # Replace encoded variables with their actual one-hot encoded column names
    for var in encoded_vars:
        col_after_encode = [col for col in X.columns if col.startswith(var)]
        features_for_kNN_impute_new.remove(var)
        features_for_kNN_impute_new = features_for_kNN_impute_new + col_after_encode
        
    # Create the imputer
    imputer = KNNImputer()
    
    # Fit and apply imputation only on the selected columns
    X[features_for_kNN_impute_new] = imputer.fit_transform(X[features_for_kNN_impute_new])
    
    # Return the transformed data and the fitted imputer with the column names used
    imputer_dict = {'imputer': imputer,
                    'features': features_for_kNN_impute_new}
    
    return X, imputer_dict

### prepare_final_training_data

In [60]:
def prepare_final_training_data(data_to_train, to_save = True):
    
    """
    Prepares the final training dataset by applying a complete preprocessing pipeline,
    including one-hot encoding, scaling, and KNN imputation.

    This function:
    - Loads the model configuration (features, encoders) from a predefined dictionary.
    - Applies preprocessing steps in the following order:
        1. Cleans missing and outlier values.
        2. One-hot encoding of categorical variables.
        3. Feature standardization (z-score scaling).
        4. Missing data imputation using K-Nearest Neighbors.
    - Optionally saves the fitted preprocessing objects (encoder, scaler, imputer) for future use.

    Parameters:
    -----------
    data : pd.DataFrame
        The raw input dataset containing both features and the target variable.

    to_save : bool, default=True
        If True, saves the fitted encoder, scaler, and imputer to disk under the
        folder '1. Data Preparation/pipeline_objects'.

    Returns:
    --------
    X_imputed : pd.DataFrame
        The fully preprocessed feature matrix ready for training.

    y : pd.Series
        The target variable extracted from the input data.
    """
    
    # Load the list of features and settings used for preprocessing
    model_features_dict = joblib.load("1. Data Preparation/model_features_dict.joblib")
    
    features_to_use = model_features_dict['features_to_use']
    target = model_features_dict['target']
    features_to_onehotencode = model_features_dict['features_to_onehotencode']
    features_for_kNN_impute = model_features_dict['features_for_kNN_impute']
    
    # Check whether all required features are present in the data
    if all(x in data_to_train.columns for x in model_features_dict['features_to_use']) == 0:
        print("The dataset does not have all the variables defined for modeling")
    
    # Copy the dataset to avoid in-place changes
    data = data_to_train.copy()
    
    # Proceed missing and outlier values
    data = cleaning_data(data, train_dataset = True)
    
    # Ensure compatibility with np.nan
    data = data.replace({pd.NA: np.nan})
    
    # Separate input features and target
    X = data[features_to_use]
    y = data[target]
    
    # One-hot encode categorical features
    X_encoded, encoder = one_hot_encode_train_encoder(X, features_to_onehotencode)
    
    # Scale features using StandardScaler
    X_scaled, scaler = scale_train_scaler(X_encoded)
    
    # Impute missing values using KNN imputer
    X_imputed, imputer = kNN_impute_train_imputer(X_scaled, features_for_kNN_impute, features_to_onehotencode)
    
    # Optionally save preprocessing objects to disk for future use
    if to_save: 
        joblib.dump(encoder, "production_pipeline_objects/encoder.pkl")
        joblib.dump(scaler, "production_pipeline_objects/scaler.pkl")
        joblib.dump(imputer, "production_pipeline_objects/imputer.pkl")
    
    return X_imputed, y

In [61]:
# Application of the function
X_final, y_final = prepare_final_training_data(preliminary_transformed_data)

In [62]:
# View the changes that have made to the data
X_final.head()

Unnamed: 0,area,num_rooms,parking_coded,year,rent_cat,number_floor_in_building,ap_floor,utilities_telewizja kablowa,utilities_internet,utilities_telefon,...,region_małopolskie,region_podkarpackie,region_podlaskie,region_pomorskie,region_warmińsko-mazurskie,region_wielkopolskie,region_zachodniopomorskie,region_łódzkie,region_śląskie,region_świętokrzyskie
0,-0.96,-0.72,0.92,0.7,0.08,-0.44,-0.54,1.19,0.98,1.77,...,-0.31,-0.19,-0.14,-0.37,-0.16,-0.27,-0.34,-0.21,-0.33,-0.12
1,0.5,0.4,0.92,0.6,0.78,1.09,1.05,1.19,0.98,-0.56,...,-0.31,-0.19,-0.14,-0.37,-0.16,-0.27,-0.34,4.86,-0.33,-0.12
2,0.1,0.4,0.92,0.7,-0.96,-1.21,-1.08,1.19,0.98,-0.56,...,-0.31,-0.19,-0.14,-0.37,-0.16,-0.27,-0.34,-0.21,-0.33,-0.12
3,0.9,1.52,0.92,0.7,-0.61,-1.21,-1.08,-0.84,0.98,-0.56,...,3.24,-0.19,-0.14,-0.37,-0.16,-0.27,-0.34,-0.21,-0.33,-0.12
4,0.52,0.4,-1.09,-2.1,2.51,-0.06,0.52,1.19,0.98,1.77,...,-0.31,-0.19,-0.14,-0.37,-0.16,-0.27,-0.34,-0.21,-0.33,-0.12


# Data processing used on the production function
For the prediction, using saved preprocessing pipeline objectes the data were processed by **prepare_production_modeling_data** function as follows:
1. categorical features were one-hot encoded
2. all features were scale with z-score normalization
3. missing data was filled usin kNN imputation
4. all the steps above were combined

In [63]:
def prepare_production_modeling_data(data_to_predict, to_delete_outliers = True):
    
    """
    Prepares production (inference) data by applying a saved preprocessing pipeline objectes.

    This function:
    1. Loads the saved model preprocessing objects (encoders, scaler, and imputer).
    2. Proceeds missing and outlier values.
    3. Applies one-hot encoding using pre-fitted encoders.
    4. Scales the features using the pre-fitted StandardScaler.
    5. Performs KNN imputation for missing values using the saved imputer.

    This ensures consistency with the transformations applied during model training.

    Parameters:
    -----------
    data : pd.DataFrame
        The new dataset (e.g., from production or deployment environment) containing both
        input features and the target variable.

    Returns:
    --------
    X : pd.DataFrame
        The fully preprocessed feature matrix ready for inference.

    y : pd.Series
        The target variable from the provided data.
    """
    
    # Load feature definitions and preprocessing components
    model_features_dict = joblib.load("1. Data Preparation/model_features_dict.joblib")
    
    features_to_use = model_features_dict['features_to_use']
    target = model_features_dict['target']
    features_to_onehotencode = model_features_dict['features_to_onehotencode']
    
    folder_path = "production_pipeline_objects"
    features_for_kNN_impute = joblib.load(f"{folder_path}/imputer.pkl")['features']
    
    encoder = joblib.load(f"{folder_path}/encoder.pkl")
    scaler = joblib.load(f"{folder_path}/scaler.pkl")
    imputer = joblib.load(f"{folder_path}/imputer.pkl")['imputer']
    
    # Copy the dataset to avoid in-place changes
    data = data_to_predict.copy()
    
    # Proceed missing and outlier values
    data = cleaning_data(data, train_dataset = False, to_delete_from_test_set = to_delete_outliers)
    
    # Ensure compatibility with np.nan
    data = data.replace({pd.NA: np.nan})
    
    # Separate input features and target
    X = data[features_to_use]
    y = data[target]
    
    # Step 1: Apply saved one-hot encoders for each categorical column
    for column in features_to_onehotencode:
        
        encoder_col = encoder[column]   
        # Column names for output
        cols = encoder_col.get_feature_names_out([column])
        
        X_encoded = pd.DataFrame(encoder_col.transform(X[[column]]), columns=cols, index=X.index)
        
        X = pd.concat([X, X_encoded], axis=1)
        
    # Drop original categorical columns after encoding
    X.drop(features_to_onehotencode, axis=1, inplace=True)
    
    # Step 2: Scale all features using saved StandardScaler
    X = pd.DataFrame(scaler.transform(X), columns = X.columns)
    
    # Step 3: Impute missing values using saved KNNImputer
    X[features_for_kNN_impute] = imputer.transform(X[features_for_kNN_impute])
    
    return X, y

In [64]:
# Application of the function
X, y = prepare_production_modeling_data(preliminary_transformed_data.iloc[5:1000],
                                        to_delete_outliers = True)


                      W zbiorze wystąpiły obserwację o skrajnych wartościach
                      ze względu na zaproponowaną cenę, powierzchnię, piętro lub rok budynku.
                      Obserwacje te zostały usunięte ze zbioru do predykcji.
                      


In [65]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 917 entries, 0 to 916
Data columns (total 71 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   area                                  917 non-null    float64
 1   num_rooms                             917 non-null    float64
 2   parking_coded                         917 non-null    float64
 3   year                                  917 non-null    float64
 4   rent_cat                              917 non-null    float64
 5   number_floor_in_building              917 non-null    float64
 6   ap_floor                              917 non-null    float64
 7   utilities_telewizja kablowa           917 non-null    float64
 8   utilities_internet                    917 non-null    float64
 9   utilities_telefon                     917 non-null    float64
 10  security_drzwi / okna antywłamaniowe  917 non-null    float64
 11  security_teren zamk