![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}


# Import data

In [1]:
from pathlib import Path

import catboost
import numpy as np
import pandas as pd
import shap
from data import pre_process, utils
from IPython.display import clear_output
from lets_plot import *
from lets_plot.mapping import as_discrete
from models import train_model
from sklearn import metrics, model_selection
from tqdm import tqdm

LetsPlot.setup_html()

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


**Objective**:


1. **Cross-validation**: Implement a robust cross-validation strategy to assess our model's performance accurately and consistently across multiple folds of the data.
2. **Outlier Identification**: Identify and scrutinize potential outliers within our dataset. This step is crucial to ensure that outliers don't unduly influence our model's predictions.
3. **Feature Engineering**: Continue refining and expanding our feature engineering efforts. We'll explore new ways to create informative features that enhance our model's predictive capabilities.



# Prepare dataframe before modelling
## Read in dataframe


In [2]:
df = pd.read_parquet(
    utils.Configuration.INTERIM_DATA_PATH.joinpath(
        "2023-10-01_Processed_dataset_for_NB_use.parquet.gzip"
    )
)

# Cross-validation

In [17]:
pre_processed_df = pre_process.prepare_data_for_modelling(train)

Shape of X and y: (2928, 17), (2928,)


# Feature Engineering

**SOME IDEAS**

1. **Categorical Features:**
   - Encode categorical variables like "state," "kitchen_type," "building_condition," and "city" using one-hot encoding or label encoding.

2. **Geospatial Features:**
   - Calculate the distance from each apartment to key locations in the city (e.g., city center, schools, parks, public transportation) using "lat" and "lng."
   - Create clusters or neighborhoods based on the latitude and longitude coordinates (lat, lng) to capture the regional effect on prices.

3. **Spatial Features:**
   - Explore the "number_of_frontages" and how it affects property prices. You could categorize this variable or use it as a numerical feature.

4. **Area-related Features:**
   - Calculate the ratio of "living_area" to "surface_of_the_plot" to get an idea of the density or spaciousness of the property.
   - Create bins or categories for "living_area" and "surface_of_the_plot" to capture different property sizes.

5. **Energy Efficiency Features:**
   - Compute the energy efficiency ratio by dividing "yearly_theoretical_total_energy_consumption" by "primary_energy_consumption."
   - Normalize energy-related features to have a similar scale if they are measured in different units.

6. **Toilet and Bathroom Features:**
   - Combine "toilets" and "bathrooms" into a single "total_bathrooms" feature to simplify the model.

7. **Parking Features:**
   - If parking information is available, create binary features indicating the presence of covered or uncovered parking spaces.

8. **Taxation Features:**
   - Incorporate "cadastral_income" as a measure of property value for taxation. You can create bins or categories for this variable.

9. **Combining Features:**
   - Experiment with interactions and products of different features to capture complex relationships. For instance, you could multiply "living_area" by "cadastral_income" to get a new feature.
