![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}


# Import data

In [1]:
import time
from pathlib import Path

import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

**Goal**:
- Identify what basic pre-processing steps need to be taken before uploading the data to a database

# Select Columns to Retain Based on the Quantity of Missing Values


In the realm of web scraping, managing the sheer volume of data is often the initial hurdle to conquer. It's not so much about deciding what data to collect but rather what data to retain. As we delve into the vast realm of the Imoweb website, we are met with a plethora of listings, each offering a unique set of information.

For many of these listings, there are commonalities – details like location and price tend to be constants. However, interspersed among them are those one-of-a-kind nuggets of information, such as the number of swimming pools available. While these specific details can certainly be vital in assessing the value of certain listings, the downside is that they can lead to a sparse dataset.

Currently, our primary objective is to pinpoint which features are prevalent across the board, drawing insights from a pre-scraped dataset comprising around 1000 ads. Once we've identified these common denominators, we can streamline our data collection process by retaining these key attributes while discarding the less likely occurrences.

In [31]:
for filename in utils.Configuration.RAW_DATA_PATH.glob("*.gzip"):
    if "data" in filename.stem:
        df = pd.read_parquet(filename)
df

Unnamed: 0,Address,Available as of,CO₂ emission,Covered parking spaces,Energy class,External reference,Flood zone type,Outdoor parking spaces,Planning permission obtained,Possible priority purchase right,...,Surroundings type,Furnished,Heating type,Bedroom 3 surface,Office,Latest land use designation,Street frontage width,"Gas, water & electricity",Price,Double glazing
0,Sint-Denijslaan 1 9000 - Gent,At delivery,Not specified,1,Not specified,5530472,Possible flood zone,1,Yes,No,...,,,,,,,,,,
1,Rue Simon 46/2 6990 - Hotton,Depending on the tenant,71 kg CO₂/m²,2,D,5530890,Property partially or completely located in a ...,,Yes,No,...,,,,,,,,,,
2,Sint-Denijslaan 1 9000 - Gent,At delivery,Not specified,,Not specified,5533819,Possible flood zone,,Yes,,...,,,,,,,,,,
3,Hoogstraat 20 9340 - Lede,After signing the deed,Not specified,,F,5523589,Non flood zone,,Yes,No,...,,,,,,,,,,
4,Heidestatiestraat 22 2920 - Kalmthout,,Not specified,,Not specified,95278 - NWB-23-SEM-0,Non flood zone,,Yes,No,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,Avenuedes Ardennes 7d 4500 - Huy,,Not specified,1,E,18115 - 111115428,,,,,...,,,Fuel oil,,,,,No,"€ 265,000 265000 €",Yes
87,Sint-Denijslaan 1 9000 - Gent,,Not specified,,Not specified,5528536,Non flood zone,,,No,...,,,,,,,,,,
88,Avenue Louise 523 1050 - Ixelles,,87 kg CO₂/m²,2,B,,,4,,,...,,,Gas,21 m² square meters,,,,Yes,"€ 1,495,000 1495000 €",Yes
89,Avenue des Tournesols 14 1640 - Rhode-St-Genèse,,37 kg CO₂/m²,,D,5531996,,,,,...,,,Gas,16 m² square meters,,,5 m,Yes,"€ 445,000 445000 €",Yes


In [38]:
for columns in df:
    print(columns, df[columns].unique().shape)

Address (49,)
Available as of (5,)
CO₂ emission (32,)
Covered parking spaces (4,)
Energy class (8,)
External reference (85,)
Flood zone type (4,)
Outdoor parking spaces (6,)
Planning permission obtained (2,)
Possible priority purchase right (3,)
Primary energy consumption (67,)
Proceedings for breach of planning regulations (2,)
Reference number of the EPC report (60,)
Subdivision permit (3,)
Tenement building (2,)
Website (34,)
Yearly theoretical total energy consumption (20,)
ad_url (91,)
day_of_retrieval (91,)
Kitchen type (7,)
Width of the lot on the street (11,)
Kitchen surface (19,)
Dining room (2,)
Living area (43,)
Bedroom 1 surface (22,)
Basement (1,)
Connection to sewer network (2,)
As built plan (2,)
TV cable (2,)
Cadastral income (40,)
Bedrooms (8,)
Toilets (6,)
Building condition (6,)
Surface of the plot (50,)
Bedroom 2 surface (11,)
Garden surface (1,)
Construction year (42,)
Number of frontages (4,)
Bathrooms (5,)
Living room surface (26,)
Surroundings type (5,)
Furnishe