![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}


# Import data

In [1]:
import time
from pathlib import Path

import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

**Objective**:
- Select Columns to Retain Based on the Quantity of Missing Values
- Identify Fundamental Data Preprocessing Procedures Post-Scraping

::: {.callout-tip title="How to import your own module using a .pth file"}

Based on [this SO question](https://stackoverflow.com/questions/700375/how-to-add-a-python-import-path-using-a-pth-file), I added the sth.pth file containing `C:\Users\s0212777\OneDrive - Universiteit Antwerpen\Jupyter_projects\Articles\house_price_prediction\src` into the folder: C:\Users\s0212777\AppData\Roaming\Python\Python310\site-packages. This folder is already in my `PYTHONPATH` so that Python can see my package directory (check with `import sys sys.path`).

So now utils can be imported as `from data import utils`.

:::

# Select Columns to Retain Based on the Quantity of Missing Values


In the realm of web scraping, managing the sheer volume of data is often the initial hurdle to conquer. It's not so much about deciding what data to collect but rather what data to retain. As we delve into the vast realm of the Imoweb website, we are met with a plethora of listings, each offering a unique set of information.

For many of these listings, there are commonalities – details like location and price tend to be constants. However, interspersed among them are those one-of-a-kind nuggets of information, such as the number of swimming pools available. While these specific details can certainly be vital in assessing the value of certain listings, the downside is that they can lead to a sparse dataset.

Currently, our primary objective is to pinpoint which features are prevalent across the board, drawing insights from a pre-scraped dataset comprising around 1000 ads. Once we've identified these common denominators, we can streamline our data collection process by retaining these key attributes while discarding the less likely occurrences.

In [5]:
def read_in_pilot_data():
    """
    Reads and preprocesses data from multiple CSV files in a directory.

    This function reads CSV files located in a directory specified by the 'RAW_DATA_PATH'
    variable in the 'utils.Configuration' module. It loads each CSV file, transposes it,
    and then concatenates all dataframes into one, removing rows with all missing values.

    Returns:
        pandas.DataFrame: A consolidated dataframe with missing rows removed.

    Example:
        To load and preprocess data, call this function as follows:
        >>> pilot_data = read_in_pilot_data()
    """

    path = utils.Configuration.RAW_DATA_PATH.joinpath("for_sale").glob("*.csv")

    list_of_dfs = []

    # Loop through CSV files in the directory and append transposed dataframes
    for df in path:
        list_of_dfs.append(pd.read_csv(df).T)

    # Concatenate and transpose all dataframes, dropping rows with all missing values
    dfs = (
        pd.concat([df for df in list_of_dfs], axis=1)
        .transpose()
        .dropna(axis=0, how="all")
    )

    return dfs


# Call the function to read and preprocess the data
df = read_in_pilot_data()

df

Unnamed: 0,Accessible for disabled people,Address,Available as of,Bathrooms,Bedrooms,Building condition,CO₂ emission,Connection to sewer network,Construction year,Covered parking spaces,...,TB1.A.c.4.1,TB1.A.c.4.2,TB1.A.c.4.3,TB1.A.c.5.1,TB1.A.c.5.2,TB1.A.c.5.3,TB1.A.c.6.1,TB1.A.c.6.2,How many fireplaces?,Current monthly revenue
0,Yes,Grotestraat 28 9500 - Geraardsbergen,After signing the deed,2,4,Good,Not specified,Connected,1967.0,1.0,...,,,,,,,,,,
1,Yes,Gistelsesteenweg 291 8200 - Sint Andries,Immediately,1,5,To renovate,Not specified,Connected,1954.0,1.0,...,,,,,,,,,,
2,No,Dokter Honore Dewolfstraat 23 9700 - Oudenaarde,,2,3,As new,9714 kg CO₂/m²,Connected,1982.0,1.0,...,,,,,,,,,,
3,,Parklaan 187 9300 - Aalst,Immediately,2,3,Just renovated,Not specified,,,,...,,,,,,,,,,
4,Yes,Grotestraat 28 9500 - Geraardsbergen,To be defined,1,2,To be done up,Not specified,,,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,,Ieperstraat 35 8970 - Poperinge,Immediately,1.0,4,To renovate,Not specified,,,,...,,,,,,,,,,
26,No,Brugsestraat 1 8020 - Oostkamp,,1.0,3,To be done up,Not specified,Connected,1974.0,,...,,,,,,,,,,
27,,Alphonse Claeys-Bouüaertlaan 57/001 9030 - Ma...,,1.0,5,To be done up,Not specified,Connected,1899.0,,...,,,,,,,,,,
28,,Place Paul Heupgen 9/4.1 7000 - Mons,,2.0,5,Good,Not specified,Connected,,,...,,,,,,,,,,


In [7]:
# | fig-cap: "Top 50 Features with Non-Missing Values Above 50%"
# | label: fig-fig1


# Getting the column names with lowest missing values
lowest_missing_value_columns = (
    df.notna()
    .sum()
    .div(df.shape[0])
    .mul(100)
    .sort_values(ascending=False)
    .head(50)
    .round(1)
)
indexes_to_keep = lowest_missing_value_columns.index

(
    lowest_missing_value_columns.reset_index()
    .rename(columns={"index": "column", 0: "perc_values_present"})
    .assign(
        Has_non_missing_values_above_50_pct=lambda df: df.perc_values_present.gt(50),
        perc_values_present=lambda df: df.perc_values_present - 50,
    )
    .pipe(
        lambda df: ggplot(
            df,
            aes(
                "perc_values_present",
                "column",
                fill="Has_non_missing_values_above_50_pct",
            ),
        )
        + geom_bar(stat="identity", orientation="y", show_legend=False)
        + ggsize(800, 1000)
        + labs(
            title="Top 50 Features with Non-Missing Values Above 50%",
            subtitle="""The plot illustrates that the features 'day of retrieval,' 'url,' and 'reference number' in EPC records 
            exhibited the highest completeness, with over 90% of instances present. Conversely, 'dining room,' 
            'as built plan,' and 'office' were among the least populated features, with approximately 10% of 
            non-missing instances.
            """,
            x="Percentage of Instances Present with Reference Point at 50%",
            y="",
            caption="https://www.immoweb.be/",
        )
        + theme(
            plot_subtitle=element_text(
                size=12, face="italic"
            ),  # Customize subtitle appearance
            plot_title=element_text(size=15, face="bold"),  # Customize title appearance
        )
    )
)