![Photo by Stephen Phillips - Hostreviews.co.uk on UnSplash](https://cf.bstatic.com/xdata/images/hotel/max1024x768/408003083.jpg?k=c49b5c4a2346b3ab002b9d1b22dbfb596cee523b53abef2550d0c92d0faf2d8b&o=&hp=1){fig-align="center" width=50%}


# Import data

In [1]:
import time
from pathlib import Path

import pandas as pd
from data import utils
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

::: {.callout-tip title="How to import your own module using a .pth file"}
Based on [this SO question](https://stackoverflow.com/questions/700375/how-to-add-a-python-import-path-using-a-pth-file), I added the sth.pth file containing `C:\Users\s0212777\OneDrive - Universiteit Antwerpen\Jupyter_projects\Articles\house_price_prediction\src` into the folder: C:\Users\s0212777\AppData\Roaming\Python\Python310\site-packages. This folder is already in my `PYTHONPATH` so that Python can see my package directory (check with `import sys sys.path`).

So now utils can be imported as `from data import utils`.

:::

# Determine what columns to keep depending on the number of missing values present

In [6]:
# I prescraped some ads so that I can see what columns are more likely to be missing

path = utils.Configuration.RAW_DATA_PATH.joinpath("for_sale").glob("*.csv")

csvs = []
for i in path:
    csvs.append(pd.read_csv(i).T)
len(csvs)

33

In [7]:
# we also transpose the dataframes so that the columns become the indexes (helps with concat)
dfs = pd.concat([i for i in csvs], axis=1).T

In [8]:
dfs.shape

(1024, 434)

In [9]:
dfs.dropna(axis=0, how="all")

Unnamed: 0,Accessible for disabled people,Address,Available as of,Bathrooms,Bedrooms,Building condition,CO₂ emission,Connection to sewer network,Construction year,Covered parking spaces,...,TB1.A.c.4.1,TB1.A.c.4.2,TB1.A.c.4.3,TB1.A.c.5.1,TB1.A.c.5.2,TB1.A.c.5.3,TB1.A.c.6.1,TB1.A.c.6.2,How many fireplaces?,Current monthly revenue
0,Yes,Grotestraat 28 9500 - Geraardsbergen,After signing the deed,2,4,Good,Not specified,Connected,1967.0,1.0,...,,,,,,,,,,
1,Yes,Gistelsesteenweg 291 8200 - Sint Andries,Immediately,1,5,To renovate,Not specified,Connected,1954.0,1.0,...,,,,,,,,,,
2,No,Dokter Honore Dewolfstraat 23 9700 - Oudenaarde,,2,3,As new,9714 kg CO₂/m²,Connected,1982.0,1.0,...,,,,,,,,,,
3,,Parklaan 187 9300 - Aalst,Immediately,2,3,Just renovated,Not specified,,,,...,,,,,,,,,,
4,Yes,Grotestraat 28 9500 - Geraardsbergen,To be defined,1,2,To be done up,Not specified,,,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,,Ieperstraat 35 8970 - Poperinge,Immediately,1.0,4,To renovate,Not specified,,,,...,,,,,,,,,,
26,No,Brugsestraat 1 8020 - Oostkamp,,1.0,3,To be done up,Not specified,Connected,1974.0,,...,,,,,,,,,,
27,,Alphonse Claeys-Bouüaertlaan 57/001 9030 - Ma...,,1.0,5,To be done up,Not specified,Connected,1899.0,,...,,,,,,,,,,
28,,Place Paul Heupgen 9/4.1 7000 - Mons,,2.0,5,Good,Not specified,Connected,,,...,,,,,,,,,,


In [10]:
# Getting the column names with lowest missing values
lowest_missing_value_columns = (
    dfs.notna()
    .sum()
    .div(dfs.shape[0])
    .mul(100)
    .sort_values(ascending=False)
    .head(50)
    .round(1)
)
print(lowest_missing_value_columns)
indexes_to_keep = lowest_missing_value_columns.index

day_of_retrieval                                  98.0
ad_url                                            98.0
Reference number of the EPC report                94.8
Energy class                                      94.8
Primary energy consumption                        94.8
Yearly theoretical total energy consumption       94.8
CO₂ emission                                      94.8
Tenement building                                 93.7
Address                                           89.1
Bedrooms                                          86.8
Living area                                       80.7
Bathrooms                                         78.9
Surface of the plot                               76.9
Price                                             75.3
Building condition                                71.4
Double glazing                                    70.3
Number of frontages                               64.5
Website                                           59.8
Toilets   