# 1. Data preprocessing

Some notes about the clean dataset in **data/properties.csv** :

- There are about 76 000 properties, roughly equally spread across houses and apartments
- Each property has a unique identifier **id**
- The target variable is **price**
- Variables prefixed with **fl_** are dummy variables (1/0)
- Variables suffixed with **_sqm** indicate the measurement is in square meters
- All missing categories for the categorical variables are encoded as **MISSING**

## Preparation of the dataset for machine learning

- Handling NaNs (hint: **imputation**)
- Converting categorical data into numeric features (hint: **one-hot encoding**)
- Rescaling numeric features (hint: **standardization**)

## Exploring the dataset


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import csv

In [None]:
# Read the csv file
df = pd.read_csv("../data/properties.csv")

In [None]:
# Display the head
df.head()

In [None]:
print("There are {} rows of data".format(len(df)))

In [None]:
# (rows,columns)
df.shape

In [None]:
# Describe index
df.index 

In [None]:
# Describe df columns
df.columns

In [None]:
# Info on df
df.info()

In [None]:
# Number of non-NA values
df.count()

In [None]:
# Descriptive statistics
df.describe()

In [None]:
# descriptive statistics for all columns in df, including both numeric and non-numeric (categorical) columns

df.describe(include="all").T  # Transpose the data frame so that it fits in a cell

In [None]:
# check for missing (NaN or null) values in each column, 
# count the number of missing values per column, 
# and then sort the results in descending order

df.isna().sum().sort_values(ascending=False)

In [None]:
# There are no duplicates
df.duplicated().any()

In [None]:
# Dropping the "id" column
df_drop_id = df.drop(["id"], axis=1)

## Cleaning the data

In [None]:
# Replace the "MISSING" values with NaN
df_missing = df.replace('MISSING', np.NAN, inplace=False)
display(df_missing)


In [None]:
df_missing.isna().sum().sort_values(ascending=False)


In [None]:
df_missing.describe(include="all").T

In [None]:
df_missing.dtypes

## Handling NaNs with imputation

Mean/Median Imputation

In [None]:
# Specify columns to impute
impute_columns = ["total_area_sqm", "surface_land_sqm", "nbr_frontages", "nbr_bedrooms", "terrace_sqm", "garden_sqm", "primary_energy_consumption_sqm", "cadastral_income"]

# Select numerical data
impute_df = df[impute_columns]

In [None]:
df.select_dtypes(include=float)

In [None]:
# mean imputation
mean_values = df[impute_columns].mean()
mean_imputation = df[impute_columns].fillna(mean_values)
mean_imputation.head()

In [None]:
# median imputation
median_values = df[impute_columns].median()
median_imputation = df[impute_columns].fillna(median_values)
median_imputation.head()