# 1️⃣ Data Preparation 
**designed by:** [datamover.ai](https://www.datamover.ai)

In [1]:
# create folder data if not exist
import os

# import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# make script reproducible
np.random.seed(42)

**1 Fetch dataset**

Download dataset from this [url](https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction) and load the data in a `pd.DataFrame`. 

check this [article](https://www.datamover.ai/post/the-right-way-to-set-absolute-path-in-python) to learn how to load dataset OS agnostic

In [7]:
DATA_DIR = "data"
FILENAME = "HousingPrices-Amsterdam-August-2021.csv"

data = pd.read_csv(
    os.path.join(DATA_DIR, FILENAME), 
    index_col=0,
)
data.head()

Unnamed: 0,Address,Zip,Price,Area,Room,Lon,Lat
1,"Blasiusstraat 8 2, Amsterdam",1091 CR,685000.0,64,3,4.907736,52.356157
2,"Kromme Leimuidenstraat 13 H, Amsterdam",1059 EL,475000.0,60,3,4.850476,52.348586
3,"Zaaiersweg 11 A, Amsterdam",1097 SM,850000.0,109,4,4.944774,52.343782
4,"Tenerifestraat 40, Amsterdam",1060 TH,580000.0,128,6,4.789928,52.343712
5,"Winterjanpad 21, Amsterdam",1036 KN,720000.0,138,5,4.902503,52.410538


**2 Check size of dataset and make sure your workspace has enough storage if you are dealing with big datasets**

In [8]:
size_b = data.memory_usage(deep=True).sum()  # get size in byte
size_mb = size_b / (1024 * 1024)  # convert byte to mb
print(f"Size data: {size_mb:.2f} Mb")

# Note: deep=True introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Size data: 0.18 Mb


**3. Check type of data (time series, sample, geographical, etc.) and make sure they are what they should be.**

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 924 entries, 1 to 924
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Address  924 non-null    object 
 1   Zip      924 non-null    object 
 2   Price    920 non-null    float64
 3   Area     924 non-null    int64  
 4   Room     924 non-null    int64  
 5   Lon      924 non-null    float64
 6   Lat      924 non-null    float64
dtypes: float64(3), int64(2), object(2)
memory usage: 57.8+ KB


**4. If necessary, convert the data to a format that is easy to manipulate (without changing the data itself, e.g. .csv, .json).**

In this case the dataset are already in a format easy to manipulate, i.e., `.csv`

**5. For training of ML models, sample a hold-out set, put it aside, and never look at it ⚠️.**

- typical train/test splits are `60/40`, `70/30`, `80/20`;
- it is convenient to store train and test data separately;
- **Note:** often test set and hold-out are terms used interchangeably.

<ins> For this project aim to have a 80/20 train/test split ratio. <ins>

In [12]:
TARGET = "Price"  # get target name

# split data in train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns=[TARGET]), data[TARGET], test_size=0.20, random_state=42
)

# re-merge X,y for both train and test
data_train = pd.merge(left=y_train, right=X_train, left_index=True, right_index=True)
data_test = pd.merge(left=y_test, right=X_test, left_index=True, right_index=True)

# double check sample size
print(
    f"# sample train set: {data_train.shape[0]} ({data_train.shape[0]/len(data):.2f}%) "
)
print(f"# sample test set: {data_test.shape[0]} ({data_test.shape[0]/len(data):.2f}%) ")

# sample train set: 739 (0.80%) 
# sample test set: 185 (0.20%) 


**⬇️ Store train and test locally**
- store both dataset in `csv` format
- save train and test set as `data_train.csv` and `data_set.csv`, respectively.
- in both dataset make sure to retain the column names and discard the index as it is not informative.

In [13]:
if not os.path.exists("data"):
    os.makedirs("data")

# save data
data_train.to_csv(
    path_or_buf="./data/data_train.csv",
    header=True,  # Write out the column names
    index=False,  # discard index as it is not informative
)

data_test.to_csv(
    path_or_buf="./data/data_test.csv",
    header=True,  # Write out the column names
    index=False,  # discard index as it is not informative
)