##2.2 Data Preprocessing

###2.2.1 Reading the Dataset

In [None]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok = True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
  f.write('''NumRooms, RoofType, Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

In [None]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

###2.2.2 Data Preparation

In [None]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na = True)
print(inputs)

In [None]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

###2.2.3 Conversion to the Tensor Format

In [None]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
X, y

##Discussion & Exercises 2.2

2.2.1

  - Reading a CSV File
    - Use pandas to read CSV files, which are commonly used to store tabular data.
```
    import pandas as pd
    data = pd.read_csv('path_to_file.csv
    data
```
    - This will load the data into a pandas DataFrame, with each column representing a field and each row representing a record.
  - Handling Missing Data
    - In the dataset, missing values are represented as NaN (Not a Number). These can be handled using pandas functions in the next steps.

2.2.2

1. Separating Input and Target Values
  - In supervised learning, separate input columns from target columns using `iloc` for indexing.

2. Handling Missing Values
  - Categorical Data : Use `pd.get_dummies` to convert categorical variables with missing values into multiple columns (one-hot encoding with a NaN category).

  - Numerical Data: Replace missing numerical values (NaN) with the mean value of the column.
  
  `input = inputs.fillna(inputs.mean())`

3. Imputation Techniques
  - Categorical Values : Convert missing values into binary columns.
  - Numerical Values : Use mean imputation to fill missing entries.

