## Creating the Dataset (CSV)

In [2]:
import os

# make directory at '../data/'
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

# create '../data/house_tiny.csv' file
data_file = os.path.join('..', 'data', 'house_tiny.csv')

# write lines to the created csv file.
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

## Reading the CSV with pandas

In [3]:
import pandas as pd

# read csv with pandas
data = pd.read_csv(data_file)

print(data)

   NumRooms RoofType   Price
0       NaN      NaN  127500
1       2.0      NaN  106000
2       4.0    Slate  178100
3       NaN      NaN  140000


## Data Preparation
- Your objective is to infer `Price` based on `NumRooms` and `RoofType`.
    - `data.iloc[:, i]` separate out i-th column. `iloc` stands for 'integer-location'.
    - With this function, you can separate `inputs` and `targets`.

- But there is `NaN`(Not a Number) values in `NumRooms` and `RoofType`. How should we handle this?

### Impuation Heuristics
1. For categorical input fields
    - We can treat `NaN` as a category.
    - There is `Slate` and `NaN` in `RoofType`.
    - What we can do is treating `NaN` as also a category, by converting `RoofType` into two columns `RoofType_Slate` and `RoofType_nan`.
        - This can be done with `pd.get_dummies` as below.

2. For numerical input values.
    - One common heuristics is to replace `NaN` into mean values of corresponding column.

## pandas.get_dummies
**pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)**

- Convert categorical variable into dummy/indicator variables.

- Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

- **dummy_na: bool, default False**
    - Add a column to indicate NaNs, if False NaNs are ignored.

In [4]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       NaN               0             1
1       2.0               0             1
2       4.0               1             0
3       NaN               0             1


In [7]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  RoofType_Slate  RoofType_nan
0       3.0               0             1
1       2.0               0             1
2       4.0               1             0
3       3.0               0             1


## Conversion to the Tensor Format
- Since all values in `inputs` and `targets` are numerical, we can load them into a tensor.


In [8]:
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,3.0,0,1
1,2.0,0,1
2,4.0,1,0
3,3.0,0,1


In [None]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))
print(X)
print(y)

tensor([[3., 0., 1.],
        [2., 0., 1.],
        [4., 1., 0.],
        [3., 0., 1.]], dtype=torch.float64)
tensor([127500., 106000., 178100., 140000.], dtype=torch.float64)
