# Data Preprocessing

So far, we have been working with synthetic data that arrived in ready-made tensors. However, to apply deep leanring in the wild we must extract messy data stored in arbitray formats, and preproces it to suit oru needs. Fortunately, the pandas library does most of the heavy lifting for us.

## Reading the Dataset

In [2]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms, RoofType, Price
    NA, NA, 127500
    2, NA, 106000
    4, Slate, 178100
    NA, NA, 14000''')

In [3]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

  NumRooms  RoofType   Price
0       NA        NA  127500
1        2        NA  106000
2        4     Slate  178100
3       NA        NA   14000


## Data Preparation

we can select columns either by name or via integer-location based indexing(iloc)

Missing values are the bed bugs of data science, a persistent menace that you will confront thrught your career. Depending upon the context, missing values might be handled either via imputaion or deletion. Inputation replaces missing values with estimates of their values while deletion simply discards either those rows or those column that contain missing values. 

In [5]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms_    2  NumRooms_    4  NumRooms_    NA  NumRooms_nan  \
0               0               0                1             0   
1               1               0                0             0   
2               0               1                0             0   
3               0               0                1             0   

    RoofType_ NA   RoofType_ Slate   RoofType_nan  
0              1                 0              0  
1              1                 0              0  
2              0                 1              0  
3              1                 0              0  


In [6]:
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms_    2  NumRooms_    4  NumRooms_    NA  NumRooms_nan  \
0               0               0                1             0   
1               1               0                0             0   
2               0               1                0             0   
3               0               0                1             0   

    RoofType_ NA   RoofType_ Slate   RoofType_nan  
0              1                 0              0  
1              1                 0              0  
2              0                 1              0  
3              1                 0              0  


In [8]:
import torch
X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y

(tensor([[0, 0, 1, 0, 1, 0, 0],
         [1, 0, 0, 0, 1, 0, 0],
         [0, 1, 0, 0, 0, 1, 0],
         [0, 0, 1, 0, 1, 0, 0]], dtype=torch.uint8),
 tensor([127500, 106000, 178100,  14000]))