# Data preparation (pre-processing)

We'll create a minimalist dummy csv data file, load it, then use Pandas to do the following data pre-processing

- handle missing value
  - for numerical columns, use the mean of the column
  - for categorical columns, treat NA as a category and do one-hot encoding (pands's `get_dummies(xxx, dummy_na=True)`)
- convert Pandas numpy array to PyTorch tensor

## Create dummy csv data

In [3]:
import os
os.getcwd()

'/Users/xiaolishen/projects/d2l/notes'

In [4]:
os.makedirs(os.path.join('..', 'data'), \
    exist_ok=True)
data_file = os.path.join('..', 'data', \
    'house_tiny.csv')
with open(data_file, 'w') as f:
    # column names
    f.write('NumRooms,Alley,Price\n')
    f.write('NA,Pave,127500\n')
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

## Import dummy data

In [6]:
!pip3 install pandas

Collecting pandas
  Downloading pandas-2.0.0-cp39-cp39-macosx_11_0_arm64.whl (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 8.3 MB/s eta 0:00:01
[?25hCollecting pytz>=2020.1
  Downloading pytz-2023.3-py2.py3-none-any.whl (502 kB)
[K     |████████████████████████████████| 502 kB 61.2 MB/s eta 0:00:01
[?25hCollecting tzdata>=2022.1
  Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
[K     |████████████████████████████████| 341 kB 25.1 MB/s eta 0:00:01
Collecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp39-cp39-macosx_11_0_arm64.whl (13.9 MB)
[K     |████████████████████████████████| 13.9 MB 57.1 MB/s eta 0:00:01
Installing collected packages: tzdata, pytz, numpy, pandas
Successfully installed numpy-1.24.2 pandas-2.0.0 pytz-2023.3 tzdata-2023.3


In [42]:
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000


## Missing Value (NA)

For numerical columns, we fill the missing value with mean.

In [43]:
# take all rows, the last column is the output
inputs, outputs = data.iloc[:, 0:2], \
    data.iloc[:, 2]

# fill NA of the numeric column with its mean
inputs['NumRooms'] = inputs['NumRooms']. \
    fillna(inputs['NumRooms'].mean())
inputs 

Unnamed: 0,NumRooms,Alley
0,3.0,Pave
1,2.0,
2,4.0,
3,3.0,


For categorical columns, convert to one-hot encoding

In [44]:
inputs = pd.get_dummies(inputs, \
    dummy_na=True)
inputs

Unnamed: 0,NumRooms,Alley_Pave,Alley_nan
0,3.0,True,False
1,2.0,False,True
2,4.0,False,True
3,3.0,False,True


## Convert pre-processed dataframes to PyTorch tensor

First let's take a look at the pre-processed dataframes.

Note that one-hot encoding `get_dummies` turns the column data type to bool, while the `fillna` makes the numerical column data type to float64.

In [49]:
inputs.values, outputs.values
# inputs.dtypes

(array([[3.0, True, False],
        [2.0, False, True],
        [4.0, False, True],
        [3.0, False, True]], dtype=object),
 array([127500, 106000, 178100, 140000]))

To create tensors from numpy arrays, we can use either `torch.tensor` or `torch.from_numpy`. The main difference is that, `torch.tensor` creates a new tensor that copies the data from the input array, while `torch.from_numpy` creates a tensor that shares the same underlyting data with input numpy array.

Also, since both `torch.tensor` and `torch.from_numpy` can only handle numpy array of homogeneous dtype, we need to use `astype` to cast all elements to dtype `float32` so that we can convert to tensor.

In [50]:
import torch

X, y = torch.tensor(inputs.values.
                    astype('float32')), \
    torch.from_numpy(outputs.values)

X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]]),
 tensor([127500, 106000, 178100, 140000]))