# Data Preprocessing 
In this tutorial, the following topics will be covered:
- Conversion to other Python objects
    - from Tensor to numpy
    - from numpy to Tensor
- Data Preprocessing
    - Reading the dataset
    - Handling Missing Data
    - Conversion to the Tensor Format
    - Exercise
    

As usual, let's start with importing some necessary packages
import following packages
- pytorch (torch)  
- Numpy (for computing)
- Pandas (for data manipulation)

if any modules does not exist, use pip to install it. i.e, !pip install xxx


In [33]:
!pip3 install pandas



In [34]:
import pandas as pd
import numpy as np
import torch

# Conversion to other Python object
 - convert to numpy 
 - covert back to tensor from numpy

In [35]:
X = torch.tensor([[2, 1, 2],
                  [1, 2, 1],
                  [3, 1, 3],
                  [4, 1, 3]], dtype = torch.float32)
A = X.numpy()
type(A)

numpy.ndarray

In [37]:
X = torch.from_numpy(A)
type(X)

torch.Tensor

In [38]:
# to convert size-1 tensor to scalar we do:

a = torch.tensor([2.2])
a, a.item(), float(a), int(a)

(tensor([2.2000]), 2.200000047683716, 2.200000047683716, 2)

# Data Preprocessing
  - Reading the dataset
  - Handling Missing Data
  - Conversion to the Tensor Format
  - Exercise
    

In [39]:
import os
os.makedirs(os.path.join('..', 'data'), exist_ok = True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms, Alley, Price\n')
    f.write('NA, Pave, 127500\n')
    f.write('2, NA, 106000\n')
    f.write('4, NA, 178100\n')
    f.write('NA, NA, 140000\n')

In [40]:
data = pd.read_csv(data_file)
print(data)

   NumRooms  Alley   Price
0       NaN   Pave  127500
1       2.0     NA  106000
2       4.0     NA  178100
3       NaN     NA  140000


In [42]:
inputs , outputs = data.iloc[:, 0:2] , data.iloc[:, 2]
inputs

Unnamed: 0,NumRooms,Alley
0,,Pave
1,2.0,
2,4.0,
3,,


In [43]:
# handle missing data entires using fillna method
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms  Alley
0       3.0   Pave
1       2.0     NA
2       4.0     NA
3       3.0     NA


  inputs = inputs.fillna(inputs.mean())


In [44]:
# handle missing categories 
inputs = pd.get_dummies(inputs, dummy_na = True)
inputs

Unnamed: 0,NumRooms,Alley_ NA,Alley_ Pave,Alley_nan
0,3.0,0,1,0
1,2.0,1,0,0
2,4.0,1,0,0
3,3.0,1,0,0


# Conversion to Tensor Format

In [45]:
X, y =  inputs.iloc[:, 0:3], outputs
print(f'inputs \n\n{X}\n')
print(f'outputs\n\n{y}')

inputs 

   NumRooms   Alley_ NA   Alley_ Pave
0       3.0           0             1
1       2.0           1             0
2       4.0           1             0
3       3.0           1             0

outputs

0    127500
1    106000
2    178100
3    140000
Name:  Price, dtype: int64


# Exercise
Create a raw dataset with more rows and columns.
1. Delete the column with the most missing values.
2. Convert the preprocessed dataset to the tensor format.

# Next Tutorial
In the next tutorial, Linear algebra will be discussed. See you soon