## Data Manipulation
- We need to understnad how to manage n-dimentional tensors before starting our dive into deep learning
- Why use pytorch / tensorflow?
  - It leverages GPUs to accelerate numerical computation, whereas NumPy only runs on CPUs

In [7]:
import torch
x = torch.arange(12, dtype=torch.float32)
x, x.numel(), x.shape

(tensor([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11.]),
 12,
 torch.Size([12]))

- Reshape function restructure the tensor in the requested shape
- The same thing could be achieved using `torch.view()` as well
- We can find the dimensions using `size()` function or the value of `shape` of the tensor

In [40]:
id_before_reshape = id(x)
x_reshape = x.reshape(3, 4)
x_view = x.view(3, -1)
x_reshape, x_reshape.shape, x_view, x_view.size()

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 torch.Size([3, 4]),
 tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 torch.Size([3, 4]))

- We can initialize `torch.zeros`, `torch.ones` for a tensor for desired shape with all zeros or ones
- Also, we can use `torch.zeros_like` or `torch.ones_like` to initialize tensors of shape which is similar to some other tensor

In [44]:
A = torch.zeros((2,3))
B = torch.ones((1,3))
C = torch.ones_like(A)
A, B, C

(tensor([[0., 0., 0.],
         [0., 0., 0.]]),
 tensor([[1., 1., 1.]]),
 tensor([[1., 1., 1.],
         [1., 1., 1.]]))

- We can also initialize tensors using `randn` function, this picks up values from a normal distribution of `mean = 0` and `sd = 1`

In [47]:
X = torch.randn((2,3,4))
X

tensor([[[-1.0703, -0.6006,  0.6795,  0.7969],
         [-0.6514, -0.6167,  0.3797, -0.7482],
         [ 0.6418,  0.2202,  0.2801,  0.2707]],

        [[-1.1957, -1.7886, -0.0371,  0.7337],
         [-0.8295,  1.0147, -0.8393,  0.2184],
         [ 0.4706,  0.1834,  0.1571, -1.5279]]])

In [50]:
# We can perform parallel operations on these tensors using the following mathematical operators
X_exp = torch.exp(X)
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
# These will create element-wise operations, we can also produce algebraic operations
X_exp, x + y, x - y, x * y, x / y, x ** y

(tensor([[[0.3429, 0.5485, 1.9729, 2.2187],
          [0.5213, 0.5397, 1.4618, 0.4732],
          [1.8999, 1.2463, 1.3233, 1.3109]],
 
         [[0.3025, 0.1672, 0.9636, 2.0828],
          [0.4363, 2.7586, 0.4320, 1.2441],
          [1.6010, 1.2013, 1.1702, 0.2170]]]),
 tensor([ 3.,  4.,  6., 10.]),
 tensor([-1.,  0.,  2.,  6.]),
 tensor([ 2.,  4.,  8., 16.]),
 tensor([0.5000, 1.0000, 2.0000, 4.0000]),
 tensor([ 1.,  4., 16., 64.]))

In [52]:
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
X == Y

tensor([[False,  True, False,  True],
        [False, False, False, False],
        [False, False, False, False]])

### Broadcasting
- Two tensors are “broadcastable” if the following rules hold:
  - Each tensor has at least one dimension.
  - When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist
- Broadcasting works according to the following two-step procedure:
  - expand one or both arrays by copying elements along axes with length 1 so that after this transformation, the two tensors have the same shape
  - perform an elementwise operation on the resulting arrays

In [53]:
a = torch.arange(3).reshape((3, 1)) # 3 x 1
b = torch.arange(2).reshape((1, 2)) # 1 x 2
a, b

(tensor([[0],
         [1],
         [2]]),
 tensor([[0, 1]]))

In [78]:
a + b

tensor([[0, 1],
        [1, 2],
        [2, 3]])

In [59]:
a_broadcasted_internally = torch.tensor([[0,0], [1,1], [2,2]])
b_broadcasted_internally = torch.tensor([[0,1], [0,1], [0,1]])
a_broadcasted_internally + b_broadcasted_internally

tensor([[0, 1],
        [1, 2],
        [2, 3]])

### We can save memory using `+=` instead of `= A + B`

In [70]:
X, Y = torch.randn((2,3))
before_id = id(Y)
Y = X + Y
before_id, id(Y), Y

(5276189424, 5275877008, tensor([-0.8869,  1.1944,  1.5369]))

In [71]:
before_id = id(Y)
Y += X
before_id, id(Y), Y

(5275877008, 5275877008, tensor([-0.9505,  2.0809,  2.3685]))

### Exercise
- Run the code in this section. Change the conditional statement X == Y to X < Y or X > Y, and then see what kind of tensor you can get.
- Replace the two tensors that operate by element in the broadcasting mechanism with other shapes, e.g., 3-dimensional tensors. Is the result the same as expected?

In [74]:
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
X, Y, X > Y # we can get element wise understanding which ele is greater/less than compared to the other tensor

(tensor([[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]]),
 tensor([[2., 1., 4., 3.],
         [1., 2., 3., 4.],
         [4., 3., 2., 1.]]),
 tensor([[False, False, False, False],
         [ True,  True,  True,  True],
         [ True,  True,  True,  True]]))

3 x 2 x 2<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 x 2<br>
From the broadcasting rules, we see that the last dimension is match, the second last dimension is 1 and the first dim does not exists, hence we can broadcast it

In [81]:
a = torch.arange(12).reshape((3, 2, 2)) # 3 x 2 x 2
b = torch.arange(2).reshape((1, 2)) # 1 x 2
a, b

(tensor([[[ 0,  1],
          [ 2,  3]],
 
         [[ 4,  5],
          [ 6,  7]],
 
         [[ 8,  9],
          [10, 11]]]),
 tensor([[0, 1]]))

## Data Preprocessing
There are few aspects of preprocessing the data:
- reading the dataset
- cleaning the dataset and processing it for better structure
- storing it in tensor format for better processing

In [3]:
# Lets create a csv for our data
import os

os.makedirs(os.path.join('./data'), exist_ok=True)
data_file = os.path.join('./data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')

We can use `read_csv()` for reading a csv using pandas

In [14]:
import pandas as pd

data = pd.read_csv(data_file)
data

Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


- In supervised learning, we need to separate out the target value
- Also, pandas replace the empty values with a special NaN value. These missing values could create problems with our models
- `get_dummies()` function can make categorical variable handling easier, we can set NaN to false and create better results. THis is similar to one hot encoding

In [13]:
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = pd.get_dummies(inputs, dummy_na=True)
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,,False,True
1,2.0,False,True
2,4.0,True,False
3,,False,True


- For numerical missing values, we can either add mean to all the missing values or we can discard the data

In [17]:
inputs = inputs.fillna(inputs.mean())
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,3.0,False,True
1,2.0,False,True
2,4.0,True,False
3,3.0,False,True


We can also convert these values to a tensor format for later processing

In [18]:
import torch

X = torch.tensor(inputs.to_numpy(dtype=float))
y = torch.tensor(targets.to_numpy(dtype=float))

X, y

(tensor([[3., 0., 1.],
         [2., 0., 1.],
         [4., 1., 0.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))

### Exercise

#### Dataset Inspection
1. Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/1/abalone) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?<br>
<B>Thoughts:</B> The best way to inspect the data is finding the dtypes of all the columns, finding the number of missing values and then using the `describe()` function to understand all the relevant information about inputs and targets

In [20]:
!pip install ucimlrepo

Collecting ucimlrepo
  Obtaining dependency information for ucimlrepo from https://files.pythonhosted.org/packages/3e/4a/ecc3456479d687202b34ee42317c3a63e09793c9409a720052d38356431a/ucimlrepo-0.0.3-py3-none-any.whl.metadata
  Downloading ucimlrepo-0.0.3-py3-none-any.whl.metadata (5.2 kB)
Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [23]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
abalone = fetch_ucirepo(id=1) 
  
# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 

In [26]:
X.head()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055


In [29]:
X.isna().mean(), y.isna().mean()

(Sex               0.0
 Length            0.0
 Diameter          0.0
 Height            0.0
 Whole_weight      0.0
 Shucked_weight    0.0
 Viscera_weight    0.0
 Shell_weight      0.0
 dtype: float64,
 Rings    0.0
 dtype: float64)

In [30]:
X.dtypes

Sex                object
Length            float64
Diameter          float64
Height            float64
Whole_weight      float64
Shucked_weight    float64
Viscera_weight    float64
Shell_weight      float64
dtype: object

In [31]:
X.describe()

Unnamed: 0,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005


In [33]:
y.describe()

Unnamed: 0,Rings
count,4177.0
mean,9.933684
std,3.224169
min,1.0
25%,8.0
50%,9.0
75%,11.0
max,29.0


#### Indexing dataset
2. Try indexing and selecting data columns by name rather than by column number. The pandas documentation on [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) has further details on how to do this.<br>
<B>Toughts: </B>We can also use `loc()` with the names of columns to filter them with the amounts of rows which we want, 6 in our case

In [39]:
X_index_using_column_names = X.loc[0:5, ['Length', 'Diameter']]
X_index_using_column_names

Unnamed: 0,Length,Diameter
0,0.455,0.365
1,0.35,0.265
2,0.53,0.42
3,0.44,0.365
4,0.33,0.255
5,0.425,0.3


#### Limitations of loading data
3. How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server?<br>
<B>Thoughts: </B>The amount of memory (RAM) which we have in our system is the amount of data which we can load. Because, these variables are stores in our RAM and as soon as we run out of memory, the system must crash