<a href="https://colab.research.google.com/github/dqniellew1/DLPT/blob/master/Real_World_Data_representations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Images

`imageio` package to handle images

In [0]:
drive_dir = 'drive/My Drive/dlwpt-code/data/'

In [0]:
import torch
import imageio

In [21]:
# Loading an image
img_arr = imageio.imread(drive_dir + '/p1ch4/image-dog/bobby.jpg')
img_arr.shape

(720, 1280, 3)

PyTorch require for input tensors to be in the format of `C x H x W`.
We can use `permute` method to get the new format.

In [0]:
img = torch.from_numpy(img_arr)
out = img.permute(2, 0, 1)

To store images in a batch in tensor, the dimensions are a `N x C x H x W` 
tensor.

A more efficient alternative to using `stack` to build up the tensor, we can pre-allocate a tensor of approiprate size and fill it with images loaded from a directory.

In [0]:
batch_size = 100
batch = torch.zeros(100, 3, 256, 256, dtype=torch.uint8)

## Loading images from a directory

In [0]:
import os


data_dir = 'drive/My Drive/dlwpt-code/data/p1ch4/image-cats/'
filenames = [name for name in os.listdir(data_dir) if os.path.splitext(name)[-1] == '.png']
for i, filename in enumerate(filenames):
  img_arr = imageio.imread(os.path.join(data_dir, filename))
  img_t = torch.from_numpy(img_arr)
  img_t = img_t.permute(2, 0, 1)
  img_t = img_t[:3] # Keep only the first three channels. Only RGB inputs.
  batch[i] = img_t

Neural networks work best when the input data ranges from 0 to 1, or from -1 to 1. Typically we want to cast a tensor to floating point and normalize the values of the pixels.

Normalizations is trickier as we have to decide the range of input between (0 to 1) or (-1 to 1). One possibility is to just divide the values of pixels by 255. (The maximum nimber representable number in 8-bit unsigned)

In [0]:
batch = batch.float()
batch /= 255.0

Another way is to compute the **mean** and **std** of the input data and scale it so that the output has **zero mean** and **unit std** across each channel.

In [0]:
n_channels = batch.shape[1]
for c in range(n_channels):
  mean = torch.mean(batch[:, c])
  std = torch.std(batch[:, c])
  batch[:, c] = (batch[:, c] - mean) / std

Above is an example of normalizing a single image, because we do not know yet how to operate on an entire dataset. It is good practice to compute the mean and standard deviation on the entire training data in advance and then subtract and divide by these fixed pre-computed quantities.

# Volumetric Data

Consists of an added dimension after channel which is **depth**, leading to a 5D tensor of shape:

 `N x C x D x H x W`

In [32]:
# Loading in a sample CT scan
dir_path = drive_dir + "p1ch4/volumetric-dicom/2-LUNG 3.0  B70f-04083"
vol_arr = imageio.volread(dir_path, 'DICOM')
vol_arr.shape

Reading DICOM (examining files): 1/99 files (1.0%)2/99 files (2.0%)3/99 files (3.0%)4/99 files (4.0%)5/99 files (5.1%)6/99 files (6.1%)7/99 files (7.1%)8/99 files (8.1%)9/99 files (9.1%)10/99 files (10.1%)11/99 files (11.1%)12/99 files (12.1%)13/99 files (13.1%)14/99 files (14.1%)15/99 files (15.2%)16/99 files (16.2%)17/99 files (17.2%)18/99 files (18.2%)19/99 files (19.2%)20/99 files (20.2%)21/99 files (21.2%)22/99 files (22.2%)23/99 files (23.2%)24/99 files (24.2%)25/99 files (25.3%)26/99 files (26.3%)27/99 files (27

(99, 512, 512)

In this case, the layout is different from what PyTorch expects, due to having no channel information. We will have to make room for the `channel` dimension using `unsqueeze`.

In [36]:
vol = torch.from_numpy(vol_arr).float()
vol = torch.transpose(vol, 0, 2)
vol = torch.unsqueeze(vol, 0)

vol.shape

torch.Size([1, 512, 512, 99])

# Tabular Data

In [0]:
import csv
import numpy as np

In [28]:
wine_path = drive_dir + 'p1ch4/tabular-wine/winequality-white.csv'
wineq_numpy = np.loadtxt(wine_path, dtype=np.float32, delimiter=';', skiprows=1)
wineq_numpy

array([[ 7.  ,  0.27,  0.36, ...,  0.45,  8.8 ,  6.  ],
       [ 6.3 ,  0.3 ,  0.34, ...,  0.49,  9.5 ,  6.  ],
       [ 8.1 ,  0.28,  0.4 , ...,  0.44, 10.1 ,  6.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]], dtype=float32)

In [29]:
col_list = next(csv.reader(open(wine_path), delimiter=';'))
wineq_numpy.shape, col_list

((4898, 12),
 ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol',
  'quality'])

Convert the NumPy array into a PyTorch tensor.

In [30]:
wineq = torch.from_numpy(wineq_numpy)

wineq.shape, wineq.dtype

(torch.Size([4898, 12]), torch.float32)

In [31]:
data = wineq[:, :-1]
data, data.shape

(tensor([[ 7.0000,  0.2700,  0.3600,  ...,  3.0000,  0.4500,  8.8000],
         [ 6.3000,  0.3000,  0.3400,  ...,  3.3000,  0.4900,  9.5000],
         [ 8.1000,  0.2800,  0.4000,  ...,  3.2600,  0.4400, 10.1000],
         ...,
         [ 6.5000,  0.2400,  0.1900,  ...,  2.9900,  0.4600,  9.4000],
         [ 5.5000,  0.2900,  0.3000,  ...,  3.3400,  0.3800, 12.8000],
         [ 6.0000,  0.2100,  0.3800,  ...,  3.2600,  0.3200, 11.8000]]),
 torch.Size([4898, 11]))

In [32]:
target = wineq[:,-1]
target, target.shape

(tensor([6., 6., 6.,  ..., 6., 7., 6.]), torch.Size([4898]))

In [33]:
# Treating labels as an integer vector of scores:
target = wineq[:, -1].long()
target

tensor([6, 6, 6,  ..., 6, 7, 6])

In PyTorch we can achieve `one-hot` encoding using the `scatter_` method, which fills the tensor with values from a source tensor along the indices provided as arguments.

The `scatter_` method reads plainly as for each row, take the index of the target label (which conincides with the score in our case) and use it as the column index to set the value to 1.0. The end-result is a tensor encoding categorical information.

In [34]:
target_onehot = torch.zeros(target.shape[0], 10)

target_onehot.scatter_(1, target.unsqueeze(1), 1.0)

tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 1., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

The `unsqueeze` method adds a singleton dimension, from a 1D tensor to a 2D tensor.

In [35]:
target_unsqueezed = target.unsqueeze(1)
target_unsqueezed

tensor([[6],
        [6],
        [6],
        ...,
        [6],
        [7],
        [6]])

In [39]:
# Compute the means across rows going down, therefore we get 11 means for the 11 columns
data_mean = torch.mean(data, dim=0)
data_mean

tensor([6.8548e+00, 2.7824e-01, 3.3419e-01, 6.3914e+00, 4.5772e-02, 3.5308e+01,
        1.3836e+02, 9.9403e-01, 3.1883e+00, 4.8985e-01, 1.0514e+01])

In [40]:
# Variance
data_var = torch.var(data, dim=0)
data_var

tensor([7.1211e-01, 1.0160e-02, 1.4646e-02, 2.5726e+01, 4.7733e-04, 2.8924e+02,
        1.8061e+03, 8.9455e-06, 2.2801e-02, 1.3025e-02, 1.5144e+00])

In [41]:
data_normalized = (data - data_mean) / torch.sqrt(data_var)
data_normalized

tensor([[ 1.7209e-01, -8.1764e-02,  2.1325e-01,  ..., -1.2468e+00,
         -3.4914e-01, -1.3930e+00],
        [-6.5743e-01,  2.1587e-01,  4.7991e-02,  ...,  7.3992e-01,
          1.3467e-03, -8.2418e-01],
        [ 1.4756e+00,  1.7448e-02,  5.4378e-01,  ...,  4.7502e-01,
         -4.3677e-01, -3.3662e-01],
        ...,
        [-4.2042e-01, -3.7940e-01, -1.1915e+00,  ..., -1.3131e+00,
         -2.6152e-01, -9.0544e-01],
        [-1.6054e+00,  1.1666e-01, -2.8253e-01,  ...,  1.0048e+00,
         -9.6250e-01,  1.8574e+00],
        [-1.0129e+00, -6.7703e-01,  3.7852e-01,  ...,  4.7502e-01,
         -1.4882e+00,  1.0448e+00]])

In [43]:
data_normalized.shape

torch.Size([4898, 11])

Determine which rows in `target` correspond to a score less than or equal to 3

In [44]:
bad_indexes = target <= 3
bad_indexes.shape, bad_indexes.dtype, bad_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(20))

We can use PyTorch's advanced indexing to filter `data` 

In [45]:
bad_data = data[bad_indexes]
bad_data.shape

torch.Size([20, 11])

Get information from different groups of wine.

In [0]:
bad_data = data[target <= 3]
mid_data = data[(target > 3) & (target < 7)]
good_data = data[target >= 7]

In [0]:
bad_mean = torch.mean(bad_data, dim=0)
mid_mean = torch.mean(mid_data, dim=0)
good_mean = torch.mean(good_data, dim=0)

In [64]:
print("{:2} {:20} {:^6} {:^6} {:^6}".format("No", "Columns", "Bad", "Mid", "Good"))
for i, args in enumerate(zip(col_list, bad_mean, mid_mean, good_mean)):
  print("{:2} {:20} {:6.2f} {:6.2f} {:6.2f}".format(i, *args))

No Columns               Bad    Mid    Good 
 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.34  10.26  11.42


Here we are using the threshold on total sulfur dioxide to discriminate good wines from bad ones.

In [65]:
total_sulfur_threshold = 141.83
total_sulfur_data = data[:, 6]
predicted_indexes = torch.lt(total_sulfur_data, total_sulfur_threshold)

predicted_indexes.shape, predicted_indexes.dtype, predicted_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(2727))

Get indexes of actually good wine

In [67]:
actual_indexes = target > 5

actual_indexes.shape, actual_indexes.dtype, actual_indexes.sum()

(torch.Size([4898]), torch.bool, tensor(3258))

Using threshold on sulfur dioxide to determine whether we have good wine, our actual indexes has 500 more than the predicted index.

Now we need to see how well our predictions line up with the actual rankings. We will perfrom a logical "and" between our prediction indexes and the actual good indexes, and use that intersection of wines-in-agreement to determine how well we did.

In [0]:
n_matches = torch.sum(actual_indexes & predicted_indexes).item()
n_predicted = torch.sum(predicted_indexes).item()
n_actual = torch.sum(actual_indexes).item()

In [71]:
n_matches, n_matches / n_predicted, n_matches / n_actual

(2018, 0.74000733406674, 0.6193984039287906)

We got about 2000 wines right. Since we have 2700 wines predicted, this gives us a 74% chanve thst if we predict a wine to be high quality, it actually is. Unfortunately, there are 3200 good wines, but we only identified 61% of them. That's just slightly better than random.

# Time series data

In [74]:
bikes_numpy = np.loadtxt(drive_dir + "p1ch4/bike-sharing-dataset/hour-fixed.csv",
                         dtype=np.float32,
                         delimiter=",",
                         skiprows=1,
                         converters={1: lambda x: float(x[8:10])}) # Convert date strings to numbers corresponding to the day of the month in column 1.

bikes = torch.from_numpy(bikes_numpy)
bikes

tensor([[1.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 3.0000e+00, 1.3000e+01,
         1.6000e+01],
        [2.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 8.0000e+00, 3.2000e+01,
         4.0000e+01],
        [3.0000e+00, 1.0000e+00, 1.0000e+00,  ..., 5.0000e+00, 2.7000e+01,
         3.2000e+01],
        ...,
        [1.7377e+04, 3.1000e+01, 1.0000e+00,  ..., 7.0000e+00, 8.3000e+01,
         9.0000e+01],
        [1.7378e+04, 3.1000e+01, 1.0000e+00,  ..., 1.3000e+01, 4.8000e+01,
         6.1000e+01],
        [1.7379e+04, 3.1000e+01, 1.0000e+00,  ..., 1.2000e+01, 3.7000e+01,
         4.9000e+01]])

In [77]:
bikes.shape, bikes.stride()

(torch.Size([17520, 17]), (17, 1))

Right now, we have the data in 17520 hours across 17 columns, we will reshape the data to have three axes; day, hour and then our 17 columns.

In [78]:
# -1 acts as a placeholder for "however many indexes are left" given the other dimensions and the original number of elements
daily_bikes = bikes.view(-1, 24, bikes.shape[1]) 
daily_bikes.shape, daily_bikes.stride()

(torch.Size([730, 24, 17]), (408, 17, 1))

Transpose daily_bikes to meet the `N x C x L` format

In [80]:
daily_bikes = daily_bikes.transpose(1, 2)
daily_bikes.shape, daily_bikes.stride()


(torch.Size([730, 17, 24]), (408, 1, 17))

In order to make it easier to render our data, we are only using the first day.

In [0]:
first_day = bikes[:24].long()

In [0]:
weather_onehot = torch.zeros(first_day.shape[0], 4)

In [85]:
first_day[:, 9]

tensor([1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2])

Scatter ones into our matrix according to the corresponding level at each row. Remember the use of `unsqueeze` to add a singleton dimension as we did.

In [88]:
weather_onehot.scatter_(dim=1,
                        index=first_day[:, 9].unsqueeze(1).long() - 1,
                        value=1.0)

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.]])