# Basic Data Handling

This lab will cover the basics of Pandas, NumPy and Jax, and how they can be used to load data into PyTorch to train models. 

While teaching Scientific Python is outside the scope of the course, being able to load and preprocess data is an essential aspect of trianing and deploying deep learning models.

There are plenty of resources online that cover this topic in much more detail such as:

https://github.com/guiwitz/NumpyPandas_course
https://docs.jax.dev/en/latest/notebooks/thinking_in_jax.html

If you are unfamiliar with these pacakges, it is very much worth your while spending time getting to grips with them.

## NumPy (https://numpy.org/)

Described as 'the fundamental package for scientific computing with python'



In [None]:
import numpy as np

The main purpose of NumPy is to allow us to perform mathematical operations easily and efficiently over multi-dimensional arrays

### Python Arrays

In [None]:
simple_list = [1,2,3,4,5]

In [None]:
print(simple_list[0])

In [None]:
print(simple_list[0:3])

In [None]:
###### EXERCISE ######
# create a new list from simple_list, with values double that of simple_list
new_list = []

# .. your code here

In [None]:
simple_2d_list = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]

In [None]:
print(simple_2d_list[0])

In [None]:
print(simple_2d_list[0][0:3])

In [None]:
###### EXERCISE ######
# Is it possible to slice the 2D list, to get the first 3 elements of the first 2#
# rows in a new 2D list?

small_slice = # ... your code

### Creating NumPy Arrays


In [None]:
np_1d_list = np.array(simple_list)
np_1d_list

In [None]:
np_2d_list = np.array(simple_2d_list)
np_2d_list

In [None]:
np.zeros(5)

In [None]:
np.zeros((2,2))

When creating NumPy arrrays, it is also possible to declare the data type (dtype) that they will contain

In [None]:
np_1d_list = np.array(simple_list, dtype=np.float32)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=np.complex128)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=str)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=np.int16)
np_1d_list

### Indexing and Slicing

Selecting or indexing elements from NumPy arrays is similar to doing so with standard Python lists, but with much more flexibility

In [None]:
np_1d_list[0]

In [None]:
np_1d_list[0:2]

In [None]:
np_2d_list[0,2]

In [None]:
np_2d_list[0:2,0:3]

In [None]:
elements_selected = np.array([True,False,False,True,True])
np_1d_list[elements_selected]

### Mathematical Operations

In [None]:
np_1d_list * 5

In [None]:
np_1d_list / 2

In [None]:
np.exp(np_1d_list)

In [None]:
np.max(np_1d_list)

In [None]:
second_np_1d_list = np.array([10,20,30,40,50])

In [None]:
np_1d_list+second_np_1d_list


These same operations work with higher dimensional arrays

In [None]:
np_2d_list+2

What do you expect the outcome of multiplying a 1D array with a 2D array will be?

In [None]:
np_2d_list * np_1d_list

### Linear Algebra in NumPy

Basic linear algebraic operations can also be performed in NumPy

Vector-matrix multiplication

In [None]:
np_2d_list @ np_1d_list

This also works for matrix-matrix multiplication

In [None]:
a = np.array([[1,0],[0,1]])
b = np.array([[4,1],[2,2]])

a@b

**EXERCISE**

Why does the following code fail?

`np_1d_list @ np_2d_list`

#### Shapes

Numpy arrays have some very useful attributes - particularly size, shape and dtpye

1. Size tracks the number of scalar values contained within then array (and any sub-arrays)
2. Shape contains the size of each dimension of the array - e.g. shape=(3,4) corresponds to a 3x4 matrix

In [None]:
np_2d_list.shape

In [None]:
np_2d_list.size

In [None]:
np_1d_list.shape

This becomes even more important when working with tensors:

In [None]:
np_tensor_a = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
np_tensor_a

In [None]:
np_tensor_a.shape

It is possible to transpose (rotate by 90 degrees) an array with `.T`

In [None]:
np_2d_list.T

In [None]:
np_1d_list.T

In [None]:
np_tensor_a.T

**EXERCISE**

What are the requirements (in terms of shape) for two matrices to be multiplicable?

#### Re-shaping

Given a NumPy array, it is possible to change its shape - provided that the total number of elements matches between the original and new shapes

In [None]:
flat_list = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
flat_list

In [None]:
flat_list.reshape( (2,6) )

In [None]:
flat_list.reshape( (2,3,2) )

In [None]:
flat_list.reshape( (12,1) )

In [None]:
square_mat = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
square_mat

In [None]:
square_mat.reshape( (12,) )

In [None]:
square_mat.reshape( (1,12) )

In [None]:
square_mat.reshape( (1,6,2) )

## Pandas (https://pandas.pydata.org/)

Pandas is a powerful data analysis and manipulation package built on top of NumPy, that operates on tabular (relational) data. This is an extrememly powerful tool, especially for dealing with large and unprocessed datasets. It supports SQL-style queries, and is integatred with PySpark for large-scale processing over distributed databases.


In [None]:
import pandas as pd

The basic data structures of Pandas are the Series and DataFrame

In [None]:
data = np.array([2,3,62,1,2])
series1 = pd.Series(data, name="Example Series 1")

In [None]:
data2 = np.log(data)
series2 = pd.Series(data2, name="Example Series 2")
series2

In [None]:
df = pd.DataFrame({"Series 1": series1, "Series 2": series2})
df

In [None]:
df.describe()

Pandas supports many useful operations over DataFrames

1. Indexing over rows and columns
2. Simple sumary statistics can be calculated over the columns
3. Transformations can be applied to differnt columns
4. SQL-like queries over the whole DataFrame, including grouping
5. SQL-like merge and intersection operations between DataFrames
6. And more ...

In [None]:
df.iloc[0:2,0]

In [None]:
df['Series 1'].sum()

In [None]:
df['Series 1'].apply(lambda x: "High" if x>6 else "Low")

In [None]:
df[(df["Series 1"]>5 & (df["Series 2"]>0))]

In [None]:
df["Series 1"].to_numpy()

In [None]:
df.to_numpy()

### Reading from CSV

In [None]:
!wget https://github.com/datasciencedojo/datasets/raw/refs/heads/master/titanic.csv
!wget https://github.com/andrew-nash/CS6421-labs-2026/raw/refs/heads/main/titanic_test.csv
!wget https://github.com/andrew-nash/CS6421-labs-2026/raw/refs/heads/main/titanic_train.csv

In [None]:
tianic_df = pd.read_csv("titanic.csv")

In [None]:
tianic_df

In [None]:
selected_columns = ["Pclass","Fare"]
tianic_df[selected_columns]

In [None]:
######## EXERCISE
# Find the average fare for first class passengers

In [None]:
######## EXERCISE
# Get a NumPy array containing the ticket class of passsengers under 18 who did not survive the sinking

In [None]:
######## ADVANCED EXERCISE
# find the number of each class in this array
# hint:  https://numpy.org/doc/stable/reference/routines.html

# Jax

Ref. https://docs.jax.dev/en/latest/notebooks/thinking_in_jax.html

At the most superficial level, Jax can be considered a hardware-accelerated re-implementation of NumPy (along with some components of Scipy). Amongst other things, it features JIT compilation, and auto-differentiation.



In [None]:
import jax.numpy as jnp

Arrays can be created similarly to NumPy

In [None]:
l = [1,2,3,5]

jnp_array = jnp.array(l)
jnp_array

In [None]:
jnp.zeros((2,5))

In [None]:
x_jnp = jnp.linspace(0, 10, 1000)
print(x_jnp[0:10])

Mathematical operations can also be performed similarly, as well as operations between arrays

In [None]:
jnp.exp(jnp_array)

In [None]:
2 * jnp.sin(x_jnp) * jnp.cos(x_jnp)

In [None]:
jnp_array@jnp_array.T

By defauly, Jax compiles and executes each command individually in sequence. For better performance, Jax allows us to combine operations into functions which can then be optimized and JIT compiled together. 

In [None]:
from jax import jit 

In [None]:
def norm(X):
    X = X - X.mean(0)
    return X / X.std(0)

In [None]:
compiled_norm = jit(norm)

In [None]:
np.random.seed(1701)
X = jnp.array(np.random.rand(10, 2))
np.allclose(norm(X), compiled_norm(X), atol=1E-6)


Key Caveats/Gotchas when using Jax, compared to NumPy:

1. JIT compiled Jax functions must be "functionally pure", i.e. the output is dependant only on the function arguments, with no side-effects. Such functions cannot read or update globabl variables, or otherwise interact with any object or variable outside the scope of the function. **This includes performing `print()` statements**. Further, JIT functions require the shapes of all arrays to be static, and known at compile time.


In [None]:
g = 0.
def impure_uses_globals(x):
    return x + g

# JAX captures the value of the global during the first run
print ("First call: ", jit(impure_uses_globals)(4.))
g = 10.  # Update the global

# Subsequent runs may silently use the cached value of the globals
print ("Second call: ", jit(impure_uses_globals)(5.))

# JAX re-runs the Python function when the type or shape of the argument changes
# This will end up reading the latest value of the global
print ("Third call, different type: ", jit(impure_uses_globals)(jnp.array([4.])))

In [None]:
def get_negatives(x):
    return x[x < 0]

x = jnp.array(np.random.randn(10))
get_negatives(x)

In [None]:
x = jnp.array(np.random.randn(10))
get_negatives(x)

Observe how, without knowing the specific values of `x`, it is impossible to determine the output hsape of this function, and therefore it cannot be JIT compiled.


2. Jax arrays (unlike NumPy) are immutable.


In [None]:
jax_array = jnp.zeros((3,3), dtype=jnp.float32)

# In place update of JAX's array will yield an error!
jax_array[1, :] = 1.0

Instead, if in place updates are required, Jax provides a functional approach:


In [None]:
updated_array = jax_array.at[1, :].set(1.0)
print("updated array:\n", updated_array)

Be careful when using `+=` and similar contractions in Jax. Bear in mind that unlike NumPy, these are not in-place updates, but invove re-creating the entire array. It is better practice to use Jax's functional API.

In [None]:
updated_array += 1.0
print("updated array:\n", updated_array)

# PyTorch

Ref. https://docs.pytorch.org/tutorials/beginner/basics/intro.html

In [None]:
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

## Tensor

Ref. https://docs.pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#bridge-to-np-label

In PyTorch, the the main underlying structure in which basic data is stored is the Tensor. These are very similar to NumPy arrays -- to the point that a PyTorch Tensor and NumPy array can in some instances share the same location in memory. The key advantage of the PyTorch Tensors is that, similarly to Jax arrays, they can run on GPU (and other accelerator hardware).

Tensors can be created from data, random or constant values, or from NumPy arrays. In practice, I would recommend either creating Tensors from constant values, or from a NumPy array in all other situations. 

In [None]:
data = [[1, 2], [3, 4]]
x_data = torch.tensor(data)

In [None]:
shape = (2, 3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

In [None]:
np_array = np.array(data)
x_np = torch.from_numpy(np_array)

Similarly to Jax and Numpy, PyTorch supports a number of operations on Tensors. If you wish to perform these operations on a GPU (to include running and training a DL model) however, you must insturct PyTorch to store the Tensor on the GPU.

In [None]:
x_np = x_np.to('cuda')

We will see this in practice shortly.

## Datasets

Given that we have pre-processed our data, and are ready to train some deep learning models, we need to load them into a format that is compatible with PyTorch's own operations.

PyTorch is designed to train over user-defined `Dataset` objects, which implement a `__getitem__` function. From this function, we rturn one row (X,Y) of data.

In this example, we will use Pandas and NumPy to load and pre-process the Titanic  dataset. We are going to consider as factors: Age, Sex, Ticket Class, And Embarkation Location


Ref. https://medium.com/@akashgajjar8/titanic-survival-prediction-using-pytorch-a5b9fb4eca53, https://medium.com/@aisgandy/predicting-titanic-survival-with-deep-learning-a-stripped-down-approach-7aa8cdb37c0e

In [None]:


class TitanicDataset (torch.utils.data.Dataset):
    # the Train argument defines whether the dataset is being queried for train or test data
    # In practice, you would likely be handling separate datasets for each
    def __init__(self, file_name, Train=True):
        self.dataframe = pd.read_csv(file_name)
        #print(self.dataframe.head())
        self.dataframe = self.dataframe.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
        self.dataframe = self.dataframe.drop(['SibSp', 'Parch'], axis=1)    

        self.dataframe = self.dataframe.dropna(subset=['Age', 'Embarked', 'Sex', 'Pclass', 'Fare'])
        
        # Instead of using strings ("Male" and "Female"), we need to convert these to numerical values -- in this 
        # case 1 for male, 0 for female
        self.dataframe['Male'] = np.where(self.dataframe['Sex'] == 'male', 1, 0)

        # Manual one-hot encoding for Embarked, using np.where

        # Embarked locations are: C = Cherbourg, Q = Queenstown, S = Southampton
        # Embarked_C = 1, if embarked from Cherbourg, 0 otherwise
        # Embarked_S = 1, if embarked from Southampton, 0 otherwise
        # Embarked_ = 0 and Embarked_S = 0, if embarked from Queenstown (now Cobh ...)
        self.dataframe['Embarked_C'] = np.where(self.dataframe['Embarked'] == 'C', 1, 0)
        self.dataframe['Embarked_S'] = np.where(self.dataframe['Embarked'] == 'S', 1, 0)

        # Remove original Sex and Embarked columns
        self.dataframe = self.dataframe.drop(['Sex', 'Embarked'], axis=1)

        # We can achieve the same one-hot encoding for Pclass using Pandas get_dummies function, instead of the 
        # manual np.where approach above
        self.dataframe[['Pclass_1', 'Pclass_2']] = pd.get_dummies(self.dataframe['Pclass'], prefix='Pclass').iloc[:, :2].astype(int)
        self.dataframe = self.dataframe.drop(['Pclass'], axis=1)


        # Nomralisation
        self.dataframe['Age_N'] = self.dataframe['Age']/self.dataframe['Age'].max()

        # An example of a log transform
        self.dataframe['log_Fare'] = np.log10(self.dataframe['Fare'] + 1)
        self.dataframe = self.dataframe.drop(['Age', 'Fare'], axis=1)
        
        self.dataframe.reset_index()
        self.Train = Train
        
    def __len__(self):
        return self.dataframe.shape[0]
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        if self.Train :
            survived = self.dataframe['Survived']
            survived = np.array(survived)[idx]
        features = pd.DataFrame(columns=('Male',  'Embarked_C', 'Embarked_S', 'Pclass_1', 'Pclass_2', 'Age_N', 'log_Fare'))
        
        # Bear in mind that the test dataset will not have a Survived column
        if self.Train:
            features = self.dataframe.iloc[idx,1:]
        else:
            features = self.dataframe.iloc[idx,:]
            
        features = features.to_numpy()
        if self.Train:
            sample = ( features, survived)
        else:
            sample = features
        return sample

In [None]:
training_data = TitanicDataset('titanic_train.csv')
testing_data = TitanicDataset('titanic_test.csv', Train=False)

In [None]:
training_data[6]

In [None]:
testing_data[4]

## Defining the PyTorch Model

In [None]:
from torch import nn
import torch.nn.functional as F

class SimpleFeedforward(nn.Module):
    
    def __init__(self):
        super(SimpleFeedforward, self).__init__()
        
        # Defines our modules/layers, in this case all
        # linear layers of the form y = Ax + b
        
        # The number of inputs must match our features
        self.dl1 = nn.Linear(7, 4)
        self.dl2 = nn.Linear(4, 5)
        self.dl3 = nn.Linear(5, 1)
        
        # finish with a sigmoid layer, to generate an 
        # output between 0 and 1        
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Appply forward popogation on these layers
        x = F.relu(self.dl1(x))
        x = F.relu(self.dl2(x))
        x = self.sigmoid(self.dl3(x))
        return x

In [None]:
from torch import optim

Since we have been given acess to a machine with high power GOU capabilities, we might as well make sure that our model is executing on it.

In [None]:
torch.accelerator.is_available()

In [None]:
torch.accelerator.current_accelerator().type

If the above outputs are True and 'cuda', then PyTorch is configur correctly to work wih the GPU. When cree an instace of out model, we must make sure to specifiythat it should execute on GPU

In [None]:
model = SimpleFeedforward().to('cuda')

In [None]:
print(model)

We can simply execute the feedforward phase by passing the model data.

We first need to convert the data into a PyTorch tensor (of type float, which in Pytorch corresponds to a 32-bit float), and specify that it should be exectued on the GPU.

In [None]:
X = torch.from_numpy(training_data[0][0]).float().to('cuda')
X

In [None]:
model(X)

## Defining the Backpropogation and Gradient Updating

### Hyper-Parameters

Before we can perform Backprop, we must define out hyper-parameters, loss function and optimizer

In [None]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Further, we must wrap our Dataset in a PyTorch DataLoader. Before we do so, it can be wrothwhile to split out train dataset into train and validation sets, which allow us to better assess model performance:

In [None]:
sub_train_dataset, val_dataset = torch.utils.data.random_split(training_data, [0.8,0.2])

We can now wrap these with DataLoaders 

In [None]:
train_dataloader = DataLoader(sub_train_dataset, batch_size=64)
val_dataloader = DataLoader(val_dataset, batch_size=64)

We can then define out model training loop which

1. Performs forward propogation
2. Performs backward propogration to compute the weight and bias gradients
3. Applies weight and bias updates 

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X = X.float().to('cuda')
        
        # add a dummy dimension to the outputs to match the model 
        # output
        y = torch.reshape(y.float().to('cuda'), (-1, 1))
        # Compute prediction and loss
        pred = model(X)
        
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * batch_size + len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

In [None]:
def val_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            X = X.float().to('cuda')
            y = torch.reshape(y.float().to('cuda'), (-1, 1))
            
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (torch.round(pred)==y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Finally, all that remains to train the model is to invoke these functions:

In [None]:
epochs = 10
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    val_loop(val_dataloader, model, loss_fn)
print("Done!")

# Evolving the Models

The performance of this model is obviously very poor. We can increase the complexity of the model, and obsere the difference in performance:

In [None]:
from torch import nn
import torch.nn.functional as F

class ComplexFeedforward(nn.Module):
    
    def __init__(self):
        super(ComplexFeedforward, self).__init__()
        
        # Defines our layers, all linear layers
        # of the form y = Ax + b
        
        # The number of inputs must match our features
        self.dl1 = nn.Linear(7, 32)
        self.dl2 = nn.Linear(32, 64)
        self.dl3 = nn.Linear(64, 128)
        self.dl4 = nn.Linear(128, 64)
        self.dl5 = nn.Linear(64, 1)
        
        # finish with a sigmoid layer, to generate a 
        # output between 0 and 1        
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        # Appply forward popogation on these layers
        x = F.relu(self.dl1(x))
        x = F.relu(self.dl2(x))
        x = F.relu(self.dl3(x))
        x = F.relu(self.dl4(x))
        x = self.sigmoid(self.dl5(x))
        return x

In [None]:
model = ComplexFeedforward().to('cuda')

learning_rate = 1e-3
batch_size = 64
epochs = 5
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
epochs = 100


for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    val_loop(val_dataloader, model, loss_fn)
print("Done!")