# Basic Python Data Manipulation

<a target="_blank" href="https://colab.research.google.com/github/andrew-nash/CS6421-labs-2025/blob/main/CS6421_Lab_01.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This lab will cover the basics of NumPy, Pandas and a brief intoduction to basic operations in TensorFlow.

While teaching Scientific Python is outside the scope of the course, we will touch on the use of these packages throughout the term

There are plenty of resources online that cover this topic in much more detail such as:

https://github.com/guiwitz/NumpyPandas_course



## NumPy (https://numpy.org/)

Described as 'the fundamental package for scientific computing with python'



In [None]:
import numpy as np

The main purpose of NumPy is to allow us to perform mathematical operations easily and efficiently over multi-dimensional arrays

### Python Arrays

In [None]:
simple_list = [1,2,3,4,5]

In [None]:
print(simple_list[0])

In [None]:
print(simple_list[0:3])

In [None]:
###### EXERCISE ######
# create a new list from simple_list, with values double that of simple_list
new_list = []

# .. your code here

In [None]:
simple_2d_list = [[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]]

In [None]:
print(simple_2d_list[0])

In [None]:
print(simple_2d_list[0][0:3])

In [None]:
###### EXERCISE ######
# Is it possible to slice the 2D list, to get the first 3 elements of the first 2#
# rows in a new 2D list?

small_slice = # ... your code

### Creating NumPy Arrays


In [None]:
np_1d_list = np.array(simple_list)
np_1d_list

In [None]:
np_2d_list = np.array(simple_2d_list)
np_2d_list

In [None]:
np.zeros(5)

In [None]:
np.zeros((2,2))

When creating NumPy arrrays, it is also possible to declare the data type (dtype) that they will contain

In [None]:
np_1d_list = np.array(simple_list, dtype=np.float32)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=np.complex128)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=str)
np_1d_list

In [None]:
np_1d_list = np.array(simple_list, dtype=np.int16)
np_1d_list

### Indexing and Slicing

Selecting elements of numpy arrays is similar to selecting elements of standard Python lists but with much more flexibility

In [None]:
np_1d_list[0]

In [None]:
np_1d_list[0:2]

In [None]:
np_2d_list[0,2]

In [None]:
np_2d_list[0:2,0:3]

In [None]:
elements_selected = np.array([True,False,False,True,True])
np_1d_list[elements_selected]

### Mathematical Operations

In [None]:
np_1d_list * 5

In [None]:
np_1d_list / 2

In [None]:
np.exp(np_1d_list)

In [None]:
np.max(np_1d_list)

In [None]:
second_np_1d_list = np.array([10,20,30,40,50])

In [None]:
np_1d_list+second_np_1d_list


These same operations work with higher dimensional arrays

In [None]:
np_2d_list+2

What do you expect the outcome of multiplying a 1D array with a 2D array will be?

In [None]:
np_2d_list * np_1d_list

### Linear Algebra in NumPy

Basic linear algebraic operations can also be performed in NumPy

Vector-matrix multiplication

In [None]:
np_2d_list @ np_1d_list

This also works for matrix-matrix multiplication

In [None]:
a = np.array([[1,0],[0,1]])
b = np.array([[4,1],[2,2]])

a@b

**EXERCISE**

Why does the following code fail?

`np_1d_list @ np_2d_list`

#### Shapes

Numpy arrays have some very useful attributes - particularly size, shape and dtpye

1. Size tracks the number of scalar values contained within then array (and any sub-arrays)
2. Shape contains the size of each dimension of the array - e.g. shape=(3,4) corresponds to a 3x4 matrix

In [None]:
np_2d_list.shape

In [None]:
np_2d_list.size

In [None]:
np_1d_list.shape

This becomes even more important when working with tensors:

In [None]:
np_tensor_a = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
np_tensor_a

In [None]:
np_tensor_a.shape

It is possible to transpose (rotate by 90 degrees) an array with `.T`

In [None]:
np_2d_list.T

In [None]:
np_1d_list.T

In [None]:
np_tensor_a.T

**EXERCISE**

What are the requirements (in terms of shape) for two matrices to be multiplicable?

#### Re-shaping

Given a NumPy array, it is possible to change its shape - provided that the total number of elements matches between the original and new shapes

In [None]:
flat_list = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
flat_list

In [None]:
flat_list.reshape( (2,6) )

In [None]:
flat_list.reshape( (2,3,2) )

In [None]:
flat_list.reshape( (12,1) )

In [None]:
square_mat = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
square_mat

In [None]:
square_mat.reshape( (12,) )

In [None]:
square_mat.reshape( (1,12) )

In [None]:
square_mat.reshape( (1,6,2) )

## Pandas (https://pandas.pydata.org/)

Pandas is a powerful data analysis and manipulation package built on top of NumPy, that targets tabular data


In [None]:
import pandas as pd

The basic data structures of Pandas are the Series and Dataframe

In [None]:
data = np.array([2,3,62,1,2])
series1 = pd.Series(data, name="Example Series 1")

In [None]:
data2 = np.log(data)
series2 = pd.Series(data2, name="Example Series 2")
series2

In [None]:
df = pd.DataFrame({"Series 1": series1, "Series 2": series2})
df

In [None]:
df.describe()

Pandas supports many useful operations over DataFrames

1. Indexing over rows and columns
2. Simple sumary statistics can be calculated over the columns
3. Transformations can be applied to differnt columns
4. SQL-like queries over the whole DataFrame, including grouping
5. SQL-like merge and intersection operations between DataFrames
6. And more ...

In [None]:
df.iloc[0:2,0]

In [None]:
df['Series 1'].sum()

In [None]:
df['Series 1'].apply(lambda x: "High" if x>6 else "Low")

In [None]:
df[(df["Series 1"]>5 & (df["Series 2"]>0))]

In [None]:
df["Series 1"].to_numpy()

In [None]:
df.to_numpy()

### Reading from CSV

In [None]:
!wget https://github.com/datasciencedojo/datasets/raw/refs/heads/master/titanic.csv

In [None]:
tianic_df = pd.read_csv("titanic.csv")

In [None]:
tianic_df

In [None]:
selected_columns = ["Pclass","Fare"]
tianic_df[selected_columns]

In [None]:
######## EXERCISE
# Find the average fare for first class passengers

In [None]:
######## EXERCISE
# Get a NumPy array containing the ticket class of passsengers under 18 who did not survive the sinking

In [None]:
######## ADVANCED EXERCISE
# find the number of each class in this array
# hint:  https://numpy.org/doc/stable/reference/routines.html

# Brief intro to TensorFlow (https://www.tensorflow.org)

TensorFlow is an end-to-end platform for machine learning.

## Tensors (https://www.tensorflow.org/guide/basics)

The basic data structure in TensorFlow is the tf.Tensor, which is very similar to the np.array

In [None]:
import tensorflow as tf

In [None]:
# An immutable Tensor
x = tf.constant([[1., 2., 3.],
                 [4., 5., 6.]])
# A mutable Tensor
vx = tf.Variable([[1., 2., 3.],
                 [4., 5., 6.]])


In [None]:
x

In [None]:
vx

## Mathematical operations

These can be performed in much the same way as NumPy

In [None]:
x = tf.constant(1.75)
x*2

In [None]:
tf.exp(x)

In [None]:
A = tf.constant([[1,2,3],[4,5,6]])
B = tf.constant([[1,2,3,4],[5,6,7,8],[9,10,11,12]])

C=tf.matmul(A,B)
C

In [None]:
C.shape

## Auto-differentiation

One of the most imporant differces over NumPy is TensorFlow's ability to autmatically differentiate user-defined functions

In [None]:
def f(x):
  y = x**2 + 2*x - 5
  return y

In [None]:
x = tf.Variable(2.0)

with tf.GradientTape() as tape:
  y = f(x)

g_x = tape.gradient(y, x)
g_x

This also works over multi-variate functions

In [None]:
def f2(x):
  # y = 5*x + 2*exp(x)
  A = tf.constant(5.0)
  B = tf.constant(2.0)
  y = tf.add(tf.multiply(x,A), tf.multiply(B, tf.exp(x)))
  return y

In [None]:
x = tf.Variable([1.0,2.0,3.0,4.0,5.0])

with tf.GradientTape() as tape:
  y = f2(x)

g_x = tape.gradient(y, x)
g_x

# Keras - a preview

In practice, you will not often be working at the level of basic operations (like the above) to create models - in practice, this will look ssomething like the following

In [None]:
mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

x_train

Notice how x_train is simply a NumPy array

In [None]:
x_train.shape

In [None]:
y_train.shape

Creating a model:

In [None]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=(28, 28)),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10)
])

Performing Inference:

In [None]:
model(x_train[0].reshape(1,28,28)).numpy()

Training your model:

In [None]:
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
              loss=loss_fn,
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test,  y_test, verbose=2)


# Optional Practice - not graded

In [None]:
!wget https://github.com/datasets/house-prices-uk/raw/refs/heads/main/data/data.csv

## NumPy

1. Create a vector v1 of the values $1,2,\dots,24$
2. Create a second vector v2 of the values $1,3,5,\dots,47$ (bonus: try to use np.arange)
3. Create a vector consisting of the values of v1/v2
3. Find the dot product between v1 and v2 using the '@' and '.T' operations. Make sure that the shape of the result is (1,)

## Pandas
1. Load the data from data.csv to a Pandas dataframe
2. Get the values of 'Price (New)' and 'Price (Modern)' for years where 'Change (All)' was negative
3. Get the mean of 'Price (New)' and 'Price (Modern)'

## TensorFlow

1. Create the following matrix as a tf.constant

\begin{equation}
  W = \left(\begin{array}[cc]\\
  7.0 & -5.0 \\
  2.5 & 3.0
  \end{array}\right)
\end{equation}

2. Define the following function where $Wx$ is matrix multiplication with 2-vector x:

\begin{equation}
  f(x) = \frac{1}{1+e^{-Wx}}
\end{equation}

3. Get the gradient using tf's autodifferentiation of $f$ at \begin{equation}x=\left(\begin{array}[c]
\;  2.56 \\
    1.75 \\
  \end{array}\right)\end{equation}