# Working with Numerical Data (Part II)


In [None]:
import numpy as np
import struct

<hr style="border: none; height: 20px; background-color: green;">

## 1. Numerical data types

NumPy uses fixed-size, machine-oriented data types. That enables predictable memory usage and fast vectorized operations, but it also means **overflow** can happen for integers and **rounding** is unavoidable for floats.


### Integers in any base system will print as base 10 integers

In [None]:
x = 0b1010   # Binary input (0b = binary)
y = 0o12     # Octal input (0o = octal)
z = 0xA      # Hexadecimal input (0x = hex)

print(x, y, z)   # Output: 10 10 10 (printed in decimal)

If you want to display the original base, you must explicitly convert them using `bin()`, `oct()`, or `hex()`

In [None]:
print(bin(x))   # 0b1010
print(oct(y))   # 0o12
print(hex(z))   # 0xa

### Endianness

Endianness is the byte order used to store multi-byte values.
- Little-endian: least significant byte first
- Big-endian: most significant byte first

In [None]:
value = 0x87654321
little = value.to_bytes(4, byteorder="little", signed=False)
big = value.to_bytes(4, byteorder="big", signed=False)
print("little bytes:", [hex(b) for b in little])
print("big bytes:   ", [hex(b) for b in big])
print(int.from_bytes(little, "little"), int.from_bytes(big, "big"))

### Integer types

- Python `int` is arbitrary precision.
- NumPy integers (`np.int8`, `np.int16`, ...) have fixed width.
- `np.int_` is the platform default (typically 64-bit on modern systems).


In [None]:
# Python ints: arbitrary precision
x = 10
y = 10**100
print(type(x), type(y))
print(y)

#### Understanding `np.iinfo()`:

Return the minimum and maximum values for a given NumPy integer type

In [None]:
# NumPy fixed-width integers
for t in [np.int8, np.uint8, np.int16, np.uint16, np.int32, np.uint32, np.int64, np.uint64]:
    info = np.iinfo(t)
    print(f"{t.__name__:7}: min={info.min:>22} max={info.max:22}")

#### NumPy’s Flexible Integer Type: np.int_

When is np.int_ used?

In [None]:
arr = np.array([1, 2, 3])    # Uses np.int_

print(arr.dtype)
print(np.dtype(np.int_))
print(arr.dtype == np.dtype(np.int_))

#### Python int vs NumPy int

In NumPy, `int` does not mean Python’s arbitrary-precision integer.   
It is mapped to NumPy’s platform-dependent default integer `type (np.int_)`, which is usually `int64` on 64-bit systems.
Therefore, `np.iinfo(int)` reports the limits of `int64`, not of Python’s unlimited int.

In [None]:
print(np.dtype(int))
print(np.dtype(np.int_))
print(np.dtype(np.int64))

### Integer overflow in NumPy

Fixed-width integers wrap around on overflow.



In [None]:
x = np.int16(np.iinfo(np.int16).max)
print("int16 max:", x)
print("max + 1:  ", np.int16(x + 1))  # wrap-around

<hr style="border: none; height: 20px; background-color: green;">

## Float types

NumPy float types follow IEEE 754 (e.g., `float32`, `float64`).
- rounding errors are normal
- overflows produce `inf` unless configured otherwise


In [None]:
0.1 + 0.2

#### Inspect IEEE-754 bit pattern for `float32`

We can reinterpret the bytes of a `float32` as an unsigned 32-bit integer and print its bit string.


In [None]:
def float32_bits(x: np.float32) -> str:
    bits = struct.unpack("!I", struct.pack("!f", float(x)))[0]
    return format(bits, "032b")  # sign|exponent|mantissa

vals = [np.float32(1.0), np.float32(-5.5), np.float32(1.316554e-36)]
for v in vals:
    print(float32_bits(v), v)

#### `np.finfo`

`np.finfo` provides properties like epsilon, min/max normal, and smallest positive subnormal.


In [None]:
info32 = np.finfo(np.float32)
info64 = np.finfo(np.float64)
print("float32 eps:", info32.eps)
print("float32 max:", info32.max)
print("float32 min:", info32.min)
print("float64 eps:", info64.eps)

#### Controlling floating point warnings/errors

Use `np.seterr` to control behavior.


In [None]:
x = np.float32(1e38)
y = x * 100
print(f"y={y}")

### Temporary Change of NumPy’s Floating-Point Error Handling

We temporarily changes how NumPy handles floating-point overflows by setting the overflow mode to `raise`.  
As a result, overflows trigger a `FloatingPointError` instead of returning `inf`.    

At the same time, the **previous configuration** is stored in old.  
After the computation, the original settings are restored using `np.seterr(**old)`, ensuring that NumPy’s global error behavior remains unchanged.

In [None]:
np.seterr(all="warn")
x = np.float32(1e38)
y = x * 100
print("y:", y, "isinf:", np.isinf(y))

old = np.seterr(over="raise")
try:
    y = x * 100
except FloatingPointError as e:
    print("Raised:", e)
finally:
    np.seterr(**old) # restoring original settings

<hr style="border: none; height: 20px; background-color: green;">

### Complex numbers

NumPy supports complex types (`complex64`, `complex128`). Operations can overflow similarly to floats.


In [None]:
np.seterr(over='warn')  # Treat overflows as warnings

z = np.complex128(1e308 + 1e308j)  # Near float64 max
z_overflow = z * 10  # Causes overflow

print(z_overflow)  # Output: (inf+infj)

#### Complex plane example

Visualize the magnitude of a complex function on a grid.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define the range for complex numbers
x = np.linspace(-2, 2, 100)  # Real axis
y = np.linspace(-2, 2, 100)  # Imaginary axis

# Create a grid for the complex plane
X, Y = np.meshgrid(x, y)
Z = np.array(X + 1j * Y)  # Generates numpy.complex128 numbers x + i*y

# Compute the function f(z) = z^2 + 1
F_Z = Z**2 + 1

# Compute the magnitude of f(z) for 3D visualization
Z_abs = np.abs(F_Z)

# Define the real function for the x-axis
real_x = np.linspace(-2, 2, 100)
real_f = real_x**2 + 1  # f(x) = x^2 + 1 for real numbers

# Create subplots
fig, axes = plt.subplots(
    1, 2,
    figsize=(14, 6),
    subplot_kw={"projection": "3d"}
)

# First 3D plot with top view
ax1 = axes[0]
ax1.plot_surface(X, Y, Z_abs, cmap="viridis", alpha=0.1, edgecolor="k")
ax1.plot(real_x, np.zeros_like(real_x), real_f, color="red", linewidth=3)

ax1.set_xlabel("Re(x)")
ax1.set_yticks([])
ax1.set_zlabel("Re(y)")
ax1.set_title("Function f(x) = x² + 1", fontsize=24, fontweight="bold")
ax1.view_init(elev=0, azim=90)

# Second 3D plot with side view
ax2 = axes[1]
ax2.plot_surface(X, Y, Z_abs, cmap="viridis", alpha=0.4, edgecolor="k")
ax2.plot(real_x, np.zeros_like(real_x), real_f, color="red", linewidth=3)

ax2.set_xlabel("Re(x)")
ax2.set_ylabel("Im(x)")
ax2.set_zlabel("Re(y)")
ax2.view_init(elev=45, azim=90)

# Show plot
plt.show()

<hr style="border: none; height: 20px; background-color: green;">

## 2. Comparisons, boolean masks, boolean logic

NumPy comparisons are element-wise and return boolean arrays. Boolean arrays can be used to filter data.


In [None]:
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

# Operator syntax that internally calls the same ufunc
print("x == y:", x == y)
print("x != y:", x != y)
print("x < y: ", x < y)
print("x >= y:", x >= y)

In [None]:
# Explicit NumPy ufunc call
print("x == y:",np.equal(x, y))
print("x != y:",np.not_equal(x, y))
print("x < y: ",np.less(x, y))
print("x >= y:", np.greater_equal(x, y))

### Broadcasting (Preview)

In these comparisons, NumPy automatically extends the scalar `3` to match the shape of `x`.  
This mechanism is called broadcasting. We will study it in detail later.

In [None]:
print("x == 3:", x == 3)
print("x != 3:", x != 3)
print("x < 3: ", x < 3)
print("x >= 3:", x >= 3)

### Boolean masking

A boolean mask selects elements where the mask is `True`.


In [None]:
arr = np.array([2, 4, 1, 5, 9, 0])
print(arr > 2)
print(arr[arr > 2])

#### Combined with multiple conditions

Create a boolean mask for values strictly between 2 and 6

In [None]:
# Direct filtering (works, but unnecessarily nested)
print("selected:", arr[ (arr[(arr > 2) & (arr < 6)])])

In [None]:
# Cleaner solution using a proper boolean mask
mask = (arr[(arr > 2) & (arr < 6)])
print(f"mask: {mask}")
print(f"selected: {arr[mask]}")

### 2D masking

Indexing with a boolean mask on a 2D array returns a 1D array (flattened selection).


In [None]:
# Create 2D array
arr = np.array([2, 4, 1, 5, 9, 0]).reshape(3,2)
print(arr)

In [None]:
# The mask preserves the shape of the original array
print(arr > 2)


In [None]:
# Create a flattened 1D aray containing only the elements that satisfy the condition
print(arr[arr > 2])

### Loading Kaggle Datasets into NumPy


#### What Are Structured Arrays?

A special NumPy `array` type that allows storing multiple columns with different data types
Enables named access to columns instead of using numeric indices `data["age"]` instead of data`[:, 0]`   
Efficient storage for structured, fixed-format data (more memory-efficient than Pandas)

In [None]:
# Structured array creation
data = np.array(
    [
        (25, 0, 204),
        (30, 1, 173),
        (64, 0, 309)
    ],
    dtype=[
        ("age", np.uint8),
        ("sex", np.uint8),
        ("chol", np.uint16)
    ]
)

print(data["age"])   # Access "age" column
print(data["sex"])   # Access "sex" column
print(data["chol"])  # Access "chol" column

#### Loading Kaggle data as structured Array with Named Columns

In [None]:
# Load Kaggle dataset as a structured array
data = np.genfromtxt(
    "../data/csv/heart.csv",
    delimiter=",",
    dtype=np.uint16,
    names=True
)

# Retrieve specific columns
print(f'age: {data["age"]}')           # Retrieve the "age" column
print(f'sex: {data["sex"]}')           # Retrieve the "sex" column
print(f'cholesterol: {data["chol"]}')  # Retrieve the target variable

# Print dataset size
print(f"\nNumber of rows: {len(data)}")                # Total number of records
print(f"Number of columns: {len(data.dtype.names)}")  # Total number of features

#### Loading Kaggle data as dense Numeric Array with Index-Based Access

In [None]:
data_index_based = np.genfromtxt(
    "../data/csv/heart.csv",
    delimiter=",",
    dtype=np.uint16,
    skip_header=1
)

print(f'first column: {data_index_based[:, 0]}')   

# Print dataset size
print("\nNumber of rows:", len(data_index_based)) 
print("Number of columns:", data_index_based.shape[1])  

#### Boolean Masking: Selecting Cholesterol Values Below 180

In [None]:
# Select all cholesterol values that are less than 180
data["chol"][data["chol"] < 180]

# Assign the "chol" column to a separate variable for cleaner code
chol = data["chol"]

# Select cholesterol values that are less than 180 (same as above, using the variable)
chol[chol < 180]

#### Boolean arrays for counting entries

Using `np.count_nonzero()` counts **how many values** satisfy a condition:

In [None]:
print(np.count_nonzero(chol < 180))

In [None]:
# Since True is considered as 1 and False as 0, we can sum up True 
print(np.sum(chol < 180))

#### Get the percentage of entries matching a condition

In [None]:
print(np.mean(chol < 180) * 100)

#### Unique counts of categorical values (e.g., sex distribution)

In [None]:
unique_values, counts = np.unique(data["sex"], return_counts=True)
print(f"unique_values: {unique_values}")
print(f"counts: {counts}")

In [None]:
# Formatting the output nicely
total = np.sum(counts)

for value, count in zip(unique_values, counts):
    print(f"Sex {value}: {count} entries ({count / total * 100:.2f}%)")

### Checking Conditions with np.any() and np.all()

- `np.any(mask)` checks if at least one is True
- `np.all(mask)` checks if all are True

In [None]:
# checks if at least one is True
print(np.any(chol < 180)) # any patients with less cholesterol than 180 mg/dl?
print(np.any(chol < 100)) # any patients with less cholesterol than 100 mg/dl?

# checks if all are True
print(np.all(chol > 100)) # Do all patients have more cholesterol than 100 mg/dl?
print(np.all(chol < 180)) # Do all patients have more cholesterol than 180 mg/dl?

### Bitwise operators vs boolean operators

For NumPy boolean arrays, use `&`, `|`, `~` instead of `and`, `or`, `not`.  
Always use parentheses because `&` and `|` have higher precedence.


In [None]:
a = 0b1100  # 12 in binary: 1100
b = 0b1010  # 10 in binary: 1010

print(bin(a & b))   # 0b1000 (bitwise AND)
print(bin(a | b))   # 0b1110 (bitwise OR)
print(bin(a ^ b))   # 0b0110 (bitwise XOR)
print(bin(~a))      # -0b1101 (bitwise NOT, two’s complement)

NumPy provides bitwise operations that work element-wise on arrays.    
The operators `&`, `|`, `^`, `and` `~` function similarly to Python’s bitwise operations but operate on entire NumPy arrays

In [None]:
arr_1 = np.array([0b1100, 0b1010], dtype=np.uint8)  # [12, 10] in decimal
arr_2 = np.array([0b1010, 0b1100], dtype=np.uint8)  # [10, 12] in decimal

print(np.bitwise_and(arr_1, arr_2))  # [0b1000, 0b1000] -> [8, 8]   (bitwise AND)
print(np.bitwise_or(arr_1, arr_2))   # [0b1110, 0b1110] -> [14, 14] (bitwise OR)
print(np.bitwise_xor(arr_1, arr_2))  # [0b0110, 0b0110] -> [6, 6]   (bitwise XOR)
print(np.bitwise_not(arr_1))         # Inverts bits (two’s complement for uint8

### Two's complement and fixed width integers

NumPy integer types are fixed width; negative values have a two's complement bit representation.


In [None]:
print("Python bin(-24):", bin(-24))  # not fixed width
print("NumPy int8(-24):", np.binary_repr(np.int8(-24), width=8))
print("NumPy int8(24): ", np.binary_repr(np.int8(24), width=8))

#### Understanding Binary Representation 

In [None]:
# The 0b prefix indicates that the output is in binary format
bin(24)

In [None]:
# Python displays negative numbers using -0bxxx, but this is not two’s complement.
# Python itself does not store integers with a fixed bit width.
bin(-24)

In [None]:
# Here, a binary number is directly used as input (0b_0001_1000)
bin(0b_0001_1000)

In [None]:
# Use np.binary_repr() to get the binary representation of an integer as a string
np.binary_repr(np.int8(24), width=8)

In [None]:
# Here, -24 is converted into two’s complement (8-bit representation)
np.binary_repr(np.int8(-24), width=8)

#### Bitwise Operators in Numpy - `OR`

We want extreme values of cholesterol:   
How many patients with cholesterol higher than 300 `OR` lower than 180?

In [None]:
print(f"chol > 300 : {chol > 300}")
print(f"chol < 180 : {chol < 180}")
print(f"binary or : {(chol > 300) | (chol < 180)}")
print(f"True values: {np.sum((chol > 300) | (chol < 180))}")

#### Bitwise Operators in Numpy - `AND`

We want a range of cholesterol values:   
How many patients with cholesterol higher than 150 `AND` lower than 180?

In [None]:
print(f"True values: {np.sum((chol > 150) & (chol < 180))}")

<hr style="border: none; height: 20px; background-color: green;">

## 3. Broadcasting

Default: We use arrays of the same size. Binary ufunc operations are element-by-element.  
**Broadcasting:** Binary ufunc operations can also be performed on arrays of different sizes due to broadcasting.  


#### Add a **scalar** to **each element** in a given NumPy array

In [None]:
# By default, we use arrays of the same size

arr_1 = np.array([0, 1, 2])
arr_2 = np.array([5, 5, 5]) 
print(arr_1 + arr_2)

In [None]:
# Use Broadcasting. Adding a scalar
arr_1 = np.array([0, 1, 2])
arr_1 = arr_1 + 5
print(arr_1)

In [None]:
arr_1 = np.array([0, 1, 2])
arr_1 += 5
print(arr_1)

#### Broadcasting for one arrays

Extending broadcasting to arrays of higher dimensions.  
The one-dimensional array `a` is broadcasted across the second dimension to match the shape of `b`

In [None]:
arr_1 + 5

In [None]:
arr_2 = np.ones((3, 3))
print(arr_2)

In [None]:
arr_1 = np.array([0, 1, 2])
print(arr_2 + arr_1)

#### Broadcasting for two arrays

In some cases, dimensions from both arrays conceptually change so that the operation becomes possible.   

In [None]:
arr_1 = np.array([0, 1, 2])
arr_2 = np.array([[0],
              [1],
              [2]])

print(arr_1 + arr_2)

### Example of broadcasting I

We add a 1D array to a 2D array. Broadcasting pads the 1D array with a leading 1 and then stretches it.

In [None]:
arr_1 = np.ones((2, 3))
arr_2 = np.arange(3)

print("arr_1.shape:", arr_1.shape)
print("arr_2.shape:", arr_2.shape)

In [None]:
print(arr_1 + arr_2)

### Example of broadcasting II

We add a column vector `(3,1)` to a row vector `(3,)`. Both are broadcast to `(3,3)`.

In [None]:
arr_1 = np.arange(3).reshape((3, 1))
arr_2 = np.arange(3)

print("arr_1.shape:", arr_1.shape)
print("arr_2.shape:", arr_2.shape)

In [None]:
print(arr_1 + arr_2)

### Example of broadcasting III

Here broadcasting fails because the final shapes would need to be `(3,2)` and `(3,3)`, which violates Rule 3.

In [None]:
arr_1 = np.ones((3, 2))
arr_2 = np.arange(3)

print("arr_1.shape:", arr_1.shape)
print("arr_2.shape:", arr_2.shape)

In [None]:
try:
    print(arr_1 + arr_2)
except ValueError as e:
    print("ValueError:", e)

### Example: `np.newaxis`

Instead of relying on implicit padding, you can explicitly add a new axis to control broadcasting.

In [None]:
arr_1 = np.ones((3, 2))
arr_2 = np.arange(3)

print(f"Array before adding new axis:\n{arr_2}")
print(f"Shape: {arr_2.shape}")

arr_2 = arr_2[:, np.newaxis]  # same as arr_2.reshape((3, 1))
print(f"\nArray after adding new axis:\n{arr_2}")
print(f"Shape: {arr_2.shape}")

print("\narr_1 + arr_2:")
print(arr_1 + arr_2)


### Broadcasting & Ufuncs: Efficient Mean-Centering

Mean-centering subtracts the mean of each column so each column has mean 0.   
This is common in ML (e.g., linear regression, PCA). Broadcasting makes this efficient.

In [None]:
# Create a random 2D array (e.g., 5 observations with 3 features)
np.random.seed(42)  # For reproducibility
arr = np.random.randint(10, 100, (5, 3)).astype(float)
print(f"Original Data:\n{arr}")

# Compute the mean per column (axis=0)
column_means = arr.mean(axis=0)
print(f"\nMean of each column before centering: {column_means}")

# Apply broadcasting: subtract the mean of each column from the data
centered_arr = arr - column_means
print(f"\nCentered Data:\n{centered_arr}")

print(f"\nMean of each column after centering: {centered_arr.mean(axis=0)}")

<hr style="border: none; height: 20px; background-color: green;">

## 4. Fancy indexing and combined indexing

Fancy indexing uses integer arrays/lists to select elements.
- 1D: `x[[i, j, k]]`
- 2D: `M[rows, cols]` selects pairwise elements


In [None]:
rand = np.random.RandomState(42)
arr = rand.randint(100, size=10)
print(arr) 

### How can we access multiple elements of an array at once?

#### Different options:

In [None]:
# NOT fancy indexing
[arr[3], arr[7], arr[2]]

In [None]:
# fancy indexing
ind = [3, 7, 2] 
arr[ind]     

### Fancy indexing for multiple dimensions

In [None]:
arr = np.arange(12).reshape(3, 4)
print(arr)

In [None]:
# first index row; second index col
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])

print(arr[row[:, np.newaxis], col])

In [None]:
# Fancy indexing: Broadcasting rules apply
print(row[:, np.newaxis] * col)

### Combined indexing

You can combine slicing, boolean masks, and fancy indexing.


In [None]:
arr = np.arange(12).reshape((3, 4))
print(arr)

In [None]:
arr[2, [2, 0, 1]]

In [None]:
arr[1:, [2, 0, 1]]

<hr style="border: none; height: 20px; background-color: green;">

## 5. Sorting

NumPy supports sorting values and also obtaining the indices that would sort an array.
- `np.sort` returns sorted values
- `np.argsort` returns sorting indices
- `np.take_along_axis` is useful for applying argsort indices

### `np.sort`

In [None]:
arr = np.array([[1, 4],
                [3, 2]])
print(arr)

#### Sort along the last axis

In [None]:
print(np.sort(arr))

#### Sort the flattened array                           

In [None]:
print(np.sort(arr, axis=None))

#### Sort along the first axis

In [None]:
print(np.sort(arr, axis=0))

### `np.argsort`

`np.argsort()` returns the indices that would sort an array in ascending order, allowing you to retrieve the sorted order without modifying the original data


In [None]:
arr = np.array([10, 3, 7, 1])
idx = np.argsort(arr)
print("arr:", arr)
print("argsort:", idx)
print("arr[idx]:", arr[idx])

#### 2D array

In [None]:
arr = np.array([[3, 1, 2],
              [9, 7, 8]])

idx_last = np.argsort(arr, axis=-1)
sorted_last = np.take_along_axis(arr, idx_last, axis=-1)

idx_first = np.argsort(arr, axis=0)
sorted_first = np.take_along_axis(arr, idx_first, axis=0)

print("arr:\n", arr)
print("\nargsort axis=-1:\n", idx_last)
print("sorted axis=-1:\n", sorted_last)
print("\nargsort axis=0:\n", idx_first)
print("sorted axis=0:\n", sorted_first)

In [None]:
arr = np.array([[3, 7, 2],
              [9, 2, 8]])
print(arr)

In [None]:
# sort along first axis
indexes = np.argsort(arr, axis=0)
print(indexes)

In [None]:
# sort along last axis
indexes = np.argsort(arr)
print(indexes)