## Data Types Overview

References and further reading:
https://blog.demofox.org/2017/11/21/floating-point-precision/
https://fabiensanglard.net/floating_point_visually_explained/
https://lukaskollmer.de/ieee-754-visualizer/
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

In [None]:
# required imports
import numpy as np
import copy

In Atmospheric and Earth Science, we typically (although not exclusively!) work with *numerical* data at some level. Whether our input data are pictures from a satellite, numerical model output, radar output, or something else, each pixel is represented as a single (or set of) numerical values. In this module, we will examine the data types that we represent, focusing on numerical data types.

![test](python_standard_data_types.png)

While the above image represents the basic *Python* data types, we almost never want to work in that space in big data. Instead, we typically work inside the `numpy` ecosystem (even `Dask`, which is its own array type that we will learn later, uses `numpy` under the hood). We will see one reason (although certainly not the only) here.

### Integer Types

Integer types are just that: representation of *integers*. That means anything after the decimal point cannot generally be represented as an integer. Inside of AES, we often see integers used for things like raw satellite retrievals and other instrument data (e.g., GOES-R series satellites use 16 bit scaled integers rather than 32 bit floats to save space), and for things like model timesteps. We typically don't work directly with integers, but we need to cover them for completeness. 

Integer data types are generally simple to work with (although their backend representation is more complicated than it would seem on its face; we won't cover that in detail in this class). Integers are always exactly represented in memory, rather than being approximated in the way that floating point numbers are. 

Integer data types have a range of values that they can store, depending on the number of bits used. An integer data type can hold up to $2^x$ unique values, where $x$ is the number of bits (typically 8, 16, 32, 64, or 128). 

There are generally two kinds of integers: **signed** and **unsigned**. Signed integers allow negative numbers, but have a smaller maximum range of values as the negative side must be represented as well. Unsigned integers do *not* allow negative numbers, but have a larger maximum range. Using this information, we can figure out that an unsigned integer can represent whole numbers from $0$ to $2^x-1$ (remember that 0 must also be represented) and signed integer can represent numbers from $-2^{x-1}$ to $2^{x-1}-1$ (we lose a power of 2 due to needing to represent the negative side).


In [None]:
# let's explore with a large integer. 
dtypes = [np.int8, np.int16, np.int32, np.int64]
int_value = 255
for dtype in dtypes:
    print(dtype, np.array(int_value).astype(dtype))

In [None]:
# Let's look at unsigned integers, now. You get more space!
dtypes = [np.uint8, np.uint16, np.uint32, np.uint64]
int_value = 255
for dtype in dtypes:
    print(dtype, np.array(int_value).astype(dtype))

### Float Types

In [None]:
# first, simple representation of a float with 3 digits
x_python = 0.1
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, np.array(x_python).astype(dtype))

In [None]:
# next, increase the number of digits
x_python = 34.7304
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, np.array(x_python).astype(dtype))

In [None]:
# What happens when we add?
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, np.array(3.5).astype(dtype)+np.array(0.0001).astype(dtype))

In [None]:
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, np.array(3.5).astype(dtype)+np.array(0.0000001).astype(dtype))

In [None]:
# what happens when we integrate?
dtypes = [np.float16, np.float32, np.float64]
initial_value_py = 0
increment_value_py = 0.1
num_timesteps = 2000
for dtype in dtypes:
    initial_value = np.array(initial_value_py).astype(dtype)
    increment_value = np.array(increment_value_py).astype(dtype)
    curr_value = copy.deepcopy(initial_value)
    for timestep in range(0, num_timesteps):
        curr_value+=increment_value
    
    print(dtype, curr_value)

### Float Accuracy

In [None]:
# we let python round and show close enough values before. 
x_python = 0.1
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, format(np.array(x_python).astype(dtype), '.60g'))


In [None]:
x_python = 0.1
dtypes = [np.float16, np.float32, np.float64]
for dtype in dtypes:
    print(dtype, format(np.array(x_python).astype(dtype)+np.array(x_python).astype(dtype), '.60g'))


### Float Storage

In [None]:
out_dir = "./"
# array size
arr_size = (100, 100)
dtypes = [np.float16, np.float32, np.float64]
dtype_names = ["float16", "float32", "float64"]
for dtype, dtype_name in zip(dtypes, dtype_names):
    arr = np.random.random_sample(arr_size).astype(dtype)
    np.save(open("{0}test_arr{1}.npy".format(out_dir, dtype_name), 'wb'), arr)


In [None]:
!ls -lh