<a href="https://colab.research.google.com/github/bamacgabhann/IEOS2023/blob/main/ieos2023/1_Data_Types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Types

In order to understand how best to use python to manipulate geospatial data, I think it's important to first understand something of how computers handle data of different types.

You can run this notebook locally in a virtual environment, or on Google Colab.

## 1. Computers handle text and numbers differently.

With numbers, if we try to add 2 + 2, we get 4.

In [None]:
2 + 2

4

But every character on a keyboard or beyond is just that - a character. We type (or copy/paste) a symbol. It's easy enough for software to recognise the set of characters which we use to represent numbers, and treat them like numbers - but sometimes we'll use text and numbers together, in which case the number characters will just get treated exactly the same as any other character.

In [None]:
"2" + "2"

'22'

This isn't a mathematical error - it's no different than if we tried:

In [None]:
"two" + "two"

'twotwo'

For *string* data, the computer name for text characters, Python "adds" them by simply putting them together, as you see above. Hence 2 + 2 = 22: it's just putting the characters together.

You might be thinking "OK, but it's easy enough to remember to type a number without putting quotes around it, since I don't usually do that". Sure, but remember that when you're trying to process environmental data, you're not usually typing the data - you're usually bringing in a dataset generated by sensors, or from other sources, and that data isn't always perfect.

Here's a quick example using the pandas library, which we'll cover a bit more later.

In [5]:
import pandas

If we import data from a spreadsheet or some other data source which is all numbers, that's fine.

In [7]:
pandas.Series([1, 2, 3, 4])

0    1
1    2
2    3
3    4
dtype: int64

dtype: int64 means the data is being treated as a 64-bit integer - an integer being a number without any decimal places. That's the correct data type for that set of values. But say one of the values in our spreadsheet has an error.

In [11]:
pandas.Series([1, 2, 'NA', 4])

0     1
1     2
2    NA
3     4
dtype: object

Now, the dtype: object is telling you that all of that data is being treated as text. Which will be a problem if you try to add the values together, or do anything else with them. Simplifying that again:

In [None]:
type("2")

str

*str* means *string*, the data type for text characters. As a number, we would get a different data type:

In [None]:
type(2)

int

*int* is the *integer* data type. Now, here we come to our second distinction:

## 2. Not all numbers are created equal


Integers are whole numbers. These are stored in computer memory differently than numbers with decimal places.

In [None]:
type(2.0)

float

*float* is a floating point number. These take up more memory, so if our numbers don't have anything after the decimal point, it's going to be faster to work with them as integers. That's not an issue when we only have a couple of numbers, but if we're working with huge datasets, it can make a sizeable difference.

## 3. Some numbers represent particular things

When does 15+15 not equal 30?

When it's February.

In [16]:
import datetime


In [17]:
datetime.date(2023,2,15)

datetime.date(2023, 2, 15)

In [18]:
datetime.date(2023,2,15) + datetime.timedelta(days=15)

datetime.date(2023, 3, 2)

Dates, and times, can be awkward - as you'll probably have noticed if you've used Excel much. Which brings up another issue - not all languages and software handle dates and times in the same way. Excel has one way, Python has another, and there's also some particular Python modules which have their own alternative formats. It's just something to be mindful of.

Dates and times are one example, but not the only example, of cases where numbers don't come in isolation. A particularly relevant example in our case is coordinates.

## 4. Tuples, lists, and dictionaries

A single number as a coordinate is not very useful without the other coordinate. We'll get more into coordinate systems later, but for cases like these, Python has a data type called a tuple.

Tuples are groups of multiple values which are unchangeable. In Python, they're denoted by round brackets.

In [23]:
mytuple = (5, 8)
mytuple[1] = 6

TypeError: ignored

Because they're unchangeable, you can't accidentally modify them, so they're good for coordinates. But sometimes you'll need a group of multiple values which you *can* change, and for that we have lists.

In [27]:
mylist = [5, 8]
type(mylist)


list

In [28]:
mylist[1] = 9
mylist

[5, 9]

And sometimes you'll want to store a group of multiple values with references to what the numbers represent, and for that we have dictionaries.

In [31]:
mydictionary = {"Temperature": 15, "Pressure": 1016, "Humidity": 83}
mydictionary["Temperature"]

15

You'll get used to these if and when you need them, but it's at least good to be aware of these different kinds of data from the start.

## 5. Booleans

One last type it's worth being aware of. Sometimes you don't need text or numbers, you just need to indicate a straight choice. Let's demonstrate by returning to our first example:

In [32]:
2 == 2

True

In [33]:
2 == "2"

False

The `True` and `False` here are what we refer to as Boolean values. Data can be stored in this format, for example, from checkboxes.

## 6. The more complicated bits

Numbers in a computer are not stored as the actual number. Computers store all information in short term memory or on a hard disk where each spot can have one of two values - just like a Boolean. Usually, these are considered as 1 or 0.

(In Python, we can write numbers in binary if we prefix them with `0b`, which I'll use in the next couple of examples.)

So, a binary number with one digit can give us only two values.



In [40]:
0b0

0

In [41]:
0b1

1

If we put two of these together, using two digits, we can represent up to 3.

In [43]:
print(0b00)
print(0b01)
print(0b10)
print(0b11)

0
1
2
3


If we go up to eight digits, that's enough to get us up to

In [44]:
0b11111111

255

This is a bit oversimplified (that's a terrible joke which you'll understand in a moment) but, inside a computer, those single points in memory can be grouped together like this to save larger numbers. One point in memory is a *bit*. A group of eight is a *byte*, and can store values 0-255. Early computers had 8-bit processors, which could only handle values 0-255 at one time. This didn't mean they couldn't process larger numbers; parts of the memory could be combined to store larger values. But it was a limit, and led to some image file types which could only use 256 colours, and so on.

From the 1990s, when home computers became common, most had 32-bit processors.

In [45]:
0b11111111111111111111111111111111

4294967295

That's a much larger number, and allows a lot more, but one limit on that was still the references to the points in memory. 32-bit computers could only access that many bits of memory. That's 4Gb.

As computers became more powerful, more memory was needed, and now, standard computers like the one you're using are generally 64-bit

In [46]:
0b1111111111111111111111111111111111111111111111111111111111111111

18446744073709551615

That will probably last a while - nobody needs 18.4 exabytes of RAM in their computer.

The broader point here is that numbers are usually stored by computers as 8-bit, 16-bit, 32-bit, or 64-bit. We saw an example in part 1 above, our pandas series:

In [47]:
pandas.Series([1, 2, 3, 4])

0    1
1    2
2    3
3    4
dtype: int64

These numbers are being stored as a 64-bit integers, which means the computer has allocated 64 bits of memory to hold each value. Part of my reason for explaining this is so that you know what it means when you see int32 or int64, or float32 or float64. But there's another quirk to this as well, and it's about the float32s and float64s. You might have noticed that I only used integers in this part so far. How can we represent a floating point number in binary?

In [49]:
bin(0.5)

TypeError: ignored

You can't, not directly. How computers handle this is to approximate floating point numbers as fractions. This means that what you see is not always what you get.

In [50]:
0.1

0.1

In [53]:
"{:.24f}".format(0.1)

'0.100000000000000005551115'

This can produce some very unexpected results

In [54]:
0.1 + 0.1 + 0.1 == 0.3

False

This isn't a bug in Python, it's simply a result of how computers work, and will be the same in any software. Again, it's just something to be aware of.

## 7. Hexadecimal

Aside from decimal and binary, there's one other number system commonly used by computers, and that's hexadecimal, which is a base 16 number system. Binary is base 2 (0 or 1), our normal decimal numbers are base 10 with 10 characters, (0, 1, 2, 3, 4, 5, 6, 7, 8, and 9). Hex uses the letters a, b, c, d, e, and f to represent additional numbers, with f = 15.

(In Python, the prefix `0x` is used to represent hex numbers).

In [56]:
0xf == 15

15

You'll see this most commonly in HTML code for websites, where colours are represented by hex values for RGB (red, green, and blue). These are commonly written as a # followed by 6 digits, e.g. #000000 is black, #ffffff is white, #00ff00 is green, and #f8ed62 would be a shade of yellow.

You might have realised why (but don't worry if you haven't):

In [57]:
0xff

255

Eight bits - one byte - is enough to store one hex value. For this reason, hex might come up in other contexts, so best to be aware of it. Some environmental sensors I've been working with, for example, output their raw values as hex bytes, which means the code I had to write to process the values had to convert that to integers. Again though, it's generally an example of "when you have to deal with it, you figure out how", but it's good to be aware of it in case it comes up.

## 8. Summary

Okay, so you won't usually have to deal with the hex bytes or binary bits, but you will have to deal with data which comes in different kinds. You should now have a handle on the concept that computers store and treat data in different ways - text characters (strings, str), integers (whole numbers, int), decimals (floating point numbers, float), dates and times, and combinations like tuples, lists, and dictionaries.