In this file we are going to learn how much space Python objects use in memory. We will also learn how other programming languages represent some of the most common data types such as integers. Python is actually quite different from most other languages in this regard. However, even though we will mostly program in Python, this knowledge will prove helpful when we learn about databases.

The language used to interact with a database uses these more common datatypes and knowing about them will be important because:

1. It will help us understand how to choose the correct datatypes for representing our data and why.
2. It will help us know how much storage space our data will occupy in disk.

![image.png](attachment:image.png)

As the numbers grow large, Python will use more and more bits to represent them. In theory the numbers can grow as large as our computer's memory allows! However, most programming languages use a fixed number of bits to represent their numbers. The most commonly used sizes are 32-bit integers and 64-bit integers. These are often know as `int` and `long` in other languages, respectively. When we fix the number of bits to a specific size, the representation needs to be changed. 

Since these types do not exist natively in Python, we will be using the `numpy` library. This library provides an extensive amount of very useful functions for handling data.

The names used for these in `numpy` is `int` followed by the number of bits used. For example, to use 32-bit integers we use the `int32` type, for 8-bit integers `int8` and so on. These types are not defined for all sizes however. We can find a complete list [here](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html).

Let's try to create our own 8-bit integers and add them together.

**Task**

![image.png](attachment:image.png)

**Answer**

In [1]:
import numpy as np
x = np.int8(100)
y = np.int8(28)
z = x + y
print(z)

-128


  after removing the cwd from sys.path.


Above we add 100 and 28 together and got -128. If it is the first time that we encounter this kind of behaviour this probably seems like a bug. Let's explain what is going on.

![image.png](attachment:image.png)

In most other programming languages and in our computer's hardware, only the bits are used to represent the numbers. Therefore we need a way to also represent negative numbers using only the available bits.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [2]:
x = 1 * (-2) + 0 * 1
y = 1 * (-8) + 0 * 4 + 1 * 2 + 0 * 1
z = 0 * (-8) + 1 * 4 + 1 * 2 + 0 * 1

Above we've learned a new way to represent base 10 numbers in binary. What is the range of values that can be represented using **n** bits in this way?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

If we look at the [`numpy` datatypes](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html) table we will see that the ranges match these ones.

![image.png](attachment:image.png)

Let's visualize this using 3 bit integers. We learned that, using 3 bits, we can represent numbers from -4 to 3. Let's arrange the numbers 0, 1, 2, 3, -4, -3, -2, -1 in a circle in clockwise order. Then, adding 1 to a number corresponds to moving to the next number in the circle:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [3]:
import numpy as np

print(np.binary_repr(-2147483648, width=32))
print(np.binary_repr(2147483647, width=32))

10000000000000000000000000000000
01111111111111111111111111111111


The two most common used sizes are 32 bit and 64 bit integers. However when using these we need to remember that, unlike Python, these types have a maximum range of values that they support. If we are not careful and our computations exceed this range we will get incorrect results as we have seen when we added 100 and 28 together using 8 bit integers.

We might be wondering at this point why we don't just always do it like Python and use an object that is able to represent arbitrary large values? The reason is simple. Performing computations with fixed bit length value is much more efficient than using Python's `int` object. Processors are designed to have built-in circuits that handle arithmetic operations on these 32 or 64 bit integers.

Another reason is that Python's `int` values occupy much more memory than pure two's complement values. Usually, when reasoning about the size of objects inside of a computer, we reason in terms of number of bytes (remember that 1 byte is the same as 8 bits). So a 32 bit integers uses 4 bytes and a 64 bit integers uses 8.

Using the [`sys.getsizeof()` function](https://docs.python.org/3/library/sys.html#sys.getsizeof) we can compute the size, in bytes, of any Python object. For example:

![image.png](attachment:image.png)

We have probably already heard of size units such as **megabytes**, **gigabytes**, **terabytes** and **petabytes**. Each of these is used to represent a number of bytes. The following table shows how many bytes are in one of each of these units:

![image.png](attachment:image.png)

If instead we had used Python `int` object we would have needed 36 GB! We most likely don't have a computer that is able to fit this into memory whereas 8 GB of memory is becoming more and more common. This shows that these apparently small difference in memory usage add up when we face huge amounts of data.

**Task**

![image.png](attachment:image.png)

**Answer**

In [4]:
import sys

x = 2147483647 # maximum value for 32-bit integers
num_bytes = sys.getsizeof(x)
num_mb = 1000000000 * num_bytes / 1000000

We are going to learn how we can compute the minimum number of bits required to represent integer data.

As a data engineer we will need to find good ways to store our data so that data scientists can access it. By analyzing the kind of data that we can expect, we can find the most efficient datatype for representing it as well as the memory requirements for storing it.

Imagine that we know that we are going to have to store a huge amount of integers. If we know that all of them can be represented with 32 bits then we can use 32-bit integers rather than 64-bit integers for storing it thus saving a lot of space.

Python provides a method `int.bit_length()` that computes the number of bits required to represent a given integer. However this function uses the binary representation that we have learned which considered only positive integers rather than two's complement representation.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [5]:
def minimum_required_bits(list_of_integers):
    min_req_bits = 0 
    for value in list_of_integers:
        nb_bits = int.bit_length(value)
        min_req_bits = max(min_req_bits, nb_bits)
    return min_req_bits


with open('identifiers.txt') as file:
    values = list(file)
    values = [int(value) for value in values]
    print(minimum_required_bits(values))

34


So far in this file we learned about two's complement integer representation and its advantages. This type of representation is the most commonly used one in computers even though Python uses another representation. As we mentioned, databases use these fixed bit two's complements representation.

We also learned about the most common units to represent memory capacities in a computer. Becoming sensible to how much memory data will occupy in memory or in disk is an important skill for us as a data engineer.

We are now going to focus on textual data and files. We have already learned about encodings and therefore we already know that the number of bytes used to represent textual data depends on the encoding that is used.

![image.png](attachment:image.png)

We might have expected this value to be 4 since all characters used in the string `"data"` are ASCII characters and we have learned that we can represent them using 1 byte each. This indicates that, as with the `int` type, there is some overhead. If we are correct then the empty string should be 49 bytes (53 - 4) and `"data!"` should have a size of 54 bytes (one more byte). Let's try it out:

![image.png](attachment:image.png)

**Task**

1. A string `s` has been provided. Using the function `sys.getsizeof()` assign the number of bytes used by `s` to a variable named `size_s`.

2. Using the function `sys.getsizeof()`, assign the number of bytes used by `s + s` to a variable named `size_ss`.

**Answer**

In [6]:
import sys

s = "你"
size_s = sys.getsizeof(s)
size_ss = sys.getsizeof(s + s)

Above we saw that `你` used 76 bytes and `你你` used 78. If the overhead was 49 as we said, then it would mean that `你` uses 76 - 49 = 27 bytes. Not only this is a lot but it also contradicts the fact that `你你` uses only 2 more bytes. So what is going on?

![image.png](attachment:image.png)

If all characters in a string are representable in Latin-1 then Python will use it to save space. If even a single character does not exist in Latin-1, will switch to UCS-2. In the same way, if the string contains characters that are not representable in UCS-2 then it will switch to UCS-4.

We might wonder why it does not use UTF-8, since this already saves space by representing some characters with 1 byte, some with 2 bytes and so on. The reason is precisely the fact that UTF-8 is not a fixed length encoding!

Without using a fixed length encoding, to access a character at a given index, Python would need loop over the whole string to find it. Using fixed length it can jump to the right position by multiplying the index by the number of bytes of each character!

The overhead changes depending on the contents of the string but it will be a value between 49 and 80 bytes. Understanding the exact value of the overhead is not very important as it does not grow indefinitely. What is more important is the fact that by adding a single non Latin-1 character to a string, we will roughly double its size since each character will start being encoded using 2 bytes.

![image.png](attachment:image.png)

In practice this means that when we are processing data in Python we should try to limit our textual data to data that can be represented using Latin-1 or at least UCS-2. For example, imagine that we are processing instant messaging information. Maybe the emojis are not important for the analysis of the data and can thus be removed.

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [7]:
import sys
message = """I really like learning about Python! 🐍\n
Me too! 😀😀\n I can't wait to see what we will learn in the next course 🙃\n"""

message_latin_bytes = message.encode(encoding='Latin-1', errors='ignore')
cleaned_message = message_latin_bytes.decode(encoding='Latin-1')

message_size = sys.getsizeof(message)
cleaned_message_size = sys.getsizeof(cleaned_message)

Above we looked at the memory consumption of strings in Python. We learned that Python uses one of three fixed length encodings in order to try to reduce the amount of memory used whenever possible.

Python also provides a [`os.path.getsize()` function](https://docs.python.org/3.7/library/os.path.html#os.path.getsize) to compute the size in bytes that a file occupies on disk. We can use this function by passing the path of the file like so:

![image.png](attachment:image.png)

**Task**

![image.png](attachment:image.png)

**Answer**

In [8]:
import os

messages = """I really like learning about Python! 🐍\n 
Me too! 😀😀\n I can't wait to see what we will learn in the next course 🙃\n"""

with open('utf8.txt', mode='w', encoding='utf8') as file:
    file.write(messages)
    
size_utf8 = os.path.getsize('utf8.txt')

with open('utf32.txt', mode='w', encoding='utf32') as file:
    file.write(messages)
    
size_utf32 = os.path.getsize('utf32.txt')

Let's wrap up this file with a challenge about estimating the disk size requirements for storing transactions in a sales website.

Knowing that we are now an expert in knowing how much disk space textual data occupies, a sales company contacted us to help them estimate how much disk space will be required to store all transactions that will occur in their website for the next **20 years**.

![image.png](attachment:image.png)

**Task**

`num_days_in_a_year = 365
num_years = 20
bytes_per_char = 32 / 8
num_transactions = 1000000
username_size = 20
product_name_size = 50`

Compute an estimate of the number of GB that you think are enough to store the data given the conditions highlighted above. Assign your answer to a variable named `num_gb`.

**Answer**

In [9]:
num_days_in_a_year = 365
num_years = 20
bytes_per_char = 32 / 8
num_transactions = 1000000
username_size = 20
product_name_size = 50

total_days = num_years * num_days_in_a_year
bytes_per_transaction = bytes_per_char * (2 * username_size + product_name_size)
bytes_per_day = bytes_per_transaction * num_transactions
total_bytes = total_days * bytes_per_day
bytes_in_gb = 10 ** 9

num_gb = total_bytes / bytes_in_gb

In this file we have learned a new way to represent integers that uses a fixed number of bits and allows us to represent both positive and negative integers. We have learned about size units in a computer and how to compute the size of a python object in memory as well as a file on disk.

We have also leveraged what we have learned about encodings to be able to estimate the size that data will occupy