# 02: Datatyper i Python, med funksjoner til dessert
**Forfatter:** Benedikt Goodman \
**Medhjelpere:** Mistral Large, ChatGPT-4

Denne leksjonen er et sammensurium av ting jeg har stjålet, ting jeg har laget selv, og ting jeg fått fra både Mistral og Chat-GPT. Derfor er denne notebooken på Engelsk.

## An introduction to Python's in-built datatypes

Python has several in-built datatypes that we can use to structure and manipulate our data. These can be broadly categorized into two types: atomic datatypes and container datatypes.

### Atomic Datatypes

Atomic datatypes are the simplest forms of data that Python can work with. They are `immutable`, meaning their value cannot be changed once they are created. Examples of atomic datatypes include:

    int (integer): This is a whole number. For example, 3, -10, 0.
    float (floating point number): This is a decimal number. For example, 3.14, -0.1, 10.0.
    str (string): This is a sequence of characters. For example, "Hello", 'World', "123".
    bool (boolean): This is a binary value, either True or False.

Much like we covered in the previous lecture, all these are `objects` with their own `properties` and `methods`

Here's an example of how to use these datatypes:

In [None]:
# Atomic datatypes, we call them this because they only store "one thing"
my_int = 10
my_float = 3.14
my_str = "Hello, World!"
my_bool = True

In [None]:
# All these have their inbuilt method, even boolean values have these, here are some methods
print('Some of the methods that bool has:\n', dir(my_bool)[-10:])

# Showing the byte representation of the boolean
my_bool.conjugate()

### Container datatypes

Container datatypes, on the other hand, can hold multiple values at once. They are mutable, meaning their content can be changed after they are created. Examples of container datatypes include:

- `list`: This is an ordered, mutable sequence of items. Items in a list can be of different types.
- `tuple`: This is an ordered, immutable sequence of items. Like lists, items in a tuple can be of different types.
- `dict` (dictionary): This is an ordered, mutable collection of key-value pairs.

Note, all of these have their own bound methods which can do all sorts of handy things. That's not the point of this lecture, the point here is to simply illustrate their versatility.

Here's a simple example of how to instatiate the different containers:


In [None]:
# Container 

# List uses square brackets
my_list = [1, 2, 3, "Python"]

# Tuples use regular brackets
my_tuple = (1, 2, 3, "Python")

# Dicts use curly brackets but have key: value pairs
my_dict = {"language": "Python", "year": 1991}

# Sets have curly brackets but only items like a list, i.e. no key: value
my_set = {1, 2, 3}

### Hashed Containers vs Non-Hashed Containers

One key difference between dictionaries and lists (or tuples) is how they handle lookups. Dictionaries are hashed containers, meaning they use a `hash-table` for storing items, which allows for *fast lookups*. Lists and tuples, on the other hand, are `non-hashed` containers, meaning lookups are done sequentially, which can be slower for large datasets.

**TLDR about hashing**: When we say that something is hashed, we mean that something is indexed and can be quickly retrieved using that index. It basically means that the computer already knows where to look and only uses one operation to find it. If something is non-hashed it is not indexed and the computer has to loop through it to find it, hence using many operations.

The longer, nerdier version:

In Python, hashing refers to the process of converting an object into a fixed-size string of bytes, known as a hash value or hash digest. The hash value is used as an index to quickly locate the original object in a hash table, which is a data structure used for implementing associative arrays or dictionaries.

When we say that an object is hashed in Python, we mean that it has a hash value that can be used to store and retrieve the object in a hash table. Objects that are hashed in Python must be immutable, because their hash value must remain constant over the lifetime of the object.

On the other hand, non-hashed objects in Python do not have a hash value and cannot be used as keys in a dictionary or as elements in a set. Non-hashed objects can be mutable or immutable.


In [None]:
large_list  = [item for item in range(1000000)]
large_dict = {f'{item}': item for item in range(1000000)}

In [None]:
%%timeit
# Filtration of the list if we don't know the exact position of an entry means we have to iterate through the entire list
[item for item in large_list if item == 666]

In [None]:
%%timeit
# Dicts on the other hand allow you to name the keys. 
# And you usually tend to know what keys you want to extract from a dict.
large_dict['666']

One of the powerful features of Python's container datatypes is that they can store any type of object, including other iterables, functions, and even classes. This makes them extremely versatile and useful for storing complex data, functions, and so forth. You can even store *containers within containers*.

In [None]:
# Storing iterables in container datatypes is no problem
my_list_of_lists = [[1, 2, 3], [4, 5, 6], [7, [8, [9]]]]
my_dict_of_tuples = {"tuple1": (1, 2, 3), "tuple2": (4, 5, 6)}

# This is useful for when you want to treat a subset of data stored in a container
[sum(item) for key, item in my_dict_of_tuples.items() if key == 'tuple1']

You can also store functions in container datatypes. This is useful when you want to pass around functions as data or apply different functions to the same data.

In [None]:
# Storing functions in container datatypes
def add(x, y):
    return x + y

def subtract(x, y):
    return x - y

# Make a dict of functions
my_dict_of_functions = {"add": add, "subtract": subtract}

# You can call these functions like this
print(my_dict_of_functions["add"](5, 3))  # Output: 8
print(my_dict_of_functions["subtract"](5, 3))  # Output: 2

Can you store dataframes and other classes inside containers? Absolutely. Let's see how we can use containers to apply the same treatment to two pandas dataframes.

Suppose we wanted to select string columns from three different DataFrames and concatenate them into a separate DataFrame

In [None]:
import pandas as pd

# Create DataFrame 1
data1 = {
    "A": [1, 2, 3],
    "B": ["foo", "bar", "baz"],
    "C": [4.1, 5.2, 6.3]
}
df1 = pd.DataFrame(data1)

# Create DataFrame 2
data2 = {
    "D": [7, 8, 9],
    "E": ["qux", "quux", "corge"],
    "F": [8.4, 9.5, 10.6]
}
df2 = pd.DataFrame(data2)

# Create DataFrame 3
data3 = {
    "G": [11, 12, 13],
    "H": ["grault", "garply", "waldo"],
    "I": [12.7, 13.8, 14.9]
}
df3 = pd.DataFrame(data3)

df1.head()

In [None]:
# Select the string columns from each DataFrame
string_cols = [df.select_dtypes(include=["object"]) for df in [df1, df2, df3]]

# String cols is now a list of small dataframes
print('Datatypes of string_cols:',[type(col) for col in string_cols])

string_cols

In [None]:
# Make them into one dataframe
pd.concat(string_cols, axis=1)

### Containers like dicts and lists are easily combined

In [None]:
# Lets make two dicts
dict1 = {"a": 1, "b": 2}
dict2 = {"c": 3, "d": 4}

# Combine the dictionaries with **
combined = {**dict1, **dict2}

combined

In [None]:
# Making two lists
list1 = [1, 2, 3]
list2 = [4, 5, 6]

# Combine the lists with *, let's stuff the dataframes in there as well to prove how versatile they are
[*list1, *list2, *string_cols]

## Immutable objects vs mutable objects, a word of caution

In Python, objects can be either `mutable` or `immutable`. Mutable objects **can be modified after they are created**, while **immutable objects cannot**. Lists are an example of a mutable object, while tuples are an example of an immutable object.

Here's an example that illustrates the difference between mutability and immutability:

In [None]:
# Create a list (mutable object)
my_list = [1, 2, 3]

# Create a tuple (immutable object)
my_tuple = (1, 2, 3)

# Modify the list
my_list[0] = 10
print(my_list)  # Output: [10, 2, 3]

# Try to modify the tuple (this will raise an error)
try:
    my_tuple[0] = 10
except TypeError:
    print("Cannot modify a tuple")

In the example above, we first create a list `my_list` and a tuple `my_tuple`. We then modify the first element of the list to be `10`. Since lists are mutable, this modification is allowed and the output is `[10, 2, 3]`.

We then try to modify the first element of the tuple to be `10`. However, since tuples are immutable, this modification is not allowed and a `TypeError` is raised.

### The Problem with Mutability

While mutability can be useful in some cases, it can also be a problem in others. Here's an example that illustrates how mutability can lead to unexpected behavior:

In [None]:
#Create a list of lists
my_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Create a new list that references the same data as my_list
my_other_list = my_list

# Modify the first element of the first sublist in my_other_list
my_other_list[0][0] = 10

# Print both lists
print("my_list:", my_list)
print("my_other_list:", my_other_list)

In the example above, we first create a list of lists `my_list`. We then create a new list `my_other_list` that references the same data as `my_list`.

We then modify the first element of the first sublist in `my_other_list` to be `10`. Since `my_other_list` references the same data as `my_list`, this modification also affects `my_list`.

When we print both lists, we see that the modification has affected both lists.

This behaviour is present in all mutable objects in python, including any type of dataframe object in pandas, polars and all other libraries in Python in which we manpulate data in dataframe-like objects

In [None]:
# Mutability 
# Create a DataFrame with some data
df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

# Create a new DataFrame that references the same data as df1
df2 = df1

# Modify df2 by adding a new column
df2["C"] = [7, 8, 9]

# Print both DataFrames, we see that they both have a 
print("df1:")
print(df1)
print("\ndf2:")
print(df2)

To fix mutability related issues in pandas we can use `df.copy()`

In [None]:
# df.copy() saves the day
df3 = df2.copy()
df3['D'] = 'yay, no mutability issues detected'

print("df2:")
print(df2)
print("\ndf3:")
print(df3)


## A primer on functions

Functions are reusable blocks of code that perform a specific task. In Python, functions are defined using the `def` keyword, followed by the function name, a set of parentheses that may contain input parameters, and a colon. The code block that makes up the function is indented beneath the function definition.

In [None]:
def add_numbers(x, y):
    sum = x + y
    return sum

### Why are functions useful?

Functions are useful because they allow you to encapsulate a specific task in a reusable block of code. This makes your code more modular, easier to read and understand, and easier to test and debug.

Functions also allow you to pass input parameters to the function, which can be used to customize the behavior of the function for different use cases. Functions can also return output values, which can be used in other parts of your code.

### Scope inside of the function vs outside

In Python, variables defined inside a function have local scope, which means that they are only accessible within the function. Variables defined outside of a function have global scope, which means that they are accessible from within any function.

In [None]:
# Define a global variable
x = 10

# Define a function that uses the global variable
def use_global_variable():
    print("x inside function:", x)

# Call the function
use_global_variable()

# Define a function that defines a local variable with the same name as the global variable
def use_local_variable():
    x = 20
    print("x inside function:", x)

# Call the function
use_local_variable()

# Print the value of the global variable
print("x outside function:", x)

### Functions can only return one thing

In Python, functions can only return one value. However, you can use a tuple to return multiple values from a function. Most of the time however it is better to just have one function return one thing. It makes your code easier to understand.

It is generally a good guideline to have a function do one thing and one thing only.

In [None]:
def calculate_stats(numbers):
    total = sum(numbers)
    average = total / len(numbers)
    minimum = min(numbers)
    maximum = max(numbers)
    return total, average, minimum, maximum

# Call the function with a list of numbers
stats = calculate_stats([1, 2, 3, 4, 5])

# Function returns a tuple (i.e. "one thing") with the calculated values
print('What the function returns:', stats)

# Unpack the tuple returned by the function
total, average, minimum, maximum = stats

# Print the calculated stats
print("Total:", total)
print("Average:", average)
print("Minimum:", minimum)
print("Maximum:", maximum)

### Optional reading on how mutability-issues come about

In Python, mutability stems from how Python stores and assigns names to objects in memory. When you create an object in Python, such as a list or a dictionary, Python stores that object in memory and assigns it a unique memory address.

When you assign a name to an object in Python, you are creating a reference to that object in memory. For example, consider the following code:

```python
my_list = [1, 2, 3]
```
In this code, we are creating a list object `[1, 2, 3]` and assigning it to the name `my_list`. Python stores this list object in memory and assigns it a unique memory address. The name `my_list` is then associated with this memory address, so that we can use `my_list` to refer to the list object.

Now, suppose we create a new name that refers to the same list object:
```python
my_other_list = my_list
```
In this code, we are creating a new name `my_other_list` and assigning it to the same memory address as `my_list`. This means that both `my_list` and `my_other_list` now refer to the same list object in memory.

Since lists are mutable in Python, we can modify the list object in place using either name. For example:
```python
my_list.append(4)
print(my_other_list)  # Output: [1, 2, 3, 4]
```
In this code, we are using the `append()` method to add a new element to the list object referred to by `my_list`. Since `my_other_list` refers to the same list object, this modification is also visible when we print `my_other_list`.

This behavior is what leads to mutability issues in Python. Since multiple names can refer to the same mutable object in memory, modifications to that object can have unexpected effects on other parts of your code that are using the same object.

To avoid mutability issues, it's often a good idea to use immutable objects whenever possible. Immutable objects, such as tuples in Python, cannot be modified after they have been created. This means that each immutable object has a unique value, and multiple names can refer to that value without causing mutability issues.

In [None]:
# create a list
my_list = [1, 2, 3]

# my_list and my_other_list are now assigned the same memory-address
my_other_list = my_list

# proof they are the same
print('memory-address of my_list:', id(my_list))
print('memory-address of my_other_list:', id(my_other_list))

In this code, we are creating a list object `[1, 2, 3]` and assigning it to the name `my_list`. Python stores this list object in memory and assigns it a unique memory address. The name `my_list` is then associated with this memory address, so that we can use `my_list` to refer to the list object.

When we create `my_other_list` from `my_list` directly they share a memory-address. This means that both `my_list` and `my_other_list` now refer to the same list object in memory.

Since lists are mutable in Python, we can modify the list object in place using either name. For example:
```python
my_list.append(4)
print(my_other_list)  # Output: [1, 2, 3, 4]
```
In this code, we are using the `append()` method to add a new element to the list object referred to by `my_list`. Since `my_other_list` refers to the same list object, this modification is also visible when we print `my_other_list`.

This behavior is what leads to mutability issues in Python. Since multiple names can refer to the same mutable object in memory, modifications to that object can have unexpected effects on other parts of your code that are using the same object.

To avoid mutability issues, it's often a good idea to use immutable objects whenever possible. Immutable objects, such as tuples in Python, cannot be modified after they have been created. This means that each immutable object has a unique value, and multiple names can refer to that value without causing mutability issues.

In [None]:
my_list.append(4)
print(my_other_list)

In this code, we are using the `append()` method to add a new element to the list object referred to by `my_list`. Since `my_other_list` refers to the same list object, this modification is also visible when we print `my_other_list`.

This behavior is what leads to mutability issues in Python. Since multiple names can refer to the same mutable object in memory, modifications to that object can have unexpected effects on other parts of your code that are using the same object.

To avoid mutability issues, it's often a good idea to use immutable objects whenever possible. Immutable objects, such as tuples in Python, cannot be modified after they have been created. This means that each immutable object has a unique value, and multiple names can refer to that value without causing mutability issues.