# Generators

Generators are function that return an object that can be iterated over. The special thing is that they generate the items inside the object lazily, which means they generate the items only one at a time and only when you ask for it. And because of this, they are much more memory efficient than other sequence objects when you have to deal with large data sets. They are a powerful advanced Python technique

In [10]:
def mygenerator():
    yield 1
    yield 2
    yield 3

for i in mygenerator():
    print(i)

1
2
3


Let's have a closer look at the execution. This can be visualized by another generator. The generator is memory efficient because, instead of predefining the values of an iterable collection, it gives the possibility to calculate a value of an index when it is needed.

In [25]:
import sys

# Lists object generator
def firstn(n):
    nums = []
    num = 0
    while num < n:
        nums.append(num)
        num+=1
    return nums

# Generator object of collection
def firstn_generator(n):
    num = 0
    while num < n:
        yield num
        num += 1

print(sys.getsizeof(firstn(1000000)))
print(sys.getsizeof(firstn_generator(1000000)))

8448728
200


We can see that a generator is orders of magnitude more efficient. Furthermore, we don't have to wait for the operating system to read all the data into the memory before we can use the collection, so it is faster as well. Consequently, generators are very handy, when the collection data can be determined from known patterns. 

Generator expressions are handier still. They are written like list comprehensions, but with parentheses instead of square brackets. It is a very simple syntax and shortcut to implement the generator expression. 

In [113]:
# Using expression as generator definition
mygenerator = (i for i in range(100000) if i % 2 == 0)
print("Size of generator is: ", sys.getsizeof(mygenerator))

# Using expression as list definition
mylist = [i for i in range(100000) if i % 2 == 0]
print("Size of list is: ", sys.getsizeof(mylist))

sum = 0
for i in mygenerator:
    sum += 1
print(sum)

Size of generator is:  208
Size of list is:  444376
50000


# Threading vs Multiprocessing

## Thread

A thread is an entity within a process that can be scheduled for execution. All threads within a process share the same memory space, which allows them to communicate quickly but also requires careful management to avoid conflicts.

| Pros                                         | Explanation                                                                                     | Cons                                    | Explanation                                                                                     |
|----------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------|--------------------------------------------------------------------------------------------------|
| All threads within a process share the same memory | This allows threads to communicate quickly and share data easily.                                | Threading is limited by GIL: Only one thread at a time | In Python, the Global Interpreter Lock (GIL) means that only one thread can execute Python bytecode at a time, limiting concurrency. |
| Lightweight                                  | Threads use fewer resources and less overhead than processes.                                    | No effect for CPU-bound tasks           | For CPU-bound tasks, threads do not provide performance improvements due to the GIL.             |
| Starting a thread is faster than starting a process | Creating a thread incurs less overhead compared to creating a process.                           | Not interruptable/killable              | Threads cannot be forcibly terminated from the outside; they must complete their task.           |
| Great for I/O-bound tasks                    | Threads can handle multiple I/O operations concurrently, improving performance.                  | Careful with race conditions            | Shared memory access can lead to race conditions if not managed properly.                        |



**When to Use Threads:**

- *I/O-bound tasks:* Threads are excellent for tasks that spend a lot of time waiting for I/O operations, such as reading from a disk, network communication, or user inputs.
- *Lightweight operations:* When you need to perform many small tasks simultaneously, threads are more efficient because they require less overhead to create and manage.
- *Quick startup time:* If your application needs to start many parallel tasks quickly, threads are advantageous due to their fast startup time compared to processes.

**Avoid Threads:**

- *CPU-bound tasks:* Due to the Global Interpreter Lock (GIL) in Python, threads are not suitable for CPU-bound tasks because only one thread can execute Python code at a time.
- *High control needs:* If you need to be able to forcibly stop a task, threads are not ideal since they cannot be interrupted or killed externally.
- *Complex synchronization:* If your tasks require complex sharing and synchronization of data, the potential for race conditions and the complexity of managing thread safety might outweigh the benefits.

In [115]:
# Import the basic threading library
from threading import Thread
import os
import time

# A good number of processes is the number of threads
threads = []
num_threads = 10

# We have to define a function to use a process
def square_numbers():
    for i in range(100):
        i * i

# Assigning process to the function
for i in range(num_threads):
    t = Thread(target=square_numbers)
    threads.append(t)

# Start each thread
for t in threads:
    t.start()

# Block the main thread until the processes are finished
for t in threads:
    t.join()

print('end main')

end main


Since threads live in the same memory space, sharing data is easy. We can do it in Python using a global variable. This variable simulates a database. We have to be careful through, as database integrity can be easily affected by interferennce. Therefore we have to use syncronization methods. For this special locks are used before and after processing the data. 

In addition a queue can be used for data interference. A queue is a linear data structure similar to a FIFO. This avoids discriminating threads from getting the lock. Note that any input or output stream has to be used as a database as well.

In [146]:
# Import special type for lock
from threading import Lock

database_value = 0

def increase(lock):
    global database_value

    # Another thread cannot have it the same time
    with lock:
        local_copy = database_value
        local_copy += 1
        time.sleep(0.03)
        database_value = local_copy

# Creating threads
lock = Lock()
thread1 = Thread(target=increase, args=(lock,))
thread2 = Thread(target=increase, args=(lock,))

# Starting threads
thread1.start()
thread2.start()

# Stopping threads
thread1.join()
thread2.join()

print("End value of database: ", database_value)

End value of database:  2


While the main thread is the initial thread of execution in any Python program, as when a Python script starts, a single main thread is created by default, the daemon thread is a type of thread that runs in the background and is typically used for tasks that should not block the program from exiting.

| Aspect           | Main Thread                                              | Daemon Thread                                            |
|------------------|----------------------------------------------------------|----------------------------------------------------------|
| **Role**         | Handles the primary logic and essential tasks.           | Handles background tasks like logging and monitoring.    |
| **Lifecycle**    | Program runs until the main thread finishes.             | Program exits when only daemon threads are left running. |
| **Termination**  | Prevents program from exiting until it completes.        | Does not prevent program from exiting.                   |
| **Creation**     | Automatically created when the program starts.           | Created by setting `daemon` attribute to `True` before starting the thread. |
| **Usage**        | Used for core application logic and critical operations. | Used for periodic or auxiliary tasks that can terminate abruptly. |


## Process

A process is an instance of a program running in its own memory space. Each process is independent and isolated from others, which provides stability and security but requires more resources.

| Pros                                               | Explanation                                                                                     | Cons                                               | Explanation                                                                                     |
|----------------------------------------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------|--------------------------------------------------------------------------------------------------|
| Takes advantage of multiple CPUs and cores         | Processes can run on different CPUs or cores, providing true parallelism.                       | Heavyweight                                        | Processes use more system resources and are heavier than threads.                               |
| Separate memory space -> Memory is not shared between processes | Each process has its own memory space, which enhances stability and security.                   | Starting a process is slower than starting a thread | Creating a new process takes more time and resources compared to a thread.                      |
| Great for CPU-bound processing                     | Processes can fully utilize multiple cores for CPU-intensive tasks.                             | More memory                                        | Each process requires its own memory allocation, leading to higher memory usage.                |
| New process is started independently from other processes | Processes are isolated, so the failure of one does not affect others.                            | IPC (inter-process communication) is more complicated | Communicating between processes is more complex and requires additional mechanisms.              |
| Processes are interruptable/killable               | Processes can be terminated externally, allowing for better control.                            |                                                    |                                                                                                  |
| One GIL for each process -> avoids GIL limitation  | Each process has its own GIL, so they can run Python code concurrently.                         |                                                    |                                                                                                  |


**When to Use Processes:**

- *CPU-bound tasks:* Processes can fully utilize multiple CPU cores because each process runs independently and has its own Python interpreter and GIL.
- *Memory isolation:* If you need tasks to run in isolated memory spaces to avoid conflicts and improve stability, processes are the better choice.
- *Parallel execution:* For true parallel execution on multiple cores, processes are necessary.

**Avoid Processes:**

- *I/O-bound tasks:* The overhead of creating and managing processes can be too high for tasks that spend a lot of time waiting for I/O operations.
- *Resource efficiency:* If you need to run many tasks simultaneously with minimal overhead, the resource demands of processes may be too high.
- *Complex IPC:* Communication between processes is more complicated than between threads, which can add complexity to your application if frequent inter-process communication is needed.

In [117]:
# basic multiprocessing library
from multiprocessing import Process

# A good number of processes is the number of cores
processes = []
num_processes = os.cpu_count()

# Assigning process to the function
for i in range(num_processes):
    p = Process(target=square_numbers)
    processes.append(p)
    p.start()

# Block the main thread until the processes are finished
for p in processes:
    p.join()

print('end main')

end main
