# Lab 15 Example
In this lab we will cover two advanced topics in Python. The first topic is about **Concurrency and Parallelism**. And the second is about **Packaging and Distribution**.

## Concurrency and Parallelism
In Python, **concurrency and parallelism** deal with executing multiple tasks simultaneously to improve performance, responsiveness, or throughput. While the terms are often used interchangeably, they refer to different concepts:

* **Concurrency** is about dealing with many tasks at once—managing multiple operations in overlapping time frames (e.g., using threads or asynchronous I/O).
* **Parallelism** is about doing many tasks at the same time—actually running code simultaneously, typically on multiple CPU cores (e.g., using the `multiprocessing` module).

Python’s Global Interpreter Lock (GIL) imposes limitations on true parallelism in multi-threaded code, but tools like **`asyncio`**, **`threading`**, and **`multiprocessing`** offer different models for achieving concurrency and parallelism. Understanding when and how to use each is critical for building high-performance applications, especially in web servers, data pipelines, or scientific computing.


### **Concurrency (AsyncIO)**

Some programs involve natural waiting during execution. For example, a web scraping script often waits for network responses. During this time, your CPU is actually idle. A very straightforward idea is to use this idle time to do something else — like sending another request, processing previously received data, or performing background tasks.

This is where **asynchronous programming** comes in. Python's `asyncio` library allows you to write code that can pause (await) during slow operations without blocking the entire program. This makes it possible to handle many tasks concurrently in a single thread — ideal for I/O-bound workloads such as:

* Fetching data from multiple APIs
* Reading and writing files or databases
* Handling thousands of web clients in a server

With `async def` functions and the `await` keyword, you can build efficient, non-blocking applications that are easier to read and maintain than traditional callback-based approaches.


Let's use lab 9 as an example, in which we scrape 50 pages in [https://books.toscrape.com/](https://books.toscrape.com/).

In [9]:
import asyncio
import time
import os
import aiohttp

os.makedirs("data_async", exist_ok=True)

url_format = "https://books.toscrape.com/catalogue/page-{}.html"

In [10]:
# async func
async def get_html(session, page_id: int) -> str:
    """Fetches and returns the HTML content of the specified page."""
    url = "https://books.toscrape.com/catalogue/page-{}.html"
    async with session.get(url.format(page_id)) as response:
        # change encoding to utf-8
        response.encoding = "utf-8"
        html = await response.text()
        with open(f"data_async/page-{page_id}.html", "w", encoding="utf-8") as f:
            f.write(html)


# run the async func in a loop
async def main():
    """Main function to run the async tasks."""
    tasks = []
    async with aiohttp.ClientSession() as session:
        for page_id in range(1, 51):
            tasks.append(get_html(session, page_id))
        await asyncio.gather(*tasks)

start = time.time()
await main()
end = time.time()
print(f"Time taken: {end - start:.2f} seconds")
# check if the files are created
for page_id in range(1, 51):
    assert os.path.exists(f"data_async/page-{page_id}.html"), f"File page-{page_id}.html not found"
print("All files created successfully.")

Time taken: 0.89 seconds
All files created successfully.


In [7]:
# without async
import requests

os.makedirs("data_no_async", exist_ok=True)

def get_html_no_async(page_id: int) -> str:
    """Fetches and returns the HTML content of the specified page."""
    url = "https://books.toscrape.com/catalogue/page-{}.html"
    response = requests.get(url.format(page_id))
    # change encoding to utf-8
    response.encoding = "utf-8"
    html = response.text
    with open(f"data_no_async/page-{page_id}.html", "w", encoding="utf-8") as f:
        f.write(html)

start = time.time()
for page_id in range(1, 51):
    get_html_no_async(page_id)
end = time.time()
print(f"Time taken: {end - start:.2f} seconds")
# check if the files are created
for page_id in range(1, 51):
    assert os.path.exists(f"data_no_async/page-{page_id}.html")
print("All files created successfully.")

Time taken: 19.88 seconds
All files created successfully.


## Parallelism

`AsyncIO` can improve CPU utilization and speed up tasks that involve a lot of waiting (idle time), such as downloading files or querying web APIs. But what if your program is **CPU-intensive** — like performing large number multiplications, sorting huge datasets, or processing images?

While the total execution time for CPU-bound tasks is limited by your machine's hardware, it's possible that your task isn't fully using all the available computational resources (i.e CPU time and CPU cores). 

Modern operating systems can execute multiple threads or processes in parallel across multiple CPU cores. To take advantage of this, you can break your workload into smaller chunks and run them **in parallel**. In Python, this is typically done using the **`multiprocessing`** module.

`multiprocessing` creates separate processes that can run truly in parallel, allowing your program to make full use of multiple CPU cores. This leads to significant performance improvements for compute-heavy operations.


In [23]:
import random
random.seed(0)

def rand_array(n):
    return [random.randint(0, 10000000) for _ in range(n)]

arrays = [rand_array(100000) for _ in range(100)]

In [26]:
from multiprocessing import Pool, cpu_count
import time

def norm_vector(v):
    norm = sum(x**2 for x in v) ** 0.5
    return norm

# Use all available CPU cores
num_workers = cpu_count()

print(f"Running on {num_workers} cores...")

start = time.time()

with Pool(processes=num_workers) as pool:
    results = pool.map(norm_vector, arrays)

end = time.time()
print(f"[Multiprocessing] Sorted {len(arrays)} arrays in {end - start:.2f} seconds.")

Running on 32 cores...
[Multiprocessing] Sorted 100 arrays in 0.56 seconds.


In [27]:
# single threaded version
start = time.time()
results = []
for arr in arrays:
    results.append(norm_vector(arr))
end = time.time()
print(f"[Singal Threaded] Sorted {len(arrays)} arrays in {end - start:.2f} seconds.")

[Singal Threaded] Sorted 100 arrays in 1.12 seconds.


## Packaging and Distribution

Throughout this class, we've worked with several popular libraries such as `NumPy`, `Pandas`, `Matplotlib`, and `Seaborn`. You might now be wondering how to create your own Python library and share it with others. In this section, we'll walk you through the process of packaging your code into a reusable library and distributing it so others can install and use it just like any other Python package.


To build our own library, we need to have several important files. Here is the structure of the most basic Python library:

```
my_lib/
├── my_lib/
│   ├── __init__.py
│   └── ... (your module code here)
├── setup.py
└── README.md
```

Where:

* `__init__.py` marks the directory as a Python package. It can also be used to expose selected functions or classes at the package level.
* `setup.py` contains the package configuration. It tells Python (and tools like `pip`) how to install and manage your package — including its name, version, dependencies, and author information.
* `README.md` is optional but strongly recommended. It provides users with an overview of your library — what it does, how to install it, and how to use it.

Once these files are in place, you can build and install the package locally, or even publish it to PyPI to share it with others.

### 1. Build local package
1. Write some code in `my_lib/my_lib`. We will use the `attention` function from Lab 13. 
2. In the `my_lib/my_lib/__init__.py`, import `attention` function.
3. Write set up commands in `my_lib/setup.py` 

In [None]:
# codes in setup.py
from setuptools import setup, find_packages

setup(
    name='my_lib',
    version='0.1.0',
    description='A simple demo library with an attention function',
    author='Your Name',
    packages=find_packages(),
    install_requires=[
        'numpy>=1.20.0'  # Add any version you depend on
    ],
    python_requires='>=3.6',
)

Now we can install your library locally! 

In [16]:
!pip install my_lib/

[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/pytorch3d-0.7.7-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/diff_gauss-1.0.10.0-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mProcessing ./my_lib
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: my_lib_dsci510
  Building wheel for my_lib_dsci510 (pyproject.toml) ... [?25ldone
[?25h  Crea

Try use it in python notebook. 

In [18]:
from my_lib import attention

import numpy as np
def test_attention():
    """Test the attention function."""
    # Create dummy data
    query = np.random.rand(5, 3)
    key = np.random.rand(5, 3)
    value = np.random.rand(5, 3)

    # Call the attention function
    output = attention(query, key, value)

    # Check the shape of the output
    print(output.shape)
    print(output)

test_attention()

(5, 3)
[[0.58703092 0.59007248 0.5074895 ]
 [0.54361973 0.5679409  0.44604053]
 [0.6952744  0.72870416 0.57378038]
 [0.59618506 0.59568121 0.51795619]
 [0.52681412 0.50852617 0.47150766]]


### 2. Share your lib to PyPI
Once your library is ready and tested locally, you can publish it to [PyPI (Python Package Index)](https://pypi.org/) so others can install it using `pip`. Here’s how to do it:

#### 1. Add a `pyproject.toml` file

This file is required by modern Python packaging tools.

Create `pyproject.toml` in the root of your project (`my_lib/`) with the following content:

```toml
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"
```

#### 2. Build your package (Here it means generate a zip file for your package)
Install the build tool:


In [13]:
!pip install build

[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/pytorch3d-0.7.7-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/diff_gauss-1.0.10.0-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0mCollecting build
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting pyproject_hooks (from build)
  Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing 

Then, from inside the `my_lib/` directory, run:

In [19]:
!cd my_lib && python -m build

[1m* Creating isolated environment: venv+pip...[0m
[1m* Installing packages in isolated environment:[0m
  - setuptools
  - wheel
[1m* Getting build dependencies for sdist...[0m
running egg_info
writing my_lib_dsci510.egg-info/PKG-INFO
writing dependency_links to my_lib_dsci510.egg-info/dependency_links.txt
writing requirements to my_lib_dsci510.egg-info/requires.txt
writing top-level names to my_lib_dsci510.egg-info/top_level.txt
reading manifest file 'my_lib_dsci510.egg-info/SOURCES.txt'
writing manifest file 'my_lib_dsci510.egg-info/SOURCES.txt'
[1m* Building sdist...[0m
running sdist
running egg_info
writing my_lib_dsci510.egg-info/PKG-INFO
writing dependency_links to my_lib_dsci510.egg-info/dependency_links.txt
writing requirements to my_lib_dsci510.egg-info/requires.txt
writing top-level names to my_lib_dsci510.egg-info/top_level.txt
reading manifest file 'my_lib_dsci510.egg-info/SOURCES.txt'
writing manifest file 'my_lib_dsci510.egg-info/SOURCES.txt'
running check
creatin

This creates a `dist/` folder containing files like:

```
dist/
├── my_lib_dsci510-0.1.0.tar.gz
└── my_lib_dsci510-0.1.0-py3-none-any.whl
```


#### 3. Upload to PyPI
Install Twine:

In [20]:
!pip install twine

[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/pytorch3d-0.7.7-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /home/emily/anaconda3/envs/nerf/lib/python3.11/site-packages/diff_gauss-1.0.10.0-py3.11-linux-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


Then upload your package(You will be prompted to enter your [PyPI username and password](https://pypi.org/account/register/).):
```bash
twine upload my_lib/dist/*
```

#### Now everyone can install the library!
Once uploaded, anyone can install your package using:

```bash
pip install my_lib_dsci510
```