# Assignment 10: Parallel Computing

### Due 24 November 2025

### Introduction

This assignment is about parallel computing with Dask. You should use Python to implement the calculations. If possible, please submit your answers in PDF  or HTML format. In case you have any issues installing Dask via pip, please use the following command:

```bash
python -m pip install "dask[complete]" --use-deprecated=legacy-resolver
```

This command will resolve dependencies and install the required packages for Dask.

You can also install Dask using conda (which the authors recommend):

```bash
conda install dask
```

If you encounter any issues, please check their website: <https://docs.dask.org/en/stable/install.html> and let us know.

1. Explain the concept of "overhead" in parallel computing with `joblib`. Why might running a very simple task (like adding 1 to a number) in parallel with `joblib` be slower than running it serially?

Your answer here

1. Write a Python function `count_vowels(text)` that counts the vowels (a, e, i, o, u, case-insensitive) in a given string. Then, use the `Parallel` and `delayed` functions from the `joblib` library to apply your function in parallel. Use all available cores.  The function should return a list of integers, where each integer corresponds to the number of vowels in the respective sentence.

```python
sentences = [
    "Joblib makes parallel computing easy",
    "Dask scales Python code effectively",
    "Parallelism can speed up computations",
    "Always consider the overhead"
]
```

In [None]:
## Your answer here

3. Write a function called `get_length` that takes a word as input and returns its length. Then, using the provided list `words`, do the following:

* Use a standard (sequential) for loop to calculate the length of each word by calling your function.
* Use the `joblib` library to calculate the length of each word in parallel, also calling your function. Use `Parallel` and `delayed` from `joblib` again.
* Compare the syntax of the sequential and parallel approaches. How do they differ when writing the loop?

```python
words = ["joblib", "parallel", "computing", "example"]
```

In [None]:
## Your answer here

3. Create a 10000x10000 Dask array `da_a` filled with random integers between 0 and 100, chunked into (500, 1000) blocks. Use `RandomState(350)` to make your code reproducible. Create a second Dask array `da_b` of the same shape and chunks, filled with ones. Compute `da_c = (da_a + da_b) * 2` and its mean value.

In [None]:
## Your answer here

4. What is the difference between `dask.dataframe.compute()` and `dask.dataframe.persist()`? When would you typically use `.persist()`?

Your answer here.

5. In this question, you will compare the performance of a regular `for` loop and `dask` for a simple computation. First, create a function called `intensive_task` as follows:

```python
import numpy as np
import time
import dask

def intensive_task(n):
    loop_limit = 10_000_000 # How many iterations inside the function
    total = 0
    for i in range(loop_limit):
        total += i*i
    return total
```

Then, create a list called `inputs` with 6 values:

```python
inputs = [1, 2, 3, 4, 5, 6] 
```

Now, use the function `time.time()` to measure the time it takes to run the function `intensive_task` for each value in the list `inputs` using a regular `for` loop. Store the results in a list called `results`. Remember to create the `start_time` and `end_time` variables to measure the time taken for the computation. The result, which is the difference between `end_time` and `start_time`, should be printed.

Repeat the same task using `dask`. However, instead of using the `@dask.delayed` decorator, use the code below:

```python
tasks = [dask.delayed(intensive_task)(i) for i in inputs]
```

Then, use `dask.compute()` to compute the results. Again, measure the time taken for the computation and print the result. Which one is faster?

In [None]:
## Your answer here

6. In the same folder as this notebook, you will find a Parquet file named `data.parquet`. It is available here: <https://github.com/danilofreire/qtm350/blob/main/assignments/data.parquet>. This file contains student records with the following columns:

* `emory_id` (integer) 
* `student_name` (string)
* `major` (string)
* `gpa` (float)

Write Python code using `dask.dataframe` to read the `data.parquet` file, but only load the `major` and `gpa` columns. Then, print the first 5 rows of the resulting Dask DataFrame using the `.head()` method, and calculate the average GPA by major.

You will need a Parquet engine to read the file. If you don't have one installed, you can use `pyarrow`. You can install it using conda (or pip):

```bash
conda install pyarrow
```

In [None]:
## Your answer here

7. You have two CSV files in this directory:

* `students.csv`: Contains columns `student_id`, `student_name`. Available here: <https://github.com/danilofreire/qtm350/blob/main/assignments/students.csv>.
* `grades.csv`: Contains columns `student_id`, `course`, `grade`. Available here: <https://github.com/danilofreire/qtm350/blob/main/assignments/grades.csv>.

Write Python code using dask.dataframe to:

* Read `students.csv` into a Dask DataFrame called `ddf_students`.
* Read `grades.csv` into a Dask DataFrame called `ddf_grades`.
* Merge these two DataFrames together based on the common `student_id` column. An inner merge is recommended (only include students present in both files).
* From the merged DataFrame, select only the `student_name`, `course`, and `grade` columns. Save it as `ddf_final`.
* Compute and print the first 5 rows of this final merged DataFrame using `.head()`.

Good luck! 😃