## Overview

In structural biology, we often analyze large numbers of atomic distances in
protein structures. A key task is identifying potential hydrogen bonds between
atoms, as these bonds play a crucial role in stabilizing protein structures and
determining their biological function.

A hydrogen bond is a relatively weak attraction that can form between atoms
under specific conditions. We'll cover the chemistry of hydrogen bonds in detail
later in the course, but for this assignment, we'll use a simplified definition:
we'll consider a hydrogen bond to potentially exist between any two atoms that
are between 2.5 and 3.5 Angstroms (Å) apart. Note that this is a greatly
simplified definition; real hydrogen bonds have additional geometric and
chemical requirements that we're ignoring for now.

(An Angstrom, written as Å, is a unit of length equal to 0.1 nanometers or
10^-10 meters. It's commonly used for atomic-scale measurements because it's
conveniently sized; most atomic bonds are 1-2 Å in length.)

For this assignment, you'll analyze both serial and parallel implementations of
code that processes simulated structural data to identify atom pairs that could
form hydrogen bonds based on their distances. The goal is to understand how
parallel processing can speed up this type of structural analysis.

**NOTE:** This assignment is fairly challenging. Each student should turn in
their own work, but please do feel free to work in groups to discuss the
concepts.

### Part 1: Understanding Serial Implementation

First, study this serial implementation, run it in Google Colab, and answer the
relevant questions in Canvas:

In [None]:
import time      # For measuring execution time of our analysis
import random    # For generating random numbers

def make_fake_distances():
    """
    Generate a list of random atomic distances for testing purposes.
    
    This function creates synthetic data that simulates distances between atoms,
    useful for testing analysis algorithms without needing real molecular data.
    
    Returns:
        list: A list of 100 floating-point numbers between 0 and 10,
              representing atomic distances in Angstroms (Å).
    
    Note:
        - Uses a fixed random seed (42) to ensure reproducible results
        - Distances are in Angstroms (Å), a common unit in molecular analysis
        - The range 0-10 Å represents typical interatomic distances in molecules
    """
    # Set random seed to ensure we all get the same random numbers. By tradition, we
    # use 42, though there is no special meaning to this number.
    random.seed(42)

    # Generate test data: 100 random atomic distances between 0 and 10 Angstroms
    distances = []
    for _ in range(100):  # Generate 100 distances
        # random.uniform generates a random float between 0 and 10
        distances.append(random.uniform(0, 10))

    return distances

def analyze_distance(distance):
    """
    Analyze whether a given atomic distance falls within the distance range
    typical of a hydrogen bond.
    
    This function checks if an atomic distance falls within the range of 2.5 to
    3.5 Angstroms. It includes an artificial delay to simulate computational
    processing time.
    
    Args:
        distance (float): The atomic distance to analyze, in Angstroms (Å)
    
    Returns:
        bool: True if the distance is between 2.5 and 3.5 Å (inclusive),
              False otherwise
    
    Note:
        The artificial delay (0.04 seconds) is added to simulate real-world
        processing time and demonstrate the impact of sequential vs.
        parallel processing.
    """
    # Add artificial delay to simulate computational processing time
    time.sleep(0.04)  # Wait for 0.04 seconds
    
    # Check if the distance falls within our target range. Returns True if
    # distance is between 2.5 and 3.5 Angstroms (inclusive).
    return (distance >= 2.5) and (distance <= 3.5)

def analyze_distances_serial(distances):
    """
    Analyze a list of atomic distances using a serial (one-at-a-time) approach
    to determine what percentage fall within a specific range of interest
    (2.5-3.5 Å), the length of a typical hydrogen bond.
    
    This function processes each distance sequentially, which is straightforward
    but may be slower than parallel approaches when dealing with large datasets.
    
    Args:
        distances (list): List of float values representing atomic distances in
                          Angstroms (Å)
    
    Returns:
        float: Percentage (0-100) of distances that fall within the target range
               of 2.5-3.5 Å
    
    Example:
        If 30 out of 100 distances fall within the range, the function returns
        30.0, which corresponds to 30.0% of the distances.
    """
    # Initialize a counter for distances that fall within our target range. This
    # will keep track of how many distances meet our criteria.
    count = 0
    
    # Iterate through each distance in our list one at a time. This is the
    # "serial" part - we process distances sequentially.
    for distance in distances:
        # Check if this distance falls within our target range. The
        # analyze_distance function returns True if the distance is 2.5-3.5 Å.
        if analyze_distance(distance):
            count += 1
    
    # Convert our count to a percentage. Multiply by 100 to get a percentage
    # value (e.g., 30.0 for 30%). The percentage is the number of distances that
    # fall within our target range divided by the total number of distances.
    percentage = (count / len(distances)) * 100
    
    return percentage

# Generate our test dataset using the make_fake_distances function. This creates
# a list of random distances that we can analyze.
distances = make_fake_distances()

# Record the starting time of our analysis. We'll use this to measure how long
# the computation takes.
start_time = time.time()

# Perform the distance analysis using our serial implementation. This will
# calculate what percentage of distances fall within our target range.
result = analyze_distances_serial(distances)

# Calculate how long the analysis took. Subtract the start time from the current
# time to get elapsed time.
end_time = time.time()
serial_time = end_time - start_time

# Output the results of our analysis. Print both the percentage of distances in
# our target range and the time taken.
print(f"Percentage of distances in target range: {result:.2f}%")
print(f"Time taken for serial analysis: {serial_time:.3f} seconds")

### Part 2: Parallel Implementation

Create a parallel version that:

- Uses Python's multiprocessing library (`Pool`)
- Splits the data across at least 2 processors (Google Colab provides 2 CPU
  cores)
- Produces exactly the same numerical result as the serial version above
- Runs faster than the serial version

I will scaffold the code for you, but you will need to fill in the missing
parts, which are marked `YOUR CODE HERE`.

Given the size of the class, I won't be able to review each submission in
detail. So to grade this section, I will evaluate (1) whether your code produces
the right answer and (2) whether your code runs faster than the serial version.
That said, I do plan to spot check some submissions as necessary.

In [10]:
from multiprocessing import Pool  # Pool lets us distribute work across CPU cores
import time      # For measuring execution time of our analysis
import random    # For generating random numbers

def make_fake_distances():
    """
    Generate a list of random atomic distances for testing purposes. NOTE: This
    is the same function defined in the serial example above.
    
    This function creates synthetic data that simulates distances between atoms,
    useful for testing analysis algorithms without needing real molecular data.
    
    Returns:
        list: A list of 100 floating-point numbers between 0 and 10,
              representing atomic distances in Angstroms (Å).
    
    Note:
        - Uses a fixed random seed (42) to ensure reproducible results
        - Distances are in Angstroms (Å), a common unit in molecular analysis
        - The range 0-10 Å represents typical interatomic distances in molecules
    """
    # Set random seed to ensure we all get the same random numbers. By tradition, we
    # use 42, though there is no special meaning to this number.
    random.seed(42)

    # Generate test data: 100 random atomic distances between 0 and 10 Angstroms
    distances = []
    for _ in range(100):  # Generate 100 distances
        # random.uniform generates a random float between 0 and 10
        distances.append(random.uniform(0, 10))

    return distances

def analyze_distance(distance):
    """
    Analyze whether a given atomic distance falls within the distance range
    typical of a hydrogen bond.  NOTE: This is the same function defined in the
    serial example above.
    
    This function checks if an atomic distance falls within the range of 2.5 to
    3.5 Angstroms. It includes an artificial delay to simulate computational
    processing time.
    
    Args:
        distance (float): The atomic distance to analyze, in Angstroms (Å)
    
    Returns:
        bool: True if the distance is between 2.5 and 3.5 Å (inclusive),
              False otherwise
    
    Note:
        The artificial delay (0.04 seconds) is added to simulate real-world
        processing time and demonstrate the impact of sequential vs. parallel
        processing.
    """
    # Add artificial delay to simulate computational processing time
    time.sleep(0.04)  # Wait for 0.04 seconds
    
    # Check if the distance falls within our target range. Returns True if
    # distance is between 2.5 and 3.5 Angstroms (inclusive).
    return (distance >= 2.5) and (distance <= 3.5)

def count_in_range(distance_subset):
    """
    Count how many distances in a subset fall within the target range (2.5-3.5
    Å).
    
    This function is designed to run in parallel on different CPU cores. Each
    core will process its own subset of the full distance list independently.
    
    Args:
        distance_subset (list): A portion of the full distance list, containing
                                float values representing atomic distances in
                                Angstroms (Å)
    
    Returns:
        int: Number of distances in this subset that fall within 2.5-3.5 Å
    
    Note:
        This function is similar to the code used in the serial-programming
        example, but instead of processing the entire list, it only processes its
        assigned portion.
    """
    # Process each distance in this subset, counting how many fall within range,
    # similar to the serial version. Return the number of distances that meet
    # the criteria.

    count = 0
    # YOUR CODE HERE
    return count

def analyze_distances_parallel(distances):
    """
    Analyze distances using parallel processing across multiple CPU cores.
    
    This function implements a parallel approach that's different from the
    serial approach above. Instead of processing distances one at a time, it:

    1. Splits the data into chunks
    2. Processes these chunks simultaneously on different CPU cores
    3. Combines the results
    
    Args:
        distances (list): List of float values representing atomic distances in
                         Angstroms (Å)
    
    Returns:
        float: Percentage (0-100) of distances that fall within 2.5-3.5 Å
    """
    # We'll use 2 processes (CPU cores) for this example because Google colab
    # typically provides 2 cores per session.
    num_processes = 2
   
    # Split our data into chunks - one chunk for each CPU core. This creates two
    # chunks by taking alternating elements:
    # Chunk 1: [distances[0], distances[2], distances[4], ...]
    # Chunk 2: [distances[1], distances[3], distances[5], ...]
    
    # Create a list of chunks by splitting the distances list into num_processes
    # chunks using array slicing. Place the value in a variable named
    # `distance_chunks`, which should be a list of lists of numbers.

    distance_chunks = # YOUR CODE HERE
   
    # === Key Concept: Using Pool for Parallel Processing ===
    # 
    # Pool is a tool from Python's multiprocessing library that manages parallel
    # processing for us. Think of it like having multiple workers (CPU cores)
    # ready to help with our task:
    #
    # 1. We create a "pool" of worker processes (2 in this case)
    # 2. The pool.map() function:
    #    - Takes our function (count_in_range) and our data chunks
    #    - Automatically sends each chunk to an available worker
    #    - Collects all the results when they're done
    #
    # This is different from our in-class example because instead of writing our
    # own code to manage the parallel processing, we're letting Pool handle all
    # the complex details of:
    #
    #   - Starting worker processes
    #   - Distributing work
    #   - Collecting results
    #   - Managing process communication
    #   - Cleaning up when done
    with Pool(processes=num_processes) as pool:
        # pool.map sends each chunk to a separate process. `results` will
        # contain the counts from each chunk.
        results = pool.map(count_in_range, distance_chunks)
    
    # Calculate the final percentage. First sum up all the counts from our
    # parallel processes, in the `results` variable. `results` is a list
    # containing the counts for each chunk. So, for example, if you have two
    # chunks, `results` might look like [12, 4].
    # 
    # After summing to get the total counts, divide that number by the total
    # number of distances and multiply by 100. Place this value in a variable
    # called `percent`.

    percent = # YOUR CODE HERE

    return percent

# Generate our test dataset with many fake distances
distances = make_fake_distances()

# Time the parallel implementation
start_time = time.time()
result = analyze_distances_parallel(distances)
end_time = time.time()

# Print the results
print(f"Percentage of distances in target range: {result:.2f}%")
print(f"Time taken for parallel analysis: {end_time - start_time:.2f} seconds")

## Final instructions

Fill in the missing sections in the parallel code above. Test it to make sure it
gives the same answer as the serial code, only faster. Once you're ready, copy
and paste the entire contents of the cell with your parallel code into the
appropriate field in the Canvas homework assignment.