# Homework: Documenting Your Code + Testing Your Code

## Problem 1 - Write docstrings

The following functions are missing docstrings. Write Google-style docstrings for each function, including `Args`, `Returns`, and `Raises` sections where appropriate. Make sure to document default values and explain what each parameter means.

In [None]:
import numpy as np

def normalize(data, method="zscore"):
    """Normalize an array of numerical data.

    Normalization rescales the data according to the specified method.

    Args:
        data (np.ndarray): Input array of numerical values to normalize.
        method (str, optional): Normalization method to use.
            Supported values are:
            - "zscore": Standardize data to mean 0 and standard deviation 1.
            - "minmax": Scale data to the range [0, 1].
            Defaults to "zscore".

    Returns:
        np.ndarray: A new array containing the normalized data.

    Raises:
        ValueError: If an unsupported normalization method is specified.
    """
    if method == "zscore":
        return (data - np.mean(data)) / np.std(data)
    elif method == "minmax":
        return (data - np.min(data)) / (np.max(data) - np.min(data))
    else:
        raise ValueError(f"Unknown method: {method}")


def weighted_mean(values, weights=None):
    """Compute the weighted mean of a set of values.

    If no weights are provided, this function returns the arithmetic mean.

    Args:
        values (np.ndarray): Array of numerical values.
        weights (np.ndarray, optional): Array of weights corresponding to
            each value. Must have the same length as `values`.
            Defaults to None.

    Returns:
        float: The weighted mean of the values.

    Raises:
        ValueError: If `weights` is provided and its length does not match
            the length of `values`.
    """
    if weights is None:
        return np.mean(values)
    if len(values) != len(weights):
        raise ValueError("values and weights must have the same length")
    return np.sum(values * weights) / np.sum(weights)


def remove_outliers(data, threshold=3.0):
    """Remove outliers from data using a z-score threshold.

    Data points are considered outliers if their absolute deviation from
    the mean exceeds `threshold` standard deviations.

    Args:
        data (np.ndarray): One-dimensional array of numerical data.
        threshold (float, optional): Number of standard deviations from
            the mean beyond which a point is considered an outlier.
            Defaults to 3.0.

    Returns:
        np.ndarray: A filtered array containing only non-outlier values.
    """
    mean = np.mean(data)
    std = np.std(data)
    mask = np.abs(data - mean) <= threshold * std
    return data[mask]

## Problem 2 - Add type hints

The following functions have incomplete or missing type hints. Add appropriate type hints for all parameters and return values. Use `|` syntax for union types where a parameter can accept multiple types or return `None`.

In [None]:
import numpy as np

def clip_values(arr: np.ndarray,
                lower: int | float,
                upper: int | float
                ) -> np.ndarray :
    """Clip array values to be within [lower, upper] range."""
    return np.clip(arr, lower, upper)


def find_peaks(data: list[float],
               min_height: float | None=None
               ) -> list[int] | None:
    """Find indices where values are local maxima above min_height.

    Returns None if no peaks are found.
    """
    peaks = []
    for i in range(1, len(data) - 1):
        if data[i] > data[i - 1] and data[i] > data[i + 1]:
            if min_height is None or data[i] >= min_height:
                peaks.append(i)
    if len(peaks) == 0:
        return None
    return peaks


def summarize(data: np.ndarray,
              stats: list[str]
              ) dict[str, float]:
    """Calculate summary statistics for data.

    Args:
        data: Input array of numeric values.
        stats: List of statistic names to compute.
            Valid options: "mean", "median", "std", "min", "max"

    Returns:
        Dictionary mapping statistic names to computed values.
    """
    result = {}
    for stat in stats:
        if stat == "mean": result[stat] = np.mean(data)
        elif stat == "median":
            result[stat] = np.median(data)
        elif stat == "std":
            result[stat] = np.std(data)
        elif stat == "min":
            result[stat] = np.min(data)
        elif stat == "max":
            result[stat] = np.max(data)
    return result

## Problem 3: Identifying Test Types

For each scenario below, identify whether the test being described is a **unit test**, **integration test**, or **regression test**. Briefly explain your reasoning.

**(a)** You write a test that verifies `calculate_variance()` returns 0 for the input `[3.0, 3.0, 3.0]`.

**Unit test**. This targets a single function and checks an isolated behavior for a specific input, with no other components involved.

**(b)** After discovering that `fit_model()` crashes when given a dataset with a single row, you fix the bug and add a test with a one-row input.

**Regression test**. This test is added after discovering a bug, and ensures that the bug does not reappear in the future.

**(c)** You write a test that loads data from a CSV file, passes it through `clean_data()`, fits a model with `fit_linear_regression()`, and verifies the model's R-squared value is within an expected range.

**Integration test**. This test verifies that the file is loaded, data is cleaned, and model is fit correctly, checking that these components work together.

**(d)** A user reports that `normalize()` returns incorrect values when all input values are negative. After fixing the issue, you add a test with input `[-5.0, -3.0, -1.0]`.

**Regression test**. This test is once again added after discovering a bug, and ensures that the function no longer returns incorrect values.

## Problem 4: Code Review - What's Wrong with These Tests?

Review the following test code and identify at least **four** problems with the test design or implementation. Explain why each is problematic and suggest how to fix it.

In [None]:
import numpy as np

def test_all_statistics():
    data = [10, 20, 30, 40, 50]

    # Test mean
    assert np.mean(data) == 30

    # Test median
    assert np.median(data) == 30

    # Test standard deviation
    assert np.std(data) > 0

    # Test min and max
    assert np.min(data) == 10
    assert np.max(data) == 50

    # Test sum
    assert np.sum(data) == 150

def verify_variance_positive(arr):
    var = np.var(arr)
    assert var >= 0

def test_correlation():
    x = np.array([1.0, 2.0, 3.0])
    y = np.array([2.0, 4.0, 6.0])
    corr = np.corrcoef(x, y)[0, 1]
    assert corr == 1.0

results = []

def test_append_result():
    global results
    results.append(42)
    assert 42 in results

def test_check_results():
    assert len(results) == 1

In [None]:
# Tests recheck NumPy's built-in behavior and not actually testing the individual's code. In other words, the tests are meaningless.
# To fix this, we need to test our own functions instead of NumPy's. For example,
assert calculate_mean(data) == 30

In [None]:
# test_all_statistics checks several statistics in one test, so if one fails, the rest are never executed.
# A better way to test statistics is to create a function for each one. For example,
def test_mean():
    data = [10, 20, 30, 40, 50]
    assert np.mean(data) == 30
def test_median():
    data = [10, 20, 30, 40, 50]
    assert np.median(data) == 30

In [None]:
# verify_variance_positive has inconsistent naming with the others.
# We can make this a proper test by saying
def test_variance_positive(arr)
    assert np.var(arr) >= 0

In [None]:
# test_append_result and test_check_results share the global `results`, so the tests ultimately depend on execution order.
# To fix this, we need to use local variables rather than global:
def test_append_result():
    results = []
    results.append(42)
    assert results == [42]

def test_check_results():
    results = []
    assert len(results) == 1

## Problem 5: The Flaky Test

Your colleague wrote the following test for a bootstrap confidence interval function:

In [None]:
import numpy as np

def bootstrap_ci(data, confidence=0.95, n_bootstrap=1000):
    """Compute bootstrap confidence interval for the mean."""
    means = []
    n = len(data)
    for _ in range(n_bootstrap):
        sample = np.random.choice(data, size=n, replace=True)
        means.append(np.mean(sample))

    alpha = 1 - confidence
    lower = np.percentile(means, 100 * alpha / 2)
    upper = np.percentile(means, 100 * (1 - alpha / 2))
    return lower, upper

def test_bootstrap_ci_contains_true_mean():
    data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    true_mean = 5.5
    lower, upper = bootstrap_ci(data)
    assert lower < true_mean < upper

**(a)** The test passes most of the time but occasionally fails. Explain why this test is "flaky" (non-deterministic).

This test is non-deterministic because bootstrap_ci uses random resampling, with no random seed. We have a finite number of samples, 1000, so the estimated interval varies for each run.

**(b)** Your colleague argues: "The test is correct because a 95% confidence interval should contain the true mean 95% of the time, so occasional failures are expected." Is this a good argument for keeping the test as-is? Why or why not?

Although this is a correct interpretation, it is a bad argument for keeping the test. The unit test we have here should be deterministic; if not, the test is untrustworthy since we cannot distinguish true bugs from random noise. In other words, we do not have a reproducible unit test, so it must be changed.

**(c)** Rewrite the test to be deterministic and reliable while still meaningfully testing the `bootstrap_ci` function. Your solution should: ensure reproducible results and verify that the confidence interval has reasonable properties.

Code below

**(d)** Propose an alternative testing strategy that could verify the 95% coverage property without making the test flaky. You don't need to implement it, but describe the approach.

To test the 95% coverage property, we could use a simulation-based test. First, we would have to fix a distribution, such as normal with a known mean. Then, we would repeat the following for a fixed number of simulations: generate dataset, compute 95% bootstrap confidence interval, record whether it contains the true mean. Finally, we would check if the empirical coverage is close to 0.95, within a tolerance of say Â±0.02.

In [10]:
import numpy as np

def test_bootstrap_ci_properties():
    """
    Test that bootstrap_ci is reproducible with a fixed random seed and
    returns a valid confidence interval.
    """
    np.random.seed(12345)

    data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    sample_mean = np.mean(data)

    lower, upper = bootstrap_ci(data, confidence=0.95, n_bootstrap=5000)

    # Interval should be ordered correctly
    assert lower < upper

    # Interval should contain the sample mean (much stronger and deterministic)
    assert lower <= sample_mean <= upper

    # Interval should have nonzero width
    assert upper - lower > 0


In [11]:
test_bootstrap_ci_properties()

NameError: name 'bootstrap_ci' is not defined