# 6. Reduction Operations in NumPy

In [None]:
import numpy as np

For this part, we create a dummy dataset consisting of $200$ rows and $5$ columns by sampling from the uniform distribution  $\mathcal{U}(-10, 10)$.

In [None]:
dataset = np.random.uniform(-10, 10, (200, 5))
print(dataset[:10]) # Print the first 10 rows of our dataset

To compute the mean value, we can use the array method `arr.mean()`.

In [None]:
mean = np.mean(dataset) # or dataset.mean()
print(mean)

Sometimes, we want to compute the mean for each column or row. We can do this by specifying which dimension we wish to compute the mean along.

In [None]:
# Compute the mean along axis 1. This will take the mean over all columns for each row
# resulting in an array of shape (200,)
row_means = np.mean(dataset, axis=1) # or dataset.mean(axis=1)
print(row_means.shape) # The mean for each row 
print(row_means[:10])  # Print the mean for the first 10 rows

# Similarly, for computing the mean over all rows
col_means = np.mean(dataset, axis=0) # or dataset.mean(axis=0)
print(col_means.shape)
print(col_means)

There are similar functions for computing the sum, minimum, maximum and standard deviation of an array. We can also here specify over which axes we want to compute along.

In [None]:
print(np.sum(dataset, axis=0))      # The sum of each column (compute along row-dimension, axis 0)
print(np.max(dataset))              # The largest value in the entire dataset
print(np.min(dataset, axis=1)[:10]) # The smallest values for the 10 first rows (along the column dimension)
print(np.std(dataset, axis=0))      # Standard deviation for each column

If we want to find the index of the smallest or largest value, we can use `np.argmin()` and `np.argmax()`, respectively. See the following example where we find the index (and value) of the smallest and largest value in the first column of `dataset`.

In [None]:
first_column = dataset[:, 0]      # Select only the first column

min_idx = np.argmin(first_column) # Find the index of the smallest value
min_val = first_column[min_idx]   # Get the value at index min_idx

print(f"The smallest value in the first column is {min_val:.4f} at index {min_idx}")

# Do the same with argmax
max_idx = np.argmax(first_column) # Find the index of the smallest value
max_val = first_column[max_idx]   # Get the value at index min_idx

print(f"The largest value in the first column is {max_val:.4f} at index {max_idx}")

## Exercises

### 1. Computing the Mean, Sum and Maximum

You are given an array `A` in the code cell below. 

Compute
1. The mean of all elements in `A` using `np.mean()`,
2. The sum of each row, using `np.sum()`, and
3. The largest value in each column using `np.max()`.

Print the result to verify that it works as you intended. 

You need to use the `axis=` argument for 2. and 3.

The output should be
```
5.0
[ 6 15 24]
[7 8 9]
```

In [None]:
A = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])

# Your code here
total_mean = ...
row_sums = ...
column_maxes = ...

### 2. Standardize Each Column of a Dataset

You are given a dataset of shape `(200, 5)` stored in the variable `X`. Think of this as 200 observations of 5 different measurements/variables.

We now want to [standardize](https://en.wikipedia.org/wiki/Standard_score#Calculation) each column $X_i$ by applying the transform $\hat{X}_i = \frac{X_i - \hat{\mu}_i}{\hat{\sigma}_i}$ where $\hat{\mu}_i$ and $\hat{\sigma}_i$ are the empirical mean and standard deviation of column $i$, respectively.

Compute the means and standard deviations for each column and standardize `X`. Store the standardized version of `X` in a new variable `X_hat`.

Verify that the mean and standard deviation of each column of `X_hat` is close to 0 and 1, respectively.

**Hints:**

1. Compute the mean and standard deviations for each column by using `np.mean()` and `np.std()` with the argument `axis=0`.
2. Taking advantage of broadcasting, all columns can be standardized simultaneously by simply writing `X_hat = (dataset - mean) / std`.
3. Use `np.mean(X_hat, axis=0)` and `np.std(X_hat, axis=0)` to verify that each column now has mean $\approx 0$ and standard deviation $\approx 1$.

In [None]:
X = np.random.uniform(-10, 10, (200, 5))

# Your code here
mean = ...
std = ...
X_hat = ...

### 3. Different Ways

Below are three cases where two different expressions give the same result.

Try to understand why they give (approximately) the same answers.

In [None]:
X = np.random.uniform(-10, 10, (200, 5))

# Case 1
print(X.sum(axis=0) / X.shape[0])
print(X.mean(axis=0))

# Case 2
print(X.std())
print(np.mean((X - X.mean())**2) ** 0.5)

# Case 3
print(X.sum())
print((X.mean(0) * len(X)).sum())

### 4. Count With Boolean Arrays

You are given an array `X` with 100000 integers between 0 and 100 in the code cell below.

1. Create a boolean array `ge_50_mask` with the expression `(X > 50)`. This will create an array of the same shape as `X` with `True` or `False` depending on the value in `X` is greater than 50 or not.
2. Print the sum and mean of `ge_50_mask`.

What does the sum and mean of `ge_50_mask` tell us? And why are they approximately, 50000 and 0.5, respectively?

**Note:** When you do `np.sum()` or `arr.sum()` on a boolean array, NumPy will treat `False` as 0 and `True` as 1 and sum these. The same is true for mean and standard deviation.

In [None]:
X = np.random.randint(0, 100, 100000)

# Your code here
ge_50_mask = ...