# 6. Reduction Operations in NumPy

We start by importing NumPy under the alias `np`.

In [1]:
import numpy as np

For this part, let us create a dummy dataset consisting of $200$ rows and $5$ columns by sampling from the uniform distribution  $\mathcal{U}(-10, 10)$.

In [2]:
dataset = np.random.uniform(-10, 10, (200, 5))
print(dataset[:10]) # Print the first 10 rows of our dataset

[[-0.95291225 -7.88760638 -6.32944755  6.82201217 -9.88205561]
 [ 0.6569027  -5.82003347 -8.80213799  2.16207931  9.53315234]
 [ 2.52892116 -6.78134855 -0.94057844 -1.08461561 -4.98683087]
 [ 2.23337794 -3.49704286 -9.04510842  3.38548853 -4.54048506]
 [-4.93210839 -7.59604558 -5.93473537  2.09462866  9.96653294]
 [ 7.2098027  -4.72641501 -8.0392037   5.73166895 -5.47568065]
 [-9.44297286  3.08925855  5.75237348  5.25270725 -5.97310322]
 [-8.91742973  9.83162102  9.86012818 -2.57557482 -3.31439253]
 [-6.90016792 -8.03964213  8.85504355 -1.85708362 -8.69426668]
 [ 3.75752665  6.5168397   5.07611002  8.82824967 -6.42085207]]


To compute the mean value, we can use the array method `arr.mean()`.

In [3]:
mean = np.mean(dataset) # or dataset.mean()
print(mean)

-0.1595651869779748


Sometimes, we want to compute the mean for each column or row. We can do this by specifying which dimension we wish to compute the mean along.

In [4]:
# Compute the mean along axis 1. This will take the mean over all columns for each row
# resulting in an array of shape (200,)
row_means = np.mean(dataset, axis=1) # or dataset.mean(axis=1)
print(row_means.shape) # The mean for each row 
print(row_means[:10])  # Print the mean for the first 10 rows

# Similarly, for computing the mean over all rows
col_means = np.mean(dataset, axis=0) # or dataset.mean(axis=0)
print(col_means.shape)
print(col_means)

(200,)
[-3.64600192 -0.45400742 -2.25289046 -2.29275397 -1.28034555 -1.05996554
 -0.26434736  0.97687042 -3.32722336  3.55157479]
(5,)
[-0.08217597 -0.41152732 -0.13137848  0.08140434 -0.2541485 ]


There are similar functions for computing the sum, minimum, maximum and standard deviation of an array. We can also specify over which axes we want to compute along for these functions.

In [5]:
print(np.sum(dataset, axis=0))      # The sum of each column (compute along row-dimension, axis 0)
print(np.max(dataset))              # The largest value in dataset
print(np.min(dataset, axis=1)[:10]) # The smallest values for the 10 first rows
print(np.std(dataset, axis=0))      # Standard deviation for each column

[-16.43519463 -82.30546321 -26.27569629  16.28086719 -50.82970003]
9.988540507193314
[-9.88205561 -8.80213799 -6.78134855 -9.04510842 -7.59604558 -8.0392037
 -9.44297286 -8.91742973 -8.69426668 -6.42085207]
[5.80766697 6.25527085 5.67548057 5.44940171 5.84684364]


If we want to find the index of the smallest or largest value, we can use `np.argmin()` and `np.argmax()`, respectively. See the following example where we find the index (and value) of the smallest and largest value in the first column of `dataset`.

In [6]:
first_column = dataset[:, 0]      # Select only the first column

min_idx = np.argmin(first_column) # Find the index of the smallest value
min_val = first_column[min_idx]   # Get the value at index min_idx

print(f"The smallest value in the first column is {min_val:.4f} at index {min_idx}")

# Do the same with argmax
max_idx = np.argmax(first_column) # Find the index of the smallest value
max_val = first_column[max_idx]   # Get the value at index min_idx

print(f"The largest value in the first column is {max_val:.4f} at index {max_idx}")

The smallest value in the first column is -9.9271 at index 111
The largest value in the first column is 9.8887 at index 157


## Exercises

### 1. Computing the Mean, Sum and Maximum

You are given an array `A` in the code cell below. 

Compute
1. The mean of all elements in `A` using `np.mean()`,
2. The sum of each row, using `np.sum()`, and
3. The largest value in each column using `np.max()`.

Print the result to verify that it works as you intended. 

You need to use the `axis=` argument for 2. and 3.

In [7]:
A = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])

# Your code here
total_mean = ...
row_sums = ...
column_maxes = ...

# Solution
total_mean = np.mean(A)
row_sums = np.sum(A, axis=1)
column_maxes = np.max(A, axis=0)

print(total_mean)
print(row_sums)
print(column_maxes)

5.0
[ 6 15 24]
[7 8 9]


### 2. Standardize Each Column of a Dataset

You are given a dataset of shape `(200, 5)` stored in the variable `X`. Think of this as 200 observations of 5 different measurements/variables.

We now want to [standardize](https://en.wikipedia.org/wiki/Standard_score#Calculation) each column $X_i$ by applying the transform $\hat{X}_i = \frac{X_i - \hat{\mu}_i}{\hat{\sigma}_i}$ where $\hat{\mu}_i$ and $\hat{\sigma}_i$ are the mean and standard deviation of column $i$, respectively.

Compute the means and standard deviations for each column and standardize `X`. Store the standardized version of `X` in a new variable `X_hat`.

Verify that the mean and standard deviation of each column of `X_hat` is (very close to) 0 and 1, respectively.

**Hints:**

1. Compute the mean and standard deviations for each column by using `np.mean()` and `np.std()` with the argument `axis=0` ("compute along rows").
2. Taking advantage of broadcasting, all columns can be standardized simultaneously by simply writing `X_hat = (dataset - mean) / std`.
3. Use `np.mean(X_hat, axis=0)` and `np.std(X_hat, axis=0)` to verify that each column now has mean $\approx 0$ and standard deviation $\approx 1$.

In [8]:
X = np.random.uniform(-10, 10, (200, 5))

# Your code here
mean = ...
std = ...
X_hat = ...

mean = np.mean(dataset, axis=0)
std = np.std(dataset, axis=0)

X_hat = (X - mean) / std
print(X_hat.mean(axis=0))
print(X_hat.std(axis=0))

[0.03663527 0.07457612 0.04329808 0.09242455 0.13042571]
[1.00787811 0.88430796 1.01945192 0.97300649 1.01297252]


### 3. Different Ways of Doing Things

Below are three cases where two different expressions give the same result.

Try to understand why they give (approximately) the same answers.

In [9]:
X = np.random.uniform(-10, 10, (200, 5))

# Case 1
print(X.sum(axis=0) / X.shape[0])
print(X.mean(axis=0))

# Case 2
print(X.std())
print(np.mean((X - X.mean())**2) ** 0.5)

# Case 3
print(X.sum())
print((X.mean(0) * len(X)).sum())

[0.16115021 0.37252166 0.47825551 0.20733643 0.0964006 ]
[0.16115021 0.37252166 0.47825551 0.20733643 0.0964006 ]
5.5951332948722055
5.5951332948722055
263.1328846981138
263.1328846981139


### 4. Count With Boolean Arrays

You are given an array `X` with 100000 integers between 0 and 100 in the code cell below.

1. Create a boolean array `ge_50_mask` by using `(X > 50)`. This will create an array of the same shape as `X` with `True` or `False` depending on the value in `X` is greater than 50 or not.
2. Print the sum and mean of `ge_50_mask`.

What does the sum and mean of `ge_50_mask` tell us? And why are they approximately, 50000 and 0.5, respectively?

**Note:** When you do `np.sum()` or `arr.sum()` on a boolean array, NumPy will treat `False` as 0 and `True` as 1 and sum these. The same is true for mean and standard deviation.

In [10]:
X = np.random.randint(0, 100, 100000)

# Your code here
ge_50_mask = ...

# Solution
ge_50_mask = (X > 50)
print(ge_50_mask.sum())
print(ge_50_mask.mean())

# It means that approx. half of the random integers are greater than 50

49093
0.49093
