# Introduction to NumPy

This course is presented by Izzy Weber, Core Curriculum Manager, DataCamp. Collaborators are James Chapman and Amy Person.

Prerequisite:
- Intermediate Python

This course is part of this track:
- Data Scientist with Python (career track)

## Data Sets

| Name | File | Notes |
| :--- | :--- | :--- |
| Monet RGB Array | rgb_array.npy | Image of Monet's painting "Cliff Walk at Pourville" |
| Tree Census Array | tree_census.npy | Tree census data |
| Monthly Sales Array | monthly_sales.npy | Monthly sales for liquor stores, restaurants, and department stores |
| Sudoku Game Array | sudoku_game.npy | 9 x 9 array containing a Sudoku game; first row is missing |
| Sudoku Solution Array | sudoku_solution.npy | 9 x 9 array containing solution to above Sudoku game; first row is missing |
| Sudoku Game CSV | sudoku.csv | CSV file with no header containing Sudoku game data (supplied by DataCamp support) |
| Sudoku Solution CSV | sudoku_solution.csv | CSV file with no header containing solution to Sudoko game (supplied by DataCamp support |
| NumPy Logo | numpylogo2.png | PNG file from NumPy code repository containing the 1001 x 1001 NumPy logo |

## Resources

The NumPy website is located at https://numpy.org/.

## Imports
Import the modules required by this notebook.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import PIL

## Understanding NumPy Arrays
TensorFlow, pandas, SciPy, matplotlib, and scikit-learn make use of NumPy.

A NumPy array can have any number of dimensions and can be any length.

### Introducing Arrays

In [None]:
# Create a 1-D array.
python_list = [3, 2, 5, 8, 4, 9, 7, 6, 1]
array_1d = np.array(python_list)
array_1d

In [None]:
print(array_1d)

In [None]:
type(array_1d)

In [None]:
print(type(array_1d))

In [None]:
# Create a 2-D array.
python_list_of_lists = [[3, 2, 5],
                       [9, 7, 1],
                       [4, 3, 6]]
array_2d = np.array(python_list_of_lists)
array_2d

Python lists can contain many different data types. NumPy arrays can contain only a single data type; this reduces storage needs and increases efficiency of using a numpy.ndarray instead of a list.

In [None]:
# Create a 2-D array filled with zeros. The tuple argument defines the shape
# of the ndarray.
np.zeros((5, 3))

In [None]:
# Create an array containing random numbers between 0 and 1.
# (See more below.)
# Use the new interface.
# np.random.random((2, 4))
np.random.default_rng().random((2, 4))

In [None]:
# Create an array containing an evenly spaced range of values.
# The first argument is included in the array, the second argument
# is not.
np.arange(-3, 4)

In [None]:
# Omit the start value if the range begins with 0.
np.arange(4)

In [None]:
# A third argument is interpreted as a step value.
np.arange(-3, 4, 3)

In [None]:
# np.arange is especially useful for plotting.
plt.scatter(np.arange(0, 7), np.arange(-3, 4))
plt.show()

#### Exercises

In [None]:
# Load the Sudoku puzzle from sudoku_game.npy.
# This file is damaged; it is missing the first row of data.
# Test loading the damaged Sudoku game file.
# This confirms that the file is damaged; it is missing a row.
print("Loading damaged file sudoku_game.npy...")
with open("sudoku_game.npy", "rb") as file:
    sudoku_game_npy = np.load(file)
print(sudoku_game_npy.shape)

# This also works.
sudoku_game_npy2 = np.load("sudoku_game.npy")
print(sudoku_game_npy2.shape)
print()

In [None]:
# Load the data from the sudoku.csv file provided by DataCamp support.
# They kindly supplied the code for loading the file correctly.
sudoku_game_from_csv = pd.read_csv("sudoku.csv", header=None).to_numpy()
print(sudoku_game_from_csv.shape)

In [None]:
# Create a 9 x 9 2-D array for a Sudoku puzzle.
# This contains the complete data.
# I coded this when I encountered the damaged sudoku.npy file.
# This array is identical to the array read from the sudoku.csv file.
sudoku_list = [
    [0, 0, 4, 3, 0, 0, 2, 0, 9],
    [0, 0, 5, 0, 0, 9, 0, 0, 1],
    [0, 7, 0, 0, 6, 0, 0, 4, 3],
    [0, 0, 6, 0, 0, 2, 0, 8, 7],
    [1, 9, 0, 0, 0, 7, 4, 0, 0],
    [0, 5, 0, 0, 8, 3, 0, 0, 0],
    [6, 0, 0, 0, 0, 0, 1, 0, 5],
    [0, 0, 3, 5, 0, 8, 6, 9, 0],
    [0, 4, 2, 9, 1, 0, 3, 0, 0]]
sudoku_game = np.array(sudoku_list)
print(type(sudoku_game))
print(sudoku_game.shape)
# Are the arrays equivalent?
print(np.array_equal(sudoku_game_from_csv, sudoku_game))

In [None]:
# Create small NumPy arrays from scratch.
zero_array = np.zeros((2, 4))
print(zero_array)

Create a NumPy array containing random numbers between 0 and 1. DataCamp's exercise uses np.random.random, the old (deprecated) API. For the current API, see https://numpy.org/doc/stable/reference/random/index.html#random-quick-start and https://numpy.org/doc/1.23/reference/random/index.html, and https://numpy.org/doc/1.23/reference/random/generator.html#distributions.

In [None]:
# Create a 2-D array of random numbers from the uniform distribution
# [0.0, 1.0).
rng = np.random.default_rng()
rng.random((3, 6))

In [None]:
# Extra credit. Create an array of random numbers from
# the standard normal distribution.
rng.standard_normal(8)

In [None]:
# Extra credit. Create an array from the normal distribution.
# The first parameter, loc, is the meann. The second parameter,
# scale, is the standard deviation. The third parameter is the size of the array.
rng.normal(5, 2, (6, 3))

In [None]:
# numpy.arange works best with ints.
# With floats, use numpy.linspace.
doubling_array = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
one_to_ten = np.arange(1, 11)
plt.scatter(one_to_ten, doubling_array)
plt.show()

### Array Dimensionality

In [None]:
# We can create a 3-D array from a list of lists of lists.
# Or we can create a 3-D array from three 2-D arrays.
array_1_2D = np.array([[1, 2], [5, 7]])
array_2_2D = np.array([[8, 9], [5, 7]])
array_3_2D = np.array([[1, 2], [5, 7]])
array_A_3D = np.array([array_1_2D, array_2_2D, array_3_2D])
print(array_A_3D)

In [None]:
# Create a 4-D array from two 3-D arrays.
array_4_2D = np.array([[1, 2], [5, 7]])
array_5_2D = np.array([[3, 4], [7, 8]])
array_6_2D = np.array([[7, 6], [1, 4]])
array_B_3D = np.array([array_4_2D, array_5_2D, array_6_2D])
array_4D = np.array([array_A_3D, array_B_3D])
print(array_4D)
print(array_4D.shape)

Think of a 4-D array as a 2-D array filled with 3-D arrays.

#### Vector Arrays

In [None]:
# A vector array is an array with one dimension.
# In NumPy, column and row vectors have the same shape, (5,).
# To create a column vector, specify the shape as (5, 1), making this a 2-D
# array with five rows and one column.
# To create a row vector, specify the shape as (1, 5), making this a 2-D
# array with one row and five columns.

# Extra credit:
# I experimented with creating the numpy equivalents of row and column vectors.
# Create the numpy equivalent of a row vector.
row_vector = np.array([1, 2, 3, 4, 5])
row_vector.shape = (1, 5)
print(row_vector.shape)
print(row_vector)

In [None]:
# Create a row vector using a list of lists.
row_vector2 = np.array([[1, 2, 3, 4, 5]])
print(row_vector2.shape)
print(row_vector2)

In [None]:
col_vector = np.array([1, 2, 3, 4, 5])
col_vector.shape=(5, 1)
print(col_vector.shape)
print(col_vector)

In [None]:
# Create a column vector from a list of lists.
col_vector2 = np.array([[1], [2], [3], [4], [5]])
print(col_vector2.shape)
print(col_vector2)

#### Matrix and Tensor Arrays

In mathematics, a matrix has two dimensions. An array with three or more dimensions is called a tensor.

#### Shapeshifting

A NumPy array has an attribute, `.shape`, which contains a tuple containing the length of each dimension.

The `.flatten()` and `.reshape()` methods change the shape of a NumPy array.

In [None]:
# Create a 2-D array.
array1 = np.array([[1, 2], [5, 7], [6, 6]]) # 3 rows, 2 columns
print(array1.shape)

In [None]:
# Reshape the 2-D array. This returns a new array.
array2 = array1.reshape((2, 3)) # 2 rows, 3 columns
print(array2.shape)

In [None]:
# Extra credit.
# This also works (pass two arguments, not a tuple).
array3 = array1.reshape(2, 3) # 2 rows, 3 columns
print(array3.shape)

In [None]:
# Extra credit.
# Another way to reshape an array.
array4 = np.reshape(array1, (2, 3)) # 2 rows, 3 columns
print(array4.shape)

In [None]:
# Flatten the 2-D rray; this returns a 1-D array.
array3 = array1.flatten()
print(type(array3))
print(array3.shape)
print(array3)

#### Exercises
To copy the data for some of the exercises, I used the DataCamp IPython shell to print the data, which I copied and pasted into this notebook. However, printing does not include the comma separator. The numpy.array2string() function will add a separator to the output.

`print(np.array2string(sudoku_solution, separator=", ")`

Store the original sudoku puzzle and its solution in a 3-D NumPy array.

In [None]:
# Load the Sudoku solution data from the file.
# This file is also damaged; it is missing the first row of data.
print("Loading damaged Sudoku solution data...")
with open("sudoku_solution.npy", "rb") as file:
    sudoku_solution_npy = np.load(file)
print(type(sudoku_solution_npy))
print(sudoku_solution_npy.shape)
print()

In [None]:
# Load the data from the sudoku_solution.csv file provided by DataCamp
# support. They kindly supplied the code for loading the file correctly.
sudoku_solution_from_csv = pd.read_csv("sudoku_solution.csv", header=None).to_numpy()
print(sudoku_solution_from_csv.shape)

In [None]:
# Create a 2-D array of 9 rows with 9 columns.
# I wrote this code when I couldn't obtain the data from sudoku_solution.npy.
sudoku_solution = np.array(
    [[8, 6, 4, 3, 7, 1, 2, 5, 9],
     [3, 2, 5, 8, 4, 9, 7, 6, 1],
     [9, 7, 1, 2, 6, 5, 8, 4, 3],
     [4, 3, 6, 1, 9, 2, 5, 8, 7],
     [1, 9, 8, 6, 5, 7, 4, 3, 2],
     [2, 5, 7, 4, 8, 3, 9, 1, 6],
     [6, 8, 9, 7, 3, 4, 1, 2, 5],
     [7, 1, 3, 5, 2, 8, 6, 9, 4],
     [5, 4, 2, 9, 1, 6, 3, 7, 8]])
print(type(sudoku_solution))
print(sudoku_solution.shape)
print()

# Create a 3-D array from two 2-D arrays.
game_and_solution = np.array([sudoku_game, sudoku_solution])
print(game_and_solution.shape)

A good way to verify a NumPy array is to examine its `shape` attribute.

In [None]:
# Build a 4-D array from two 3-D arrays. Each 3-D array contains two
# 2-D arrays, the original Sudoku game and its solution.
# Create a new 3-D array containing two 2-D arrays; this represents a
# second Sudoku game and its solution.
new_sudoku_game = np.array(
    [[0, 0, 4, 3, 0, 0, 0, 0, 0],
     [8, 9, 0, 2, 0, 0, 6, 7, 0],
     [7, 0, 0, 9, 0, 0, 0, 5, 0],
     [5, 0, 0, 0, 0, 8, 1, 4, 0],
     [0, 7, 0, 0, 3, 2, 0, 6, 0],
     [6, 0, 0, 0, 0, 1, 3, 0, 8],
     [0, 0, 1, 7, 5, 0, 9, 0, 0],
     [0, 0, 5, 0, 4, 0, 0, 1, 2],
     [9, 8, 0, 0, 0, 6, 0, 0, 5]])
new_sudoku_solution = np.array(
    [[2, 5, 4, 3, 6, 7, 8, 9, 1],
     [8, 9, 3, 2, 1, 5, 6, 7, 4],
     [7, 1, 6, 9, 8, 4, 2, 5, 3],
     [5, 3, 2, 6, 9, 8, 1, 4, 7],
     [1, 7, 8, 4, 3, 2, 5, 6, 9],
     [6, 4, 9, 5, 7, 1, 3, 2, 8],
     [4, 2, 1, 7, 5, 3, 9, 8, 6],
     [3, 6, 5, 8, 4, 9, 7, 1, 2],
     [9, 8, 7, 1, 2, 6, 4, 3, 5]])
new_game_and_solution = np.array([new_sudoku_game, new_sudoku_solution])

# Create the 4D array from the two 3D arrays.
games_and_solutions = np.array([game_and_solution, new_game_and_solution])
# Interpret the shape (2, 2, 9, 9) as follows:
# There is a 4-D array containing two 3-D arrays, where each 3-D array 
# contains two 2-D arrays of dimensions nine rows by nine columns.
print(games_and_solutions.shape)

In [None]:
# Flatten the sudoku_game array.
flattened_game = sudoku_game.flatten()
print(flattened_game.shape)

In [None]:
# Reshape flattened_game as a 9 x 9 array.
reshaped_game = flattened_game.reshape(9, 9)
# Are the arrays equivalent?
print(np.array_equal(sudoku_game, reshaped_game))

### NumPy Data Types

NumPy has data types that are more specific than general Python data types. Use the `numpy.ndarray.dtype` attribute to learn the NumPy data type.

See https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases for aliases of NumPy data types.

In [None]:
# Float array.
float_array = np.array([1.32, 5.78, 175.55])
print(float_array.dtype)
float_array.dtype

In [None]:
# Int array.
int_array = np.array([[1, 2, 3], [4, 5, 6]])
print(int_array.dtype)

In [None]:
# String array. "<" means little-endian. U means Unicode. 12 means
# 12 characters in each Unicode string.
string_array = np.array(["Introduction", "to", "NumPy"])
print(string_array.dtype)

In [None]:
# The dtype of a numpy.ndarray can be set during creation of the array.
float32_array = np.array([1.32, 5.78, 175.55], dtype=np.float32)
print(float32_array.dtype)

In [None]:
# We can convert the type of elements within an array.
boolean_array = np.array([[True, False], [False, False]], dtype=np.bool_)
print(boolean_array.dtype)

In [None]:
boolean_array2 = boolean_array.astype(np.int32)
print(boolean_array2.dtype)

In [None]:
# NumPy performs type coercian when dealing with multiple data types.
# Here, all elements become strings.
mixed_array = np.array([True, "Boop", 42, 42.42])
print(mixed_array.dtype)

In [None]:
mixed_array

In [None]:
# Specify the dtype during creation.
# Shorten the length of the strings here.
mixed_array2 = np.array([True, "Boop", 42, 42.42], dtype="<U5")
print(mixed_array2.dtype)

In [None]:
# Create a float32 array.
float32_array = np.array([1.32, 5.78, 175.55], dtype=np.float32)
print(float32_array.dtype)

In [None]:
# Convert data to a different dtype (type conversion).
# Strings are not easily cast into numbers while preserving the original data.
boolean_array = np.array([[True, False], [False, False]], dtype=np.bool_)
int32_bool_array = boolean_array.astype(np.int32)
print(boolean_array)
print(int32_bool_array)
print(int32_bool_array.dtype)

In [None]:
# There is a type coercion hierarchy.
# Adding a single string to an array will cause all elements to be cash
# to string.
# Adding a float to an array of ints changes all integers to floats.
float_array2 = np.array([0, 42, 42.42])
print(float_array2)
print(float_array2.dtype)

In [None]:
# Adding an integer to an array of booleans changes all elements to integers.
int_array2 = np.array([True, False, 42])
print(int_array2)
print(int_array2.dtype)

#### Exercises

In [None]:
# Create an array using np.zeros(). The argument must be a shape tuple.
zeros_array = np.zeros((3, 2))
print(zeros_array)
print(zeros_array.dtype)

In [None]:
# Create an array of zeros specifying dtype np.int32.
zero_int_array = np.zeros((3, 2), dtype=np.int32)
print(zero_int_array.dtype)

In [None]:
# This sets the values to np.int8, coercing the values.
np.array([45.67, True], dtype=np.int8)

In [None]:
# Change the dtype of sudoku_game to make it more memory-efficient.
# We could go to int8 for these since this has a range of -128 to 127.
print(sudoku_game.dtype)

In [None]:
small_sudoku_game = sudoku_game.astype(np.int8)
print(small_sudoku_game.dtype)

## Selecting and Updating Data

### Indexing and Slicing Data

In [None]:
# Indexing a 1-D array.
array2 = np.array([2, 4, 6, 8, 10])
print(array2[3])

In [None]:
# Indexing a 2-D array.
print(sudoku_game[2, 4])

In [None]:
# This works, too.
print(sudoku_game[2][4])

In [None]:
print(sudoku_game)

In [None]:
# When giving NumPy one index into a 2-D array, it assumes it
# is a row index.
print(sudoku_game[0])

In [None]:
# Use a colon to index all rows and a column number to obtain a
# column from an ndarray.
print(sudoku_game[:, 3])

#### Exercises

In [None]:
# Read the data from the tree_census.npy file.
# The columns are:
#   tree ID
#   block ID
#   trunk diameter
#   stump diameter (0 for living trees)
tree_census = np.load("tree_census.npy")
print("shape:", tree_census.shape)
print("dtype:", tree_census.dtype)
print("nbytes:", tree_census.nbytes)
print("itemsize:", tree_census.itemsize)

In [None]:
# Look at a summary of the data.
tree_census

In [None]:
# Get the block IDs and print the first five rows.
block_ids = tree_census[:, 1]
print(block_ids[:5])

In [None]:
# Select the 10th item in block_ids.
tenth_block_id = block_ids[9]
print(tenth_block_id)

In [None]:
# Select five consecutive block IDs from block_ids,
# starting with the tenth ID.
block_id_slice = block_ids[9:14]
print(block_id_slice)

In [None]:
# Select the first 100 trunk diameters from the ndarray.
hundred_diameters = tree_census[0:100, 2]
print(hundred_diameters.shape)
print(hundred_diameters)

In [None]:
# Select every other tree diameter for trees with row indices 50 to 100, inclusive.
# The indexes can be obtained using np.arange.
np.arange(50, 101, 2)

In [None]:
# Select the data subset.
every_other_diameter = tree_census[np.arange(50, 101, 2), 2]
print(every_other_diameter)

In [None]:
# Create a new array containing the sorted trunk diameters.
sorted_trunk_diameters = np.sort(tree_census[:, 2])
print(sorted_trunk_diameters)

In [None]:
# Extra credit.
# np.sort doesn't give a choice of descending sort.
# Reverse the order of the 1-D array using np.flip.
np.flip(sorted_trunk_diameters)

In [None]:
# Extra credit.
# My first approach used the .sort method of the ndarray. This does
# an in-place sort, and it results in sorting column 2 in place in the 2-D
# ndarray because it's using a reference to the data.
# That is not what we want!
tree_census2 = tree_census.copy()
print(tree_census2[:, 2])
tree_census2[:, 2].sort()
print(tree_census2[:, 2])

### Filtering Arrays

In [None]:
# Fancy indexing. This is used when we're interested in the elements
# that meet the filtering condition. Fancy indexing returns an array
# containing the elements that meet the condition.
# Boolean masks have the same shape as the target array.
# Here, extract the members of the array that are even.
one_to_five = np.arange(1, 6)
print(one_to_five)
mask = one_to_five % 2 == 0
print(mask)
evens = one_to_five[mask]
print(evens)

In [None]:
# Extra credit.
# Try this with a 2-D array, extracting the elements with even values.
# The result is a 1-D array.
one_to_twenty = np.arange(1, 21)
one_to_twenty = one_to_twenty.reshape(4, 5)
print(one_to_twenty)
mask2 = one_to_twenty % 2 == 0
print(mask2)
evens2 = one_to_twenty[mask2]
print(evens2)

In [None]:
# Fancy indexing of a 2-D array.
# We have classrooms and students in classrooms that we want to pair for an
# activity. We want to know the classrooms with an even number of students.
classroom_ids_and_sizes = np.array([[1, 22], [2, 21], [3, 27], [4, 26]])
print(classroom_ids_and_sizes)
# Create a mask for all rows and column 1, which contains the size value.
mask3 = classroom_ids_and_sizes[:, 1] % 2 == 0
print(mask3)
# Apply the mask to identify the room IDs from column 0 that meet the condition.
result = classroom_ids_and_sizes[:, 0][mask3]
print(result)

In [None]:
# Or, step by step:
# Extract an array containing the classroom IDs.
classroom_ids = classroom_ids_and_sizes[:, 0]
print(classroom_ids)
# Apply the mask.
result2 = classroom_ids[mask3]
print(result2)

In [None]:
# np.where() returns an array of indices of elements that meet a condition.
# Can be used to create an array based on whether the elements do or
# don't meet a condition.
# np.where returns a tuple.
where = np.where(classroom_ids_and_sizes[:, 1] % 2 == 0)
print(where)

In [None]:
# Same result using the mask array.
where2 = np.where(mask3)
print(where2)

In [None]:
# Use np.where() to return the indices of sudoku_game elements that are 0.
# There are 46 elements where this is true.
print("sudoku_game")
print(sudoku_game)
print()

# Create a boolean mask.
sudoku_zeros_mask = sudoku_game == 0
print("mask")
print(sudoku_zeros_mask)
print()

# Get the indices of the elements that are zero.
sudoku_zero_indices = np.where(sudoku_zeros_mask)
print("sudoku_zero_indices")
print(sudoku_zero_indices)
print("shape")
print(np.shape(sudoku_zero_indices))
print()

# Find 0 and replace with "" using np.where.
# This builds a new array, entering "" when the mask value is True,
# else entering the element from sudoku_game.
# The mask determines the shape of the array returned.
sudoku_game_2 = np.where(sudoku_zeros_mask, "", sudoku_game)
print(sudoku_game_2)

In [None]:
# Combine the steps.
sudoku_game3 = np.where(sudoku_game == 0, "", sudoku_game)
print(sudoku_game3)

In [None]:
# Extra credit.
# Find and replace 0 with " " to make the array look nicer when printed.
np.where(sudoku_game == 0, " ", sudoku_game)

In [None]:
# Extra credit.
# Find and replace 0 with " ", else replace the value with "X".
np.where(sudoku_game == 0, " ", "X")

#### Exercises

In [None]:
# Print the row of data from tree_census where the diameter of the
# trunk is 51.
tree_data_51 = tree_census[tree_census[:, 2] == 51]
print(tree_data_51)

In [None]:
# Print the block ID of the block(s) having trees with trunk diameter
# of 51.
# tree_data_51_blocks = tree_census[tree_census[:, 2] == 51][:, 1]
tree_data_51_blocks = tree_data_51[:, 1]
print(tree_data_51_blocks)

In [None]:
# Given the ID of the block containing the tree with trunk diameter 51,
# return an array of all trees on that block.
block_501882_trees = tree_census[tree_census[:, 1] == tree_data_51_blocks]
print(block_501882_trees)

In [None]:
# Create an array with rows for trees on block 313879.
block_313879 = tree_census[tree_census[:, 1] == 313879]
print(block_313879)
print()
# Create an array of row indices for trees having block 313879.
# We want the first element in the tuple returned by np.where.
row_indices = np.where(tree_census[:, 1] == 313879)[0]
print(row_indices)
print()
# Test the indices.
print(tree_census[row_indices])

In [None]:
# Tree diameters are in column 2, and stump diameters are in column 3.
# Tree diameters are 0 when a stump diameter value is not 0.
# Stump diameters are 0 when a tree diameter is not 0.
# Create a 1-D array containing the diameters of trees and stumps.
trunk_stump_diameters = np.where(tree_census[:, 2] == 0, tree_census[:, 3], tree_census[:, 2])
print(trunk_stump_diameters)

In [None]:
# Extra credit.
# Break down what happened above.
trunks_0_row_indices = np.where(tree_census[:, 2] == 0)
print("row_indices")
print(trunks_0_row_indices)
print()

# Get the array of trunk diameters.
trunk_diameters = tree_census[:, 2]
print("trunk diameters")
print(trunk_diameters)
print()

# Test: Using these indices, the values for the trunk diameter should all be 0.
trunks_0 = tree_census[:, 2][trunks_0_row_indices]
print("trunk zero diameters")
print(trunks_0)
print()

# Get the stump diameters.
stump_diameters = tree_census[:, 3]
print("stump diameters")
print(stump_diameters)
print()

# Use the indices to get the stump diameter when the trunk diameter is 0.
stumps_trunk_0 = tree_census[:, 3][trunks_0_row_indices]
print("stump diameters for trunk zero diameters")
print(stumps_trunk_0)
print()

# Use np.where again to get the same result.
all_diameters2 = np.where(tree_census[:, 2] == 0, stump_diameters, trunk_diameters)
print("all diameters")
print(all_diameters2)
print()

# Test. Reversing the columns should give an array with all elements being zero.
zero_diameters = np.where(tree_census[:, 2] == 0, trunk_diameters, stump_diameters)
print("test zero diameters")
print(zero_diameters)

### Adding and Removing Data

In [None]:
# Concatenate using the np.concatenate() function.
# Arrays must be compatible for concatenation.
# A 1-D array must be converted into a 2-D array to concatenate
# it to another 2-D array. For example, shape(3,) => shape(3,1)
# np.concatenate will not create new dimensions (for example, by concatenating
# two 3-D arrays to create a 4-D array.

# Add rows to classroom_ids_and_sizes.
# This is concatenation along axis 0 -- adding new rows -- the default.
# The course should have set a unique ID for the final classroom (6, not 5).
print("classroom_ids_and_sizes")
print(classroom_ids_and_sizes)
print()
new_classes = np.array([[5, 30], [6, 17]])
print("new_classes")
print(new_classes)
print()
# Concatenate requires a tuple argument.
all_classes = np.concatenate((classroom_ids_and_sizes, new_classes))
print("all_classes")
print(all_classes)

In [None]:
# To concatenate columns, pass the axis=1 argument.
# Note all data is coerced to dtype "<U21".
grade_levels_and_teachers = np.array([[1, "James"], [1, "George"], [3, "Amy"], [3, "Meehir"]])
classroom_data = np.concatenate((classroom_ids_and_sizes, grade_levels_and_teachers), axis=1)
print(classroom_data)

In [None]:
# Creating a compatible 1-D array.
# Prepare a column array for concatenation.
array_1D = np.array([1, 2, 3])
print("1-D array")
print()
column_array_2D = np.reshape(array_1D, (3, 1))
print("column array")
print(column_array_2D)
print()
row_array_2D = np.reshape(array_1D, (1, 3))
print("row array")
print(row_array_2D)

In [None]:
# Deleting with np.delete().
# Need to specify the input array, the indices to delete, and the axis for deletion.
# np.delete() will delete rows or columns from a 2-D array.
# It makes no sense to delete a single cell of a 2-D array.
# Delete row 1 from classroom_data.
classroom_data2 = np.delete(classroom_data, 1, axis=0)
print(classroom_data2)

In [None]:
# Delete the student count column (column 1).
classroom_data3 = np.delete(classroom_data, 1, axis=1)
print(classroom_data3)

In [None]:
# If you don't specify an axis, np.delete() deletes the element at the
# specified index from a flattened array.
classroom_data4 = np.delete(classroom_data, 1)
print(classroom_data4)

#### Exercises

The following arrays are not compatible for concatenation:
- (5, 2) and (7, 4)
- (4, 2) and (4, )
- (4, 2) and (2, )

The following arrays are compatible for row concatenation (axis 0)
because they have the same number of columns:
- (4, 2) and (6, 2)
- (15, 5) and (100, 5)

The following arrays are compatible for column concatenation (axis 1)
because they have the same number of rows:
- (4, 2) and (4, 3)

In [None]:
# Add two more rows to tree_census.
# We can add these rows because they contain the same number of columns.
new_trees = np.array([[1211, 227386, 20, 0], [1212, 227386, 8, 0]])
print(tree_census.shape)
print(new_trees.shape)

In [None]:
updated_tree_census = np.concatenate((tree_census, new_trees), axis=0)

In [None]:
print(updated_tree_census)

In [None]:
# Add the column trunk_stump_diameters to tree_census.
# The shape of trunk_stump_diameters is not compatible for concatenation.
print("incompatible shapes")
print(tree_census.shape, trunk_stump_diameters.shape)
print()

# Reshape the column so it is compatible for concatenation.
# Both arrays have 1000 rows, so they are compatible for column concatenation.
reshaped_diameters = np.reshape(trunk_stump_diameters, (1000, 1))
print("compatible shapes")
print(tree_census.shape, reshaped_diameters.shape)
print()

# Perform the concatenation. Column 4 contains the diameter of the trunk
# or stump.
concatenated_tree_census = np.concatenate((tree_census, reshaped_diameters), axis=1)
print("concatenated tree census")
print(concatenated_tree_census)

In [None]:
# Delete dead trees and trees not on city-owned blocks.
# First, delete the stump diameter column (column 3) from tree_census.
tree_census_no_stumps = np.delete(tree_census, 3, axis=1)
print("tree_census_no_stumps")
print(tree_census_no_stumps)
print()

# Get the indices of trees on block 313879, a private block.
private_block_indices = np.where(tree_census[:, 1] == 313879)
print("private_block_indices")
print(private_block_indices)

# Delete the rows for trees on block 313879.
tree_census_clean = np.delete(tree_census_no_stumps, private_block_indices[0], axis=0)
print(tree_census_clean.shape)

## Array Mathematics!

### Summarizing Data

NumPy provides several methods for summarizing array data.
- .sum()
- .min()
- .max()
- .mean()
- .cumsum()

From the documentation:

An operation along axis n of array a behaves as if its argument were an
array of slices of a where each slice has a successive index of axis n.

So when axis is 0, the method returns a row of values. When the axis is 1, the method returns a column of values.

In [None]:
# We have data for security breaches per year for three clients and five years.
# The rows correspond to years and the columns to clients.
# (We really need a pandas DataFrame here.)
# When axis is 0, the function returns a row of values.
# When axis is 1, the function returns a column of values.

security_breaches = np.array([[0, 5, 1], [0, 2, 0], [1, 1, 2], [2, 2, 1], [0, 0, 0]])
print(security_breaches)
print()
# sum()
print("sums")
print(security_breaches.sum()) # all elements
print(security_breaches.sum(axis=0)) # returns a row
print(security_breaches.sum(axis=1)) # returns a column
print()
# cumsum() -- cumulative sum
print("cumsum")
print(security_breaches.cumsum(axis=0)) # indexing by row
print(security_breaches.cumsum(axis=1)) # indexing by column
print()
# min()
print("minimums")
print(security_breaches.min()) # all elements
print(security_breaches.min(axis=0)) # returns a row
print(security_breaches.min(axis=1)) # returns a column
print()
# max()
print("maximums")
print(security_breaches.max()) # all elements
print(security_breaches.max(axis=0)) # returns a row
print(security_breaches.max(axis=1)) # returns a column

In [None]:
# The keepdims argument keeps the dimensions of the output array to
# make it easy to concatenate to the original array.
print(security_breaches.sum(axis=1))

In [None]:
# This creates a column vector.
print(security_breaches.sum(axis=1, keepdims=True))

In [None]:
# Plotting something like the cumsum versus the average can
# give insight into each client's experience.
cum_sums_by_client = security_breaches.cumsum(axis=0)
print(cum_sums_by_client)
mean = cum_sums_by_client.mean(axis=1)
print(mean)

plt.plot(np.arange(1, 6), cum_sums_by_client[:, 0], label="Client 1")
plt.plot(np.arange(1, 6), cum_sums_by_client[:, 1], label="Client 2")
plt.plot(np.arange(1, 6), cum_sums_by_client[:, 2], label="Client 3")
plt.plot(np.arange(1, 6), cum_sums_by_client.mean(axis=1), label="Average")
plt.xticks(np.arange(1,6))
plt.ylabel("Security breaches")
plt.xlabel("Year")
plt.title("Cumulative Security Breaches")
plt.legend()
plt.show()

#### Exercises

In [None]:
# Read monthly_sales_f from the data file.
# This file is correct.
monthly_sales_npy = np.load("monthly_sales.npy")
print(type(monthly_sales_npy))
print(monthly_sales_npy)
print(monthly_sales_npy.shape)
print()

# Given monthly sales (for 12 months) for different industries (3 industries),
# calculate the total sales per month.
# Column 1: liquor stores
# Column 2: restaurants
# Column 3: department stores
monthly_sales = np.array(
    [   [ 4134, 23925,  8657],
        [ 4116, 23875,  9142],
        [ 4673, 27197, 10645],
        [ 4580, 25637, 10456],
        [ 5109, 27995, 11299],
        [ 5011, 27419, 10625],
        [ 5245, 27305, 10630],
        [ 5270, 27760, 11550],
        [ 4680, 24988,  9762],
        [ 4913, 25802, 10456],
        [ 5312, 25405, 13401],
        [ 6630, 27797, 18403]])
print(monthly_sales)
print(monthly_sales.shape)
print()
# Are the two arrays equivalent?
print(np.array_equal(monthly_sales_npy, monthly_sales))
print()
monthly_industry_sales = monthly_sales.sum(axis=1, keepdims=True) # index by column
print(monthly_industry_sales)

In [None]:
monthly_sales_with_totals = np.concatenate((monthly_sales, monthly_industry_sales), axis=1)
print(monthly_sales_with_totals)

In [None]:
# We want to know if sales increase more at the end of the year
# for department stores.
# Calculate "average monthly sales", which turns out to be by
# month, not by industry.
# I would want to center these values before plotting them.
avg_monthly_sales = np.int64(monthly_sales.mean(axis=1)) # index by column
print("average monthly sales")
print(avg_monthly_sales)

# Plot the data. I worked on the plotting parameters to make the plot
# easier to interpret and to move the legend where it didn't cover anything.
# It appears that department store sales increae quite a bit compared
# to the two other industries.
plt.plot(np.arange(1, 13), monthly_sales[:, 0], label="Liquor stores")
plt.plot(np.arange(1, 13), monthly_sales[:, 1], label="Restaurants")
plt.plot(np.arange(1, 13), monthly_sales[:, 2], label="Department stores")
plt.plot(np.arange(1, 13), avg_monthly_sales, label="Average sales across industries")
plt.xticks(np.arange(1, 13))
plt.yticks(np.arange(0, 50000, 5000))
plt.xlabel("month")
plt.ylabel("sales")
plt.title("Industry Sales")
plt.legend()
plt.show()

In [None]:
# Plot cumulative monthly sales for each industry.
# The plot will indicate where sales are greater or less than usual by looking
# at the slope of the line.
cumulative_monthly_industry_sales = monthly_sales.cumsum(axis=0) # index by row
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 0], label="Liquor stores")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 1], label="Restaurants")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 2], label="Department stores")
plt.xticks(np.arange(1, 13))
plt.xlabel("Month")
plt.ylabel("Cumulative sales")
plt.title("Cumulative Monthly Sales")
plt.legend()
plt.show()

### Vectorized Operations

NumPy uses C libraries for speed and efficiency. Vectorized operations are written in C; they are vectorized on the Python level.

In [None]:
# A vectorized operation.
print(np.arange(1000000).sum())

In [None]:
# Doing it the hard way, with a for loop.
sum = 0
for i in range(1000000):
    sum += i
print(sum)

In [None]:
# Using vectorized operations increases execution speed, and it also reduces
# the amount of code that needs to be written.
# Add 3 to each element in an array using Python loops.
array = np.array([[1, 2, 3], [4, 5, 6]])
for row in range(array.shape[0]):
    for column in range (array.shape[1]):
        array[row][column] += 3
print(array)


In [None]:
# Add 3 to each element in an array using NumPy vectorization.
# 3 is referred to as a scalar.
array2 = np.array([[1, 2, 3], [4, 5, 6]])
array2 += 3
print(array2)

In [None]:
# Vectorized multiplication, etc.
print(array2 * 3)

In [None]:
# We can add two arrays if they have the same shape.
array_a = np.array([[1, 2, 3], [4, 5, 6]])
array_b = np.array([[0, 1, 0], [1, 0, 1]])
print(array_a + array_b)

In [None]:
# Vector operations are used for other purposes, such as creating
# boolean masks.
print(array_a > 2)

In [None]:
# We can vectorize Python code.
# This code gets the length of the entire array, whereas we want
# the length of each element.
array = np.array(["NumPy", "is", "awesome"])
print(len(array) > 2)
vectorized_len = np.vectorize(len)
print(vectorized_len(array) > 2)

#### Exercises

In [None]:
# Compute sales tax at 5% for monthly sales from above.
tax_collected = monthly_sales * 0.05
print(tax_collected)

In [None]:
# Combine monthly sales with sales taxes.
total_tax_and_revenue = monthly_sales + tax_collected
print(total_tax_and_revenue)

Extra credit. Multiplying two arrays (which must be of the same size) uses the `*` operator and produces the Hadamard product; see https://en.wikipedia.org/wiki/Hadamard_product_(matrices).

Normal matrix multiplication uses the `@` operator, a binary operator (an operator having two operands) that was introduced in Python 3.5. NumPy provides the numpy.matmul function for standard matrix multiplication.

In [None]:
# We have projections of next year's sales, month by month, for each of the 
# three industries, as a proportion of this year's sales; these are stored
# in a 2-D array the same size as monthly_sales. Multiply these together to
# project next year's sales in the currency amount.
monthly_industry_multipliers = np.array(
    [  [0.98, 1.02, 1.  ],
       [1.00, 1.01, 0.97],
       [1.06, 1.03, 0.98],
       [1.08, 1.01, 0.98],
       [1.08, 0.98, 0.98],
       [1.1 , 0.99, 0.99],
       [1.12, 1.01, 1.  ],
       [1.1 , 1.02, 1.  ],
       [1.11, 1.01, 1.01],
       [1.08, 0.99, 0.97],
       [1.09, 1.  , 1.02],
       [1.13, 1.03, 1.02]])
projected_monthly_sales = monthly_sales * monthly_industry_multipliers
print(projected_monthly_sales)

In [None]:
# Given monthly_sales and projected_monthly_sales, plot the values for liquor stores.
months = np.arange(1, 13)
plt.plot(months, monthly_sales[:, 0], label="Current sales")
plt.plot(months, projected_monthly_sales[:, 0], label="Projected sales")
plt.xticks(months)
plt.xlabel("Month")
plt.ylabel("Monthly sales")
plt.legend()
plt.title("Liquor Store Sales")
plt.show()

In [None]:
# Extra credit.
# Create a plot for each industry.
industries = "Liquor Store", "Restaurant", "Department Store"
for index, industry in enumerate(industries):
    plt.plot(months, monthly_sales[:, index], label="Current sales")
    plt.plot(months, projected_monthly_sales[:, index], label="Projected sales")
    plt.xticks(months)
    plt.xlabel("Month")
    plt.ylabel("Monthly sales")
    plt.legend()
    plt.title(industry + " Sales")
    plt.show()

In [None]:
# Vectorize converting a 2-D array of names to upper case.
names = np.array([["Izzy", "Monica", "Marvin"],
                  ["Weber", "Patel", "Hernandez"]])
print(names)
print()
vectorized_upper = np.vectorize(str.upper)
upper_names = vectorized_upper(names)
print(upper_names)

### Broadcasting

In [None]:
# The calculation of the addition of a scalar to a 2-D array uses
# broadcasting of the scalar to an array of shape suitable for addition.
# Broadcasting is efficient in terms of programming and computer time.
a1 = np.array([[5, 6, 13], [6, 10, 12], [11, 8, 1]])
print(a1 + 2)
print()
# Effectively, this is like computing:
a2 = np.array([[2, 2, 2], [2, 2, 2], [2, 2, 2]])
print(a1 + a2)

In [None]:
# For compatibility for broadcasting, NumPy compares sets of array dimensions
# from right to left.
# shape(10, 5)
# shape(10, 1)
# Dimensions are compatible if one of the directions is 1 or if the dimensions
# have the same size. All dimension sets must be compatible.
# Looking at the two shapes above, right to left, the second dimensions are
# 5 and 1, so they are compatible since one dimension is 1. The first dimensions
# are 10 and 10, so they are compatible since they have the same length.
# Two arrays need not have the same number of dimensions to be broadcastable.
# Broadcastable or not?
# (10, 5) and (10, 1) are broadcastable.
# (10, 5) and (5,) are broadcastable.
# (10, 5) and (5, 10) are not broadcastable.
# (10, 5) and (10,) are not broadcastable.
# NumPy's default assumption is to broadcasting row-wise.
# This can be overridden by shaping the arrays for column-wise broadcasting.
# The lesson here seems to be to shape the arrays appropriately.
# Broadcastable rows:
a4 = np.arange(10).reshape((2, 5))
print(a4)
a5 = np.array([0, 1, 2, 3, 4])
print(a5)
print("a4 + a5:")
print(a4 + a5)

In [None]:
# Experimentation
print("row-wise broadcasting:")
print(a1)
print(a1.shape)
print()
a3 = np.array([4, 5, 3])
print(a3)
print(a3.shape)
print()
print(a1 + a3)
print()

In [None]:
# This fails.
print(a4)
print(a4.shape)
print()
a5 = np.array([0, 1])
print(a5)
print(a5.shape)
print()
try:
    print(a4 + a5)
except ValueError as ex:
    print("Exception caught:", ex)
print()

# To get this to work, must reshape a5 to make the dimensions compatible.
# This allows column-wise broadcasting.
a5 = a5.reshape(2, 1)
print(a5)
print(a5.shape)
print()
print(a4 + a5)

In [None]:
# Row-wise broadcasting.
a6 = np.array([[1, 2], [3, 4]])
a7 = np.array([5, 10])
print(a6 * a7)

# Row-wise again. This is (2, 2) x (1, 2), so row-wise broacasting.
a8 = a7.reshape((1, 2))
print(a6 * a8)

# Column-wise broadcasting.
# This is (2, 2) x (2, 1), so column broadcasting.
a9 = a7.reshape((2, 1)) # Two rows, one column
print(a6 * a9)

In [None]:
# More experimentation.
print("column-wise broadcasting:")
print(a1)
print(a1.shape)
print()
a3 = np.array([[4], [5], [3]]) # a column vector
print(a3)
print(a3.shape)
print()
print(a1 + a3)

In [None]:
print("multiplication, column-wise broadcasting")
print(a1)
print(a1.shape)
print()
print(a3)
print(a3.shape)
print()
print(a1 * a3)

#### Exercises

In [None]:
# broadcastable:
# (3, 4) and (3, 1) -- column-wise
# (3, 4) and (1, 4) -- row-wise
# (3, 4) and (4,)   -- row-wise
# not broadcastable:
# (3, 4) and (1, 2)
# (3, 4) and (4, 1)
# (3, 4) and (3,)
#
# It's best to use 2-D arrays when doing array arithmetic.
# Turn an array of shape, say, (5,)
# into (1, 5) (equivalent to a column vector)
# or (5, 1) (equivalent to a row vector).
# This removes any ambiguity about the operation.

In [None]:
# Broadcasting across columns.
# When broadcasting across columns, NumPy needs a 2-D array of n rows and 1 column.
# NumPy will broadcast this array column-wise.
# Initialize a list of monthly year-over-year expected growth for the economy.
# Create a new array by multiplying monthly_sales by these values 
monthly_growth_rate = [1.01, 1.03, 1.03, 1.02, 1.05, 1.03, 1.06, 1.04, 1.03, 1.04, 1.02, 1.01]
monthly_growth_1D = np.array(monthly_growth_rate)
monthly_growth_2D = monthly_growth_1D.reshape((12, 1))
print(monthly_sales * monthly_growth_2D)

In [None]:
# Broadcasting across rows.
# We used monthly_industry_multipliers above. But now we think we can't make
# monthly predictions of growth rates, and we'd rather create an average
# multiplier for the entire year and use that to project sales in the
# next year.

# Find the mean sales multiplier for each industry. We are indexing the rows,
# so the axis is 0.
mean_multipliers = monthly_industry_multipliers.mean(axis=0)
print(mean_multipliers)
print()

# Examine the shapes of the arrays to determine whether they are compatible
# for broadcasting. (They are.)
print(monthly_sales.shape)
print(mean_multipliers.shape)
print()

# Multiply the two arrays to get projected sales for next year.
projected_sales = monthly_sales * mean_multipliers
print(projected_sales)
print()

# Reshape mean_multipliers to (1, 3) and multiply.
# The results are the same.
print(monthly_sales * mean_multipliers.reshape(1, 3))

## Array Transformations

### Saving and Loading Arrays

In [None]:
# Create and display a small image array.
# I modified the lesson to create an array with shape (4, 5, 3) to
# remove ambiguity about the number of elements on eadh axis.
rgb = np.array([[[255,   0,   0], [255, 255,   0], [255, 255, 255], [255,   0,   0], [255, 255,   0]],
                [[255,   0, 255], [  0, 255,   0], [  0, 255, 255], [255,   0, 255], [  0, 255,   0]],
                [[  0,   0,   0], [  0, 255, 255], [  0,   0, 255], [  0,   0,   0], [  0, 255, 255]],
                [[255,   0, 255], [  0, 255,   0], [  0, 255, 255], [255,   0, 255], [  0, 255,   0]]])
print("image array:")
print(rgb.shape)
plt.imshow(rgb)
plt.show()

In [None]:
# Load a NumPy file containing image data. The tutorial opens logo.npy,
# but this was not provided. Here, I load rgb_array.npy.
# The image is a painting called "Cliff Walk at Pourville", by Monet.
with open("rgb_array.npy", "rb") as file:
    monet = np.load(file)
print(monet.shape)
plt.imshow(monet)
plt.show()

In [None]:
# Extra credit.
# To replicate the tutorial, I needed to obtain the NumPy logo and convert it to
# an array. The logo used in the course was available at:
# https://github.com/numpy/numpy/blob/main/branding/logo/secondary/numpylogo2.png.
# See the parent directories for other NumPy logo files.
# I saved numpylogo2.png to the working directory for this notebook.
# To create a NumPy array from the data, I followed this tutorial:
# https://thecleverprogrammer.com/2021/06/08/convert-image-to-array-using-python/
# The file contains data for red, green, blue, and alpha. The image has a size
# of 1001 x 1001 pixels.
numpy_logo = PIL.Image.open("numpyLogo2.png")
print(type(numpy_logo))
# matplotlib can show this PIL.PngImagePlugin.PngImageFile object directly.
plt.imshow(numpy_logo)
plt.show()
# Convert the object to a NumPy array and show it.
logo_rgb_array = np.asarray(numpy_logo)
print(logo_rgb_array.shape)
plt.imshow(logo_rgb_array)
plt.show()

In [None]:
# Pull out the red, green, blue, and alpha data by slicing on the third axis.
# Show the first row of each subarray.
logo_red_array = logo_rgb_array[:, :, 0]
logo_green_array = logo_rgb_array[:, :, 1]
logo_blue_array = logo_rgb_array[:, :, 2]
logo_alpha_array = logo_rgb_array[:, :, 3]
print(logo_red_array.shape)
print(logo_alpha_array.shape)
print(logo_red_array[0], logo_green_array[0], logo_blue_array[0], logo_alpha_array[0], sep="\n")

In [None]:
# Change the bright elements to dark elements by changing values
# from 255 to 50. This didn't work (because of the alpha channel).
dark_logo_array = np.where(logo_rgb_array == 255, 50, logo_rgb_array)
print(dark_logo_array[0, :, 0])
plt.imshow(dark_logo_array)
plt.show()

In [None]:
# Extra credit.
# Remove the alpha channel.
logo_rgb_array3 = np.delete(logo_rgb_array, 3, axis=2)
print("original (no alpha):")
print(logo_rgb_array3.shape)
print(logo_rgb_array3[0, :, 0])

In [None]:
# Replace values to darken the image.
dark_logo_array = np.where(logo_rgb_array3 == 255, 50, logo_rgb_array3)
print("dark:")
print(dark_logo_array.shape)
print(dark_logo_array[0, :, 0])
plt.imshow(dark_logo_array)
plt.show()

In [None]:
# Save the array as a NumPy file.
with open("dark_logo.npy", "wb") as file:
    np.save(file, dark_logo_array)

In [None]:
# Getting help on NumPy, including help for a class method.
# help(np.unique)
# help(np.load)
# help(np.ndarray.flatten)

#### Exercises

In [None]:
# Reload the image from "rgb_array.npy".
# This file was named "mystery_image.npy" in the tutorial.
with open("rgb_array.npy", "rb") as f:
    monet = np.load(f)
plt.imshow(monet)
plt.show()

In [None]:
# Practice getting help.
# help(np.ndarray.astype)

In [None]:
# View the type of the array.
# This is a good type for an image array.
print(monet.dtype)

In [None]:
# Darken the image by multiplying each value in the array by 0.5.
# Create an array that uses np.uint8 for the data.
darker_monet = monet * 0.5
darker_monet_uint8 = darker_monet.astype(np.uint8)
plt.imshow(darker_monet_uint8)
plt.show()

In [None]:
# Save the darker image.
with open("darker_monet.npy", "wb") as f:
    np.save(f, darker_monet_uint8)

### Array Acrobatics

Data augmentation is a process of taking existing data and making small changes to it to include the modified data when training a model. The example is training a model to recognize images of recyclable items. Take the image of a plastic bottle and invert it to include the inverted image in the training set.

`np.flip` reverses the order of array elements.

In [None]:
# Flip the image. This reverses the columns and the rows,
# causing the image to rotate 180 degrees.
# Red is flipped with blue, but green stays the same.
flipped_logo = np.flip(logo_rgb_array3)
plt.imshow(flipped_logo)
plt.show()

In [None]:
# Flip the image on axis 0. This reverses the rows, turning the image
# upside down. This is a mirror image with horizontal reflection.
flipped_rows_logo = np.flip(logo_rgb_array3, axis=0)
plt.imshow(flipped_rows_logo)
plt.show()

In [None]:
# Extra credit.
# Flip the image on axis 1. This reverses the columns,
# creating a mirror image with vertical reflection.
flipped_columns_logo = np.flip(logo_rgb_array3, axis=1)
plt.imshow(flipped_columns_logo)
plt.show()

In [None]:
# Flip red and blue in the RGB colors by flipping the image on axis 2.
flipped_colors_logo = np.flip(logo_rgb_array3, axis=2)
plt.imshow(flipped_colors_logo)
plt.show()

In [None]:
# Flip the image by axis 0 and axis 1, reversing both the rows and columns,
# rotating the image 180 degrees.
flipped_except_colors_logo = np.flip(logo_rgb_array3, axis=(0, 1))
plt.imshow(flipped_except_colors_logo)
plt.show()

In [None]:
# Compare flipping and transposing an array.
# The floats indicate the original row and column.
# When transposing, columns become rows and rows become columns.
array = np.array([[1.1, 1.2, 1.3], [2.1, 2.2, 2.3], [3.1, 3.2, 3.3], [4.1, 4.2, 4.3]])
print("original:")
print(array)
print("flipped:")
print(np.flip(array))
print("transposed:")
print(np.transpose(array))

In [None]:
# Setting transposed axis order.
# Transpose axes 0 and 1.
# Note the use of "axes" instead of "axis" here.
# This rotates the image counter-clockwise 90 degrees and creates a
# left-right mirror image.
print("original:")
plt.imshow(logo_rgb_array3)
plt.show()
print("transposed:")
transposed_logo = np.transpose(logo_rgb_array3, axes=(1, 0, 2))
plt.imshow(transposed_logo)
plt.show()

#### Exercises

In [None]:
# Flip the Monet image so that it is the mirror image of the original, with 
# the ocean on the right and the grassy knoll on the left. This requires
# reversing the columns, axis 1.
monet_mirror = np.flip(monet, axis=1)
plt.imshow(monet_mirror)
plt.show()

In [None]:
# Flip rgb_array so that it is upside down but otherwise remains the
# same. This requires flipping on both axis 0 and axis 1.
monet_upside_down = np.flip(monet, axis=(0, 1))
plt.imshow(monet_upside_down)
plt.show()

In [None]:
# Transpose the 3-D array so that the image is rotated 90 degrees 
# counter-clockwise and a left-right mirror image is produced.
monet_t = np.transpose(monet, axes=(1, 0, 2))
plt.imshow(monet_t)
plt.show()

### Splitting and Stacking

"A key part of reshaping is adding and removing dimensions while keeping the underlying data intact."

In [None]:
# Slicing dimensions.
# We can slice the image array to extract the red, green, and blue data separately.
# I modified the lesson to create an array with shape (4, 5, 3) to
# remove ambiguity about the number of elements on eadh axis.
# Reuse rgb created above.
print(rgb.shape)
plt.imshow(rgb)
plt.show()

rgb_red_array = rgb[:, :, 0]
rgb_green_array = rgb[:, :, 1]
rgb_blue_array = rgb[:, :, 2]
print("red array:")
print(rgb_red_array)
print(rgb_red_array.shape)
print()
print("green array:")
print(rgb_green_array)
print(rgb_green_array.shape)
print()
print("blue array:")
print(rgb_blue_array)
print(rgb_blue_array.shape)

In [None]:
# We can unpack arrays using np.split, which takes three arguments:
# the array to split, the number of equally sized arrays in the result,
# and the axis for splitting.
# The resultant arrays have the same number of dimensions as the original
# array (here, three).
red_array2, green_array2, blue_array2 = np.split(rgb, 3, axis=2)
print("red_array2:")
print(red_array2.shape)

In [None]:
# Behavior of plt.imshow. See https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html.
# matplotlib happily plots 2-D array data as an image.
# This uses a default color map (cmap) named "viridis", which is designed to
# be visible to persons with red-green color blindness.
# Values for display are either 0 or 255.
print("original:")
plt.imshow(rgb)
plt.show()
print("red levels:")
print(red_array2.shape)
plt.imshow(red_array2)
plt.show()
print("green levels:")
print(green_array2.shape)
plt.imshow(green_array2)
plt.show()
print("blue levels:")
print(blue_array2.shape)
plt.imshow(blue_array2)
plt.show()

In [None]:
# Reshape red_array2 to remove the unneeded dimension.
red_array3 = np.reshape(red_array2, (4, 5))
print(red_array3)
print(red_array3.shape)

`np.concatenate` will not concatenate data in a new dimension. To do this, we
must use `np.stack`.

"It's easier to work with arrays with fewer dimensions, especially when we 
need to apply a transformation to only part of an array. Slice or split an 
array to unpack it, then use `np.stack` to put it back together once changes
have been made."

In [None]:
# For reference, display the logo.
plt.imshow(logo_rgb_array)
plt.show()

In [None]:
# Let's replace the red array values with zeros and rebuild the RGB
# array of the NumPy logo using np.stack.
zero_color_array = np.zeros((1001, 1001), dtype=np.uint8)
logo_red_array2 = zero_color_array
logo_green_array2 = np.reshape(logo_green_array, (1001, 1001))
logo_blue_array2 = np.reshape(logo_blue_array, (1001, 1001))

# Stack the arrays, specifying the axis for stacking.
# Display the image. With red removed from white, we see a mix of
# green and blue in the background of the image.
stacked_logo = np.stack([logo_red_array2, logo_green_array2, logo_blue_array2], axis=2)
print(stacked_logo.shape)
plt.imshow(stacked_logo)
plt.show()

In [None]:
# Extra credit.
# Show red only, green only, and blue only.
print("red only:")
red_only = np.stack([logo_red_array, zero_color_array, zero_color_array], axis=2)
print(red_only.shape)
plt.imshow(red_only)
plt.show()
print("green only:")
green_only = np.stack([zero_color_array, logo_green_array, zero_color_array], axis=2)
print(green_only.shape)
plt.imshow(green_only)
plt.show()
print("blue only:")
blue_only = np.stack([zero_color_array, zero_color_array, logo_blue_array], axis=2)
print(blue_only.shape)
plt.imshow(blue_only)
plt.show()

#### Exercises

In [None]:
# Return to the monthly_sales data.
# The first dimension of monthly_sales (rows) is the month.
# The second dimension of monthly_sales (columns) is the industry. 
# Split the data into fourths, where each fourth contains data
# for a quarter of a year.

# For review, look at monthly_sales.
print("monthly sales for year")
print(monthly_sales.shape)
print(monthly_sales)
print()
# Split the data along the row axis.
q1_sales, q2_sales, q3_sales, q4_sales = np.split(monthly_sales, 4, axis=0)
print("monthly sales for Q1")
print(q1_sales.shape)
print(q1_sales)

In [None]:
# Create quarterly sales from the four 2-D arrays.
quarterly_sales = np.stack([q1_sales, q2_sales, q3_sales, q4_sales], axis = 0)
print(quarterly_sales.shape)
print(quarterly_sales)

In [None]:
# Investigate Monet's use of the color blue by making the blues in the
# painting even bluer!

In [None]:
# Extra credit.
# Show only the blues in Monet's painting.
# This turns out not to be very helpful.
rows, cols, colors = monet.shape
monet_red, monet_green, monet_blue = np.split(monet, 3, axis=2)
monet_red = np.reshape(monet_red, (rows, cols))
monet_green = np.reshape(monet_green, (rows, cols))
monet_blue = np.reshape(monet_blue, (rows, cols))
monet_zeros = np.zeros((rows, cols), dtype=np.uint8)
monet_blue_only = np.stack([monet_zeros, monet_zeros, monet_blue], axis=2)
plt.imshow(monet_blue_only)
plt.show()

In [None]:
# If a pixel contains more than the mean amount of blue, set the blue in the
# pixel to 255.
mean_blue = monet_blue.mean()
print("mean blue: ", mean_blue)

In [None]:
# Change the pixels, emphasizing the blue.
emphasized_blue_array = np.where(monet_blue > mean_blue, 255, monet_blue)
print(emphasized_blue_array.shape)

In [None]:
# Display the original for reference.
print("original:")
plt.imshow(monet)
plt.show()
# Combine the arrays to create an image with emphasized blue color.
print("emphasized blue:")
emphasized_blue_monet = np.stack([monet_red, monet_green, emphasized_blue_array], axis=2)
plt.imshow(emphasized_blue_monet)
plt.show()

In [None]:
# Extra credit. Let's see the difference.
# The difference is mostly in the sky and ocean.
monet_emphasized_blue = np.stack([monet_zeros, monet_zeros, emphasized_blue_array], axis=2)
plt.imshow(monet_emphasized_blue)
plt.show()

In [None]:
# Extra credit.
# Let's see only where the blue is greater than the mean.
# This provides more contrast.
emphasized_blue_only = np.where(monet_blue > mean_blue, 255, 0)
monet_emphasized_blue_only = np.stack([monet_zeros, monet_zeros, emphasized_blue_only], axis=2)
plt.imshow(monet_emphasized_blue_only)
plt.show()