# Introduction to NumPy
Run the hidden code cell below to import the data used in this course.

In [1]:
# Importing numpy
import numpy as np

# Importing the data
with open("datasets/rgb_array.npy", "rb") as f:
    rgb_array = np.load(f)
with open("datasets/tree_census.npy", "rb") as f:
    tree_census = np.load(f)
with open("datasets/monthly_sales.npy", "rb") as f:
    monthly_sales = np.load(f)
with open("datasets/sudoku_game.npy", "rb") as f:
    sudoku_game = np.load(f)
with open("datasets/sudoku_solution.npy", "rb") as f:
    sudoko_solution = np.load(f)

## Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

## Introducing arrays

Numpy is the core libary for scientific computing in Python. Lots of other libaries build off of this.

The array is the main object in NumPy. It's a grid-like structure that holds data. An array can have any number of dimensions, and each dimension can be any length. You can create arrays from Python by passing a list as an argument to the np.array() function. This will have a data type of numpy.ndarray. a 2D array would be a list of lists, and a list of lists of lists would be 3D.


In [None]:
# Import NumPy
import numpy as np

# Convert sudoku_list into an array
sudoku_array = np.array(sudoku_list)

# Print the type of sudoku_array 
print(type(sudoku_array))

NameError: name 'sudoku_list' is not defined

Lists can contain different data types, but all the elements in array must be the same data type, so it's very efficient, and takes up less space in memory. 

You can create arrays from scratch using np.zeros(), np.random.random(), and np.arange()

np.zeros((5,3)) makes an array of zeros, with 5 rows and 3 columns, and you can fill with data later on. A tuple of integers is used as the argument. 

np.random.random((2,4)) also accepts a tuple for the array's shape. Made of random floats between 0 and 1. Called random.random because np.random is a Numpy module for random sampling, and random() is a function within it

In [None]:
# Create an array of zeros which has four columns and two rows
zero_array = np.zeros((2,4))
print(zero_array)

# Create an array of random floats which has six columns and three rows
random_array = np.random.random((3,6))
print(random_array)

np.arange creats evenly spaced array of numbers based on given start and stop values; creates array of sequential integers. np.arange(-3,4) will return -3, -2, -1, 0, 1, etc. Start value included in output, stop is not. start value can be omitted if range begins with 0. If there's a 3rd argument, that's interpreted as the step value 

In [None]:
# Create an array of integers from one to ten
one_to_ten = np.arange(1, 11)

# Create your scatterplot
plt.scatter(one_to_ten, doubling_array)
plt.show()

## Array dimensionality

*The slides will help with visualizing nD arrays

Data saved in higher dimensions can be harder to work with. Instead of making a list of lists of lists, you can create an array of 2D arrays. You can visualize a 3D array as a bunch of 2D arrays with the same shape (like 2 rows and 2 columns) stacked on top of each other.

In [None]:
# Create the game_and_solution 3D array
game_and_solution = np.array([sudoku_game, sudoku_solution])

# Print game_and_solution
print(game_and_solution) 


4D arrays can be harder to visualize since we don't have a 4th dimension; think of it like a 2D array filled with 3D arrays


In [None]:
# Create a second 3D array of another game and its solution 
new_game_and_solution = np.array([new_sudoku_game, new_sudoku_solution])

# Create a 4D array of both game and solution 3D arrays
games_and_solutions = np.array([game_and_solution, new_game_and_solution])

# Print the shape of your 4D array
print(games_and_solutions.shape)

# Output is (2, 2, 9, 9), means that there's 2 sets of game/solution pairs, and within each game/solution pair, all arrays have nine rows and nine columns.

Arrays can be referred to as vectors, matrices, or tensors; these are more mathematical terms than NumPy terms, they all describe types of arrays. The difference between them is the number of dimensions an array has. 

A vector is an array with one dimension. There's no difference between row and column vectors in NumPy since no second axis is specified for 1D arrays (shape of (5,) and shape (5,). If you need an array that is _**explicitly**_ horizontal and vertical, it must be a 2D array so that NumPy understands what axis it lies on (shape(5,1) or shape(1,5)), but they're no vectors anymore. It's a matrix

In math, a 2D array is a matrix, and an array with 3 or more dimensions is a tensor

Array attributes are properties of an instance of an array:
- array_name.shape - describes shape and returns tuple with length of each      dimension. Referring to rows and columns only gets us so far in NumPy since many arrays have more than 2 dimentions. Instead of referring to rows, we can refer to the 1st dimension. Instead of referring to columns, we can refer to the 2nd dimension, but it's common to use row and columns terms

Array methods are called directly on the array object itself, rather than passing the array as an argument like we do with NumPy functions like np.array:
- .flatten() - takes all elements and flattens them into one 1D array
- .reshape() - redefines the shape of an array without changing the elements    that make up an array. Pass in a tuple of desired row and column numbers;       the number must be compatible with the number of elements in the array

In [None]:
# Flatten sudoku_game
flattened_game = sudoku_game.flatten()

# Print the shape of flattened_game
print(flattened_game.shape)

# Reshape flattened_game back to a nine by nine array
reshaped_game = flattened_game.reshape((9,9))

# Print sudoku_game and reshaped_game
print(sudoku_game)
print(reshaped_game)

## NumPy data types

NumPy data types are more specific than Python data types in that NumPy data types include both the type of data (like integer or string) and the ammount of memory available in bits (like np.int64). You can save memory by reducing the data type's bitsize when our data doesn't require a large bitsize.

Bit is binary digit. Can hold only values of 0 or 1. Smallest unit of memory data available on a computer. A byte is a sequence of 8 bits. np.int32 can store 2 to the 32nd power numbers since this is the number of possible combinations of 0 and 1 available in 32 bits; so it can hold about 4 billion integers, from around -2 billion to 2 billion. Anything outside those bounds need a larger bit size like 64. Booleans don't have a bitsize since Booleans don't vary in size

To find the data types of an array, use array.dtype. NumPy automatically decides the data type based on the content at array creation, like detecting integers (default bitsize 64 for floats and integers). For strings, NumPy select the data type with capacity large enough for the longest string. <U12 is a Unicode string with max length of 12. 

In [None]:
# Create an array of zeros with three rows and two columns
zero_array = np.zeros((3, 2))

# Print the data type of zero_array
print(zero_array.dtype)

# Create a new array of int32 zeros with three rows and two columns
zero_int_array = np.zeros((3,2), dtype=np.int32)

# Print the data type of zero_int_array
print(zero_int_array.dtype)

You can declare a data type at creation using dtype= argument in the np.array() function. You can convert datatypes using array.astype(datatype) method. 

In [None]:
# Sudoku games only ever store integers from one to nine
# So we change its bit size to something smaller, like int8
# Print the data type of sudoku_game
print(sudoku_game.dtype)

# Change the data type of sudoku_game to int8
small_sudoku_game = sudoku_game.astype(np.int8)

# Print the data type of small_sudoku_game
print(small_sudoku_game.dtype)

If you try to make an array from a list with several types, Python automatically converts all the values to the same data type, like strings; this is called type coercion. Numbers are easily cast into strings, but strings are not easily cast into numbers while still preserving the original data. 

A single string will make Python cast the whole list as strings. Adding a single float to a list of integers will change all the integers to floats, and adding a single integer will convert a list of Booleans to integers

Pay attention because element types in your array will change without notice

## Indexing and slicing arrays

First index is 0. For indexing in 2D, provide row and column number to index, like sudoku[2,4]. Giving only one index returns the row, and passing :, # outputs all rows and the one column

In slicing, the element at the start of the slice is included, but the stop value is not. To slice in 2D, you have to provide the row start and stop indices for the rows and columns, like sudoku[3:6, 3:6]

You could also give NumPy a third number in a slice: the step value. A step value of 2 gives you every other number in the array. You can use step values in both rows and columns, or just one or the other.

np.sort(array) sorts an array along a given axis. If you just pass np.sort(sudoku), then the values are sorted across the columns, so that low values are on the left and high values on the right. To sort across rows, you need to learn array axis orders...

In a 2D array, the direction vertically along rows is axis zero (from up to down), and the direct along columns is axis one (left to right). Remember that a column looks like the number one, and that's axis 1. 

The default axis is np.sort() is the last axis of the array passed to it. If a 2D array is being sorted, NumPy sorts by columns, since columns. To sort the array by row, pass the keyword argument axis=0 into np.sort() so that the highest numbers in each row are at the bottom of the array

In [None]:
# Select all rows of block ID data from the second column
block_ids = tree_census[:, 1]

# Print the first five block_ids
print(block_ids[0:5])

# Select the tenth block ID from block_ids
tenth_block_id = block_ids[9]
print(tenth_block_id)

# Select five block IDs from block_ids starting with the tenth ID
block_id_slice = block_ids[9:14]
print(block_id_slice)

In [None]:
# Create an array of the first 100 trunk diameters from tree_census
hundred_diameters = tree_census[0:100, 2]
print(hundred_diameters)

# Create an array of trunk diameters with even row indices from 50 to 100 inclusive
every_other_diameter = tree_census[50:101:2, 2]
print(every_other_diameter)

In [None]:
# Extract trunk diameters information and sort from smallest to largest
sorted_trunk_diameters = np.sort(tree_census[:, 2])
print(sorted_trunk_diameters)

## Filtering Arrays

You can use masks and fancy indexing, or np.where()

The code to create a Boolean mask checks whether the condition is true for each element in an array. The mask itself is an array of Booleans the same shape as the evaluated array. To filter an array (one_to_five) that only includes even numbers, first create a Boolean mask of True and False values based on whether the element is evenly divisible by 2, like with the statment mask = one_to_five % 2 == 0. You can then index the array using the mask, like one_to_five[mask]. This is fancy indexing. Useful for when we are interested in the elements that meet a condition. The mask provides the indices of elements that are True

You may want to filter with a condition in one row or column but return data from another, like returning class ids for even sized classes. You would make a mask like classrooms_id_and_sizes[:, 1] % 2 == 0, which checks all values in the 2nd column for the condition. You would need index the first column using that mask, like classrooms_id_and_sizes[:, 0]

In [None]:
# Create an array which contains row data on the largest tree in tree_census
# Tree diameter is saved in column with index 2
largest_tree_data = tree_census[tree_census[:, 2] == 51]
print(largest_tree_data)

# Slice largest_tree_data to get only the block ID
largest_tree_block_id = largest_tree_data[:, 1]
print(largest_tree_block_id)

# Create an array which contains row data on all trees with largest_tree_block_id
trees_on_largest_tree_block = tree_census[tree_census[:, 1] == largest_tree_block_id]
print(trees_on_largest_tree_block)

Fancy indexing returns array of elements which meet a condition. The indices can be used to direct NumPy where to apply code. It can also be used for combining data as well as filtering arrays. I can pull different elements into a new array based on what meets the condition. So with the classroom example, you would do np.where(classroom_ids_sizes[:, 1] % 2 == 0), and it would return (array([0, 3]),). This means classrooms at indices 0 and 3 meet the criteria.

The np.where() function returns a tuple in parentheses. This is because when the filtered array is multi-dimensional, each element can only be located by including an index for every dimension. So for a 2D array would return (array([0,0,0]), array([0,0,0])), since identifying all the elements would require a row index (first array) and a column index (second array). It's helpful to unpack the results of np.where() into different variables 

In [None]:
# Create the block_313879 array containing trees on block 313879
block_313879 = tree_census[tree_census[:, 1] == 313879]
print(block_313879)

In [None]:
# Create an array of row_indices for trees on block 313879
row_indices = np.where(tree_census[:, 1] == 313879)

# Create an array which only contains data for trees on block 313879
block_313879 = tree_census[row_indices]
print(block_313879)

The real power of np.where() is its ability to check whether rows, columns, or elements meet a condition and then pull one element if the condition is met and another if not. So if you wanted to replace all the 0s in the sudoku game with empty strings, you could use np.where(sudoku == 0, '', sudoku), The third argument specifies how to change the element if it does not meet the condition. Since we want the non-zeros to remain unchanged, we're passing off the original array to signify that

In [None]:
# Create and print a 1D array of tree and stump diameters
# Have to include all the semicolons to specifu that it's all rows in that column
trunk_stump_diameters = np.where(tree_census[:, 2] == 0, tree_census[:, 3], tree_census[:, 2])
print(trunk_stump_diameters)

## Adding and removing data

Concatenate: add data to an array along any axis. use np.concatenate()
If you have named aarays, you can pass a tuple of the array names into np.concatenate(). By default this will concatenate along the first axis, and add rows. For other dimensions (columns), use the keyword axis=

Shapes must be compatible; they must have the same shape along all axes except the one being concatenated along. They must also have the same number of dimensions, so you couldn't concatenate a 1D array with a 2D array, even if the row/column sizes where the same. So you would need to use .reshape() on the 1D array to make them compatible. Depending on if the data is vertical or horizontal, set a value of 1 in the reshape tuple argrument as the length of the flat dimension. So if you have a 1D array with 3 rows, use array_1D.reshape((3,1)). You can't add new dimensions with concatenate; it only adds data along an existing axis

In [None]:
# Print the shapes of tree_census and new_trees
print(tree_census.shape, new_trees.shape)

# Add rows to tree_census which contain data for the new trees
updated_tree_census = np.concatenate((tree_census, new_trees))
print(updated_tree_census)

In [None]:
# Print the shapes of tree_census and trunk_stump_diameters
print(trunk_stump_diameters.shape, tree_census.shape)

# Reshape trunk_stump_diameters
reshaped_diameters = trunk_stump_diameters.reshape((1000, 1))

# Concatenate reshaped_diameters to tree_census as the last column
concatenated_tree_census = np.concatenate((tree_census, reshaped_diameters), axis=1)
print(concatenated_tree_census)

np.delete() takes 3 arguments: the array to delete from; a slice, index, or array of indices to be deleted; and the axis to be deleted along. If you want to delete the second row from a 2D array: np.delete(classroom_date, 1, axis=0), a column would use axis=1. If no axis is specified, Numpy deletes the indicated index/indices along a flattened version of the array. 

In [None]:
# Delete the stump diameter column from tree_census
tree_census_no_stumps = np.delete(tree_census, 3, axis=1)

# Save the indices of the trees on block 313879
private_block_indices = np.where(tree_census_no_stumps[:, 1] == 313879)

# Delete the rows for trees on block 313879 from tree_census_no_stumps
tree_census_clean = np.delete(tree_census_no_stumps, private_block_indices, axis=0)

# Print the shape of tree_census_clean
print(tree_census_clean.shape)

## Summarizing data

Aggregating methods: .sum(), .min(), .max(), .mean(), .cumsum()

Calling these methods on an array would calculate over the whole array. You can add an axis= argument to define which axis to sum across. axis=0 would sum across all rows in each column and would return an array with the results, the number of elements in the array being the number of columns. The axis listed in the axis= argument refers to the axis we want collapsed

keepdims= is an optional argument in the .sum(), .min(), .max(), .mean() methods. If set to True, then the dimensions that are collapsed when aggregating are left in the output array and set to one. For instance, if axis=1 and dimension is 2D, then the output would be in a 2D column format. This can help with dimension compatibility 

.cumsum() will provide a cummulative sum of elements along a given axis. IF axis= 0, cummulative sum would be calculated down each row, and the bottom row would have the final cummulative sum

Graphs are helpful with communicating summary information

In [None]:
# Create a 2D array of total monthly sales across industries
monthly_industry_sales = monthly_sales.sum(axis=1, keepdims=True)
print(monthly_industry_sales)

# Add this column as the last column in monthly_sales
monthly_sales_with_total = np.concatenate((monthly_sales, monthly_industry_sales), axis=1)
print(monthly_sales_with_total)

In [None]:
# Create the 1D array avg_monthly_sales
avg_monthly_sales = monthly_sales.mean(axis=1)
print(avg_monthly_sales)

# Plot avg_monthly_sales by month
plt.plot(np.arange(1,13), avg_monthly_sales, label="Average sales across industries")

# Plot department store sales by month
plt.plot(np.arange(1,13), monthly_sales[:, 2], label="Department store sales")
plt.legend()
plt.show()

In [None]:
# Find cumulative monthly sales for each industry
cumulative_monthly_industry_sales = monthly_sales.cumsum(axis=0)
print(cumulative_monthly_industry_sales)

# Plot each industry's cumulative sales by month as separate lines
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 0], label="Liquor Stores")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 1], label="Restaurants")
plt.plot(np.arange(1, 13), cumulative_monthly_industry_sales[:, 2], label="Department stores")
plt.legend()
plt.show()

## Vectorized operations

When elements are summed in an array, it addes them all at once. Since the array will have the same data type, NumPy uses C (very efficient memory usage and speedy). Also reduces the amount of code you need to write.  This is vectorization. 

If you add a single number to an array, it will add that number to all elements; that single number is a scalar. You can also multiply. If you have arrays of the same shape, you add two arrays together like array_a + array_b, it will add the elements together and output an array of the same shape. Same with multiplying, subtracting, and dividing. 

We use vectorized operations for stuff like making Boolean masks and filtering arrays

You can create your own vectorized functions from Python functions using np.vectorize(). If you have an array of strings and want to check if any lengths are greater than two, you couldn't use len(array) > 2; you would get one True value if there's mone than 2 elements in the array. len() is a Python function, not a NumPy function, so it can't vectorize. If you feed len (no trailing parentheses) into np.vectorize() and assign to a variable, then passing the array as an argument into the variable followed by the condition will give you Boolean array

In [None]:
# .upper() is a string method, meaning that it must be called on an instance of a string: str.upper()

# Vectorize the .upper() string method
vectorized_upper = np.vectorize(str.upper)

# Apply vectorized_upper to the names array
uppercase_names = vectorized_upper(names)
print(uppercase_names)

## Broadcasting

Remember that 1D arrays are presented as a row of data by default; to express in a column you must reshape to add (n, 1) so it's one column

Takes vectorized operations to the next level. Stretches a smaller across a larger one, letting you do operations between arrays of different shapes; broadcasting also refers to this type of aray math. Adding a scalar to an array uses broadcasting, essentially creating an array the same size as the original array filled in with the scalar value. 

Only works with compatible arrays. Compare the arrays from right to left. A set of dimensions is compatible when one of them is 1, or they are equal; this must be the case with all dimension sets. A (10, 5) and (5, ) (which is a row of 5 as a 1D array) are broadcastable, because the 1D array's rightmost dimensions are both 5(there's only 1 dimension, so that one dimension is considered the right-most) (but (10, 5) and (10, ) arrays wouldn't be compatible); the arrays don't need to have the same number of dimensions

A 1D array is broadcastable because NumPy operates as thought there's a copy of the 1D array for each row of the 2D array, then operates on them as necessary. NumPy assumes that the user is trying to broadcast row-wise. You could also use .reshape() to convert a 1D array into a 2D array (like (2, ) --> (2, 1)) that can be broadcastable, as long as the leading dimensions meet the broadcasting criteria

## Saving and loading arrays

rgd array: a type of 3d array used in image-based machine learning. Each 1D in the 3D array contains red, green, and blue values, respectively, which together describe the color of a single pixel. The 2D array comprised of 1D arrays represents a row of pixel colors. 255 seems to be the highest number. A 1D array of all 255 is white

Arrays can be saved as .csv, .txt, pickle files, etc. NumPy's .npy is the best for speed and storage efficiency

Load a .npy file: with open("file.npy", 'rb') as f:
array_name = np.load(f)
plt.imshow(array_name)
plt.show()

open() arguments are the name of the file to open, and the mode to open the file in ('rb' is read binary)

np.load() takes the alias of the opened .npy file

In this 3D array, you can slice the array to select all of the first, second, and third components of the 1D arrays, which correspond to the red, blue, and green values, and save them as array representing the three colors like this, which results in 2D arrays: 

red_array = logo_rgb_array[:, :, 0]
blue_array = logo_rgb_array[:, :, 1]
green_array = logo_rgb_array[:, :, 2]

Printing the top row of each 2D array, you can look at the values across the arrays to figure out what color parts of the first row are. You can use np.where() to replace values and change the color of the images. 255 is often associated with brighter colors, and lower numbers are associated with darker colors

dark_logo_array = np.where(logo_rgd_array == 255, 50, logo_rgb_array)

To save a array as .npy:

with open('dark_logo.npy', 'wb') as f:
    np.save(f, dark_logo_array)

But this time you use "wb" for write binary

## Array acrobatics

Rearranging data by flipping the order of array elements and changing axis order

In machine learning, data augmentation is the process of adding additional data by performing small manipulations on data that is already available. You augment data to flip images and use both the flipped and original images to train the model

np.flip() reverses the order of array elements, along every axis by default. The first axis, the ordered row of pixels, is flipped so that bottom rows are now at top. The second axis (the ordered columns of pixels) so that columns that were on the left are now on the right. The RGB values in the third axis are also flipped so blue is replaced with red, and vice versa (but greens in the middle so it stays the same). 

You can flip along a specific axis to flip by specifying an axis using axis=. Use a tuple to specify multiple axes

With a 2D array of floats, the np.flip() function will reverse element order along both axes since no axis keyword argument is passed (the shape of the array stays the same, with the last array showing first and the elements when all arrays being reversed). np.transpose() flips axis order while keeping the element order within each axis the same (the shape reverses, where the first element in each 1D array shows up in the first 1D array; columns become rows, and vice versa)

The default behavior for transpose is to reverse the axis order; you can also specify a custom axis order using the axes= keyword argument. For instance, axes=(1, 0, 2) will make column values into row values and row values into column values, because it changes the axis order from 0, 1, 2 to 1, 0, 2. The third axis will be in the same position

## Stacking and splitting

You can unpack arrays using np.split, a function that accepts 3 arguments: the array to split, the number of equally sized arrays desired after the split, and the axis to split along. np.split allows multiple assignment, so you can so you can assign the split arrays to multiple variables. The resulting arrays have the same number of dimensions as the original array. If the resulting array has a dimension with a length of one (a tailing dimension), you might remove this dimension using .reshape() method

With concatenate, we are not able to concatenate data into a new dimension, but you can use np.stack to do it. You can use slice or split to unpack an array, and then use.npstack() to put it back together once you're done with the changes. np.stack() takes the list of arrays you want to stack, and the axis should be set to 2 to make a 3D array

You can plot 2D data pulled from a 3D array, because some images (like glack and white) are 2D

Matplotlib's default colormap is called Viridis. It's like grayscale but more readable for people with coloblindness. Yellow = lighter values, purple= darker values

np.stack requires that all arrays have the same shape and number of dimensions, so you may need to reshape the arrays. For the axis argument, the axis should be set to 2 (the last axis) because plt.imshow() requires 3D RGB arrays to have a shape that corresponds to the pixel width and height of the image in the first 2 dimensions, with the RGB values in the third dimension 

### Tuple

A tuple is a Python data type used to store collections of data, like a list. Created with parentheses

## Axes

The first axis is represented by 0, and is along the rows. 

The second axis is represented by 1, and is along the columns; think how columns look like 1s

## Trailiing/Rightmost Dimensions

This answer in StackOverflow helps:https://stackoverflow.com/questions/11178179/numpy-array-broadcasting-rules

Relates to when broadcasting with arrays of different dimensions. If you have two arrays with different dimensions number, say one 1x2x3 and other 2x3, then you compare only the trailing common dimensions, in this case 2x3. But if both your arrays are two-dimensional, then their corresponding sizes have to be either equal or one of them has to be 1.

So with the example in the class of (10, 5) and (5, ), the second array is a 1D, and the first is a 2D. Since the the most right hand dimension of the 1D array is 5, and the rightmost dimension of the 2D is also 5, making them compatible

Because the arrays are different sizes, even though the nrow/column numbers don't appear to match across the arrays, the arrays will be set up differently. 

## Help function

help(np.unique) for a function
help(np.ndarray.flatten) for a method, where you prefix it with the object type that the method is called on - in this case, a NumPy n-dimensional array

### Slicing a 3D array

In the example in the course, to return each first 1D element for a 3D array (splitting the RGB array into red, blue, green), you set the axis=2 in np.split(). I think this is because it's kind of splitting the columns through the array of arrays of arrays

https://towardsdatascience.com/a-visual-guide-to-multidimensional-numpy-array-aggregation-97a8960b3c59

When adding dimensions, the new, highest dimension is 0, and the other dimensions are increased by 1. So in a 3D array, 0 is the relationship between the two 2D arrays, rows become the 1st dimension, and columns become the 2nd dimension, which is why you you axis=2 to split by columns!!!!!!!!