# Python cheatsheets and tutorials

1. [Python basics](https://www.pythoncheatsheet.org/)
2. [numpy](http://datacamp-community-prod.s3.amazonaws.com/da466534-51fe-4c6d-b0cb-154f4782eb54)
3. [pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
4. [Basics, numpy and pandas](https://www.kaggle.com/lavanyashukla01/pandas-numpy-python-cheatsheet)
5. [matplotlib for plotting](https://github.com/matplotlib/cheatsheets#cheatsheets) 
6. [seaborn for plotting](https://seaborn.pydata.org/tutorial.html) 

# Numpy

<img src="img/logo/numpy.png" width="200"/>

> We need arrays in scientific computation!

* Arrays are essential tools in numerical computing. When a computation must be repeated for a set of input values, **it is natural and advantageous to represent the data as arrays and the computation in terms of array operations**. <br><br>

* Computations that are formulated this way are said to be vectorized. **Vectorized computing eliminates the need for many explicit loops over the array elements by applying batch operations on the array data.** The result is concise and more maintainable code. Vectorized computations can therefore be significantly faster than sequential element-by-element computations.<br><br>

* The core of NumPy is implemented in C and provides efficient functions for manipulating and processing arrays. **NumPy provides the numerical backend for nearly every scientific or technical library for Python.** 

In [1]:
# run this cell first before running the cells below!!!
import numpy as np

The core of the `numpy` package is the `array` class. Let's examine that first. We can make an array out of a sequence, like a list.

In [None]:
# this is a list
li = [1, 2, 3, 4, 5]

In [None]:
li

In [None]:
type(li)

In [None]:
# make an array
dd = np.array(li)

In [None]:
type(dd)

In [None]:
dd

* At a first glance, Numpy arrays bear some resemblance to Python’s list data structure. But an important difference is that while Python lists are generic containers of objects, **NumPy arrays are homogenous and typed arrays of fixed size.**<br><br>

* Homogenous means that all elements in the array have the **same data type**. Fixed size means that an **array cannot be resized (without creating a new array)**. For these and other reasons, operations and functions acting on NumPy arrays can be much more efficient than those using Python lists.<br><br>

* In addition, compared with list, NumPy also provides a **large collection of basic operators and functions** that act on these data structures, as well as **submodules with higher-level algorithms** such as linear algebra.<br><br>

Homogenous:

In [None]:
li = [1,2,3,"abc"]
li

In [None]:
dd = np.array(li)
dd

Fixed size:

In [None]:
li.append(2)
li

In [None]:
# dd.append(2)

### Array numeric type

A very nice summary of the array data types can be found [here](https://betterprogramming.pub/a-comprehensive-guide-to-numpy-data-types-8f62cb57ea83)

<img src="img/array_numeric_type.png" alt="beginning python" width="800"/>

In [None]:
z = np.array([1,2,3])
z

**You can query the data type of the elements in the array by using the `dtype` attribute**. Don't confuse this with the `type()` method, which checks the type of the whole object.

In [None]:
type(z)

In [None]:
z.dtype # the default is int64 for integers

In [None]:
z = np.array([1.2,2.3,3.4])
z

In [None]:
z.dtype # the default is float64 for floating point numbers


Array is homogeneous. The property of "Homogeneous" requires an array is **upcast** to be able to represent all of the elements. So, if one element is a float, all elements will be converted to floats.

In [None]:
# a mixture of integer and floating point number
d = [1, 2, 3.1415, 4, 5]
arr = np.array(d)
arr

In [None]:
arr.dtype

Array types may be defined explicity in the call. When creating a new ndarray data, you can define the data type of the element by the **string name** in the above array numeric type graph.

In [None]:
# assign the array type during creation
z = np.array([1,2,3], dtype="int8")
z

In [None]:
# check array type
z.dtype

In [None]:
z = np.array([1,2,3], dtype="int64")
z

In [None]:
# check array type
z.dtype

You can use the `astype()` method to convert the array data type

In [None]:
z.astype("float16")

In [None]:
# the original array data type is not changed!
z.dtype

In [None]:
# need to assign the result to variable
z = z.astype("float16")

In [None]:
z.dtype

Arrays are like **multidimensional sequences**. We can create a 2D/3D array by supplying **a list of lists** as the argument. And note how many square brackets are surrounding the output!

In [None]:
# 2D array
arr = np.array([[1, 2, 3,], [4, 5, 6]])
arr

In [None]:
# 3D array
arr = np.array([[[1, 2, 3,], [4, 5, 6]], [[3, 2, 1,], [6, 5, 4]]])

In [None]:
arr

### Array attributes

Arrays have a few other important attributes. **Note attributes never have parentheses after them. Methods always do.**

![array_axis](img/array_axis.png)

In [None]:
# 1D array
arr = np.array([1,2,3])
arr

In [None]:
arr.ndim

In [None]:
arr.shape

In [None]:
arr.size

In [None]:
# 2D array
arr = np.array([[1, 2], [4, 5],[5,5]])
arr

In [None]:
arr.ndim # The number of dimensions (indices) of the array

In [None]:
arr.shape # The shape of the array (i.e., the size of array in each dimension)

In [None]:
arr.size # The total number of elements in the array

### change array shape

You can set the `array.shape` attribute to change the shape of the array.

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])

In [None]:
arr

In [None]:
arr.shape

In [None]:
arr.shape = (3, 2)
arr

You can also use the `reshape()` method to change the shape of an array. **Note that it will not change the array itself.**

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])

In [None]:
arr

In [None]:
arr.reshape(3, 2)

Note that it will not change the array itself.

In [None]:
arr

assign the conversion back to the original array to change its shape

In [None]:
arr = arr.reshape(3, 2)
arr

### 1D vs. 2D array for the same number of elements

This part can be very tricky BUT important for you to understand the 1D vs. 2D array

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])
arr

In [None]:
# convert it to 1D array
arr.shape = (6, )

In [None]:
arr

Singleton dimensions (aka. the size of array in this dimension is 1) **add** to the dimensionality of an array. The last example is a 1D array (also called a vector), but the next is a 2D array|.

In [None]:
# convert it back to 2D array
arr.shape = (2, 3)
arr

# Note that there are *two* square brackets in the output sequence to denote it is a 2D array.
# It looks like a row vector.

In the above example, the arr shape is (1, 6), indicating we have a singleton dimension for the row. In other words, the size of array in the row dimension is 1, so we have 1 single element for each column.

In [None]:
arr.shape = (6, 1)
arr   # this is also a 2D array, like a column vector

**What if we don't know the size of the array but we still want to convert the original 1D or 2D array to a 2D array with only 1 row or 1 column?**

*Convert an array from 1D to 2D with only 1 row or 1 column*

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])
arr

In [None]:
# let's change arr back to 1D array
arr.shape = (6, ) # not (, 6)!!!
arr

**Method 1: use `size` attribute**

In [None]:
# convert to 2D array with only 1 row
arr.shape = (1, arr.size)
arr

In [None]:
# back to 1D array
arr.shape = (arr.size, )
arr

In [None]:
# convert to 2D array with only 1 column
arr.shape = (arr.size, 1)
arr

**Method 2: use -1**

* use -1 at the dimension of unknown size

In [None]:
# back to 1D array
arr.shape = (arr.size, )
arr

In [None]:
# use the -1 trick to create a 2D array with 1 row
arr.shape = (1,-1)
arr

In [None]:
# back to 1D array
arr.shape = (arr.size, )
arr

In [None]:
# use the -1 trick to create a 2D array with 1 column
arr.shape = (-1, 1)
arr

*Convert an array from 2D with multiple rows and columns to 2D with only 1 row or 1 column*

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])
arr

**method 1: use the `size` attribute**

In [None]:
# create a 2D array with 1 row
arr.shape = (1, arr.size)
arr

In [None]:
# back to multiple rows and columns
arr.shape = (2, 3)
arr

In [None]:
# create a 2D array with 1 column
arr.shape = (arr.size, 1)
arr

**method 2: use the `-1` trick**

In [None]:
# back to multiple rows and columns
arr.shape = (2, 3)
arr

In [None]:
# use the -1 trick to create a 2D array with 1 row
arr.shape = (1, -1)
arr

In [None]:
# back to multiple rows and columns
arr.shape = (2, 3)
arr

In [None]:
# use the -1 trick to create a 2D array with 1 column
arr.shape = (-1, 1)
arr

*Convert an array from 2D to 1D (i.e., flatten the array)*

In [None]:
arr = np.array([[1, 2, 3,], [4, 5, 6]])
arr

**method 1: use the `size` attribute**

In [None]:
arr.shape = (arr.size, )
arr

**method 2: use the `-1` trick**

In [None]:
# back to 2D array
arr.shape = (2, 3)
arr

In [None]:
arr.shape = (-1,) # not (, -1)!!!
arr

You may feel this is confusing, but actually this makes sense. Recall that the shape of 1D array yields (xx, ), where xx is the number of elements in the 1D array. Therefore, by providing `arr.size` when using `arr.shape = (arr.size, )`, we are letting Python to generate a 1D array containing `arr.size` elements. Similarly, by providing -1 when using `arr.shape = (-1, )`, we are letting Python automatically determine the number of elements that should be contained in the resulting 1D array.

**method 3: use the `flatten()` method**. Similar to `.reshape()`, `.flatten()` will not change the original array.

In [None]:
# remember to assign the conversion back to arr
arr = arr.flatten()

In [None]:
arr

---
### *Exercise*

> Go over the above examples about 1D-2D conversion


In [None]:
yee = 1

## Array indexing

Arrays are indexed in a similar way to sequences (like list).

### 1-dimensional array

![array_index](img/array_index.jpeg)

In [None]:
arr = np.array([2,  3,  0,  1,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

In [None]:
arr

In [None]:
arr.size

In [None]:
arr.shape

**Similar to string slicing or list slicing, we can get a subset of the data in that particular dimension of the array by giving a range of the index to extract**. This is done by using the following format

    start:stop:step

1. `s[start:stop:step]`: This extracts the elements from the start index up to but not including the stop index, with a step size of step.

1. `s[start:stop]`:  This extracts the elements from the start index up to but not including the stop index, with a default step size of 1.

1. `s[start:]`: This extracts the elements from the start index to the last one, with a default step size of 1.

1. `s[:stop]`: This extracts the elements from the beginning up to but not including the stop index, with a default step size of 1.

1. `s[::step]`: This extracts the elements from the beginning up to the last one (or from the ending to the first one, depending on whether step is positive or negative), with a default step size of step.

1. `s[:]` or `s1[::]` extract all elements

In [None]:
arr[2]

In [None]:
arr[1:10:2]

### 2-dimensional array

![array_index](img/array_index_2d.jpeg)

In [None]:
arr

In [None]:
arr.size

In [None]:
# convert it to 2D array
arr.shape = (5, 5)
arr

In [None]:
arr[3, 4] # an individual element

In [None]:
# A common use case is to get a single row or column from a 2D array.

arr[:, 4]   # the 5th column with all rows

In [None]:
arr[2, :]   # the 3rd row with all columns

In [None]:
arr[2]     # Trailing colons do not need to be explicitly typed. This is equivalent to the last example.

In [None]:
arr[1::2, :] # get the second row and beyond with a step of 2

---
### *Exercise*

> Use array `arr` and grab out every other row and the 4th column and beyond


In [None]:
arr[0::2,]

## Array methods

Arrays have a number of methods. Let's take a look at the `sum()` method as an example. 

In [None]:
arr = np.array([[0, 1, 2], [3, 4, 5]])  # reset the array to our 2x3 array.

In [None]:
arr

In [None]:
arr.sum()        # The sum of all of the elements in the array

**Note:** 

* `Sum` takes the optional argument `axis` that can be used to take the sum along a single axis of the array. Just like with indexing, the axes are reference in a **zero-based** system.
* In a 2D array case, `axis=0` means the first dimension, which is the row direction, pointing downwards; `axis=1` means the second dimension, which is the column direction, pointing to the right.

![array_axis](img/array_axis.png)

<div class="alert alert-info">When you run arr.sum(axis = 0 or axis = 1) (here arr is a 2D array), it will produce a 1D array. For axis = 0, you can think the operation collapses the row axis. For axis = 1, you can think the operation collapses the column axis. The following figure shows what happens.</div>




Here is a good [tutorial](https://www.sharpsightlabs.com/blog/numpy-axes-explained/).


![numpy_axis](img/numpy_axis.png)

In [None]:
arr

In [None]:
arr.sum(axis=0)  # takes the sum in the 'row' direction, resulting in a 1D array that is the sum across the rows

In [None]:
arr.sum(axis=1)  # takes the sum in the 'column' direction, resulting in a 1D array that is the sum across the columns

---
### *Exercise*

> Use the `mean()` method get the mean of the numbers in each column of `arr`. The result should be a 1D array with three elements.


In [None]:
arr.mean(axis=1)

You can find the mininum and maximum of an array with the `min()` and `max()` methods.

In [None]:
arr = np.array([[0, 1, 2], [3, 4, 5]]) 
arr

In [None]:
arr.min()

In [None]:
arr.max()

In [None]:
# find the min value along each column
arr.min(axis=0)

In [None]:
# find the max value along each row
arr.max(axis=1)

Sometimes it is useful to find the indices of these minima and maxima. For this use `argmin` and `argmax`:

**Note:** argmin or argmax function by default works along the flattened array (aka, the min or max index generated is the min or max index of the flattened 1D array).

In [None]:
arr2 = np.array([[2,3,5], [1,8,4]])
arr2

In [None]:
arr2.argmin()

In [None]:
arr2.argmax()

**use flatten() to check the min and max index and it is right!**

In [None]:
arr2.flatten()

In [None]:
arr2.flatten()[arr2.argmin()]

In [None]:
arr2.flatten()[arr2.argmax()]

#### find the min  or max value index along single axis

In [None]:
arr2

In [None]:
# min index in each column
arr2.argmin(axis=0)

In [None]:
# max index in each row
arr2.argmax(axis=1)

sort the array using the `sort()` method

In [None]:
arr = np.array([2,1,4,3,5])
arr

**Method 1: use `sort()` method (sort the array in place)**

In [None]:
arr.sort() # sort occurs inplace (arr itself will be changed after using the sort method)

In [None]:
arr

**Method 2: use `np.sort()` method (sort and create a new array)**

In [None]:
arr = np.array([2,1,4,3,5])
arr

In [None]:
np.sort(arr)

In [None]:
arr

In [None]:
arr = np.sort(arr)
arr

---
### *Exercise*

> Sort the arr in descending order


In [None]:
np.sort(arr)[::-1]

## Creating standard arrays

There are a few standard arrays, for example, arrays filled with zeros or ones (or empty). Here are some examples of creating arrays.

create an array with all 1

In [None]:
o = np.ones(6)
o

In [None]:
# create two-dimensional array with all 1
# The argument should be a tuple or a list with the length of each dimension as an argument
o = np.ones((3, 4)) # note that we have two round brackets on each side!!!
o

In [None]:
# np.ones(3, 4) # this will be wrong as 3 and 4 are written separately, instead of as a tuple or list

In [None]:
# check default array type
o.dtype

create an array with all 0

In [None]:
z = np.zeros((2, 3))
z

In [None]:
# check default array type
z.dtype

In [None]:
# explicitly denote the variable type
z = np.zeros((2, 3), dtype="int8")
z

In [None]:
z.dtype

You can also create these arrays with the same shape and datatype of the input array using `np.ones_like` and `np.zeros_like`.

In [None]:
z

In [None]:
zo = np.ones_like(z)
zo

In [None]:
zo = np.zeros_like(z)
zo

You can also create a diagonal array with a given vector along the diagonal.

In [None]:
np.diag(-2*np.ones(6))

**There are also a number of ways to generate sequences of numbers.**

 - `np.arange([start,] stop [[, step]])` Create a sequence of numbers, similar to `range`. **Up-to-but-not-including stop.**
 - `np.linspace(min, max, length)` Create a uniform series of specified `length` between `min` and `max`, **inclusive**.

In [None]:
np.arange(10)

In [None]:
np.linspace(0, 10, 11)

In [None]:
np.arange(1, 10, 2)

**You can create arrays of random numbers easily with methods in `np.random`.**

* `np.random.uniform(low=0.0, high=1.0, size=None)`: Return random samples from a uniform distribution of `size` (int or tuple of ints) in the interval [low, high). <br><br>

* `np.random.normal(loc=0.0, scale=1.0, size=None)`: Return random samples from a normal distribution (mean is loc and standard deviation is scale) of `size` (int or tuple of ints). <br><br>

* `np.random.randint(low, high=None, size=None)`: Return random integers from `low` (inclusive) to `high` (exclusive). If `high` is None then return integers from [0, `low`). `size` is an int or tuple of ints to give the output shape. <br><br>


**np.random.uniform**

In [None]:
x = np.random.uniform(low=3, high=10, size=10000)
x

In [None]:
# we will learn how to plot soon
import matplotlib.pyplot as plt
%matplotlib inline

fig, ax = plt.subplots() 

ax.hist(x, color="red", edgecolor="white", bins=30);

**np.random.normal**

In [None]:
x = np.random.normal(loc=5, scale=2, size=10000)
x

In [None]:
fig, ax = plt.subplots()

ax.hist(x, color="red", edgecolor="white", bins=30);

**np.random.randint**

In [None]:
x = np.random.randint(low=5, high=15, size=10000)
x

In [None]:
fig, ax = plt.subplots()

ax.hist(x, color="red", edgecolor="white", bins=30);

---
### *Exercise*

> Create an array of random floats between 0 and 1 that has dimension 5 x 3. Calculate the standard deviation of each column of the array.

---

In [None]:
zeroto1 = np.random.uniform(low=0,high=1,size = (5,3))
zeroto1

In [None]:
zeroto1.std(axis = 0 )

In [None]:
np.std(zeroto1, axis = 0 )

## Combining arrays

Generally, arrays can be combined with the `np.concatenate` function. The arguments are a sequence of arrays to join, and the axis along which to join them (default=0).




In [None]:
x = np.ones((2,3))
y = x * 9

In [None]:
x

In [None]:
y

<table><tr>
<td> <img src="img/np_conc_0.png"/> </td>
<td> <img src="img/np_conc_1.png"/> </td>
</tr></table>

In [None]:
np.concatenate((x, y), axis=0)

In [None]:
np.concatenate((x, y), axis=1)

There are a number of convenience functions that act like concatenate for specific axes:

 - `np.vstack` – vertical stack (stack along axis=0)
 - `np.hstack` – horizontal stack (stack along axis=1)

In [None]:
np.vstack((x, y))

In [None]:
np.hstack((x, y))

**Note:**
* Everything that I’ve said about `concatenate` or `stack` applies to 2D arrays (as well as multi-dimensional arrays).

* The axes of 1D array work differently. For beginners, this is likely to cause issues (especially for `concatenate`)

* Note that 1D array only has one axis, and since axes start from 0, the axis of the 1D array will be 0 

In [None]:
x1d = np.array([0,0,0])
y1d = np.array([1,1,1])

In [None]:
x1d

In [None]:
y1d

In [None]:
np.concatenate((x1d, y1d), axis = 0)

In [None]:
# np.concatenate((x1d, y1d), axis = 1) # since 1D array doesn't have axis 1, this will throw out error

if we want to stack the two 1D array vertically (and in the meantime obtains a 2D array):

We can convert 1D array to 2D array and then do the concatenation.

In [None]:
x2d = x1d.reshape(1, -1) # reshape to 2D
x2d

In [None]:
y2d = y1d.reshape(1, -1) # reshape to 2D
y2d

In [None]:
np.concatenate((x2d, y2d), axis = 0)

In [None]:
# we can concatenate using axis=1
np.concatenate((x2d, y2d), axis = 1)

**You can also use vstack and hstack directly on 1D array for vertical and horizontal stack**

In [None]:
x1d

In [None]:
y1d

In [None]:
np.vstack((x1d, y1d)) # output is 2D, nice!

In [None]:
np.hstack((x1d, y1d)) # output is 1D. nice? Maybe not. Your call!

In [None]:
# same as:
np.concatenate((x1d, y1d)) # output is 1D. nice? Maybe not. Your call!

## Splitting array

Likewise, arrays can be split with `np.split`. There are also convenience functions to split horizontally, vertically, and with depth.

In [None]:
x = np.random.randint(low=1, high=10, size=(12, 2))

In [None]:
x.shape

In [None]:
x

In [None]:
# split into four arrays (the outcome will be a list)
np.split(x, 4, axis=0)

In [None]:
# check the shape of each array in the list
[a.shape for a in np.split(x, 4, axis=0)]

---
### *Exercise*

1. Create an array, A, of shape (40, 50). Fill the first 10 columns with 1, the second 10 columns with 2, ..., and so on up to the fifth 10 columns with 5.

1. Split it along axis=1 into a list containing the splitted five arrays.

1. What is the resulting shape of each array? *[Advanced: can you calculate this on one line?]*

1. Concatenate the first two arrays back together along axis 1.

In [None]:
# ans 2

A = np.ones((40,50))

for i in range(5):
    col_start = i*10
    col_end = col_start + 10 
    A[:, col_start:col_end] = i + 1
    
print(A.shape)    
print(A)

In [None]:
A_split = np.split(A, 5, axis =1)

In [None]:
# answer 1
xx = np.hstack([np.ones((40,10)) * (i+1) for i in range(5)])

In [None]:

[x.shape for x in A_split]


In [None]:
np.concatenate( A_split[0], A_split[1], axis = 1)

## Finding values

There are a number of ways to find values in an array.

In [None]:
x = np.random.normal(size=(5,5))
x

The simplest is always to create a **boolean array index**, like

In [None]:
ind = x > 0.5
ind

The boolean array can be used as an index to the original array or other arrays. **Note that using the boolean array as an index will return a 1D array, no matter what dimension the original arrays are, because there is no way to know what structure the `True` values have.**

In [None]:
# array itself
x[ind]

You can also use the boolean array index on other array to filter values

In [None]:
# other array
x = np.random.normal(size=(5,5))
y = np.sin(x)

print(x)
print() # this is to add some space between the two prints
print(y)

In [None]:
idx = x > 0.5 # create boolean array index

print()
print(y[idx]) # filter y values based on the array index

In [None]:
print(y[x > 0.5])

To get the indices of the places where the conditional is true (i.e., the locations of the `True` values in the boolean array), use the `np.where` command. 

Note that `np.where` always returns **a tuple of array index for each dimension**.

In [None]:
x = np.random.normal(size=(3,3))
x

In [None]:
idx = np.where(x > 0.5)
idx

In [None]:
# use the tuple as the index to filter the array and it will return a 1D array
x[idx]


Returning a tuple of indices for each dimension is a little strange for 1D arrays, but is done for consistency across all input values.

In [2]:
x = np.random.normal(size=10)

In [3]:
x

array([-1.53055986, -0.61285888,  1.10110529, -0.70312264, -0.83367902,
        1.10030764, -0.1907624 ,  0.61935245,  0.17727296,  1.14321342])

In [4]:
idx = np.where(x > 0.5)
idx # still a tuple

(array([2, 5, 7, 9]),)

In [5]:
x[idx] # use the array index to filter

array([1.10110529, 1.10030764, 0.61935245, 1.14321342])

You can also use these calculated indices, or boolean matrices for assignment.

In [6]:
x

array([-1.53055986, -0.61285888,  1.10110529, -0.70312264, -0.83367902,
        1.10030764, -0.1907624 ,  0.61935245,  0.17727296,  1.14321342])

In [7]:
x>0.8

array([False, False,  True, False, False,  True, False, False, False,
        True])

In [8]:
# assign 10 to all elements > 0.8
x[x>0.8] = 10

In [9]:
x

array([-1.53055986, -0.61285888, 10.        , -0.70312264, -0.83367902,
       10.        , -0.1907624 ,  0.61935245,  0.17727296, 10.        ])

In [10]:
# assign 10 to all elements > 0.5
np.where(x>0.5)

(array([2, 5, 7, 9]),)

In [11]:
x[np.where(x>0.5)] = 10
x

array([-1.53055986, -0.61285888, 10.        , -0.70312264, -0.83367902,
       10.        , -0.1907624 , 10.        ,  0.17727296, 10.        ])

### *Exercise*

> Create a 3x3 random array, with values between 0 and 1. Replace all of the numbers smaller than 0.5 with zero.

> Complete this without using `where`.

> Complete this using `where`.

In [15]:
d = np.random.uniform(low = 0, high = 1, size = (3,3))

In [17]:
new = d<0.5
new

array([[False,  True, False],
       [False,  True, False],
       [False,  True, False]])

In [19]:
new2 = np.where(d<0.5)
new2

(array([0, 1, 2]), array([1, 1, 1]))

In [21]:
d[np.where(d<0.5)] = 0


## Array views

The data for an array may be stored in memory using `C` or `FORTRAN` ordered memory. Typically, there is no need to think about this, some details can be found [here](http://docs.scipy.org/doc/numpy-1.10.0/reference/internals.html).

**However, it is important to remember that subsets of an array can produce a different 'view' of the array that addresses the same memory as the original array**. This can lead to some unexpected behaviors. One way to think of this is that assignment in Python is more like a C-pointer (i.e., a reference to a memory location) than an actual value.

In [None]:
a = np.arange(10)
a

In [None]:
b = a[:5]
b

In [None]:
b[1] = 100 # change the b value

In [None]:
b

In [None]:
a # a is also changed!

In [None]:
a[4] = -999   # change the a value
a

In [None]:
b # b is also changed

Normally, this will not be a problem, but if you need to make sure that a subset of an array has it's own memory, make sure you make a `copy` of the array, like

In [None]:
a = np.arange(10.0)
a

In [None]:
b = a[:5].copy()
b

In [None]:
b[2] = -999 # change b value

In [None]:
b

In [None]:
a # a value is not changed

## Array broadcasting

(Largely taken from [SciPy docs](https://docs.scipy.org/doc/numpy-1.10.0/user/basics.broadcasting.html))

Generally arrays should be the same shape for them to be **operated together (element-wise)**

In [None]:
a = np.random.uniform(low=0, high=5, size=(3,2))

b = np.random.uniform(low=0, high=5, size=(3,2))

In [None]:
a

In [None]:
b

In [None]:
a + b

In [None]:
a - b

In [None]:
a * b

In [None]:
a / b

The term broadcasting describes how `numpy` treats arrays with **different shapes** during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

For example, the simplest broadcasting example occurs when an array and a scalar value are combined in an operation:

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

In [None]:
a

In [None]:
b

In [None]:
# the above is equivalent to:
a * np.array([2.0, 2.0, 2.0])

In the `a * b` example, We can think of the scalar `b` being stretched during the arithmetic operation into an array with the same shape as `a`. The new elements in b are simply copies of the original scalar. The stretching analogy is only conceptual. **NumPy is smart enough to use the original scalar value without actually making copies, so that broadcasting operations are as memory and computationally efficient as possible.**

### General Broadcasting Rules

When operating on two arrays, NumPy compares their shapes element-wise. It **starts with the trailing dimensions, and works its way to the front**. Two dimensions are compatible when

1. they are equal, or
1. one of them is 1

Here are some more examples:

    a      (2d array):  2 x 3
    b      (1d array):      3
    Result (2d array):  2 x 3

    a      (2d array):  2 x 3
    b      (1d array):      1
    Result (2d array):  2 x 3
    
    
**The size of the resulting array is the maximum size along each dimension of the input**

*1D example*

In [None]:
a = np.array([[1,2,3], [2,4,6]])
a

In [None]:
a.shape

In [None]:
b = np.array([1,2,3])
b

In [None]:
b.shape

In [None]:
a * b

the above is equal to:

In [None]:
a * np.vstack([b, b])

*2D example*

In [None]:
b_2d = b.reshape(1,3)

In [None]:
b_2d

In [None]:
b_2d.shape

In [None]:
a

In [None]:
a.shape

In [None]:
a * b_2d

the above is equal to:

In [None]:
a * np.vstack([b_2d, b_2d])

Notice that the rules for broadcasting are based on the location of singleton dimensions. Singleton dimensions are implied to the front (to the left), not to the end (to the right). So, the first example here works but not the second:

    a      (2d array):  2 x 3
    b      (1d array):      3
    Result (2d array):  2 x 3

    a      (2d array):  2 x 3
    b      (1d array):      2
    Result error!

In [None]:
b = np.array([1,2])
b

In [None]:
a.shape

In [None]:
b.shape

In [None]:
# a * b # this will not work because this doesn't obey the broadcasting rule!

If you really want to multiply `a` with `b`, this problem can be fixed by creating new singleton dimensions in arrays. This can be done by putting `np.newaxis` in the appropriate space when indexing the array. For example:

In [None]:
b_new = b[:,np.newaxis]

In [None]:
b_new

In [None]:
b_new.shape

In [None]:
a.shape

In [None]:
a

In [None]:
b_new

In [None]:
a * b_new

the above is equal to:

In [None]:
a * np.hstack([b_new, b_new, b_new])


### *Exercise*

Can you make a * b work without using `newaxis`?

## Transposing arrays with  `T`

In [None]:
a = np.random.randint(low=1, high=10, size=(2,3))
a

In [None]:
# transpose
b = a.T
b

## Importing data

One of the basic commands in `numpy` for loading in data is the `loadtxt` command.

Let's read in the `CTD.txt` file

![CTD](img/CTD_screenshot.png)

In [None]:
# note that comments = '*' indicates we ignore those lines starting with *
data = np.loadtxt("data/CTD.txt", comments='*')
data

In [None]:
data.shape

In [None]:
data[:,2]    # a column of data representing temperature

---
### *Exercise*

> Read in the oceanographic data file 'data/CTD.txt' into an array. You can look at the data file itself to see what variables are stored in each column.

> Using this data, write a function to calculate the linear equation of state. This is an approximation of the density of water, as it depends on salinity, temperature, and some empirical constants. We will use the following form for the linear equation of state:

> $\rho = 1027[1+7.6\times 10^{-4}(S-35) -1.7\times 10^{-4}(T-25)]$

> where $\rho$ is the density, $S$ is the salinity, and $T$ is the temperature.

> This is a more free form than the homework, so you should set up all of the associated code to call the function, and write out the function yourself. Don't forget docstrings! Use your salinity and temperature contained in CTD.txt to calculate density. For a check, the first value of your density array in order should equal 1021.75199816 and the last should equal 1028.04713536.

---

## Polynomial fitting

The basic function for fitting a polynomial (e.g., a straight line) is `np.polyfit(x, y, deg)`. There are a number of other functions that let you find zeros (`np.roots`), and do other operations to polynomials.

In [None]:
x = np.random.uniform(size=100)
y = 5 + 3*x + 0.1*np.random.normal(size=(100))   # A straight line with some noise

In [None]:
# we will learn how to plot soon
import matplotlib.pyplot as plt
%matplotlib inline

# plot data
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(x, y, ".")

In [None]:
# let's fit the data
p = np.polyfit(x, y, 1)  # fit a straight line (order is 1)
print(p)  # The coefficients of the polynomial, with highest order first. (i.e,. [slope, intercept])

In [None]:
p[0] # this is the slope

In [None]:
p[1] # this is the intercept

In [None]:
p

Let's plot it to make sure this makes sense:

In [None]:
# plot data
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(x, y, ".")

# plot fitted line
x_new = np.linspace(0, 1, 10)
y_fit = p[0]*x_new + p[1]
ax.plot(x_new, y_fit)

ax.legend(('Data', 'Fitted line'))

Once you have the fit, you can use it to find other useful things, like the value of the fitted line at $x=1$:

In [None]:
# plot data
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(x, y, ".")

# plot fitted line
x_new = np.linspace(-2, 2, 10)
y_fit = p[0]*x_new + p[1]
ax.plot(x_new, y_fit)

ax.axvline(1)

ax.legend(('Data', 'Fitted line'))

In [None]:
np.polyval(p, 1)

In [None]:
# plot data
fig, ax = plt.subplots(figsize=(6,6))
ax.plot(x, y, ".")

# plot fitted line
x_new = np.linspace(-2, 2, 10)
y_fit = p[0]*x_new + p[1]
ax.plot(x_new, y_fit)

ax.axhline(0)

ax.legend(('Data', 'Fitted line'))

In [None]:
np.roots(p)

## Vectorization

Array broadcasting and vectorization are two big reasons that `numpy` can be efficient and fast. With these tools, you can avoid writing for loops (which are slow).

The best way to do mathematical operations using `numpy` arrays is to do `vector` operations. That is, mathematical operations are defined to be element by element, and this is done much faster than looping. 

**As a rule of thumb, you should be very concerned if your code has more than one significant `for` loop in the numerical analysis section.**

Here is a way to do multiply 2 big arrays using for loops, which is not how you should do it. The sum at the end is included for comparison with the subsequent approach.

In [None]:
a = np.arange(102400.0).reshape(4, 8, 1600, 2)   # a 4D array using sequential numbers
b = np.random.uniform(size=(4, 8, 1600, 2))      # a 4D array using random numbers

In [None]:
a.shape

In [None]:
b.shape

In [None]:
li, lj, lk, lm = b.shape  # size of b in each dimension
sol = np.zeros(b.shape)
for i in range(li):
    for j in range(lj):
        for k in range(lk):
            for m in range(lm):
                sol[i,j,k,m] = a[i,j,k,m]*b[i,j,k,m]
print(sol.sum())

The better way is to directly multiply the arrays together, taking advantage of C code that Python has in the background.

In [None]:
# element-by-element multiplication. This operation is about as fast as it can be on your computer.
sol = a * b       
print(sol.sum())

## Basic performance evaluation

We can do some very basic perfomance testing using the `%%time` (whole cell) or `%time` (single-line code) magic command in jupyter notebooks.

In [None]:
%%time

li, lj, lk, lm = b.shape  # size of b in each dimension
sol = np.zeros(b.shape)
for i in range(li):
    for j in range(lj):
        for k in range(lk):
            for m in range(lm):
                sol[i,j,k,m] = a[i,j,k,m]*b[i,j,k,m]
print(sol.sum())

In [None]:
%%time

sol = a * b       
print(sol.sum())

In [None]:
%time b = np.random.normal(size=(50, 20))

For statements that are longer than a single line, the `time.time` function can also be used.

In [None]:
import time

t_start = time.time()
time.sleep(0.25)   # Do nothing for 0.25 seconds
t_stop = time.time()

print('{:6.4f} seconds have passed.'.format(t_stop-t_start))

In [None]:
import time

t_start = time.time()

b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))
b = np.random.normal(size=(50, 20))


t_stop = time.time()

print('{:6.4f} seconds have passed.'.format(t_stop-t_start))

## Linear algebra

Matrix multiplication is done using the `np.dot` function. In this case, matrices do _not_ need to be the same shape, but must follow the rules of matrix multiplication. E.g., the operation dot(<2x3 array>, <3x4 array>) results in a 2x4 array; i.e., the inner dimensions must match (technically last and second-to-last, for arrays with more than two dimensions).   

![matrix_mult](img/matrix_mult.png)

In [None]:
x = np.random.uniform(size=(2, 3))
y = np.random.uniform(size=(3, 4))

print(x.shape)
print()
print(y.shape)

In [None]:
res = np.dot(x, y)
print()
print(res)
print(res.shape)

# np.dot(y, x)  # This gives an error -- order is important.


One of the key elements of the `numpy` package is the `numpy.linalg` subpackage that contains a number of linear algebra functions that work efficiently on arrays.

In [None]:
a = np.random.normal(size=(3, 3))

# get the eigen value and eigen vector
eigen_value, eigen_vector = np.linalg.eig(a)

In [None]:
eigen_value

In [None]:
eigen_vector