# Sanity Check

Let's first check a couple packages we need in this class. Anaconda should have all the packages below (except for PyTorch) installed.

In [1]:
import torch
import scipy
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

  import pandas.util.testing as tm


Now let's check your Python version. Both 3.7 and 3.6 should work fine. If you have multiple Python environments installed, please pay attention to which one are you using. Nevertheless, we highly recommend using Anaconda to manager your packages.

In [2]:
!python --version

Python 3.6.9


# Orientation to Google Colab

Please note that Google Colab only has temporary storage. Once the current session disconnects (or timesout) all the files in the session storage will be discarded. However, the output from Jupyter Notebook will be saved unless you specify not to do so. You can download/upload files with the GUI in the left panel. If you would like to have a permanent storage (maybe for datasets and results), you can mount your Google Drive to Google Colab by running the cell below. You will be prompted to link your account to Colab. **The code in this section only works on Colab.**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Run the cell below and you should see the changes in your Google Drive

In [4]:
with open('/content/drive/My Drive/Colab Notebooks/boo.txt', 'w') as f:
  f.write('foo')

Don't forget to unmout at the end of your session.

In [5]:
drive.flush_and_unmount()

# Python Basics

Before diving into Python libraries, let's first take a look at Python's basic programming constructs.

## Basic Syntax

### Keywords

In [6]:
import keyword
keyword.kwlist

['False',
 'None',
 'True',
 'and',
 'as',
 'assert',
 'break',
 'class',
 'continue',
 'def',
 'del',
 'elif',
 'else',
 'except',
 'finally',
 'for',
 'from',
 'global',
 'if',
 'import',
 'in',
 'is',
 'lambda',
 'nonlocal',
 'not',
 'or',
 'pass',
 'raise',
 'return',
 'try',
 'while',
 'with',
 'yield']

### Identifiers


*   Identifiers in Python are case-sensitive.
*   The first character of any Python identifier can't be a digit. Other characters can be alphanumeric or an underscore. Non-ascii characters are also supported in Python 3.
*   !, @, #, $, and % are special symbols in Python and you can't use them as identifiers.





### Comments

In [7]:
# boo
'''
boo
foo
'''

"""
boo
foo
"""
print("Welcome to CS 361!")

Welcome to CS 361!


### Indentation
Python doesn't use `{}` for code blocks and is indentation-sensitive. This is America, you can whatever number of tabs/spaces you want for each code block, but the indentation level has to be the same within the same code block.

In [8]:
if True:
  print("True")
else:
  print("Truth always rests with the minority")

True


In [9]:
if True:
  print("True")
else:
  pass
print("False")

True
False


### Multi-line Expression

In [10]:
lets_make_this_line_super_super_super_super_super_super_super_super_super_long = 233
attempting_to_write_some_code_that_my_coworker_cant_maintain_so_that_i_could_keep_my_job = 666

We have to use backslash for a stand-alone multi-line expression.

In [11]:
lets_make_this_line_super_super_super_super_super_super_super_super_super_long + \
attempting_to_write_some_code_that_my_coworker_cant_maintain_so_that_i_could_keep_my_job

899

In [12]:
print(  
  lets_make_this_line_super_super_super_super_super_super_super_super_super_long +
  attempting_to_write_some_code_that_my_coworker_cant_maintain_so_that_i_could_keep_my_job   
)

899


In [13]:
[lets_make_this_line_super_super_super_super_super_super_super_super_super_long,
attempting_to_write_some_code_that_my_coworker_cant_maintain_so_that_i_could_keep_my_job]

[233, 666]

In [14]:
{lets_make_this_line_super_super_super_super_super_super_super_super_super_long,
attempting_to_write_some_code_that_my_coworker_cant_maintain_so_that_i_could_keep_my_job}

{233, 666}

## Data Types

### Built-in Numerical Data Types

* Integer

In [15]:
type(233)

int

* Float

In [16]:
type(233.666)

float

* Complex

In [17]:
type(233 + 666j)

complex

* Boolean: Be careful of compound boolean expression!

In [18]:
type(True)

bool

In [19]:
True and False

False

In [20]:
True or False

True

In [21]:
not True

False

*   Equality check

In [22]:
boo = [1, 2, 3]
foo = [1, 2, 3]
boo is foo # This is similar to boo == foo in Java

False

In [23]:
boo == foo # This is similar to boo.equals(foo) in Java

True

Note that Python doesn't have `===` like those in Kotlin and Javascript/TypeScript.

### String

*   `''` and `""` work the same for declaring string literals. If you use `""` to declare a string, you don't have to use escape character if you would like to use `''` as a part of the string literal.

In [24]:
"What's up?"

"What's up?"

In [25]:
'What\'s up?'

"What's up?"

*   Use `"""` or `'''` for multi-line strings

In [26]:
print(
  '''
  line 0
  line1
  '''
)


  line 0
  line1
  


*   String concatenation

In [27]:
"Welcome ""to ""CS 361"

'Welcome to CS 361'

In [28]:
"Welcome " + "to " + "CS 361"

'Welcome to CS 361'

*   String indexing: Python string can be index from left to right starting with 0 and right to left starting with -1

In [29]:
boo = "Welcome to CS 361!"

In [30]:
boo[0] # left to right

'W'

In [31]:
boo[-1] # right to left

'!'

*   String slicing [start:end:step]

In [32]:
boo[0:-1]

'Welcome to CS 361'

In [33]:
boo[0:7]

'Welcome'

In [34]:
boo[::2]

'Wloet S31'

In [35]:
boo[::-1]

'!163 SC ot emocleW'

In [36]:
boo[7::-1]

' emocleW'

In [37]:
boo[:7:-1]

'!163 SC ot'

In [38]:
boo[0:7:2]

'Wloe'

*   Byte-string: Please be aware of the existence of byte-string. If you have any byte-string imported into you dataset, please be careful when comparing string equality.

In [39]:
type(b'Hi!')

bytes

In [40]:
b'Hi!' == 'Hi!'

False

*   String Interpolation: C Style

In [41]:
"This is %s" % "CS 361!"

'This is CS 361!'

In [42]:
"Planck's constant is %e m^2 kg/s" % (6.62607004e-34)

"Planck's constant is 6.626070e-34 m^2 kg/s"

In [43]:
"Format a number into scientific notation: %e" % 233333

'Format a number into scientific notation: 2.333330e+05'

In [44]:
from math import pi
"π is approximately %f" % (pi)

'π is approximately 3.141593'

In [45]:
"π is approximately %.2f" % (pi)

'π is approximately 3.14'

*   String Interpolation: f-string

In [46]:
course = "CS 361"
f"Welcome to {course}"

'Welcome to CS 361'

In [47]:
f"1 + 1 = {1 + 1}"

'1 + 1 = 2'

*   String Interpolation: format function

In [48]:
"π is approximately {:.2f}".format(pi)

'π is approximately 3.14'

Checkout Python's [official document for string](https://docs.python.org/3.8/library/string.html) to see more details.

### List

*   List-indexing is very similar to string-indexing

In [49]:
boo = [0, 2, "four", 6, 8, "ten", 12, 14, 16, 18, 20]

In [50]:
boo[2]

'four'

In [51]:
boo[-1]

20

In [52]:
boo[-2]

18

In [53]:
boo[0:3]

[0, 2, 'four']

In [54]:
boo[::-2]

[20, 16, 12, 8, 'four', 0]

In [55]:
boo[-1:-6:-2]

[20, 16, 12]

*   Other List Tricks

In [56]:
"ten" in boo

True

In [57]:
boo.append("ha?")
boo

[0, 2, 'four', 6, 8, 'ten', 12, 14, 16, 18, 20, 'ha?']

In [58]:
boo.pop()

'ha?'

In [59]:
boo.remove("four")

In [60]:
len(boo)

10

Note that the two cells below repeat/concat two lists. However, it's a different story in Numpy.

In [61]:
[1, 2, 3] * 2

[1, 2, 3, 1, 2, 3]

In [62]:
[1, 2, 3] + [4, 5, 6]

[1, 2, 3, 4, 5, 6]

In [63]:
foo = [[233, 666], [888, 999]]

In [64]:
foo[0][1]

666

*   List Comprehension: Save you some effort to write loops.

In [65]:
[x * 2 for x in boo]

[0, 4, 12, 16, 'tenten', 24, 28, 32, 36, 40]

In [66]:
[x ** 2 for x in boo if type(x) is int]

[0, 4, 36, 64, 144, 196, 256, 324, 400]

### Tuple

*   You can't modify a tuple once it's created.

In [67]:
foobar = ("boo", "foo")

*   [Destructuring assignment](https://blog.tecladocode.com/destructuring-in-python/)

In [68]:
(bar, baz) = foobar
bar

'boo'

In [69]:
baz

'foo'

### Set

*   Built-in functions

In [70]:
{1, 2, 3, "three", 4, 4, "five", "five"}

{1, 2, 3, 4, 'five', 'three'}

In [71]:
boo = {"a", "b", "b", "c"}
foo = {"a", "b", "b", "c"}

In [72]:
boo.add("d")
boo

{'a', 'b', 'c', 'd'}

In [73]:
"d" in boo

True

In [74]:
boo.union(foo)

{'a', 'b', 'c', 'd'}

In [75]:
boo.intersection(foo)

{'a', 'b', 'c'}

In [76]:
len(boo)

4

*   Set comprehension

In [77]:
{x * 2 for x in boo}

{'aa', 'bb', 'cc', 'dd'}

### Dictionary

In [78]:
grade_book = {"Alice": 90, "Bob": 93, "Charlie": 99, "Eve": 80}
grade_book

{'Alice': 90, 'Bob': 93, 'Charlie': 99, 'Eve': 80}

*  Loops

In [79]:
for name, grade in grade_book.items():
  print("Name: {:<7} | Grade: {:>4}".format(name, grade))

Name: Alice   | Grade:   90
Name: Bob     | Grade:   93
Name: Charlie | Grade:   99
Name: Eve     | Grade:   80


*   Dictionary Comprehension

In [80]:
{name: "A" if grade > 90 else "B" for name, grade in grade_book.items()}

{'Alice': 'B', 'Bob': 'A', 'Charlie': 'A', 'Eve': 'B'}

In [81]:
{x: x ** 2 for x in range(10) if x % 2 == 0}

{0: 0, 2: 4, 4: 16, 6: 36, 8: 64}

## Functions: First-Class Citizens in Python

*   Declare a function

In [82]:
def square(x):
  return x ** 2

In [83]:
square_lambda = lambda x: x**2

* First-class citizen: It can be everywhere

In [84]:
def apply(f, x):
  return f(x)

apply(square, 2)

4

In [85]:
{x: square(x) for x in range(10) if x % 2 == 0}

{0: 0, 2: 4, 4: 16, 6: 36, 8: 64}

*   Function Parameters

In [86]:
def greeting(name, loudness=0):
  return "HELLO, {}!!!".format(name.capitalize() * loudness) if loudness else "Hello, {}!".format(name)

In [87]:
greeting("Chenhui", 0)

'Hello, Chenhui!'

In [88]:
greeting("Chenhui")

'Hello, Chenhui!'

In [89]:
greeting("Chenhui", 10) # lol

'HELLO, ChenhuiChenhuiChenhuiChenhuiChenhuiChenhuiChenhuiChenhuiChenhuiChenhui!!!'

# NumPy Basics

## Terminology
> Credit to NumPy's [official document](https://numpy.org/doc/stable/glossary.html).

*   Rank: Number of dimensions.
*   Shape: A tuple to describe the size of each dimension.
*   View: An array that refers to another array’s data instead.
*   Slice: The selection of certain elements from a sequence.
*   Flatten: Collapsing a multi-dimensional arrary to a one-dimensional array.
*   Mask😷: A boolean array, used to select only certain elements for an operation.
*   Universal function: A fast element-wise, vectorized array operation. Examples include `add`, `sin` and `logical_or`.
*   Array-like: A sequence that can be interpreted as an ndarray, including nested lists, tuples, scalars and existing arrays.
*   Broadcast: The way how numpy treats arrays with different shapes, subject to certain constraints, during arithmetic operations.
*   Axis: Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Many operations can take place along one of these axes
*   Vectorization: Optimizing a looping block by specialized code. NumPy uses vectorization to mean any optimization via specialized code performing the same operations on multiple elements, typically achieving speedups by avoiding some of the overhead in looking up and converting the elements. See the `Vectorization` section below for an very simple example.



## Ndarray


### Array Creation

A secret that I won't tell anyone: If you create a super big PyTorch tensor (on GPU) and do multiplication, Colab may temporarily upgrade your notebook environment (sometimes even with an V100 GPU). However, your session might get killed if you are unfortunate.

*   From Python array

In [90]:
np.array([233, 666, 999])

array([233, 666, 999])

In [91]:
np.array([[233, 666, 999], [233, 666, 999]])

array([[233, 666, 999],
       [233, 666, 999]])

*   Create constant arrays

In [92]:
np.zeros([100, 100, 100])

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

In [93]:
np.ones((2, 3))

array([[1., 1., 1.],
       [1., 1., 1.]])

In [94]:
np.full((2, 3, 3), 5)

array([[[5, 5, 5],
        [5, 5, 5],
        [5, 5, 5]],

       [[5, 5, 5],
        [5, 5, 5],
        [5, 5, 5]]])

Of course you can also specify [NumPy datatypes](https://numpy.org/doc/stable/user/basics.types.html).

In [95]:
np.array([2, 3, 3], dtype=np.float32)

array([2., 3., 3.], dtype=float32)

Remeber the array-like sequence we mentioned above?

In [96]:
foo = [2, 3, 3] * 100
np.zeros_like(foo)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [97]:
boo = np.random.randint(-100, high=100, size=(10, 10))
np.ones_like(boo)

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [98]:
np.eye(10) # identity matrix

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

*  Would like to create a geometric/arithmetic sequence? Here you go.

In [99]:
np.arange(start=0, stop=101, step=2)

array([  0,   2,   4,   6,   8,  10,  12,  14,  16,  18,  20,  22,  24,
        26,  28,  30,  32,  34,  36,  38,  40,  42,  44,  46,  48,  50,
        52,  54,  56,  58,  60,  62,  64,  66,  68,  70,  72,  74,  76,
        78,  80,  82,  84,  86,  88,  90,  92,  94,  96,  98, 100])

In [100]:
np.linspace(start=0., stop=100., num=51, retstep=True)

(array([  0.,   2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.,  20.,
         22.,  24.,  26.,  28.,  30.,  32.,  34.,  36.,  38.,  40.,  42.,
         44.,  46.,  48.,  50.,  52.,  54.,  56.,  58.,  60.,  62.,  64.,
         66.,  68.,  70.,  72.,  74.,  76.,  78.,  80.,  82.,  84.,  86.,
         88.,  90.,  92.,  94.,  96.,  98., 100.]), 2.0)

In [101]:
np.logspace(start=1, stop=5, num=5, base=2)

array([ 2.,  4.,  8., 16., 32.])

### Array Properties

In [102]:
boo.size

100

In [103]:
boo.shape

(10, 10)

In [104]:
boo.dtype

dtype('int64')

### Array Reshape
![](https://forum.onefourthlabs.com/uploads/default/original/2X/d/dc52bf0fac7ecdf30c66cfd4625ccf58e74fba02.jpeg)

In [105]:
boo = np.linspace(start=1., stop=99., num=50)

In [106]:
bar = boo.reshape((10, 5))
bar

array([[ 1.,  3.,  5.,  7.,  9.],
       [11., 13., 15., 17., 19.],
       [21., 23., 25., 27., 29.],
       [31., 33., 35., 37., 39.],
       [41., 43., 45., 47., 49.],
       [51., 53., 55., 57., 59.],
       [61., 63., 65., 67., 69.],
       [71., 73., 75., 77., 79.],
       [81., 83., 85., 87., 89.],
       [91., 93., 95., 97., 99.]])

In [107]:
bar.T
bar.transpose()

array([[ 1., 11., 21., 31., 41., 51., 61., 71., 81., 91.],
       [ 3., 13., 23., 33., 43., 53., 63., 73., 83., 93.],
       [ 5., 15., 25., 35., 45., 55., 65., 75., 85., 95.],
       [ 7., 17., 27., 37., 47., 57., 67., 77., 87., 97.],
       [ 9., 19., 29., 39., 49., 59., 69., 79., 89., 99.]])

In [108]:
foo = np.array([[1, 2, 3]])
foo

array([[1, 2, 3]])

We can also specify the size of one dimension as `-1` (unknown) during reshaping. This will let NumPy figure out the size of that dimenion.

In [109]:
foo.reshape(-1)

array([1, 2, 3])

In [110]:
np.ones((1, 2, 3)).flatten()

array([1., 1., 1., 1., 1., 1.])

### Indexing and Slicing
*   Similar story like what what we did for Python arrays but with extra toppings.

In [111]:
boo = np.random.randn(225, 233, 361)
boo.shape

(225, 233, 361)

In [112]:
boo[:, :, :].shape # I want them all!

(225, 233, 361)

In [113]:
boo[0:-10:2, :, ...].shape # Too lazy to specify the size of one dimension to be sliced

(108, 233, 361)

In [114]:
boo[::-1, 1, 1].shape

(225,)

*   Fancy indexing (Yeah that's what people call it😂) with boolean mask: Note that boolean mask indexing creates a new ndarray instead of a view.

In [115]:
foo = np.random.randn(100)
foo[foo > 0]

array([1.68339489e+00, 8.94855537e-01, 4.85777866e-01, 1.08696284e+00,
       1.40309913e+00, 9.07868695e-01, 1.71675186e+00, 1.51317942e+00,
       1.81822228e+00, 5.75292362e-01, 4.16015064e-01, 1.22194665e-02,
       1.20974652e+00, 2.21792623e+00, 2.07676146e-03, 2.33204822e-01,
       1.31725095e+00, 7.54091528e-01, 1.11441265e+00, 4.86291657e-01,
       1.99178813e+00, 1.11434440e+00, 7.88104671e-01, 1.85536382e+00,
       1.36859416e+00, 1.25872278e+00, 2.03676107e+00, 1.49814405e+00,
       1.09287124e+00, 7.46711640e-01, 7.43880185e-01, 1.07364270e+00,
       1.89021374e-01, 1.66853580e+00, 4.64538095e-01, 1.24363174e+00,
       4.11649764e-01, 7.72097329e-01, 8.44206399e-02, 7.15574247e-01,
       1.76815981e-01, 1.45318359e+00, 7.25833133e-01, 5.83765584e-01])

In [116]:
boo[boo < 0]

array([-1.54451975, -0.24822301, -1.10994655, ..., -0.38990346,
       -0.99421244, -0.13257632])

*   Fancy indexing with integer array: Note that integer array indexing creates a new ndarray instead of a view.

In [117]:
foo = np.random.randint(0, 100, (10, 10))
foo

array([[95, 16, 66, 32, 73, 94, 36, 52, 64, 48],
       [52, 46, 89, 49, 27, 91, 96, 77, 64, 77],
       [14, 55,  2, 72, 76,  5, 68, 66, 49, 95],
       [51, 66, 88, 56, 56, 50, 87, 36, 79, 78],
       [93, 71, 64, 18, 74, 75, 60, 65, 25, 37],
       [97, 20, 76, 41, 96,  1, 41, 60,  1, 25],
       [21, 47, 74, 94, 43,  7, 33,  0, 44, 18],
       [70, 81,  2, 71, 43, 71,  0, 89, 90, 52],
       [74, 20, 31, 40, 19, 69, 58, 12,  2, 40],
       [16, 20, 73, 91, 32, 81, 80, 60, 58, 97]])

In [118]:
foo[[1, 2, 3], [1, 2, 3]] # Basically we are picking out numbers with position (1, 1), (2, 2), and (3, 3)

array([46,  2, 56])

Same story for high-dimensional array.

In [119]:
bar = np.random.randint(0, 100, (10, 10, 10))
bar[[1, 2, 3], [1, 2, 3], [1, 2, 3]]

array([33, 22, 22])

We can even mix slicing with indexing.

In [120]:
foo[3:7, [0, 1, 2, 3]]

array([[51, 66, 88, 56],
       [93, 71, 64, 18],
       [97, 20, 76, 41],
       [21, 47, 74, 94]])

## NumPy Math/Statistics

### Statistics

In [121]:
boo = np.random.randn(3, 5)
boo

array([[ 0.13315673,  1.28355587,  2.26422345, -1.13109949,  0.03002014],
       [-0.03203126, -0.67702123,  0.36223797, -0.27586817,  0.86385367],
       [-0.50310435,  0.27807985, -0.21850545, -1.26513923,  2.43544275]])

In [122]:
boo.std() == boo.flatten().std() # Yes, my mon told me not to compare float numbers this way, but the point here is that std() by default flattens the array.

True

In [123]:
boo.std(axis=0) # std of each column (along the rows)

array([0.26957273, 0.80049031, 1.06033299, 0.43818369, 0.99728679])

In [124]:
boo.std(axis=1) # std of each row (along the columns)

array([1.16112957, 0.52939849, 1.24936029])

Watch out for the degree of freedom (depending whether you are running population std or sample std and your downstream task). Prof. Liu will talk about degree of freedom later in the course.

In [125]:
boo.std(ddof=0)

1.0505937182721985

In [126]:
boo.mean()

0.23652008238371441

In [127]:
boo.sum()

3.547801235755716

### Element-wise Operations

In [128]:
boo = np.arange(0, 20).reshape((4, 5))
boo

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [129]:
foo = np.arange(0, 5)
foo

array([0, 1, 2, 3, 4])

In [130]:
foo ** 2

array([ 0,  1,  4,  9, 16])

In [131]:
foo - 1

array([-1,  0,  1,  2,  3])

In [132]:
foo - np.arange(-5, 0)

array([5, 5, 5, 5, 5])

In [133]:
boo * foo

array([[ 0,  1,  4,  9, 16],
       [ 0,  6, 14, 24, 36],
       [ 0, 11, 24, 39, 56],
       [ 0, 16, 34, 54, 76]])

Wait... What? **PLEASE NOTE THAT THE CODE ABOVE IS NOT DOING MATRIX MULTIPLICATIONS.** ~Serious side effect may include but not limited to... (OK..OK.. I can't make this tutorial a TV prescription drug ad.)~ The code in the cell below is how you do a matrix multiplication.

In [134]:
boo @ foo

array([ 30,  80, 130, 180])

But what happened when we did `boo * foo`? This is our next fairy tale.

## Broadcast

Let's say I'd like to subtract a constant from a very large ndarray `boo`. Without broadcasting, I have to create an array with the same size as `boo`. However, this is not very efficient. This is the basic idea of broadcast: making arithmetic operations happen when two arrays even have incompatible shapes. For the `boo * foo` example above, we "copy" `foo` for five times and make its shape compatible with `boo`. 

~The COVID testing took my entire morning away so I'll just copy the official documentation. (Sorry, that's not an excuse.)~

![Meme](https://imgs.xkcd.com/comics/compiling.png)\
Wait... Python does not compile... Nevermind...

The rules for broadcasting go as below ([credit](https://cs231n.github.io/python-numpy-tutorial/#broadcasting)):
*   If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.
*   The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension.
*   The arrays can be broadcast together if they are compatible in all dimensions.
*   After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays.
*   In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension

Checkout [this link](https://numpy.org/devdocs/user/theory.broadcasting.html) for the motivation behind this design and the related rules. Below are some examples.

![Boradcast 1](https://numpy.org/doc/stable/_images/theory.broadcast_1.gif)\
![Broadcast 2](https://numpy.org/doc/stable/_images/theory.broadcast_2.gif)\
![Broadcast 3](https://numpy.org/doc/stable/_images/theory.broadcast_3.gif)\
When the trailing dimensions of the arrays are unequal, broadcasting fails because it is impossible to align the values in the rows of the 1st array with the elements of the 2nd arrays for element-by-element addition.



## Vectorization

The basic idea of doing vectorization with NumPy is to converting vanilla loops into ndarray-based NumPy operations without explicit looping. We usually (at least for simple problems) generate all the random trials onece and then accumulate the result with highly-optimized NumPy functions. We won't get into too much detail but below is a naive example.

In [135]:
%%timeit
from random import randint
path = []
pos = 0
for i in range(int(1e5)):
  walk = randint(-100, 100)
  path.append(pos)
  pos += walk

10 loops, best of 3: 118 ms per loop


In [136]:
%%timeit
walks = np.random.randint(-100, high=100, size=int(1e5))
path = np.cumsum(walks)

1000 loops, best of 3: 1.41 ms per loop


See the difference? Vectorization could save you a huge amount of time on a later homework about simulation (and in a whole bunch of other sceanrios). Side Note: Packages like `Numba` could also bring vectorization and JIT compilation to CPython. However, please do watch out for any unintended behaviors when using such packages. (I recommended `Numba` to a friend who is doing a research project and it messed up his experiment results...🤦)

In [137]:
from numba import jit
@jit(nopython=True)
def random_walk(steps):
  path = []
  pos = 0
  for i in range(int(steps)):
    walk = np.random.randint(-100, high=100)
    path.append(pos)
    pos += walk
  return path

In [138]:
%%timeit
random_walk(1e5) # Not bad!

The slowest run took 29.29 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 6.95 ms per loop


![Meme](https://i.redd.it/nklty63uzav41.png)

# Other Libraries by Examples
*  [Notebook1](https://colab.research.google.com/github/danielz02/CS361_Notebook_Collection/blob/master/Notebook1_Pandas_Histogram.ipynb): Matplotlib and Pandas
*  [Notebook2](https://github.com/danielz02/CS361_Notebook_Collection/blob/master/Notebook2_Time_Series_Boxplot_Corr.ipynb): Matplotlib and Pandas (Cont.d)
*  [Notebook3](https://colab.research.google.com/github/danielz02/CS361_Notebook_Collection/blob/master/Notebook3_Scatter_Corr.ipynb): Correlation Coefficient
*  [Notebook4](https://colab.research.google.com/github/danielz02/CS361_Notebook_Collection/blob/master/Notebook4_Simulation.ipynb): Simulation
*  [Notebook5](https://colab.research.google.com/github/danielz02/CS361_Notebook_Collection/blob/master/Notebook5_Covariance_PCA.ipynb): PCA
*  [Notebook6](https://colab.research.google.com/github/danielz02/CS361_Notebook_Collection/blob/master/Notebook6_PCA.ipynb): PCA (Cont.d)

# Reference
*   [What's NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html)
*   [NumPy Indexing](https://numpy.org/doc/stable/user/basics.indexing.html)
*   [NumPy Glossary](https://numpy.org/doc/stable/glossary.html)
*   [Stanford CS 231n](https://cs231n.github.io/python-numpy-tutorial/)
*   [NumPy Cheatsheet](https://www.dataquest.io/blog/numpy-cheat-sheet/)
*   [NumPy Broadcasting](https://numpy.org/doc/stable/user/theory.broadcasting.html)
*   [Meme from Unknown Source](https://forum.onefourthlabs.com/t/numpy-reshaping/5813)