# Programming for Data Science and Artificial Intelligence

## Numpy

### Readings: 
- [VANDER] Ch2
- https://numpy.org/doc/stable/

NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.

A python list comes with overhead of determining its dynamic type and convert them back to C.  Unlike python list, Numpy is constrained to arrays that all contain the same type, thus removing that overhead.

## Creation

### From list

### Creating using np built-in functions

### Creating with data types

### Creating multi-dimensional numpy array

### Creating randomized numpy arrays

### === Task 1 ===

1. Declare a list of 1 to 4 using range()

2. Continuing, create numpy array from this list, with dtype='float32'

3. Create an numpy array of size 3 by 5 using np.zeros

4. Create an numpy array of size 2 by 3 with filled value 1 / 3

5. Create an array of 5 equal-distanced values for 0 to 10

6. Create an array of number for 0.001, 0.01, 0.1, 1 using np.logspace

7. Create a diagonal matrix of list [1, 2, 3, 4]


8. Create a random array of size 4 by 5 all filled with random values between 0 and 1

9.  Create a random array of size 4 by 5 all filled with integer random values between 2 and 5

10.  Create a random array of size 4 by 5 all filled with float random values between 2 and 5

11. Create a random array of size 4 by 5 with mean = 5, and std = 1 following a gaussian (normal) distribution

12. Create an identity matrix of size 5 by 5

## Attributes

First let's discuss some useful array attributes. We'll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array. 

<center><img src="../figures/02.01-numpy-dimension.png" width=500 height=500 /></center>

We'll use NumPy's random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run:

### Dim

### Shape

### Len

### Size

### Dtypes

## Indexing and Slicing

<center><img src="../figures/02.02-numpy-array-slice.png" width=500 height=500 /></center>

### Basic indexing

### 2D array access

<center><img src="../figures/02.03-numpy-matrix-indexing.png" width=500 height=500 /></center>

### 3D array access

### Modifying

### === Task 2 ===

1. Create a numpy array of size 3 by 4 with random float values between 0 to 5

2. Continuing, print the shape of this array

3. Continuing, access the first row of the array

4. Continuing, access the first column of the array

5. Continuing, access the second row, and third column element

6. Continuing, access the first two columns

7. Continuing, access the  first and third columns using step

8. Continuing, print the whole matrix in reverse columns but not rows

9. Change the third row, fourth column element (i.e., the last element) to 999

### Very very important reminder - subarray are not copies!

## Reshaping

### Simple reshape

### 1d to 2d

<center><img src="../figures/02.04-np_reshape.png" width=500 height=500 /></center>

### np.newaxis

### Common to reshape 2d to 3d for time algorithms, such as LSTM

### Using -1 in reshape

### === Task 3 ===

1. Create a numpy array of size 200 by 4 with random float values between 1 to 5 and name it <code>Data</code>

2. Split the array into two numpy arrays, X and y, where the X contains the first 3 columns and y contains the last column.

3. Continuing, split the first 70% of the 200 rows of X and y and call them <code>X_train</code> and <code>y_train</code>. Similarly, populate <code>X_test</code> and <code>y_test</code> using same corresponding columns but 30\% rest of the data.

<img src="../figures/02.07-numpy-Task.png" width=1000 height=1000 />

4. Print the shape of the <code>X_train</code> and <code>y_train</code>.  The first array should have shape (0.7 * 200, 3); the second array is (0.7 * 200, 1)

5. Assign m = <code>X_train.shape[0]</code>, and n = <code>X_train.shape[1]</code>, where $m$ is number of samples, and $n$ is number of features

6. Randomly select one row of <code>X_train</code> by using <code>np.random.randint</code> to select the random row and called it <code>X_i</code>.  Reshape it so that it has shape of <code>(1, n)</code>

7. Randomly select 50 contiguous rows from <code>X_train</code> by using <code>np.random.randint</code> to select a random starting row and called it <code>mini_batch_X</code>.  If the index exceeds, simply grab whatever left.

8. Write a for loop that breaks the <code>X_train</code> row-wise into 10 equal pieces without overlap, and simply print their shape

9. Create an np.zero array called <code>theta</code> with shape of <code>(n, )</code>

10. Perform a **dot product** between <code>X_train</code> and <code>theta</code> and assign this value to a variable called <code>yhat</code>
Hint: https://www.guru99.com/numpy-dot-product.html

<center><img src="../figures/02.07-numpy-Task-dot-explained.png" width=500 height=500 /></center>

11. Create another variable called <code>y</code> with the same shape as <code>yhat</code>, and populate it with random values from 0 to 1.

12. Calculate the following 
    $$ \frac{\sum\limits_{i=1}^{m} (y_i - \hat{y_i})^2}{m} $$
    
    For example, if yhat = [1, 2, 3] and y = [2, 3, 4], then the calculation is
    
    $$ \frac{sum((1-2)^2 + (2-3)^2 + (3-4)^2)}{m} $$

## Concatenation, vstack/hstack

<center><img src="../figures/numpy-axis.jpg" width=500 height=500 /></center>

<center><img src="../figures/02.06-numpy-concatenate.png" width=500 height=500 /></center>

### Concatenation

### Error example

### More example

### Concatenating multiple lists

### Vstack and hstack

<center><img src="../figures/numpy-hstack.png" width=200 height=200 /><img src="../figures/numpy-vstack.png" width=120 height=200 /></center>

## Vectorization
<center><img src="../figures/numpy-vectorization.png" width=300 height=300 /></center>

### Vectorization basics

### Vectorization by scalars

### Vectorization using numpy built in function

## Broadcasting

Broadcasting is actually a built-in vectorization technique when shape is different

Here are the broadcasting rules:

- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

A simple way to think about broadcasting is as follows:

<code>a = 5
b = np.array([1, 1, 1])  #shape with (3, )
a + b = [6, 6, 6]</code>

Actually, what broadcasting does is to stretch a to [5, 5, 5] to match the dimension of b.  As you can see, the shape of [5, 5, 5] has shape with (3, ) as well.  BTW, this duplication does not actually take place, but it is a useful mental model to think about broadcasting

<center><img src="../figures/02.05-broadcasting.png" width=400 height=400 /></center>

### Example 1

### Example 2

### Example 3

### Example 4

### === Task 4 ===

1. From the above <code>X_train</code> from previous task, using concatenation, add a column of 1s along axis=1 and call it intercept

2. Create a <code>theta</code> of shape (n + 1, 4), with random values between 0 and 1

3. Perform a dot product between X_train and theta, and assign this value to a variable called <code>hot_encoded_yhat</code>

4. Continuing, using broadcasting and vectorization, for each value (here I use example as value1) in <code>hot_encoded_yhat</code>, perform the following calculations.

$$ \frac{\exp(value1)}{\exp(value1)+\exp(value2)+\exp(value3)+\exp(value4)} $$

For example, if my first row is [0.3, 0.5, 1.2, 3.1], then the first value 0.3 will change to

exp(0.3) / (exp(0.3) + exp(0.5) + exp(1.2) + exp(3.1))

The second value 0.5 will become

exp(0.5) / (exp(0.3) + exp(0.5) + exp(1.2) + exp(3.1))

5. Create a variable called <code>yhat</code> which is equal to the <code>np.argmax</code> of <code>hot_encoded_yhat</code> along axis=1.  That is, <code>yhat</code> has the shape of <code>(X_train.shape[0], )</code>.  For example, if the first row of <code>hot_encoded_yhat</code> is [0.1, 0.2, 0.3, 0.4], since the fourth value is the biggest, the value will be its index which is 3.

6. Create a variable called <code>y</code>, containing shape of <code>(X_train.shape[0], )</code> but with random int values from [0, 1, 2, 3]

7. Assign a variable <code>n_classses</code> equal to the number of unique values in <code>y</code>

## Masking

Instead of writing if/while, we can use Boolean masks to elegantly extract desired values from numpy arrays.  You will love it!!


Examples

Hint: https://matteding.github.io/2019/04/12/numpy-masks/

### Argwhere

### np.any

### np.all

### Multiple conditions

## Fancy indexing

Instead of pass individual indices, we can pass array of indices all at once, and it will return a list of numbers

### Basic example

### Shape of fancy indexing

### Multiple dimension fancy indexing

### Combined indexing with indexing/slicing

### 3d

Modifying using fancy indexing

### === Task 5 ===

1. Take the above variables <code>yhat</code> and <code>y</code>, sum up the counts when their corresponding value are the same.  Then divide this sum by $m$.  Called this variable <code>accuracy</code>

2. Let's practice masking.  Convert <code>y</code> into <code>hot_encoded_y</code> matrix of shape <code>(X_train.shape[0], 4)</code>, where the column will be one according to its value, and other columns will be zero.  For example, if the first row is 1, then it will become 0 1 0 0.  If the second row is 2, then it will become 0 0 1 0

3. Grab the data from <code>data = np.genfromtxt('../data/perceptron.csv', delimiter=',', skip_header=1)</code>
    - set X to be all columns except last
    - set y to be the last column
    - select X where its corresponding y is 0
    - select X where its corresponding y is 1
    
4. Grab the data from <code>iris = np.genfromtxt('../data/iris.csv', delimiter=',', encoding="utf-8", dtype=None)</code>
    - set iris_without_headers to all rows except the first row
    - set sepal_length to second column of iris_without_headers
    - set petal_length to fourth column of iris_without_headers
    - print the shape of iris_without_headers where sepal_length is less than 5, and petal_length is greater than 1.5

## Other useful stuff :)

#### np.transpose

### np.flatten

### np.squeeze

### np.argwhere

### np.argmax

### np.argsort