# Linear Algebra with Numpy
### NumPy

As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as various data science techniques rely upon it (e.g. Linear Regression, Support Vector Machines, Recommender Systems, Neural Networks).

NumPy is a package designed to be used in scientific computing, and specifically around building and manipulating N-dimensional array objects.

* `numpy` stands for numerical python
* the starring role is played by the `ndarray`, which is a vector
* `pandas` is built on `numpy`
* More documentation available at: http://www.numpy.org
* Good quick tutorial at: http://cs231n.github.io/python-numpy-tutorial/

In [None]:
import numpy
import numpy as np # Same as above, but numpy is aliased to np.

In [None]:
print numpy.absolute(-10)
print np.absolute(-10) # Both ways work

### Arrays
We can create vectors and matrices we learned in class through np.array

(Note for persons familiar with vector representations of mathematical geometries:
Hereforth, dimensions refers to the number of axes of the matrix, not physical dimensions that can be represented through multiple vectors in a m x n matrix)

In [None]:
a = np.array([1, 2, 1]) #Create a 1 dimensional array having shape 3

In [None]:
a = np.array([ [1,2], [3,4], [5,6] ]) #Create a 2 dimensional array having shape 3 x 2

In [None]:
a = np.array([[1, 2, 3],[2, 4, 9],[4,6,4]]) # Create a 2 dimensional array having shape 3 x 3

In [None]:
# Gives us the shape and number of dimensions. 
# There are two dimensions, both having size 3.
a.shape

In [None]:
# We can also observe how many dimensions the numpy array is
a.ndim

In [None]:
a[0]  # Index first row

In [None]:
a[0,:] #Index first row, but more clear

In [None]:
a[ : , 1]  # Index second column

In [None]:
a[:2,1:3] #Slice the 0 and 1 rows, and the 1 and 2 columns

In [None]:
a[:2,1:3] = 0 # Can directly modify the elements 

Be careful about indexing/slicing. If you **index** a single row/column in a 2d matrix, you will get a 1d array back having dimensions (d,). If you **slice** a single row/column, you will get a 2d array back having dimensions (d,1), (1,d) or similar.

In [None]:
# The output numbers are the same,
# but the first is conceptually a vector. 
# The second is conceptually a 1xd matrix
print a[0,:].shape #Indexed on 0th row. This is a 1d array
print a[0:1,:].shape #Sliced on 0th row. This is a 2d array having dimension sizes: (1,3)

In [None]:
#If you're unclear how many dimensions there are, you can use ndim to check
print a[0,:].ndim
print a[0:1,:].ndim

#### Integer slicing
You can index into numpy arrays using arrays of integers as well

In [None]:
a = np.array([ [1,2],[3,4],[5,6]])

In [None]:
a[ [0,1,2], [0,1,0] ] # If we wanted to grab the (0,0), (1,1), and (2,0) elements

In [None]:
np.array([ a[0,0], a[1,1], a[2,0]]) #Same as above, but create a new array from the elements

#### Boolean Slicing
You can also index into numpy arrays through conditional statements

In [None]:
a = np.array([[1,2],[3,4],[5,100000000]])
a

In [None]:
bad_vals = a > 10
bad_vals

In [None]:
a[bad_vals] = 5
a

### Datatypes
All elements in an np array use the same data type. Numpy tries to guess the best datatype, but they are coercible to other types. In addition you can explicitly force the type

In [None]:
x = np.array([1,2])
x.dtype

In [None]:
x = np.array([1.0,2.3])
x.dtype

In [None]:
#Force to float64 type, though we've seen this would normally cast to a int64
x = np.array([1,2],dtype='float64')
x.dtype

### Matrix Operations

We can perform matrix operations such as:

In [None]:
data = np.array( [ [1,2,3], [4,5,6], [7,8,9] ])

In [None]:
data + data # Matrix Addition

In [None]:
np.add(data,data) # Same as above

In [None]:
np.subtract(data,data)

In [None]:
data * 2 # Multiplication of each element by a scalar

The dimensionality of np arrays is important.
If we multiply a 2d array by a 1d array, it will implicitly do a matrix * vector operation, and return a 1d array.

In [None]:
a = np.array([1, 2, 1]) #Create a 1 dimensional array with size 3
np.dot(data, a) # Here we multiply the (3,3) array by a (3,) array

If instead, we tried to create a 2d array having size (1,3), this will fail.
Because `data` and `a` are 2d arrays, numpy will follow normal matrix multiplication rules of matching inner dimensions: (3,3) * (1,3) will not work since the number of columns in the first matrix (3), does not match the number of rows in the second matrix (1).

In [None]:
a = np.array([[1,2,1]]) #Explicitly create a 2 dimensional, 1x3 array
np.dot(data,a) # Here we multiple the (3,3) array by a (1,3) array

If we instead shaped a to be a column array having size (3,1), then the dimensions match:
(3,3) * (3,1) will work because the number of columns in the first matrix (3), matches the number of rows in the second matrix (3)

In [None]:
vector = np.array([ [1], [2], [1] ])
np.dot(data,vector) # Here we multiple the (3,3) array by a (3,1) array

There is a difference between the `np.dot` method and `np.array * np.array`

In [None]:
sqmat = np.array([ [1,2,3], [4,5,6], [7,8,9] ])

In [None]:
np.dot( sqmat, sqmat ) # sqmat (dot) sqmat, or, sqmat^2, this is matrix mult

In [None]:
sqmat * sqmat # Careful, this is very different from matrix multiplication!

We can transpose an array

In [None]:
print sqmat
print sqmat.T
print np.transpose(sqmat)

In [None]:
np.linalg.inv(sqmat) # Very crazy numbers! Most likely not full rank matrix
# For fun, you can check the rank by: np.linalg.matrix_rank(sqmat)
# This matrix is not full rank

In [None]:
sqmat = np.array( [ [1,5,4], [6,0,-2], [ 3.3, -.5, -.5] ])
np.linalg.inv(sqmat)
# For fun you can check the rank by: np.linalg.matrix_rank(sqmat) 
# This matrix should be full rank

### Matrix Descriptions
Helper functions exist to help reshape arrays as well as determine the number of dimensions and size of the dimensions

In [None]:
a = np.array( [1,2,3,4,5,6,7,8,9,10,11,12]) #Create a 12 x 1 array

In [None]:
print np.shape(a) # Get the dimensions of the array
print np.ndim(a)  # Get how many dimensions there are

In [None]:
matA = a.reshape(3,4) #Reshape into a 3 x 4 matrix. Use elements row-wise

In [None]:
print matA.shape
print matA.ndim

In [None]:
matA = np.reshape(a,(3,4)) # Same as above, the shape must be passed as a tuple

In [None]:
#This takes a matrix, and does the reverse operation of flattening into an array
a = matA.flatten() 

### Helper Functions to create Arrays

There are a number of helper functions to create arrays from scratch

In [None]:
a = np.zeros(4)
a

In [None]:
#Instead of a scalar, can pass in dimensions to create a multi-dimensional array
a = np.zeros( (4,4) ) 
a

In [None]:
#As mentioned above, a 1d array of size 4 is not the same as a 2d array of size 4x1
a = np.zeros(  4    )  #Create 1d array of size 4
b = np.zeros( (4,1) )  #Create 2d array of size 4 x 1
c = np.zeros( (1,4) )  #Create 2d array of size 1 x 4
print a.shape, b.shape, c.shape

In [None]:
a = np.ones( (4,4) )
a

`arange` returns a range similar to python's `range`, but enclosed in an np array. We can directly call reshape on the returned ndarray, just as we can on any other ndarray
(ndarray is the basic type of numpy arrays.)

In [None]:
a = np.arange(5**2).reshape(5,5)
a

In [None]:
#We can also flatten this matrix
a = np.arange(5**2).reshape(5,5)
a = a.flatten()

In [None]:
a = np.linspace(0,1,11) #Note that the boundaries ARE included in linspace
a 

In [None]:
np.random.random() #Gives a single number in the interval [0,1.0] inclusive

In [None]:
np.random.random(10) # Creates a vector of 10 elements

In [None]:
np.random.random( (3,4) ) #Creates a 3 x 4 matrix. Enclose dimensions in a tuple

In [None]:
#Creates a 2x2 matrix generated from normal distribution. Use varargs here, not tuple
np.random.randn( 2,2 ) 

### Descriptive Stats
Through helper functions, you can obtain stats of arrays along multiple dimensions, as well as modify the data to your benefit

In [None]:
a = np.arange(5,10)

In [None]:
np.square(a)

In [None]:
np.sqrt(a)

In [None]:
print np.mean(a), np.median(a),  np.min(a), np.max(a), np.std(a)

In [None]:
print a
print np.cumsum(a)

In [None]:
a = np.arange(25).reshape(5,5)
print a.mean(axis=0) #average per column
print a.mean(axis=1) #average per row

In [None]:
a = np.random.random(10) 
print np.sort(a) #Not an inplace sort. Returns a new np array
print a

In [None]:
a.sort() # Be careful, this will sort a in place
print a

In [None]:
np.random.shuffle(a) # This will shuffle your np array IN PLACE
print a

### Plotting with matplotlib
The `pyplot` submodule is located in the `matplotlib` package.
Remember to include `%matplotlib inline` if you want the graphs to plot from inside ipython notebook (else they get plotted to lala land)

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
a = np.random.random(10000)
f = plt.hist(a,bins=20)

In [None]:
a = np.random.randn(10000) #Normal distribution
f = plt.hist(a, bins=20)

In [None]:
print a.mean(), a.var() #Let's check again numerically. For a standard normal, mean ~ 0, var ~ 1

#### Stacking arrays
Often, you may need to append columns or add rows to an existing matrix.

`np.hstack` and `np.vstack` are functions to horizontally append columns and vertically add rows respectively

In [None]:
a = np.array([1,2,3])
b = np.array([4,5,6])
print a
print b

In [None]:
np.hstack( (a,b) ) #Define all the elements to stack in a tuple.

In [None]:
np.vstack( (a,b) )

In [None]:
a = np.array([ [1],[2],[3] ])
b = np.array([ [4],[5],[6] ])

In [None]:
np.hstack( (a,b) )

In [None]:
np.vstack( (a,b) )

Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

# Exercise

**1a) Create a 4x5 array of integers between 2 and 21. Save it as `matA`**

In [None]:
import numpy as np
matA = np.arange(2,22).reshape(4,5)
matA

**1b) Create another 5x10 array of random numbers between [0,1]. Save it as `matB`**

In [None]:
#One direct way. Enclose your dimensions in a tuple of (5x10)
matB  = np.random.random( (5,10) )

#Another way. Generate 50 random # in a 1d array. Reshape into a 2d array of size 5 x 10
matB = np.random.random(50).reshape(5,10)

*1c) Obtain the dot product (matrix multiplication) of the two matrices. Save this as `matC`.**

In [None]:
matC = np.dot(matA,matB)
matC.shape

**1d) For some reason, analysts believe that `matB` and `matC` can be better used when their rows are combined. Vertically stack the two matrices, and save as `matD`**

In [None]:
matD = np.vstack( (matB, matC ) )
print matB.shape, matC.shape, matD.shape

**2a) Create a 6x6-matrix with ones on the diagonals - i.e. on (1,1), (2,2), (3,3) and zeroes everywhere else. Save as `matA`. Hint: Use eye**

In [None]:
matA = np.eye(6)
matA

**2b) Create another 6x6-matrix where all of the values are uniformly random values between 1 and 10. Save as `matB`**

In [None]:
#One way. Specify the low as 1, high(noninclusive) as 10, and the shape (6,6) in a tuple
matB = np.random.randint(1,10, (6,6) ) 

#Another way. i.e. Generate 36 random # between 1 and 10, and reshape the resulting ndarray
matB = np.random.randint(1,10,36).reshape(6,6)

**2c) Add `matA + matB` and determine the columwise means, rowwise means, and overall mean of matC**

In [None]:
matC = matA + matB
colMeans = matC.mean(axis=0)
rowMeans = matC.mean(axis=1)
totMean  = matC.mean()

**3a) Create a 500x5 array through a normal distribution with a mean of 20 and variance of 100. Save it to a variable called  biggie**

In [None]:
#One way
biggie = np.random.normal(20,100, (500,5) )

#Another way
biggie = np.random.normal(20,100,500*5).reshape(500,5)

**3b) Plot a histogram of this distribution using 20 break points.**

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline
f = plt.hist(biggie,bins=20)

**3c) Determine the standard deviation of `biggie`. Now, for any value in `biggie` greater than `mean + 2 * std`, set it equal to `mean + 2 * std`. Any value less than `mean - 2 * std`, set it equal to `mean - 2 * std`.**

In [None]:
std = biggie.std()
std2Above = biggie.mean() + 2.0*std
std2Below = biggie.mean() - 2.0*std

biggie[ biggie > std2Above ] = std2Above
biggie[ biggie < std2Below ] = std2Below

#Be careful that you obtain std2Above and std2Below BEFORE modifying biggie.
#If you don't, then the mean and std will change as you calculate the limits.

**3d) Replot a histogram of this distribution using 20 break points. What could a technique like 2c be used for in the real world?**

In [None]:
f = plt.hist(biggie, bins=20) #This technique is called winsorising