## Numpy

NumPy and SciPy are modules in Python for scientific computing.  [NumPy](http://www.numpy.org) lets you do fast, vectorized operations on arrays.  Why use this module?  

* It gives you the performance of using low-level code (e.g. C or Fortran) with the benefit of writing the code in an interpreted scripting language (all while keeping the native Python code). 
* It gives you a fast, memory-efficient multidimensional array called `ndarray` which allows you perform vectorized operations on (and supports mathematical functions such as linear algebra and random number generation)

In [None]:
# Import NumPy
import numpy as np

To create a fast, multidimensional `ndarray` object, use the `np.array()` method on a python `list` or `tuple` or reading data from files. 

In [None]:
x = np.array([1,2,3,4])
y = np.array([[1,2], [3,4]])
x

array([1, 2, 3, 4])

In [None]:
y

array([[1, 2],
       [3, 4]])

In [None]:
type(x)

numpy.ndarray

#### Properties of NumPy arrays
There are a set of properties about the `ndarray` object such the dimensions, the size, etc.  

Property | Description
--- | ----
`y.shape` (or `shape(y)` | Shape or dimension of the array
`y.size` (or `size(y)`) | Number of elements in the array 
`y.ndim` | number of dimensions 


In [None]:
x.shape

(4,)

In [None]:
y.shape

(2, 2)

#### Other ways to generate NumPy arrays

Function | Description
--- | ---
`np.arange(start,stop,step)` | Create a range between the start and stop arguments
`np.linspace(start,stop,num)` | Create a range between start and stop (both ends included) of length num
`np.logspace(start, stop,num,base)` | Create a range in the log space with a define base of length num
`np.eye(n)` | Generate an n x n identity matrix

In [None]:
np.arange(0, 21, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In addition, the `numpy.random` module can be used to create arrays using a random number generation 

In [None]:
from numpy import random

Function | Description
--- | ---
`np.random.randint(a, b, N)` | Generate N random integers between a and b
`np.random.rand(n, m)` | Generate uniform random numbers in [0,1] of dim n x m
`np.random.randn(n, m)` | Generate standard normal random numbers of dim n x m


In [None]:
np.random.randint(1, 100, 50)

array([96,  1, 37, 25, 19, 90, 56, 88, 78, 49, 53, 78, 17, 93, 25, 85, 56,
       96, 44, 97, 67, 79, 56, 82, 90, 25, 38, 23, 51, 41,  1, 37, 23, 62,
        5, 65, 61, 88, 27, 21,  8, 80, 72, 54, 62, 23,  2, 53, 76, 48])

#### Reshaping, resizing and stacking NumPy arrays

To reshape an array, use `reshape()`:

In [None]:
z = np.random.rand(4,4)
z 

array([[0.32892669, 0.27762393, 0.9979168 , 0.95699732],
       [0.65075288, 0.4517447 , 0.46586364, 0.62372789],
       [0.10203712, 0.86865727, 0.97203646, 0.61119228],
       [0.97635129, 0.14012493, 0.65334615, 0.49616653]])

In [None]:
z.shape

(4, 4)

In [None]:
z.reshape((8,2)) # dim is now 8 x 2

array([[0.32892669, 0.27762393],
       [0.9979168 , 0.95699732],
       [0.65075288, 0.4517447 ],
       [0.46586364, 0.62372789],
       [0.10203712, 0.86865727],
       [0.97203646, 0.61119228],
       [0.97635129, 0.14012493],
       [0.65334615, 0.49616653]])

To flatten an array (convert a higher dimensional array into a vector), use `flatten()`

In [None]:
z.flatten()

array([0.32892669, 0.27762393, 0.9979168 , 0.95699732, 0.65075288,
       0.4517447 , 0.46586364, 0.62372789, 0.10203712, 0.86865727,
       0.97203646, 0.61119228, 0.97635129, 0.14012493, 0.65334615,
       0.49616653])

## Operating on NumPy arrays

#### Assigning values
To assign values to a specific element in a `ndarray`, use the assignment operator. 

In [None]:
y = np.array([[1,2], [3,4]])
y.shape

(2, 2)

In [None]:
y[0,0] = 10
y 

array([[10,  2],
       [ 3,  4]])

#### Indexing and slicing arrays
To extract elements of the NumPy arrays, use the bracket operator and the slice (i.e. colon) operator.  To slice specific elements in the array, use `dat[lower:upper:step]`. To extract the diagonal (and subdiagonal) elements, use `diag()`. 

In [None]:
 # random samples from a uniform distribution between 0 and 1
dat = np.random.rand(4,4)
dat

array([[0.12763063, 0.2188284 , 0.38157378, 0.3034226 ],
       [0.84403395, 0.25928363, 0.78458648, 0.0582864 ],
       [0.03810382, 0.00448945, 0.43105533, 0.22002795],
       [0.17735295, 0.49247637, 0.01372686, 0.56948032]])

In [None]:
dat[0, :] # row 1

array([0.12763063, 0.2188284 , 0.38157378, 0.3034226 ])

In [None]:
dat[:, 0] # column 1

array([0.12763063, 0.84403395, 0.03810382, 0.17735295])

In [None]:
dat[0:3:2, 0] # first and third elements in column 1

array([0.12763063, 0.03810382])

In [None]:
np.diag(dat) # diagonal

array([0.12763063, 0.25928363, 0.43105533, 0.56948032])

In [None]:
np.arange(32).reshape((8, 4)) # returns an 8 x 4 array

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [None]:
x[0] # returns the first row

1

#### Element-wise transformations on arrays
There are many vectorized wrappers that take in one scalar and produce one ore more scalars (e.g. `np.exp()`, `np.sqrt()`). This element-wise array methods are also known as NumPy `ufuncs`. 

Function | Description 
--- | --- 
`np.abs(x)` | absolute value of each element
`np.sqrt(x)` | square root of each element
`np.square(x)` | square of each element
`np.exp(x)` | exponential of each element
`np.maximum(x, y)` | element-wise maximum from two arrays x and y
`np.minimum(x,y)` | element-wise minimum
`np.sign(x)` | compute the sign of each element: 1 (pos), 0 (zero), -1 (neg)
`np.subtract(x, y)` | subtract elements in y from elements in x
`np.power(x, y)` | raise elements in first array x to powers in second array y
`np.where(cond, x, y)` | ifelse statement



In [None]:
import numpy as np
import pandas as pd
from sklearn import datasets

In [None]:
iris = datasets.load_iris()

In [None]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [None]:
X = iris.data
y = iris.target
X.shape, y.shape

((150, 4), (150,))

In [None]:
# Print target names
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [None]:
print(iris.DESCR)

In [None]:
# Print columns names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [None]:
iris.filename

'/usr/local/lib/python3.6/dist-packages/sklearn/datasets/data/iris.csv'

In [None]:
X.shape, type(X), y.shape, type(y)

((150, 4), numpy.ndarray, (150,), numpy.ndarray)

In [None]:
X

In [None]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [None]:
X[:5,1], X[:5,1].sum(), X[:5,1].mean(), np.median(X[:5,1])

(array([3.5, 3. , 3.2, 3.1, 3.6]), 16.4, 3.28, 3.2)

In [None]:
df = pd.DataFrame(iris.data)
df.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
df.columns = iris.feature_names
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
df['target'] = iris.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


# Summary Statistics

Let's explore this data set.  We use `describe()` to get basic summary statistics for each of the columns. 

In [None]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


$$mean = \frac{sum\;of\;the\;terms}{number\;of\;terms}$$

In [None]:
df.mean()

sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64

$$variance = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n-1}$$

In [None]:
df.var() # variance

sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
target               0.671141
dtype: float64

$$corr(x,y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\displaystyle\sum(x_i-\bar{x})^2\displaystyle\sum(y_i-\bar{y})^2}}$$

In [None]:
df.corr() # correlation

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,-0.11757,0.871754,0.817941,0.782561
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126,-0.426658
petal length (cm),0.871754,-0.42844,1.0,0.962865,0.949035
petal width (cm),0.817941,-0.366126,0.962865,1.0,0.956547
target,0.782561,-0.426658,0.949035,0.956547,1.0
