Python's built in data structures are great for general-purpose programming, but they lack some specialized features we'd like for data analysis. For example, adding rows or columns of data in an element-wise fashion and performing math operations on two dimensional tables (matrices) are common tasks that aren't readily available with Python's base data types. 

### Numpy and Array Basics
The numpy library is one of the core packages in Python's data science software stack. Many other Python data analysis libraries require numpy as a prerequisite, because they use its array data structure as a building block. The Kaggle Python environment has numpy available by default; if you are running Python locally, the Anaconda Python distribution comes with numpy as well.

Numpy implements a data structure called the N-dimensional array or ndarray. ndarrays are similar to lists in that they contain a collection of items that can be accessed via indexes. On the other hand, ndarrays are homogeneous, meaning they can only contain objects of the same type and they can be multi-dimensional, making it easy to store 2-dimensional tables or matrices.

To work with ndarrays, we need to load the numpy library. It is standard practice to load numpy with the alias "np" like so:

In [1]:
import numpy as np

In [3]:
my_list = [1,2,3,4]
my_array = np.array(my_list)
type(my_array)

numpy.ndarray

In [7]:
second_list = [5, 6, 7, 8]
two_d_array = np.array([my_list, second_list])
print(two_d_array)

[[1 2 3 4]
 [5 6 7 8]]


In [9]:
two_d_array.shape

(2, 4)

In [11]:
two_d_array.size

8

In [13]:
two_d_array.dtype

dtype('int64')

Numpy has a variety of special array creation functions. Some handy array creation functions include:

In [16]:
# np.identity() to create a square 2d array with 1's across the diagonal

np.identity(n = 5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [28]:
# np.eye() to create a 2d array with 1's across a specified diagonal

np.eye(
    N = 3,   # Number of rows
    M = 5,     # Number of columns
    k = 0         # Index of the diagonal (main diagonal (0) is default)
)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.]])

In [30]:
# np.ones() to create an array filled with ones:

np.ones(shape= [2,4])

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [32]:
# np.zeros() to create an array filled with zeros:

np.zeros(shape= [4,6])

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

### Array Indexing and SlicingÂ¶
Numpy ndarrays offer numbered indexing and slicing syntax that mirrors the syntax for Python lists:

In [37]:
one_d_array = np.array([1,2,3,4,5,6])

print(one_d_array[3])        # Get the item at index 3

4


In [39]:
one_d_array[3:]       # Get a slice from index 3 to the end

array([4, 5, 6])

In [41]:
one_d_array[::-1]     # Slice backwards to reverse the array

array([6, 5, 4, 3, 2, 1])

In [43]:
# Create a new 2d array
two_d_array = np.array([one_d_array, one_d_array + 6, one_d_array + 12])

print(two_d_array) 

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]
 [13 14 15 16 17 18]]


In [45]:
# Get the element at row index 1, column index 4

two_d_array[1, 4]

np.int64(11)

In [47]:
 # Slice elements starting at row 2, and column 5

two_d_array[1:, 4:]

array([[11, 12],
       [17, 18]])

In [49]:
two_d_array[::-1, ::-1]

array([[18, 17, 16, 15, 14, 13],
       [12, 11, 10,  9,  8,  7],
       [ 6,  5,  4,  3,  2,  1]])

In [53]:
np.reshape(two_d_array,        # Array to reshape
           newshape=(6,3))       # Dimensions of the new array

  np.reshape(two_d_array,        # Array to reshape


array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

Unravel a multi-dimensional into 1 dimension with np.ravel():

In [56]:
np.ravel(two_d_array,
         order='C')         # Use C-style unraveling (by rows)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

Alternatively, use ndarray.flatten() to flatten a multi-dimensional into 1 dimension and return a copy of the result:

In [59]:
two_d_array.flatten()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18])

Get the transpose of an array with ndarray.T:

In [62]:
two_d_array.T

array([[ 1,  7, 13],
       [ 2,  8, 14],
       [ 3,  9, 15],
       [ 4, 10, 16],
       [ 5, 11, 17],
       [ 6, 12, 18]])

Flip an array vertically or horizontally with np.flipud() and np.fliplr() respectively:

In [65]:
np.flipud(two_d_array)

array([[13, 14, 15, 16, 17, 18],
       [ 7,  8,  9, 10, 11, 12],
       [ 1,  2,  3,  4,  5,  6]])

Join arrays along an axis with np.concatenate():

In [72]:
array_to_join = np.array([[10,20,30],[40,50,60],[70,80,90]])

np.concatenate( (two_d_array,array_to_join),  # Arrays to join
               axis=1)                        # Axis to join upon

array([[ 1,  2,  3,  4,  5,  6, 10, 20, 30],
       [ 7,  8,  9, 10, 11, 12, 40, 50, 60],
       [13, 14, 15, 16, 17, 18, 70, 80, 90]])

### Array Math Operations

Creating and manipulating arrays is nice, but the true power of numpy arrays is the ability to perform mathematical operations on many values quickly and easily. Unlike built in Python objects, you can use math operators like +, -, / and * to perform basic math operations with ndarrays

In [76]:
two_d_array + 100    # Add 100 to each element

array([[101, 102, 103, 104, 105, 106],
       [107, 108, 109, 110, 111, 112],
       [113, 114, 115, 116, 117, 118]])

In [78]:
two_d_array * 2      # Multiply each element by 2

array([[ 2,  4,  6,  8, 10, 12],
       [14, 16, 18, 20, 22, 24],
       [26, 28, 30, 32, 34, 36]])

In [80]:
two_d_array % 2       # Take modulus of each element 

array([[1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0],
       [1, 0, 1, 0, 1, 0]])

In [82]:
small_array1 = np.array([[1,2],[3,4]])

small_array1 + small_array1

array([[2, 4],
       [6, 8]])

In [84]:
small_array1 - small_array1

array([[0, 0],
       [0, 0]])

In [86]:
small_array1 ** small_array1

array([[  1,   4],
       [ 27, 256]])

Numpy also offers a variety of named math functions for ndarrays. There are too many to cover in detail here, so we'll just look at a selection of some of the most useful ones for data analysis:

In [89]:
# Get the mean of all the elements in an array with np.mean()

np.mean(two_d_array)

np.float64(9.5)

In [91]:
# Provide an axis argument to get means across a dimension

np.mean(two_d_array,
        axis = 1)     # Get means of each row

array([ 3.5,  9.5, 15.5])

In [93]:
# Provide an axis argument to get means across a dimension

np.mean(two_d_array,
        axis = 0)     # Get means of each row

array([ 7.,  8.,  9., 10., 11., 12.])

In [95]:
# Get the standard deviation all the elements in an array with np.std()

np.std(two_d_array)

np.float64(5.188127472091127)

In [97]:
# Provide an axis argument to get standard deviations across a dimension

np.std(two_d_array,
        axis = 0)     # Get stdev for each column

array([4.89897949, 4.89897949, 4.89897949, 4.89897949, 4.89897949,
       4.89897949])

In [99]:
np.sum(two_d_array,
       axis=0)        # Get the column sums

array([21, 24, 27, 30, 33, 36])

In [101]:
# Take the log of each element in an array with np.log()

np.log(two_d_array)

array([[0.        , 0.69314718, 1.09861229, 1.38629436, 1.60943791,
        1.79175947],
       [1.94591015, 2.07944154, 2.19722458, 2.30258509, 2.39789527,
        2.48490665],
       [2.56494936, 2.63905733, 2.7080502 , 2.77258872, 2.83321334,
        2.89037176]])

In [103]:
# Take the square root of each element with np.sqrt()

np.sqrt(two_d_array)

array([[1.        , 1.41421356, 1.73205081, 2.        , 2.23606798,
        2.44948974],
       [2.64575131, 2.82842712, 3.        , 3.16227766, 3.31662479,
        3.46410162],
       [3.60555128, 3.74165739, 3.87298335, 4.        , 4.12310563,
        4.24264069]])

In [105]:
#Take the vector dot product of row 0 and row 1

np.dot(two_d_array[0,0:],  # Slice row 0
    two_d_array[1,0:])  # Slice row 1

np.int64(217)

In [107]:
# Do a matrix multiply

np.dot(small_array1, small_array1)

array([[ 7, 10],
       [15, 22]])