<a href="https://colab.research.google.com/github/finesketch/data_science/blob/main/Python_Data_Science_Handbook/02_Introduction_to_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

It is important that data can be in any form, here are some of the examples to think about:

* Collections of documents
* Collections of images
* Collections of sound clips
* Collections of measurements
* etc.

Despite the heterogeneity of data, but fundamentally all data are just arrays of numbers.

* Image: 2**x**3 (or 2**x**4) dimensions of arrays for RGB (or ARGB)
* Sound clip: One dimension of intensity versus time

**Important**: No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers. Efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science.

Reference: 
* https://jakevdp.github.io/PythonDataScienceHandbook
* https://github.com/jakevdp/PythonDataScienceHandbook

In [1]:
import numpy
numpy.__version__

'1.19.5'

In [3]:
# most people in the SciPy/PyData world will import NumPy using "np"
import numpy as np
np.__version__

'1.19.5'

In [3]:
# getting help on Numpy, just use "np?"
np?

## Understanding Data Types in Python

This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this. 


In [5]:
# beauty of "dynamic typed" programming language
result = 0
for i in range(100):
  result += i

print(result)

# type can be used on the fly, or change
x = 0
x = 'four'

4950


### A Python Integer Is More Than Just an Integer

The standard Python implementation is written in C. This means that every Python object is simply a cleverly disguised C structure, which contains not only its value, but other information as well. 

This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C.

### A Python List Is More Than Just a List

Let’s consider now what happens when we use a Python data structure that holds many Python objects. The standard mutable multielement container in Python is the list.

In [7]:
L = list(range(100))
print(L)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


In [8]:
print(type(L))

<class 'list'>


In [10]:
print(type(L[0]))

<class 'int'>


In [11]:
L2 = [str(c) for c in L]
print(L2)
print(type(L2[0]))

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99']
<class 'str'>


In [12]:
# Because of Python’s dynamic typing, we can even create heterogeneous lists
L3 = [True, '2', 3.0, 4]
L3_type = [type(item) for item in L3]
print(L3)
print(L3_type)


[True, '2', 3.0, 4]
[<class 'bool'>, <class 'str'>, <class 'float'>, <class 'int'>]


But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information—that is, each item is a complete Python object.

So it can be much more efficient to store data in a fixed-type array.

NumPy list is a *fixed-type* array, much efficient.

### Fixed-Type Arrays in Python

Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module (available since Python 3.3) can be used to create dense arrays of a uniform type:

In [14]:
import array

L = list(range(10))
A = array.array('i', L) # 'i' is a type code indicating the contents are integers

print(A)

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


Much more useful, however, is the ndarray object of the NumPy package. While Python’s array object provides efficient storage of array-based data, NumPy adds to this efficient operations on that data. 

### Creating Arrays from Python Lists

Start with the standard NumPy import, under the alias np:


In [17]:
import numpy as np

# integer array
np.array([1, 4, 2, 5, 3])

array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will **upcast** if possible (here, integers are upcast to floating point):

In [16]:
np.array([3.14, 4, 2, 3])

array([3.14, 4.  , 2.  , 3.  ])

In [18]:
# use explicity type in an array
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

Finally, unlike Python lists, NumPy arrays can explicitly be multidimensional; here’s one way of initializing a multidimensional array using a list of lists:

In [20]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 6) for i in [2, 4, 6]])

array([[ 2,  3,  4,  5,  6,  7],
       [ 4,  5,  6,  7,  8,  9],
       [ 6,  7,  8,  9, 10, 11]])

### Creating Arrays from Scratch

Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy.

In [21]:
# create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [22]:
# create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [23]:
# create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [24]:
# create an array filled with a linear sequence
# starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [25]:
# create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [26]:
# create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.75901071, 0.46595166, 0.9013545 ],
       [0.47276073, 0.4059979 , 0.6234351 ],
       [0.18812559, 0.09650019, 0.74786742]])

In [27]:
# create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[-0.20917477, -0.55849793,  0.71047601],
       [ 0.61411326, -0.80282626, -1.5417945 ],
       [ 0.87953122,  0.31011075,  1.1078033 ]])

In [28]:
np.random.normal(0, 10, (3, 3))

array([[ 2.25598989, 18.76435839, -2.90220591],
       [-3.74801385,  4.92189167, -1.58842967],
       [ 5.71005728, -4.49953657, -8.86847028]])

In [29]:
np.random.normal(0.25, 1, (3, 3))

array([[-0.6925484 , -1.00791755,  0.27267914],
       [ 0.21651026, -2.38423098, -0.41996225],
       [-0.53815537, -0.70858447, -1.00990208]])

In [32]:
# create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[0, 1, 2],
       [3, 8, 5],
       [4, 2, 1]])

In [33]:
# create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [34]:
# create an uninitialized array of three integers
# the values will be whatever happens to already exist at that
# memory location
np.empty(3)

array([1., 1., 1.])

### NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.

Reference: 
* https://numpy.org/devdocs/user/basics.types.html
* https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html


In [35]:
# specify them using a string:
np.zeros(10, dtype='int16')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

In [36]:
# Or using the associated NumPy object
np.zeros(10, dtype=np.int16)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

## The Basics of NumPy Arrays

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array. 

*Numpy arrays comprise the building blocks of many other examples used throughout the book.*

### NumPy Array Attributes

Let start by defining three random arrays: a one-dimensional, two-dimensional, and three-dimensional array. And *seed* value will set to **0**.

In [3]:
import numpy as np

# seed for reproducibility
np.random.seed(0)

# one-dimentional array
x1 = np.random.randint(10, size=6)
print(x1)

[5 0 3 3 7 9]


In [4]:
# two-dimensional array
x2 = np.random.randint(10, size=(3, 4))
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


In [5]:
# three-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))
print(x3)

[[[8 1 5 9 8]
  [9 4 3 0 3]
  [5 0 2 3 8]
  [1 3 3 3 7]]

 [[0 1 9 9 0]
  [4 7 3 2 7]
  [2 0 0 4 5]
  [5 6 8 4 1]]

 [[4 9 8 1 1]
  [7 9 9 3 6]
  [7 2 0 3 5]
  [9 4 4 6 4]]]


In [6]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype:", x3.dtype)
print("itemsize:", x3.itemsize, "bytes") # itemsize, which lists the size (in bytes) of each array element
print("nbytes:", x3.nbytes, "bytes") # nbytes, which lists the total size (in bytes) of the array

# nbytes is equal to itemsize times size

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60
dtype: int64
itemsize: 8 bytes
nbytes: 480 bytes


In [7]:
# four-dimensional array
x4 = np.random.randint(10, size=(2, 3, 4, 5))
print(x4)

[[[[4 3 4 4 8]
   [4 3 7 5 5]
   [0 1 5 9 3]
   [0 5 0 1 2]]

  [[4 2 0 3 2]
   [0 7 5 9 0]
   [2 7 2 9 2]
   [3 3 2 3 4]]

  [[1 2 9 1 4]
   [6 8 2 3 0]
   [0 6 0 6 3]
   [3 8 8 8 2]]]


 [[[3 2 0 8 8]
   [3 8 2 8 4]
   [3 0 4 3 6]
   [9 8 0 8 5]]

  [[9 0 9 6 5]
   [3 1 8 0 4]
   [9 6 5 7 8]
   [8 9 2 8 6]]

  [[6 9 1 6 8]
   [8 3 2 3 6]
   [3 6 5 7 0]
   [8 4 6 5 8]]]]


In [8]:
print("x4 ndim: ", x4.ndim)
print("x4 shape:", x4.shape)
print("x4 size: ", x4.size)
print("dtype:", x4.dtype)
print("itemsize:", x4.itemsize, "bytes") # itemsize, which lists the size (in bytes) of each array element
print("nbytes:", x4.nbytes, "bytes") # nbytes, which lists the total size (in bytes) of the array

# nbytes is equal to itemsize times size

x4 ndim:  4
x4 shape: (2, 3, 4, 5)
x4 size:  120
dtype: int64
itemsize: 8 bytes
nbytes: 960 bytes


### Array Indexing: Accessing Single Elements

If you are familiar with Python’s standard list indexing, indexing in NumPy will feel quite familiar. In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in square brackets, just as with Python lists



In [9]:
x1

array([5, 0, 3, 3, 7, 9])

In [62]:
x1[0]

5

In [63]:
x1[4]

7

In [64]:
# To index from the end of the array, you can use negative indices
x1[-1]

9

In [65]:
x1[-2]

7

In [66]:
# In a multidimensional array, you access items using a comma-separated tuple of indices
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [67]:
x2[0,0]

3

In [68]:
x2[0][0]

3

In [69]:
x2[2, 0]

1

In [70]:
x2[2][0]

1

In [71]:
x2[2,-1]

7

In [72]:
x2[2][-1]

7

In [73]:
# modify values using any of the above index notation
x2[2,-1]

7

In [74]:
x2[2,-1] = 9

In [75]:
x2[2,-1]

9

In [76]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 9]])

In [77]:
x1[0] = 3.14159  # this will be truncated!

In [78]:
x1[0]

3

In [79]:
x1[0] = 5.865  # this will be truncated!

In [80]:
x1[0]

5

### Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:) character.

**x[start:stop:step]**

If any of these are unspecified, they default to the values 
- start=0
- stop=size of dimension
- step=1

In [11]:
x1[1:5]

array([0, 3, 3, 7])

#### One-Dimensional Subarrays

In [12]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
x[:5]

array([0, 1, 2, 3, 4])

In [14]:
x[5:]

array([5, 6, 7, 8, 9])

In [15]:
x[4:7]

array([4, 5, 6])

In [16]:
x[::2]

array([0, 2, 4, 6, 8])

In [17]:
x[1::2]

array([1, 3, 5, 7, 9])

A potentially confusing case is when the step value is negative. In this case, the defaults for start and stop are swapped. This becomes a convenient way to reverse an array:

In [18]:
x[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [19]:
x[5::-2]

array([5, 3, 1])

#### Multi-Dimensional Subarrays

In [20]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [21]:
x2[:2, :3] # two rows, three columns

array([[3, 5, 2],
       [7, 6, 8]])

In [22]:
x2[:3, ::2] # all rows, every other column

array([[3, 2],
       [7, 8],
       [1, 7]])

In [23]:
x2[::-1, ::-1]

array([[7, 7, 6, 1],
       [8, 8, 6, 7],
       [4, 2, 5, 3]])

#### Accessing array rows and columns

One commonly needed routine is accessing single rows or columns of an array. You can do this by combining indexing and slicing, using an empty slice marked by a single colon (:).

In [24]:
x2[:, 0] # first column of x2

array([3, 7, 1])

In [25]:
x2[0, :] # first row of x2

array([3, 5, 2, 4])

In [26]:
x2[0] # equivalent to x2[0, :]

array([3, 5, 2, 4])

#### Subarrays as No-Copy Views

One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.

In [27]:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [28]:
x2_sub = x2[:2, :2]
x2_sub

array([[3, 5],
       [7, 6]])

In [30]:
# let's modify the value
x2_sub[0, 0] = 99
x2_sub

array([[99,  5],
       [ 7,  6]])

In [31]:
x2

array([[99,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

#### Creating Copies of Arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray.

In [32]:
x2_sub_copy = x2[:2, :2].copy()
x2_sub_copy

array([[99,  5],
       [ 7,  6]])

In [33]:
x2_sub_copy[0, 0] = 42
x2_sub_copy

array([[42,  5],
       [ 7,  6]])

In [34]:
x2

array([[99,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7,  7]])

### Reshaping of Arrays

Another useful type of operation is reshaping of arrays.

*Note that for this to work, the size of the initial array must match the size of the reshaped array.*

In [35]:
grid = np.arange(1, 10).reshape((3, 3))
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix. You can do this with the **reshape** method, or more easily by making use of the **newaxis** keyword within a slice operation.

In [4]:
x = np.array([1, 2, 3])

# row vector via reshape
x.reshape((1, 3)). # reshape((row_number, col_number))

array([[1, 2, 3]])

In [5]:
# row vector via newaxis
x[np.newaxis, :] # add a new "row" axis

array([[1, 2, 3]])

In [6]:
# column vector via reshape
x.reshape((3, 1)). # reshape((row_number, col_number)) => 3 rows and 1 column

array([[1],
       [2],
       [3]])

In [7]:
# column vector via newaxis
x[:, np.newaxis] # add a new "col" axis

array([[1],
       [2],
       [3]])