# Numpy and How Computers Work

Numpy is short for **Numerical Python** and is a package based around the **numpy array**.

The array is like the `list` container, except they're much more efficient on memory and operation speed. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy well pays off in the long run.

In [5]:
import numpy as np

np.__version__

'1.18.5'

In [6]:
np? #Shows info of numpy

# Static and dynamic types

In the previous lecture, we saw how **file extensions** tell the computer how to interpret the data in a file.

Data types are the same thing. Since all data on a computer is 1's and 0's (binary numbers) anyway, we can get large speedups if we can assume the binary numbers are uniformly organised a certain way.

This contrasts with python's **dynamic typing**, where variables can be of any type

In [79]:
x = 5
x = "cat" # perfectly valid type change

**Statically-typed** languages like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:

```C
// C code
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

In python this would be:

In [80]:
result = 0
for i in range(100):
    result += i

## A Python Integer Is More Than Just an Integer

The standard Python implementation is written in C.
This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as ``x = 10000``, ``x`` is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values.
Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python 3.4 actually contains four pieces:

- ``ob_refcnt``, a reference count that helps Python silently handle memory allocation and deallocation
- ``ob_type``, which encodes the type of the variable
- ``ob_size``, which specifies the size of the following data members
- ``ob_digit``, which contains the actual integer value that we expect the Python variable to represent.

This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:

Here ``PyObject_HEAD`` is the part of the structure containing the reference count, type code, and other pieces mentioned before.

Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value.
A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value.
This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically.
All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.

## A Python List Is More Than Just a List

Let's consider now what happens when we use a Python data structure that holds many Python objects.
The standard mutable multi-element container in Python is the list.
We can create a list of integers as follows:

In [81]:
l = [str(c) for c in range(10)]
l[0 : 3] = [int(x) for x in l[0 : 3]]
l

#makes the list a string of 0-9
#then takes the first 3 numbers and makes them into int and the rest remain str

[0, 1, 2, '3', '4', '5', '6', '7', '8', '9']

But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information–that is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array.
The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:

![Array Memory Layout](array_vs_list.png)

At the implementation level, the array essentially contains a single pointer to one contiguous block of data.
The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier.
Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type.
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.

# Numpy Intro

NumPy arrays are somewhat like native Python lists, except that

- Data *must be homogeneous* (all elements of the same type).  
- These types must be one of the [data types](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html) (`dtypes`) provided by NumPy:

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 


There are also dtypes to represent complex numbers, unsigned integers, etc.

On modern machines, the default dtype for arrays is `float64`

First, we can use ``np.array`` to create arrays from Python lists:

In [82]:
import numpy as np

# integer array:
a = np.array([1, 4, 2, 5, 3])
print(a.dtype)
a

int64


array([1, 4, 2, 5, 3])

In [15]:
import numpy as no
a = np.array([1,2.5,3,4,5], dtype=np.float32)
print(a.dtype)
a

float32


array([1. , 2.5, 3. , 4. , 5. ], dtype=float32)

### Upcasting

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.
If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [83]:
a = np.array([3.14, 4, 2, 3])
print(a.dtype)
a

float64


array([3.14, 4.  , 2.  , 3.  ])

If we want to explicitly set the data type of the resulting array, we can use the ``dtype`` keyword:

In [84]:
np.array([1, 2, 3, 4], dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

The hierarchy looks like

`bool < int < float < str < object`

with sub hierarchies within types like `int8 < int16 < int32 < int64`

### PyObject fallback

If numpy fails to find a common type for elements, it will fallback to the `object` type at which point it's basically a list

In [85]:
np.array([1, 'cat', 3.5]) # <U21 is a string type (unicode)

array(['1', 'cat', '3.5'], dtype='<U21')

# Array Shape

Unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:

In [86]:
# nested lists are multi-dimensional arrays
np.array([[2,3,4], [6,7,8], [9, 10, 11]])

array([[ 2,  3,  4],
       [ 6,  7,  8],
       [ 9, 10, 11]])

However, if our matrix is not rectangular (each row having the same length), numpy will fallback to holding each as an `object`, which is no better than python's lists.

In [87]:
# Falls back to object
np.array([[2,3,4,5,6,7,8,9,1,2,3], 
          [6,7,8], 
          [9, 10, 11]]
)

array([list([2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3]), list([6, 7, 8]),
       list([9, 10, 11])], dtype=object)

# Array Creation



In [88]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

#The types are in the chart above
#np.zero takes the memory and goes np.empty and then fills everything with 0s

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [89]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

#Create matrices and first part is 3 rows by 5 columns
##Second part is the number you'd like
#np.arange(10): returns numpy array directly. Puts the range in a container automatically
#Python range needs to be set into a container unlike numpy

#np.empy(1000, dtype=np.int64)
##were asking mac os to take a big chunk of memory in RAM and reserve it, 
##claims up all the memory 

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [90]:
# Create a 3x5 array filled with the number pi
np.full((3, 5), np.pi)

array([[3.14159265, 3.14159265, 3.14159265, 3.14159265, 3.14159265],
       [3.14159265, 3.14159265, 3.14159265, 3.14159265, 3.14159265],
       [3.14159265, 3.14159265, 3.14159265, 3.14159265, 3.14159265]])

numpy also has `np.arange` which is its analogue to python's `range`:

In [91]:
# Create an array filled with a linear sequence
# Starting at 3, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(3, 20, 2)

array([ 3,  5,  7,  9, 11, 13, 15, 17, 19])

### np.empty

The function `np.empty` reserves a block of memory on your computer to be an array, but doesn't touch it.

Since the memory at that space could have been anything before we reserved it, the values in it can be any kind of garbage:

In [92]:
np.empty(10, dtype=np.int64)

array([8070450532247928832, 8070450532247928832,                   7,
                         0,                   0,                   0,
                         0,                   0,                   0,
                         0])

### Random arrays

In [93]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))
#randint check the difference

array([[0.45162259, 0.35771648, 0.03182097],
       [0.67652066, 0.63986053, 0.77741404],
       [0.41466468, 0.51059136, 0.05635465]])

In [94]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[0, 9, 3],
       [7, 1, 3],
       [1, 1, 6]])

# Advanced Indexing

In [2]:
import numpy as np

x = np.array([
    [3, 5, 2, 4],
    [7, 6, 8, 8],
    [1, 6, 7, 7]]
)
# We index (row, column)
x[0, 0]
x[2, 1]

6

In [20]:
x = np.array([
    [3, 5, 2, 4],
    [7, 6, 8, 8],
    [1, 6, 7, 7]]
)
# We index (row, column)
x[2, 3] = 92 #modifies the number at location
#if we have int as data type and we add a float it will add it as an int
x

array([[ 3,  5,  2,  4],
       [ 7,  6,  8,  8],
       [ 1,  6,  7, 92]])

In [96]:
x[2, -1]

7

In [97]:
x[0, 0] = 9999
x

array([[9999,    5,    2,    4],
       [   7,    6,    8,    8],
       [   1,    6,    7,    7]])


Keep in mind that, unlike Python lists, NumPy arrays have a fixed type.
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!

In [5]:
x[0, 0] = np.pi  # silently truncated!
x

IndexError: too many indices for array

Indexing works as `[start : stop : stride]`

In [4]:
x = np.arange(50)
x[::3] # every third element

#x[::-1] changes everything in reverse

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48])

In [100]:
x[-25 : -5 : 2]

array([25, 27, 29, 31, 33, 35, 37, 39, 41, 43])

Negative steps give us an easy way to reverse an array:

In [101]:
x[::-1]

array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
       32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
       15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])

The comma separates row indexing from column indexing

`[row start : stop : step, col start : stop : step]`

In [27]:
x = np.random.randint(0, 10, (5, 10))
x

array([[3, 8, 7, 1, 2, 7, 8, 6, 2, 8],
       [8, 5, 3, 3, 6, 9, 6, 6, 0, 8],
       [8, 7, 6, 3, 0, 2, 4, 2, 3, 2],
       [5, 2, 2, 1, 7, 9, 2, 3, 7, 8],
       [8, 4, 7, 9, 6, 9, 5, 0, 7, 7]])

In [28]:
x[:2, :3] # two rows, three columns

array([[3, 8, 7],
       [8, 5, 3]])

In [29]:
x[:, ::2]  # all rows, every other column

array([[3, 7, 2, 8, 2],
       [8, 3, 6, 6, 0],
       [8, 6, 0, 4, 3],
       [5, 2, 7, 2, 7],
       [8, 7, 6, 5, 7]])

In [30]:
x[:, 4] # just the 4th column

array([2, 6, 0, 7, 6])

### Subarrays as no-copy views

One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.

Let's extract a $2 \times 2$ subarray from this:

In [31]:
x_sub = x[:2, :2]
x_sub

array([[3, 8],
       [8, 5]])

In [32]:
x_sub = x[:2, :2]
x_sub

array([[3, 8],
       [8, 5]])

In [33]:
x

array([[3, 8, 7, 1, 2, 7, 8, 6, 2, 8],
       [8, 5, 3, 3, 6, 9, 6, 6, 0, 8],
       [8, 7, 6, 3, 0, 2, 4, 2, 3, 2],
       [5, 2, 2, 1, 7, 9, 2, 3, 7, 8],
       [8, 4, 7, 9, 6, 9, 5, 0, 7, 7]])

In [34]:
x_sub[0, 0] = 999999
x

array([[999999,      8,      7,      1,      2,      7,      8,      6,
             2,      8],
       [     8,      5,      3,      3,      6,      9,      6,      6,
             0,      8],
       [     8,      7,      6,      3,      0,      2,      4,      2,
             3,      2],
       [     5,      2,      2,      1,      7,      9,      2,      3,
             7,      8],
       [     8,      4,      7,      9,      6,      9,      5,      0,
             7,      7]])

### Copying arrays

Is done with np.copy() or the `a.copy()` syntax

In [108]:
x2 = x.copy()
x2[:] = -1
x

array([[999999,      7,      8,      6,      8,      6,      1,      9,
             7,      9],
       [     7,      8,      2,      0,      8,      9,      8,      7,
             9,      3],
       [     4,      8,      0,      1,      6,      2,      9,      4,
             8,      8],
       [     0,      3,      3,      2,      7,      6,      6,      1,
             5,      6],
       [     2,      1,      4,      4,      0,      2,      6,      8,
             4,      6]])

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [109]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [110]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


It can also be used for two-dimensional arrays:

In [1]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
# concatenate along the first axis
np.concatenate([grid, grid])

NameError: name 'np' is not defined

For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions:

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])



# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

### Note: appending is slow

Adding elements to an existing array can be don in numpy, but it requires re-allocating chunks of memory on the computer, so is very slow.

If you care about performance, you should always build the whole array first, then fill it

In [36]:
import numpy as np
def compute_reciprocals(v):
    out = np.empty(len(v))
    for i in range(len(v)):
        out[i] = 1.0 /v[i]
    return out

values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.14285714, 0.16666667, 0.25      , 0.25      , 0.33333333])

In [37]:
big_array = np.random.randint(1, 100, size=1_000_000)
%timeit compute_reciprocals(big_array)

2.22 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [38]:
%timeit 1 / big_array #faster

1.3 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
