# Introduction to Data Science L2
## Quick Introduction to Python3.X
__Budapest 2021.09.16__ (*Fall Semester*).

**Notes:**
Dear (quick) learner, feel free to go over this notebook and run the cells on your own. There are some cells containing several statements without using the print function, thus, I encourage you to comment and uncomment some of them to check the outputs of each statement within the same code-block. Also note that some blocks are going to throw some Errors/Exceptions, this is on purpose, read the error and try to understand what happened. ;)

Main Source: [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)

Additional Sources:
- [Doc from numpy.org](https://numpy.org/doc/stable/r)
- [Quick Start from numpy.org](https://numpy.org/doc/stable/user/quickstart.html)
- [datacamp.org NumPy Array Tutorial](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)
- [Numpy Book (old but gold)](https://web.mit.edu/dvp/Public/numpybook.pdf)
- [Some Examples](https://numpy.org/doc/stable/reference/routines.html)

Regards,<br/>Andrea Galloni<br/>andrea.galloni@inf.elte.hu


----

## NumPy
NumPy is at the core of nearly every scientific Python library/module/application. It provides a *"fast:"*  N-dimensional array data type that can be manipulated in a vectorized form making easier linear algebra operations if compared to Python's native tipe `list`.

### A Python List Is More Than Just a List
When we use a Python data structure holding many Python objects. The standard mutable Python list. As you know... we can create a list of integers as follows:

In [1]:
L = list(range(10))
L

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [2]:
type(L[0])

int

In [3]:
L2 = [str(c) for c in L]
L2

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [4]:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]

[bool, str, float, int]

But this **etherogeneus types capability and it flexibility comes at a cost**. Indeed to allow these different types, each item in the list must contain its own type info, reference count, and other information–that is, each item is a complete Python object. Most of the time much of this information is redundant: it can be much more efficient to store data in a fixed-type array:

![arrayVSlist](https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png)

The array essentially contains a single pointer to one contiguous block of data, while the Python's list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like a Python integer or a string.

**Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.**

Python itself offers several different options for storing data in efficient, fixed-type data buffers. The **built-in array module (available since Python 3.3)** can be used to create dense arrays of a uniform type:

In [5]:
import array
L = list(range(10))
A = array.array('i', L)
print('L type:', type(L))
print('L values:', L)
print('A type:', type(A))
print('A values:', A)


L type: <class 'list'>
L values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
A type: <class 'array.array'>
A values: array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


Here 'i' is a type code indicating the contents are integers.

Much more useful, however, is the ndarray object of the NumPy package. While Python's array object provides *efficient storage* of array-based data, **NumPy adds to this *efficient operations* on that data**. 

## Let's dive into NumPy arrays!

In [6]:
import numpy as np

np_array = np.array([1, 4, 2, 5, 3])

print(type(np_array))
print(np_array)

np_array = np.array([1, 4.0, 2, 5, 3])
# ? What will happen if we do pass an etherogeneus list? 
# Auto infer type if not passed. It will try to infer and upcast types!
print(type(np_array))
print(np_array)
np_array


<class 'numpy.ndarray'>
[1 4 2 5 3]
<class 'numpy.ndarray'>
[1. 4. 2. 5. 3.]


array([1., 4., 2., 5., 3.])

In [7]:
# DON'T GET FOOOOLED! 
# The upcast works only at creation time!!!
print(np_array)
np_array[0] = 3.14159  # this will be truncated!
print(np_array)

[1. 4. 2. 5. 3.]
[3.14159 4.      2.      5.      3.     ]


If we want to explicitly pass the type of the content of the array we can do the following:

In [8]:
np_array = np.array([1., 2., 3., 4.], dtype='float32')
print(type(np_array))
print(np_array)

# Upcast also in this case!
np.array([1, 2, 3, 4], dtype='float32')
print(type(np_array))
np_array

<class 'numpy.ndarray'>
[1. 2. 3. 4.]
<class 'numpy.ndarray'>


array([1., 2., 3., 4.], dtype=float32)

In [9]:
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])


array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

## Creating Arrays from Scratch


In [10]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [11]:
# Create a length-5 integer array filled with ones
np.ones(5, dtype=int)

array([1, 1, 1, 1, 1])

In [12]:
# Create a length-5 integer array not filled just memory allocation
np.empty(5, dtype=float)

array([3.14159, 4.     , 2.     , 5.     , 3.     ])

In [13]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3, dtype=int)

array([ 94427911284448,               0, 139875906514288])

In [14]:
np.empty(10, dtype='int8')

array([-32, -26,  27, -74, -31,  85,   0,   0,   0,   0], dtype=int8)

In [15]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

array([[3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14, 3.14]])

In [16]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [17]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [18]:
# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
np.random.random((3, 3))

array([[0.27475346, 0.56520962, 0.9738081 ],
       [0.18858928, 0.10483449, 0.977991  ],
       [0.95148072, 0.70790943, 0.65486326]])

In [19]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

array([[ 0.36393184,  1.68122968, -0.09536796],
       [-0.61231996,  3.94925404, -0.08740998],
       [-0.83130305,  0.29380353, -1.26685683]])

In [20]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

array([[2, 3, 3],
       [5, 3, 7],
       [5, 1, 2]])

## NumPy Data Types:
| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

# NumPy Array Attributes
Numpy arrays contain some self attributes useful while coding. 

In [21]:
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

In [22]:
print("x3.ndim (number of dimensions): ", x3.ndim)
print("x3.shape (size of each dimension):", x3.shape)
print("x3.size (total number of elements 4*3*5): ", x3.size)

x3.ndim (number of dimensions):  3
x3.shape (size of each dimension): (3, 4, 5)
x3.size (total number of elements 4*3*5):  60


In [23]:
# another way to get data type of an array:
print("dtype:", x3.dtype)

dtype: int64


In [24]:
# Other attributes include itemsize, which lists the size 
# (in bytes) of each array element, and nbytes, which lists 
# the total size (in bytes) of the array:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes (8*60):", x3.nbytes, "bytes")
8*60

itemsize: 8 bytes
nbytes (8*60): 480 bytes


480

In [25]:
# Indexing is pretty similar to Python's lists:
x1

array([5, 0, 3, 3, 7, 9])

In [26]:
x1[2]

3

In [27]:
x1[2:]

array([3, 3, 7, 9])

In [28]:
x1[2:10]

array([3, 3, 7, 9])

In [29]:
x1[-1]

9

In [30]:
# In a multi-dimensional array, items can be accessed using a 
# comma-separated tuple of indices:
x2

array([[3, 5, 2, 4],
       [7, 6, 8, 8],
       [1, 6, 7, 7]])

In [31]:
x2[0,0]

3

In [32]:
x2[0]

array([3, 5, 2, 4])

In [33]:
# but also:
x2[0,:]

array([3, 5, 2, 4])

In [34]:
# get a specific column
x2[:,2]

array([2, 8, 7])

In [35]:
# get a range of columns
x2[:,1:3]

array([[5, 2],
       [6, 8],
       [6, 7]])

In [36]:
x2[:2, :3]  # two rows, three columns

array([[3, 5, 2],
       [7, 6, 8]])

In [37]:
# the syntax is x[start:stop:step]
x1[1::2]  # every other element, starting at index 1

array([0, 3, 9])

In [38]:
x2[:2, :3]  # two rows, three columns

array([[3, 5, 2],
       [7, 6, 8]])

In [39]:
x2[:3, ::2]  # all rows, every other column

array([[3, 2],
       [7, 8],
       [1, 7]])

In [40]:
# Subarray dimensions can even be 
x2[::-1, ::-1]

array([[7, 7, 6, 1],
       [8, 8, 6, 7],
       [4, 2, 5, 3]])

### Subarrays as no-copy views

One important and extremely useful thing to know about array slices is that they return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
Consider our two-dimensional array from before:

In [41]:
print(x2)

[[3 5 2 4]
 [7 6 8 8]
 [1 6 7 7]]


In [42]:
# Let's extract a 2 x 2 subarray from this:
x2_sub = x2[:2, :2]
print(x2_sub)

[[3 5]
 [7 6]]


In [43]:
x2_sub[0, 0] = 99
print(x2_sub)

[[99  5]
 [ 7  6]]


In [44]:
print(x2)

[[99  5  2  4]
 [ 7  6  8  8]
 [ 1  6  7  7]]


In [45]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

[[99  5]
 [ 7  6]]


## Reshaping of Arrays

Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a 3x3 grid, you can do the following:

In [46]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [47]:
x = np.array([1, 2, 3])

# row vector via reshape
x.reshape((1, 3))

array([[1, 2, 3]])

In [48]:
# row vector via newaxis
x[np.newaxis, :]

array([[1, 2, 3]])

In [49]:
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [50]:
# column vector via newaxis
x[:, np.newaxis]

array([[1],
       [2],
       [3]])

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [51]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y]) # works also for several arrays to concatenate! 

array([1, 2, 3, 3, 2, 1])

In [52]:
# works also for several arrays to concatenate and upcasting! 
np.concatenate([x, y, np.array([1.,2.,3.])])

array([1., 2., 3., 3., 2., 1., 1., 2., 3.])

In [53]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])

array([[1, 2, 3],
       [9, 8, 7],
       [6, 5, 4]])

In [54]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

array([[ 9,  8,  7, 99],
       [ 6,  5,  4, 99]])

### Splitting of arrays

The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``.  For each of these, we can pass a list of indices giving the split points:

In [55]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

[1 2 3] [99 99] [3 2 1]


Notice that *N* split-points, leads to *N + 1* subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:

In [56]:
grid = np.arange(16).reshape((4, 4))
grid

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [57]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

[[0 1 2 3]
 [4 5 6 7]]
[[ 8  9 10 11]
 [12 13 14 15]]


In [58]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

[[ 0  1]
 [ 4  5]
 [ 8  9]
 [12 13]]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]]


## The Slowness of Loops

Python's default implementation does some operations very slowly.
This is in part due to the dynamic, interpreted nature of the language: the fact that types are flexible, so that sequences of operations cannot be compiled down to efficient machine code as in languages like C and Fortran.

The relative sluggishness of Python generally manifests itself in situations where many small operations are being repeated – for instance looping over arrays to operate on each element.
For example, imagine we have an array of values and we'd like to compute the reciprocal of each.
A straightforward approach might look like this:

In [59]:
def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output
        
values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.2       , 0.25      , 0.2       , 0.2       , 0.11111111])

In [60]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

1.28 s ± 34.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [61]:
print(compute_reciprocals(values))
print(1.0 / values)

[0.2        0.25       0.2        0.2        0.11111111]
[0.2        0.25       0.2        0.2        0.11111111]


In [62]:
%timeit (1.0 / big_array)

1.09 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [63]:
np.arange(5) / np.arange(1, 6)

array([0.        , 0.5       , 0.66666667, 0.75      , 0.8       ])

In [64]:
x = np.arange(9).reshape((3, 3))
2 ** x

array([[  1,   2,   4],
       [  8,  16,  32],
       [ 64, 128, 256]])

## Array Arithmetic... Universal Functions or UFunc
Ufuncs exist in two flavors: *unary ufuncs*, which operate on a single input, and *binary ufuncs*, which operate on two inputs.

In [65]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


In [66]:
print("-x     = ", -x) # negation 
print("x ** 2 = ", x ** 2) # exponentiation 
print("x % 2  = ", x % 2) # modulus

-x     =  [ 0 -1 -2 -3]
x ** 2 =  [0 1 4 9]
x % 2  =  [0 1 0 1]


In [67]:
# composition of ufuncs
-(0.5*x + 1) ** 2

array([-1.  , -2.25, -4.  , -6.25])

The following table lists the arithmetic operators implemented in NumPy:

| Operator	    | Equivalent ufunc    | Description                           |
|---------------|---------------------|---------------------------------------|
|``+``          |``np.add``           |Addition (e.g., ``1 + 1 = 2``)         |
|``-``          |``np.subtract``      |Subtraction (e.g., ``3 - 2 = 1``)      |
|``-``          |``np.negative``      |Unary negation (e.g., ``-2``)          |
|``*``          |``np.multiply``      |Multiplication (e.g., ``2 * 3 = 6``)   |
|``/``          |``np.divide``        |Division (e.g., ``3 / 2 = 1.5``)       |
|``//``         |``np.floor_divide``  |Floor division (e.g., ``3 // 2 = 1``)  |
|``**``         |``np.power``         |Exponentiation (e.g., ``2 ** 3 = 8``)  |
|``%``          |``np.mod``           |Modulus/remainder (e.g., ``9 % 4 = 1``)|

Additionally there are Boolean/bitwise operators; we will explore these in [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb).

In [68]:
x = np.array([-2, -1, 0, 1, 2]) # absolute value
abs(x) # pure python
np.absolute(x) # numpy (the actual implementation may be different!!!)
np.abs(x) # but also this method, same!

array([2, 1, 0, 1, 2])