# Big Data Real-Time Analytics with Python and Spark

## Chapter 2 - Data Manipulation in Python with Numpy
- Documentation: https://numpy.org/
- How to import NumPy
- Ways to create a data structure With NumPy
- Data types
- Operations

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# to update a package (!pip install -U package_name)
# to install an specific package version (!pip install package-name==version)
# (pip) is a python package installer and use(!) to run the command in the OS
# Use (-q) to quite istallation and (-U) to update  if the package already exists

# After install or update the package, restart jupyter notebook

# Install watermark package
# This package is used to record the versions of other packages used in this jupyter notebook
!pip install -q -U watermark

In [3]:
# I do not need to install Numpy because I already have it inside anaconda list packages
# Importing the Numpy module (because I have to load it into my session) )
import numpy as np

In [4]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

numpy: 1.21.5



## Different ways to creating data structure with Numpy

In [5]:
# creating a one-dimensional array
array1 = np.array([10, 20, 30, 40])

In [6]:
print(array1)

[10 20 30 40]


In [7]:
# cheking the shape of the array
array1.shape

(4,)

Outcome (4,) means that I have a vector of a single dimension with 4 elements. The empty space after the comma indicate that we have only one dimension. 
Note: in pure python we call it a list, but in the numpy nomeclature we call it  an array

In [8]:
# cheking the number of dimensions only to prove what I said above
array1.ndim

1

In [9]:
# creating a bi-dimensional array
# note: Use [] on the outside of all[arrays]
array2 = np.array([[100, 83,15],[42,78,0]])

In [10]:
print(array2)

[[100  83  15]
 [ 42  78   0]]


In [11]:
# cheking the shape of the array
array2.shape

(2, 3)

In [12]:
# cheking the number of dimensions only to prove what I said above
array2.ndim

2

### In Python everything is an Object, with Methods and Attributes
- **Method** used to be a function, it is an action that I perform on the object
- **Attibute** is a feature(characteristic) of the object

In [13]:
# Method
array2.max()

100

In [14]:
array2.ndim

2

### Creating data structure with arange

In [15]:
# Create an array with the number of elements that I pass as parameter
array3 = np.arange(15)

In [16]:
array3

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [17]:
# I can customize using arguments: (start/end(exclusive)/steps) 
# If you want 15 in the outcome, you have to put 16
array4 = np.arange(0, 15, 3)

In [18]:
array4

array([ 0,  3,  6,  9, 12])

### Creating data structure with Linspace

In [19]:
# Linspace will create linear space for the values I have in my array, 
# arguments: (start, end, number of elements)
# we can print array5 in the same line with an ";" after the first command 
# the point "." after the number is because the default type of the function is a float (You can change it)
array5 = np.linspace(0, 3, 4); array5

array([0., 1., 2., 3.])

### Creating data structure with other functions

**numpy.zeros**

In [20]:
# array 10x8 of zero
# 2-dimensional (x, y)
array6 = np.zeros((10, 8)); array6

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.]])

It is very common to create an array filled with zero when I am training an artificial neural network model. Where all the work is done through matrix operations with arrays. So I create this array, fill it with 0, and change the values while the training goes on.

**numpy.ones**

In [63]:
# array 2x3x2 de 1's
# 3-dimensional (x, y, z) or (Rows, columns, depth)
array7 = np.ones((2, 3, 4)); array7

array([[[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]],

       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]])

**numpy.eye**

In [22]:
# Create a identity array
# argument is number of 1's in the diagonal
array8 = np.eye(3); array8

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

**numpy.diag**

In [23]:
# Create a diagonal array
# argument is the diagonal values
array9 = np.diag((2,1,4,6)); array9

array([[2, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 4, 0],
       [0, 0, 0, 6]])

**numpy.random.rand**  

I call the function rand of the package random of Numpy. This function create a random elements. The argument is the number of elements that I want in the array. Each run will be different.

In [24]:
# the function rand(n) produces a sequence of uniformly distributed numbers with a renge from 0 to n
np.random.seed(100) #First I define a set seed
array10 = np.random.rand(5); array10

array([0.54340494, 0.27836939, 0.42451759, 0.84477613, 0.00471886])

**Why set seed?** When we work with random process, the function will create this random process in each run, the set seed allows you to run this random process with the same pattern. **The random values will be always random, but the pattern creation will be the same.** This is important so that you can always reproduce the same result in a test or analysis process. It is also very useful in machine learning. e.g. this is very common when you have to reproduce the same steps each time you run a machine learning model.

**numpy.random.randn**  

It is useful if you want to initialize a machine learning model and you want that the coefficients start following a normal distribution, this is very common in some algorithms.

In [25]:
# np.random.randn
# the function randn(n) produces a sequence of normal  distributed numbers (Gaussian)
array11 = np.random.randn(5); array11

array([ 0.35467445, -0.78606433, -0.2318722 ,  0.20797568,  0.93580797])

**numpy.empty**

When you need only the structure like when you create an array of zeros. Note: As you can see below, it does not look empty, these numbers is memory garbage that has been brought by the python language. But now I have the structure and then I just fill it with the elements that I need.

In [26]:
array12 = np.empty((3,2)); array12

array([[4.64305497e-310, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000]])

**numpy.tile**

We duplicate the number of values of the array.

In [27]:
np.array([[9, 4], [3, 7]])

array([[9, 4],
       [3, 7]])

In [28]:
# passing a scalar number
# 4 is the argument, is the copies number of each array
np.tile(np.array([[9, 4], [3, 7]]), 4)

array([[9, 4, 9, 4, 9, 4, 9, 4],
       [3, 7, 3, 7, 3, 7, 3, 7]])

In [29]:
# Passing the dimension that I want
np.tile(np.array([[9, 4], [3, 7]]), (2,2))

array([[9, 4, 9, 4],
       [3, 7, 3, 7],
       [9, 4, 9, 4],
       [3, 7, 3, 7]])

### Numpy Data types 

In [30]:
array13 = np.array([8, -3, 5, 9], dtype = 'float')

In [31]:
print(array13)

[ 8. -3.  5.  9.]


In [32]:
array13.dtype

dtype('float64')

The difference between float64 and float32 is basically the precision of the decimal places.

In [33]:
array14 = np.array([2, 4, 6, 8])

In [34]:
print(array14)

[2 4 6 8]


In [35]:
array14.dtype # By default it creates as integer

dtype('int64')

**Note:** Is important always check the type of the array if it has the type that you want.

In [36]:
array15 = np.array([2.0, 4, 6, 8])

In [37]:
array15.dtype

dtype('float64')

**Note:** If I have one float number, the default type change to float.

In [38]:
array16 = np.array(['Data', 'Science', 'Academy'])

In [39]:
array16.dtype

dtype('<U7')

**Note:** U7 is the unicode type, string to make it simple. When we work only with python is usually create a structure as a string type or character

In [40]:
array17 = np.array([True, False, True])

In [41]:
array17.dtype

dtype('bool')

In [42]:
array18 = np.array([7, -3, 5.24])

In [43]:
array18.dtype

dtype('float64')

In [44]:
# Changing the data type
array18.astype(int)

array([ 7, -3,  5])

> **Warning:** It rounds following the mathematical patterns, but you can lose information. Is it what you want? 


### Operation with Arrays

With numpy we can with a single instruction, create the data structure, and perform an operation.

In [45]:
# Array starting from 0, with 20 elements and multiplied by 5
array19 = np.arange(0, 30) * 5

In [46]:
print(array19)

[  0   5  10  15  20  25  30  35  40  45  50  55  60  65  70  75  80  85
  90  95 100 105 110 115 120 125 130 135 140 145]


In [47]:
# create an array and raise to the fourth power
array20 = np.arange(5) ** 4

In [48]:
array20

array([  0,   1,  16,  81, 256])

In [49]:
# create an array and add a number to each element of the array
array21 = np.arange(0, 30) + 1

In [50]:
array21

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])

In [51]:
# create an array22 with element from 0 to 30, step 3, and add 3
# create am array23 with element from 1 to 11
array22 = np.arange(0, 30, 3) + 3
array23 = np.arange(1, 11)

In [52]:
array22

array([ 3,  6,  9, 12, 15, 18, 21, 24, 27, 30])

In [53]:
array23

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [54]:
# Subtraction
array22 - array23

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [55]:
# Sum
array22 + array23

array([ 4,  8, 12, 16, 20, 24, 28, 32, 36, 40])

In [56]:
# Division
array22 / array23

array([3., 3., 3., 3., 3., 3., 3., 3., 3., 3.])

In [57]:
# Multiplication
array22 * array23

array([  3,  12,  27,  48,  75, 108, 147, 192, 243, 300])

In [58]:
# Compare arrays
array22 > array23

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True])

In [59]:
# Create two boolean array and 
# Apply the "and" logic to compare both arrays with a Numpy method(function) 
arr1 = np.array([True, False, True, False])
arr2 = np.array([False, False, True, False])
np.logical_and(arr1, arr2)

array([False, False,  True, False])

In [60]:
# Apply the "or" logic to compare both arrays with a Numpy method(function) 
np.logical_or(arr1, arr2)

array([ True, False,  True, False])

> We can create arrays with Python without Numpy, however Numpy is faster

In [61]:
array_numpy = np.arange(1000)
%timeit array_numpy ** 4

2.49 µs ± 74.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [62]:
# I have to use a list comprehension to create a loop
array_python = range(1000)
%timeit [array_python[i] ** 4 for i in array_python]

227 µs ± 3.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


- **%timeit** is a jupyter notebook operator
- **"µs"** means microseconds
- **arange** is numpy and **range** is python

## The End