<a href="https://colab.research.google.com/github/aranlemaur/Big-Data/blob/main/01_Python_Numpy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The first thing we want to do is import numpy.

In [5]:
import numpy as np

Let us first define a Python list containing the ages of 6 people.

In [6]:
ages_list = [10, 5, 8, 32, 65, 43]
print(ages_list)

[10, 5, 8, 32, 65, 43]


There are 3 main ways to instantiate a Numpy ndarray object. One of these is to use `np.array(<collection>)`

In [7]:
ages = np.array(ages_list)
print(type(ages))
print(ages)

<class 'numpy.ndarray'>
[10  5  8 32 65 43]


In [8]:
print(ages)
print("Size:\t" , ages.size)
print("Shape:\t", ages.shape)

[10  5  8 32 65 43]
Size:	 6
Shape:	 (6,)


In [9]:
zeroArr = np.zeros(5)
print(zeroArr)

[0. 0. 0. 0. 0.]


### Multi-dim

Now let us define a new list containing the weights of these 6 people.

In [10]:
weight_list = [32, 18, 26, 60, 55, 65]

Now, we define an ndarray containing all fo this information, and again print the size and shape of the array.

In [11]:
people = np.array([ages_list, weight_list])

print("People:\t" , people)
print("Size:\t" , people.size)
print("Shape:\t", people.shape)

People:	 [[10  5  8 32 65 43]
 [32 18 26 60 55 65]]
Size:	 12
Shape:	 (2, 6)


In [12]:
people = people.reshape(12,1)
print("People:\t" , people)
print("Size:\t" , people.size)
print("Shape:\t", people.shape)

People:	 [[10]
 [ 5]
 [ 8]
 [32]
 [65]
 [43]
 [32]
 [18]
 [26]
 [60]
 [55]
 [65]]
Size:	 12
Shape:	 (12, 1)


###### Note: The new shape must be the same "size" as the old shape

### Exercise

* Generate a 1D numpy array with the values [7, 9, 65, 33, 85, 99]

* Generate a matrix (2D numpy array) of the values:

\begin{align}
  \mathbf{A} =
  \begin{pmatrix}
    1 & 2 & 4 \\
    2 & 3 & 0 \\
    0 & 5 & 1
  \end{pmatrix}
\end{align}

* Change the dimensions of this array to another permitted shape

In [21]:
list1 = [7, 9, 65, 33, 85, 99]
one_dim_array = np.array(list1)
print(one_dim_array)

[ 7  9 65 33 85 99]


In [34]:
matrix = np.matrix('1 2 4; 2 3 0; 0 5 1')
matrix

matrix([[1, 2, 4],
        [2, 3, 0],
        [0, 5, 1]])

## Array Generation

Instead of defining an array manually, we can ask numpy to do it for us.

The `np.arange()` method creates a range of numbers with user defined steps between each.

In [23]:
five_times_table = np.arange(0, 55, 5) # a range from 0 to 55 jumping by 5
five_times_table

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50])

The `np.linspace()` method will produce a range of evenly spaced values, starting, ending, and taking as many steps as you specify.

In [27]:
five_spaced = np.linspace(0,50,11) # eleven numbers in between 0 and 50 -linspace
print(five_spaced)

[ 0.  5. 10. 15. 20. 25. 30. 35. 40. 45. 50.]


The `.repeat()` method will repeat an object you pas a specified number of times.

In [26]:
twoArr = np.repeat(2, 10)
print(twoArr)

[2 2 2 2 2 2 2 2 2 2]


The `np.eye()` functions will create an identity matrix/array for us.

In [29]:
identity_matrix = np.eye(6, 2)
print(identity_matrix)

[[1. 0.]
 [0. 1.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


# Operations

There are many, many operations which we can perform on arrays. Below, we demonstrate a few.

What is happening in each line?

In [35]:
five_times_table

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50])

In [36]:
print("1:", 2 * five_times_table) # every item in the matrix *2
print("2:", 10 + five_times_table) # every item in the matrix +10
print("3:", five_times_table - 1) # every item in the matrix -1
print("4:", five_times_table/5) # every item in the matrix /5
print("5:", five_times_table **2) # every item in the matrix to the power of 2
print("6:", five_times_table < 20) # every item in the matrix: is it smaller than 20?

1: [  0  10  20  30  40  50  60  70  80  90 100]
2: [10 15 20 25 30 35 40 45 50 55 60]
3: [-1  4  9 14 19 24 29 34 39 44 49]
4: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
5: [   0   25  100  225  400  625  900 1225 1600 2025 2500]
6: [ True  True  True  True False False False False False False False]


### Speed Test

If we compare the speed at which we can do these operations compared to core python, we will notice a substantial difference.

In [38]:
fives_list = list(range(0,5001,5))
fives_list

TypeError: ignored

In [39]:
five_times_table_lge = np.arange(0,5001,5)
five_times_table_lge

array([   0,    5,   10, ..., 4990, 4995, 5000])

In [40]:
%timeit five_times_table_lge + 5

The slowest run took 56.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 1.22 µs per loop


In [41]:
%timeit [e + 5 for e in fives_list]

NameError: ignored

Boolean string operations can also be performed on ndarrays.

In [None]:
words = np.array(["ten", "nine", "eight", "seven", "six"])

print(np.isin(words, 'e'))

print("e" in words)
["e" in word for word in words]

# Transpose

In [42]:
people.shape = (2, 6)
print(people, "\n")
print(people.T)

[[10  5  8 32 65 43]
 [32 18 26 60 55 65]] 

[[10 32]
 [ 5 18]
 [ 8 26]
 [32 60]
 [65 55]
 [43 65]]


# Data Types

As previously mentioned, ndarrays can only have one data type. If we want to obtain or change this, we use the `.dtype` attribute.

In [43]:
people.dtype

dtype('int64')

What is the data type of the below ndarray?

In [44]:
ages_with_strings = np.array([10, 5, 8, '32', '65', '43'])
ages_with_strings

array(['10', '5', '8', '32', '65', '43'], dtype='<U21')

What is the dtype of this array?

In [45]:
ages_with_strings = np.array([10, 5, 8, '32', '65', '43'], dtype='int32')
ages_with_strings

array([10,  5,  8, 32, 65, 43], dtype=int32)

What do you think has happened here?

In [46]:
ages_with_strings = np.array([10, 5, 8, '32', '65', '43'])
print(ages_with_strings)

['10' '5' '8' '32' '65' '43']


In [None]:
ages_with_strings.dtype = 'int32'
print(ages_with_strings)

In [47]:
ages_with_strings.size

6

In [48]:
ages_with_strings.size/21

0.2857142857142857

In [49]:
np.array([10, 5, 8, '32', '65', '43']).size

6

The correct way to have changed the data type of the ndarray would have been to use the `.astype()` method, demonstrated below.

In [None]:
ages_with_strings = np.array([10, 5, 8, '32', '65', '43'])
print(ages_with_strings)
print(ages_with_strings.astype('int32'))

### Exercise

* #### Create an array of string numbers, but use dtype to make it an array of floats.
* #### Transpose the matrix, printing the new size and shape.
* #### Use the .astype() method to convert the array to boolean.

## Array Slicing Operations

As before, we can use square brackets and indices to access individual values, and the colon operator to slice the array.

In [None]:
five_times_table

In [None]:
five_times_table[0]

In [None]:
five_times_table[-1]

In [None]:
five_times_table[:4]

In [None]:
five_times_table[4:]

We can also slice an n-dim ndarray., specifying the slice operation accross each axis.

In [None]:
print(people)
people[:3, :3]

### Exercise

* Create a numpy array with 50 zeros
* Create a np array of 2 repeated 20 times
* Create a numpy array from 0 to 2 $\pi$ in steps of 0.1

For one of the arrays generated:
* Get the first five values
* Get the last 3 values
* Get the 4th value to the 7th value

We can reverse an array by using `.flip()` or by using the `::` operator.

In [None]:
reverse_five_times_table = np.flip(five_times_table)
reverse_five_times_table

In [None]:
reverse_five_times_table = five_times_table[-1::-1]
print(reverse_five_times_table)
five_times_table

We can also use the `::` operator to select steps of the original array.

In [None]:
five_times_table[0::3] #Every 3rd element starting from 0

### Exercise
Take one of the arrays you defined and
* #### Reverse it
* #### Only keep every 4th element.
* #### Get every 2nd element, starting from the last and moving backwards.

# Stats

In [None]:
np.array([1.65432, 5.98765]).round(2)

In [None]:
nums = np.arange(0, 4, 0.2555)

### Exercise

* Compute min, max, sum, mean, median, variance, and standard deviation of the above array, all to to 2 decimal places.

In [None]:
print("min = ", np.min(nums).round(2))
print("max = ", np.max(nums).round(2))
print("sum = ", np.sum(nums).round(2))
print("mean = ", np.mean(nums).round(2))
print("median = ", np.median(nums).round(2))
print("var = ", np.var(nums).round(2))
print("std = ", np.std(nums).round(2))

## Random

With `np.random`, we can generate a number of types of dataset, and create training data.

The below code simulates a fair coin toss.

In [None]:
flip = np.random.choice([0,1], 10)
flip

In [None]:
np.random.rand(10,20,9)

We can produce 1000 datapoints of a normally distributed data set by using `np.random.normal()`

In [None]:
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)

### Exercise
* Simulate a six-sided dice using numpy.random.choice(), generate a list of values you would obtain from 10 throws.
* Simulate a two-sided coin toss that is NOT fair: it is twice as likely to have head than tails.
