## 6 Numpy - lecture

# Introduction to NumPy

Son Huynh
31.01.2020


**Numpy stands for Numerical Python**
- A library consisting of array objects and a collection of routines for processing of array.
- Numpy is use for fast data generation and handling
- Numpy relies on packages written in low level languages (C, Fortran) to combine Python's expressiveness with high performance

## Numpy Documentation:
- Numpy math functions: https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html
- Array attributes and methods: https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html

## Table of content:
* [1. Numpy Arrays](#1.-NumPy-Arrays)
* [2. Numpy Operations](#2.-NumPy-Operations)
* [3. Vectorization](#3.-Vectorization)
* [4. Multi-dimensional array](#4.-Multi-dimensional-array)
* [5. Indexing and slicing](#5.-Indexing-and-slicing-array)
* [6. Statistics with array](#6.-Statistics-with-array)

In [1]:
import numpy as np # convention to use np as alias for numpy

## 1. NumPy Arrays
- the main element of Numpy
- if you know about linear albegra from maths, the below will be familiar.
- Numpy arrays is being used a lot in data analysis and visualization
- Numpy arrays has two kinds:
    - Vectors : a one dimensional array, basically it's like a list of elements
    - Matrixes: two dimensional array
    - and of course can have more than two dimensions

An array can hold different types of data just like a list

In [2]:
arr = np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)], dtype='object')
# array , then you pass this content of the array. 
# array like list - you can hold different kind of content in it.
# dtype object - withouth this, usually, prefers to have one type of element,
# try the line without the dtype bit!


In [3]:
arr

array([['Rex', 9, 81.0],
       ['Fido', 3, 27.0]], dtype=object)

For computation purpose, it's optimal to store an array as one data type.
Like in Excel, each column would only have one data type, here each (array?) would only have one data type.

In [4]:
arr = np.array([1.1, 1.2, 3.5, 10.1])

In [5]:
arr.dtype

dtype('float64')

## 2. NumPy Operations

We can have a lot of mathematic operations with numpy, sin, cos, tag, cotag, exp, logarit, ...

#### Numpy Broadcasting: Different behaviours compared to Python's list

In [6]:
# Broadcasting = you can apply a maths operation to each element in the array


lst = [1, 2, 3]

# How can we multiply each number in lst by 2?

In [7]:
print('New list: ', lst * 2) #Not what we want 
# because this will only extend my list two times

New list:  [1, 2, 3, 1, 2, 3]


In [8]:
lst2 = []
for i in lst:
    lst2.append(i*2)
print('New list: ', lst2) 
#Too many lines of code and could be slow when data gets big
# you could also use list comprehension to make it shorter, but not optimal!

New list:  [2, 4, 6]


__Let's see how numpy array behaves__

In [9]:
arr = np.array(lst)

arr

# so, don't need to add any for loop here!

array([1, 2, 3])

In [10]:
arr * 2 #The number 2 is broadcasted to each and every numbers in lst

array([2, 4, 6])

In [11]:
arr + 2

array([3, 4, 5])

In [12]:
arr**2

# For some reason, Son got array([3, 4, 5]), dtype=int32)

array([1, 4, 9])

In [13]:
np.sqrt(arr)

# sqrt - because you don't have any symbol on your keyboard, 
# so you have to use this keyword sqrt

# you also have the math library to import

# import math
# a = [1,2,3]
# math.sqrt(a)

# but Numpy is more convenient.

# You can also make your own functions to do things, 
# but they will no longer e NumPy functions

# When you call the square root you have to use the np alias

array([1.        , 1.41421356, 1.73205081])

In [14]:
arr == 1

# an array will return back an array

array([ True, False, False])

### Practice

In [15]:
import random # Use the random module to generate some random numbers

random.seed(100)
lst1 = random.sample(range(1,6), 5) # Generate a list of 5 random numbers between 1 and 5
lst2 = random.sample(range(1,6), 5) # Generate another list

print(lst1, lst2, sep='\n')


# Arda: "What is seed?" Son - Bec you want to make a random list but don't want 
# it to be random every time, so it will be constant eg when I share the notebook 
# with you it will stay the same.


[2, 4, 5, 1, 3]
[3, 4, 5, 1, 2]


In [16]:
# Convert lst1 and lst2 into numpy arrays

lst1_new = np.array (lst1)

lst2_new = np.array (lst2)

# print(type (lst1_new)) # this will show the data type, 
# so that it really is returning an array.

# Son's version
#lst1 = np.array(lst1)
#lst2 = np.array(lst2)
#lst1

In [17]:
# Use np.log() to find the NATURAL LOGARITHM of lst1
# Had to Google natural logarithm! Need to watch Toni's maths course website

np.log(lst1_new)

# Son's version
# np.log(lst1)

array([0.69314718, 1.38629436, 1.60943791, 0.        , 1.09861229])

In [18]:
# For each pair of number of lst1 and lst2, add them together (element-wise sum)


# Son's version
lst1 + lst2

[2, 4, 5, 1, 3, 3, 4, 5, 1, 2]

In [19]:
# Compare each pair of number of lst1 and lst2 to see which pairs are equal

# Son's version
lst == lst2

False

In [20]:
# Sonja "what happens if you add a normal list to a np array?"

# Son: It will convert it (to a list?) 

#lst3 = [1,2,3,4,5]
#lst1 + lst3

## 3. Vectorization

Very important topic in Data Analysis

Rule of thumb: 
- Avoid using native Python methods (for-loop, if-else, ...) when iterating over large dataset. Numpy is built to make computation more efficient and we should use it.
- Most problems have a vectorized solution. Using custom functions or for-loop should be last resort.

https://towardsdatascience.com/why-you-should-forget-for-loop-for-data-science-code-and-embrace-vectorization-696632622d5f

So, normally, you have a for loop eg. 'for x in ...'

The you have tabular data eg.

name     age    grade1    grade 2

but what if your data has one million rows and you want to add a new column. Your intuition would be to write a for loop eg.

for row in table
    grade = grade1 + grade2
    
This would be very bad practice, because your computer would have to loop through all the one million rows and would crash.

Enter Numpy and VECTORISATION.

Python is a very high level language - this presents a problem. Numpy is implemented in C - it uses the python syntax but is implemented in C, so it's not Python. Numpy will translate 
your code into Python and will run faster and you won't have to use a for loop.

So, with the above table, when you import this into Pandas, each column is a vector (an array) and you add them together. 

Sometimes, however, you'll not find something in Numpy - a vectorised solution, even if digging deeper into it - and you'll have  to develop some more complicated own solution like a for loop etc.

## 4. Multi dimensional array

In [21]:
# Create a 1D array
arr = np.arange(1, 13)

arr

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

In [22]:
# Turn the 1D array into 2D array (want to have 4 rows, 3 columns)
arr = arr.reshape(4, 3) # reshape method

arr

# len only considers the outer (something), 
# the rows - use shape too if you want to know the columns

#Sönke and Seppo's question about 
#arr.reshape(-1,1) # see what this returns
# a.shape

# Sonja: shaping really reshapes the data structure. How to reverse - get back to the previous?
# Son - try: a.T - just experiment, I'm sure that there's a way to do it.

#Son: machine learning models require you to reshape data, so this is very useful to know!

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [23]:
arr.shape # Shape of the array

# you can always add more dimension but usually in 
# data analysis you'll rarely need more than two.

(4, 3)

In [24]:
len(arr) # Length of the outer-most dimension

4

In [25]:
arr.size # Number of elements in the array

12

In [26]:
# Broadcasting feature is still kept for multi-dimensional array
arr*10

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [ 70,  80,  90],
       [100, 110, 120]])

## 5. Indexing and slicing array

In [27]:
arr = np.arange(20).reshape(4, 5)

arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [28]:
arr[0] # Get the first row. It doesn't access the elements in the row, only the row

array([0, 1, 2, 3, 4])

In [29]:
arr[1,0] # Get the element at second row, first column

5

In [30]:
lst = [list(x) for x in arr]

lst[1][0] # If you want to do the same thing in list, you have to do chain indexing like this

5

In [31]:
arr[1][0] 
# Chain indexing is also available for array, but it is BAD PRACTICE (not pythonic)
# Better to do it as above int eh element at second row example: arr[1,0] 

5

In [32]:
arr[:, 1] # Slicing: get the second column and all row. Want to get/extract every row, only second column

array([ 1,  6, 11, 16])

In [33]:
arr[0,0] = 10 # Modify the value of the element at first row, first column

arr

array([[10,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

### Practice

In [34]:
arr = np.random.uniform(low=0, high=1, size=(15)) # 15 elements

arr

# So, experiment - now you can do some reshaping and slicing

array([0.8583106 , 0.97695349, 0.3123181 , 0.50327355, 0.78295586,
       0.90510779, 0.41148247, 0.29272242, 0.71848202, 0.74562768,
       0.19581586, 0.7620633 , 0.27957344, 0.91468768, 0.69851414])

In [35]:
# Reshape arr into a 2-D array of 5 rows and 3 columns

# see seven or eight cells up ^
arr = arr.reshape(5, 3) 
# number one mistake in this course: remember if you want to reuse this, 
# assign/store it to a variable

arr


# Arda: why does shape not have brackets after it like functions, if it's a function?
# Son: this is not a function, it's an attribute - did you already learn about classes?
# when i was born I had a name (attribute). I can also run and sing (like functions), 
# but I only do those when I choose to run or sing. 



array([[0.8583106 , 0.97695349, 0.3123181 ],
       [0.50327355, 0.78295586, 0.90510779],
       [0.41148247, 0.29272242, 0.71848202],
       [0.74562768, 0.19581586, 0.7620633 ],
       [0.27957344, 0.91468768, 0.69851414]])

In [36]:
# Check the shape of arr

arr.shape

(5, 3)

In [37]:
# Check the number of elements in arr

arr.size

15

In [38]:
# Get all items from first row, first and second column

arr[0 , :2] # So, was this the nice Pythonic way instead of arr[0][:2]?

# Son's version: arr[0, 0:2]

array([0.8583106 , 0.97695349])

In [39]:
# Get all items from every odd rows (row 1, 3 and 5) and all columns

arr[::2, :] 
# So, you start at the beginning, the end, then the skips - 
# it's similar to slicing, except that tyou have this comma thing.

array([[0.8583106 , 0.97695349, 0.3123181 ],
       [0.41148247, 0.29272242, 0.71848202],
       [0.27957344, 0.91468768, 0.69851414]])

In [40]:
# Use np.round(array, decimals) to round the first row to 2 decimals. 
#Use sclicing to modify arr directly.


arr[0] = np.round(arr[0], 2)
arr

array([[0.86      , 0.98      , 0.31      ],
       [0.50327355, 0.78295586, 0.90510779],
       [0.41148247, 0.29272242, 0.71848202],
       [0.74562768, 0.19581586, 0.7620633 ],
       [0.27957344, 0.91468768, 0.69851414]])

## 6. Statistics with array

See documentation on array methods for full list of implemented methods.

In NumPy, you have some statistics functions eg sum




In [41]:
arr = np.arange(9).reshape(3, 3)

arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [42]:
arr.sum(axis=0) # you have to specify the (assets?) for the sum

array([ 9, 12, 15])

In [43]:
arr.mean(axis=1) # axis aggregates by rows

array([1., 4., 7.])

**The previous course had problems understanding the difference between sum and mean.** 

**In data analysis, you should have a strong understanding of statistics,** otherwise
you won't know what type of statistical tool to use. 

Eg comparing emissions between centuries, it would not make sense to use sum - you'd use mean.
