<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@acca-logo.jpg" alt="ACCA logo" style="width: 400px;"/>

# Python for data analysis
## Part 1 - Numerical computing with NumPy

* **Course:** __Machine learning with Python for finance professionals__ by ACCA
* **Instructor:** [Coefficient](https://coefficient.ai) / [@CoefficientData](https://twitter.com/CoefficientData)

---

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
NumPy
</h2><br>
</div>

<img src="https://courses.edx.org/asset-v1:ACCA+ML001+2T2021+type@asset+block@numpy.png" alt="NumPy" style="width: 300px;"/>

[NumPy](https://numpy.org) (Numerical Python, pronounced "num-pie") is an open source Python library for working with numerical data. Many other libraries utilise NumPy under-the-hood with applications including state-of-the-art libraries for [financial analysis](https://numpy.org/numpy-financial/), [image processing](https://scikit-image.org/), [psychology](https://www.psychopy.org/), [astronomy](https://www.astropy.org/), [statistical computing](https://github.com/statsmodels/statsmodels), [machine learning](https://scikit-learn.org/), [quantum computing](http://qutip.org/), [bioinformatics](https://biopython.org/) and [much more](https://numpy.org/).

Let's import NumPy. We could just do `import numpy` but to save time we'll use the shorthand alias `np` instead. You can do this with any Python import, e.g. you could do `import numpy as any_name_you_like`, but `np` is shorter.

In [1]:
import numpy as np

In [2]:
# The core of NumPy is the NumPy array. It's similar to a normal Python list.
x = np.array([1, 2, 3, 4, 5])

In [3]:
# Select the element at position 0
x[0]

1

In [4]:
# Select the first 3 elements
x[:3]

array([1, 2, 3])

In [5]:
# Select the last two elements
x[-2:]

array([4, 5])

In [6]:
# Let's use the built-in "array range" function, similar to Python's range() function
y = np.arange(start=1, stop=11)
y

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [7]:
# We can use the normal built-in Python functions with this array
sum(y)

55

In [8]:
# However, NumPy has it's own "method" for calculating sum
y.sum()

55

A "method" in Python is like a function that operates on whatever it's "connected to". Instead of `sum(x)` we do `x.sum()`. Methods are always accessed via a dot (`.`) and are executed with round brackets just like functions. Don't forget the brackets when calling methods!

In [9]:
# NumPy also has methods for mean, min, max and more.
print("Mean = ", y.mean())
print("Min = ", y.min())
print("Max = ", y.max())
print("Std dev = ", y.std())  # standard deviation

Mean =  5.5
Min =  1
Max =  10
Std dev =  2.8722813232690143


One of the benefits of using NumPy, and by extension Python libraries that utilise NumPy under-the-hood like pandas, is computational speed. With NumPy you get the speed and power of languages like C or Fortan but with the simplicity of Python.

Let's see this speed in action.

First, we need to know how to time a cell's execution. We use `%%timeit` in the cell below. This is a handy "Jupyter Magic" tool for timing the execution of a code cell. It **must** be on the top line of a cell.

`%%timeit` runs the cell a number of times (depending on whether the cell is quick or slow to run) and displays information on how long the cell's code takes to run.

In [10]:
%%timeit

# Create a list of one million numbers in Python from 0 to 999999
python_vector = list(range(1000000))

# Use the built-in Python sum() function to add up the numbers
sum(python_vector)

66.3 ms ± 3.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


You should see something that looks like (your numbers will differ based on how powerful your computer is):

```
50.4 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

```

This means:
1. The cell takes 50.4 milliseconds to run (a millisecond is a thousandth of a second).
2. This number is the average of re-running the cell 70 times.
3. The standard deviation of these timed runs was 1.1 milliseconds.
4. **Takeaway: this cell takes ~50 milliseconds to run**.

Let's now repeat this using NumPy.

In [11]:
%%timeit

# Create an array of one million numbers using NumPy from 0 to 999999
numpy_vector = np.arange(1000000)

# Use the NumPy .sum() method to add up the numbers
numpy_vector.sum()

2.14 ms ± 84.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


You should see something that looks like:

```
2.21 ms ± 82.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

This suggests that NumPy is over 20x faster than the "pure Python" approach.

---

<div class="alert alert-block alert-info" style="background-color: #BA001E; border: 0px; -moz-border-radius: 10px; -webkit-border-radius: 10px;">
<h2 style="color: white">
Generating random numbers using NumPy
</h2><br>
</div>

NumPy has a `random` module with many handy functions. For example, this next cell can help you pick a lunch option. Every time you re-run the cell it will randomly choose an option from the list.

In [12]:
np.random.choice(['pizza', 'noodles', 'salad', 'pasta', 'burrito', 'dim sum'])

'noodles'

> ### 🚩 Exercises
> 1. Check out the documentation for NumPy's `random.randint()` function by either typing `help(np.random.randint)` or [you can find the NumPy documentation online here](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html). Most Python libraries have great online documentation, and Google is a great tool for finding links to online documentation. Usually "numpy randint" would be enough to find this page.
> 2. Generate a random number between 1 and 100 using `np.random.randint()`.
> 3. Generate 100 random numbers between 1 and 100 using the optional argument `size`.
> 4. Generate 1,000,000 random numbers between 1 and 100 and add `.mean()` onto the end. This is an example of "method chaining".

In [13]:
# 1. Check out the documentation for NumPy's random.randint() function.

#    Running this cell will load up the "built-in" help documentation for this function.
#    This is useful as a quick reference, but we recommend comparing this with the online
#    documentation as this may also link to related examples or User Guide entries.

np.random.randint?

# N.B. You can close the help window in Jupyter Notebook by clicking the ✖️ icon.

In [14]:
# 2. Generate a random number between 1 and 100

# ✏️ ENTER YOUR SOLUTION HERE
np.random.randint(1,100)



77

In [15]:
# 3. Generate 100 random numbers between 1 and 100

# ✏️ ENTER YOUR SOLUTION HERE
np.random.randint(100, size=100)



array([90, 81, 31, 28, 39, 17, 43, 28, 64, 67, 59, 68, 50, 95, 95, 82, 21,
       29, 88,  1, 35, 79, 52,  2, 39, 53, 11, 14, 52, 24, 95, 10, 43, 46,
       13, 83, 64, 51, 22, 38,  8,  4,  6, 90, 78, 43,  0, 83,  3, 90, 85,
       39, 95, 77, 29, 78, 43, 64, 23, 59, 52, 51,  4, 51, 65, 90, 51, 72,
       75,  1, 34, 98, 79, 20, 30,  6, 18, 31, 74, 78, 37, 24, 36, 90, 49,
       97, 25, 40, 63, 79, 21, 70, 17, 27, 34, 12, 61, 86, 17, 74])

In [16]:
# 4. Average of 1000000 random numbers between 1 and 100

# ✏️ ENTER YOUR SOLUTION HERE

np.random.randint(100, size=100000).mean()


49.42057

---
More:
- https://numpy.org/doc/stable/user/absolute_beginners.html
- Docs: https://numpy.org/doc/stable/reference/index.html
---