In [1]:
%matplotlib inline

# Introduction to NumPy



## Why use NumPy?

NumPy provides fast numerical processing and fast arrays to python. 

Python itself is very slow. 

## How do I import NumPy?

In [1]:
import numpy as np

## How do you create a NumPy array?

You could start with a list, and then covert it:

In [4]:
x_age = [18, 22, 33, 41]

# x is now much faster than it was!
x = np.array(x_age)

In [6]:
x.mean()

28.5

You can also create numpy arrays using specific utilties... 

In [8]:
np.arange(0, 10, 2) # a range of numbers from 0 to 10 in steps of 2

array([0, 2, 4, 6, 8])

In [11]:
np.repeat([0, 1], 5) # repeat [0, 1] five times

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [15]:
np.random.choice([1, 2, 3, 4, 5, 6], 10) # 10 rolls of a dice

array([2, 1, 2, 3, 3, 5, 6, 5, 3, 3])

...here we are using the random library within numpy to simulate experimental data... 

(ie., drawing a number out of a set (1 to 6) 10 times). 

## How do you compute with arrays?

Suppose we generate an array which represents the ages of 10 people (mean=$30 \pm 5$)

In [16]:
x_age = np.random.normal(30, 5, 10) 

x_age

array([26.75645146, 30.06000099, 33.07749316, 33.23508197, 32.8935573 ,
       25.18538857, 26.23845934, 40.78925534, 31.12088011, 20.20150136])

Suppose we need to compute $3x_{age} + 1$, then we write:

In [17]:
3 * x_age + 1

array([ 81.26935439,  91.18000296, 100.23247948, 100.7052459 ,
        99.68067189,  76.55616571,  79.71537801, 123.36776601,
        94.36264032,  61.60450407])

Notice that `3*` is run on every element, as is `+1`. 

This is called **vectorization**. 

Aside: note, that this doesnt work with lists:

In [23]:
x = [1, 2, 3]
# x = np.array([1, 2, 3])

x + 1

TypeError: can only concatenate list (not "int") to list

## What is a Sequence?

In [25]:
x_age

array([26.75645146, 30.06000099, 33.07749316, 33.23508197, 32.8935573 ,
       25.18538857, 26.23845934, 40.78925534, 31.12088011, 20.20150136])

The shape of this array defines how it is structured for calculations, $(10,)$ -- a sequence of 10 elements...

In [26]:
x_age.shape

(10,)

In [27]:
len(x_age)

10

### How do I index a sequence?

Just the same as python lists...

In [28]:
x_age[0]

26.756451462116274

In [29]:
x_age[0:2]

array([26.75645146, 30.06000099])

In [30]:
x_age[-1]

20.20150135831472

## What is a Matrix?

A table of numbers...

In [46]:
M = np.array([
    (1000, 12, +1), #eg., Loan, Duration, Settle
    (2000, 9, -1), #eg., Loan, Duration, Settle  
    (3000, 6, -1), #eg., Loan, Duration, Settle  
])

In [47]:
M

array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

### How do I index a matrix?

`M[row-index, col-index]`

Note, both indexes work like list indexes -- except now there are two. 

In [48]:
M[0, 0] # first row, first column

1000

In [49]:
M[1, 0] # second row, first column

2000

In [50]:
M[0:2, -1] # first two rows, last column

array([ 1, -1])

### What is a Vector?
A vector is a matrix of one colum.

In [51]:
x_profit = np.array([
    [10],
    [11],
    [12]
])

x_profit

array([[10],
       [11],
       [12]])

In [52]:
x_profit[0, 0]

10

In [53]:
x_profit[0, 1]

IndexError: index 1 is out of bounds for axis 1 with size 1

## Why would we use a matrix of one column?

In machine learning (libraries) we must always have our features ($X$) formatted as a matrix.

Each row of the feature matrix $X$ *must* be one complete observation. This is assumed in how these libraries process data. 

## How do I select multiple elements?

In [54]:
M

array([[1000,   12,    1],
       [2000,    9,   -1],
       [3000,    6,   -1]])

`:2` means from `0` to `2`

In [55]:
M[:2, 0] # 0:2   :2

array([1000, 2000])

`:` - from the beginning to the end

In [56]:
M[:, 0]

array([1000, 2000, 3000])

NB. you can just read `:` as "all".

So, `M[:, 0]` means `M[all rows, first column]`

In [61]:
M[ [0, 2], :] # chose rows indexed [0, 2] and all columns

array([[1000,   12,    1],
       [3000,    6,   -1]])

Remember:  `label[index]`  <- always means FIND `index` in `label`

Remember: `[data,]` <- always means `list`


## How do I select elements by a condition?

Comparison are also *vectorized*, meaning, they run across every element:

In [65]:
x_age  < 30

array([ True, False, False, False, False,  True,  True, False, False,
        True])

`np.where` tells you the index of the `True` values... 

In [67]:
np.where(x_age < 30)

(array([0, 5, 6, 9]),)

In [69]:
x_age[ np.where(x_age < 30)  ]  # here I select the elements which match this condition

array([26.75645146, 25.18538857, 26.23845934, 20.20150136])

In [71]:
x_age[ x_age < 30  ]  # FIND elements in x_age, WHERE  <30

array([26.75645146, 25.18538857, 26.23845934, 20.20150136])

Aside: to do this in raw python, we would use a loop and a condition:

NOTE: far far slower... 

In [102]:
keep = []
for age in x_age:
    if age < 30:
        keep.append(age)
keep

[19.260181550953583,
 24.35850053422343,
 23.18504804125957,
 17.933264321074905,
 9.931122518399246,
 18.16283598114136,
 22.236638562348347,
 18.91942978375155,
 5.084342289695776,
 24.375355468023493,
 20.56660078848965]

## How do I combine conditions?

Recall, in python:

In [103]:
age = 18
email = "michael.burgess@qa.com"

(age <= 20) and ("@" in email)

True

The problem with using `and`, (`or`, `not` etc.) with numpy, is that they only work for *single* comparisons. 

In [108]:
temp = np.array([19, 21, 23]) # eg., temp of a room 
hours = np.array([0, 0.5, 1]) # eg., duration of heating

To combine comparisons across and array we must use *vectorized* operators (ie., ones which work with arrays).

* `&` and 
* `|` or
* `~` not

In [110]:
(temp > 20) & (hours < 0.75)

array([False,  True, False])

In [114]:
(temp > 20) | (hours >= 1)

array([False,  True,  True])

In [115]:
~((temp > 20) & (hours < 0.75))

array([ True, False,  True])

## How do you simulate real-valued data?

10 random values, whose mean will be aproximately $30$, and which will vary from 30, on average, by $5$...

In [166]:
np.random.normal(30, 5, 10)

array([19.42974668, 45.13651107, 20.74212228, 34.93130155, 29.081382  ,
       41.01384116, 30.09996855, 15.98946629, 31.61639624, 32.35979923])

## How do you simulate categorical data?

Categorical data is represented as *labels* (eg., die faces, cards, answers to questions, locations)....

In [143]:
x_like_film = np.random.choice(["YES", "NO"], 10)
x_like_film

array(['YES', 'YES', 'YES', 'YES', 'NO', 'YES', 'YES', 'YES', 'NO', 'YES'],
      dtype='<U3')

In numpy, `random.choice` is the easiest way to simulate a categorical variable (eg., `x_like_film`). 

A categorical variable *IS NOT* numerical in the ordinary sense, so if we wish to compute statistics on it, we typically convert it to a frequency distribution (ie., we count the entires). 

In [147]:
categories, counts = np.unique(x_like_film, return_counts=True)

counts

array([2, 8])

The rate of "NO" (ie., $P(x=\text{NO})$), 

In [163]:
counts[0] / sum(counts)

0.44

## Exercise (10 min)

Consider the matrix below...

Import numpy

In [3]:
import numpy as np

In [9]:
np.column_stack(([1, 2, 3], [2, 4, 5]))

array([[1, 2],
       [2, 4],
       [3, 5]])

In [54]:
import pandas as pd

df = pd.DataFrame(X, columns=["Temp", "Power", "Window"])

df.loc[  df["Window"] == 1 ,  ["Temp", "Power"]]

Unnamed: 0,Temp,Power
3,26,3000


$f(X; W, b) = W_0X_0 + W_1X_1 + b \dots $

In [29]:
X = np.array([
    [21, 1_000, False],  # temp, power, window_open
    [19, 1_000, False],
    [24, 3_000, False],
    [26, 3_000, True],
])

X[   X[:, 0] > 20 ,  -1]   # select the elements in X 
                           # where first col is >20, the last column

array([0, 0, 1])

In [64]:
X.shape

(4, 3)

In [74]:
a = np.array([1, 2, 3, 4, 5])
a.shape

(5,)

$P(Window=Open | Temp > 21) = P(X_2=1 | X_0 > 21) = $ 

In [34]:
X[X[:, 0] > 21, 2].mean()

0.5

$P(Window = Open, Temp > 21) = P(X_2 = 1, X_0 > 21) = $

In [38]:
((X[:, 0] > 21) & (X[:, 2])).mean()

0.25

## Exercise 1: Select Values
* the temperature column
    * HINT: all rows of column 0
* the power column
    * HINT: all rows of column 1
* the last column
    * HINT: all rows of column -1
* the first observation row
    * HINT: row 0 of all columns
* the last observation row
    * HINT: row -1 of all columns
* the temp and power of the first two observations
    * HINT: the first two rows of the first two columns
    * HINT: `0` until `2`
* the temp and power when the window is open
    * HINT: we want the first two columns with a *row* condition (ie., mask, test, ..)
    * HINT: the condition is that the third column `X[:, 2]` is `True`
* the power when it is closed
    * HINT: as above, condition is that third column is `False`
    

Aside: note that a numpy array can only use one type for the entire data structure. Here it's chosen an integer.  

In [6]:
X.dtype

dtype('int64')

## Exercise 2 (EXTRA)

You are hired by a cinema to make film recommendations to customers as they speak to your front desk staff.

Your staff may observe: their age, budget, like_action, like_comedy. 

Note, $x : (age, budget, action, comedy) = (18, 10, +1, -1)$

Let's simulate some data:

$x_{age} \sim N(\mu=35, \sigma=5) \in \mathbb{R}^{25}$

### Q1. Import and Compute
* import the numpy library
    * recall, use `np`
    
* you are given a regression and classification formula
* use numpy to compute the $y$ predictions for each person
* a formula to compute likely spend on consessions (food counter)
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = 0.1x_{age} + 0.1x_{budget} + x_{action} - x_{comedy}$

* what is the expected (ie., average) spend for these customers?
    * HINT: `.mean()`
    
#### EXTRA
* a formula to compute whether they will like the blockbuster currently showing
    * $y = f(x_{age}, x_{budget}, x_{action}, x_{comedy}) = (x_{age} < 18) \text{ or } (x_{budget} > 10) \text{ and } x_{action}$
* HINT:
    * `(age < 18) | (budget > 10) & (action == 1)`
    
* what is $P(y=LikeFilm)$ ?
    * HINT: `.mean`

### Q2. Select Elements & Describe

* Produce a report of the simulated data
    * show `.mean()` of all x
    * show `.std()` of all 
    * `.min()`, `.max()`
    
* Show sample observations
    * first, last
    * first two, last two
    * extra: the median
    
* EXTRA: Show the budget of people who are adults
    * HINT: `x_budget[ x_age ... ]`
    
    * and other conditions of interest...