# Calculations in the fast lane

<div "style="width:500px;">

When it comes to libraries, there are two particularly loyal companions to the data scientist.

The first of this is **NumPy** (short for *Numerical Python*). 

Remember, computers are helpful only insofar as you can tell them to do repetitive things. We know how to add two numbers. Now how could you conveniently add two *lists* of numbers?

Let's first import the library and name it `np`.

</div>

In [1]:
import numpy as np

<div "style="width:500px;">

NumPy has a host of convenient methods for us to explore, but it requires of us to first turn our data into a *NumPy array*, often called *ndarray*. 

If we create our own data as a list, then we simply put our list into the NumPy method `array()`, accessed via the dot operator: `np.array()`.

</div>

In [3]:
data = [4,6,3,2,9,9,5,2]
arr = np.array(data)

<div "style="width:500px;">

**Extension:** In order for NumPy to perform the same operation on every element and save us work, it has to enforce that that each element is of the same type. The `array()` method simply chooses the most general type of the elements. If there's a double nestled in there among the integers, it goes for that. 

## Batch arithmetic

Now for the real magic. With ndarrays, you can add two lists (element by element, or *element-wise*) just like you would add two numbers!

</div>

In [4]:
arr2 = np.array([6,4,7,8,1,1,5,8])

sumarr = arr + arr2
print(sumarr)

[10 10 10 10 10 10 10 10]


<div "style="width:500px;">

You didn't even need a function call! Just make sure they are of equal length.

You can probably predict what the following will do.

</div>

In [5]:
print(arr * arr)
print(arr2 - arr)
print(arr / 100)
print(arr ** 2 )

[16 36  9  4 81 81 25  4]
[ 2 -2  4  6 -8 -8  0  6]
[0.04 0.06 0.03 0.02 0.09 0.09 0.05 0.02]
[16 36  9  4 81 81 25  4]


<div "style="width:500px;">

We can also use functions in NumPy to transform elements on a one-by-one basis, for example to find the square root of each.

<br>

<div>
<img src="elementwise.png" alt="Variables" style="width:50%;height:50%;">
</div>

<br>

Here's just a teaser of the functions on offer:

| Purpose      | NumPy function         | 
| ------------- |:-------------:| 
| Compute absolute value      | `abs( )` | 
| Compute square root    | `sqrt()`      | 
| Compute square | `square()`      | 
| Compute cosine | `cos()`      |   

It also happens that we wish to do more exciting pairwise operations, in which case there are functions that accept two equal-length arrays and spit out a similar one.

<br>

<div>
<img src="binary_operation.png" alt="Variables" style="width:30%;height:30%;">
</div>

<br>

| Purpose      | NumPy function         | 
| ------------- |:-------------:| 
| Adding corresponding elements      | `add( )` | 
| Subtracting corresponding elements   | `subtract()`      | 
| Multiplying corresponding elements | `multiply()`      | 
| Finding the maximum of the two | `maximum()`      |   

(You're right - apart from the maximum function we could have used regular operators to achieve the same thing.)

It also happens that you want to do operations where the result is a single number. For example, what if you want to add all the numbers *inside* the array instead of with an external number? Or what if you wished to compute the mean?

<br>

<div>
<img src="unary_operation.png" alt="Variables" style="width:50%;height:50%;">
</div>

<br>

| Purpose      | NumPy function         | 
| ------------- |:-------------:| 
| Adding elements together     | `sum( )` | 
| Find maximum value in array   | `max()`      | 
| Find index of maximum value | `argmax()`      | 

As you can see, there are many potential confusions here. Be careful to distinguish `max()` from `maximum()`, `sum()` from `add()` and so on.

Remember that, when we use NumPy functions on arrays, we have two options:

* Access it via the library, e.g. `np.sum(arr)`
* Access it via the array, e.g. `arr.sum()`

Make sure you feel comfortable using and recognising both.


### Exercise

Now, equipped as you are with basic NumPy operations, brush up on your statistics and find ways of computing the mean of the array below using nothing but what you were just taught!

Remember, the mean is the total sum divided by N.

<div>

In [None]:
data = np.array([53,45,50,49,61,41,42,47,48,49])

###Your code here

<div "style="width:500px;">

## Accessing and modifying the data

</div>

<div "style="width:500px;">

Dealing with arrays as we are, it happens that we still want to access or modify particular elements. The location of a particular element is called its *index*. The first element has index 0 - it is the 0th element. This means that, if there are 6 elements, the last element will have index 5. NumPy has many handy shortcuts of accessing elements, which also work on regular lists. They are summarised below:

<br>

<div>
<img src="indexing.png" alt="Variables" style="width:50%;height:50%;float:right;">
</div>

<br>



| Aim      | Syntax        | 
| ------------- |:-------------:| 
| A single index      | `[i]` | 
| A slice of consecutive indexes    | `[i:j]`      | 
| All elements after a particular index | `[i:]`      | 
| All elements until a particular index | `[:j]`      |  
| All elements  | `[:]`      |  
| i positions going backwards  | `[-i]`      | 
| A list of arbitrary indices  | `[[i,j,k,l]]`      | 

#### Masking
One particular indexing way deserves particular attention. Suppose you have an ndarray A of five elements, and an ndarray B of five elements. You want to access all elements that pass a Boolean condition. You may then place the condition *inside* the square brackets.

This works because the Boolean condition outputs a Boolean array, of the same length as the original array but with `True` signalling it passed the criterion:

<div>
<img src="masking.png" alt="Variables" style="width:50%;height:50%;float:right;">
</div>


</div>

In [7]:
#Masking itself
numbers = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17])

#The condition
even_to_two = numbers % 2 == 0
print(even_to_two)

#Let's filter
numbers[even_to_two]

#Masking a different array
hundred = np.array([101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117])
hundred[even_to_two]

array([102, 104, 106, 108, 110, 112, 114, 116])

#### Modification

<div "style="width:500px;">

If we wished to alter parts the array, we assign it just like how we assign values to normal variables.

We simply place the indexed array to the left of an equality sign, and the new value to the right. For example:

`arr[2] = 67`

</div>

In [9]:
arr = np.arange(6)
arr[1] = 67


<div "style="width:500px;">

One thing to remember here is that if we store a subset of the array and modify the subset, the original will - believe it or not - also be updated.

This is because when you use square brackets, you don't actually create a copy, but rather a *view* of the original.

<br>
<br>

<div>
<img src="views.png" alt="Variables" style="width:50%;height:50%;float:right;">
</div>

</div>

In [12]:
sec_third = arr[1:3]
sec_third[0] = 2
print(arr)

[0 2 2 3 4 5]


# Introducing dataframes

<div "style="width:500px;">

Powerful as they seem, NumPy arrays are in reality severely restricted, because they only deal with numbers, and only with data of a certain type. In real life data analyses, individual cases often have several related data, of different types. For example, a survey participant has a gender, an age, a response to Question 1 and so forth.

Enter **pandas**, a fundamental library built on top of NumPy to help processing of a broader set of data.

As usual, there is a conventional alias for importing it.

</div>

In [None]:
import pandas as pd

<div "style="width:500px;">

To store variably typed data, we make use of a Pandas object called a **dataframe**.

A dataframe is, quite simply, a table. You are probably already familiar with such tables from spreadsheet applications.

More technically, it consists of several, labelled NumPy arrays called *Series*.

Like a NumPy array, we can access its values, change its values, summarise its values. 

Creating our own dataframe is rarely done in practice. Instead, we could go ahead reading a datafile directly into a dataframe.

</div>

In [None]:
df = pd.read_csv("traffic_data_glasgow.csv", sep=',')

<div "style="width:500px;">

You'll find in online examples that dataframes - when dealing with a single one - are almost always called `df`.

For starters, let us see which headers it has.

In the case you just read in we have a huge dataframe with 10 000+ rows and many columns, and we wouldn't want to print it as we might end up feeling lost in all those numbers. 

What we could do instead is see what the columns are, then perhaps see the first few rows. 

We are going to use the following: 
* df.columns: a property (attribute) that gives us the name of the columns

* df.head(): a function that gives us the first five rows

Remember, `.columns` is a property of the dataframe (thus no parantheses after it), while `.head()` is a function we could likewise access via `pd`.

These little magical expressions return a list. Unless we wrap it with a print-statement, that list will not be visible in our console. Have a go at it below.

</div>