# Practical: NumPy and Pandas

**14th November 2019 – 11am to 1pm**

**Christopher Ingold Building G20** 

## Intro to NumPy

NumPy is a Python library for scientific computing. It provides high-performance multidimensional data structure called arrays. NumPy is usually imported like this:

In [None]:
import numpy as np

### Arrays

A NumPy array is a tensor of values all of the same type. The number of dimensions is the rank of the array, the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can create a NumPy array (aka a vector) from Python lists:

In [None]:
a = np.array([1, 2, 3, 4])
a

The previous code cell has created a NumPy array of rank 1. To check that this is a NumPy array we can use the function `type`:

In [None]:
type(a)

If we are given an array and we need to know its shape, we can use its attribute `shape`:

In [None]:
a.shape

What shape returns is a tuple containing the length of each dimension. In this case, we have only one dimension of size 4.

To access the values of a one-dimensional array we can do like we do when using Python lists:

In [None]:
a[0]

Let's now create a 2-dimensional array (aka a matrix in math):

In [None]:
a2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
a2

Now we use shape to get the length of the dimensions:

In [None]:
a2.shape

This is a matrix (with rank 2) with 2 rows and 4 columns.

To access the individual values of this matrix we can do like this:

In [None]:
a2[0, 0]

This will access the value positioned at the first row and first column. Try to access other locations until you get and error.

NumPy also provides some handy functions to create commonly used tensors when developing scientific applications. To use these functions we just need to give a tuple as a parameter defining the size of each dimensions. 

For example, use zeros to create a tensor filled with zeros:

In [None]:
np.zeros((4, 4))

Use ones to create a tensor filled with ones:

In [None]:
np.ones((4, 4))

Use random.random to create a tensor filled with random values $\in [0,1]$:

In [None]:
np.random.random((4, 4))

To create a matrix containing a given value, we can either use the method full and provide the given value, or create a tensor of ones and multiplying it by the given value, or create a tensor of zeros and sum the given value:

In [None]:
np.full((4, 4), 3)

In [None]:
np.ones((4, 4)) * 3

In [None]:
np.zeros((4, 4)) + 3

The two code cells above also demonstrate that in NumPy we can easily do arithmetic operations between tensors and scalars.

To create a matrix with the diagonal full with ones (aka the identity matrix), we can use the method eye like this:

In [None]:
np.eye(4)

### Array Selection

We can select a part of a tensor using the slicing operator of Python. However, in NumPy the range operator can be used with multi-dimensional tensors:

In [None]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
a

For example, to select only the 2x2 matrix on the top right, we can do:

In [None]:
a[:2, 2:]

If we want to select the last row:

In [None]:
a[2:]

If we want to select the last two columns:

In [None]:
a[:,2:]

We can also use a list of integers to extract only the coordinates we want, for example if we want to extract only the values of the 4 corners of the matrix `a`:

In [None]:
a[[0, 0, 2, 2],[0, 3, 0, 3]]

### Arithmetic Operators with Arrays

To sum a scalar to a matrix:

In [None]:
a = np.eye(4) 
a + 3

To subtract a scalar:

In [None]:
a - 5

To multiply a scalar to a matrix:

In [None]:
a * 3

To divide by a scalar:

In [None]:
a / 3

To exponentiate by a value:

In [None]:
(a + 1) ** 3

We can also do matrix operations like the dot product like this:

In [None]:
a1 = np.array([4, 3])
a2 = np.array([[1, 2],[3, 4]])

a1 @ a2

We can transpose a matrix like this:

In [None]:
a2.T

### Comparison Operators with Arrays

We can use the Python comparison operators with arrays. However, this will not return a Boolean value but an array of Boolean values:

In [None]:
a = np.array([1, 2, 3, 4])
a == 3

Or, greater or equal:

In [None]:
a >= 3

### Extended Datatypes

NumPy is implemented in C. This allows us to use C datatypes like this:

In [None]:
a = np.array([1, 2, 3, 4], dtype=np.int32)
a

We can find details about C datatypes [here](https://docs.scipy.org/doc/numpy/user/basics.types.html).

# Exercise 27

Compute the following sequence when n is equal to 5:

$
a_0 = 
\begin{bmatrix}
1&0\\
0&1\\
\end{bmatrix}
$

$
a_n = 
\begin{bmatrix}
1\\
2\\
\end{bmatrix} \cdot 
\begin{bmatrix}
2&1\\
\end{bmatrix} -
a_{n-1}$


The result should be:

$a_5 = \begin{bmatrix}
1&0\\
3&1\\
\end{bmatrix}$

## Intro to Pandas

Pandas is an open source providing high-performance easy-to-use data structures and data analysis tools for the Python programming language.

Pandas is usually imported using the following import statement:

In [None]:
import pandas as pd

Pandas defines two data structures called DataFrame and Series. 

### Read a CSV

To read a CSV file we can use the read_csv method. This will create a DataFrame. To read a CSV you can pass to this method either a file path to a CSV file stored locally, or an URL to a CSV file which pandas will download and parse automatically. 

For this tutorial we will work with a famous dataset called IRIS. This dataset describes the features of 3 different species of flowers. Each row is a flower and each column is a different measurement made on the flowers with the exception of the last column which represents the species of the flower:

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris

We can check the first 5 rows of the DataFrame by using head:

In [None]:
iris.head(5)

Or tail:

In [None]:
iris.tail(5)

Like NumPy, we can get the size of the dimensions of the DataFrame using shape:

In [None]:
iris.shape

We can analyze the content of the DataFrame using describe:

In [None]:
iris.describe()

To get the type of each column (Series) we can use dtypes:

In [None]:
iris.dtypes

### Select Columns (Series)

To access a column we can either use the name of the column as a method or use the squared brackets.

In [None]:
iris.sepal_length

In [None]:
iris["sepal_length"]

To get a DataFrame back we need to put the name of the column in a list:

In [None]:
iris[["sepal_length"]]

We can also select two columns using a list:

In [None]:
iris[["sepal_length", "sepal_width"]]

To select the first row we need to use the iloc like this:

In [None]:
iris.iloc[0]

### Select Rows

To select rows based on a condition we can use boolean operators. For example, to select all the flowers that have a sepal length greater or equal than 5: 

In [None]:
iris[iris.sepal_length >= 5.0]

### Changing Values

We can change the value of a column similarly to the way we access the DataFrame.

For example to change all the values of a column:

In [None]:
iris.sepal_length = 0
iris

To change the values of a row:

In [None]:
iris.iloc[0] = [0, 0, 0, 0, "ok"]
iris

### Statistical Functions

We can use the method mean, std, mode, max, min, etc. To compute any basic statistics we need:

In [None]:
iris.sepal_width.mean()

In [None]:
iris.sepal_width.mode()

In [None]:
iris.sepal_width.std()

### Plotting with Pandas

Pandas uses matplotlib to plot series. For example we can plot a column using the plot method:

In [None]:
%matplotlib notebook

iris.sepal_width.plot()

We can also plot the histogram of the column like this:

In [None]:
iris.sepal_width.hist()

# Exercise 28

Reload the IRIS dataset and compute the standard error of each numerical column per flower species, and store them into the DataFrame.

The standard error for a normal distribution is computed like this:

$$\mu \pm 1.96 \sqrt{\frac{\sigma^2}{n}}$$

where, $\mu$ is the mean, $\sigma^2$ is the variance, 1.96 is the approximate value of the 97.5 percentile point of the normal distribution, and $n$ is the sample size.

# Exercise 29

Imagine you are given a NumPy matrix and you want to transform this matrix into a pandas DataFrame, how can you do this? (Use the Internet to answer this question).

What about when you have a pandas DataFrame and you need a NumPy matrix?