# Tutorial - NumPy and Pandas  

### NumPy arrays

In Mathematics, a **vector** is a sequence of numbers, and a **matrix** is a rectangular arrangement of numbers. Operations with vectors and matrices are the subject of a branch of mathematics called linear algebra. In Python (and in many other languages), vectors are called one-dimensional (1d) **arrays**, while matrices are called two-dimensional (2d) arrays. Arrays of more than two dimensions can be managed in Python without pain.

Python arrays are not necessarily numeric. Indeed, vectors of dates and strings appear frequently in data science. In principle, all the terms of an ordinary array must have the same type, so that the array itself can have a type, although you can relax this constraint using mixed types, as we will see later. Arrays were already implemented in plain Python, but the functionality of the Python arrays was enlarged in the NumPy library, intended to be the fundamental library for scientific computing in Python.

A typical way to load NumPy is:

In [1]:
import numpy as np

Creating a numeric 1d array in NumPy is easy:

In [2]:
x = np.array(range(10))
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

String arrays are created in the same way:

In [3]:
y = np.array(['Messi', 'Neymar'])
y

array(['Messi', 'Neymar'], dtype='<U6')

The terms of a 1d array can be extracted from a range, a list or another data container. The elements of a list can have different type, but they are converted to a common type when creating the array.

A 2d array can be directly created from a list of lists of equal length. The terms are entered row-by-row:

In [4]:
z = np.array([[0, 7, 2, 3], [3, 9, -5, 1]])
z

array([[ 0,  7,  2,  3],
       [ 3,  9, -5,  1]])

Although we visualize a vector as a column (or as a row) and a matrix as a rectangular arrangement, with rows and columns, it is not so in the computer. The vector is just a sequence of elements of the same type, neither horizontal nor vertical. It has one **axis**, which is the 0-axis.

In a similar way, a matrix is a sequence of vectors of the same length and type. It has two axes. When we visualize the matrix as rows and columns, `axis=0` means *across rows*, while `axis=1` means *across columns*.

The number of terms stored along an axis is the dimension of that axis. The dimensions are collected in the attribute `shape`:

In [5]:
x.shape

(10,)

In [6]:
z.shape

(2, 4)

NumPy incorporates vectorized forms of the mathematical functions of the package `math`. A **vectorized function** is one that, when applied to an array, returns an array with same shape, whose terms are the values of the function on the corresponding terms of the original array. For instance, the square root function **numpy.sqrt** takes the square root of every term of a numeric array:

In [7]:
np.sqrt(x)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

The functions that are defined in terms of vectorized functions are automatically vectorized. For instance:

In [8]:
def f(t): return 1/(1 + np.exp(t))
f(z)

array([[5.00000000e-01, 9.11051194e-04, 1.19202922e-01, 4.74258732e-02],
       [4.74258732e-02, 1.23394576e-04, 9.93307149e-01, 2.68941421e-01]])

### Subsetting arrays

**Slicing** a one-dimensional array is done as for a list:

In [9]:
x[:3]

array([0, 1, 2])

The same applies to two-dimensional arrays, but we need two indexes within the square brackets. The first index selects the rows (`axis=0`), and the second index the columns (`axis=1`):

In [10]:
z[:1, 1:]

array([[7, 2, 3]])

Subsets of an array can also be extracted by means of expressions. First, note that an expression involving an array returns a Boolean array with the same shape:

In [11]:
z > 2

array([[False,  True, False,  True],
       [ True,  True, False, False]])

When we write a expression between the brackets, Python returns the terms for which that expression is true:

In [12]:
x[x > 3]

array([4, 5, 6, 7, 8, 9])

Note that, under the hood, Python creates a Boolean array of the same shape as `x` and then returns the elements of `x` for which that Boolean array equals `True`. You can enter directly a Boolean array inside the brackets. A Boolean array that is used to extract a subarray is called a **Boolean mask**.

Boolean masks can also be used to filter out rows or columns of a matrix. For instance, you can select the rows of the matrix `z` for which the first column is positive:

In [13]:
z[z[:, 0] > 0, 1:]

array([[ 9, -5,  1]])

### Pandas series

**Pandas** provides a wide range of data wrangling tools. It typically imported as:

In [14]:
import pandas as pd

Pandas allows for two data container classes, the series (one-dimensional) and the data frames (two-dimensional). An individual data vector is called a **series**. A series is like one-dimensional array plus the **index**, which contains the names of the values of the series.

The data used in data science are usually imported from external data files, so you do not have to create the series by yourself. But a Pandas series can be created directly, for instance from a range, with the function `pandas.Series`.

In [15]:
s = pd.Series(range(10))
s

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

Instead of a range, a list or a vector can be used to specify the data points. In a list, the elements can have different type, but, as NumPy, Pandas converts them to a common type, as shown in the following example.

In [17]:
s1 = pd.Series([1, 5, 'Messi'])
s1

0        1
1        5
2    Messi
dtype: object

A Pandas series has three important attributes: `shape` and `dtype`, which are the same as in a NumPy, and `index`, which makes a difference. The index of a series is a vector-like object that contains the names of the terms of the series. The index is printed on the left, as you can see in the preceding output. Since the index of `s1` was not specified when the series was created, Pandas assigned consecutive integers as indexes.

In [18]:
s1.index

RangeIndex(start=0, stop=3, step=1)

This index is automatically created, as a `RangeIndex`. We can also specify an index directly:

In [25]:
s1.index = ['a', 'b', 'c']
s1

a        1
b        5
c    Messi
dtype: object

Now the index is a plain `Index`:

In [24]:
s1.index

Index(['a', 'b', 'c'], dtype='object')

### Pandas data frames

A **data frame** is formed by one or several series with the same index (hence, with the same length). It can be seen as a two-dimensional array plus the column names and the index, which contains the row names. A difference between two-dimensional NumPy arrays and Pandas data frames is that a data frame does not have a data type, each column has its own data type.

A Pandas data frame can be built in many ways, for instance from a dictionary of vector-like objects of the same length, as in:

In [26]:
df = pd.DataFrame({'v1': range(0, 5),
    'v2': ['a', 'b', 'c', 'd', 'e'],
    'v3': np.repeat(-1.3, 5)})
df

Unnamed: 0,v1,v2,v3
0,0,a,-1.3
1,1,b,-1.3
2,2,c,-1.3
3,3,d,-1.3
4,4,e,-1.3


Important attributes of a data frame are `shape`, `index`, `dtypes` and `columns`. `shape` works as in two-dimensional NumPy array, and `index` as in a Pandas series. A data frame can be seen as a collection of series of the same length which share the index.

Besides the attributes that have already been mentioned, there are several methods for exploring a Pandas object. These methods are capital to data scientists, who are frequently checking whether their data frames contain what they are expected to contain.

First, the functions `head` and `tail` extract the first and the last rows of a data frame, respectively. The default number of rows extracted is 5, but you may pass a custom number.

In [28]:
df.head(2)

Unnamed: 0,v1,v2,v3
0,0,a,-1.3
1,1,b,-1.3


The content of a data frame can be explored with the function `info`. It reports the dimensions, the data type and the number of non-missing values of every column of the data frame.

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   v1      5 non-null      int64  
 1   v2      5 non-null      object 
 2   v3      5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


The function `describe` returns a conventional statistical summary. Columns of type `object` are omitted, except when all the columns have that type. Then the report contains only counts. This function also works for series.

In [31]:
df.describe()

Unnamed: 0,v1,v3
count,5.0,5.0
mean,2.0,-1.3
std,1.581139,0.0
min,0.0,-1.3
25%,1.0,-1.3
50%,2.0,-1.3
75%,3.0,-1.3
max,4.0,-1.3


### Subsetting in Pandas, first round

Pandas offers multiple ways for subsetting series and data frames. Suppose that you wish to select a subset of complete columns from a data frame. You can specify this with a list containing the names of those columns:

In [32]:
df[['v1', 'v2']]

Unnamed: 0,v1,v2
0,0,a
1,1,b
2,2,c
3,3,d
4,4,e


To select a collection of complete rows, we specify them as in lists or in a 1d array:

In [33]:
df[1:3]

Unnamed: 0,v1,v2,v3
1,1,b,-1.3
2,2,c,-1.3


You can also be use a Boolean mask to extract rows from a data frame. By entering an expression within the brackets, you select the rows for which the expression is true.

In [34]:
expr = df['v1'] > 2
df[expr]

Unnamed: 0,v1,v2,v3
3,3,d,-1.3
4,4,e,-1.3


Note that, as for a NumPy array, an expression involving a Pandas object returns a Boolean Pandas object with the same shape:

In [35]:
expr

0    False
1    False
2    False
3     True
4     True
Name: v1, dtype: bool

### .loc and .iloc subsetting

Besides the simple methods of the preceding section, you have two additional ways to carry out a selection: by label or by position. **Selection by label** is specified by adding `.loc` after the name of the data frame. For a **selection by position**, you add `.iloc`. In both cases, if you enter a single specification inside the brackets, it refers to the rows. If you enter two specifications, the first one refers to the rows and the second one to the columns.

In `.loc` subsetting (by label), the selection of the rows is based on the index, and the selection of the columns is based on the column names. Two examples follow.

In [36]:
df.loc[1:2]

Unnamed: 0,v1,v2,v3
1,1,b,-1.3
2,2,c,-1.3


In [37]:
df.loc[:2, :'v2']

Unnamed: 0,v1,v2
0,0,a
1,1,b
2,2,c


Note that, in the `.loc` selection, the numbers inside the brackets do not refer to the positions of the row, but to their names, as given by the index. The selection includes the two rows whose labels are specified together with all the intermediate rows. So, in this example, `df.loc[1:2]` is the same as `df[1:3]`. This is a source of confusion when the index of the data frame is the default range index which is created automatically if no index has been specified.

Now, the `.iloc` selection for the same examples.

In [38]:
df.iloc[1:3]

Unnamed: 0,v1,v2,v3
1,1,b,-1.3
2,2,c,-1.3


In [39]:
df.iloc[:3, :2]

Unnamed: 0,v1,v2
0,0,a
1,1,b
2,2,c
