## Getting started

* Numpy is a huge package
* Most packages for data science build on numpy
* Documentation and further tutorials: https://numpy.org/doc/stable/

### Importing numpy

By convention:

Hello this is some more text.

In [1]:
import numpy as np

### Numpy arrays

Arrays are the basic objects in numpy.

An array is a multidimensional collection of data of the same type.

Arrays in numpy are referred to as ndarrays - for n dimensional array

**Creating arrays**

Basic syntax: *numpy.array(object, dtype=None)*

lets create an array of the integers 1 to 3

In [2]:
np.array([1,2,3])

array([1, 2, 3])

Alternatively as floating point numbers

In [3]:
arr = np.array([1,2,3], dtype=float)
arr

array([1., 2., 3.])

**Properties of arrays**

All arrays have a shape and a datatype.

These can be found using *ndarray.shape* *ndarray.dtype*

In [4]:
arr.shape

(3,)

In [5]:
arr.dtype

dtype('float64')

Create a 2 rows by 2 columns array of zeros

Create a 3 rows by 4 columns array of ones

In [7]:
np.zeros((2,2))

array([[0., 0.],
       [0., 0.]])

In [8]:
np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Create a 2 by 2 array of fours
* Create a 2 by 2 array
* Fill the array with the value 4

In [9]:
arr = np.empty((2, 2))
arr

array([[0., 0.],
       [0., 0.]])

In [10]:
arr.fill(4)
arr

array([[4., 4.],
       [4., 4.]])

## Reading / Writing data files

You have been provided with a csv file theoph.csv

In [11]:
theoph_file_location = 'theoph.csv'
theoph = np.genfromtxt(theoph_file_location, delimiter=',')
theoph

array([[  nan,   nan,   nan,   nan,   nan],
       [11.  , 79.6 ,  4.02,  0.  ,  0.74],
       [11.  , 79.6 ,  4.02,  0.25,  2.84],
       [11.  , 79.6 ,  4.02,  0.57,  6.57],
       [11.  , 79.6 ,  4.02,  1.12, 10.5 ],
       [11.  , 79.6 ,  4.02,  2.02,  9.66],
       [11.  , 79.6 ,  4.02,  3.82,  8.58],
       [11.  , 79.6 ,  4.02,  5.1 ,  8.36],
       [11.  , 79.6 ,  4.02,  7.03,  7.47],
       [11.  , 79.6 ,  4.02,  9.05,  6.89],
       [11.  , 79.6 ,  4.02, 12.12,  5.94],
       [11.  , 79.6 ,  4.02, 24.37,  3.28],
       [ 6.  , 72.4 ,  4.4 ,  0.  ,  0.  ],
       [ 6.  , 72.4 ,  4.4 ,  0.27,  1.72],
       [ 6.  , 72.4 ,  4.4 ,  0.52,  7.91],
       [ 6.  , 72.4 ,  4.4 ,  1.  ,  8.31],
       [ 6.  , 72.4 ,  4.4 ,  1.92,  8.33],
       [ 6.  , 72.4 ,  4.4 ,  3.5 ,  6.85],
       [ 6.  , 72.4 ,  4.4 ,  5.02,  6.08],
       [ 6.  , 72.4 ,  4.4 ,  7.03,  5.4 ],
       [ 6.  , 72.4 ,  4.4 ,  9.  ,  4.55],
       [ 6.  , 72.4 ,  4.4 , 12.  ,  3.01],
       [ 6.  , 72.4 ,  4.4 , 24.

Numpy arrays contain data of the same type

The first line in our csv file contains strings for the column names

Numpy has read those in as nan

*skip_header* arguement can be used to ignore lines at the start of a file

In [12]:
theoph = np.genfromtxt(theoph_file_location, delimiter=',', skip_header=1)
theoph

array([[11.  , 79.6 ,  4.02,  0.  ,  0.74],
       [11.  , 79.6 ,  4.02,  0.25,  2.84],
       [11.  , 79.6 ,  4.02,  0.57,  6.57],
       [11.  , 79.6 ,  4.02,  1.12, 10.5 ],
       [11.  , 79.6 ,  4.02,  2.02,  9.66],
       [11.  , 79.6 ,  4.02,  3.82,  8.58],
       [11.  , 79.6 ,  4.02,  5.1 ,  8.36],
       [11.  , 79.6 ,  4.02,  7.03,  7.47],
       [11.  , 79.6 ,  4.02,  9.05,  6.89],
       [11.  , 79.6 ,  4.02, 12.12,  5.94],
       [11.  , 79.6 ,  4.02, 24.37,  3.28],
       [ 6.  , 72.4 ,  4.4 ,  0.  ,  0.  ],
       [ 6.  , 72.4 ,  4.4 ,  0.27,  1.72],
       [ 6.  , 72.4 ,  4.4 ,  0.52,  7.91],
       [ 6.  , 72.4 ,  4.4 ,  1.  ,  8.31],
       [ 6.  , 72.4 ,  4.4 ,  1.92,  8.33],
       [ 6.  , 72.4 ,  4.4 ,  3.5 ,  6.85],
       [ 6.  , 72.4 ,  4.4 ,  5.02,  6.08],
       [ 6.  , 72.4 ,  4.4 ,  7.03,  5.4 ],
       [ 6.  , 72.4 ,  4.4 ,  9.  ,  4.55],
       [ 6.  , 72.4 ,  4.4 , 12.  ,  3.01],
       [ 6.  , 72.4 ,  4.4 , 24.3 ,  0.9 ],
       [ 5.  , 70.5 ,  4.53,  0.

To write files using numpy:

In [13]:
theoph_out_file = 'theoph_out.csv'
np.savetxt(theoph_out_file, theoph, delimiter=",")

**About theoph.csv**

This contains data from a study by Dr. Robert Upton of the kinetics of the anti-asthmatic drug theophylline [1]. 

Subjects were given oral doses of theophylline then serum concentrations were measured at 11 time points over the next 25 hours.

The columns of the data are: 

* Subject, a reference number identifying each subject (1 to 12)
* Wt, weight of the subject in kg
* Dose, dose of theophylline given to the subject in mg/kg
* Time, time since the drug was administered when the sample was drawn in hours
* conc, theophylline concentration in mg/L

[1] *Boeckmann, A. J., Sheiner, L. B. and Beal, S. L. (1994), NONMEM Users Guide: Part V, NONMEM Project Group, University of California, San Francisco.*

## Indexing and slicing

Indexing and slicing works similarly to base python and to R

Now we have multiple dimensions

### Indexing

Indexing is ndarray[row, column] *don't forget python is zero indexed*

So to get the value in the first row and column: 


In [14]:
theoph[0, 0]

11.0

To get all the values in the first row

In [15]:
theoph[0, :]

array([11.  , 79.6 ,  4.02,  0.  ,  0.74])

To get all the values in the last column

In [16]:
theoph[:, -1]

array([ 0.74,  2.84,  6.57, 10.5 ,  9.66,  8.58,  8.36,  7.47,  6.89,
        5.94,  3.28,  0.  ,  1.72,  7.91,  8.31,  8.33,  6.85,  6.08,
        5.4 ,  4.55,  3.01,  0.9 ,  0.  ,  4.4 ,  6.9 ,  8.2 ,  7.8 ,
        7.5 ,  6.2 ,  5.3 ,  4.9 ,  3.7 ,  1.05,  0.  ,  1.89,  4.6 ,
        8.6 ,  8.38,  7.54,  6.88,  5.78,  5.33,  4.19,  1.15,  0.  ,
        2.02,  5.63, 11.4 ,  9.33,  8.74,  7.56,  7.09,  5.9 ,  4.37,
        1.57,  0.  ,  1.29,  3.08,  6.44,  6.32,  5.53,  4.94,  4.02,
        3.46,  2.78,  0.92,  0.15,  0.85,  2.35,  5.02,  6.58,  7.09,
        6.66,  5.25,  4.39,  3.53,  1.15,  0.  ,  3.05,  3.05,  7.31,
        7.56,  6.59,  5.88,  4.73,  4.57,  3.  ,  1.25,  0.  ,  7.37,
        9.03,  7.14,  6.33,  5.66,  5.67,  4.24,  4.11,  3.16,  1.12,
        0.24,  2.89,  5.22,  6.41,  7.83, 10.21,  9.18,  8.02,  7.14,
        5.68,  2.42,  0.  ,  4.86,  7.24,  8.  ,  6.81,  5.87,  5.22,
        4.45,  3.62,  2.69,  0.86,  0.  ,  1.25,  3.96,  7.82,  9.72,
        9.75,  8.57,

### Slicing

To get a slice of data we use

ndarray[row_start:row_end, column_start:column_end]

*python is lower bound inclusive and upper bound exclusive*

To select the second, third and fourth rows

In [17]:
theoph[1:4, :]

array([[11.  , 79.6 ,  4.02,  0.25,  2.84],
       [11.  , 79.6 ,  4.02,  0.57,  6.57],
       [11.  , 79.6 ,  4.02,  1.12, 10.5 ]])

To select the first three columns

In [18]:
theoph[:, :3]

array([[11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [11.  , 79.6 ,  4.02],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 6.  , 72.4 ,  4.4 ],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 5.  , 70.5 ,  4.53],
       [ 7

To select the last eleven rows

In [19]:
theoph[-11:, :]

array([[ 9.  , 60.5 ,  5.3 ,  0.  ,  0.  ],
       [ 9.  , 60.5 ,  5.3 ,  0.25,  1.25],
       [ 9.  , 60.5 ,  5.3 ,  0.5 ,  3.96],
       [ 9.  , 60.5 ,  5.3 ,  1.  ,  7.82],
       [ 9.  , 60.5 ,  5.3 ,  2.  ,  9.72],
       [ 9.  , 60.5 ,  5.3 ,  3.52,  9.75],
       [ 9.  , 60.5 ,  5.3 ,  5.07,  8.57],
       [ 9.  , 60.5 ,  5.3 ,  7.07,  6.59],
       [ 9.  , 60.5 ,  5.3 ,  9.03,  6.11],
       [ 9.  , 60.5 ,  5.3 , 12.05,  4.57],
       [ 9.  , 60.5 ,  5.3 , 24.15,  1.17]])

Alternatively we can pass a list of the columns (or rows) we want

In [20]:
theoph[:, [1, 2, 4]]

array([[79.6 ,  4.02,  0.74],
       [79.6 ,  4.02,  2.84],
       [79.6 ,  4.02,  6.57],
       [79.6 ,  4.02, 10.5 ],
       [79.6 ,  4.02,  9.66],
       [79.6 ,  4.02,  8.58],
       [79.6 ,  4.02,  8.36],
       [79.6 ,  4.02,  7.47],
       [79.6 ,  4.02,  6.89],
       [79.6 ,  4.02,  5.94],
       [79.6 ,  4.02,  3.28],
       [72.4 ,  4.4 ,  0.  ],
       [72.4 ,  4.4 ,  1.72],
       [72.4 ,  4.4 ,  7.91],
       [72.4 ,  4.4 ,  8.31],
       [72.4 ,  4.4 ,  8.33],
       [72.4 ,  4.4 ,  6.85],
       [72.4 ,  4.4 ,  6.08],
       [72.4 ,  4.4 ,  5.4 ],
       [72.4 ,  4.4 ,  4.55],
       [72.4 ,  4.4 ,  3.01],
       [72.4 ,  4.4 ,  0.9 ],
       [70.5 ,  4.53,  0.  ],
       [70.5 ,  4.53,  4.4 ],
       [70.5 ,  4.53,  6.9 ],
       [70.5 ,  4.53,  8.2 ],
       [70.5 ,  4.53,  7.8 ],
       [70.5 ,  4.53,  7.5 ],
       [70.5 ,  4.53,  6.2 ],
       [70.5 ,  4.53,  5.3 ],
       [70.5 ,  4.53,  4.9 ],
       [70.5 ,  4.53,  3.7 ],
       [70.5 ,  4.53,  1.05],
       [72

We can also pass a list of boolean values to slice the array

This is know as a boolean mask

A common way to create a boolean mask is using the > < ==  operators on a column

Lets split the dataframe with all doses less than 4.1 mg/kg

In [21]:
mask = theoph[:, 2] < 4.1
mask

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [22]:
theoph[mask, :]

array([[11.  , 79.6 ,  4.02,  0.  ,  0.74],
       [11.  , 79.6 ,  4.02,  0.25,  2.84],
       [11.  , 79.6 ,  4.02,  0.57,  6.57],
       [11.  , 79.6 ,  4.02,  1.12, 10.5 ],
       [11.  , 79.6 ,  4.02,  2.02,  9.66],
       [11.  , 79.6 ,  4.02,  3.82,  8.58],
       [11.  , 79.6 ,  4.02,  5.1 ,  8.36],
       [11.  , 79.6 ,  4.02,  7.03,  7.47],
       [11.  , 79.6 ,  4.02,  9.05,  6.89],
       [11.  , 79.6 ,  4.02, 12.12,  5.94],
       [11.  , 79.6 ,  4.02, 24.37,  3.28],
       [ 1.  , 80.  ,  4.  ,  0.  ,  0.  ],
       [ 1.  , 80.  ,  4.  ,  0.27,  1.29],
       [ 1.  , 80.  ,  4.  ,  0.58,  3.08],
       [ 1.  , 80.  ,  4.  ,  1.15,  6.44],
       [ 1.  , 80.  ,  4.  ,  2.03,  6.32],
       [ 1.  , 80.  ,  4.  ,  3.57,  5.53],
       [ 1.  , 80.  ,  4.  ,  5.  ,  4.94],
       [ 1.  , 80.  ,  4.  ,  7.  ,  4.02],
       [ 1.  , 80.  ,  4.  ,  9.22,  3.46],
       [ 1.  , 80.  ,  4.  , 12.1 ,  2.78],
       [ 1.  , 80.  ,  4.  , 23.85,  0.92],
       [ 8.  , 86.4 ,  3.1 ,  0.

## Vectorised operations

Numpy functions are vectorized. 

This means they are fast and efficient compared to base python

### Elementwise operations

For elementwise operators the same operation is applied to every element in the array

If we want to convert the weights from metric units to imperial (kg to lb)

We need to multiple every weight by 2.2

In base python we would write a loop or list comprehension

In numpy if is much simpler

In [23]:
theoph[:, 1] * 2.2

array([175.12, 175.12, 175.12, 175.12, 175.12, 175.12, 175.12, 175.12,
       175.12, 175.12, 175.12, 159.28, 159.28, 159.28, 159.28, 159.28,
       159.28, 159.28, 159.28, 159.28, 159.28, 159.28, 155.1 , 155.1 ,
       155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 ,
       155.1 , 159.94, 159.94, 159.94, 159.94, 159.94, 159.94, 159.94,
       159.94, 159.94, 159.94, 159.94, 120.12, 120.12, 120.12, 120.12,
       120.12, 120.12, 120.12, 120.12, 120.12, 120.12, 120.12, 176.  ,
       176.  , 176.  , 176.  , 176.  , 176.  , 176.  , 176.  , 176.  ,
       176.  , 176.  , 142.12, 142.12, 142.12, 142.12, 142.12, 142.12,
       142.12, 142.12, 142.12, 142.12, 142.12, 155.1 , 155.1 , 155.1 ,
       155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 , 155.1 ,
       190.08, 190.08, 190.08, 190.08, 190.08, 190.08, 190.08, 190.08,
       190.08, 190.08, 190.08, 128.04, 128.04, 128.04, 128.04, 128.04,
       128.04, 128.04, 128.04, 128.04, 128.04, 128.04, 143.  , 143.  ,
      

We can also muliply slices by slices

The dose is given in mg per kg.

To find out the mg given to each subject we multiply the dose column by the weight column

In [24]:
np.multiply(theoph[:, 1], theoph[:, 2])

array([319.992, 319.992, 319.992, 319.992, 319.992, 319.992, 319.992,
       319.992, 319.992, 319.992, 319.992, 318.56 , 318.56 , 318.56 ,
       318.56 , 318.56 , 318.56 , 318.56 , 318.56 , 318.56 , 318.56 ,
       318.56 , 319.365, 319.365, 319.365, 319.365, 319.365, 319.365,
       319.365, 319.365, 319.365, 319.365, 319.365, 319.88 , 319.88 ,
       319.88 , 319.88 , 319.88 , 319.88 , 319.88 , 319.88 , 319.88 ,
       319.88 , 319.88 , 319.956, 319.956, 319.956, 319.956, 319.956,
       319.956, 319.956, 319.956, 319.956, 319.956, 319.956, 320.   ,
       320.   , 320.   , 320.   , 320.   , 320.   , 320.   , 320.   ,
       320.   , 320.   , 320.   , 319.77 , 319.77 , 319.77 , 319.77 ,
       319.77 , 319.77 , 319.77 , 319.77 , 319.77 , 319.77 , 319.77 ,
       319.365, 319.365, 319.365, 319.365, 319.365, 319.365, 319.365,
       319.365, 319.365, 319.365, 319.365, 267.84 , 267.84 , 267.84 ,
       267.84 , 267.84 , 267.84 , 267.84 , 267.84 , 267.84 , 267.84 ,
       267.84 , 320.

np.multiply and the * operator are equivalent

Similarly there are functions for add, subtract, divide

And many more mathematical operations eg reciprocal, power, log / exp, sin / cos / tan

https://numpy.org/doc/stable/reference/routines.math.html

### Summary operations

There are many functions available that can summarise the data

These can be applied to the whole array

More commonly to every row or column

For example to find the minimum of each column, we specify axis=0

In [25]:
np.min(theoph, axis=0)

array([ 1. , 54.6,  3.1,  0. ,  0. ])

to find the mean of each row we specify axis=1

In [26]:
np.mean(theoph, axis=1)

array([19.072, 19.542, 20.352, 21.248, 21.26 , 21.404, 21.616, 21.824,
       22.112, 22.536, 24.454, 16.56 , 16.958, 18.246, 18.422, 18.61 ,
       18.63 , 18.78 , 19.046, 19.27 , 19.562, 21.6  , 16.006, 16.94 ,
       17.502, 17.85 , 17.97 , 18.23 , 18.262, 18.48 , 18.786, 19.176,
       21.05 , 16.82 , 17.268, 17.86 , 18.754, 18.922, 19.028, 19.2  ,
       19.38 , 19.69 , 20.054, 21.98 , 14.492, 14.956, 15.722, 16.972,
       16.762, 16.94 , 17.008, 17.314, 17.492, 17.766, 19.676, 17.   ,
       17.312, 17.732, 18.518, 18.67 , 18.82 , 18.988, 19.204, 19.536,
       19.976, 21.954, 14.34 , 14.53 , 14.88 , 15.518, 16.03 , 16.424,
       16.642, 16.756, 16.988, 17.426, 19.384, 15.606, 16.266, 16.32 ,
       17.264, 17.522, 17.63 , 17.792, 17.982, 18.334, 18.626, 20.68 ,
       19.5  , 21.034, 21.432, 21.138, 21.17 , 21.338, 21.638, 21.782,
       22.082, 22.452, 24.61 , 14.788, 15.392, 15.938, 16.226, 16.716,
       17.492, 17.586, 17.76 , 18.044, 18.296, 19.964, 14.784, 15.806,
      

To sum up the numbers in the whole array we leave out the axis arguement

We can nest functions in numpy

Here we round to one decimal place

In [27]:
np.round(np.sum(theoph), 1)

12086.5

plus many more arithmetic operations available

### Linear algebra

numpy also contains functions for many common linear algebra functions

https://numpy.org/doc/stable/reference/routines.linalg.html

Matrix multiplication

In [28]:
np.matmul(theoph[:2, :], theoph[:5, :2])

array([[1048.96 , 7590.656],
       [1074.81 , 7777.716]])

eigenvalues

In [29]:
np.linalg.eig(theoph[:5, :])

(array([ 9.78998626e+01,  8.38897366e+00, -8.88836233e-01,  2.92535465e-15,
        -3.80290455e-16]),
 array([[-4.17368737e-01, -1.99891890e-01,  2.05229022e-01,
          9.74290006e-01, -3.36539687e-01],
        [-4.28646806e-01, -1.49324302e-02, -7.19385779e-03,
         -1.43413250e-01,  9.38267482e-02],
        [-4.48083801e-01,  3.03883884e-01, -4.99039276e-01,
          1.73757368e-01, -9.36983234e-01],
        [-4.69583906e-01,  6.56450108e-01, -8.20725225e-01,
          0.00000000e+00,  0.00000000e+00],
        [-4.69869252e-01,  6.60717864e-01,  1.87614504e-01,
         -3.70678206e-16,  8.99639735e-18]]))

### Manipulating arrays

We can stack arrays together to add extra rows or columns

https://numpy.org/doc/stable/reference/routines.ma.html

In [30]:
new_array = np.hstack([theoph, theoph])
new_array

array([[11.  , 79.6 ,  4.02, ...,  4.02,  0.  ,  0.74],
       [11.  , 79.6 ,  4.02, ...,  4.02,  0.25,  2.84],
       [11.  , 79.6 ,  4.02, ...,  4.02,  0.57,  6.57],
       ...,
       [ 9.  , 60.5 ,  5.3 , ...,  5.3 ,  9.03,  6.11],
       [ 9.  , 60.5 ,  5.3 , ...,  5.3 , 12.05,  4.57],
       [ 9.  , 60.5 ,  5.3 , ...,  5.3 , 24.15,  1.17]])

In [31]:
new_array.shape

(132, 10)

In [32]:
new_array = np.vstack([theoph, theoph])
new_array.shape

(264, 5)

In [33]:
new_array = np.hstack([theoph, theoph[:, 1] * 2.2])

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)

In [34]:
theoph.shape

(132, 5)

In [None]:
theoph[:, 1].shape

In [None]:
theoph[:, 1:2].shape

In [35]:
new_array = np.hstack([theoph, theoph[:, 1:2] * 2.2])
new_array.shape

(132, 6)

# Pandas

Pandas works with tabular data similar to spreadsheets and databases

Pandas is another huge package with lots of functionality

https://pandas.pydata.org/pandas-docs/stable/reference/index.html

Similar to numpy pandas is imported by convention as pd

In [26]:
%pip install pandas
import pandas as pd

Collecting pandas
  Downloading pandas-1.2.1-cp39-cp39-macosx_10_9_x86_64.whl (10.7 MB)
[K     |████████████████████████████████| 10.7 MB 1.5 MB/s eta 0:00:01    |███████████▏                    | 3.7 MB 2.0 MB/s eta 0:00:04
Collecting pytz>=2017.3
  Downloading pytz-2021.1-py2.py3-none-any.whl (510 kB)
[K     |████████████████████████████████| 510 kB 139 kB/s eta 0:00:01
[?25hInstalling collected packages: pytz, pandas
Successfully installed pandas-1.2.1 pytz-2021.1
Note: you may need to restart the kernel to use updated packages.


## Pandas series

The basic object in pandas is the pandas Series

A pandas series is a 1d numpy array with an index or axis label for each element

Lets create a series:

In [27]:
pd.Series(['a', 'b', 'c', 'c'])

0    a
1    b
2    c
3    c
dtype: object

## Pandas indexing

Indexes are shown on the left in the output above

Indexes do not have to be numbers

In [28]:
string_series = pd.Series(['a', 'b', 'c', 'c'], 
                          index=['row0', 'row1', 'row2', 'row3'])
string_series

row0    a
row1    b
row2    c
row3    c
dtype: object

Indexes are not a column

If we look at the shape we can see this

In [29]:
string_series.shape

(4,)

Pandas has two ways of indexing, label based and position based

syntax for label based indexing is: 

*Series.loc[index]*

In [30]:
string_series.loc['row2']

'c'

syntax for position based indexing is: 

*Series.iloc[position]*

In [31]:
string_series.iloc[2]

'c'

This can lead to confusion and bugs

Consider a series where the indexs are non consecutive numbers

In [32]:
string_series = pd.Series(['a', 'b', 'c', 'c'], index=[0, 3, 5, 7])
string_series

0    a
3    b
5    c
7    c
dtype: object

In [33]:
string_series.loc[3]

'b'

In [34]:
string_series.iloc[3]

'c'

In [35]:
string_series[3]

'b'

## Pandas Dataframes

Series can be combined together into a pandas dataframe

Indexes are used when combining series into dataframes

A dataframe looks a lot like a table

More properly could be thought of as a dictionary of series

Each series in a dataframe must be the same length and have the same indexes

In [None]:
STRING_series = pd.Series(['C', 'B', 'C', 'A'], index=[5, 3, 7, 0])
STRING_series

In [None]:
string_dataframe = pd.concat((string_series, STRING_series), axis=1, sort=True)
string_dataframe

Unlike numpy Series in a dataframe can be different types

In [None]:
bool_series = pd.Series([True, False, True, False], index=[5, 3, 7, 0])
integer_series = pd.Series([1, 2, 3, 4], index=[0, 3, 5, 7])
string_dataframe = pd.concat((string_series, STRING_series, bool_series, integer_series), axis=1, sort=True)
string_dataframe

We can also name the columns in dataframe

In [None]:
string_dataframe.columns = ['string', 'STRING', 'bool', 'integer']
string_dataframe

Indexing pandas dataframes has the same label and position options as the pandas Series

In [None]:
string_dataframe.loc[[0, 3], 'STRING']

In [None]:
string_dataframe.iloc[[0, 3], [1, 2]]

**Resetting the index**

We can reset the index to match the position

This will create a new column called index

Unless you set the drop=True arguement

In [None]:
string_dataframe.reset_index()

In [None]:
string_df = string_dataframe.reset_index(drop=True)
string_df

Warning mismatched indexes can lead to unexpected results why trying to combine series / dataframes

In [None]:
pd.concat((string_df, integer_series), axis=0)

## Reading / Writing csv files using pandas

Pandas can also read in csv files

In [None]:
theoph_pd = pd.read_csv(theoph_file_location)
theoph_pd

In [None]:
theoph_pd.to_csv(theoph_out_file, index=False)

## Combining pandas and numpy

Pandas has many built in functions similar to numpy

https://pandas.pydata.org/pandas-docs/stable/reference/index.html

You can use all the numpy functions on pandas series and dataframes

So to add a column with weight in pounds

In [None]:
theoph_pd['Wt_lb'] = theoph_pd['Wt'].multiply(2.2)
theoph_pd

There are pandas functions for filtering and sorting.

Pandas can also be sliced using boolean masks

In [None]:
theoph_pd[theoph_pd['Wt'] < 69.58]