# Pandas
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/davemor/scientific-python-tutorial/blob/master/PythonTutorialPart3-Pandas.ipynb)
- Pandas is a library for data manipulation.
- Useful for reading, exploring, transforming, and writing table-like data with columns of different data-types.
- It’s also useful for time series data and have lot’s of tools for working with times.
- It’s similar to a spreadsheet but can do many of the selection and transformation operations you might do with a database.
- Check out https://pandas.pydata.org/docs/index.html for detailed Documentation.
- When importing Pandas the convention is to alias it to pd, like so:

In [2]:
import pandas as pd
dir(pd)

['BooleanDtype',
 'Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NA',
 'NaT',
 'NamedAgg',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseDtype',
 'StringDtype',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__getattr__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_config',
 '_hashtable',
 '_is_numpy_dev',
 '_lib',
 '_libs',
 '_np_version_under1p16',
 '_np_version_under1p17',
 '_np_version_under1p18',
 '_testing',
 '_tslib',
 '_ty

## Series
- A Series is one-dimensional ndarray with labels for each of it’s elements – called it’s index.
- Indexes can be any hashable value, such as a string or an integer. They default to 0, 1, 2, 3, …
- A Series can also have name (optionally).
- We can declare a Series like so, passing in a value for each element:

In [3]:
pd.Series(['a', 'b', 'c', 'c'])

0    a
1    b
2    c
3    c
dtype: object

This will add a Range Index by default, that is one that starts a 0 and goes up in increments on 1.

Let's say that we want to use a set of the string of our own invention as the indices. We can do this by passing a list to the index keyword argument.

In [4]:
s = pd.Series(['a', 'b', 'c', 'c'], 
        index=['row0', 'row1', 'row2', 'row3'])
s

row0    a
row1    b
row2    c
row3    c
dtype: object

If we want to select a value from a series, there are two methods: 
- loc – retrieves that value with that label in the index
- iloc – retrieves the value at that location


In [5]:
s.loc['row2']

'c'

In [6]:
s.iloc[2]

'c'

In [7]:
s['row2']

'c'

This can be a source of bugs if the index is and integer but non-consecutive.

## Data Frames
- A Data Frame is two-dimensional data structure, used to represent table-like data consisting of cells, organised into rows and columns.
- Each column in a Data Frame is a Series. 
- You can think of a Data Frame as a dictionary of Series.
- All Series in the Data Frame have identical indexes.
- Think of the indexes as defining the row names and the keys for each Series as defining the column names.
- They are declared using the DataFrame constructor. For example:

In [8]:
pink_floyd = pd.DataFrame({
    "name": ["Nick", "Roger", "Dave"],
    "instrument": ["drums", "base", "guitar"],
    "age": [77, 77, 74]
})
pink_floyd

Unnamed: 0,name,instrument,age
0,Nick,drums,77
1,Roger,base,77
2,Dave,guitar,74


You can declare an empty Data Frame with the column names using the columns keyword argument.

In [9]:
band = pd.DataFrame(columns=['name', 'instrument', 'age'])
band

Unnamed: 0,name,instrument,age


Data Frames actually have a property called columns. Note - it is of type index but it's actually the column names, not the rows.

In [10]:
band.columns

Index(['name', 'instrument', 'age'], dtype='object')

It can be useful to get a quick look at the start of any data. To do this we can use the `head` method to select as many rows at the start as we want.

In [11]:
pink_floyd.head(2)

Unnamed: 0,name,instrument,age
0,Nick,drums,77
1,Roger,base,77


Indexing pandas dataframes has the same label and position options as the pandas Series.

In [12]:
pink_floyd.loc[[0, 2], ['name', 'age']]

Unnamed: 0,name,age
0,Nick,77
2,Dave,74


In [13]:
pink_floyd.iloc[[0, 2], [0, 1]]

Unnamed: 0,name,instrument
0,Nick,drums
2,Dave,guitar


We can also use a boolean condition (a.k.a a predicate) to select specific rows.

In [14]:
predicate = pink_floyd['age'] > 75
over_75 = pink_floyd[predicate]
over_75

Unnamed: 0,name,instrument,age
0,Nick,drums,77
1,Roger,base,77


We can transform Data Frames in lot's of ways. For example, let's say we wanted to add another column to the frame. We can do this by assigning to the column, like so:

In [15]:
hometown = ['London', 'Great Bookham', 'Cambridge']
pink_floyd['hometown'] = hometown
pink_floyd

Unnamed: 0,name,instrument,age,hometown
0,Nick,drums,77,London
1,Roger,base,77,Great Bookham
2,Dave,guitar,74,Cambridge


Note that we can assign to a column with a single value and it will fill the column.

## Reading and Writing CSV files using Pandas
Pandas can also be used to read and write CSV files using the `pd.read_csv` and `pd.to_csv` functions.

In [16]:
theoph_pd = pd.read_csv('theoph.csv')
theoph_pd

Unnamed: 0,Subject,Wt,Dose,Time,conc
0,11,79.6,4.02,0.00,0.74
1,11,79.6,4.02,0.25,2.84
2,11,79.6,4.02,0.57,6.57
3,11,79.6,4.02,1.12,10.50
4,11,79.6,4.02,2.02,9.66
...,...,...,...,...,...
127,9,60.5,5.30,5.07,8.57
128,9,60.5,5.30,7.07,6.59
129,9,60.5,5.30,9.03,6.11
130,9,60.5,5.30,12.05,4.57


In [17]:
theoph_pd.to_csv('theoph_out.csv', index=False)

## Getting Summary Statistics
Like in other similar libraries, it's possible to get summary statistics about your data using Pandas. For example:

In [18]:
# what was the mean concentration?
theoph_pd["conc"].mean()

4.960454545454546

In [19]:
# what was the median dose?
theoph_pd['Dose'].median()

4.53

You can a bunch of stats about specfic columns using the describe function.

In [20]:
theoph_pd[['Dose', 'Time']].describe()

Unnamed: 0,Dose,Time
count,132.0,132.0
mean,4.625833,5.894621
std,0.718074,6.925952
min,3.1,0.0
25%,4.305,0.595
50%,4.53,3.53
75%,5.0375,9.0
max,5.86,24.65
