<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## Lecture 11: Pandas â€” main data structures, indexing, selecting, filtering and sorting

<br>

<center> **Andrey Vassilev**

<br> 

<center> **2016/2017**
 

# Outline

1. An overview of Pandas
2. Main data structures
3. Basic operations on Pandas objects

# Main facts about Pandas

- Pandas is a Python package that offers rich data processing and analysis functionality.
- In particular, it can work with series of observations and tabular heterogeneous data (think a dataset consisting of several time series or observations on different subjects).
- Pandas allows us to clean, transform, filter, sort etc. a dataset.
- Pandas also allows us to split, merge and extract various representations of our data.
- Pandas can interact with different data sources.
- It has sophisticated date-time functionality.

# Pandas data structures

- The main data structures in Pandas are: 
    - `Series` 
    - `DataFrame` 
    - `Panel`
- The key ones are the first two.
- These structures can be treated as nestable by dimension: 
   - The `Series` is 1D and can be used as a building block of a `DataFrame`
   - The `DataFrame` is 2D and can serve as the building block of a `Panel`
   - The `Panel` is 3D and is the most general (but least used) data structure.

To start exploring the various Pandas structures we first import the relevant modules:

In [None]:
import pandas as pd # another established convention
import numpy as np

A `Series` can be created from a list.

In [None]:
s = pd.Series([1,4,-2,0,np.nan,3])
s

A `Series` object has several main characteristics.

It has an index.

In [None]:
s.index

This type of indexing is trivial because it coincides with the familiar indexing for sequences. We can substitute it with more interesting indexes:

In [None]:
dt = pd.date_range(start="2017-01-11",periods=len(s),freq="M") # Monthly frequency, starting Jan 11, 2017
print(dt)
s.index=dt
s

You can inspect the contents of a `Series` by using the `head()` and `tail()` methods.

In [None]:
s.head()

In [None]:
s.head(3) # Try changing it to 2 or 4

In [None]:
s.tail()

In [None]:
s.values # You can extract the values as an array

In [24]:
s[4] = 8 # assignment is done in a standard way
s.describe()

count    6.000000
mean     2.333333
std      3.502380
min     -2.000000
25%      0.250000
50%      2.000000
75%      3.750000
max      8.000000
dtype: float64

A `Series` can be created from a dictionary. The dictionary keys will be used as index.

In [27]:
s = pd.Series({"a":1,"b":3,"f":4, "c":-2.2})
s

a    1.0
b    3.0
c   -2.2
f    4.0
dtype: float64

You can also create it by simultaneously passing values and index.

In [34]:
s = pd.Series(np.random.rand(5),index = ["e"+str(i) for i in range(1,6)])
s

e1    0.384298
e2    0.152787
e3    0.222569
e4    0.598050
e5    0.931912
dtype: float64