In [2]:
# Pandas intro

# `Series` objects
The `pandas` library contains these useful data structures:
* `Series` objects, that we will discuss now. A `Series` object is 1D array, similar to a column in a spreadsheet (with a column name and row labels).
* `DataFrame` objects. This is a 2D table, similar to a spreadsheet (with column names and row labels).
* `Panel` objects. You can see a `Panel` as a dictionary of `DataFrame`s. These are less used, so we will not discuss them here.

In [5]:
import pandas as pd
import numpy as np

In [6]:
# Creating a Series
s = pd.Series([2, -3, 1, 5])
s

0    2
1   -3
2    1
3    5
dtype: int64

In [10]:
np.exp(s)

0      7.389056
1      0.049787
2      2.718282
3    148.413159
dtype: float64

In [12]:
s + [100,200,300,400]

0    102
1    197
2    301
3    405
dtype: int64

In [13]:
s + 100

0    102
1     97
2    101
3    105
dtype: int64

In [14]:
s < 0

0    False
1     True
2    False
3    False
dtype: bool

## Index labels
Each item in a `Series` object has a unique identifier called the *index label*. 

By default, it is simply the rank of the item in the `Series` (starting at `0`) but you can also set the index labels manually:

In [15]:
s2 = pd.Series([68, 83, 112, 68], index=["alice", "bob", "charles", "darwin"])
s2

alice       68
bob         83
charles    112
darwin      68
dtype: int64

In [16]:
s2["bob"]

83

In [18]:
s2[2]

112

loc:
To make it clear when you are accessing by label or by integer location, it is recommended to always use the `loc` attribute when accessing by label, and the `iloc` attribute when accessing by integer location:

In [21]:
s2.loc["bob"]

83

In [20]:
s2.iloc[1]

83

In [28]:
s2.iloc[2:3] #No.2 is not included

charles    112
dtype: int64

In [29]:
surprise = pd.Series([1000, 1001, 1002, 1003])
surprise

0    1000
1    1001
2    1002
3    1003
dtype: int64

In [30]:
surprise_slice = surprise[2:]
surprise_slice

2    1002
3    1003
dtype: int64

In [31]:
try:
    surprise_slice[0]
except KeyError as e:
    print("Key error:", e)

Key error: 0


In [32]:
surprise_slice.iloc[0]

1002

## Init from `dict`
You can create a `Series` object from a `dict`. The keys will be used as index labels:

In [35]:
weights = {"alice":68,"bob":83,"colin":86,"darwin":68}
s3 = pd.Series(weights)
s3

alice     68
bob       83
colin     86
darwin    68
dtype: int64

In [37]:
# again:
s2 = pd.Series([68, 83, 112, 68], index=["alice", "bob", "charles", "darwin"])
s2

alice       68
bob         83
charles    112
darwin      68
dtype: int64

In [38]:
# s2 is equivalent to s3

You can control which elements you want to include in the `Series` and in what order by explicitly specifying the desired `index`:

In [39]:
s4 = pd.Series(weights,index=["colin", "alice"])
s4

colin    86
alice    68
dtype: int64

## Automatic alignment
When an operation involves multiple `Series` objects, `pandas` automatically aligns items by matching index labels.

In [42]:
print(s2.keys())
print(s3.keys())

s3 + s2

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')
Index(['alice', 'bob', 'colin', 'darwin'], dtype='object')


alice      136.0
bob        166.0
charles      NaN
colin        NaN
darwin     136.0
dtype: float64

The resulting `Series` contains the union of index labels from `s2` and `s3`. Since `"colin"` is missing from `s2` and `"charles"` is missing from `s3`, these items have a `NaN` result value. (ie. Not-a-Number means *missing*).

Automatic alignment is very handy when working with data that may come from various sources with varying structure and missing items. But if you forget to set the right index labels, you can have surprising results:

In [44]:
s5 = pd.Series([1000, 1000, 1000, 1000])

s5.values

array([1000, 1000, 1000, 1000])

In [45]:
s2.values

array([ 68,  83, 112,  68])