In [1]:
import pandas as pd
import numpy as np

# Pandas Series
We'll start analyzing 'The Group of Seven". Which is political formed by Canada, France, Germany, Italy, Japan, United Kingdom and United States. We'll start by analyzing population, and for that, we'll use a pandas.Series object.

In [2]:
#In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Someone might not know we're representing popoulation in millions of inhabitants. Series can have a name, to better document the purpose of the Series:

In [3]:
g7_pop.name = 'G7 Population in millions'
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

Series are pretty similar to numpy arrays:

In [4]:
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

They're actually backend by numpy arrays:

In [5]:
type(g7_pop.values)

numpy.ndarray

And they look like simple python lists or numpy arrays. But they're actually more similar to python dict s. A series has an index, that's similar to the automatic index assigned to python lists:

In [6]:
g7_pop[0]

35.467

In [7]:
g7_pop[1]

63.951

In [8]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

But, in contrast to lists, we can explicitly define the index:

In [9]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]

g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

We can say that Series look like 'ordered dictionaries'. We can actually create Series out of dictionaries:

In [10]:
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.940,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [11]:
pd.Series(
    [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'],
    name = 'G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

You can also create Series out of other series, specifying indexes:

In [12]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

# Indexing

In [13]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [14]:
g7_pop['Canada']

35.467

In [15]:
g7_pop['Japan']

127.061

Numeric positions can also be used, with the `iloc` attribute:

In [16]:
g7_pop.iloc[0]

35.467

In [17]:
g7_pop.iloc[-1]

318.523

Selecting multiple elements at once:

In [18]:
g7_pop[['Germany', 'Japan']]

Germany     80.940
Japan      127.061
Name: G7 Population in millions, dtype: float64

# Conditional Selection (Boolean Arrays)
The same boolean array techniques we saw applied to numpy arrays can be used for Pandas `Series`:

In [19]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [20]:
g7_pop > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [21]:
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [22]:
g7_pop.mean()

107.30257142857144

In [23]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [24]:
g7_pop.std()

97.24996987121581

In [25]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std()/2) | (g7_pop > g7_pop.mean() + g7_pop.std()/2)]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

# Operations and Methods
Series also support vectorized operations and aggregation functions as Numpy:

In [26]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [27]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [28]:
g7_pop.mean()

107.30257142857144

In [29]:
np.log(g7_pop)

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 Population in millions, dtype: float64

In [30]:
g7_pop['Germany' : 'Japan'].mean()

89.55533333333334