## `pandas`

In [4]:
import numpy as np
import pandas as pd

### Series
A `pandas.Series` is like a Python dictionary object that has an ordering:

In [1]:
# pythonic way
capitals = { 'France': 'Paris', 'Sweden': 'Stockholm', 'Spain': 'Madrid'}
print(capitals['Spain'])

Madrid


In [2]:
print(capitals[1])

KeyError: 1

In [5]:
# pandas way
pd_capitals = pd.Series(capitals)
print(pd_capitals['Spain'])

Madrid


In [6]:
print(pd_capitals[1])

Stockholm


These are very useful especially for dealing with time series.  We'll work with these much more a little later.

### DataFrames

Data frames represent tables of two-dimensional data.  The data can be heterogeneous and rows are indexed by an integer.  Let's create a simple one:

In [7]:
data = {'country': ['France', 'Sweden', 'Spain'],
        'capital': ['Paris', 'Stockholm', 'Madrid'],
        'population': [67, 10, 47]}

df_data = pd.DataFrame(data)
print(df_data) # normally.
df_data # jupyter magic.

  country    capital  population
0  France      Paris          67
1  Sweden  Stockholm          10
2   Spain     Madrid          47


Unnamed: 0,country,capital,population
0,France,Paris,67
1,Sweden,Stockholm,10
2,Spain,Madrid,47


### Indexing Data Frames

We can select a single row or column from the data frame.  To select a single row use the `loc` function:

In [8]:
print(df_data.loc[0])

country       France
capital        Paris
population        67
Name: 0, dtype: object


To select a single column we can use its name directly:

In [10]:
print(df_data['country'])
print(df_data.country)

0    France
1    Sweden
2     Spain
Name: country, dtype: object
0    France
1    Sweden
2     Spain
Name: country, dtype: object


In [12]:
print(df_data.loc[1,'country'])


Sweden


In [13]:
print(df_data.loc[:,'country'])

0    France
1    Sweden
2     Spain
Name: country, dtype: object


In [14]:
print(df_data.iloc[1,1])
print(df_data.iat[1,1])

Stockholm
Stockholm


## The `apply` function

Earlier we learnt a bit about functional programming in Python, we can take advantage of this with `pandas` data frames too.  You may have noticed that our populations are in millions (or I'm very terrible at geography and/or using Google).  Let's write a function to convert these into sensible numbers.  We'll start using an anonymous function and then switch to a named one:

In [15]:
print(df_data['population'].apply(lambda x: x*1e6))

0    67000000.0
1    10000000.0
2    47000000.0
Name: population, dtype: float64


In [16]:
def convert_pop(x):
    return x*1e6

print(df_data['population'].apply(convert_pop))
print(df_data['population'])

0    67000000.0
1    10000000.0
2    47000000.0
Name: population, dtype: float64
0    67
1    10
2    47
Name: population, dtype: int64


### Adding New Columns
Let's say we want to add a new column that contains the number of bicycles:

In [17]:
df_data['bicycles'] = 1.7*1e6*df_data['population']

In [18]:
df_data['bicycles']

0    113900000.0
1     17000000.0
2     79900000.0
Name: bicycles, dtype: float64

### Adding new Rows

In [19]:
df_data = df_data.append({'country': 'Finland'},ignore_index=True)
df_data

Unnamed: 0,country,capital,population,bicycles
0,France,Paris,67.0,113900000.0
1,Sweden,Stockholm,10.0,17000000.0
2,Spain,Madrid,47.0,79900000.0
3,Finland,,,


### Dealing with `NaN`

There should probably be a whole session on this.  I might give you one later.  For now though let's explore a couple of options.  You almost certainly don't want to do either of them (at least at this stage).

#### Option 1: Remove the Rows

In [20]:
df_without_nan = df_data.dropna()
df_without_nan

Unnamed: 0,country,capital,population,bicycles
0,France,Paris,67.0,113900000.0
1,Sweden,Stockholm,10.0,17000000.0
2,Spain,Madrid,47.0,79900000.0


In [None]:
df_data

#### Option 2: Replace the NaNs (probably don't do this, ever)

In [21]:
df_dodgy_clean = df_data.fillna(0.0)
print(df_dodgy_clean)

   country    capital  population     bicycles
0   France      Paris        67.0  113900000.0
1   Sweden  Stockholm        10.0   17000000.0
2    Spain     Madrid        47.0   79900000.0
3  Finland          0         0.0          0.0


Anyone spot a bit of `pandas` awkwardness here by the way?

## Important aside on copy behaviour
Usually `pandas` functions return copies of the Data Frame rather than a reference (always?).  BE CAREFUL though, assignment is still done by reference.  The only way to be safe and get a copy is as follows:

In [22]:
df_copy = df_data.copy(deep=True)

In [23]:
df_copy

Unnamed: 0,country,capital,population,bicycles
0,France,Paris,67.0,113900000.0
1,Sweden,Stockholm,10.0,17000000.0
2,Spain,Madrid,47.0,79900000.0
3,Finland,,,


In [None]:
df_data

In [None]:
df_copy['country'] = df_copy['country'].apply(lambda x: 'Narnia') 

In [None]:
df_copy

In [None]:
df_data