(c) 2016 - present. Enplus Advisors, Inc.

In [1]:
import numpy as np
import pandas as pd

# Programming with Data 
## Part II: Data Frames

# DataFrame

The workhorse of `pandas`

Similar to a table in SQL or a list of named records

Named after `data.frame` in the `R` language, from which it was inspired.

# DataFrame

Several ways to [create](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) a DataFrame:

* From a dict of lists

In [2]:
# From a dict of lists
_df = {
    'ticker': ['AAPL', 'AAPL', 'MSFT', 'IBM', 'YHOO'],
    'date': ['2015-12-30', '2015-12-31', '2015-12-30', '2015-12-30', '2015-12-30'],
    'open': [426.23, 427.81, 42.3, 101.65, 35.53]
}
df = pd.DataFrame(_df)
df

Unnamed: 0,ticker,date,open
0,AAPL,2015-12-30,426.23
1,AAPL,2015-12-31,427.81
2,MSFT,2015-12-30,42.3
3,IBM,2015-12-30,101.65
4,YHOO,2015-12-30,35.53


# DataFrame

Several ways to [create](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) a DataFrame:

* From a dict of lists

* From a `list` of `dict`s

In [3]:
_df = [
    {'ticker': 'AAPL', 'date': '2015-12-30', 'open': 426.23},
    {'ticker': 'AAPL', 'date': '2015-12-31', 'open': 427.81},
    {'ticker': 'MSFT', 'date': '2015-12-30', 'open': 42.3}]
pd.DataFrame(_df)


Unnamed: 0,date,open,ticker
0,2015-12-30,426.23,AAPL
1,2015-12-31,427.81,AAPL
2,2015-12-30,42.3,MSFT


# DataFrame

Several ways to [create](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) a DataFrame:

* From a dict of lists

* From a `list` of `dict`s

* From another DataFrame

In [4]:
pd.DataFrame(_df, columns=['ticker', 'date', 'open', 'close'])

Unnamed: 0,ticker,date,open,close
0,AAPL,2015-12-30,426.23,
1,AAPL,2015-12-31,427.81,
2,MSFT,2015-12-30,42.3,


Notice that the `close` column is all `NaN`

## Single Axis Selection

Uses `dict` style lookup, e.g. `df['col1']` or attribute lookup, e.g. `df.col1`

Filter either rows or columns, but not both. 
(unless you pass a DataFrame to select individual cells).

Attribute and scalar indexing of columns returns a `Series`, NOT a `DataFrame`

## Single Axis Selection

Attribute access based on column names. Only works for
valid attribute names (no spaces, keywords)

In [5]:
df.ticker

0    AAPL
1    AAPL
2    MSFT
3     IBM
4    YHOO
Name: ticker, dtype: object

## Single Axis Selection

Dictionary-style (`__getitem__`) lookup is more flexible, supports
names that aren't valid attributes

In [6]:
df['ticker name'] = df['ticker']
df['ticker name']

0    AAPL
1    AAPL
2    MSFT
3     IBM
4    YHOO
Name: ticker name, dtype: object

In [7]:
del df['ticker name']

## Single Axis Selection

If you want to retrieve a DataFrame or multiple columns you
must pass a `list` or other sequence

In [8]:
df[['ticker']]

Unnamed: 0,ticker
0,AAPL
1,AAPL
2,MSFT
3,IBM
4,YHOO


## Single Axis Selection - Rows

Works similarly to `Series` logical indexing.

In [9]:
idx = [True, True, False, True, True]
df[idx]

Unnamed: 0,ticker,date,open
0,AAPL,2015-12-30,426.23
1,AAPL,2015-12-31,427.81
3,IBM,2015-12-30,101.65
4,YHOO,2015-12-30,35.53


## Single Axis Selection - Summary

When the index is:

* list[bool] operates on rows
* str or list[str] operates on columns
* `slice` then operates on the columns
* `DataFrame` operates cell-by-cell

## Multi-Axis Selection

The most common situation is logical indexing on the rows and
label indexing on the columns using `loc`.

In [10]:
idx = [True, True, False, True, True]
df.loc[idx, ['date', 'open']]

Unnamed: 0,date,open
0,2015-12-30,426.23
1,2015-12-31,427.81
3,2015-12-30,101.65
4,2015-12-30,35.53


## Multi-Axis Selection with Row Labels

May select by _label_ on both rows and columns.

We haven't set an index on `df` so it has the default integer index.
Let's set one now.

In [11]:
df

Unnamed: 0,ticker,date,open
0,AAPL,2015-12-30,426.23
1,AAPL,2015-12-31,427.81
2,MSFT,2015-12-30,42.3
3,IBM,2015-12-30,101.65
4,YHOO,2015-12-30,35.53


In [12]:
# Note that `df1` is a copy of `df`
df1 = df.set_index('ticker')

In [13]:
## Multi-Axis Selection with Row Labels

In [14]:
df1

Unnamed: 0_level_0,date,open
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
AAPL,2015-12-30,426.23
AAPL,2015-12-31,427.81
MSFT,2015-12-30,42.3
IBM,2015-12-30,101.65
YHOO,2015-12-30,35.53


The tickers are no longer part of the values of the `DataFrame`

Consequently, we can use them for index lookups.

## Multi-Axis Selection

In [15]:
# Select by row label
df1.loc['MSFT']

date    2015-12-30
open          42.3
Name: MSFT, dtype: object

Defaults to all columns, but I prefer explicit selection.
Easier to figure out what your code is doing.

In [16]:
# Same, but explicitly require all columns
df1.loc['MSFT', :]

date    2015-12-30
open          42.3
Name: MSFT, dtype: object

## Multi-Axis Gotcha

Rows may be returned as either `Series` or `DataFrame` by using `loc`.

In [17]:
df1.loc['AAPL', :]

Unnamed: 0_level_0,date,open
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
AAPL,2015-12-30,426.23
AAPL,2015-12-31,427.81


Have to be careful of whether a `Series` or `DataFrame` is returned.

## Selecting and Assigning with DataFrames

In [18]:
df1[df1.date != '2015-12-31']

Unnamed: 0_level_0,date,open
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
AAPL,2015-12-30,426.23
MSFT,2015-12-30,42.3
IBM,2015-12-30,101.65
YHOO,2015-12-30,35.53


## Selecting and Assigning with DataFrames

In [19]:
idx = df1.date != '2015-12-31'
df1[idx]

Unnamed: 0_level_0,date,open
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1
AAPL,2015-12-30,426.23
MSFT,2015-12-30,42.3
IBM,2015-12-30,101.65
YHOO,2015-12-30,35.53


## Selecting and Assigning with DataFrames

Indexing a DataFrame (or Series) creates a view (not a copy) of the original.

In [20]:
df1['close'] = df1['open']
df1_view = df1[df1.date != '2015-12-31']

# We'll come back to assignment in a second
df1_view['close'] = -5


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


## Selecting and Assigning with DataFrames

In [21]:
df1_view

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,-5
MSFT,2015-12-30,42.3,-5
IBM,2015-12-30,101.65,-5
YHOO,2015-12-30,35.53,-5


## Selecting and Assigning with DataFrames

The original `DataFrame` is unchanged.

In [22]:
df1

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,426.23
AAPL,2015-12-31,427.81,427.81
MSFT,2015-12-30,42.3,42.3
IBM,2015-12-30,101.65,101.65
YHOO,2015-12-30,35.53,35.53


## Selecting and Assigning with DataFrames

If you want to work with the subset, make a copy:

In [23]:
# What you want is
df2 = df1[idx].copy()
df2

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,426.23
MSFT,2015-12-30,42.3,42.3
IBM,2015-12-30,101.65,101.65
YHOO,2015-12-30,35.53,35.53


In [24]:
df2['close'] = 1
df2

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,1
MSFT,2015-12-30,42.3,1
IBM,2015-12-30,101.65,1
YHOO,2015-12-30,35.53,1


## Selecting and Assigning with DataFrames

If you want to assign to the original index, use an indexing attribute
like `loc` or `iloc`

In [25]:
df1.loc[idx, 'close'] = np.nan
df1

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,
AAPL,2015-12-31,427.81,427.81
MSFT,2015-12-30,42.3,
IBM,2015-12-30,101.65,
YHOO,2015-12-30,35.53,


## Assignment Gotchas

In [26]:
# May create NAs with `np.nan`, None, or float('nan')
df2['close'] = np.nan

In [27]:
df2['close'] = None
df2.close.dtype

dtype('O')

## Assignment Gotchas

Assigning a `Series` to a `DataFrame` column does an implicit left-join.

In [28]:
closes = pd.Series({'AAPL': 430.0, 'MSFT': 43.5, 'SP5': 1263.5})
df2['close'] = closes
df2

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,430.0
MSFT,2015-12-30,42.3,43.5
IBM,2015-12-30,101.65,
YHOO,2015-12-30,35.53,


## Assignment Gotchas

Use a sequence (`list`, `tuple`) or a `numpy` array if you don't
want automatic alignment.

In [29]:
x = pd.Series([1, 2, 3, 4], index=list('abcd'))
df2['close'] = x
df2

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,
MSFT,2015-12-30,42.3,
IBM,2015-12-30,101.65,
YHOO,2015-12-30,35.53,


In [30]:
df2['close'] = x.values
df2

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,1
MSFT,2015-12-30,42.3,2
IBM,2015-12-30,101.65,3
YHOO,2015-12-30,35.53,4


## Sorting DataFrames

In [31]:
df2.sort_index()

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,2015-12-30,426.23,1
IBM,2015-12-30,101.65,3
MSFT,2015-12-30,42.3,2
YHOO,2015-12-30,35.53,4


In [32]:
df2.sort_values('open')

Unnamed: 0_level_0,date,open,close
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
YHOO,2015-12-30,35.53,4
MSFT,2015-12-30,42.3,2
IBM,2015-12-30,101.65,3
AAPL,2015-12-30,426.23,1
