# Overview

Pandas support two important data structures


* Series, Series is one dimensional
* DataFrame,  DataFrame is two dimensional.

Both supports "index object": which carries axis labels etc. attributes.


# Create DataFrame

A dataframe can be created by:

* Dictionary style from `Series` object, here, each Series object will be the column. The key becomes the row name name.

```
d1 = {'a': 1, 'b': 2}
col_1 = pd.Series(d1)
d2 = {'a': 3, 'b': 4}
col_2 = pd.Series(d2)
df = pd.DataFrame({'area': col_1, 'state': col_2})
```

* From 2d numpy array. Note here how we can explicitly set index and column names.

```
df = pd.DataFrame(np.random.rand(3,2), 
    columns=['foo', 'bar'],
    index=['a', 'b', 'c'])
```

* From a list of dictionary

```
data=[{'a': i, 'b': 2*i} for i in range(3)]
pd.DataFrame(data)
```



In [3]:
import pandas as pd
import numpy as np

d1 = {'a': 1, 'b': 2}
col_1 = pd.Series(d1)
d2 = {'a': 3, 'b': 4}
col_2 = pd.Series(d2)
df = pd.DataFrame({'area': col_1, 'state': col_2})
df


Unnamed: 0,area,state
a,1,3
b,2,4


# Indexing

The key about DataFrame index is:

* You can index a column name dictionary style: `df['area']` or attribute style `df.area`.
* To index a row, you must use either `loc` for explicit index; or `iloc` for implicit index, which is basically numpy indexing.
* You can get a numpy-like view of a dataframe through `df.values`, and you can operate (like indexing or slicing) it as such.






when you have integers as index, for more precise handling, you can:

* iloc - integer indexing
* loc - label indexing

so there is no ambiguilties


In [8]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [7]:
ser.iloc[-1]

2.0

# Function Application and Mapping

you can apply a function across a DataFrame's columns or rows.

Across rows is the default.
Across columns, you can pass `axis=column`.



In [10]:
frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'), index=['utah', 'ohio', 'oregon', 'new york'])
frame

Unnamed: 0,b,d,e
utah,-0.329207,-0.565209,-0.606694
ohio,1.372541,-2.288578,1.042503
oregon,1.6281,1.805881,0.718378
new york,-1.911114,0.335547,0.584529


In [11]:
f = lambda x: x.max() - x.min()
frame.apply(f)  # apply across rows, the result is a Series, with index of columns

b    3.539214
d    4.094458
e    1.649197
dtype: float64

In [12]:
frame.apply(f, axis='columns') # apply across columns

utah        0.277488
ohio        3.661118
oregon      1.087503
new york    2.495644
dtype: float64

# Group Aggregation

the general pattern is "split-apply-combine".
see page Wes@2017 page 288 for the illustration figure.






In [10]:
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
        'key2' : ['one', 'two', 'one', 'two', 'one'],
        'data1' : np.random.randn(5),
        'data2' : np.random.randn(5)})

df


Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.952102,0.612399
1,a,two,0.110007,-0.697778
2,b,one,0.042394,0.591722
3,b,two,0.37484,0.159767
4,a,one,0.346527,-0.594477


## apply to a single column

In [11]:
grouped = df['data1'].groupby(df['key1'])
grouped
grouped.mean()

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x120eabd68>

key1
a   -0.165189
b    0.208617
Name: data1, dtype: float64

## select a column based on grouping

In [18]:
df.groupby('key1')['data1']
grouped.mean()

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x120f0aa20>

key1
a   -0.165189
b    0.208617
Name: data1, dtype: float64

## apply for all columns

In [13]:
df.groupby('key1').mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.165189,-0.226619
b,0.208617,0.375745


## group by two keys

In [8]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.347751
      two    -0.623045
b     one     1.161160
      two    -0.973354
Name: data1, dtype: float64

In [9]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.347751,-0.623045
b,1.16116,-0.973354


## calculate group size

In [16]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

# Convert Time

In the following example, `stime` is a 'us' resolution.
and the example shows how pd convert it back and forth

Note that `ts.value` is `ns` resolution, thus, there are extra 000 at the end

In [2]:
# handle time field
import pandas as pd
stime = 1426019558841509
ts = pd.to_datetime(stime, unit='us')
ts
ts.value


Timestamp('2015-03-10 20:32:38.841509')

1426019558841509000

In [4]:
L = ['ab', 'cd']
{ key:"NA" for key in L}

{'ab': 'NA', 'cd': 'NA'}

In [5]:
x = {}
x[1.0] = 3

In [6]:
x

{1.0: 3}

In [7]:
x.keys()

dict_keys([1.0])

In [12]:
x,y,z = (None*3)

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [13]:
None*3

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [14]:
x,y,z = None, None, None

In [15]:
x