In [1]:
import numpy as np
import pandas as pd

## Series

创建序列的基本方法：
s = pd.Series(data, index=index)

Here, data can be many different things:
- a Python dict
- an ndarray
- a scalar value (like 5)

The passed index is a list of axis labels.

#### from ndarray

In [48]:
s = pd.Series(np.random.randn(5), index=list('abcde'))

In [3]:
s

a   -1.456247
b    0.784603
c    0.839344
d   -0.257462
e    1.068553
dtype: float64

In [4]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:
s.values

array([-1.45624725,  0.78460293,  0.83934363, -0.25746178,  1.06855334])

In [6]:
s = pd.Series(np.random.randn(5))

In [7]:
s

0   -0.139145
1   -0.535637
2    2.040784
3    0.455581
4   -1.245959
dtype: float64

#### from dict

In [8]:
s = pd.Series({'d': [0,1], 'b':1, 'c':2})

In [9]:
s

b         1
c         2
d    [0, 1]
dtype: object

In [10]:
s.values

array([1, 2, list([0, 1])], dtype=object)

In [11]:
type(s.values)

numpy.ndarray

In [12]:
s = pd.Series({'d':1,'b':2,'c':3}, index=list('bcd'))

In [13]:
s

b    2
c    3
d    1
dtype: int64

#### from a scalar

In [14]:
s = pd.Series(2, index=list('abcd'))

In [15]:
s

a    2
b    2
c    2
d    2
dtype: int64

### Series is ndarray-like

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also slice the index

In [16]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -2.268410
b    0.445979
c    0.903601
d   -1.586158
e    0.978161
dtype: float64

In [17]:
s[0]

-2.2684096410417354

In [18]:
s[0:3]

a   -2.268410
b    0.445979
c    0.903601
dtype: float64

In [19]:
s.median()

0.4459791236269375

In [20]:
np.median(s)

0.4459791236269375

In [21]:
flag = s > s.median()
flag

a    False
b    False
c     True
d    False
e     True
dtype: bool

In [22]:
s[flag]

c    0.903601
e    0.978161
dtype: float64

In [23]:
s[[4,3,1]]

e    0.978161
d   -1.586158
b    0.445979
dtype: float64

In [24]:
np.exp(s)

a    0.103477
b    1.562019
c    2.468475
d    0.204711
e    2.659561
dtype: float64

### Series is dict-like

A Series is like a fixed-size dict in that you can get and set values by index label:

In [25]:
s

a   -2.268410
b    0.445979
c    0.903601
d   -1.586158
e    0.978161
dtype: float64

In [26]:
s['a']

-2.2684096410417354

In [27]:
s['e'] = 0

In [28]:
s

a   -2.268410
b    0.445979
c    0.903601
d   -1.586158
e    0.000000
dtype: float64

In [29]:
'e' in s

True

In [30]:
s.keys()

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [31]:
type(s.keys())

pandas.core.indexes.base.Index

### Vectorized operations and label alignment with Series

In [32]:
s

a   -2.268410
b    0.445979
c    0.903601
d   -1.586158
e    0.000000
dtype: float64

In [33]:
s + s

a   -4.536819
b    0.891958
c    1.807201
d   -3.172315
e    0.000000
dtype: float64

In [34]:
s * 2

a   -4.536819
b    0.891958
c    1.807201
d   -3.172315
e    0.000000
dtype: float64

In [35]:
np.exp(s)

a    0.103477
b    1.562019
c    2.468475
d    0.204711
e    1.000000
dtype: float64

In [36]:
s[::-1] + s

a   -4.536819
b    0.891958
c    1.807201
d   -3.172315
e    0.000000
dtype: float64

A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.

In [55]:
# s[1:] + s[:-1]
s[:-1] + s[1:]

a         NaN
b    0.917251
c    0.481365
d   -1.891113
e         NaN
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN

### Name attribute

In [37]:
s = pd.Series(np.random.randn(5), name='foo')

In [38]:
s

0   -0.230581
1   -0.718922
2    1.274442
3   -2.222643
4    1.222837
Name: foo, dtype: float64

In [39]:
s.name

'foo'

In [40]:
s2 = s.rename('bar')

In [41]:
s2.name

'bar'

## DataFrame

DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

### From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

In [57]:
d = {'one': pd.Series([1,2,3], index=list('abc')),
     'two': pd.Series([1,2,3,4], index=list('abcd'))}

In [65]:
df = pd.DataFrame(d, columns=['one','two', 'three'])
df

Unnamed: 0,one,two,three
a,1.0,1,
b,2.0,2,
c,3.0,3,
d,,4,


In [73]:
df.index = list('ABCD')
df

Unnamed: 0,one,two
A,1,11
B,2,22
C,3,33
D,4,44


In [67]:
df.columns

Index(['one', 'two', 'three'], dtype='object')

### From dict of ndarrays / lists

In [68]:
d = {'one': [1, 2, 3, 4],
     'two': [11, 22, 33, 44]}

In [70]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1,11
1,2,22
2,3,33
3,4,44


In [76]:
df = pd.DataFrame(d, index=list('abdc'), columns=['two', 'one', 'three'])
df

Unnamed: 0,two,one,three
a,11,1,
b,22,2,
d,33,3,
c,44,4,


### From structured or record array

In [77]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [79]:
df = pd.DataFrame(data2)
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


### From a dict of tuple

### From a Series

### Alternate Constructors

### Column selection, addition, deletion

你可以在语义上，将 DataFrame 当做 Series 对象的字典来处理。列的获取，设置和删除的方式与字典操作的语法相同

In [80]:
df

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [81]:
df['a']

0    1
1    5
Name: a, dtype: int64

In [82]:
df['c'] = df['a'] + df['b']
df

Unnamed: 0,a,b,c
0,1,2,3
1,5,10,15


In [83]:
df['d'] = df['a'] * df['b']
df

Unnamed: 0,a,b,c,d
0,1,2,3,2
1,5,10,15,50


In [84]:
del df['d']
df

Unnamed: 0,a,b,c
0,1,2,3
1,5,10,15


In [85]:
c = df.pop('c')
c

0     3
1    15
Name: c, dtype: int64

In [86]:
df

Unnamed: 0,a,b
0,1,2
1,5,10


当插入一个标量值时，它自然会广播来填充该列：

In [87]:
df['c'] = 'zsh'
df

Unnamed: 0,a,b,c
0,1,2,zsh
1,5,10,zsh


当插入的 Series 与 DataFrame 的索引不同时，它将适配 DataFrame 的索引：

In [90]:
# df['d'] = pd.Series([111,222], index=['a', 'b'])
df['d'] = pd.Series([111,222])
df

Unnamed: 0,a,b,c,d
0,1,2,zsh,111
1,5,10,zsh,222


您可以插入原始的ndarray，但它们的长度必须匹配DataFrame的索引的长度。

默认情况下，列在末尾插入。insert函数可用于在列中的特定位置插入：

### Assigning New Columns in Method Chains

In [92]:
# iris = pd.read_csv('data/iris.data')


### Indexing / Selection

行的选择返回 Series，其索引是 DataFrame 的列

In [93]:
df

Unnamed: 0,a,b,c,d
0,1,2,zsh,111
1,5,10,zsh,222


In [94]:
df['a']

0    1
1    5
Name: a, dtype: int64

In [95]:
df.loc[0]

a      1
b      2
c    zsh
d    111
Name: 0, dtype: object

In [96]:
df.iloc[1]

a      5
b     10
c    zsh
d    222
Name: 1, dtype: object

In [None]:
df[]