# CHAPTER 5 
---
# Getting Started With pandas

In [1]:
import pandas as pd
from pandas import DataFrame, Series

In [2]:
import numpy as np
import matplotlib as plt

## 5.1 Introduction to pandas Data Structures 

### Series
- Attributes
    - Series.values
    - Series.index
    - Series.name
    - Series.index.name

### DataFrame
- Attributes
    - DataFrames.T
- Possible data inputs to DataFrame constructor
    - 2D ndarray
    - dict of arrays, lists, or tuples
    - NumPy structured/record array
    - dict of Series
    - dict of dict
    - List of dicts or Series
    - List of list or tuples
    - Another DataFrame
    - NumPy MaskedArray

### Index Objects
- some Index methods and properties
    - append | 连接另一个Index对象,产生一个新的Index
    - difference | 计算差集, 并的到一个Index
    - intersection | 计算交集
    - union | 计算并集
    - isin | 是否包含
    - delete | 删除索引i
    - drop | 删除传入的值
    - insert | 插入到索引i
    - is_monotonic | 当各元素均大于等于前一个元素时,返回True
    - is_unique | 当Index没有重复值时,返回True
    - unique | 计算Index中唯一值的数组

## 5.2 Essential Functionality

### Reindexing
- `reindex` function arguments
    - `index=`
    - `method=`
    - `fill_value=`
    - `limit=`
    - `tolerance=`
    - `level=`
    - `copy=`

### Dropping Entries from an Axis
- `drop` function arguments
    - `axis=`
    - `inplace=`

### Indexing, Selection, and Filtering
- Indexing options with DataFrame
    - `df[val]`
    - `df.loc[val]`
    - `df.loc[:, val]`
    - `df.iloc[where]`
    - `df.iloc[:, where]`
    - `df.iloc[where_i, where_j]`
    - `df.at[lable_i, lable_j]`
    - `df.iat[i, j]`
    - `reindex()`
    - `get_value(), set_value()`
    

### Integer Indexes
To keep things consistent, if you have an axis index containing integers, data selection
will always be label-oriented.  
For more precise handling, use loc (for labels) or iloc (for integers).

### Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes.  
When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the
index pairs.   
For users with database experience, this is similar to an automatic outer join on the index labels.

The internal data alignment introduces missing values in the label locations that don’t overlap.   
Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns.

Each of them has a counterpart, starting with the letter r, that has arguments flipped.

- Flexible arithmetic methods
    - add, radd
    - sub, rsub
    - div, rdiv
    - floordiv, rfloordiv
    - mul, rmul
    - pow, rpow
    
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows

If an index value is not found in either the DataFrame’s columns or the Series’s index,the objects will be reindexed to form the union

### Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects  
- Function
    - apply
    - applymap
    - map

### Sorting and Ranking
- Function
    - sort_index
        - axis='index' | 0 | 'columns' | 1
        - ascending=True, | False
        - by=
    - sort_values
    - rank
        - axis='index' | 0 | 'columns' | 1
        - ascending=True, | False
        - methon=
            - 'average'
            - 'min'
            - 'max'
            - 'first'
            - 'dense'

### Axis Indexes with Duplicate Labels
Data selection is one of the main things that behaves differently with duplicates.  
Indexing a label with multiple entries returns a Series, while single entries return a scalar value.  
This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.  
- Function
    - `index.is_unique`

## 5.3 Summarizing and Computing Descriptive Statistics

## Exercist

### Axis Indexes with Duplicate Labels

In [40]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int32

### Sorting and Ranking

In [31]:
obj = pd.Series(range(4), index = ['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int32

In [33]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d','a','b','c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [36]:
frame.sort_index(axis='columns')

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [39]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.sum()

19

### Function Application and Mapping

In [3]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [4]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.656043,0.648626,1.505143
Ohio,0.194272,1.194864,1.330081
Texas,0.084046,0.099777,0.40262
Oregon,1.11369,0.889086,0.094969


In [5]:
frame.apply(lambda x: x.max() - x.min(), axis='index')

b    1.307961
d    2.083951
e    1.907764
dtype: float64

### Arithmetic and Data Alignment

In [6]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [7]:
# Adding these together yields:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

#### Arithmetic methods with fill values

In [8]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1 ,'b'] = np.nan

In [9]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [10]:
df1.radd(df2, fill_value=1)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,5.0
1,9.0,6.0,13.0,15.0,10.0
2,18.0,20.0,22.0,24.0,15.0
3,16.0,17.0,18.0,19.0,20.0


#### Operations between DataFrame and Series

In [11]:
arr = np.arange(12.).reshape((3, 4))
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

In [12]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohil', 'Texas', 'Oregon'])
series = frame.iloc[0]

frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohil,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [13]:
series2 = pd.Series(range(3),index=['b', 'e', 'f'])
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohil,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [14]:
series3 = frame['d']
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohil,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Integer Indexes

In [15]:
ser = pd.Series(np.arange(3.), index=['a', 'b', 'c']) ; ser

a    0.0
b    1.0
c    2.0
dtype: float64

In [16]:
ser[-1]

2.0

### Reindexing

In [17]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])

In [18]:
frame.reindex(index = ['a', 'b', 'c', 'd'], columns = ['Texas', 'Utah', 'California'])

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


In [19]:
# many users perfer to use it exclusively:
frame.loc[['a', 'b', 'c', 'd'], ['Texas', 'Utah', 'California']]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


### Dropping Entries from an Axis

In [20]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])

In [21]:
obj.drop('b', axis='index')

d    4.5
a   -5.3
c    3.6
dtype: float64

In [22]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [23]:
data.drop(['one', 'two'], axis='columns')

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
Utah,10,11
New York,14,15


### Indexing, Selection, and Filtering

In [24]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [25]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])

In [26]:
data.loc[data['three'] > 5, 'three'] = 5 ; data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,5,7
Utah,8,9,5,11
New York,12,13,5,15


In [27]:
data.loc['Colorado', ['two', 'three']]

two      5
three    5
Name: Colorado, dtype: int32

In [28]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [29]:
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [30]:
data.iloc[:, :3][data.two > 5]

Unnamed: 0,one,two,three
Utah,8,9,5
New York,12,13,5
