# CHAPTER 5 
---
# Getting Started With pandas

pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. 

>在这本书余下的大部分内容里，pandas 将成为一个贯穿始终的主要工具。它包含数据结构和数据操作，以及在Python中快速、方便地进行数据清理和分析的工具。pandas 经常和数值计算工具(类似 NumPy 和 SciPy)，分析库(类似 statsmokit 和 scikit-learn)，以及像 matplotlib 这样的数据可视化库一起使用 。pandas 采用了 NumPy 的惯用风格——基于数组(array-based)的计算方式，特别是基于数组的函数(array-based functions)，以及对不通过 for 循环对数据进行处理的偏爱。

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

>虽然 pandas 采用了许多 NumPy 的编码习惯，但最大的不同是，pandas 的设计目的是为了处理表格形或多样化、异性的数据。相比之下，NumPy 最适合于处理同类的规整的数值形数组数据。

Since becoming an open source project in 2010, pandas has matured into a quite large library that’s applicable in a broad set of real-world use cases. The developer community has grown to over 800 distinct contributors, who’ve been helping build the project as they’ve used it to solve their day-to-day data problems. Throughout the rest of the book, I use the following import convention for pandas:

>自从2010年成为一个开源项目以来，pandas 已经成长为一个相当大的库，它适用于广泛的现实世界的用例。开发人员社区已经发展到超过800个不同的贡献者，他们一直在帮助构建这个项目，因为他们已经用它来解决他们的日常数据问题。在本书的剩余部分，我将使用以下的 pandas 导入惯例：

In [3]:
import pandas as pd

Thus, whenever you see pd. in code, it’s referring to pandas. You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:

>因此，无论你在代码中何时看到 pd ，它指的是 pandas 。你可能还会发现，将 Series 和 DataFrame 导入本地命名空间会更好一些，因为它们经常被使用：

In [4]:
from pandas import Series, DataFrame

## 5.1 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

>开始使用 pandas 之前，你需要适应它的两个重要的数据结构: Series 和 DataFrame 。虽然它们不是解决所有问题的通用解决方案，但它们为大多数应用程序提供了可靠的、易于使用的基础。

### Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.The simplest Series is formed from only an array of data:

> Series 是一个一维数组类的对象，包含一个值序列(与NumPy类型相似的类型)和与之相关的数据标签数组(称为它的索引)。最简单的 Series 仅仅是由一个数据序列组成的:

In [5]:
obj = pd.Series([4, 7, -5, 3]) ; obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

>上述 Series 显示的行，相对应地表示了左边的索引和右边的值。由于我们没有为数据指定一个索引，因此创建了一个默认值，其中包含了 0 到 N-1(其中N是数据的长度)的整数。您可以通过它的 `values` 和 `index` 属性分别获得该 Series 的数组表示的值和索引对象:

In [12]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [14]:
obj.index  # like range(4)

RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label:
>通常，我们希望为每一行数据创建一个具有标识的索引:

In [15]:
obj2 = pd.Series(data=[4, 7, -5, 3], index=['d', 'b', 'a', 'c']) ; obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [16]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
>与 NumPy 数组相比，在选择单个值或一组值时，可以使用索引中的标签。

In [17]:
obj2['a']

-5

In [19]:
obj2['d'] = 6
obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.

>`['c', 'a', 'd']` 被解释为一个索引列表，尽管它包含的是字符串而不是整数。

Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

> 使用 NumPy 函数或 NumPy 风格的操作，例如使用 boolean array 进行过滤、标量乘法或应用数学函数，将保留索引和值之间链接:

In [21]:
obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [22]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [23]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:

> 另一种理解 Series 的方法是固定长度的、有序的 dict，因为 dict 是索引值到数据值的映射。它可以应用在很多场景，你可以这样使用它:

In [24]:
'b' in obj2

True

In [25]:
'e' in obj2

False

In [6]:
import numpy as np
import matplotlib as plt

In [7]:
import pandas_datareader.data as web

## 5.1 Introduction to pandas Data Structures 

### Series
- Attributes
    - Series.values
    - Series.index
    - Series.name
    - Series.index.name

### DataFrame
- Attributes
    - DataFrames.T
- Possible data inputs to DataFrame constructor
    - 2D ndarray
    - dict of arrays, lists, or tuples
    - NumPy structured/record array
    - dict of Series
    - dict of dict
    - List of dicts or Series
    - List of list or tuples
    - Another DataFrame
    - NumPy MaskedArray

### Index Objects
- some Index methods and properties
    - append | 连接另一个Index对象,产生一个新的Index
    - difference | 计算差集, 并的到一个Index
    - intersection | 计算交集
    - union | 计算并集
    - isin | 是否包含
    - delete | 删除索引i
    - drop | 删除传入的值
    - insert | 插入到索引i
    - is_monotonic | 当各元素均大于等于前一个元素时,返回True
    - is_unique | 当Index没有重复值时,返回True
    - unique | 计算Index中唯一值的数组

## 5.2 Essential Functionality

### Reindexing
- `reindex` function arguments
    - `index=`
    - `method=`
    - `fill_value=`
    - `limit=`
    - `tolerance=`
    - `level=`
    - `copy=`

### Dropping Entries from an Axis
- `drop` function arguments
    - `axis=`
    - `inplace=`

### Indexing, Selection, and Filtering
- Indexing options with DataFrame
    - `df[val]`
    - `df.loc[val]`
    - `df.loc[:, val]`
    - `df.iloc[where]`
    - `df.iloc[:, where]`
    - `df.iloc[where_i, where_j]`
    - `df.at[lable_i, lable_j]`
    - `df.iat[i, j]`
    - `reindex()`
    - `get_value(), set_value()`
    

### Integer Indexes
To keep things consistent, if you have an axis index containing integers, data selection
will always be label-oriented.  
For more precise handling, use loc (for labels) or iloc (for integers).

### Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes.  
When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the
index pairs.   
For users with database experience, this is similar to an automatic outer join on the index labels.

The internal data alignment introduces missing values in the label locations that don’t overlap.   
Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both the rows and the columns.

Each of them has a counterpart, starting with the letter r, that has arguments flipped.

- Flexible arithmetic methods
    - add, radd
    - sub, rsub
    - div, rdiv
    - floordiv, rfloordiv
    - mul, rmul
    - pow, rpow
    
By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows

If an index value is not found in either the DataFrame’s columns or the Series’s index,the objects will be reindexed to form the union

### Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects  
- Function
    - apply
    - applymap
    - map

### Sorting and Ranking
- Function
    - sort_index
        - axis='index' | 0 | 'columns' | 1
        - ascending=True, | False
        - by=
    - sort_values
    - rank
        - axis='index' | 0 | 'columns' | 1
        - ascending=True, | False
        - methon=
            - 'average'
            - 'min'
            - 'max'
            - 'first'
            - 'dense'

### Axis Indexes with Duplicate Labels
Data selection is one of the main things that behaves differently with duplicates.  
Indexing a label with multiple entries returns a Series, while single entries return a scalar value.  
This can make your code more complicated, as the output type from indexing can vary based on whether a label is repeated or not.  
- Function
    - `index.is_unique`

## 5.3 Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical  methods.   
Most of these fall into the category of reductions or summary statistics, methods  that extract a single value (like the sum or mean) from a Series or a Series of values  from the rows or columns of a DataFrame.  
Compared with the similar methods  found on NumPy arrays, they have built-in handling for missing data. 
- Descriptive and summary statistics
    - count
    - describe
    - min, max
    - argmin, argmax
    - quantile
    - sum
    - mean
    - median
    - mad
    - prod
    - var
    - std
    - skew
    - kurt
    - cumsum
    - cummin, cummax
    - cumprod
    - diff
    - pct_change
    
- Options for reduction methods
    - axis
    - skipna
    - level

### Correlation and Covariance

## Exercist

### Correlation and Covariance

In [8]:
all_data = {ticker: web.get_data_yahoo(ticker)
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1262275200&period2=1513871999&interval=1d&events=history&crumb=Etibcxsuc9%5Cu002F

In [None]:
all_data[:]

## 5.3 Summarizing and Computing Descriptive Statistics

In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a','b','c','d'],
                  columns=['one', 'two'])

In [None]:
df

In [None]:
df.sum()

In [None]:
df.sum(axis='columns')

In [None]:
df['one']

In [None]:
df.cumsum()

In [None]:
df.loc['a'].ptc_change()

### Axis Indexes with Duplicate Labels

In [None]:
obj = pd.Series(range(5), index=['a','a','b','b','c'])
obj

### Sorting and Ranking

In [None]:
obj = pd.Series(range(4), index = ['d', 'a', 'b', 'c'])
obj.sort_index()

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['three', 'one'],
                     columns=['d','a','b','c'])
frame.sort_index()

In [None]:
frame.sort_index(axis='columns')

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.sum()

### Function Application and Mapping

In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [None]:
np.abs(frame)

In [None]:
frame.apply(lambda x: x.max() - x.min(), axis='index')

### Arithmetic and Data Alignment

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [None]:
# Adding these together yields:
s1 + s2

#### Arithmetic methods with fill values

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1 ,'b'] = np.nan

In [None]:
df1.add(df2, fill_value=0)

In [None]:
df1.radd(df2, fill_value=1)

#### Operations between DataFrame and Series

In [None]:
arr = np.arange(12.).reshape((3, 4))
arr - arr[0]

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohil', 'Texas', 'Oregon'])
series = frame.iloc[0]

frame - series

In [None]:
series2 = pd.Series(range(3),index=['b', 'e', 'f'])
frame + series2

In [None]:
series3 = frame['d']
frame.sub(series3, axis='index')

### Integer Indexes

In [None]:
ser = pd.Series(np.arange(3.), index=['a', 'b', 'c']) ; ser

In [None]:
ser[-1]

### Reindexing

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])

In [None]:
frame.reindex(index = ['a', 'b', 'c', 'd'], columns = ['Texas', 'Utah', 'California'])

In [None]:
# many users perfer to use it exclusively:
frame.loc[['a', 'b', 'c', 'd'], ['Texas', 'Utah', 'California']]

### Dropping Entries from an Axis

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])

In [None]:
obj.drop('b', axis='index')

In [None]:
data.drop(['Colorado', 'Ohio'])

In [None]:
data.drop(['one', 'two'], axis='columns')

### Indexing, Selection, and Filtering

In [None]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj[[1, 3]]

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns = ['one', 'two', 'three', 'four'])

In [None]:
data.loc[data['three'] > 5, 'three'] = 5 ; data

In [None]:
data.loc['Colorado', ['two', 'three']]

In [None]:
data.iloc[[1, 2], [3, 0, 1]]

In [None]:
data.loc[:'Utah', 'two']

In [None]:
data.iloc[:, :3][data.two > 5]