# Pandas教程

### 2017七月在线 Python数据分析基础课 julyedu.com
by 褚则伟 

pandas是一个专门用于数据分析的python library

## [Pandas](http://pandas.pydata.org/)简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R的感觉

## 目录
- Series
- DataFrame

## 数据结构Series

### 2017七月在线 Python数据分析基础课 julyedu.com

In [1]:
import numpy as np
import pandas as pd

### 构造和初始化Series

In [5]:
s = pd.Series([7, "Beijing", 2.17, -12345, "Happy"])

In [15]:
s[1]

'Beijing'

Series是一个一维的数据结构，下面是一些初始化Series的方法。

In [16]:
s = pd.Series([7, "Beijing", 2.17, -12345, "Happy"])
s

0          7
1    Beijing
2       2.17
3     -12345
4      Happy
dtype: object

pandas会默认用0到n来作为Series的index，但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。

In [17]:
s = pd.Series([7, "Beijing", 2.17, -12345, "Happy"], index=["A", "B", "C", "D", "E"])
s

A          7
B    Beijing
C       2.17
D     -12345
E      Happy
dtype: object

还可以用dictionary来构造一个Series，因为Series本来就是key value pairs。

In [19]:
cities = {"Beijing": 55000, "Shanghai": 60000, "Shenzhen": 50000, "Hangzhou": 20000, "Guangzhou": 20000, "Suzhou": None}
apts = pd.Series(cities, name="price")
apts

Beijing      55000.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
dtype: float64

In [20]:
cities = {"Beijing": 55000, "Shanghai": 60000, "Shenzhen": 50000, "Hangzhou": 20000, "Guangzhou": 20000, "Suzhou": None}
apts = pd.Series(cities, name="price")
apts

Beijing      55000.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

numpy ndarray构建一个Series

In [21]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.590158
b    0.880828
c   -0.251413
d   -1.447042
e    1.318019
dtype: float64

### 选择数据

我们可以像对待一个list一样对待Series

In [22]:
apts

Beijing      55000.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [23]:
apts[3]

60000.0

In [24]:
apts[[3,4,1]]

Shanghai     60000.0
Shenzhen     50000.0
Guangzhou    20000.0
Name: price, dtype: float64

In [25]:
apts[1:]

Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [26]:
apts[:-1]

Beijing      55000.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: price, dtype: float64

为什么下面这样会拿到两个NaN呢？

In [27]:
apts[1:] + apts[:-1]

Beijing           NaN
Guangzhou     40000.0
Hangzhou      40000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
Name: price, dtype: float64

Series就像一个dict，前面定义的index就是用来选择数据的

In [28]:
apts["Hangzhou"]

20000.0

In [29]:
apts[["Hangzhou", "Beijing", "Shenzhen"]]

Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
Name: price, dtype: float64

In [30]:
"Hangzhou" in apts

True

In [31]:
"Chongqing" in apts

False

In [32]:
apts["Chongqing"]

KeyError: 'Chongqing'

In [34]:
print(apts.get("Chongqing"))

None


In [35]:
print(apts.get("Shenzhen"))

50000.0


boolean indexing，与numpy类似。

In [36]:
apts[apts < 50000]

Guangzhou    20000.0
Hangzhou     20000.0
Name: price, dtype: float64

In [37]:
apts.median()

50000.0

In [38]:
apts.mean()

41000.0

In [39]:
apts.min()

20000.0

In [40]:
apts.max()

60000.0

下面我再详细展示一下这个boolean indexing是如何工作的

In [42]:
less_than_50000 = apts < 50000

In [45]:
less_than_50000

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
Name: price, dtype: bool

In [46]:
apts[less_than_50000]

Guangzhou    20000.0
Hangzhou     20000.0
Name: price, dtype: float64

In [47]:
apts[ apts > apts.mean() ]

Beijing     55000.0
Shanghai    60000.0
Shenzhen    50000.0
Name: price, dtype: float64

### Series元素赋值

Series的元素可以被赋值

In [49]:
print("Old price of Shenzhen: {}".format(apts["Shenzhen"]))
apts["Shenzhen"] = 70000
print("New price of Shenzhen: {}".format(apts["Shenzhen"]))

Old price of Shenzhen: 50000.0
New price of Shenzhen: 70000.0


In [50]:
apts

Beijing      55000.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: price, dtype: float64

前面讲过的boolean indexing在赋值的时候也可以用

In [51]:
apts[apts < 50000] = 40000

In [52]:
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: price, dtype: float64

### 数学运算

下面我们来讲一些基本的数学运算。

In [53]:
apts / 2

Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     35000.0
Suzhou           NaN
Name: price, dtype: float64

In [54]:
apts ** 2

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     4.900000e+09
Suzhou                NaN
Name: price, dtype: float64

numpy的运算可以被运用到pandsa上去

In [57]:
np.log(apts)

Beijing      10.915088
Guangzhou    10.596635
Hangzhou     10.596635
Shanghai     11.002100
Shenzhen     11.156251
Suzhou             NaN
Name: price, dtype: float64

我们再定义一个新的Series做加法

In [59]:
cars = pd.Series({"Beijing": 300000, "Shanghai": 350000, "Shenzhen": 300000, \
                  "Tianjin": 200000, "Guangzhou": 200000, "Chongqing": 150000})

In [60]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     350000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [66]:
expense = cars + apts * 100

### 数据缺失

[reference](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [62]:
"Hangzhou" in apts

True

In [63]:
"Hangzhou" in cars

False

In [64]:
apts.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
Name: price, dtype: bool

In [65]:
apts.isnull()

Beijing      False
Guangzhou    False
Hangzhou     False
Shanghai     False
Shenzhen     False
Suzhou        True
Name: price, dtype: bool

In [73]:
expense[expense.isnull()] = expense.mean()

In [74]:
expense

Beijing      5800000.0
Chongqing    5912500.0
Guangzhou    4200000.0
Hangzhou     5912500.0
Shanghai     6350000.0
Shenzhen     7300000.0
Suzhou       5912500.0
Tianjin      5912500.0
dtype: float64

## 数据结构[Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### 七月在线 julyedu.com

一个Dataframe就是一张表格，Series表示的是一维数组，Dataframe则是一个二维数组，可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

### 创建一个DataFrame

dataframe可以由一个dictionary构造得到。

In [75]:
data = {"City": ["Beijing", "Shanghai", "Guangzhou", "Shenzhen", "Hangzhou", "Chongqing"],
       "year": [2016,  2017, 2016, 2017, 2016, 2016],
       "population": [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)

Unnamed: 0,City,population,year
0,Beijing,2100,2016
1,Shanghai,2300,2017
2,Guangzhou,1000,2016
3,Shenzhen,700,2017
4,Hangzhou,500,2016
5,Chongqing,500,2016


columns的名字和顺序可以指定

In [77]:
pd.DataFrame(data, columns = ["year", "City", "population"])

Unnamed: 0,year,City,population
0,2016,Beijing,2100
1,2017,Shanghai,2300
2,2016,Guangzhou,1000
3,2017,Shenzhen,700
4,2016,Hangzhou,500
5,2016,Chongqing,500


In [79]:
pd.DataFrame(data, columns = ["year", "City", "population"], \
            index=["one", "two", "three", "four", "five", "six"])

Unnamed: 0,year,City,population
one,2016,Beijing,2100
two,2017,Shanghai,2300
three,2016,Guangzhou,1000
four,2017,Shenzhen,700
five,2016,Hangzhou,500
six,2016,Chongqing,500


也可以从几个Series构建一个DataFrame

In [81]:
apts

Beijing      55000.0
Guangzhou    40000.0
Hangzhou     40000.0
Shanghai     60000.0
Shenzhen     70000.0
Suzhou           NaN
Name: price, dtype: float64

In [82]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     350000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [83]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df

Unnamed: 0,apts,cars
Beijing,55000.0,300000.0
Chongqing,,150000.0
Guangzhou,40000.0,200000.0
Hangzhou,40000.0,
Shanghai,60000.0,350000.0
Shenzhen,70000.0,300000.0
Suzhou,,
Tianjin,,200000.0


也可以用一个list of dicts来构建DataFrame

In [84]:
data = [{"July": 999999, "Han": 50000, "Zewei": 1000}, {"July": 99999, "Han": 8000, "Zewei": 200}]
pd.DataFrame(data)

Unnamed: 0,Han,July,Zewei
0,50000,999999,1000
1,8000,99999,200


In [85]:
pd.DataFrame(data, index=["salary", "bonus"])

Unnamed: 0,Han,July,Zewei
salary,50000,999999,1000
bonus,8000,99999,200


## 练习

- 构建三个Series，分别是一系列商品的单价，计量单位，和数量。至于是什么商品什么计量单位由大家自己决定。
- 然后把这三个Series合并成一个DataFrame。