# Pandas教程

### 2017七月在线 Python数据分析基础课 julyedu.com
by 褚则伟 

pandas是一个专门用于数据分析的python library

## [Pandas](http://pandas.pydata.org/)简介
- python数据分析library
- 基于numpy (对ndarray的操作)
- 有一种用python做Excel/SQL/R的感觉

## 目录
- Series
- DataFrame

## 数据结构Series

### 2017七月在线 Python数据分析基础课 julyedu.com

### 构造和初始化Series

In [2]:
import pandas as pd
import numpy as np

Series是一个一维的数据结构，下面是一些初始化Series的方法。

In [4]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'])
s

0                  7
1            Beijing
2               2.17
3             -12344
4    Happy Birthday!
dtype: object

pandas会默认用0到n来作为Series的index，但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。

In [5]:
s = pd.Series([7, 'Beijing', 2.17, -12344, 'Happy Birthday!'],
             index=['A', 'B', 'C', 'D', 'E'])
s

A                  7
B            Beijing
C               2.17
D             -12344
E    Happy Birthday!
dtype: object

还可以用dictionary来构造一个Series，因为Series本来就是key value pairs。

In [6]:
cities = {'Beijing': 55000, 'Shanghai': 60000, 'Shenzhen': 50000, 'Hangzhou': 20000, 'Guangzhou': 25000, 'Suzhou': None}
# apts = pd.Series(cities)
apts = pd.Series(cities, name="price")
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

numpy ndarray构建一个Series

In [7]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.513368
b    1.573003
c    0.493720
d   -1.441517
e    0.359131
dtype: float64

### 选择数据

我们可以像对待一个list一样对待Series

In [8]:
apts[[4,3,1]]

Shenzhen     50000.0
Shanghai     60000.0
Guangzhou    25000.0
Name: price, dtype: float64

In [9]:
apts[1:]

Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [10]:
apts[:-1]

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: price, dtype: float64

为什么下面这样会拿到两个NaN呢？

In [14]:
apts[1:] + apts[:-1]

Beijing           NaN
Guangzhou     50000.0
Hangzhou      40000.0
Shanghai     120000.0
Shenzhen     100000.0
Suzhou            NaN
dtype: float64

Series就像一个dict，前面定义的index就是用来选择数据的

In [6]:
apts["Hangzhou"]

20000.0

In [11]:
apts[["Hangzhou", "Beijing", "Shenzhen"]]

Hangzhou    20000.0
Beijing     55000.0
Shenzhen    50000.0
Name: price, dtype: float64

In [12]:
"Hangzhou" in apts

True

In [14]:
"Chongqing" in apts

False

In [15]:
apts.get("Chongqing")

In [20]:
apts["Chongqing"]

KeyError: 'Chongqing'

In [16]:
apts.get("Hangzhou")

20000.0

boolean indexing，与numpy类似。

In [8]:
apts[apts < 50000]

Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64

In [6]:
apts.median()

50000.0

In [5]:
apts[apts > apts.median()]

Beijing     55000.0
Shanghai    60000.0
dtype: float64

下面我再详细展示一下这个boolean indexing是如何工作的

In [10]:
less_than_50000 = apts < 50000
print(less_than_50000)

Beijing      False
Guangzhou     True
Hangzhou      True
Shanghai     False
Shenzhen     False
Suzhou       False
dtype: bool


In [11]:
print(apts[less_than_50000])

Guangzhou    25000.0
Hangzhou     20000.0
dtype: float64


### Series元素赋值

Series的元素可以被赋值

In [13]:
print("Old value: ", apts['Shenzhen'])
apts['Shenzhen'] = 55000
print("New value: ", apts['Shenzhen'])

Old value:  40000.0
New value:  55000.0


前面讲过的boolean indexing在赋值的时候也可以用

In [14]:
print(apts[apts < 50000])
print()
apts[apts <= 50000] = 40000
print(apts[apts < 50000])

Guangzhou    40000.0
Hangzhou     40000.0
dtype: float64

Guangzhou    40000.0
Hangzhou     40000.0
dtype: float64


### 数学运算

下面我们来讲一些基本的数学运算。

In [15]:
apts / 2

Beijing      27500.0
Guangzhou    20000.0
Hangzhou     20000.0
Shanghai     30000.0
Shenzhen     27500.0
Suzhou           NaN
dtype: float64

In [19]:
apts ** 2

Beijing      3.025000e+09
Guangzhou    1.600000e+09
Hangzhou     1.600000e+09
Shanghai     3.600000e+09
Shenzhen     3.025000e+09
Suzhou                NaN
dtype: float64

numpy的运算可以被运用到pandsa上去

In [18]:
np.square(apts)

Beijing      3.025000e+09
Guangzhou    6.250000e+08
Hangzhou     4.000000e+08
Shanghai     3.600000e+09
Shenzhen     2.500000e+09
Suzhou                NaN
Name: price, dtype: float64

我们再定义一个新的Series做加法

In [20]:
cars = pd.Series({'Beijing': 300000, 'Shanghai': 400000, 'Shenzhen': 300000, \
                      'Tianjin': 200000, 'Guangzhou': 200000, 'Chongqing': 150000})
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [22]:
print(cars + apts * 100)

Beijing      5800000.0
Chongqing          NaN
Guangzhou    4200000.0
Hangzhou           NaN
Shanghai     6400000.0
Shenzhen     5800000.0
Suzhou             NaN
Tianjin            NaN
dtype: float64


### 数据缺失

[reference](https://pandas.pydata.org/pandas-docs/stable/missing_data.html)

In [23]:
print('Hangzhou' in apts)
print('Hangzhou' in cars)

True
False


In [24]:
apts.notnull()

Beijing       True
Guangzhou     True
Hangzhou      True
Shanghai      True
Shenzhen      True
Suzhou       False
dtype: bool

In [29]:
print(apts.isnull())
print()

Beijing      False
Guangzhou    False
Hangzhou     False
Shanghai     False
Shenzhen     False
Suzhou        True
dtype: bool



In [30]:
print(apts[apts.isnull()])

Suzhou   NaN
dtype: float64


In [19]:
print(apts[apts.isnull() == False])

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Name: price, dtype: float64


## 数据结构[Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### 七月在线 julyedu.com

一个Dataframe就是一张表格，Series表示的是一维数组，Dataframe则是一个二维数组，可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。

### 创建一个DataFrame

dataframe可以由一个dictionary构造得到。

In [33]:
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
       'year': [2016,2017,2016,2017,2016, 2016],
       'population': [2100, 2300, 1000, 700, 500, 500]}
print(pd.DataFrame(data))

        city  population  year
0    Beijing        2100  2016
1   Shanghai        2300  2017
2  Guangzhou        1000  2016
3   Shenzhen         700  2017
4   Hangzhou         500  2016
5  Chongqing         500  2016


columns的名字和顺序可以指定

In [34]:
print(pd.DataFrame(data, columns=['year', 'city', 'population']))

   year       city  population
0  2016    Beijing        2100
1  2017   Shanghai        2300
2  2016  Guangzhou        1000
3  2017   Shenzhen         700
4  2016   Hangzhou         500
5  2016  Chongqing         500


In [37]:
frame2 = pd.DataFrame(data, \
                     columns = ['year', 'city', 'population', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six'])
print(frame2)

       year       city  population debt
one    2016    Beijing        2100  NaN
two    2017   Shanghai        2300  NaN
three  2016  Guangzhou        1000  NaN
four   2017   Shenzhen         700  NaN
five   2016   Hangzhou         500  NaN
six    2016  Chongqing         500  NaN


也可以从几个Series构建一个DataFrame

In [18]:
apts

Beijing      55000.0
Guangzhou    25000.0
Hangzhou     20000.0
Shanghai     60000.0
Shenzhen     50000.0
Suzhou           NaN
Name: price, dtype: float64

In [21]:
cars

Beijing      300000
Chongqing    150000
Guangzhou    200000
Shanghai     400000
Shenzhen     300000
Tianjin      200000
dtype: int64

In [22]:
df = pd.DataFrame({"apts": apts, "cars": cars})
df

Unnamed: 0,apts,cars
Beijing,55000.0,300000.0
Chongqing,,150000.0
Guangzhou,25000.0,200000.0
Hangzhou,20000.0,
Shanghai,60000.0,400000.0
Shenzhen,50000.0,300000.0
Suzhou,,
Tianjin,,200000.0


也可以用一个list of dicts来构建DataFrame

In [24]:
data = [{"July": 999999, "Han": 50000, "Zewei": 1000}, {"July": 99999, "Han": 8000, "Zewei": 200}]
pd.DataFrame(data)

Unnamed: 0,Han,July,Zewei
0,50000,999999,1000
1,8000,99999,200


In [25]:
pd.DataFrame(data, index=["salary", "bonus"])

Unnamed: 0,Han,July,Zewei
salary,50000,999999,1000
bonus,8000,99999,200


## 练习

- 构建三个Series，分别是一系列商品的单价，计量单位，和数量。至于是什么商品什么计量单位由大家自己决定。
- 然后把这三个Series合并成一个DataFrame。

In [24]:
price = pd.Series([20, 2, 3, 50, 40],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
unit = pd.Series(["kg", "each", "each", "each", "kg"],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
amount = pd.Series([5, 10, 6, 1, 2],
             index=["Apple", "Banana", "Orange", "Watermelon", "Strawberry"])
fruit_df = pd.DataFrame({"price": price, "unit": unit, "amount": amount}, columns=["price", "unit", "amount"])
fruit_df

Unnamed: 0,price,unit,amount
Apple,20,kg,5
Banana,2,each,10
Orange,3,each,6
Watermelon,50,each,1
Strawberry,40,kg,2
