# CH7 数值操作（切配菜品）

In [36]:
import pandas as pd
import numpy as np

## 7.1数值替换

数值替换可以从字面上理解，就是将数值A替换为数值B，在实际过程中可以用在异常值替换处理、缺失值填充处理<br>
主要有如下三种替换方式：<br>
1.一对一替换<br>
2.多对一替换<br>
3.多对多替换<br>

### 7.1.1一对一替换

在Python中，对某个数值进行替换用的是replace方法

In [37]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","赵恒"],
             "唯一识别码":[101,102,103,104,104],
             "年龄":[31,45,23,240,240],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-12"]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间
0,A1,张通,101,31,2018-8-8
1,A2,李谷,102,45,2018-8-9
2,A3,孙凤,103,23,2018-8-10
3,A4,赵恒,104,240,2018-8-11
4,A5,赵恒,104,240,2018-8-12


In [38]:
df["年龄"].replace(240,33,inplace = True)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间
0,A1,张通,101,31,2018-8-8
1,A2,李谷,102,45,2018-8-9
2,A3,孙凤,103,23,2018-8-10
3,A4,赵恒,104,33,2018-8-11
4,A5,赵恒,104,33,2018-8-12


如上操作仅针对年龄，若对整表的空值做替换，replace()方法相当于fillna()方法

In [39]:
data_dict = {"订单编号":["A1","A2",np.NaN,"A4"],
             "年龄":[54,16,np.NaN,41],
             "注册时间":["2018-8-8","2018-8-9",np.NaN,"2018-8-11"]}
df = pd.DataFrame(data_dict)
df.replace(np.NaN,0)

Unnamed: 0,订单编号,年龄,注册时间
0,A1,54.0,2018-8-8
1,A2,16.0,2018-8-9
2,0,0.0,0
3,A4,41.0,2018-8-11


### 7.1.2多对一替换

多对一就是把一块区域中的多个值替换为某一个值，使用的也是replace()方法<br>
与一对一替换不同的是，逗号左边的参数为列表

In [40]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5","A6"],
             "客户姓名":["张通","李谷","孙凤","赵恒","赵恒","王丹"],
             "唯一识别码":[101,102,103,104,104,105],
             "年龄":[31,45,23,240,260,280],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-12","2018-8-12"]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间
0,A1,张通,101,31,2018-8-8
1,A2,李谷,102,45,2018-8-9
2,A3,孙凤,103,23,2018-8-10
3,A4,赵恒,104,240,2018-8-11
4,A5,赵恒,104,260,2018-8-12
5,A6,王丹,105,280,2018-8-12


In [41]:
df["年龄"].replace([240,260,280],33,inplace = True)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间
0,A1,张通,101,31,2018-8-8
1,A2,李谷,102,45,2018-8-9
2,A3,孙凤,103,23,2018-8-10
3,A4,赵恒,104,33,2018-8-11
4,A5,赵恒,104,33,2018-8-12
5,A6,王丹,105,33,2018-8-12


### 7.1.3多对多替换

多对多替换，同样使用replace()方法，但用字典形式进行传入

In [42]:
df["年龄"].replace({240:32,260:33,280:34},inplace = True)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间
0,A1,张通,101,31,2018-8-8
1,A2,李谷,102,45,2018-8-9
2,A3,孙凤,103,23,2018-8-10
3,A4,赵恒,104,33,2018-8-11
4,A5,赵恒,104,33,2018-8-12
5,A6,王丹,105,33,2018-8-12


## 7.2数值排序

在数据分析中，排序是经常做的操作，常见的排序有两种：<br>
1.降序（Descending），从大到小<br>
2.升序（Ascending），从小到大<br>

在Pandas中，使用sort_value(by = ["Col1","Col2"],ascending = [True, False])方法实现

### 7.2.1根据某一列进行排序

In [43]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [44]:
df.sort_values(by = ["销售ID"]) #ascending默认为True

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
2,A3,孙凤,103,23,2018-8-10,1
1,A2,李谷,102,45,2018-8-9,2
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [45]:
df.sort_values(by = ["销售ID"], ascending = False)

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
4,A5,王娜,105,21,2018-8-11,3
1,A2,李谷,102,45,2018-8-9,2
3,A4,赵恒,104,36,2018-8-11,2
0,A1,张通,101,31,2018-8-8,1
2,A3,孙凤,103,23,2018-8-10,1


### 7.2.2根据有缺失值的列进行排序

当表格中存在缺失值的时候，我们可以用na_position参数，选择将空值放在最前面或最后面

In [46]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,np.NaN,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1.0
1,A2,李谷,102,45,2018-8-9,2.0
2,A3,孙凤,103,23,2018-8-10,
3,A4,赵恒,104,36,2018-8-11,2.0
4,A5,王娜,105,21,2018-8-11,3.0


In [47]:
df.sort_values(by = ["销售ID"]) #na_position参数默认为"last"

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1.0
1,A2,李谷,102,45,2018-8-9,2.0
3,A4,赵恒,104,36,2018-8-11,2.0
4,A5,王娜,105,21,2018-8-11,3.0
2,A3,孙凤,103,23,2018-8-10,


In [48]:
df.sort_values(by = ["销售ID"], na_position= "first")

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
2,A3,孙凤,103,23,2018-8-10,
0,A1,张通,101,31,2018-8-8,1.0
1,A2,李谷,102,45,2018-8-9,2.0
3,A4,赵恒,104,36,2018-8-11,2.0
4,A5,王娜,105,21,2018-8-11,3.0


### 7.2.3按照多列数值进行排序

In [49]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [50]:
df.sort_values(by = ["销售ID","成交时间"], ascending = [True, False])

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
2,A3,孙凤,103,23,2018-8-10,1
1,A2,李谷,102,45,2018-8-9,2
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


## 7.3数值排名

数值排名和数值排序是相对应的，排名会新增一列，用来存放字段的排名情况，从1开始<br>
在Python中，需要用到rank()方法，有如下几个参数需要调整<br>

y有一个参数为method，用来指如果待排名值出现重复值时的处理情况<br>
1.average，对销售排名取平均值，即如果出现两个第2名，该排名则显示1.5<br>
2.first，在排名相同的情况下，按所在行在数据源中的先后顺序排名<br>
3.min，返回最佳排名<br>
4.max，与min相反，取最大排名<br>

In [51]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [52]:
df["销售ID"].rank(method = "average")

0    1.5
1    3.5
2    1.5
3    3.5
4    5.0
Name: 销售ID, dtype: float64

In [53]:
df["销售ID"].rank(method = "first")

0    1.0
1    3.0
2    2.0
3    4.0
4    5.0
Name: 销售ID, dtype: float64

In [54]:
df["销售ID"].rank(method = "min")

0    1.0
1    3.0
2    1.0
3    3.0
4    5.0
Name: 销售ID, dtype: float64

In [55]:
df["销售ID"].rank(method = "max")

0    2.0
1    4.0
2    2.0
3    4.0
4    5.0
Name: 销售ID, dtype: float64

## 7.4数值删除

数据删除是指对数据表中一些无用的护具进行删除操作

In [61]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


### 7.4.1删除列

在Python中，要删除某一列用的是drop方法<br>
可以直接传入待删除的列明，再加axis = 1，表示删除列

In [62]:
df.drop(["销售ID","成交时间"], axis = 1)

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄
0,A1,张通,101,31
1,A2,李谷,102,45
2,A3,孙凤,103,23
3,A4,赵恒,104,36
4,A5,王娜,105,21


drop方法也可以直接删除待删除列的位置，但同样需要用到axis参数

In [64]:
df.drop(df.columns[[4,5]], axis = 1)

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄
0,A1,张通,101,31
1,A2,李谷,102,45
2,A3,孙凤,103,23
3,A4,赵恒,104,36
4,A5,王娜,105,21


如果直接用columns参数，那就不需要axis参数了

In [65]:
df.drop(columns = ["销售ID","成交时间"])

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄
0,A1,张通,101,31
1,A2,李谷,102,45
2,A3,孙凤,103,23
3,A4,赵恒,104,36
4,A5,王娜,105,21


### 7.4.2删除行

In [67]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df.index = ["0a","1b","2c","3d","4e"]
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0a,A1,张通,101,31,2018-8-8,1
1b,A2,李谷,102,45,2018-8-9,2
2c,A3,孙凤,103,23,2018-8-10,1
3d,A4,赵恒,104,36,2018-8-11,2
4e,A5,王娜,105,21,2018-8-11,3


In [68]:
df.drop(["0a","1b"], axis = 0)

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
2c,A3,孙凤,103,23,2018-8-10,1
3d,A4,赵恒,104,36,2018-8-11,2
4e,A5,王娜,105,21,2018-8-11,3


也可以直接传入待删除行的行号，也需要用axis参数，让其参数为0

In [72]:
df.drop(df.index[[0,1]], axis = 0)

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
2c,A3,孙凤,103,23,2018-8-10,1
3d,A4,赵恒,104,36,2018-8-11,2
4e,A5,王娜,105,21,2018-8-11,3


In [74]:
df.drop(index = ["0a", "1b"])

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
2c,A3,孙凤,103,23,2018-8-10,1
3d,A4,赵恒,104,36,2018-8-11,2
4e,A5,王娜,105,21,2018-8-11,3


### 7.4.3删除特定行

如果要删除年龄>40的行，我们可以反向操作，直接筛选出年龄≤40的行，作为新的数据源

In [75]:
df[df["年龄"]<40]

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0a,A1,张通,101,31,2018-8-8,1
2c,A3,孙凤,103,23,2018-8-10,1
3d,A4,赵恒,104,36,2018-8-11,2
4e,A5,王娜,105,21,2018-8-11,3


## 7.5数值计算

数值计算指在某一列中一系列值出现的次数

In [76]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [77]:
df["销售ID"].value_counts()

2    2
1    2
3    1
Name: 销售ID, dtype: int64

也可以传入参数normaliza与sort，用来求百分比与排序

In [78]:
df["销售ID"].value_counts(normalize = True, sort = False)

1    0.4
2    0.4
3    0.2
Name: 销售ID, dtype: float64

## 7.6唯一值获取

唯一值获取就是把某一系列重复值删除后的结果

In [57]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [58]:
df["销售ID"].unique()

array([1, 2, 3], dtype=int64)

## 7.7数值查找

数值查找就是指，查看数据表中的数据是否包含某个值或某些值

In [59]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


In [60]:
df["年龄"].isin([31,21])

0     True
1    False
2    False
3    False
4     True
Name: 年龄, dtype: bool

## 7.8区间切分

区间切分就是将一系列的数值分为若干份，将他们的年龄大小化为3个区间，这个过程就是切分

In [82]:
data_dict = {"年龄":np.arange(1,11)}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,年龄
0,1
1,2
2,3
3,4
4,5
5,6
6,7
7,8
8,9
9,10


In [84]:
pd.cut(df["年龄"],bins = [0,3,6,10])

0     (0, 3]
1     (0, 3]
2     (0, 3]
3     (3, 6]
4     (3, 6]
5     (3, 6]
6    (6, 10]
7    (6, 10]
8    (6, 10]
9    (6, 10]
Name: 年龄, dtype: category
Categories (3, interval[int64]): [(0, 3] < (3, 6] < (6, 10]]

与cut()相似的方法还有qcut()方法

In [89]:
pd.qcut(df["年龄"],3)

0    (0.999, 4.0]
1    (0.999, 4.0]
2    (0.999, 4.0]
3    (0.999, 4.0]
4      (4.0, 7.0]
5      (4.0, 7.0]
6      (4.0, 7.0]
7     (7.0, 10.0]
8     (7.0, 10.0]
9     (7.0, 10.0]
Name: 年龄, dtype: category
Categories (3, interval[float64]): [(0.999, 4.0] < (4.0, 7.0] < (7.0, 10.0]]

不过也可以生成一个新列来记录这些区间

In [98]:
df.loc[(df["年龄"] >0) & (df["年龄"] <=3),"区间"] = "(0,3]"
df.loc[(df["年龄"] >3) & (df["年龄"] <=6),"区间"] = "(3,6]"
df.loc[(df["年龄"] >6) & (df["年龄"] <=9),"区间"] = "(6,9]"
df.loc[df["年龄"] >9,"区间"] = "大于9"
df

Unnamed: 0,年龄,区间
0,1,"(0,3]"
1,2,"(0,3]"
2,3,"(0,3]"
3,4,"(3,6]"
4,5,"(3,6]"
5,6,"(3,6]"
6,7,"(6,9]"
7,8,"(6,9]"
8,9,"(6,9]"
9,10,大于9


## 7.9插入新的行或列

In [100]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df.insert(2,"商品类别",["cat01","cat02","cat03","cat04","cat05"])
df

Unnamed: 0,订单编号,客户姓名,商品类别,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,cat01,101,31,2018-8-8,1
1,A2,李谷,cat02,102,45,2018-8-9,2
2,A3,孙凤,cat03,103,23,2018-8-10,1
3,A4,赵恒,cat04,104,36,2018-8-11,2
4,A5,王娜,cat05,105,21,2018-8-11,3


还可以直接用索引的方式进行插入，直接让新的一列等于某列值即可

In [103]:
df["商品类别"] = ["cat01","cat02","cat03","cat04","cat05"]
df

Unnamed: 0,订单编号,客户姓名,商品类别,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,cat01,101,31,2018-8-8,1
1,A2,李谷,cat02,102,45,2018-8-9,2
2,A3,孙凤,cat03,103,23,2018-8-10,1
3,A4,赵恒,cat04,104,36,2018-8-11,2
4,A5,王娜,cat05,105,21,2018-8-11,3


## 7.10行与列互换

行列互换，又称为转置，在Python中，用T方法实现

In [104]:
data_dict = {"订单编号":["A1","A2","A3","A4","A5"],
             "客户姓名":["张通","李谷","孙凤","赵恒","王娜"],
             "唯一识别码":[101,102,103,104,105],
             "年龄":[31,45,23,36,21],
             "成交时间":["2018-8-8","2018-8-9","2018-8-10","2018-8-11","2018-8-11"],
             "销售ID":[1,2,1,2,3]}
df = pd.DataFrame(data_dict)
df.T

Unnamed: 0,0,1,2,3,4
订单编号,A1,A2,A3,A4,A5
客户姓名,张通,李谷,孙凤,赵恒,王娜
唯一识别码,101,102,103,104,105
年龄,31,45,23,36,21
成交时间,2018-8-8,2018-8-9,2018-8-10,2018-8-11,2018-8-11
销售ID,1,2,1,2,3


如果同时做两次转置，相当于没有变化

In [105]:
df.T.T

Unnamed: 0,订单编号,客户姓名,唯一识别码,年龄,成交时间,销售ID
0,A1,张通,101,31,2018-8-8,1
1,A2,李谷,102,45,2018-8-9,2
2,A3,孙凤,103,23,2018-8-10,1
3,A4,赵恒,104,36,2018-8-11,2
4,A5,王娜,105,21,2018-8-11,3


## 7.11索引重塑

重塑索引就是将原来的索引进行重构，在Python中使用stack()方法

In [107]:
data_dict = {"C1":[1,4],
             "C2":[2,5],
             "C3":[3,6]}
df = pd.DataFrame(data_dict)
df.set_index = ["S1","S2"]
df

Unnamed: 0,C1,C2,C3
0,1,2,3
1,4,5,6


In [108]:
df.stack()

0  C1    1
   C2    2
   C3    3
1  C1    4
   C2    5
   C3    6
dtype: int64

## 7.12长宽表转换

长宽表互相转化，是为了做数据分析前必要的步骤，但必须要主意的是：<br>
那就是必须得有公共列

### 7.12.1宽表转长表

宽表转化为长表，在Excel中多用Excel实现，过程非常繁琐<br>
在Python中，有两种转化方法<br>
1.stack()方法<br>
2.melt()方法<br>

In [113]:
data_dict = {"Company":["Apple","Google","Facebook"],
             "Name":["苹果","谷歌","脸书"],
             "Sales2013":[5000,3500,2300],
             "Sales2014":[5050,3800,2900],
             "Sales2015":[5050,3800,2900],
             "Sales2016":[5050,3800,2900]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,Company,Name,Sales2013,Sales2014,Sales2015,Sales2016
0,Apple,苹果,5000,5050,5050,5050
1,Google,谷歌,3500,3800,3800,3800
2,Facebook,脸书,2300,2900,2900,2900


In [118]:
df2 = df.set_index(["Company","Name"]).stack()
df2.reset_index()

Unnamed: 0,Company,Name,level_2,0
0,Apple,苹果,Sales2013,5000
1,Apple,苹果,Sales2014,5050
2,Apple,苹果,Sales2015,5050
3,Apple,苹果,Sales2016,5050
4,Google,谷歌,Sales2013,3500
5,Google,谷歌,Sales2014,3800
6,Google,谷歌,Sales2015,3800
7,Google,谷歌,Sales2016,3800
8,Facebook,脸书,Sales2013,2300
9,Facebook,脸书,Sales2014,2900


In [120]:
df.melt(id_vars = ["Company","Name"], var_name = "Year", value_name = "Sales")

Unnamed: 0,Company,Name,Year,Sales
0,Apple,苹果,Sales2013,5000
1,Google,谷歌,Sales2013,3500
2,Facebook,脸书,Sales2013,2300
3,Apple,苹果,Sales2014,5050
4,Google,谷歌,Sales2014,3800
5,Facebook,脸书,Sales2014,2900
6,Apple,苹果,Sales2015,5050
7,Google,谷歌,Sales2015,3800
8,Facebook,脸书,Sales2015,2900
9,Apple,苹果,Sales2016,5050


### 7.12.2长表转宽表

本质上就是宽表转长表的你过程，常用的方法就是数据透视表

具体做法详见CH10

## 7.13apply()与applymap()函数

这是Python中的一个高级特性函数

In [109]:
data_dict = {"C1":[1,4,7],
             "C2":[2,5,8],
             "C3":[3,6,9]}
df = pd.DataFrame(data_dict)
df

Unnamed: 0,C1,C2,C3
0,1,2,3
1,4,5,6
2,7,8,9


In [110]:
df["C1"].apply(lambda x:x+1)

0    2
1    5
2    8
Name: C1, dtype: int64

In [112]:
df.applymap(lambda x:x+1)

Unnamed: 0,C1,C2,C3
0,2,3,4
1,5,6,7
2,8,9,10
