# Summary functions & Maps
上一节中我们总结学习了关于如何读取并检索数据，然而现实情况中数据并不总如我们所需的那样，这时候就需要我们手动去重新格式化它（reformat）

In [1]:
import pandas as pd

依旧读取上一章所用到的红酒数据集

In [2]:
wine_reviews = pd.read_csv('winemag-data-50k-v2.csv', index_col = 0)
wine_reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


### Pandas提供了很多简单好用的summary functions，比如**describe**函数：

In [3]:
wine_reviews.points.describe()

count    50000.000000
mean        88.454700
std          3.048189
min         80.000000
25%         86.000000
50%         88.000000
75%         91.000000
max        100.000000
Name: points, dtype: float64

可以看到**describe**函数对所选数据字段（标签所在的列）进行了高度总结，总结内容根据数据类型而不同。该列均为浮点数，于是总结内容为数理统计内容。<p>若所选字段数据类型为字符串，则如下所示：

In [4]:
wine_reviews.taster_name.describe()

count          39750
unique            19
top       Roger Voss
freq            9923
Name: taster_name, dtype: object

如果想查看一些简单的统计信息，pandas提供和numpy一样的数学函数供你使用。例如查看分数的均值：

In [5]:
wine_reviews.points.mean()

88.4547

再比如查看该列中unique的值都有哪些：

In [6]:
wine_reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

接下来我们看看这些unique的词在数据集中都出现了几次：

In [7]:
wine_reviews.taster_name.value_counts()

Roger Voss            9923
Michael Schachner     5893
Kerin O’Keefe         4151
Paul Gregutt          3734
Virginie Boone        3600
Matt Kettmann         2320
Joe Czerwinski        2021
Sean P. Sullivan      1819
Anna Lee C. Iijima    1666
Jim Gordon            1543
Anne Krebiehl MW      1387
Lauren Buzzeo          695
Susan Kostrzewa        449
Jeff Jenssen           180
Mike DeSimone          165
Alexander Peartree     151
Carrie Dykes            39
Fiona Adams             11
Christina Pickard        3
Name: taster_name, dtype: int64

## Maps
**Map**可以理解为是通过一些函数或方法，将一组数值映射成另一组数值。在日常的数据处理中，经常会对一个DataFrame进行逐行、逐列和逐元素的操作，对应这些操作，Pandas中的**map**和**apply**可以解决绝大部分这样的数据处理需求。<p>
### 先来说说 map() 函数
    map()的功能是将一个自定义函数作用于Series对象的每个元素。
假设现在红酒商店大促销，全场5折（50%off），我们需要修改数据集中的价格（price）字段，将他们全部做半价处理。How？

In [8]:
wine_reviews['price'] = wine_reviews['price'].map(lambda x: x * 0.5)
wine_reviews.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,7.5,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,7.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,6.5,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,32.5,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


**map()**里面可以添加我们需要的任意函数，比如上面使用到的lambda函数是python中比较常用的一种方法，很多人会比较困惑这种写法，翻译成python语句其实它长这样：

In [9]:
def function_g(x):
    return x * 0.5

是不是这样写就好理解多了？在这里lambda简化了函数定义的书写形式，使代码更为简洁。但是使用函数的定义方式更为直观，易理解。比如达到上面的效果，我们同样可以这样：

In [10]:
wine_reviews['price'] = wine_reviews['price'].map(function_g)
wine_reviews.head(5)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,3.75,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,3.5,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,3.25,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,16.25,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


可以看到price字段结果已经又一次减半了，这里记得使用时直接传入你定义的函数名到map的括号里即可

我们再来举个常用的超~简单 **map** 例子，假设我们有个数据集是统计班级男女生体育成绩：

In [11]:
data = pd.DataFrame({
    "scores":[67,56,90,87],
    "gender":[1,1,2,2],
})
data

Unnamed: 0,scores,gender
0,67,1
1,56,1
2,90,2
3,87,2


老师因为记录数据的时候图方便，把男生写成了1，女生写成了2。现在为了能看起来更清楚，我们要把"1"和"2"改回它原本对应的"男"和"女"：

In [12]:
data['gender'] = data['gender'].map({1:"男", 2:"女"})
data

Unnamed: 0,scores,gender
0,67,男
1,56,男
2,90,女
3,87,女


### 让我们再来介绍一下 apply() 函数
**apply**方法的作用原理和**map**方法类似，它将一个自定义函数作用于DataFrame的行或者列<p>
用回我们之前的红酒数据集，这次我们想把points字段的分数进行中心化处理（x' = x - μ）：

In [13]:
def remean_points(row):
    row.points = row.points - review_points_mean
    return row

review_points_mean = wine_reviews.points.mean() # 先得到points字段的均值

wine_reviews.apply(remean_points, axis='columns') # 这里的'columns'是指定对列进行操作，等同于 axis = 1；
                                                  # 如果想对行进行操作则要写axis = 0 或者'index'

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,-1.4547,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,-1.4547,3.75,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,-1.4547,3.50,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,-1.4547,3.25,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,-1.4547,16.25,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks
...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,US,This is one of the best California Chardonnays...,Estate Vineyards,1.5453,5.00,California,Chalk Hill,Sonoma,,,Rodney Strong 2010 Estate Vineyards Chardonnay...,Chardonnay,Rodney Strong
49996,Italy,"Powerful, thick and concentrated, this has plu...",,1.5453,15.00,Veneto,Amarone della Valpolicella,,,,San Cassiano 2006 Amarone della Valpolicella,"Corvina, Rondinella, Molinara",San Cassiano
49997,US,"Lush and generous, this has a center of blackb...",,1.5453,6.25,Washington,Yakima Valley,Columbia Valley,Paul Gregutt,@paulgwine,Sheridan Vineyard 2010 Cabernet Sauvignon (Yak...,Cabernet Sauvignon,Sheridan Vineyard
49998,Italy,The beautiful thing about this wine is the ric...,Le Rive,1.5453,11.25,Veneto,Soave Classico,,,,Suavia 2008 Le Rive (Soave Classico),Garganega,Suavia


值得注意的是，**map()**和**apply()**函数返回的是一个新对象。这两个操作其实并不会修改原数据。<p>
我们可以看一下原数据的第一行，会发现points字段的值并没有改变。

In [14]:
wine_reviews.head(1)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
