#### Pandas怎样对每个分组应用apply函数?
#### 知识：Pandas的Groupby遵从split，apply，combine模式
这里的split指的是pandas的groupby，我们自己实现apply函数，apply返回的结果由pandas进行combine得到结果  
#### GroupBy.apply(function)
+ function的第一个参数是dataframe
+ function的返回结果，可是dataframe、series、单个值，甚至和输入dataframe完全没关系  

#### 本次实例演示：
1. 怎样对数值列按分组的归一化？
2. 怎样取每个分组的TOPN数据？

#### 实例1：怎样对数值列按分组的归一化？
将不同范围的数值列进行归一化，映射到[0,1]区间：
+ 更容易做数据横向对比，比如价格字段是几百到几千，增幅字段是0到100
+ 机器学习模型学的更好更快
归一化的公式：
X normalized=（X-X minimum)/(X maximum - X minimum)

#### 演示：用户对电影评分的归一化
每个用户的评分不同，乐观派评分高，悲观派评分低，按用户做归一化

In [1]:
import pandas as pd

In [2]:
ratings=pd.read_csv('./files/ml-latest-small/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
# 实现按照用户ID分组，然后对其中一列归一化
def ratings_norm(df):
    min_value=df['rating'].min()
    max_value=df['rating'].max()
    df['rating_norm']=df['rating'].apply(lambda x: (x-min_value)/(max_value-min_value+1))
    return df

ratings=ratings.groupby('userId').apply(ratings_norm)

In [9]:
ratings[ratings['userId']==1].head()

Unnamed: 0,userId,movieId,rating,timestamp,rating_norm
0,1,1,4.0,964982703,0.6
1,1,3,4.0,964981247,0.6
2,1,6,4.0,964982224,0.6
3,1,47,5.0,964983815,0.8
4,1,50,5.0,964982931,0.8


可以看到userId==1用户，rating==4是最低分，是个乐观派，我们归一到0分

#### 实例2：怎样取每个分组的TOPN数据？

In [11]:
fpath='./files/austin_weather.csv'
df=pd.read_csv(fpath)
df.head()

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.68,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm"
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,30.13,29.87,10,10,5,16,6,25,0,
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.49,30.41,10,10,10,8,3,12,0,
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.45,30.3,10,10,7,12,4,20,0,
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.33,30.27,10,10,7,10,2,16,T,


In [14]:
df.dtypes

Date                          object
TempHighF                      int64
TempAvgF                       int64
TempLowF                       int64
DewPointHighF                 object
DewPointAvgF                  object
DewPointLowF                  object
HumidityHighPercent           object
HumidityAvgPercent            object
HumidityLowPercent            object
SeaLevelPressureHighInches    object
SeaLevelPressureAvgInches     object
SeaLevelPressureLowInches     object
VisibilityHighMiles           object
VisibilityAvgMiles            object
VisibilityLowMiles            object
WindHighMPH                   object
WindAvgMPH                    object
WindGustMPH                   object
PrecipitationSumInches        object
Events                        object
dtype: object

获取2013年每个月温度最高的2天数据

In [15]:
# 新增一列为月份
df['month']=df['Date'].str[:7]
df.head()

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events,month
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm",2013-12
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,29.87,10,10,5,16,6,25,0,,2013-12
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.41,10,10,10,8,3,12,0,,2013-12
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.3,10,10,7,12,4,20,0,,2013-12
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.27,10,10,7,10,2,16,T,,2013-12


In [16]:
def getTempTopN(df, topn):
    # 这里的df是每个月份分组group的df
    return df.sort_values(by='TempHighF')[['Date','TempHighF']][-topn:]

df.groupby('month').apply(getTempTopN, topn=2).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,TempHighF
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-12,8,2013-12-29,64
2013-12,0,2013-12-21,74
2014-01,41,2014-01-31,80
2014-01,30,2014-01-20,82
2014-02,64,2014-02-23,86
