### Pandas怎样新增数据列
在进行数据分析时，经常需要按照一定条件创建新的数据列，然后进行进一步分析。

1. 直接赋值
2. df.apply方法
3. df.assign方法
4. 按条件选择分组分别赋值

In [2]:
import pandas as pd

#### 读取csv数据到dataframe

In [64]:
fpath = "C:/Users/pxpxz_ct9p1p3/Downloads/Fortune_1000_Companies_by_Revenue.csv"
df = pd.read_csv(fpath)

In [65]:
df.head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
0,1,Walmart,"$572,754",2.40%,"$13,673",1.20%,"$244,860","$409,795",-,2300000
1,2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
2,3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
3,4,CVS Health,"$292,111",8.70%,"$7,910",10.20%,"$232,999","$132,839.20",-,258000
4,5,UnitedHealth Group,"$287,597",11.80%,"$17,285",12.20%,"$212,206","$479,830.30",-,350000


#### 1. 直接赋值的方法
实列：清理revenues和profits列，变成数值类型

In [66]:
# 替换掉前缀$
df.loc[:, 'revenues'] = df['revenues'].str.replace('$', '').str.replace(',','').astype('float')

  df.loc[:, 'revenues'] = df['revenues'].str.replace('$', '').str.replace(',','').astype('float')


In [67]:
# 替换掉前缀$
df.loc[:, 'profits'] = df['profits'].str.replace('$', '').str.replace(',','').str.replace('(','').str.replace(')','').str.replace('(','').str.replace('-','0').astype('float')

  df.loc[:, 'profits'] = df['profits'].str.replace('$', '').str.replace(',','').str.replace('(','').str.replace(')','').str.replace('(','').str.replace('-','0').astype('float')


In [68]:
# 观察revenues和profits的数据
df.head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
0,1,Walmart,572754.0,2.40%,13673.0,1.20%,"$244,860","$409,795",-,2300000
1,2,Amazon,469822.0,21.70%,33364.0,56.40%,"$420,549","$1,658,807.30",-,1608000
2,3,Apple,365817.0,33.30%,94680.0,64.90%,"$351,002","$2,849,537.60",-,154000
3,4,CVS Health,292111.0,8.70%,7910.0,10.20%,"$232,999","$132,839.20",-,258000
4,5,UnitedHealth Group,287597.0,11.80%,17285.0,12.20%,"$212,206","$479,830.30",-,350000


#### 实例：计算expense = revenue - profit

In [69]:
df.loc[:, 'expenses'] = df['revenues'] - df['profits']

In [70]:
df.head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees,expenses
0,1,Walmart,572754.0,2.40%,13673.0,1.20%,"$244,860","$409,795",-,2300000,559081.0
1,2,Amazon,469822.0,21.70%,33364.0,56.40%,"$420,549","$1,658,807.30",-,1608000,436458.0
2,3,Apple,365817.0,33.30%,94680.0,64.90%,"$351,002","$2,849,537.60",-,154000,271137.0
3,4,CVS Health,292111.0,8.70%,7910.0,10.20%,"$232,999","$132,839.20",-,258000,284201.0
4,5,UnitedHealth Group,287597.0,11.80%,17285.0,12.20%,"$212,206","$479,830.30",-,350000,270312.0


#### 2. df.apply方法
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame's index (axis=0) or the DataFrame's columns (axis=1).

实例：添加一列增长速度列：
1. 如果高于20就是快
2. 如果低于10就是慢
3. 否则正常

In [71]:
# 将 revenue_percent_change 列变成数字类型
df['revenue_percent_change'] = df['revenue_percent_change'].str.replace('%','')
df['revenue_percent_change'] = df['revenue_percent_change'].str.replace('-','')
# revenue_percent_change列的数据变为number
df['revenue_percent_change'] = pd.to_numeric(df['revenue_percent_change'])


In [72]:
# 将 profit_percent_change 列变成数字类型
df['profits_percent_change'] = df['profits_percent_change'].str.replace('%','')
df['profits_percent_change'] = df['profits_percent_change'].str.replace('-','')
# profits_percent_change列的数据变为number
df['profits_percent_change'] = pd.to_numeric(df['profits_percent_change'])

In [73]:
# 将 employees 列变成数字类型
df['employees'] = df['employees'].str.replace(',','')
df['employees'] = df['employees'].str.replace('-','1')

In [74]:
df['employees'] = df['employees'].astype(int)

In [75]:
def get_change_type(df):
    if df['revenue_percent_change']>=20:
        return 'fast'
    if df['revenue_percent_change']<10:
        return 'slow'
    else: return 'normal'
    
# tips: 需要设置axis=1， 这是series的index是columns
df.loc[:,'change_type'] = df.apply(get_change_type, axis=1)

# 查看change_type的计数
df['change_type'].value_counts()

fast      416
slow      296
normal    288
Name: change_type, dtype: int64

#### 3、df.assign方法
Assign new columns to a DataFrame
Returns a new object with all original columns in addition to new ones

In [76]:
# 可以同时添加多个新的列
df.assign(
          revenue_change = lambda df:df['revenue_percent_change']/100,
          profit_change = lambda df:df['profits_percent_change']/100
         )

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees,expenses,change_type,revenue_change,profit_change
0,1,Walmart,572754.0,2.4,13673.0,1.2,"$244,860","$409,795",-,2300000,559081.0,slow,0.024,0.012
1,2,Amazon,469822.0,21.7,33364.0,56.4,"$420,549","$1,658,807.30",-,1608000,436458.0,fast,0.217,0.564
2,3,Apple,365817.0,33.3,94680.0,64.9,"$351,002","$2,849,537.60",-,154000,271137.0,fast,0.333,0.649
3,4,CVS Health,292111.0,8.7,7910.0,10.2,"$232,999","$132,839.20",-,258000,284201.0,slow,0.087,0.102
4,5,UnitedHealth Group,287597.0,11.8,17285.0,12.2,"$212,206","$479,830.30",-,350000,270312.0,normal,0.118,0.122
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,Vizio Holding,2124.0,4.0,39.4,138.4,$935.80,"$1,705.10",-,800,2084.6,slow,0.040,1.384
996,997,1-800-Flowers.com,2122.2,42.5,118.7,101.1,"$1,076.70",$830,-,4800,2003.5,fast,0.425,1.011
997,998,Cowen,2112.8,30.2,295.6,36.6,"$8,748.80",$744.10,-,1534,1817.2,fast,0.302,0.366
998,999,Ashland Global Holdings,2111.0,11.2,220.0,,"$6,612","$5,601.90",-130,4100,1891.0,normal,0.112,


In [77]:
# assign方法不会改变df本身，运行.dtypes时并没有显示这两列
df.dtypes

rank                       object
name                       object
revenues                  float64
revenue_percent_change    float64
profits                   float64
profits_percent_change    float64
assets                     object
market_value               object
change_in_rank             object
employees                   int32
expenses                  float64
change_type                object
dtype: object

#### 4. 按条件选择分组分别赋值
按条件先选择数据，然后对这部分数据赋值新列
实例：revenues/employees小于5：labor_density为intensive, 否则为not intensive

In [78]:
# 先创建空列（这是第一种创建新列的方法）
df['labor_density']='' # ‘广播’方法：给series赋单个值，这个值会赋给series每一行
df.loc[df['revenues']/df['employees']<5, 'labor_density'] = 'intensive'
df.loc[df['revenues']/df['employees']>=5, 'labor_density'] = 'not intensive'

In [79]:
df['labor_density'].value_counts()

intensive        962
not intensive     38
Name: labor_density, dtype: int64