#### Pandas字符串处理
前面我们已经使用了字符串的处理函数：
df['revenues'].str.replace('$','').astype('float')
##### Pandas的字符串处理：
1. 使用方法：先获取Series的str属性，然后在属性上调用函数；
2. 只能在字符串列上使用，不能在数字列上使用；
3. DataFrame上没有str属性和处理方法；
4. Series.str并不是Python原生字符串，而是自己的一套方法，不过大部分和原生str很相似；

##### 本节演示内容：
1. 获取Series的str属性，然后使用各种字符串处理函数
2. 使用str的startwith、contains等bool类Series可以做条件查询
3. 需要多次str处理的链式操作
4. 使用正则表达式的处理

#### 读取数据

In [2]:
import pandas as pd

In [3]:
fpath = "C:/Users/pxpxz_ct9p1p3/Downloads/Fortune_1000_Companies_by_Revenue.csv"
df = pd.read_csv(fpath)

#### 1、获取Series的str属性，使用各种字符串处理函数

In [4]:
df['revenues'].str

<pandas.core.strings.accessor.StringMethods at 0x1f97c51e160>

In [None]:
# 字符串替换函数
df['revenues'].str.replace('$','')

In [5]:
# 判断是不是数字
df['revenues'].str.isnumeric()

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Name: revenues, Length: 1000, dtype: bool

In [6]:
df['revenues'].str.len() # 不可以用于数字列

0       9
1       9
2       9
3       9
4       9
       ..
995     7
996    10
997    10
998     7
999    10
Name: revenues, Length: 1000, dtype: int64

#### 2、使用str的startswith、contains等得到bool的Series可以做条件查询

In [7]:
condition = df['name'].str.startswith('A')

In [8]:
condition

0      False
1       True
2       True
3      False
4      False
       ...  
995    False
996    False
997    False
998     True
999    False
Name: name, Length: 1000, dtype: bool

In [9]:
df[condition].head()

Unnamed: 0,rank,name,revenues,revenue_percent_change,profits,profits_percent_change,assets,market_value,change_in_rank,employees
1,2,Amazon,"$469,822",21.70%,"$33,364",56.40%,"$420,549","$1,658,807.30",-,1608000
2,3,Apple,"$365,817",33.30%,"$94,680",64.90%,"$351,002","$2,849,537.60",-,154000
7,8,Alphabet,"$257,637",41.20%,"$76,033",88.80%,"$359,268","$1,842,326.10",1,156500
9,10,AmerisourceBergen,"$213,988.80",12.70%,"$1,539.90",-,"$57,337.80","$32,355.70",-2,40000
12,13,AT&T,"$168,864",-1.70%,"$20,081",-,"$551,622","$169,262.40",-2,202600


#### 3、需要多次str处理的链式操作
怎样提取纯数字，如$234,567 --> 234567

In [None]:
df['profits'].str.replace('$','').str.replace(',','').str.replace('-','1').str.replace('(','').str.replace(')','').astype(float)

In [11]:
# 每次调用函数都返回一个新的Series
df['name'].str.replace('b','B').slice(0,6) # 报错！！！AttributeError: 'Series' object has no attribute 'slice'

AttributeError: 'Series' object has no attribute 'slice'

In [12]:
# 所以每次调用series属性方法都要先获取series的str属性
df['name'].str.replace('b','B').str.slice(0,6)

0      Walmar
1      Amazon
2       Apple
3      CVS He
4      United
        ...  
995    Vizio 
996    1-800-
997     Cowen
998    Ashlan
999    DocuSi
Name: name, Length: 1000, dtype: object

In [13]:
# slice就是切片语法，可以直接用
df['name'].str.replace('a','A').str[0:6]

0      WAlmAr
1      AmAzon
2       Apple
3      CVS He
4      United
        ...  
995    Vizio 
996    1-800-
997     Cowen
998    AshlAn
999    DocuSi
Name: name, Length: 1000, dtype: object

#### 4、使用正则表达式的处理

#### 问题：怎样将‘$123,456’中的$和,去除

In [None]:
# 方法1：链式replace
df['revenues'].str.replace('$','').str.replace(',','')

In [14]:
# 方法2：正则表达式替换
df['revenues'].str.replace('[$,]','')

  df['revenues'].str.replace('[$,]','')


0       572754 
1       469822 
2       365817 
3       292111 
4       287597 
         ...   
995       2124 
996    2122.20 
997    2112.80 
998       2111 
999    2107.20 
Name: revenues, Length: 1000, dtype: object