pandas 中处理text的机会还是很大的，
从column的清洗整理到真正text数据的处理都需要用到。

而pandas中需要处理string同在python标准string的处理接口类似：

#### 类似string 接口的标准用法：
* lower()
* upper()
* len()
* strip()
* rstrip()
* lstrip()

In [1]:
import pandas as pd
import numpy as np
pd.__version__
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [4]:
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [5]:
s.str.upper()

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

In [6]:
s.str.len()

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [16]:
df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '], index=range(3))

In [17]:
df

Unnamed: 0,Column A,Column B
0,0.218645,-0.533317
1,-0.740273,-0.166863
2,0.476363,0.46436


In [20]:
df.columns

Index([u' Column A ', u' Column B '], dtype='object')

In [21]:
df.columns.str.strip()
# 官方文档依然使用该方法，然而Index的str已被移除

AttributeError: 'Index' object has no attribute 'str'

in >= 0.11:

df.index.to_series().str.contains('foo')

*these methods only exist on series objects*

In [23]:
df.columns.to_series().str.strip()

 Column A     Column A
 Column B     Column B
dtype: object

In [24]:
df.columns.to_series().str.lower()

 Column A      column a 
 Column B      column b 
dtype: object

In [25]:
df.columns = df.columns.to_series().str.strip().str.lower().str.replace(' ', '_')

In [26]:
df

Unnamed: 0,column_a,column_b
0,0.218645,-0.533317
1,-0.740273,-0.166863
2,0.476363,0.46436


####Splitting and Replacing 用法

In [4]:
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
s2.str.split('_')

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

split开来的是标准的list，可以用get或者[]来取相关的值。

In [28]:
s2.str.split('_').str.get(1)

0      b
1      d
2    NaN
3      g
dtype: object

In [29]:
s2.str.split('_').str[1]

0      b
1      d
2    NaN
3      g
dtype: object

可以使用expand关键字来split成一个dataframe.

New in version 0.16.1.

In [30]:
s2.str.split('_', expand=True)

TypeError: split() got an unexpected keyword argument 'expand'

这是旧版本pandas时出现的错误信息

In [2]:
pd.__version__

'0.16.2'

In [5]:
s2.str.split('_', expand=True)

Unnamed: 0,0,1,2
0,a,b,c
1,c,d,e
2,,,
3,f,g,h


In [6]:
s2.str.split('_', expand=True, n=1)
# n用来指定split的个数

Unnamed: 0,0,1
0,a,b_c
1,c,d_e
2,,
3,f,g_h


In [7]:
s2.str.rsplit('_', expand=True, n=1)

Unnamed: 0,0,1
0,a_b,c
1,c_d,e
2,,
3,f_g,h


replace 和 findall 方法可以利用正则

In [9]:
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', '', np.nan, 'CABA', 'dog', 'cat'])

In [10]:
s3.str.replace('^.a|dog', 'XX-XX ', case=False)

0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6         NaN
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: object

因为接受正则的缘故，所以想当替代一些特殊字符的时候，要注意是否是正则的保留字符。

In [11]:
pattern = r'[a-z][0-9]'

In [13]:
pd.Series(['1', 'a7', '3a', '3b', '03c']).str.contains(pattern)

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [14]:
pd.Series(['1', 'a7', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)

0    False
1     True
2    False
3    False
4    False
dtype: bool

The distinction between match and contains is strictness: 

match relies on strict re.match, while contains relies on re.search.

search ⇒ find something anywhere in the string and return a match object.

match ⇒ find something at the beginning of the string and return a match object.

In [15]:
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s4.str.contains('A', na=False)

0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: bool

`contains` 只会从头匹配，而不会搜索整个string