### Series.str属性
pandas的Series.str可以对某一序列中的每个文本数据进行批处理，一般返回的结果是数组。

代码|功能
---|---
Series.str.lower()/Series.str.uppper()|变小写/变小写
Series.str.contains(pat, regex=True)|文本中是否含有pat，返回布尔值
Series.str.len()|求文本长度
Series.str.strip()|剔除文本中的空格
Series.str.findall(pat, regex=True)|寻找符合pat的所有字符，相当于对series使用re.findall()方法
Series.str.replace(pat, repl, regex=True)|将文本中的pat替换为repl，regex参数为布尔值表示sep是否使用正则表达式
Series.str.split(sep, expand=False, regex=True)|按照sep对文本进行切割,regex默认为True，表示sep是否使用正则表达式。expand默认值为False，决定返回的是series还是dataframe
Series.str.extract(pat, expand=True)|对文本数据提取pat，expand默认为True，该参数决定返回的是series还是dataframe
Series.str.get_dummies()|哑变量操作

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'From_To': [' LoNDon_paris ', ' MAdrid_miLAN ', ' londON_StockhOlm ', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


选择From_To列，得到Series类型数据

In [7]:
df.From_To

0         LoNDon_paris 
1         MAdrid_miLAN 
2     londON_StockhOlm 
3        Budapest_PaRis
4       Brussels_londOn
Name: From_To, dtype: object

### Series.str.upper/lower
将Airline列中的每一项变为大写

In [8]:
df.From_To.str.upper()  

0         LONDON_PARIS 
1         MADRID_MILAN 
2     LONDON_STOCKHOLM 
3        BUDAPEST_PARIS
4       BRUSSELS_LONDON
Name: From_To, dtype: object

将From_To列中的每一项变为小写

In [9]:
df.From_To.str.lower()  

0         london_paris 
1         madrid_milan 
2     london_stockholm 
3        budapest_paris
4       brussels_london
Name: From_To, dtype: object

### Series.str.len
求From_To列每一项的长度

In [10]:
df.From_To.str.len() 

0    14
1    14
2    18
3    14
4    15
Name: From_To, dtype: int64

### pd.Series.str.split分割
对From_To列中每一项按照"_"进行切割。注意这里expand参数

In [3]:
df.From_To.str.split('_')

0        [ LoNDon, paris ]
1        [ MAdrid, miLAN ]
2    [ londON, StockhOlm ]
3        [Budapest, PaRis]
4       [Brussels, londOn]
Name: From_To, dtype: object

In [4]:
df.From_To.str.split('_', expand=True)

Unnamed: 0,0,1
0,LoNDon,paris
1,MAdrid,miLAN
2,londON,StockhOlm
3,Budapest,PaRis
4,Brussels,londOn


### Series.str.contains
From_To列中每项是否含有'Brussels'这个字段，返回布尔值

In [15]:
df.From_To.str.contains('Brussels')

0    False
1    False
2    False
3    False
4     True
Name: From_To, dtype: bool

### Series.str.startswith
From_To列中每项是否含以'B'作为开头，返回布尔值

In [16]:
df.From_To.str.strip().str.startswith('B')

0    False
1    False
2    False
3     True
4     True
Name: From_To, dtype: bool

### Series.str.findall
把RecentDelays列中的列表

In [18]:
df.RecentDelays

0        [23, 47]
1              []
2    [24, 43, 87]
3            [13]
4        [67, 32]
Name: RecentDelays, dtype: object

### Series.str.extract
清洗Airline列，使其每一项只拥有字母和空格。我们先看看数据

In [47]:
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


In [48]:
df.Airline.str.extract('([a-zA-Z\s]+)')

Unnamed: 0,0
0,KLM
1,Air France
2,British Airways
3,Air France
4,Swiss Air


使用正则表达式，对From_To列进行提取操作，获得出发地和目的地。

In [56]:
df.From_To.str.extract('([a-zA-Z\s]+)_([a-zA-Z\s]+)')

Unnamed: 0,0,1
0,LoNDon,paris
1,MAdrid,miLAN
2,londON,StockhOlm
3,Budapest,PaRis
4,Brussels,londOn


### Series.str.findall(pat)
查找Series中每一项是否含有pat

In [19]:
s = pd.Series(['Lion', 'Monkey', 'Rabbit'])
s

0      Lion
1    Monkey
2    Rabbit
dtype: object

In [22]:
def maper(into):
    if into:
        return into[0]
    else:
        return None
    
s.str.findall('Monkey')

0          []
1    [Monkey]
2          []
dtype: object

我们先看看df

In [65]:
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


查找From_To列中是否有Paris

In [68]:
df.From_To.str.lower().str.findall('paris')

0    [paris]
1         []
2         []
3    [paris]
4         []
Name: From_To, dtype: object

### Series.str.replace
将From_To列中的"_"换为">"

In [72]:
df.From_To.str.replace('_', ' > ')

0         LoNDon > paris 
1         MAdrid > miLAN 
2     londON > StockhOlm 
3        Budapest > PaRis
4       Brussels > londOn
Name: From_To, dtype: object

获得s1序列中每项文本的最后一个位置的数据

### Series.str.get_dummies哑变量操作

In [3]:
import pandas as pd

names = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

df = pd.DataFrame({'Name': names,
                           'Info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
df

Unnamed: 0,Name,Info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [7]:
df.Info.str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


In [9]:
import pandas as pd

names = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

df = pd.DataFrame({'Name': names,
                           'Info': ['B,C,D', 'B,D', 'A,C',
                                    'B,D', 'B,C', 'B,C,D']})
df.Info.str.get_dummies(',')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


In [13]:
pd.concat([df, df.Info.str.get_dummies(',')], 
          axis = 'columns')

Unnamed: 0,Name,Info,A,B,C,D
0,Graham Chapman,"B,C,D",0,1,1,1
1,John Cleese,"B,D",0,1,0,1
2,Terry Gilliam,"A,C",1,0,1,0
3,Eric Idle,"B,D",0,1,0,1
4,Terry Jones,"B,C",0,1,1,0
5,Michael Palin,"B,C,D",0,1,1,1
