## 第二章 字符串和文本
#### 针对任意多的分隔符拆分字符串

In [6]:
line = 'asd"fdf/gh;fgh'
import re
re.split(r'["/;]',line)

['asd', 'fdf', 'gh', 'fgh']

#### 在字符串的开头或者结尾做文本匹配

In [11]:
filename = 'hello.py'
filename.startswith('hello')
filename.endswith('.py')

import re
url = 'http://www.python.org'
re.match('http:|https:|ftp:',url)

<_sre.SRE_Match object; span=(0, 5), match='http:'>

#### 利用Shell通配符做字符串匹配

In [18]:
from fnmatch import fnmatch,fnmatchcase

fnmatch('foo.txt','*.txt')
fnmatch('foo.txt','?oo.txt')
fnmatch('Dat45.csv','Dat[0-9]*')

# MAC False Windows True
fnmatch('foo.txt','*.TXT')
# MAC False Windows False
fnmatchcase('foo.txt','*.TXT')

False

* fnmatch所完成的匹配操作有点介乎于简单的字符串方法和全功能的正则表达式之间

#### 文本模式的匹配和查找
* 简单的 startswith endwith find

In [19]:
test = 'Hello world!'
test.find('wo')

6

* 复杂一些的 re

In [5]:
text1 = '10/30/2017'
text2 = 'Oct 30, 2017'

import re
if re.match(r'\d+/\d+/\d+',text1):
    print('yes')
else:
    print('no')

if re.match(r'\d+/\d+/\d+',text2):
    print('yes')
else:
    print('no')

no
no


* 如果打算对同一种模式做多次匹配

In [12]:
datepat = re.compile(r'\d+/\d+/\d+')

text1 = 'a10/30/2017a10/30/2017'
if datepat.match(text1):
    print('yes')
else:
    print('no')

if datepat.match(text2):
    print('yes')
else:
    print('no')

no
no


* match()方法总是尝试在字符串的开头找匹配项,如果想针对整个文本搜索出所有的匹配项,那么应该视图findall()方法

In [13]:
datepat.findall(text1)

['10/30/2017', '10/30/2017']

* 引入捕获组

In [20]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
m = datepat.match('10/31/2017')
m.group(0)
m.group(3)
m.groups()

('10', '31', '2017')

In [24]:
text = 'Today is 11/31/2017.PyCon starts 03/13/2013'
datepat.findall(text)

for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

2017-11-31
2013-03-13


#### 查找和替换文本
* 简单的可以使用str.replace()

In [30]:
text = 'Hello world!'

text.replace('Hello','Hi')

'Hi world!'

* 复杂的模式使用re.sub()

In [33]:
text = 'Today is 11/31/2017.PyCon starts 03/13/2013'

import re
re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',text)

'Today is 2017-11-31.PyCon starts 2013-03-13'

* 对于更复杂的情况,可以指定一个替换回调函数

In [37]:
from calendar import month_abbr
def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2),mon_name,m.group(3))

datepat.sub(change_date,text)

'Today is 31 Nov 2017.PyCon starts 13 Mar 2013'

#### 以不区分大小写的方式对文本做查找和替换

In [44]:
text = 'UPDATE PYTHON, lower python, Mixed Python'

import re
re.findall('python',text)
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPDATE snake, lower snake, Mixed snake'

* 运行后我们发现有一些小瑕疵,就是替换后的文本大小写跟之前的不一样,为了修正这个问题需要用到一个support function

In [49]:
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper:
            return word.capitalize()
        else:
            return word
    return replace

re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)

'UPDATE SNAKE, lower snake, Mixed Snake'

#### 定义实现最短匹配的正则表达式

In [52]:
str_pat = re.compile(r'\"(.*)\"')
text1 = 'Hello "world"~'
str_pat.findall(text1)

text2 = 'One "world" One "dream"'
str_pat.findall(text2)

['world" One "dream']

* 操作符在正则表达式采用的是贪心策略,所以匹配过程是基于最长的可能匹配进行的

In [54]:
str_pat = re.compile(r'\"(.*?)\"')

text2 = 'One "world" One "dream"'
str_pat.findall(text2)

['world', 'dream']

#### Unicode 文本统一表示为规范形式

In [61]:
s1 = 'What\'s your \u00f1ame'
s2 = 'What\'s your n\u0303ame'
print(s1,s2)
s1 == s2

What's your ñame What's your ñame


False

* 这里的'name'以两种形式呈现 第一种使用全组成形式  第二种使用字母+符号组成
* 为了将文本统一,可以通过unicodedata模块

In [66]:
import unicodedata
t1 = unicodedata.normalize('NFC',s1)
t2 = unicodedata.normalize('NFC',s2)
t1 == t2

t1 = unicodedata.normalize('NFD',s1)
t2 = unicodedata.normalize('NFD',s2)
t1 == t2

True

#### 从字符串中去掉不需要的字符
* 字符串两头的字符使用 strip() lstrip() rstrip()
* 字符串中的字符 replace()
#### 对齐字符串
* ljust() rjust() center()
* format()

In [74]:
text = 'Hello world'
format(text,'>20')
format(text,'<20')
format(text,'^20')

'{:>10s}{:>10s}'.format('Hello','world')

'     Hello     world'

#### 字符串连接及合并

In [80]:
parts = ['Is','Beijing','Not','Shanghai?']
' '.join(parts)
','.join(parts)

a = 'hello'
b = 'world'
a + b
'{} {}'.format(a,b)

'hello world'

In [83]:
print(a + ':' + b)     #Ugly
print(':'.join([a,b])) #Still Ugly
print(a, b, sep=':')   #Better

hello:world
hello:world
hello:world


#### 给字符串中的变量名做插值处理

In [86]:
s = '{name} has {n} messages.'
s.format(name = 'JackMa',n=88)

'JackMa has 88 messages.'

* 另一种方式是 如果被替换的值确实能在变量中找到 则可以使用format_map()和vars()联合起来使用

In [98]:
name = 'TomMa'
n = 66
s.format_map(vars())

'TomMa has 66 messages.'

* 不过format和format_map都无法优雅地处理缺少某个值的情况

In [99]:
s.format(name='Koera')

KeyError: 'n'

* 避免这种情况的一种方法是单独定义一个带有__missing__()方法的字典类

In [100]:
class safesub(dict):
    def __missing__(self,key):
        return '{' + key + '}'

del n
s.format_map(safesub(vars()))

'TomMa has {n} messages.'

#### 以固定的列数重新格式化文本

In [114]:
with open('test.txt','r') as f:
    s = f.read()

import textwrap
print(textwrap.fill(s,70,initial_indent='    '))
print('-'*80)
print(textwrap.fill(s,70,subsequent_indent='    '))

    The Zen of Python, by Tim Peters  Beautiful is better than ugly.
Explicit is better than implicit. Simple is better than complex.
Complex is better than complicated. Flat is better than nested. Sparse
is better than dense. Readability counts. Special cases aren't special
enough to break the rules. Although practicality beats purity. Errors
should never pass silently. Unless explicitly silenced. In the face of
ambiguity, refuse the temptation to guess. There should be one-- and
preferably only one --obvious way to do it. Although that way may not
be obvious at first unless you're Dutch. Now is better than never.
Although never is often better than *right* now. If the implementation
is hard to explain, it's a bad idea. If the implementation is easy to
explain, it may be a good idea. Namespaces are one honking great idea
-- let's do more of those!
--------------------------------------------------------------------------------
The Zen of Python, by Tim Peters  Beautiful is better than

In [117]:
import os
os.get_terminal_size().columns

61

#### 在文本中处理HTML和XML实体

In [121]:
s = 'Elements are written as "<tag>text</tag>"'

import html
print(s)
print(html.escape(s))
print(html.escape(s,quote=False))


Elements are written as "<tag>text</tag>"
Elements are written as &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;
Elements are written as "&lt;tag&gt;text&lt;/tag&gt;"
