# string模块
## capwords函数
结果等同于先调用split(),然后调用join()

In [22]:
import string

s = "The quick brown fox jumped over the lazy dog."

print s
print string.capwords(s)

The quick brown fox jumped over the lazy dog.
The Quick Brown Fox Jumped Over The Lazy Dog.


## maketrans函数
maketrans()函数创建转换表, 可以结合translate()方法将一组字符修改为另一组字符, 这种做法比反复调用repalce()更为高效.
在Python2中, maketrans函数必须有两个参数
在Python3中, maketrans如果只有一个参数, 那么必须是一个字典, 字典键的长度只能为1

In [21]:
import string

leet = string.maketrans({"e": "1"})
s = "The quick brown fox jumped over the lazy dog."
print s
print s.translate(leet)

TypeError: maketrans() takes exactly 2 arguments (1 given)

In [20]:
"read this short txt".translate(None, "aeiou")

'rd ths shrt txt'

In [19]:
import string

leet = string.maketrans("abegiloprstz", "463611092572")
s = "The quick brown fox jumped over the lazy dog."
print s
print s.translate(leet)

The quick brown fox jumped over the lazy dog.
Th3 qu1ck 620wn f0x jum93d 0v32 7h3 142y d06.


## Template类
如果需要与两侧的文本区分, 使用大括号将变量扩起来${var}.

In [18]:
import string

values = { "var": "foo" }

t = string.Template("""
Variable        : $var
Escape          : $$
Variable in text: ${var}iable
""")

print "TEMPLATE: ", t.substitute(values)

s = """
Variable        : %(var)s
Escape          : %%
Variable in text: %(var)siable
"""

print "INTERPOLATION: ", s % values

TEMPLATE:  
Variable        : foo
Escape          : $
Variable in text: fooiable

INTERPOLATION:  
Variable        : foo
Escape          : %
Variable in text: fooiable



使用safe_substitue方法, 避免未能提供模板所需全部参数时可能产生的异常

In [26]:
import string

values = { "var" : "foo" }

t = string.Template("$var is here but $missing is not provided")

try:
    print "substitue: ", t.substitute(values)
except KeyError, error:
    print "ERROR: ", error

print "safe_substitute: ", t.safe_substitute(values)

 substitue:  ERROR:  'missing'
safe_substitute:  foo is here but $missing is not provided


修改Template类的属性delimiter和idpattern来调整在模板中查找变量名所使用的正则表达式.

In [30]:
import string

template_txt = """
Delimiter : %%
Replaced  : %with_underscore
Ignored   : %notunderscore
"""

d = {
"with_underscore" : "replaced",
"notunderscore" : "not replaced"
}

class MyTemplate(string.Template):
    delimiter = "%"
    idpattern = "[a-z]+_[a-z]+"

t = MyTemplate(template_txt)
print "Modified ID pattern: "
print t.safe_substitute(d)
                


Modified ID pattern: 

Delimiter : %
Replaced  : replaced
Ignored   : %notunderscore



进行更复杂的修改, 自定义全新的正则表达式, 所提供的模式必须包含4个命名组: 转义定界符, 命名变量, 用大括号括住的变量名, 不合法的定界符.

In [56]:
import string
import re

t = string.Template("$var")
print t.pattern.pattern
s = """
Variable        : $var
Escape          : $$
Variable in text: ${var}iable
"""
print t.pattern.findall(s)



    \$(?:
      (?P<escaped>\$) |   # Escape sequence of two delimiters
      (?P<named>[_a-z][_a-z0-9]*)      |   # delimiter and a Python identifier
      {(?P<braced>[_a-z][_a-z0-9]*)}   |   # delimiter and a braced identifier
      (?P<invalid>)              # Other ill-formed delimiter exprs
    )
    
[('', 'var', '', ''), ('$', '', '', ''), ('', '', 'var', '')]


这个例子中使用`{{`作为转义符通过`(?P<escaped>\{\{)`来实现, 而使用`%%`来作为与其他字符隔离的符号通过`\%\%(?P<braced>[_a-z][_a-z-0-9]*)\%\%`来实现.

In [72]:
import re
import string

class MyTemplate(string.Template):
    delimiter = "{{"
    pattern = r"""
    \{\{(?:
    (?P<escaped>\{\{) |
    (?P<named>[_a-z][_a-z-0-9]*)\}\} |
    \%\%(?P<braced>[_a-z][_a-z-0-9]*)\%\% |
    (?P<invalid>)
    )
    """

t = MyTemplate("""
{{{{
{{%%var%%iable
""")

print t.template
print "MATCHES: ", t.pattern.findall(t.template)
print "SUBSTITUTED: ", t.safe_substitute(var="replacement")


{{{{
{{%%var%%iable

MATCHES:  [('{{', '', '', ''), ('', '', 'var', '')]
SUBSTITUTED:  
{{
replacementiable



# textwrap-格式化文本段落

## fill()函数
fill()函数效果, 左对齐, 第一行保留了缩进,其余各行前面的空格嵌入到段落中, 在situation前就继续留有多余的缩进.

In [81]:
import textwrap

sample_txt = """
    The testwrap module can be used to format text for output in
    situations where pretty-printing is desired. It offers
    programmatic functionality similar to the paragraph wrapping
    or filling features found in many text editors.
"""

print "no dedent: \n"
print textwrap.fill(sample_txt, width = 50)

no dedent: 

     The testwrap module can be used to format
text for output in     situations where pretty-
printing is desired. It offers     programmatic
functionality similar to the paragraph wrapping
or filling features found in many text editors.


## dedent()函数
dedent()方法能够删除各行最前面都有的空白符.

In [82]:
import textwrap

sample_txt = """
    The testwrap module can be used to format text for output in
    situations where pretty-printing is desired. It offers
    programmatic functionality similar to the paragraph wrapping
    or filling features found in many text editors.
"""

dedent_text = textwrap.dedent(sample_txt)
print "dedent: \n"
print dedent_text

dedent: 


The testwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors.



但是如果有某一行比其他行缩进多, 则会有一些空白符无法删除.

In [83]:
import textwrap

sample_txt = """
 Line one.

   Line two.
 Line three.
"""

dedent_text = textwrap.dedent(sample_txt)
print "dedent: \n"
print dedent_text


dedent: 


Line one.

  Line two.
Line three.



## 结合fill和dedent

In [84]:
import textwrap

sample_txt = """
    The testwrap module can be used to format text for output in
    situations where pretty-printing is desired. It offers
    programmatic functionality similar to the paragraph wrapping
    or filling features found in many text editors.
"""
dedent_text = textwrap.dedent(sample_txt).strip()
for width in [45, 60]:
    print "%d columns: \n" % width
    print textwrap.fill(dedent_text, width = width)
    print 

45 columns: 

The testwrap module can be used to format
text for output in situations where pretty-
printing is desired. It offers programmatic
functionality similar to the paragraph
wrapping or filling features found in many
text editors.

60 columns: 

The testwrap module can be used to format text for output in
situations where pretty-printing is desired. It offers
programmatic functionality similar to the paragraph wrapping
or filling features found in many text editors.



## 悬挂缩进, 单独控制第一行的缩进

In [88]:
import textwrap

sample_txt = """
    The testwrap module can be used to format text for output in
    situations where pretty-printing is desired. It offers
    programmatic functionality similar to the paragraph wrapping
    or filling features found in many text editors.
"""
dedent_text = textwrap.dedent(sample_txt).strip()
print textwrap.fill(dedent_text,
                    initial_indent = "",
                    subsequent_indent = " " * 4,
                    width = 50
                    )

The testwrap module can be used to format text for
    output in situations where pretty-printing is
    desired. It offers programmatic functionality
    similar to the paragraph wrapping or filling
    features found in many text editors.


# re模块

## search()函数
如果找到这个模式则返回一个Match对象, 如果没有找到则返回None.
Match对象包含有关匹配性质的信息, 如原输入字符串, 使用的正则表达式, 模式在原字符串中出现的位置.

In [96]:
import re

pattern = "this"
text = "Does this text match the pattern?"
 
match = re.search(pattern, text)
print match

s = match.start()
e = match.end()

print "Found '%s'\nin '%s'\nfrom %d to %d ('%s')" % \
      (match.re.pattern, match.string, s, e, text[s: e])

<_sre.SRE_Match object at 0x1076344a8>
Found 'this'
in 'Does this text match the pattern?'
from 5 to 9 ('this')


## 编译表达式
模块级函数会维护已编译表达式的一个缓存, 不过, 这个缓存的大小是有限的, 直接使用已编译表达式可以避免查找开销, 使用已编译表达式的另一个好处是, 通过加载模块是预编译所有表达式, 可以把编译工作转到应用开始时, 而不是当程序响应一个用户动作是才进行.

In [97]:
import re

regexes = [re.compile(p) for p in ["this", "that"]]
text = "Dose this text match the pattern?"

print "Text: %r\n" % text

for regex in regexes:
    print "Seeking '%s' ->" % regex.pattern

    if regex.search(text):
        print "Mathch!"
    else:
        print "No match!"

Text: 'Dose this text match the pattern?'

Seeking 'this' ->
Mathch!
Seeking 'that' ->
No match!


## 多重匹配

### findall()函数返回输入中与模式匹配而不重叠的所有子串, 为字符串格式.

In [99]:
import re

text = "abbaaabbbbaaaaa"

pattern = "ab"

for match in re.findall(pattern, text):
    print "Found: %s" % match
    print type(match)

Found: ab
<type 'str'>
Found: ab
<type 'str'>


### finditer()函数返回一个迭代器, 里面是Match实例, 而不是像findall()函数返回字符串.

In [101]:
import re

text = "abbaaabbbbaaaaa"

pattern = "ab"

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print "Found: %s at %d: %d" % (text[s:e], s, e) 
    print type(match)


Found: ab at 0: 2
<type '_sre.SRE_Match'>
Found: ab at 5: 7
<type '_sre.SRE_Match'>


## 模式语法

In [95]:
import re

def test_patterns(text, patterns = []):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("abbaaabbbbaaaaa",
                 [("ab", "'a' followed by 'b'"),
                 ])

pattern 'ab' ('a' followed by 'b')

  'abbaaabbbbaaaaa'
  'ab'
  .....'ab'



*: 0次或多次
+: 1次或多次
?: 0次或多次
{m}: m次
{m,n}: 最少m次,最多n次, 注意m与n之间不能出现空格, 否则会导致结果不正确

In [94]:
import re

def test_patterns(text, patterns = []):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("abbaabbba",
                 [("ab*", "'a' followed by zero or more 'b'"),
                  ("ab+", "'a' followed by one or more 'b'"),
                  ("ab?", "'a' followed by zero or one 'b'"),
                  ("ab{3}", "'a' followed by three 'b'"),
                  ("ab{2,3}", "'a' followed by two to three 'b'"),
                 ])


pattern 'ab*' ('a' followed by zero or more 'b')

  'abbaabbba'
  'abb'
  ...'a'
  ....'abbb'
  ........'a'

pattern 'ab+' ('a' followed by one or more 'b')

  'abbaabbba'
  'abb'
  ....'abbb'

pattern 'ab?' ('a' followed by zero or one 'b')

  'abbaabbba'
  'ab'
  ...'a'
  ....'ab'
  ........'a'

pattern 'ab{3}' ('a' followed by three 'b')

  'abbaabbba'
  ....'abbb'

pattern 'ab{2,3}' ('a' followed by two to three 'b')

  'abbaabbba'
  'abb'
  ....'abbb'



## 关闭贪婪模式, 在重复指令后面加上`?`可以关闭贪婪模式
对于允许b出现0次的模式, 如果关闭贪婪模式, 意味着匹配的子串不包含任何b字符.

In [92]:
import re

def test_patterns(text, patterns = []):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("abbaabbba",
                 [("ab*?", "'a' followed by zero or more 'b'"),
                  ("ab+?", "'a' followed by one or more 'b'"),
                  ("ab??", "'a' followed by zero or one 'b'"),
                  ("ab{3}?", "'a' followed by three 'b'"),
                  ("ab{2,3}?", "'a' followed by two to three 'b'"),
                 ])



pattern 'ab*?' ('a' followed by zero or more 'b')

  'abbaabbba'
  'a'
  ...'a'
  ....'a'
  ........'a'

pattern 'ab+?' ('a' followed by one or more 'b')

  'abbaabbba'
  'ab'
  ....'ab'

pattern 'ab??' ('a' followed by zero or one 'b')

  'abbaabbba'
  'a'
  ...'a'
  ....'a'
  ........'a'

pattern 'ab{3}?' ('a' followed by three 'b')

  'abbaabbba'
  ....'abbb'

pattern 'ab{2,3}?' ('a' followed by two to three 'b')

  'abbaabbba'
  'abb'
  ....'abb'



## 字符集
字符集character set是一组字符, 包含可以与模式中相应位置匹配的所有字符.

In [90]:
import re

def test_patterns(text, patterns = []):
    """Given source text and a list of patterns, look for
    matches for each pattern within the text and print
    them to stdout.
    """
    # Look for each pattern in the text and print the results
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("abbaabbba",
                 [("[ab]", "either a or b"),
                  ("a[ab]+", "a followed by 1 or more a or b"),
                  ("a[ab]+?", "a followed by 1 or more a or b, not greedy"),
                 ])

pattern '[ab]' (either a or b)

  'abbaabbba'
  'a'
  .'b'
  ..'b'
  ...'a'
  ....'a'
  .....'b'
  ......'b'
  .......'b'
  ........'a'

pattern 'a[ab]+' (a followed by 1 or more a or b)

  'abbaabbba'
  'abbaabbba'

pattern 'a[ab]+?' (a followed by 1 or more a or b, not greedy)

  'abbaabbba'
  'ab'
  ...'aa'



In [89]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("This is some text -- with punctuation.",
                 [("[^-. ]+", "sequences without -, ., or space"),])


pattern '[^-. ]+' (sequences without -, ., or space)

  'This is some text -- with punctuation.'
  'This'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'



In [88]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("This is some text -- with punctuation.",
                 [("[a-z]+", "sequences of lowercae letters."),
                  ("[A-Z]+", "sequences of uppercase letters."),
                  ("[a-zA-Z]+", "sequences of lowercase or uppercase letters."),
                  ("[A-Z][a-z]+", "one uppercase followed by lowercase.")])



pattern '[a-z]+' (sequences of lowercae letters.)

  'This is some text -- with punctuation.'
  .'his'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'

pattern '[A-Z]+' (sequences of uppercase letters.)

  'This is some text -- with punctuation.'
  'T'

pattern '[a-zA-Z]+' (sequences of lowercase or uppercase letters.)

  'This is some text -- with punctuation.'
  'This'
  .....'is'
  ........'some'
  .............'text'
  .....................'with'
  ..........................'punctuation'

pattern '[A-Z][a-z]+' (one uppercase followed by lowercase.)

  'This is some text -- with punctuation.'
  'This'



## 转义符号-反斜线\
在下面的例子中"\\\\.\\+"等价于r"\\.\+", r"\\"匹配字符"\", r"\+"匹配字符"+".

In [85]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("\d+ \D+ \s+",
                 [("\\\\.\\+", "sequences of lowercae letters."),
                  (r"\\.\+", "sequences of lowercae letters."),])

pattern '\\\\.\\+' (sequences of lowercae letters.)

  '\\d+ \\D+ \\s+'
  '\\d+'
  .....'\\D+'
  ..........'\\s+'

pattern '\\\\.\\+' (sequences of lowercae letters.)

  '\\d+ \\D+ \\s+'
  '\\d+'
  .....'\\D+'
  ..........'\\s+'



## 锚定
| 锚定码   | 含义   |
| ------------ | ------------ |
| ^   | 字符串或行的开始   |
| $   | 字符串或行的结束  |
| \A  | 字符串开始 |
| \Z  | 字符串结束 |
| \b  | 一个单词开头或结尾的空串 |
| \B  | 不在一个单词开头或结尾的空串 |


In [3]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("This is some text -- with punctuation.",
                 [(r"^\w+", "word at start of string"),
                  (r"\A\w+", "word at start of string"),
                  (r"\w+\S*$", "word near end of string, skip punctuation"),
                  (r"\w+\S*\Z", "word near end of string, skip punctuation"),
                  (r"\w*t\w*", "word containing t"),
                  (r"\bt\w+", "t at start of word"),
                  (r"\w+t\b", "t at end of word"),
                  (r"\Bt\B", "t not at start or end of word")])

pattern '^\\w+' (word at start of string)

  'this is some text -- with punctuation.'
  'this'

pattern '\\A\\w+' (word at start of string)

  'this is some text -- with punctuation.'
  'this'

pattern '\\W+$' (word near end of string, skip punctuation)

  'this is some text -- with punctuation.'
  .....................................'.'

pattern '\\w+\\S*\\Z' (word near end of string, skip punctuation)

  'this is some text -- with punctuation.'
  ..........................'punctuation.'

pattern '\\w*t\\w*' (word containing t)

  'this is some text -- with punctuation.'
  'this'
  .............'text'
  .....................'with'
  ..........................'punctuation'

pattern '\\bt\\w+' (t at start of word)

  'this is some text -- with punctuation.'
  'this'
  .............'text'

pattern '\\w+t\\b' (t at end of word)

  'this is some text -- with punctuation.'
  .............'text'

pattern '\\Bt\\B' (t not at start or end of word)

  'this is some text -- with punctuation.'
 

这个例子中有个小细节, 就是在匹配一个字符串的最后一个单词时使用r"\w+\S*\Z", 而不是r"\w+\Z", 后者会把整个字符串匹配, 若是r"\W+\Z", 则匹配了最后一个标点符号".".

## ^与\A, $与\Z的区别 

In [96]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        regex = re.compile(pattern, re.MULTILINE)
        print regex.findall(text)

test_patterns("zzz\nabc",
              [(r"^abc", "sequences of lowercae letters."),
               (r"\Aabc", "sequences of lowercae letters."),
               (r"\w+$", "sequences of lowercae letters."),
               (r"\w+\Z", "sequences of lowercae letters.")])

pattern '^abc' (sequences of lowercae letters.)

  'zzz\nabc'
['abc']
pattern '\\Aabc' (sequences of lowercae letters.)

  'zzz\nabc'
[]
pattern '\\w+$' (sequences of lowercae letters.)

  'zzz\nabc'
['zzz', 'abc']
pattern '\\w+\\Z' (sequences of lowercae letters.)

  'zzz\nabc'
['abc']


同时还应该注意r"\n"与"\n"两者的区别
r"\n"代表包含两个字符"\","n"的一个字符串.
"\n"is a one-character string containing a newline.
所以不注意他们之间的区别, 使用不慎就会导致意料不到的后果.
下面的例子对上一个例子进行修改进行演示.

In [97]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        regex = re.compile(pattern, re.MULTILINE)
        print regex.findall(text)

test_patterns(r"zzz\nabc",
              [(r"^abc", "sequences of lowercae letters."),
               (r"\Aabc", "sequences of lowercae letters."),
               (r"\w+$", "sequences of lowercae letters."),
               (r"\w+\Z", "sequences of lowercae letters.")])

pattern '^abc' (sequences of lowercae letters.)

  'zzz\\nabc'
[]
pattern '\\Aabc' (sequences of lowercae letters.)

  'zzz\\nabc'
[]
pattern '\\w+$' (sequences of lowercae letters.)

  'zzz\\nabc'
['nabc']
pattern '\\w+\\Z' (sequences of lowercae letters.)

  'zzz\\nabc'
['nabc']


可以看出结果已经偏离了预期,和原来的结果又很大的不同.

In [4]:
import re

text = "This is some text -- with punctuation."
pattern = "is"
print "Text    :", text
print "pattern :", pattern

m = re.match(pattern, text)
print "Mathch :", m
s = re.search(pattern, text)
print "Search :", s

Text    : This is some text -- with punctuation.
pattern : is
Mathch : None
Search : <_sre.SRE_Match object at 0x10e321920>


## 限制搜索
search()方法还可以接受可选的start和end位置参数, 将搜索限制在输入的一个子集中.

In [12]:
import re

text = "This is some text -- with punctuation."
pattern = r"\b\w*is\w*\b"
regex = re.compile(pattern)

pos = 0
while True:
    match = regex.search(text, pos)
    if not match:
        break
    s = match.start()
    e = match.end()
    print "%2d : %2d, '%s'" % (s, e-1, text[s:e])
    pos = e

 0 :  3, 'This'
 5 :  6, 'is'


## 用组解析匹配
任何完整的正则表达式都可以转换为组, 并嵌套在一个更大的表达式中, 所有重复修饰符都可以应用到整个组作为一个整体, 要求重复整个组模式.

In [13]:
import re

def test_patterns(text, patterns=[]):
    for pattern, desc in patterns:
        print "pattern %r (%s)\n" % (pattern, desc)
        print "  %r" % text
        for match in re.finditer(pattern, text):
            s = match.start()
            e = match.end()
            substr = text[s:e]
            n_backslashed = text[:s].count("\\")
            prefix = "." * (s + n_backslashed)
            print "  %s%r" % (prefix, substr)
        print
    return

if __name__ == "__main__":
    test_patterns("abbaaabbbbaaaaa",
                 [(r"a(ab)", "a followed by literal ab"),
                  (r"a(a*b*)", "a followed by 0-n a and 0-n b"),
                  (r"a(ab)*", "a followed by 0-n ab"),
                  (r"a(ab)+", "a followed by 1-n ab"),]) 

pattern 'a(ab)' (a followed by literal ab)

  'abbaaabbbbaaaaa'
  ....'aab'

pattern 'a(a*b*)' (a followed by 0-n a and 0-n b)

  'abbaaabbbbaaaaa'
  'abb'
  ...'aaabbbb'
  ..........'aaaaa'

pattern 'a(ab)*' (a followed by 0-n ab)

  'abbaaabbbbaaaaa'
  'a'
  ...'a'
  ....'aab'
  ..........'a'
  ...........'a'
  ............'a'
  .............'a'
  ..............'a'

pattern 'a(ab)+' (a followed by 1-n ab)

  'abbaaabbbbaaaaa'
  ....'aab'

