# Introduction

Learn the basics of how strings work and how to create them by hand, but the focus is ***Regular Expressions***, or ***Regexps*** for short.

In this chapter, we mainly use the [re](https://docs.python.org/zh-cn/3/library/re.html) package.

In [2]:
import re
import numpy as np
import pandas as pd
import pydoc
import os

# String Basics

Create strings with either single quotes or double quotes.

In [9]:
string1 = "This is a string"
string2 = 'If I want to include a "quote" inside a string, I use single quotes'

In [10]:
print(string1)
print(string2)

This is a string
If I want to include a "quote" inside a string, I use single quotes


To include a literal single or double quote in a string you can use \ to “escape” it.

In [11]:
print("\"")

"


In [12]:
print('\'')

'


If you want to include a literal backslash, you’ll need to double it up: "\\\\".

In [13]:
print('\\')

\


There are a handful of other special characters.

In [14]:
# \n newline
print('abc\ndef')

abc
def


In [15]:
# \t tab
print('abc\tdef')

abc	def


In [33]:
# non-English characters that works on all platforms
print('\u00b6')

¶


## String Length

In [6]:
# Use iteration to compute each string's length
str = ["a", "R for data science", "np.nan"]
print(len(str))
print([len(x) for x in str])

3
[1, 18, 6]


In [7]:
# Vectorization
pd.Series(str).str.len()

0     1
1    18
2     6
dtype: int64

## Combining Strings

* Use join(), can also control how they’re joined.
* Use '+'.
* Do it directly.

In [100]:
parts = ['Is', 'Chicago', 'Not', 'Chicago?']
print(' '.join(parts))
print(','.join(parts))
print(''.join(parts))

Is Chicago Not Chicago?
Is,Chicago,Not,Chicago?
IsChicagoNotChicago?


In [19]:
a = 'Is Chicago'
b = 'Not Chicago?'
print(a + ' ' + b)

Is Chicago Not Chicago?


In [20]:
a = 'Hello'' ''World''!'
print(a)

Hello World!


In [12]:
# Use iteration to allocate each element with the same prefix and suffix
str = ['abc', 'def', 'ghi']
print(['prefix-'+x+'-suffix' for x in str])

# Vectorization
print(['prefix-'+pd.Series(str)+'-suffix'])

['prefix-abc-suffix', 'prefix-def-suffix', 'prefix-ghi-suffix']
[0    prefix-abc-suffix
1    prefix-def-suffix
2    prefix-ghi-suffix
dtype: object]


## Subsetting Strings

In [29]:
# Extract parts of a string
str = ["Apple", "Banana", "Pear"]

# The first 3 letters
print([a[0:3] for a in str])
print(pd.Series(str).str[0:3])  # Vectorization

# The last 3 letters
print([a[-4:-1] for a in str])

# It will return as much as possible
print([a[0:8] for a in str])

# Modify specific parts, lowercase the first letter
print([a[0].lower()+a[1:] for a in str])

['App', 'Ban', 'Pea']
0    App
1    Ban
2    Pea
dtype: object
['ppl', 'nan', 'Pea']
['Apple', 'Banana', 'Pear']
['apple', 'banana', 'pear']


## Locales

Different languages have different rules for changing case. 

In [23]:
str = "i,ı"
str.upper()

'I,I'

Turkish has two i's: with and without a dot, and it has a different rule for capitalising them.

In [24]:
#import locale
#locale.setlocale(locale.LC_ALL, 'tr_TR.utf8')

#"i,ı".upper()
#> "İ" "I"

Different languages also have different rules for sorting.

In [25]:
str = ['apple', 'eggplant', 'banana']
print(sorted(str))  # English

# Hawaiian
#> "apple" "eggplant" "banana"

['apple', 'banana', 'eggplant']


# Matching Patterns With Regular Expressions

## Basic Matches

Match **exact** strings using [***re***](https://docs.python.org/zh-cn/3.7/library/re.html) package.

In [20]:
str = ['apple','bananananan','pear']

In [27]:
# re.match(), only matches to the beginning, recommend using re.search()
print([re.match('a', x) for x in str])

[<re.Match object; span=(0, 1), match='a'>, None, None]


In [92]:
# re.search(), only returns to the first one
print([re.search('anan', x) for x in str])

[None, <re.Match object; span=(1, 5), match='anan'>, None]


In [29]:
# re.findall(), returns to all
print([re.findall('a', x) for x in str])

[['a'], ['a', 'a', 'a', 'a', 'a'], ['a']]


In [30]:
# re.finditer(), returns to all, but in a MatchObject type iterator
for i in str:
    for m in re.finditer("an", i):
        print('%02d-%02d: %s' % (m.start(),m.end(),m.group(0)))

01-03: an
03-05: an
05-07: an
07-09: an
09-11: an


Using “.” match **any** character.

In [31]:
print([re.findall('.a.', x) for x in str])

[[], ['ban', 'nan', 'nan'], ['ear']]


Match the character '.', we need the string "\\\\.".

Or use ***Raw string notation (r"text")*** keeps regular expressions the same.

In [32]:
str = ['abc','a.c','bef','a.b']
# Find the match as 'a.x'
print([re.findall('a\\..', x) for x in str])
print([re.findall(r'a\..', x) for x in str]) # Raw string notation

[[], ['a.c'], [], ['a.b']]
[[], ['a.c'], [], ['a.b']]


Match a literal \\ (Which is written as '\\\\' in a string), you need to write "\\\\\\\\".

In [33]:
str = ['a\\b','b\\c','d\\e']
print([re.findall('.\\\\.', x) for x in str])
print([re.findall(r'.\\.', x) for x in str]) # Raw string notation

[['a\\b'], ['b\\c'], ['d\\e']]
[['a\\b'], ['b\\c'], ['d\\e']]


Vectorization for ***re***, np.vectorize() in functions.

In [34]:
array_of_strings = ["3a1", "1b2", "1c", "d"]
ElimAlphaArr = np.vectorize(re.sub)
print(ElimAlphaArr("[a-zA-Z]", " ", array_of_strings))

['3 1' '1 2' '1 ' ' ']


## Anchors

In [34]:
str = ['apple','applepie','applecake','banana','bananapie']

In [35]:
# '^' to match the beginning
print([re.findall('^apple', x) for x in str])

[['apple'], ['apple'], ['apple'], [], []]


In [36]:
# '$' to match the end
print([re.findall('pie$', x) for x in str])

[[], ['pie'], [], [], ['pie']]


In [37]:
# \b to match the boundary between words
print([re.findall('pie\\b', x) for x in str])

[[], ['pie'], [], [], ['pie']]


## Character Classes and Alternatives

There are special patterns that match more than one character.
* \d: matches any digit.
* \s: matches any whitespace (e.g. space, tab, newline).
* [abc]: matches a, b, or c.
* [^abc]: matches anything except a, b, or c.

To create a regular expression containing \d or \s, you need to type "\\\\d" or "\\\\s".

In [38]:
str = ["abc1","a.c","a*c","a c"]

In [39]:
# \d matches the digit
print([re.findall('c\\d', x) for x in str])

[['c1'], [], [], []]


In [40]:
# \s matches the whitespaces
print([re.findall('\\sc', x) for x in str])

[[], [], [], [' c']]


In [41]:
# [ab] matches a or b
print([re.findall('[ab]', x) for x in str])

[['a', 'b'], ['a'], ['a'], ['a']]


In [42]:
# [^ab] matches anything except a or b
print([re.findall('[^ab]', x) for x in str])

[['c', '1'], ['.', 'c'], ['*', 'c'], [' ', 'c']]


A character class containing a single character like [.] is a nice alternative to backslash escapes '\\',  works for: $ . | ? * + ( ) [ {, but not ] \ ^ and -.

In [43]:
# Find a.c
print([re.findall('a[.]c', x) for x in str])

[[], ['a.c'], [], []]


In [44]:
# Find x*c
print([re.findall('.[*]c', x) for x in str])

[[], [], ['a*c'], []]


Use alternation '|' to pick between one or more alternative patterns.

In [45]:
print([re.findall('gr[e|a|y]y', x) for x in ['grey','gray','gryy','grxy']])

[['grey'], ['gray'], ['gryy'], []]


## Repetition

Control how many times a pattern matches.
* ?: 0 or 1, Write colou?r to Match either American or British spellings
* +: 1 or more
* *: 0 or more

In [46]:
str = "1888 is the longest year in Roman numerals: MDCCCCCCLXXXVIII"

In [47]:
re.findall("CC?", str)

['CC', 'CC', 'CC']

In [48]:
re.findall("CC+", str)

['CCCCCC']

In [49]:
re.findall("C[LX]+", str)

['CLXXX']

You can also specify the number of matches precisely:

* {n}: exactly n
* {n,}: n or more
* {,m}: at most m
* {n,m}: between n and m

In [50]:
re.findall("C{2}", str)

['CC', 'CC', 'CC']

In [51]:
re.findall("C{2,}", str)

['CCCCCC']

In [52]:
re.findall("C{2,4}", str)

['CCCC', 'CC']

By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a ? after them.

In [53]:
re.findall("C{2,4}?", str)

['CC', 'CC', 'CC']

In [54]:
re.findall("C[LX]+?", str)

['CL']

## Grouping and Backreferences

Parentheses create a numbered capturing group (number 1, 2 etc.). You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc.

In [55]:
fruit = ['banana','coconut','cucumber','jujube','papaya','salal_berry']
print([re.findall("(..)\\1", x) for x in fruit])

[['an'], ['co'], ['cu'], ['ju'], ['pa'], ['al']]


# Tools

Now it's time to learn how to apply them to real problems.
* Determine which strings match a pattern.
* Find the positions of matches.
* Extract the content of matches.
* Replace matches with new values.
* Split a string based on a match.

Because regular expressions are so powerful, it’s easy to try and solve every problem with a single regular expression. In the words of Jamie Zawinski:

*Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.*

An email address:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

See the stackoverflow discussion at http://stackoverflow.com/a/201378 for more details.

In [39]:
email = '(^[A-Za-z0-9\u4e00-\u9fa5]+@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)+$)'
print(re.match(email, '2017302010@whu.edu.cn'))

<re.Match object; span=(0, 21), match='2017302010@whu.edu.cn'>


In [112]:
print('\u4e00,\u4e01,\u4e02,\u9fa4,\u9fa5')

一,丁,丂,龤,龥


## Detect matches

In [114]:
# Determine if a character matches a pattern
str = ["apple","banana","pear"]
print([bool(re.search('p', x)) for x in str])

[True, False, True]


In ***words***, use the numeric context to answer questions about matches across a larger vector.

In [41]:
df = pd.read_csv('words.csv', names=['words'])
words = df.to_string(header=None, index=False, columns=['words']).replace(' ','').split('\n')
print(words)

['a', 'able', 'about', 'absolute', 'accept', 'account', 'achieve', 'across', 'act', 'active', 'actual', 'add', 'address', 'admit', 'advertise', 'affect', 'afford', 'after', 'afternoon', 'again', 'against', 'age', 'agent', 'ago', 'agree', 'air', 'all', 'allow', 'almost', 'along', 'already', 'alright', 'also', 'although', 'always', 'america', 'amount', 'and', 'another', 'answer', 'any', 'apart', 'apparent', 'appear', 'apply', 'appoint', 'approach', 'appropriate', 'area', 'argue', 'arm', 'around', 'arrange', 'art', 'as', 'ask', 'associate', 'assume', 'at', 'attend', 'authority', 'available', 'aware', 'away', 'awful', 'baby', 'back', 'bad', 'bag', 'balance', 'ball', 'bank', 'bar', 'base', 'basis', 'be', 'bear', 'beat', 'beauty', 'because', 'become', 'bed', 'before', 'begin', 'behind', 'believe', 'benefit', 'best', 'bet', 'between', 'big', 'bill', 'birth', 'bit', 'black', 'bloke', 'blood', 'blow', 'blue', 'board', 'boat', 'body', 'book', 'both', 'bother', 'bottle', 'bottom', 'box', 'boy', '

In [43]:
# How many 'a's in words?
print(sum([x.count('a') for x in words]))

# How many common words start with t?
print(sum([bool(re.search('^t', x)) for x in words]))

# What proportion of common words end with a vowel?
print(np.mean([bool(re.search('[aeiou]$', x)) for x in words]))

385
65
0.27653061224489794


When you have complex logical conditions, it’s easier to do with logical operators, rather than trying to create a single regular expression.

In [60]:
print(sum([bool(re.search('^[^t]', x)) for x in words])) # Harder
print(sum([not bool(re.search('^t', x)) for x in words]))  # Easier

915
915


Directly select the elements that match a pattern.

In [61]:
print([x for x in words if bool(re.search('x$',x))==True])

['box', 'sex', 'six', 'tax']


In [62]:
# An easier way for dataframe
print(df[df['words'].str.contains("x$")])

    words
107   box
746   sex
771   six
840   tax


Note that matches never overlap. 

In [63]:
print('abababa'.count('aba'))
print(re.findall('aba','abababa'))

2
['aba', 'aba']


## Extract Matches

To extract the actual text of a match, we need a more complicated example, the [Harvard sentences](https://en.wikipedia.org/wiki/Harvard_sentences).

In [64]:
df = pd.read_csv('sentences.csv', names=['sentences'])
sentences = df.to_string(header=None, index=False, columns=['sentences']).replace('  ','').split('\n')
print(sentences)

['The birch canoe slid on the smooth planks.', ' Glue the sheet to the dark blue background.', "It's easy to tell the depth of a well.", 'These days a chicken leg is a rare dish.', 'Rice is often served in round bowls.', ' The juice of lemons makes fine punch.', ' The box was thrown beside the parked truck.', ' The hogs were fed chopped corn and garbage.', ' Four hours of steady work faced us.', 'Large size in stockings is hard to sell.', 'The boy was there when the sun rose.', ' A rod is used to catch pink salmon.', ' The source of the huge river is the clear spring.', 'Kick the ball straight and follow through.', 'Help the woman get back to her feet.', ' A pot of tea helps to pass the evening.', 'Smoky fires lack flame and heat.', "The soft cushion broke the man's fall.", ' The salt breeze came across from the sea.', ' The girl at the booth sold fifty bonds.', 'The small pup gnawed a hole in the sock.', ' The fish twisted and turned on the bent hook.', ' Press the pants and sew a but

In [65]:
# How many lines in sentences
print(len(sentences))

720


Find all sentences that contain a colour.

In [66]:
colours = ["red", "orange", "yellow", "green", "blue", "purple"]
colour_match = '|'.join(colours)
print(colour_match)

red|orange|yellow|green|blue|purple


In [67]:
# Find out the sentences that contain a colour
print([x for x in sentences if bool(re.search(colour_match, x))==True])

[' Glue the sheet to the dark blue background.', ' Two blue fish swam in the tank.', ' The colt reared and threw the tall rider.', ' The wide road shimmered in the hot sun.', 'See the cat glaring at the scared mouse.', ' A wisp of cloud hung in the blue air.', ' Leaves turn brown and yellow in the fall.', 'He ordered peach pie with ice cream.', ' Pure bred poodles have curls.', 'The spot on the blotter was made by green ink.', 'Mud was spattered on the front of his white shirt.', 'The sofa cushion is red and of light weight.', ' The sky that morning was clear and bright blue.', ' Torn scraps littered the stone floor.', 'The doctor cured him with these pills.', ' The new girl was fired today at noon.', ' The third act was dull and tired the players.', ' A blue crane is a tall wading bird.', 'Lire wires should be kept covered.', 'It is hard to erase blue or red ink.', 'The wreck occurred by the bank on Main Street.', ' The lamp shone with a steady green flame.', 'The box is held by a bri

In [68]:
# Find out all the sentences that have more than 1 match
more_match = [x for x in sentences if sum([x.count(y) for y in colours])>1]
print(more_match)

['It is hard to erase blue or red ink.', ' The green light in the brown box flickered.', 'The sky in the west is tinged with orange red.']


In [69]:
# Also find out the corresponding color words in each sentence
print([re.findall(colour_match,x) for x in more_match])

[['blue', 'red'], ['green', 'red'], ['orange', 'red']]


## Grouped Matches

Earlier in this chapter we talked about the use of parentheses for clarifying precedence and for backreferences when matching, You can also use parentheses to extract parts of a complex match. 

In [116]:
# Find all the nouns in the sentences
noun = "(a|the) ([^ .]+)"
print([re.findall(noun,x) for x in sentences])

[[('the', 'smooth')], [('the', 'sheet'), ('the', 'dark')], [('the', 'depth'), ('a', 'well')], [('a', 'chicken'), ('a', 'rare')], [], [], [('the', 'parked')], [], [], [], [('the', 'sun')], [], [('the', 'huge'), ('the', 'clear')], [('the', 'ball')], [('the', 'woman')], [('a', 'helps'), ('the', 'evening')], [], [('the', "man's")], [('the', 'sea')], [('the', 'booth')], [('a', 'hole'), ('the', 'sock')], [('the', 'bent')], [('the', 'pants'), ('a', 'button'), ('the', 'vest')], [], [('the', 'view'), ('the', 'young')], [('the', 'tank')], [], [('the', 'tall')], [('the', 'same')], [], [('the', 'load')], [('the', 'winding'), ('the', 'lake')], [('the', 'size'), ('the', 'gas')], [('the', 'grease')], [('the', 'coat')], [], [], [], [('the', 'bell')], [], [('the', 'state'), ('the', 'early')], [('the', 'sharp')], [('the', 'third')], [('the', 'hot')], [('the', 'cool')], [('the', 'square'), ('the', 'fence')], [('the', 'seven')], [('the', 'fence')], [('the', 'drug')], [], [('the', 'coat')], [('the', 'mouse

## Replacing Matches

The simplest use is to replace a pattern with a fixed string.

In [46]:
str = ["apple", "pear", "banana"]
print([re.sub('[aeiou]','-',x) for x in str])

['-ppl-', 'p--r', 'b-n-n-']


In [49]:
# Or using str.replace()
x=pd.Series(str)
x.str.replace('[aeiou]','-')

0     -ppl-
1      p--r
2    b-n-n-
dtype: object

In [72]:
# Perform multiple replacements
str = ["1 house", "2 cars", "3 people"]
rep = {"1": "one", "2": "two","3":"three"} 
print([x.translate(x.maketrans(rep)) for x in str])

['one house', 'two cars', 'three people']


In [54]:
# Or using replace()
x=pd.Series(['4 houses','2 cars','3 people'])
rep=pd.Series(['four','two','three'],index=['4','2','3'])
x.replace(rep,regex=True)

0     four houses
1        two cars
2    three people
dtype: object

## Splitting

Use split() to split a string up into pieces.

In [73]:
str = "This is string example....Wow!!!"
print(str.split())
print(str.split('i', 1))   # 1 refers to the split times
print(str.split('w'))     # Case sensitive

['This', 'is', 'string', 'example....Wow!!!']
['Th', 's is string example....Wow!!!']
['This is string example....Wo', '!!!']


In [74]:
# Uppercase the whole string
print(str.upper().split())
# Lowercase the whole string
print(str.lower().split())

['THIS', 'IS', 'STRING', 'EXAMPLE....WOW!!!']
['this', 'is', 'string', 'example....wow!!!']


In [75]:
# Capitalize the first letter of the whole string
print(str.capitalize().split())

['This', 'is', 'string', 'example....wow!!!']


In [76]:
# Capitalize the first letter of each word
print(str.title().split())

['This', 'Is', 'String', 'Example....Wow!!!']


## Find Matches

Use re.findall() and re.finditer(), already introduced in 3.1 Basic Matches.

# Other Types of Pattern

When you use a pattern that’s a string, it automatically runs without any limitation.

In [77]:
# The common situation
str = ["banana", "Banana", "BANANA"]
print([re.findall('banana',x) for x in str])

[['banana'], [], []]


<img src="img/flags.jpg" style="zoom:50%" />

In [78]:
# Control details of the match

## Ignore the uppercase or lowercase forms
print([re.findall('banana', x, flags=re.I) for x in str])

[['banana'], ['Banana'], ['BANANA']]


In [79]:
## Allow ^ and $ to match the start and end of each line rather than the start and end of the complete string
str = "Line 1\nLine 2\nLine 3"
print(re.findall('^Line', str))
print(re.findall('^Line', str, flags=re.M))

['Line']
['Line', 'Line', 'Line']


In [80]:
## Make the dot character ‘.’ match any character, including a newline
str = """once upon a time,
there lived a king"""
print(re.findall(r".+", str))
print(re.findall(r".+", str, flags=re.S))

['once upon a time,', 'there lived a king']
['once upon a time,\nthere lived a king']


In [81]:
## Allow you to add annotations in regex, whitespace is ignored, except when in a character class or preceded by an unescaped backslash
phone1 = re.compile("""\\(?  # optional opening parens
                    (\\d{3}) # area code, three numbers
                    [) -]?   # optional closing parens, space, or dash
                    (\\d{3}) # another three numbers
                    [ -]?    # optional space or dash
                    (\\d{3}) # three more numbers""", re.X)
phone2 = re.compile("\\(?(\\d{3})[) -]?(\\d{3})[ -]?(\\d{3})")    # p2 is the same as p1

print(re.findall(phone1, "514-791-8141"))
print(re.findall(phone2, "514-791-8141"))

[('514', '791', '814')]
[('514', '791', '814')]


# Other Uses of Regular Expressions

* apropos() searches all objects available from the global environment, prints all the one-line module summaries that contain a substring. 

In [82]:
pydoc.apropos('replace')

_dummy_thread - Drop-in replacement for the thread module.
idlelib.idle_test.mock_tk - Classes that replace tkinter gui objects used by an object being tested.
idlelib.idle_test.test_replace 
idlelib.replace - Replace dialog for IDLE. Inherits SearchDialogBase for GUI.
lib2to3.fixes.fix_asserts - Fixer that replaces deprecated unittest method names.
astropy.coordinates.tests.test_atc_replacements - Test replacements for ERFA functions atciqz and aticq.
bottleneck.tests.nonreduce_test - Test replace().
ipykernel.displayhook - Replacements for sys.displayhook that publish over ZMQ.
joblib.test.test_numpy_pickle - Test the numpy pickler as a replacement of the standard pickler.
jupyterlab.tests.test_registry - Test yarn registry replacement
notebook.tests.selenium.test_find_and_replace 
pandas.tests.arrays.categorical.test_replace 
pandas.tests.frame.methods.test_replace 
pandas.tests.series.methods.test_replace 
spyder.widgets.findreplace - Find/Replace widget
sympy.core.rules - Replacem

* os.listdir(path) lists all the files in a directory. 

In [56]:
# cwd represents the current working directory
cwd = os.getcwd()
os.listdir(cwd)

['Chapter15.ipynb',
 '.DS_Store',
 'Chapter14 Strings.ipynb',
 'words.csv',
 'Chapter16.ipynb',
 'sentences.csv',
 'img',
 '.ipynb_checkpoints',
 'gss_cat.csv']

* Filter files whose names match the pattern.

In [58]:
def findfiles(path,name):
    if path:
        for dirpath, dirnames, filenames in os.walk(path):
            for file in filenames:
                if name in file:
                    result = [file]
                    print(result)
    else:
        print ("The dirpath not Exist!")

# Define the path and file name pattern
path = cwd
name = ".csv"
findfiles(path,name)

['words.csv']
['sentences.csv']
['gss_cat.csv']


# Stringi

***stringr*** is built on top of the ***stringi*** package.