### 问题 
使用正则表达式（regular expressions）处理文本

### 解决
#### 使用python中的`re`库

##### 1 Basic flags: I, L, M, S, U, X
- `re.I` ：忽略大小写
- `re.L` ：查找局部依赖项
- `re.M` ：Tis flag is useful if you want to find patterns throughout multiple lines.
- `re.S` ：查找点匹配
- `re.U` ：unicode 码
- `re.X` ：编写更有可读性的正则表达式

##### 2 正则表达式的功能
- 找出字符a和b的出现的字符：`Regex:[ab]`
- 找出除了a和b之外的字符：`Regex:[^ab]`
- 找出包含a到b的字符：`Regex:[a-z]`
- 找出除了a到z的字符：`Regex:[^a-z]`
- 找出包含a到b和A到Z的字符：`Regex:[a-zA-Z]`
- 任何空白字符串：`Regex: \s`
- 任何非空字符串：`Regex: \S`
- 任何数字： `Regex: \d`
- 任何非数字：`Regex: \D`
- 任何非字符：`Regex: \W`
- 任何非字符：`Regex: \w`
- 匹配a或者b： `Regex: (a|b)`
- Te occurrence of a is either zero or one:
- *  Matches zero or one occurrence but not more than one occurrence`Regex: a? ; ?`
- *  The occurrence of a is zero times or more than that:`Regex: a* ; * `matches zero or more than that
- *  The occurrence of a is one time or more than that:`Regex: a+ ; +` matches occurrences one or more that one time
- *  Exactly match three occurrences of a: `Regex: a{3}`
- *  Match simultaneous occurrences of a with 3 or more than 3: `Regex: a{3,}`
- *  Match simultaneous occurrences of a between 3 to 6: `Regex: a{3,6}`
- *  Starting of the string: `Regex: ^`
- *  Match word boundary:`Regex: \b`
- *  Non-word boundary: `Regex: \B`
- *  Ending of the string: `Regex: $ `

##### 3 `re.match()`: 
这只检查字符串开头是否匹配。因此，如果它在输入字符串的开头找到模式，那么它返回匹配的模式;否则;它返回一个名词.

##### 4 `re.search()`:
将检查字符串中任何地方是否匹配。它查找给定输入字符串或数据中出现的所有模式。

##### 5 ` re.split`

In [12]:
import re
#run the split query
re.split('\s+','I like this book.')

['I', 'like', 'this', 'book.']

##### 5 抽取邮箱地址

In [14]:
# 定义一个包含邮箱的句子
doc = "For more details please mail us at: xyz@abc.com,pqr@mno.com"

# 读取邮箱地址
addresses = re.findall(r'[\w\.-]+@[\w\.-]+',doc)
for address in addresses:
    print(address)

xyz@abc.com
pqr@mno.com


##### 6 替换邮箱地址

In [15]:
# 定义一个包含邮箱的句子
doc = "For more details please mail us at xyz@abc.com"

# 使用 re.sub 函数
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',r'pqr@mno.com', doc)
print(new_email_address)

For more details please mail us at pqr@mno.com


##### 7 从电子书了读取数据

In [16]:
# Import library
import re
import requests

#url you want to extract
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

#function to extract
def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
   
    # Keeps the relevant text
    text = raw[start:stop]
    return text

# processing
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)

 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

##### 8 使用正则表达式对数据进行分析

In [18]:
# 统计 the 的个数
len(re.findall(r'the', processed_book))

302

In [19]:
#Replace "i" with "I"
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)

 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

In [20]:
#find all occurance of text in the format "abc--xyz"
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il