Parsing Text Using Regular Expressions
----------------------------------------------------------

In this notebook, we are going to code-n-learn how regular expressions are helpful when dealing with text data. This is very much required when dealing with raw data from the web, which would contain HTML tags, long text, and
repeated text. During the process of developing your application, as well as
in output, we don’t need such data.

We can do all sort of basic and advanced data cleaning using regular expressions.

Problem
------------
You want to parse text data using regular expressions.

Solution
------------
The best way to do this is by using the “re” library in Python.

How It Works
-------------------
Let’s look at some of the ways we can use regular expressions for our tasks.
Basic flags: the basic flags are I, L, M, S, U, X:

• re.I: This flag is used for ignoring casing.

• re.L: This flag is used to find a local dependent.

• re.M: This flag is useful if you want to find patterns throughout multiple lines.

• re.S: This flag is used to find dot matches.

• re.U: This flag is used to work for unicode data.

• re.X: This flag is used for writing regex in a more readable format.

Regular expressions’ functionality:
-------------------------------------------------
• Find the single occurrence of character a and b:
[ab]

• Find characters except for a and b:
 [^ab]

• Find the character range of a to z:
 [a-z]

• Find a range except to z:
 [^a-z]

• Find all the characters a to z as well as A to Z:
 [a-zA-Z]

• Any whitespace character:
 \s

• Any non-whitespace character:
 \S

• Any digit:
 \d

• Any non-digit:
 \D

• Any non-words:
 \W

• Any words:
 \w

• Either match a or b:
 (a|b)

• Matches zero or one occurrence but not more than one occurrence
 a?

• The occurrence of a is zero times or more than that:
 a* 

• The occurrence of a is one time or more than that:
 a+ 

• Exactly match three occurrences of a:
 a{3}
 
• Exactly match three occurrences of a:
  a{3}

• Match simultaneous occurrences of a with 3 or more than 3:
  a{3,}

• Match simultaneous occurrences of a between 3 to 6:
  a{3,6}

• Starting of the string:
  ^

• Ending of the string:
  $
  
• Match word boundary:
 \b

• Non-word boundary:
 \B

re.match() and re.search() functions are used to find the patterns and then can be processed according to the requirements of the application.

> Note the differences between re.match() and re.search():

> • re.match(): This checks for a match of the string only at the beginning of the string. So, if it finds the pattern at the beginning of the input string, then it returns the matched pattern; otherwise; it returns a None.

> • re.search(): This checks for a match of the string anywhere in the string. It finds the first  occurrences of the pattern in the given input string or data.

In [1]:
# Tokenizing
# You want to split the sentence into words – tokenize. 
# One of the ways to do this is by using re.split.

# Import library
import re

#run the split query
re.split('\s+','I like this book.')

['I', 'like', 'this', 'book.']

In [2]:
# Extracing email IDs
# The simplest way to do this is by using re.findall.

doc = "For more details please mail us at: xyz@abc.com, p-q.r@mno.com"
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)

for address in addresses:
    print(address)

xyz@abc.com
p-q.r@mno.com


In [3]:
# Replacing email IDs
# Here we replace email ids from the sentences or documents with another
# email id. The simplest way to do this is by using re.sub.

doc = "For more details please mail us at xyz@abc.com"

new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',r'pqr@mno.com', doc)
print(new_email_address)

For more details please mail us at pqr@mno.com


In [6]:
# Extract data from the ebook and perform regex

# Import library
import re
import requests

# url you want to extract
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'


# function to extract
def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text

    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()

    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()

    # Keeps the relevant text
    text = raw[start:stop]
    return text

# processing
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

#calling the above function
## type your code here
book = get_book(url)
processed_book=preprocess(book)
print(book)








Produced by Martin Adamson, David Widger, with corrections by Andrew Sly










THE IDIOT

By Fyodor Dostoyevsky


Translated by Eva Martin




PART I

I.

Towards the end of November, during a thaw, at nine oâclock one morning,
a train on the Warsaw and Petersburg railway was approaching the latter
city at full speed. The morning was so damp and misty that it was only
with great difficulty that the day succeeded in breaking; and it was
impossible to distinguish anything more than a few yards away from the
carriage windows.

Some of the passengers by this particular train were returning from
abroad; but the third-class carriages were the best filled, chiefly with
insignificant persons of various occupations and degrees, picked up at
the different stations nearer town. All of them seemed weary, and
most of them had sleepy eyes and a shivering expression, while their
complexions generally appeared to have taken on the colour of the fog

In [7]:
# Perform some exploratory data analysis on this data using regex
# Count number of times "the" is appeared in the book
len(re.findall(r'the', processed_book))

302

In [8]:
#Replace "i" with "I"
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)

 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

In [9]:
#find all occurance of text in the format "abc--xyz"
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il

Self to-do regex 

In [11]:
text='Suven,ML is fun, isnt it?'


re.findall(r'([A-z]+),ML', text)

['Suven']

In [13]:
import re 
pattern = '^a...s$'
test_list = ['abs','alias','abyss','Alias','An abacus']

for test_string in test_list:
    result=re.match(pattern,test_string)
    if result:
        print(test_string,'-> match found ')
    else:
        print(test_string,'-> no match found')

abs -> no match found
alias -> match found 
abyss -> match found 
Alias -> no match found
An abacus -> no match found


3> This week's assignment : compelete babyname.py files 
