# Manipulating text using Regex

Regular expression is defined as a pattern that you give to a regex processor with some source data. The processer then parses
source data using that pattern and return chunk of text for data manipulation. It is mainly used:-

1. To check whether pattern exists in the data.
2. To get instance of all complex pattern that exist in the data or to clean the source data using that pattern.
3. Finding pattern in the Data cleaning purposes.

More detail. (https://docs.python.org/3/library/re.html)

In [10]:
import re

In [11]:
text = "Today is a good day"

if re.search("good",text):
    print("WonderFul!")
else:
    print("Nothing")

WonderFul!


In [14]:
text = "John works diligently. John gets good grades. Our student John is succesful."

re.split("John",text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

In [15]:
re.findall("John",text)

['John', 'John', 'John']

##### ^ (caret character means start) and $ (character means End)

In [20]:
re.findall("^John",text)

['John']

In [22]:
re.findall("succesful.$",text)

['succesful.']

## Patterns and Character Classes

In [71]:
grades="ACAAAABCBCBAA"

# Check how many B are in the grades?

re.findall('B',grades)

['B', 'B', 'B']

In [72]:
# If we want to find how many are A and B?

re.findall('[AB]',grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [66]:
# If we want to find how many receives A followed by B or C

re.findall("[A][B-C]",grades)

['AC', 'AB']

In [29]:
# We can use pipe operator (|), which means OR to find grades with AC or AB

re.findall("AB|AC",grades)

['AC', 'AB']

In [33]:
# We can use caret with set operator to negate our result. If we need to parse out grades which were not 'A'

re.findall("[^A]",grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [34]:
# Find grades that doesnot begin with A

re.findall("^[^A]",grades)

[]

## Quantifiers

Quantifiers are the number of times that you want a pattern to be matched. It is an expression of e(m,n).where e is the expression or character we are matching, m is the minimum
number of times you want it to matched, and n is the maximum number of times the item could be matched

In [36]:
# How many times a student has scored back to back A grades?

re.findall("A{2,10}",grades) # we'll use 2 as our min, but ten as our max

['AAAA', 'AA']

In [37]:
re.findall("A{1,1}A{1,1}",grades) 

['AA', 'AA', 'AA']

In [38]:
re.findall("A{2}A{2}",grades) 

['AAAA']

In [39]:
re.findall("A{2}",grades) 

['AA', 'AA', 'AA']

In [40]:
# Using this, we could find a decreasing trend in a student's grades
re.findall("A{1,10}B{1,10}C{1,10}",grades)

['AAAABC']

In [41]:
# We also have asterix * to match 0 or more times, a question mark ? to
# match one or more times, or a + plus sign to match one or more times. Lets look at a more complex example,
# and load some data scraped from wikipedia

In [4]:
with open("datasets/ferpa.txt","r") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

'Overview[edit]\nFERPA gives parents access to their child\'s education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student\'s consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.\n\nOther regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student\'s personally identifiable information without the student\'s consent.[2]\n\nExamples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student\'s grades o

In [75]:
re.findall("[a-zA-Z]{1,100}\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [76]:
#\w is a metacharacter, and indicates a special pattern of any letter or digit. There
# are actually a number of different metacharacters listed in the documentation. For instance, \s matches any
# whitespace character.

In [78]:
re.findall("[\w]*\[edit\]",wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [79]:
re.findall("[\w ]*\[edit\]",wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

In [81]:
for title in re.findall("[\w ]*\[edit\]",wiki):
     print(re.split("\[",title)[0])

Overview
Access to public records
Student medical records


#### We can Group using parenthesis() instead of iterating over a loop 

In [83]:
re.findall("([\w ]*)(\[edit\])",wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

In [84]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')


In [86]:
for item in re.finditer("([\w ]*)(\[edit\])",wiki):
    print(item.group(1))

Overview
Access to public records
Student medical records


In [89]:
#For that we use the syntax (?P<name>), where the parethesis
# starts the group, the ?P indicates that this is an extension to basic regexes, and <name> is the dictionary
# key we want to use wrapped in <>.

for item in re.finditer("(?P<title>[\w ]*)(?P<edit_link>\[edit\])",wiki):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict()['title'])

Overview
Access to public records
Student medical records


# Look-ahead & Look-behind

In [91]:
re.findall("([\w ]*)(?=\[edit\])",wiki)

['Overview', '', 'Access to public records', '', 'Student medical records', '']

# Example of Wikipedia data

In [94]:
with open("datasets/buddhist.txt",encoding="utf8") as file:
    # we'll read that into a variable called wiki
    wiki=file.read()
# and lets print that variable out to the screen
wiki

'Buddhist universities and colleges in the United States\nFrom Wikipedia, the free encyclopedia\nJump to navigationJump to search\n\nThis article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.\nFind sources: "Buddhist universities and colleges in the United States" – news · newspapers · books · scholar · JSTOR (December 2009) (Learn how and when to remove this template message)\nThere are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes:\n\nDhammakaya Open University – located in Azusa, California, part of the Thai Wat Phra Dhammakaya[1]\nDharmakirti College – located in Tucson, Arizona Now called Awam Tibetan Buddhist Institute (http://awaminstitute.org/)\nDharma Realm Buddh

In [115]:
pattern = """
(?P<title>.*)
(–\ located\ in\ )
(?P<city>[\w]+)
(,\ )
(?P<state>[\w]+)
"""

In [116]:
for item in re.finditer(pattern,wiki,re.VERBOSE):
    # We can get the dictionary returned for the item with .groupdict()
    print(item.groupdict())

{'title': 'Dhammakaya Open University ', 'city': 'Azusa', 'state': 'California'}
{'title': 'Dharmakirti College ', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University ', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute ', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Institute of Buddhist Studies ', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College ', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'University of the West ', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies ', 'city': 'Glenside', 'state': 'Pennsylvania'}


# Tweets

In [5]:
with open("datasets/nytimeshealth.txt",encoding="utf8") as file:
    # We'll read everything into a variable and take a look at it
    health=file.read()
health



In [129]:
re.findall("#[\w]+",health)

['#askwell',
 '#pregnancy',
 '#Ayotzinap',
 '#Colorado',
 '#3',
 '#VegetarianThanksgiving',
 '#BrittanyMaynard',
 '#FallPrevention',
 '#Ebola',
 '#Ebola',
 '#ebola',
 '#Ebola',
 '#Ebola',
 '#EbolaHysteria',
 '#AskNYT',
 '#Ebola',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Excalibur',
 '#ebola',
 '#Ebola',
 '#dallas',
 '#nobelprize2014',
 '#ebola',
 '#ebola',
 '#monrovia',
 '#ebola',
 '#nobelprize2014',
 '#ebola',
 '#nobelprize2014',
 '#Medicine',
 '#Ebola',
 '#Monrovia',
 '#Liberia',
 '#Ebola',
 '#smell',
 '#Ebola',
 '#Ebola',
 '#Ebola',
 '#Monrovia',
 '#Ebola',
 '#ebola',
 '#monrovia',
 '#liberia',
 '#benzos',
 '#Alzheimers',
 '#ClimateChange',
 '#Whole',
 '#Wheat',
 '#Focaccia',
 '#Tomatoes',
 '#Olives',
 '#Recipes',
 '#Health',
 '#Ebola',
 '#Ebola',
 '#Monrovia',
 '#Liberia',
 '#Liberia',
 '#Ebola',
 '#Ebola',
 '#Liberia',
 '#Ebola',
 '#blood',
 '#Ebola',
 '#organtrafficking',
 '#org',
 '#EbolaOutbreak',
 '#SierraLeone',
 '#Freetown',
 '#SierraLeone',
 '#ebolaoutbreak',
 '#kenema',
 '#eb