In [0]:
# https://www.datacamp.com/courses/regular-expressions-in-python

In [0]:
# reverse order of list
test[::-1]

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [0]:
from datetime import datetime
from string import Template
import re

**Course Description**

As a data scientist, you will encounter many situations where you will need to extract key information from huge corpora of text, clean messy data containing strings, or detect and match patterns to find useful words. All of these situations are part of text mining and are an important step before applying machine learning algorithms. This course will take you through understanding compelling concepts about string manipulation and regular expressions. You will learn how to split strings, join them back together, interpolate them, as well as detect, extract, replace, and match strings using regular expressions. On the journey to master these skills, you will work with datasets containing movie reviews or streamed tweets that can be used to determine opinion, as well as with raw text scraped from the web.

# 1. Basic Concepts of String Manipulation

Start your journey into the regular expression world! From slicing and concatenating, adjusting the case, removing spaces, to finding and replacing strings. You will learn how to master basic operation for string manipulation using a movie review dataset.

###Introduction to string manipulation


In [0]:
movie = 'fox and kelley soon become bitter rivals because the new fox books store is opening up right across the block from the small business .'

In [0]:
# Find characters in movie variable
length_string = len(movie)
length_string

135

In [0]:
to_string = str(length_string)
to_string

'135'

In [0]:
# Predefined variable
statement = "Number of characters in this review:"

# Concatenate strings and print result
print(statement+" "+to_string)

Number of characters in this review: 135


In [0]:
movie1 = 'the most significant tension of _election_ is the potential relationship between a teacher and his student .'

In [0]:
movie2 = 'the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .'

- Select the first 32 characters of the variable movie1 and assign it to the variable first_part.

In [0]:
# Select the first 32 characters of movie1
first_part = movie1[:32]
first_part

'the most significant tension of '

- Select the substring going from the 43rd character to the end of movie1. Assign it to the variable last_part.

In [0]:
# Select from 43rd character to the end of movie1
last_part = movie1[42:]
last_part

' is the potential relationship between a teacher and his student .'

- Select the substring going from the 33rd to the 42nd character of movie2. Assign it to the variable middle_part.

In [0]:
# Select from 33rd to the 42nd character
middle_part = movie2[32:42]
middle_part

'_rushmore_'

- Print the concatenation of the variables first_part, middle_part and last_part in that order. Print the variable movie2 and compare them.

In [0]:
# Print concatenation and movie2 variable
print(first_part+middle_part+last_part) 

the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .


In [0]:
print(movie2)

the most significant tension of _rushmore_ is the potential relationship between a teacher and his student .


####Palindromes


A **palindrome** is a sequence of characters which can be read the same backward as forward, for example: Madam or No lemon, no melon. 

You want to make a list of all movie titles that are funny palindromes but you will start by analyzing one example.

In python, you can also specify steps by using a third index. If you don't specify the first or second index and the third one is negative, it will return the characters jumping and backwards.

In [0]:
movie = 'oh my God! desserts I stressed was an ugly movie'

In [0]:
# Get the word
movie_title = movie[11:30]
movie_title

'desserts I stressed'

In [0]:
# Obtain the palindrome
palindrome = movie_title[::-1]
palindrome

'desserts I stressed'

In [0]:
# Print the word if it's a palindrome
if movie_title == palindrome:
	print(movie_title)

desserts I stressed


In [0]:
test = list(range(9))
test

[0, 1, 2, 3, 4, 5, 6, 7, 8]

In [0]:
test[::-1]

[8, 7, 6, 5, 4, 3, 2, 1, 0]

### String operations


In [0]:
movie = '$I supposed that coming from MTV Films I should expect no less$'

In [0]:
# Convert to lowercase and print the result
movie_lower = movie.lower()
print(movie_lower)

$i supposed that coming from mtv films i should expect no less$


- Remove the $ that occur at the start and at the end of the string contained in movie_lower. 

In [0]:
# Remove specified character and print the result
movie_no_space = movie_lower.strip("$")
print(movie_no_space)

i supposed that coming from mtv films i should expect no less


In [0]:
# Split the string into substrings and print the result
movie_split = movie_no_space.split()
print(movie_split)

['i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less']


In [0]:
# Select root word and print the result
word_root = movie_split[1][:-1]
print(word_root)

suppose


In [0]:
movie = 'the film,however,is all good<\\i>'

In [0]:
# Remove tags happening at the end and print results
movie_tag = movie.rstrip("<\i>")
print(movie_tag)

the film,however,is all good


In [0]:
# Split the string using commas and print results
movie_no_comma = movie_tag.split(",")
print(movie_no_comma)

['the film', 'however', 'is all good']


In [0]:
# Join back together and print results
movie_join = " ".join(movie_no_comma)
print(movie_join)

the film however is all good


#### Split lines or split the line?

In [0]:
file = 'mtv films election, a high school comedy, is a current example\nfrom there, director steven spielberg wastes no time, taking us into the water on a midnight swim'

In [0]:
# Split string at line boundaries
file_split = file.splitlines()

# Print file_split
print(file_split)

['mtv films election, a high school comedy, is a current example', 'from there, director steven spielberg wastes no time, taking us into the water on a midnight swim']


In [0]:
# Complete for-loop to split by commas
for substring in file_split:
    substring_split = substring.split(",")
    print(substring_split)

['mtv films election', ' a high school comedy', ' is a current example']
['from there', ' director steven spielberg wastes no time', ' taking us into the water on a midnight swim']


**Observation:**

The difference between `split()` and `splitlines` is that
- `splitlines()` breaks a string by line boundaries while 
- `split()` uses the separating element to break a string into pieces.

#### Finding and replacing


In [0]:
movies_df = pd.read_csv("short_movies.csv")
movies_df.iloc[199:203]

Unnamed: 0,id,tag,html,sent id,text,target
199,0,cv006,15448,15,the reasons for why he becomes a skinhead are ...,pos
200,0,cv006,15448,16,it's clear that he's passionate about his beli...,pos
201,0,cv006,15448,17,I believe you I always said that the actor act...,pos
202,0,cv006,15448,18,it's astonishing how frightening the actor act...,pos


In [0]:
movies = movies_df.iloc[200:203]
movies = movies['text']
movies

200    it's clear that he's passionate about his beli...
201    I believe you I always said that the actor act...
202    it's astonishing how frightening the actor act...
Name: text, dtype: object

In [0]:
for movie in movies:
  	# Find if actor occurrs between 37 and 41 inclusive
    if movie.find("actor", 37, 42) == -1:
        print("Word not found")
    # Count occurrences and replace two by one
    elif movie.count("actor") == 2:  
        print(movie.replace("actor actor", "actor"))
    else:
        # Replace three occurrences by one
        print(movie.replace("actor actor actor", "actor"))

Word not found
I believe you I always said that the actor is amazing in every movie he has played
it's astonishing how frightening the actor norton looks with a shaved head and a swastika on his chest.


In [0]:
movies = movies_df.iloc[137:139]
movies = movies['text']
movies

137    heck , jackie doesn't even have enough money f...
138    in condor , chan plays the same character he's...
Name: text, dtype: object

- Find the index where money occurs between characters with index 12 and 50. If not found, the method should return -1.

In [0]:
for movie in movies:
  # Find the first occurrence of word
  print(movie.find("money", 12, 51))

39
-1


- Find the index where money occurs between characters with index 12 and 50. If not found, it should raise an error.

In [0]:
for movie in movies:
  try:
    # Find the first occurrence of word
  	print(movie.index("money", 12, 51))
  except ValueError:
    print("substring not found")

39
substring not found


#### Replacing negations
In order to keep working with your prediction project, your next task is to figure out how to handle negations that occur in your dataset. Some algorithms for prediction do not work well with negations, so a good way to handle this is to remove either not or n't, and to replace the next word by its antonym.

In [0]:
movies = "the rest of the story isn't important because all it does is serve as a mere backdrop for the two stars to share the screen ."

- Replace the substring isn't for the word is.

In [0]:
# Replace negations 
movies_no_negation = movies.replace("isn't", "is")
movies_no_negation

'the rest of the story is important because all it does is serve as a mere backdrop for the two stars to share the screen .'

- Replace the substring important for the word insignificant.

- Print out the result contained in the variable movies_antonym.

In [0]:
# Replace important
movies_antonym = movies_no_negation.replace("important", "insignificant")

# Print out
print(movies_antonym)

the rest of the story is insignificant because all it does is serve as a mere backdrop for the two stars to share the screen .


# 2. Formatting Strings

Following your journey, you will learn the main approaches that can be used to format or interpolate strings in python using a dataset containing information scraped from the web. You will explore the advantages and disadvantages of using positional formatting, embedding expressing inside string constants, and using the Template class.

####Positional formatting


In [0]:
wikipedia_article = 'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.'

In [0]:
my_list = []

The text of one article has already been saved in the variable wikipedia_article. Also, the empty list my_list is already defined. 

- Assign the substrings going from the 4th to the 19th character, and from the 22nd to the 44th character of wikipedia_article to the variables first_pos and second_pos, respectively. Adjust the strings so they are lowercase.

In [0]:
# Assign the substrings to the variables
first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

In [0]:
first_pos

'computer science'

In [0]:
second_pos

'artificial intelligence'

- Define a string with the text "The tool is used in" **adding placeholders** after the word tool and the word in for future positional formatting. Append it to the list `my_list`.

In [0]:
# Define string with placeholders 
my_list.append("The tool {} is used in {}")
my_list

['The tool {} is used in {}']

In [0]:
# Define string with rearranged placeholders
my_list.append("The tool {1} is used in {0}")
my_list

['The tool {} is used in {}', 'The tool {1} is used in {0}']

- Complete the for-loop so that it uses the .format() method and the variables first_pos and second_pos to print out every string in my_list.

In [0]:
# Use format to print strings
for my_string in my_list:
  	print(my_string.format(first_pos, second_pos))

The tool computer science is used in artificial intelligence
The tool artificial intelligence is used in computer science


####Calling by its name

Task: you want to create a template email with a standard message changing the different tools and corresponding field name.

First, you want to try doing this with just one example as a proof of concept. You use positional formatting and named placeholders to call the variables in a dictionary.

In [0]:
courses = ['artificial intelligence', 'neural networks']

- The variable courses containing one tool and one field name has been saved. 

In [0]:
# Create a dictionary
plan = {
  		"field": courses[0],
        "tool": courses[1]
        }

In [0]:
plan

{'field': 'artificial intelligence', 'tool': 'neural networks'}

In [0]:
plan['field']

'artificial intelligence'

In [0]:
{plan['tool']}

{'neural networks'}

In [0]:
# Define string with placeholders
my_message = "If you are interested in {plan[field]}, you can take the course related to {plan[tool]}"

In [0]:
# Use dictionary to replace placehoders
print(my_message.format(plan=plan))

If you are interested in artificial intelligence, you can take the course related to neural networks


For accessing elements in a dictionary when using the `str.format()` method, you need to use `dict[index]` without using quotes for index. The method converts it automatically to the string `"index"` when it is looked up in the `dict`.

In [0]:
# Import datetime 
from datetime import datetime

In [0]:
# Assign date to get_date
get_date = datetime.now()
get_date

datetime.datetime(2019, 9, 27, 4, 16, 44, 193679)

In [0]:
# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

# Format date
print(message.format(today=get_date))

Good morning. Today is September 27, 2019. It's 04:16 ... time to work!


###Formatted string literal


In [0]:
field1 = 'sexiest job'

In [0]:
field2 = 'data is produced daily'

In [0]:
field3 = 'Individuals'

In [0]:
fact1 = 21

In [0]:
fact2 = 2500000000000000000

In [0]:
fact3 = 72.41415415151

In [0]:
fact4 = 1.09

In [0]:
# Complete the f-string
print(f"Data science is considered {field1!r} in the {fact1:d}st century")

Data science is considered 'sexiest job' in the 21st century


In [0]:
# Complete the f-string
print(f"About {fact2:e} of {field2} in the world")

About 2.500000e+18 of data is produced daily in the world


In [0]:
# Complete the f-string
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed")

Individuals create around 72.41% of the data but only 1.1% is analyzed


In [0]:
number1 = 120

In [0]:
number2 = 7

In [0]:
string1 = 'httpswww.datacamp.com'

In [0]:
list_links = ['www.news.com',
 'www.google.com',
 'www.yahoo.com',
 'www.bbc.com',
 'www.msn.com',
 'www.facebook.com',
 'www.news.google.com']

In [0]:
# Include both variables and the result of dividing them 
print(f"{number1} tweets were downloaded in {number2} minutes indicating a speed of {number1/number2:.1f} tweets per min")

120 tweets were downloaded in 7 minutes indicating a speed of 17.1 tweets per min


In [0]:
# Replace the substring http by an empty string
print(f"{string1.replace('https', '')}")

www.datacamp.com


In [0]:
# Divide the length of list by 120 rounded to two decimals
print(f"Only {(len(list_links)*100/120):.2f}% of the posts contain links")

In [0]:
import datetime

In [0]:
east = {'date': datetime.datetime(2007, 4, 20, 0, 0), 'price': 1232443}

In [0]:
west = {'date': datetime.datetime(2006, 5, 26, 0, 0), 'price': 1432673}

In [0]:
# Access values of date and price in east dictionary
print(f"The price for a house in the east neighborhood was\n ${east['price']} in {east['date']:%m-%d-%Y}")

The price for a house in the east neighborhood was
 $1232443 in 04-20-2007


In [0]:
# Access values of date and price in west dictionary
print(f"The price for a house in the west neighborhood was\n ${west['price']} in {west['date']:%m-%d-%Y}.")

The price for a house in the west neighborhood was
 $1432673 in 05-26-2006.


### Template method


In [0]:
tool1 = 'Natural Language Toolkit'

In [0]:
tool2 = 'TextBlob'

In [0]:
tool3 = 'Gensim'

In [0]:
description1 = 'suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.'

In [0]:
description2 = 'Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.'

In [0]:
description3 = 'Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.'

In [0]:
# Import Template
from string import Template

In [0]:
# Create a template
wikipedia = Template("$tool is a $description")
wikipedia

<string.Template at 0x7f28d17d7630>

In [0]:
# Substitute variables in template
print(wikipedia.substitute(tool=tool1, description=description1))
print(wikipedia.substitute(tool=tool2, description=description2))
print(wikipedia.substitute(tool=tool3, description=description3))

Natural Language Toolkit is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Gensim is a Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.


In [0]:
tools = ['Natural Language Toolkit', '20', 'month']

In [0]:
# Import template
from string import Template

In [0]:
# Select variables
our_tool = tools[0]
our_fee = tools[1]
our_pay = tools[2]

In [0]:
# Create template
course = Template("We are offering a 3-month beginner course on $tool just for $$ $fee ${pay}ly")

# Substitute identifiers with three variables
print(course.substitute(tool=our_tool, fee=our_fee, pay=our_pay))

We are offering a 3-month beginner course on Natural Language Toolkit just for $ 20 monthly


In [0]:
answers = {'answer1': 'I really like the app. But there are some features that can be improved'}

In [0]:
# Import template
from string import Template

# Complete template string using identifiers
the_answers = Template("Check your answer 1: $answer1, and your answer 2: $answer2")

- The answer of one user has been stored in the dictionary answers. You can use the print() function to view the variables in the IPython Shell.

In [0]:
# Use substitute to replace identifiers
try:
    print(the_answers.substitute(answers))
except KeyError:
    print("Missing information")

Missing information


- Use the method .safe_substitute() to replace the identifiers with the values in answers in the predefined template.

In [0]:
# Use safe_substitute to replace identifiers
try:
    print(the_answers.safe_substitute(answers))
except KeyError:
    print("Missing information")

Check your answer 1: I really like the app. But there are some features that can be improved, and your answer 2: $answer2


# 3. Regular Expressions for Pattern Matching

Time to discover the fundamental concepts of regular expressions! In this key chapter, you will learn to understand the basic concepts of regular expression syntax. Using a real dataset with tweets meant for sentiment analysis, you will learn how to apply pattern matching using normal and special characters, and greedy and lazy quantifiers.

###Introduction to regular expressions


####Are they bots?

You write down some helpful metacharacters to help you later:

- \d: digit
- \w: word character
- \W: non-word character
- \s: whitespace

In [0]:
sentiment_analysis = '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'

- The text of one tweet was saved in the variable sentiment_analysis. 

In [0]:
# Import the re module
import re

# Write the regex
regex = r"@robot\d\W"

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


In [0]:
sentiment_analysis = "Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"

In [0]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

['User_mentions:2']


In [0]:
# Write a regex to obtain number of likes
print(re.findall(r"likes:\s\d", sentiment_analysis))

['likes: 9']


In [0]:
# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['number of retweets: 7']


####Match and split

Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. 

In [0]:
sentiment_analysis = 'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

- Write a regex that matches the pattern separating the sentences in sentiment_analysis, e.g. &4break!.

In [0]:
# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

In [0]:
# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)

In [0]:
sentiment_sub

'He#newHis%newTin love with$newPscrappy.  He is&newYmissing him@newLalready'

In [0]:
sentiment_analysis

'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

In [0]:
# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

In [0]:
# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, " ", sentiment_sub)
print(sentiment_final)

He is in love with scrappy.  He is missing him already


###Repetitions


In [0]:
tweets = pd.read_csv("short_tweets.csv")
tweets.iloc[545:548]

Unnamed: 0,target,id,date,flag,user,text
545,0,1467962938,Mon Apr 06 23:01:04 PDT 2009,NO_QUERY,jess___x,Boredd. Colddd @blueKnight39 Internet keeps st...
546,0,1467963418,Mon Apr 06 23:01:14 PDT 2009,NO_QUERY,Zimily,I had a horrible nightmare last night @anitaLo...
547,0,1467963477,Mon Apr 06 23:01:15 PDT 2009,NO_QUERY,Augustina22,im lonely keep me company @YourBestCompany! @...


In [0]:
sentiment_analysis = tweets.iloc[545:548]['text']
sentiment_analysis

545    Boredd. Colddd @blueKnight39 Internet keeps st...
546    I had a horrible nightmare last night @anitaLo...
547    im lonely  keep me company @YourBestCompany! @...
Name: text, dtype: object

-  Write a regex to find all the matches of http links appearing in each tweet in sentiment_analysis. Print out the result.
- Write a regex to find all the matches of user mentions appearing in each tweet in sentiment_analysis. Print out the result.

In [0]:
# Import re module
import re

for tweet in sentiment_analysis:
  	# Write regex to match http links and print out result
	print(re.findall(r"http\S+", tweet))

	# Write regex to match user mentions and print out result
	print(re.findall(r"@\w+", tweet))

['https://www.tellyourstory.com']
['@blueKnight39']
[]
['@anitaLopez98', '@MyredHat31']
['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio']


In [0]:
sentiment_analysis = tweets.iloc[232:235]['text']
sentiment_analysis

232    I would like to apologize for the repeated Vid...
233    @zaydia but i cant figure out how to get there...
234    FML: So much for seniority, bc of technologica...
Name: text, dtype: object

- Complete the for-loop with a regex that finds all dates in a format similar to 27 minutes ago or 4 hours ago.

In [0]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\s\w+\sago", date))

['32 minutes ago']
[]
[]


- Complete the for-loop with a regex that finds all dates in a format similar to 23rd june 2018.

In [0]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}", date))

[]
['1st May 2019']
['23rd June 2018']


- Complete the for-loop with a regex that finds all dates in a format similar to 1st september 2019 17:25.

In [0]:
# Complete the for loop with a regex to find dates
for date in sentiment_analysis:
	print(re.findall(r"\d{1,2}\w+\s\w+\s\d{4}\s\d{1,2}:\d{2}", date))

[]
[]
['23rd June 2018 17:54']


In [0]:
sentiment_analysis = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'

In [0]:
# Write a regex matching the hashtag pattern
regex = r"#\w+"

In [0]:
# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)

In [0]:
# Get tokens by splitting text
print(re.split(r"\s+", no_hashtag))

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


###Regex metacharacters


In [0]:
sentiment_analysis = tweets.iloc[780:782]['text']
sentiment_analysis

780    AIshadowhunters.txt aaaaand back to my literat...
781    ouMYTAXES.txt I am worried that I won't get my...
Name: text, dtype: object

- Write a regex that matches the pattern of the text file names, e.g. aemyfile.txt.
- Find all matches of the regex in the elements of sentiment_analysis. Print out the result.
- Replace all matches of the regex with an empty string "". Print out the result.

In [0]:
# Write a regex to match text file name
regex = r"^[aeiouAEIOU]{2,3}.+txt"

for text in sentiment_analysis:
	# Find all matches of the regex
	print(re.findall(regex, text))
    
	# Replace all matches with empty string
	print(re.sub(regex, "", text))

['AIshadowhunters.txt']
 aaaaand back to my literature review. At least i have a friendly cup of coffee to keep me company
['ouMYTAXES.txt']
 I am worried that I won't get my $900 even though I paid tax last year


####Give me your email

The company puts some rules in place to verify that the given email address is valid:

- The first part can contain:
 - Upper A-Z and lowercase letters a-z
 - Numbers
 - Characters: !, #, %, &, *, $, .
- Must have @
- Domain:
 - Can contain any word characters
 - But only .com ending is allowed

The project consist of writing a script that checks if the email address follow the correct pattern. Your colleague gave you a list of email addresses as examples to test.

In [0]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

The list emails as well as the re module are loaded in your session. 

In [0]:
# Write a regex to match a valid email address
regex = r"[A-Za-z0-9!#%&*\$\.]+@\w+\.com"

In [0]:
for example in emails:
  	# Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
      	print("The email {email_example} is a valid email".format(email_example=example))
    else:
      	print("The email {email_example} is invalid".format(email_example=example))   

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


####Invalid password

The company also puts some rules in order to verify valid passwords:

- It can contain lowercase a-z and uppercase letters A-Z
- It can contain numbers
- It can contain the symbols: *, #, $, %, !, &, .
- It must be at least 8 characters long but not more than 20

Your colleague also gave you a list of passwords as examples to test.

In [0]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']

The list passwords and the module re are loaded in your session. 

In [0]:
# Write a regex to match a valid password
regex = r"[A-Za-z0-9!#%&*\$\.]{8,20}" 

for example in passwords:
  	# Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
      	print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
      	print("The password {pass_example} is invalid".format(pass_example=example))     

The password Apple34!rose is a valid password
The password My87hou#4$ is a valid password
The password abc123 is invalid


####Greedy vs. non-greedy matching



####Understanding the difference

You realize that there are some HTML tags present. You need to remove them but keep the inside content as they are useful for analysis.

In [0]:
string = 'I want to see that <strong>amazing show</strong> again!'

In [0]:
# Import re
import re

# Write a regex to eliminate tags
string_notags = re.sub(r"<.+?>", "", string)

# Print out the result
print(string_notags)

I want to see that amazing show again!


####Greedy matching

Next, you see that numbers still appear in the text of the tweets. So, you decide to find all of them.

In [0]:
sentiment_analysis = 'Was intending to finish editing my 536-page novel manuscript tonight, but that will probably not happen. And only 12 pages are left '

Let's imagine that you want to extract the number contained in the sentence I was born on April 24th. 
- A **lazy quantifier** will make the regex return 2 and 4, because they will match as few characters as needed. 

In [0]:
# Write a lazy regex expression 
numbers_found_lazy = re.findall(r"[0-9]+?", sentiment_analysis)

# Print out the result
print(numbers_found_lazy)

['5', '3', '6', '1', '2']


- However, a **greedy quantifier** will return the entire 24 due to its need to match as much as possible.

In [0]:
# Write a greedy regex expression 
numbers_found_greedy = re.findall(r"[0-9]+", sentiment_analysis)

# Print out the result
print(numbers_found_greedy)

['536', '12']


####Lazy approach

In [0]:
sentiment_analysis = "Put vacation photos online (They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying). "

- Use a greedy quantifier to match text that appears within parentheses in the variable sentiment_analysis.

In [0]:
# Write a greedy regex expression to match 
sentences_found_greedy = re.findall(r"\(.*\)", sentiment_analysis)

# Print out the result
print(sentences_found_greedy)

["(They were so cute) a few yrs ago. PC crashed, and now I forget the name of the site (I'm crying)"]


- Now, use a lazy quantifier to match text that appears within parentheses in the variable sentiment_analysis.

In [0]:
# Write a lazy regex expression
sentences_found_lazy = re.findall(r"\(.*?\)", sentiment_analysis)

# Print out the results
print(sentences_found_lazy)

['(They were so cute)', "(I'm crying)"]


# 4. Advanced Regular Expression Concepts

In the last step of your journey, you will learn more complex methods of pattern matching using parentheses to group strings together or to match the same text as matched previously. Also, you will get an idea of how you can look around expressions.

###Capturing groups


In [0]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
 'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
 'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']

- Complete the regex to match the email capturing only the name part. The name part appears before the @.

In [0]:
# Write a regex that matches email
regex_email = r"([A-Za-z0-9]+)@\S+"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']


In [0]:
# Import re
import re

In [0]:
flight = 'Subject: You are now ready to fly. Here you have your boarding pass IB3723 AMS-MAD 06OCT'

- Complete the regular expression to match and capture all the flight information required.

In [0]:
# Write regex to capture information of the flight
regex = r"([A-Z]{2})(\d{4})\s([A-Z]{3})-([A-Z]{3})\s(\d{2}[A-Z]{3})"

In [0]:
# Find all matches of the flight information
flight_matches = re.findall(regex, flight)

In [0]:
#Print the matches
print("Airline: {} Flight number: {}".format(flight_matches[0][0], flight_matches[0][1]))
print("Departure: {} Destination: {}".format(flight_matches[0][2], flight_matches[0][3]))
print("Date: {}".format(flight_matches[0][4]))

Airline: IB Flight number: 3723
Departure: AMS Destination: MAD
Date: 06OCT


###Alternation and non-capturing groups


In [0]:
sentiment_analysis = ['I totally love the concert The Book of Souls World Tour. It kinda amazing!',
 'I enjoy the movie Wreck-It Ralph. I watched with my boyfriend.',
 "I still like the movie Wish Upon a Star. Too bad Disney doesn't show it anymore."]

- Complete the regular expression to capture the words love or like or enjoy. Match and capture the words movie or concert. Match and capture anything appearing until the ..

In [0]:
# Write a regex that matches sentences with the optional words
regex_positive = r"(love|like|enjoy).+?(movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    positive_matches = re.findall(regex_positive, tweet)
    
    # Complete format to print out the results
    print("Positive comments found {}".format(positive_matches))

Positive comments found [('love', 'concert', 'The Book of Souls World Tour')]
Positive comments found [('enjoy', 'movie', 'Wreck-It Ralph')]
Positive comments found [('like', 'movie', 'Wish Upon a Star')]


In [0]:
sentiment_analysis = ['That was horrible! I really dislike the movie The cabin and the ant. So boring.',
 "I disapprove the movie Honest with you. It's full of cliches.",
 'I dislike very much the concert After twelve Tour. The sound was horrible.']

- Complete the regular expression to capture the words hate or dislike or disapprove. Match but don't capture the words movie or concert. Match and capture anything appearing until the ..

In [0]:
# Write a regex that matches sentences with the optional words
regex_negative = r"(hate|dislike|disapprove).+?(?:movie|concert)\s(.+?)\."

for tweet in sentiment_analysis:
	# Find all matches of regex in tweet
    negative_matches = re.findall(regex_negative, tweet)
    
    # Complete format to print out the results
    print("Negative comments found {}".format(negative_matches))

Negative comments found [('dislike', 'The cabin and the ant')]
Negative comments found [('disapprove', 'Honest with you')]
Negative comments found [('dislike', 'After twelve Tour')]


###Backreferences


In [0]:
contract = 'Provider will invoice Client for Services performed within 30 days of performance.  Client will pay Provider as set forth in each Statement of Work within 30 days of receipt and acceptance of such invoice. It is understood that payments to Provider for services rendered shall be made in full as agreed, without any deductions for taxes of any kind whatsoever, in conformity with Provider’s status as an independent contractor. Signed on 03/25/2001.'

- Write a regex that captures the month, day, and year in which the contract was signed. Scan contract for matches.

In [0]:
# Write regex and scan contract to capture the dates described
regex_dates = r"Signed\son\s(\d{2})/(\d{2})/(\d{4})"
dates = re.search(regex_dates, contract)

- Assign each captured group to the corresponding keys in the dictionary.

In [0]:
# Assign to each key the corresponding match
signature = {
	"day": dates.group(2),
	"month": dates.group(1),
	"year": dates.group(3)
}

In [0]:
# Complete the format method to print-out
print("Our first contract is dated back to {data[year]}. Particularly, the day {data[day]} of the month {data[month]}.".format(data=signature))

Our first contract is dated back to 2001. Particularly, the day 25 of the month 03.


####Close the tag, please!

You need to write a short script for checking that every HTML tag that is open has its proper closure.

You have an example of a string containing HTML tags:

`<title>The Data Science Company</title>`

You learn that an opening HTML tag is always at the beginning of the string. It appears inside <>. A closing tag also appears inside <>, but it is preceded by /.

In [0]:
html_tags = ['<body>Welcome to our course! It would be an awesome experience</body>',
 '<article>To be a data scientist, you need to have knowledge in statistics and mathematics</article>',
 '<nav>About me Links Contact me!']

- Complete the regex in order to match closed HTML tags. Find if there is a match in each string of the list html_tags. Assign the result to match_tag.

In [0]:
for string in html_tags:
    # Complete the regex and find if it matches a closed HTML tags
    match_tag =  re.match(r"<(\w+)>.*?</\1>", string)
 
    if match_tag:
        # If it matches print the first group capture
        print("Your tag {} is closed".format(match_tag.group(1))) 
    else:
        # If it doesn't match capture only the tag 
        notmatch_tag = re.match(r"<(\w+)>", string)
        # Print the first group capture
        print("Close your {} tag!".format(notmatch_tag.group(1)))

Your tag body is closed
Your tag article is closed
Close your nav tag!


####Reeepeated characters
Back to your sentiment analysis! Your next task is to replace elongated words that appear in the tweets. 

In [0]:
sentiment_analysis = ['@marykatherine_q i know! I heard it this morning and wondered the same thing. Moscooooooow is so behind the times',
 'Staying at a friends house...neighborrrrrrrs are so loud-having a party',
 'Just woke up an already have read some e-mail']

If you want to find a match for Awesoooome. You first need to capture Awes. Then, match o and reference the same character back, and then, me.



In [0]:
# Complete the regex to match an elongated word
regex_elongated = r"\w*(\w)\1\w*"

for tweet in sentiment_analysis:
	# Find if there is a match in each tweet 
	match_elongated = re.search(regex_elongated, tweet)
    
	if match_elongated:
		# Assign the captured group zero 
		elongated_word = match_elongated.group(0)
        
		# Complete the format method to print the word
		print("Elongated word found: {word}".format(word=elongated_word))
	else:
		print("No elongated word found")     	

Elongated word found: Moscooooooow
Elongated word found: neighborrrrrrrs
No elongated word found


####Lookaround


Positive lookahead (?=) makes sure that first part of the expression is followed by the lookahead expression. Positive lookbehind (?<=) returns all matches that are preceded by the specified pattern.

In [0]:
sentiment_analysis = 'You need excellent python skills to be a data scientist. Must be! Excellent python'

- Get all the words that are followed by the word python in sentiment_analysis. Print out the word found.

In [0]:
# Positive lookahead
look_ahead = re.findall(r"\w+(?=\spython)", sentiment_analysis)

# Print out
print(look_ahead)

['excellent', 'Excellent']


-  Get all the words that are preceded by the word python or Python in sentiment_analysis. Print out the words found.

In [0]:
# Positive lookbehind
look_behind = re.findall(r"(?<=[Pp]ython\s)\w+", sentiment_analysis)

# Print out
print(look_behind)

['skills']


####Filtering phone numbers

The phone numbers in the list have the structure:

- Optional area code: 3 numbers
- Prefix: 4 numbers
- Line number: 6 numbers
- Optional extension: 2 numbers

E.g. 654-8764-439434-01.

In [0]:
cellphones = ['4564-646464-01', '345-5785-544245', '6476-579052-01']

In [0]:
for phone in cellphones:
	# Get all phone numbers not preceded by area code
	number = re.findall(r"(?<!\d{3}-)\d{4}-\d{6}-\d{2}", phone)
	print(number)

['4564-646464-01']
[]
['6476-579052-01']


In [0]:
for phone in cellphones:
	# Get all phone numbers not followed by optional extension
	number = re.findall(r"\d{3}-\d{4}-\d{6}(?!-\d{2})", phone)
	print(number)

[]
['345-5785-544245']
[]
