# Regular Expressions

## Introduction
A regular expression (regex) is a sequence of characters that describes a search pattern. We can use regular expressions to search for and extract data.

    Instructions
- In the code cell, assign to the variable regex a regular expression that's four characters long and matches every string in the list strings.

In [1]:
strings = ["data science", "big data", "metadata"]
reges = "data"

## Wildcards in Regular Expressions
We use the special character "." to indicate that any character can be put in its place.

    Instructions
- Assign a regular expression that is three characters long and matches every string in strings to the variable regex.

In [51]:
strings = ["bat", "robotics", "megabyte"]
regex = "b.t"

## Searching the Beginnings And Endings Of Strings
We can use the caret symbol ("^") to match the beginning of a string, and the dollar sign ("$") to match the end of a string.

    Instructions
- Assign a regular expression that's seven characters long and matches every string in strings (except for bad_string) to the variable regex.

In [5]:
strings = ["better not put too much", "butter in the", "batter"]
bad_string = "We also wouldn't want it to be bitter"
regex = "^b.tter"

## Reading and Printing the Data Set
Let's use the csv module to read and print our data file, "askreddit_2015.csv".

    Instructions

- Use the csv module to read our data set and assign it to posts_with_header.
- Use list slicing to exclude the first row, which represents the column names. Assign this sliced data set to posts.
- Use a for loop and string slicing to print the first 5 rows. See if you notice any patterns in this sample of the data set.


In [50]:
import csv
f = open("/home/aida/Desktop/Dataquest/python-intermediate/data/askreddit_2015.csv", "r")
readfile = csv.reader(f)
posts_with_header = list(readfile)
posts_with_header = posts_with_header[1:]
for row in posts_with_header[:5]:
    print(row)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']


## Counting Simple Matches in the Data Set with re()
With re.search(regex, string), we can check whether string is a match for regex. If it is, the expression will return a match object. If it isn't, it will return None. 

    Instructions

Count the number of posts in our data set that match the regex "of Reddit". Assign the count to of_reddit_count.


In [52]:
import re

of_reddit_count = 0
for row in posts_with_header:
    posts = row[0]
    if re.search("of Reddit", posts) is not None:
        of_reddit_count += 1

In [53]:
of_reddit_count

76

## Using Square Brackets to Match Multiple Characters
We use square brackets in a regex to indicate that any character within them can fill the space. 

    Instructions


- Use square bracket notation to make the code account for both capitalizations of "Reddit", and count how many posts contain "of Reddit" or "of reddit" in the title.

- Assign the resulting count to of_reddit_count.

In [14]:
for row in posts_with_header:
    posts = row[0]
    if re.search("of [Rr]eddit", posts) is not None:
        of_reddit_count += 1

## Escaping Special Characters
In regular expressions, escaping a character means indicating that you don't want the character to do anything special, and that the interpreter should treat it just like any other character. We use the backslash ("\") to escape characters in a regex.

    Instructions


- Escape the square bracket characters to count the number of posts in our data set that contain the "[Serious]" tag.

- Assign the count to serious_count.

In [21]:
serious_count = 0
for row in posts_with_header:
    posts = row[0]
    if re.search("\[Serious\]", posts) is not None:
        serious_count += 1
serious_count

69

## Combining Escaped Characters and Multiple Matches

    Instructions

- Refine the code to count how many posts have either "[Serious]" or "[serious]" in the title.
- Assign the count to serious_count.


In [20]:
serious_count = 0
for row in posts_with_header:
    posts = row[0]
    if re.search("\[[Ss]erious\]", posts) is not None:
        serious_count += 1
serious_count

77

## Adding More Complexity to Your Regular Expression

    Instructions

- Refine the code so that it counts how many posts have the serious tag enclosed in either square brackets or parentheses.
- Assign the count to serious_count.


In [23]:
serious_count = 0
for row in posts_with_header:
    posts = row[0]
    if re.search("[\[\([Ss]erious[\]\)]", posts) is not None:
        serious_count += 1
serious_count

80

## Combining Multiple Regular Expressions
To combine regular expressions, we use the "|" character.

    Instructions


- Use the "^" character to count how many posts include the serious tag at the beginning of the title. Assign this count to serious_start_count.

- Use the '$' character to count how many posts include the serious tag at the end of the title. Assign this count to serious_end_count.

- Use the "|" character to count how many posts include the serious tag at either the beginning or end of the title. Assign this count to serious_count_final.


In [33]:
serious_start_count = 0
serious_end_count = 0
serious_count_final = 0
for row in posts_with_header:
    posts = row[0]
    if re.search("^[\[\(][Ss]erious[\]\)]",posts) is not None:
        serious_start_count += 1
    if re.search("[\[\(][Ss]erious[\]\)]$", posts) is not None:
        serious_end_count += 1
    if re.search("^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$", posts) is not None:
        serious_count_final += 1
print(serious_start_count, serious_end_count, serious_count_final)

69 11 80


## Using Regular Expressions to Substitute Strings

The re module provides a sub() function that takes the following parameters (in order):
- pattern: The regex to match
- repl: The string that should replace the substring matches
- string: The string containing the pattern we want to search


    Instructions
- Replace "[serious]", "(Serious)", and "(serious)" with "[Serious]" for all of the titles in posts.
- You should only need to use one call to sub(), and one regex.
- Recall that the repl argument is an ordinary string. It's not a regex, so you don't need to escape characters like "[".


In [36]:
for row in posts_with_header:
    row[0] = re.sub("[\[\(][Ss]erious[\]\)]", "[Serious]", row[0])  

## Matching Years with Regular Expressions

    Instructions
- Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999.
- Store every string that contains a year in year_strings. The .append() function will help here.

In [41]:
strings = ['War of 1812', 'There are 5280 feet to a mile', 'Happy New Year 2016!'] 
year_strings = []
for string in strings:
    if re.search("[1-2][0-9][0-9][0-9]",string) is not None:
        year_strings.append(string)
year_strings

['War of 1812', 'Happy New Year 2016!']

## Repeating Characters in Regular Expressions
We can use curly brackets ("{" and "}") to indicate that a pattern should repeat. To match any four-digit number, for example, we could repeat the pattern "[0-9]" four times by writing "[0-9]{4}".

    Instructions

- Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999. Use a regex that takes advantage of curly brackets.
- Store every string that contains a year in year_strings. The .append() function will help here.


In [43]:
strings = ['War of 1812', 'There are 5280 feet to a mile', 'Happy New Year 2016!'] 
year_strings = []
for string in strings:
    if re.search("[1-2][0-9]{3}",string) is not None:
        year_strings.append(string)
year_strings

['War of 1812', 'Happy New Year 2016!']

## Challenge: Extracting all Years

 The re module contains a findall() function that returns a list of substrings matching the regex. re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

    Instructions
- Use re.findall() to generate a list of all years between 1000 and 2999 in the string years_string.

- Assign the result to years.


In [49]:
years_string = '2015 was a good year, but 2016 will be better!'
years = re.findall("[1-2][0-9]{3}",years_string)
years

['2015', '2016']