# **Putting our Python to work:**
# **Using the** `sqlite3` **library to use SQL with Python**

We can directly connect to our SQLite3 database using a special Python library for this purpose. This functionality gives us the ability to run SQL scripts outside SQLiteStudio and work with the results directly in Python, to provide for more advanced Python work such as plotting the results with code (to come in later classes).

There are 7 main steps to using SQLite3 with Python:
1. import the `sqlite3` library
* connect to your database file
* create a cursor - which is a database object that allows you to run queries
* execute a SQL query
* store the results and column names in separate variables 
* close your database cursor and connection
* clean up column names and combine with results in a single variable for further work

### Step 1: Import the library

In [None]:
# Import the SQLite3 library
import sqlite3

### Step 2: Connect to your database

In [None]:
# point to my local directory
import os
os.chdir(r'C:\Users\colling\!dwd_spring2019\classes\class8')

# setup location variables for the database
db_location = r'nyc_film_db_final.db'

# Create the connection
db_connect = sqlite3.connect(db_location)

### Step 3: Create a cursor execute the SQL script

In [None]:
# create cursor
db_cursor = db_connect.cursor()

### Steps 4 and 5: Execute a SQL query and Store Results - SQL statement as string

In [None]:
# execute query using cursor
db_cursor.execute('SELECT * from irs_nyc_tax_returns WHERE year = "2012" and zipcode = "10128";')

# retrieve results 
results_data = db_cursor.fetchall()

# retrieve column headers
results_headers = db_cursor.description 

# print both
print("results column names ==>","\n",results_headers,"\n")
print("results data ==>","\n",results_data)

### Steps 4 and 5: Execute a SQL query and Store Results - SQL script in files

In [None]:
# point to my local directory
import os
os.chdir(r'C:\Users\colling\!dwd_spring2019\classes\class8')

# retrieve the SQL script from a file
script_location = 'analyze_data.sql'
file_handle = open(script_location)
sql_script = file_handle.read()
file_handle.close()

# execute query using cursor
db_cursor.execute(sql_script)

# retrieve results 
results_data = db_cursor.fetchall()

# retrieve column headers
results_headers = db_cursor.description 

# print both
print("results column names ==>","\n",results_headers,"\n")
print("results data ==>","\n",results_data_1,"\n")

### Step 6: Close your cursor and database connections

In [None]:
db_cursor.close()
db_connect.close()

### Step 7: Clean up column headers and combine with results into single variable

In [None]:
# clean-up column headers
column_names = []
for header in results_headers:
    column_names.append(header[0])
results_headers = tuple(column_names)

# add headers to result
results_final = results_data
results_final.insert(0,results_headers)

#print each row
for row in results_final:
    print(row)

Regular Expressions
-------------------

Regular expressions (regexes or re’s) constitute an extremely powerful, flexible and concise language for matching elements in text ranging from a few characters to complex patterns. While mastering the syntax of the regular expression language does require climbing a learning curve, this learning curve is not particularly steep, and a newcomer can find herself performing useful tasks with regular expressions almost immediately. Efforts spent learning regular expressions quickly pay off--tasks that are well suited for regular expressions abound. Indeed, regular expressions are one of the most useful computer skills, and an absolutely critical tool for data scientists. 

This document will present basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. We will present examples using grep - a Unix command to find lines of a text file with a given string in them. We create a Python version of grep to work with.

In [33]:
# The code below is written in Python to replicate the behavior of grep, the UNIX utility
# We will examine the details of how the code works in a subsequent notebook.
# For now, just execute the code, and use the function grep(regex_expression, file_name) as-is

import re

def printMatches(text, regex_expression):
    BACKGROUND_YELLOW = '\x1b[43m'
    COLOR_RESET  = "\x1b[0m"
    regex= re.compile(regex_expression)
    matches = regex.finditer(text)
    for m in matches:
        highlighted  = text[:m.start()] # the string before the regex match
        highlighted += BACKGROUND_YELLOW + text[m.start():m.end()] + COLOR_RESET 
        highlighted += text[m.end():] # the string after the regex match
        print(highlighted)

def grep(regex_expression, file_name):
    f = open(file_name, "r")
    content = f.read()
    f.close()
    for line in content.split("\n"):
        printMatches(line, regex_expression)
        
# for the lesson - let's be sure to point to the directory where the data is.
import os
os.chdir('C:\\Users\\colling\\!dwd_spring2019\\classes\\class8')

### NYC Restaurant Names Data

In the notebook, we will demonstrating the various regular expressions using the set of restaurant names from `restaurant-names.txt`.

Let's take a peek at the contents using our text reader.

Now, let's see if there are any restaurants with the string 'PANO' in them:

In [34]:
grep('PANO', "restaurant-names.txt")

BUFFALO WILD WINGS,PEETS COOFEE &TEA, [43mPANO[0mPOLIS BAKERY & CAFE
CAFE ES[43mPANO[0mL
EL CHARRO ES[43mPANO[0mL
EL POTE ES[43mPANO[0mL
LA CANDELA ES[43mPANO[0mLA
PAM[43mPANO[0m
[43mPANO[0mRAMA OF MY SILENCE-HEART
[43mPANO[0mRAMA RESTAURANT
TIGIN IRISH PUB,PEETS COFFEE&TEA,[43mPANO[0mPOLIS BAKERY&CAFE


What can we do if we want to search for something more complex than a fixed string? Regular expressions are solving exactly this problem. 

### The atoms

The simplest regular expressions are a sequence of `atoms`. An atom can be any of the following:
* single character, 
* a dot,
* a bracket expression, 
* an anchor.

#### Single character atom

A single character atom matches itself.

#### The `.` character atom

A dot atom matches any single character (except for a new line character `\n`).

Example: Using single character atoms, and the `.` atom, let's find all restaurant names that contain the characters `AB`, followed by any character (`.`) and then the character `D`:

In [35]:
grep('AB.D', 'restaurant-names.txt')

[43mABID[0mE BROOKLYN PITA
JJ PE[43mABOD[0mY'S
L[43mABAD[0mEE MANOIR
NEW KAB[43mAB D[0mINER
RESTAURANT [43mABID[0mJAN


#### Bracket expression atom

A bracket expression (defined by square brackets []) defines a set of characters. matches only one single character that can be any of the characters defined in a set. Example: [ABL] matches either A, B, or L.

Now, let's use a bracket expression: We want to find restaurants that contain one of the letters A,B,C,X,Y,Z followed by a digit. We specify the set of letters as `[ABCXYZ]` and the set of digits as `[0123456789]`.  

In [36]:
grep('[ABCXYZ][0123456789][0123456789]', 'restaurant-names.txt')

[43mB66[0m CLUB
B[43mA10[0m02 BAR
B[43mA10[0m19 BAR
B[43mA61[0m10 BAR
B[43mC81[0m40 BAR AT THE GARDEN
C[43mB80[0m30 SAUSAGE CONCESSION
CIBO MARKET (GATE [43mC65[0m)
COTTO MARKET-GATE [43mC30[0m
F[43mA80[0m70 HOT DOG CONCESSION
F[43mB10[0m14 HOT DOG CONCESSION
F[43mB80[0m20 PIZZA CONCESSION
F[43mB90[0m90 HOT DOG CONCESSION
F[43mB91[0m10 HOT DOG CONCESSION
F[43mB91[0m20 HOT DOG CONCESSION
HOT DOG CONCESSION [43mA80[0m3-1
JFK FUEL BAR [43mB27[0m
MADISON CLUB (B[43mB71[0m84)
RUNWA[43mY69[0m
YOGURT [43mY23[0m INC


##### Brackets and ranges

Instead of typing long lists of characters in a bracket expression, we can use the range character: [0-9] is equivalent to [0123456789]. Similarly [A-Z] is equivalent to [ABCDEFGHIJKLMNOPQRSTUVWXYZ]. And [D-T] is equivalent to [DEFGHIJKLMNOPQRST]. (You get the idea.) You can also combine multiple ranges: [a-e1-9] is equivalent to [abcde123456789]. 

Finally, you can even specify to be excluded from the set using the character (^) within the []. For example, [^0-9] matches any character other than a number.

Let's find restaurants that contain a letter, followed by a number, and then followed by a charather that is not a number:

In [37]:
grep('[A-Z][0-9][^0-9]', 'restaurant-names.txt')

[43mA1 [0mOCHA SUSHI
A[43mH2 [0mICE TEA
[43mB4 [0mNYC
B[43mT3 [0mBAR
B[43mT4 [0mBAR
[43mC2 [0mCAFE
CAF[43mE1 [0m& CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)
[43mF1 [0mLOUNGE AND GRILL
ILLY/VELOCITY BAR (E[43mC2)[0m
[43mJ4 [0mHOOKAH LOUNGE
JUIC[43mE4U[0m
[43mM1-[0m5
[43mM2M[0m MART
[43mM2N[0m BUFFET
NINET[43mY9 [0m& UP DINER
N[43mO1 [0mCHINESE RESTAURANT
[43mQ2 [0mTHAI RESTAURANT
[43mT2 [0m- GO
TERMINA[43mL1 [0mEMPLOYEE CAFETERIA
THE NEW YORK PALACE HOTEL ([43mC1 [0mLEVEL CAFETERIA)
TW[43mO8T[0mWO BAR & BURGER
US FRIED CHICKEN & [43mP1Z[0mZA


Hm, we do not want to get results that have a space after the number, so let's also exclude the space character:

In [38]:
grep('[A-Z][0-9][^0-9 ]', 'restaurant-names.txt') 

ILLY/VELOCITY BAR (E[43mC2)[0m
JUIC[43mE4U[0m
[43mM1-[0m5
[43mM2M[0m MART
[43mM2N[0m BUFFET
TW[43mO8T[0mWO BAR & BURGER
US FRIED CHICKEN & [43mP1Z[0mZA


Let's try another example: a number, followed by a character that's not a letter, nor number, nor space, followed by a number.

In [39]:
# Digit, not letter not digit not space, digit
grep('[0-9][^A-Z0-9 ][0-9]', 'restaurant-names.txt') 

$[43m1.2[0m5 PIZZA
[43m1.5[0m GALBI CORP
10[43m4-0[0m1 FOSTER AVENUE COFFEE SHOP(UPS)
3[43m6-0[0m2 DITMARS COFFEE CORP.
4[43m0/4[0m0 CLUB
4[43m0/4[0m0 CLUB BAR
44 [43m1/2[0m CAFE
83 [43m1/2[0m
BRASSERIE 8 [43m1/2[0m
FOOD DEPOT 1[43m2*4[0m
HOT DOG CONCESSION A80[43m3-1[0m
M[43m1-5[0m
PRB 2[43m4-7[0m
THE BEST $[43m1.0[0m0 PIZZA


Now - what about restaurants with five digits?

In [40]:
# Restaurants with five digits
grep('[0-9][0-9][0-9][0-9][0-9]', 'restaurant-names.txt') 

CAFE [43m11231[0m
COFFEE [43m11238[0m
MCDONALDS (#[43m11542[0m)
MCDONALDS [43m17754[0m
PIZZA HUT  # [43m29782[0m
PIZZA HUT #[43m29773[0m
PIZZA HUT [43m29531[0m
PIZZA HUT# [43m28256[0m
STARBUCKS # [43m14840[0m
STARBUCKS (STORE [43m16628[0m)
STARBUCKS [43m22420[0m
STARBUCKS COFFEE  #[43m16608[0m
STARBUCKS COFFEE # [43m15440[0m
STARBUCKS COFFEE #[43m14240[0m
STARBUCKS COFFEE #[43m18509[0m
STARBUCKS COFFEE #[43m20679[0m
STARBUCKS COFFEE #[43m21514[0m
STARBUCKS COFFEE #[43m22596[0m
STARBUCKS COFFEE #[43m23266[0m
STARBUCKS COFFEE #[43m23267[0m
STARBUCKS COFFEE (#[43m19890[0m)
STARBUCKS COFFEE (STORE #[43m13539[0m)
STARBUCKS COFFEE (STORE [43m17478[0m)
STARBUCKS COFFEE (STORE#[43m11650[0m)
STARBUCKS COFFEE (STORE#[43m20161[0m)
STARBUCKS COFFEE COMPANY #[43m22560[0m
SUBWAY (STORE #[43m27610[0m)
SUBWAY (STORE #[43m38550[0m)
SUBWAY STORE [43m46555[0m
SUBWAY#[43m50497[0m (CARDINAL HAYES HIGH SCHOOL)
TEAVANA #[43m22994[0m
TEAVANA#[43m2

#### Anchor

Anchor atoms are special characters, used to define the location of a regex within a line. 

The anchor `^` specifies the *beginning of a line*, the anchor `$` specifies the end of a line. The anchor `\b` specifies the word boundary. (Note that the `^` is used differently in regular expressions based on its context!)

Example: Find restaurant names that start with the characters `BAL`

In [41]:
grep('^BAL', 'restaurant-names.txt')

[43mBAL[0mABOOSTA
[43mBAL[0mADE
[43mBAL[0mBOA RESTAURANT.
[43mBAL[0mCON QUITENO RESTAURANT
[43mBAL[0mDOR SPECIALTY FOODS
[43mBAL[0mDUCCI'S
[43mBAL[0mI NUSA INDONESIAN RESTAURANT
[43mBAL[0mILO DELI
[43mBAL[0mIMAYA RESTAURANT
[43mBAL[0mKANIKA
[43mBAL[0mKH SHISH KABAB HOUSE
[43mBAL[0mL PARK HOT DOG
[43mBAL[0mLARO
[43mBAL[0mLATO'S RESTAURANT
[43mBAL[0mLFIELDS CAFE
[43mBAL[0mLI DELI & SALAD BAR
[43mBAL[0mLY TOTAL FITNESS
[43mBAL[0mLY'S SPORT CLUB
[43mBAL[0mNDIE'S PLACE, INC
[43mBAL[0mON
[43mBAL[0mTHAZAR BAKERY
[43mBAL[0mTHAZAR RESTAURANT
[43mBAL[0mUCHI
[43mBAL[0mUCHI'S
[43mBAL[0mUCHI'S FRESH
[43mBAL[0mUCHI'S INDIAN FOOD
[43mBAL[0mVANERA
[43mBAL[0mZEM


Example: Find restaurant names that end with the characters `NORTH`

In [42]:
grep('NORTH$', 'restaurant-names.txt')

AQUEDUCT [43mNORTH[0m
BOURGEOIS PIG [43mNORTH[0m
PRATT INSTITUTE [43mNORTH[0m


In [43]:
# All restaurants that end with 4 digits
grep('[0-9][0-9][0-9][0-9]$', 'restaurant-names.txt')

CAFE 1[43m1231[0m
CAFE [43m1853[0m
CANTINA [43m1436[0m
CBRE-[43m1540[0m
CHIPOTLE MEXICAN GRILL # [43m2135[0m
CHIPOTLE MEXICAN GRILL #[43m1394[0m
CHIPOTLE MEXICAN GRILL #[43m1962[0m
CHIPOTLE MEXICAN GRILL #[43m1968[0m
CHIPOTLE MEXICAN GRILL #[43m2090[0m
CHIPOTLE MEXICAN GRILL #[43m2123[0m
CHIPOTLE MEXICAN GRILL#[43m1766[0m
COFFEE 1[43m1238[0m
DOMINO'S PIZZA #[43m3647[0m
DOMINO'S PIZZA [43m3537[0m
DOMINO'S PIZZA [43m3657[0m
DOMINOS PIZZA # [43m3448[0m
EMPIRE RESTAURANT OF [43m1635[0m
GALLAGHER'S [43m2000[0m
JACQUES [43m1534[0m
KAFFE [43m1668[0m
LABETTI'S POST # [43m2159[0m
LONGHORN STEAKHOUSE #[43m5453[0m
MCDONALD'S RESTAURANT #[43m3391[0m
MCDONALDS 1[43m7754[0m
MIDTOWN [43m1015[0m
OUTBACK STEAKHOUSE [43m3330[0m
OUTBACK STEAKHOUSE [43m3332[0m
PANDA EXPRESS #[43m2634[0m
PANDA RESTAURANT [43m2807[0m
PETER'S SINCE [43m1969[0m
PIZZA HUT  # 2[43m9782[0m
PIZZA HUT #2[43m9773[0m
PIZZA HUT 2[43m9531[0m
PIZZA HUT# 2[43m8256[0m
RE

Example: Let's try to find restaurants containing the word `COLUMBIA`:

In [44]:
grep('COLUMBIA', 'restaurant-names.txt')

BROWNIE'S CAFE AT [43mCOLUMBIA[0m
CAFE 212/[43mCOLUMBIA[0m CATERING KITCHEN - ALFRED LERNER HALL
[43mCOLUMBIA[0m UNIVERSITY MEDICAL CENTER BOOKSTORE CAFE
EL PUNTO [43mCOLUMBIA[0mNO RESTAURANTE BAKERY
LA GATA GOLOSA [43mCOLUMBIA[0mN FOOD
PARAISO [43mCOLUMBIA[0mNO RESTAURANT
THE FACULTY CLUB ([43mCOLUMBIA[0m UNIVERSITY)
THE SCHOOL AT [43mCOLUMBIA[0m UNIVERSITY
TIERRAS [43mCOLUMBIA[0mNAS


Hm, something is wrong. We also get COLUMBIANO, COLUMBIANAS, and other words. We want only the word COLUMBIA, so we add the word anchors:

In [45]:
# The r'....' is a "raw" string, and allows us to enter
# backslash without having to "escape" the backslash.
# Otherwise Python will interpret \b as a single special
# character, and not as two characters \b that are part of the regex
grep(r'\bCOLUMBIA\b', 'restaurant-names.txt')

BROWNIE'S CAFE AT [43mCOLUMBIA[0m
CAFE 212/[43mCOLUMBIA[0m CATERING KITCHEN - ALFRED LERNER HALL
[43mCOLUMBIA[0m UNIVERSITY MEDICAL CENTER BOOKSTORE CAFE
THE FACULTY CLUB ([43mCOLUMBIA[0m UNIVERSITY)
THE SCHOOL AT [43mCOLUMBIA[0m UNIVERSITY


#### Basic Patterns

* `a, X, 9, ....`: -- ordinary characters just match themselves exactly. 
* `. ^ \$ * + ? { [ ] \ | ( )`: The **meta-characters** which do not match themselves because they have special meanings (more info below)
* `.` (a period) -- matches any single character except newline '\n'
* `\t, \n, \r`: Special characters, tab, newline, return
* `^` = start, `$` = end -- match the start or end of the string
* `\`: inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

#### Shortcuts

A few of the bracket expressions that we discussed above occur very often. For this reason, we have shortcuts for them:

* `\d`: matches the digits: `[0-9]`.
* `\D`: matches anything but `\d`: `[^0-9]`.
* `\w`: matches any alphanumeric character plus underscore: `[A-Za-z0-9_]`.
* `\W`: matches anything but `\w`: `[^A-Za-z0-9_]`
* `\s`: matches any "whitespace" character (space, tab, newline, etc): `[ \t\n\r\f\v]`.
* `\S`: matches anything but `\s`: `[^ \t\n\r\f\v]` .
* `\b`: matches the breaks between alphanumeric and non-alphanumeric characters (an empty string), the boundary between `\w` and `\W`. Useful for ensuring that what you match is actually a word.
* `\B`: matches anything but `\b`. Useful for ensuring your match is in the middle of a word.



#### In class exercises using Console

Write a regular expression for:

* Match any character
* Match the end of line
* Match any digit
* Find all characters that are not digits
* Find all words with four letters
* Find every line that starts with a digit
* Find all empty lines
* Find all lines with 4 characters


In [46]:
# match any character
# do not run against a large dataset
ex1 = r'.'

# match the end of line
# do not run against a large dataset
ex2 = r'$\n'

# match any digit
# do not run against a large dataset
ex3 = r'\d'

# match any "not" digit
# do not run against a large dataset
ex4 = r'\D'

# four letter words
ex5 = r'\b[A-Za-z][A-Za-z][A-Za-z][A-Za-z]\b'

# lines that start with a digit
ex6 = r'^\d'

# empty lines
ex7 = r'^\s$\n'

# four character lines
ex8 = r'^\w\w\w\w$\n'

### Regular Expressions: Operators

#### Alternation |

The alternation operator `|` defines one or more alternatives regular expressions that need to be true for the string to match the regular expression. 

For example, if we are looking for names that contain either the word `GREEK` or the word `RUSSIAN`, we issue the following command: 

In [47]:
grep('GREEK|RUSSIAN', 'restaurant-names.txt')

ANTHI'S [43mGREEK[0m FOOD
AVLEE  [43mGREEK[0m KITCHEN
AVLEE [43mGREEK[0m KITCHEN
AVLI THE LITTLE [43mGREEK[0m TAVERN
ETHOS [43mGREEK[0m CUISINE
[43mGREEK[0m EXPRESS
[43mGREEK[0m FAMILY KITCHEN
[43mGREEK[0m GARDENS GRILL
[43mGREEK[0m GRILL
[43mGREEK[0m ISLANDS
GRK FRESH [43mGREEK[0m
GYRO [43mGREEK[0m STYLE
MEDITERRANEAN GRILL [43mGREEK[0m TARVERNA
OKEANOS [43mGREEK[0m SEAFOOD
OPA! [43mGREEK[0m RESTAURANT
RAFINA [43mGREEK[0m CUISINE
[43mRUSSIAN[0m BATHS
[43mRUSSIAN[0m SAMOVAR
[43mRUSSIAN[0m TURKISH BATHS
SOMETHING [43mGREEK[0m
SYMPOSIUM [43mGREEK[0m RESTAURANT
THE [43mGREEK[0m
THE [43mGREEK[0m CORNER
THE [43mGREEK[0m KITCHEN  CLINTON BAKERY CAFE
THE [43mRUSSIAN[0m TEA ROOM
VILLAGE TAVERNA [43mGREEK[0m GRILL
ZENON TAVERNA [43mGREEK[0m RESTAURANT


#### Repetition {m,n}

A repetition operator specifies that the atom or expression immediately before the repetition may be repeated. For example, if we are looking for restaurants that contain the letter I, three to five times:  

In [48]:
grep('I{3,5}', 'restaurant-names.txt')

ANTIQUE CAFE & BAKERY [43mIII[0m INC
AZOGUENITA BAKERY & RESTAURANT [43mIII[0m
BAGEL EXPRESS [43mIII[0m
BARZOLA'S RESTAURANT [43mIII[0m
BREAD BROTHERS [43mIII[0m
CESTRA'S PIZZA [43mIII[0m
EL CHIVITO D'ORO [43mIII[0m
EL PACHANGON [43mIII[0m RESTAURANT & BAR
EL POLLO [43mIII[0m
EL REY [43mIII[0m
EMILIO [43mIII[0m BAR
ESTRELLITA POBLANA [43mIII[0m
GOLDEN DRAGON [43mIII[0m
KNAPP PIZZA [43mIII[0m
LAS NUEVAS EMPANADAS MONUMENTAL [43mIII[0m
LITTLE ITALY PIZZA [43mIII[0m
LOS POLLITOS [43mIII[0m
MIRACALI [43mIII[0m
NEW CHINA [43mIII[0m
NEW FRESCO TORTILLOS [43mIII[0m
NEW WIN HING [43mIII[0m CHINESE RESTAURANT
ROCCO PIZZA [43mIII[0m
SAKURA [43mIII[0m
SHINJU [43mIII[0m SUSHI
SUSHI TATSU JAPANESE RESTAURANT [43mIII[0m
THI [43mIII[0m NEW YORK
WARD [43mIII[0m


Now, let's find all the restaurants that have a name length from 50 to 55 characters:

In [49]:
grep('^.{50,55}$', 'restaurant-names.txt')

[43mBRASSIERIE 1605/BROADWAY 49 BAR & LOUNGE (MAIN KITCHEN)[0m
[43mBROOKLYN CHILDREN'S MUSEUM CAFE/FOREST CITY RATNER CAFE[0m
[43mCAFE 212/COLUMBIA CATERING KITCHEN - ALFRED LERNER HALL[0m
[43mCAFE1 & CAFE 4 (AMERICAN MUSEUM OF NATURAL HISTORY)[0m
[43mCARIBBEAN CONNECTION CATERING SERVICES INC RESTAURANT[0m
[43mCHARTWELLS AT COLLEGE OF MOUNT ST. VINCENT-BENEDICT[0m
[43mCOURTYARD & RESIDENCE INN BY MARRIOTT CENTRAL PARK[0m
[43mFORDHAM UNIVERSITY/MCGINLEY CENTER/RAMSKELLER KITCHEN[0m
[43mGREEN AND ACKERMAN KOSHER DAIRY RESTAURANT & PIZZA[0m
[43mHOMESTYLE FOOD SERVICES (ST. BARNABAS HIGH SCHOOL)[0m
[43mLOBBY LOUNGE AND TROUBLE'S TRUST @ THE PALACE HOTEL[0m
[43mNATURAL TOFU & NOODLES RESTAURANT (BOOK CHANG DONG)[0m
[43mNEW YORK BOTANICAL GARDENS TERRACE CAFE ( GARDEN CAFE )[0m
[43mNEW YORK UNIVERSITY - KIMMEL STUDENT CENTER CAFETERIA[0m
[43mPYRAMID COFFEE COMPANY HOSPITAL FOR SPECIAL SURGERY[0m
[43mQ.B.COMM.COLLEGE-MAIN KITCHEN/TIGER BITES PIZZA SECTION[0m


In the repetition operator {m,n}, we can skip putting the upper limit if we want to say, "anything with m matches and above". For example, let's find all the restaurants that have a name length 60 characters and above:

In [50]:
grep('^.{60,}$', 'restaurant-names.txt')

[43m(PUBLIC FARE) 81ST STREET AND CENTRAL PARK WEST (DELACORTE THEATRE)[0m
[43mBUFFALO WILD WINGS,PEETS COOFEE &TEA, PANOPOLIS BAKERY & CAFE[0m
[43mCENTER PLATE- CONCOURSE CAFE-JACOB K JAVITS CONVENTION CENTER[0m
[43mCENTERPLATE-EMPLOYEE CAFETERIA-JACOB K JAVITS CONVENTION CENTER[0m
[43mCENTRA`L MARKET ALL AMERICAN GRILL ( STATEN ISLAND FERRY TERMINAL)[0m
[43mDELTA SKY CLUB (BARTENDER SERVICE TERMINAL D DELTA DEPARTURE)[0m
[43mDUNKIN DONUTS (INSIDE GULF GAS STATION ON NORTH SIDE OF MAJ. DEEGAN EXWY- AFTER EXIT 13 - 233 ST.)[0m
[43mFASHION INSTITUTE OF TECHNOLOGY DAVID DUBINSKY STUDENT CENTER[0m
[43mGREATER NEW YORK SOCIAL AND HEALTH ADULT DAY CARE CENTER LLC[0m
[43mHOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES SQUARE[0m
[43mHONG KONG CAFE / FRESH SANDWICH BAKERY (BASEMENT FOOD COURT RESTAURANT & 1ST FL BAKERY)[0m
[43mMARLIN BAR AT TOMMY BAHAMA AND TOMMY BAHAMA RESTAURANT AND B[0m
[43mNEW WAI LING CHINESE RESTAURANT/NEW FRESCO TORTILLAS II TACO[0m


##### Repetition shortcuts (very common!): 

* `* = {0,}`. The `*` character means match the previous atom zero or more times
* `+ = {1,}`. The `+` character means match the previous atom one or more times
* `? = {0,1}`. The `*` character means match the previous atom zero or one times






Find all restaurants that start with one or more digits, followed by a space.

In [51]:
grep('^[0-9]+ ', 'restaurant-names.txt')

[43m002 [0mMERCURY TACOS LLC
[43m1 [0m2 3 BURGER SHOT BEER
[43m1 [0mBANANA QUEEN
[43m1 [0mBUEN SABOR
[43m1 [0mDARBAR
[43m1 [0mEAST 66TH STREET KITCHEN
[43m1 [0mOAK
[43m1 [0mOR 8
[43m1 [0mSTOP PATTY SHOP
[43m10 [0mDEVOE
[43m10 [0mPOINTS KTV
[43m100 [0mFUN
[43m1001 [0mNIGHTS
[43m1001 [0mNIGHTS CAFE
[43m1005 [0mCATERING
[43m101 [0mCAFE
[43m101 [0mDELI
[43m101 [0mRESTAURANT AND BAR
[43m102 [0mNOODLES TOWN RESTAURANT
[43m1020 [0mBAR
[43m1028 [0mBAR & RESTAURANT EL SALVADORENO 
[43m1061 [0mCATERING LLC
[43m107 [0mWEST RESTAURANT
[43m108 [0mFAST FOOD CORP
[43m108 [0mLOUNGE - CLUB 108
[43m1081 [0mFULTON
[43m11 [0mSTREET CAFE
[43m111 [0mRESTAURANT
[43m1174 [0mFULTON CUISINE, HALAL FOOD
[43m12 [0mCHAIRS
[43m12 [0mCHAIRS CAFE
[43m12 [0mCORAZONES RESTAURANT & BAR
[43m12 [0mCORNERS
[43m12 [0mCORNERS COFFEE INC
[43m12 [0mSTREET ALE HOUSE
[43m120 [0mBAY CAFE
[43m1200 [0mMILES
[43m121 [0mFULTON STREET
[43m123 [0mNIKKO
[43m1

Find all restaurants that start with a letter, followed by one or more digits, followed by a space.

In [52]:
grep('^[A-Z][0-9]+ ', 'restaurant-names.txt')

[43mA1 [0mOCHA SUSHI
[43mB4 [0mNYC
[43mB66 [0mCLUB
[43mC2 [0mCAFE
[43mF1 [0mLOUNGE AND GRILL
[43mH20 [0mLOUNGE AND RESTAURANT
[43mJ4 [0mHOOKAH LOUNGE
[43mQ2 [0mTHAI RESTAURANT
[43mT2 [0m- GO
[43mT49 [0mCAFE


In [53]:
# Find all restaurants
# Beggining with one or more letters // ^[A-Z]+
# followed by one or more digits // [0-9]+
# Followed by any number of charaters // .*
# and ending with BAR  // BAR$
grep('^[A-Z]+[0-9]+.*BAR$', 'restaurant-names.txt')

[43mBA1002 BAR[0m
[43mBA1019 BAR[0m
[43mBA6110 BAR[0m
[43mBT3 BAR[0m
[43mBT4 BAR[0m


Find all restaurants that start with the word STARBUCKS, followed by any number of characters, and then have a digit.

In [54]:
grep('STARBUCKS.*[0-9]+', 'restaurant-names.txt')

[43mSTARBUCKS # 14840[0m
[43mSTARBUCKS (JFK TERMINAL 5[0m-POST SECURITY DEPARTURE)
[43mSTARBUCKS (STORE 16628[0m)
[43mSTARBUCKS 22420[0m
[43mSTARBUCKS COFFEE  #16608[0m
[43mSTARBUCKS COFFEE # 15440[0m
[43mSTARBUCKS COFFEE # 7463[0m
[43mSTARBUCKS COFFEE # 7540[0m
[43mSTARBUCKS COFFEE #14240[0m
[43mSTARBUCKS COFFEE #18509[0m
[43mSTARBUCKS COFFEE #20679[0m
[43mSTARBUCKS COFFEE #21514[0m
[43mSTARBUCKS COFFEE #22596[0m
[43mSTARBUCKS COFFEE #23266[0m
[43mSTARBUCKS COFFEE #23267[0m
[43mSTARBUCKS COFFEE #3438[0m
[43mSTARBUCKS COFFEE #7344[0m
[43mSTARBUCKS COFFEE #7358[0m
[43mSTARBUCKS COFFEE #7416[0m
[43mSTARBUCKS COFFEE #7682[0m
[43mSTARBUCKS COFFEE #7826[0m
[43mSTARBUCKS COFFEE #9282[0m
[43mSTARBUCKS COFFEE #9722[0m
[43mSTARBUCKS COFFEE (#19890[0m)
[43mSTARBUCKS COFFEE (#2785[0m)
[43mSTARBUCKS COFFEE (STORE #13539[0m)
[43mSTARBUCKS COFFEE (STORE #7216[0m)
[43mSTARBUCKS COFFEE (STORE #7555[0m)
[43mSTARBUCKS COFFEE (STORE #7577[0m)
[43

#### Grouping ()

In the group operator, when a group of characters is enclosed in parentheses, the next operator applies to the whole group, not only the previous characters. 

For example: Find all the restaurants that start (`^`) with 8 or more repetitions (`{8,}`) of the `\w+ ` pattern (alphanumeric characters followed by space):

In [55]:
grep(r'^(\w+ ){8,}', 'restaurant-names.txt')

[43mGREATER NEW YORK SOCIAL AND HEALTH ADULT DAY CARE CENTER [0mLLC
[43mHOMEWOOD SUITES BY HILTON NEW YORK MIDTOWN MANHATTAN TIMES [0mSQUARE
[43mMARLIN BAR AT TOMMY BAHAMA AND TOMMY BAHAMA RESTAURANT AND [0mB


#### In class exercices

What do these regular expressions match?

1. b (cd)*
* j? k+
* (cd){2,5}
* Panos|Ipeirotis

#### In class exercises (advanced)

Write down the regular expressions for the following:

1. A telephone number (e.g, 212-555-0921)
* A zip+4 code (e.g, 10012-1809)
* Dollar amount with optional cents  (e.g. \$0.33, \$784)
* Match urls  only of the form http://www.alphanumeric.com


### Group references

Sometimes it is handy to be able to refer to a match that was made earlier in a regex. This is done with **backreferences**, which refer to groups. `\k` is the backreference specifier, where `k` is a number, which refers to the `k`-th regular expression *that was enclosed in parenthesis*.

For example, find if the first character(s) of a line are the same as the last:


In [56]:
grep(r'^(.{3,}).*\1$', 'restaurant-names.txt')

[43m108 LOUNGE - CLUB 108[0m
[43mANTEK RESTAURANT[0m
[43mANTOJITOS RETAURANT[0m
[43mANTONIO'S RESTAURANT[0m
[43mARRIBA ARRIBA[0m
[43mBARCELONA BAR[0m
[43mBARRACUDA BAR[0m
[43mBERONBERON[0m
[43mBINGO BINGO BINGO[0m
[43mBUMBLE AND BUMBLE[0m
[43mBURGER BURGER[0m
[43mCENTER PLATE- CONCOURSE CAFE-JACOB K JAVITS CONVENTION CENTER[0m
[43mCENTERPLATE-EMPLOYEE CAFETERIA-JACOB K JAVITS CONVENTION CENTER[0m
[43mCHARLES SALLY & CHARLES[0m
[43mCHEEBURGER CHEEBURGER[0m
[43mCHEN MOMMY KITCHEN[0m
[43mCHEN'S KITCHEN[0m
[43mCHOP CHOP[0m
[43mCREPE SUCRE[0m
[43mDIP DIP[0m
[43mETCETERA ETCETERA[0m
[43mGAJI GAJI[0m
[43mGIT-IT-N-GIT[0m
[43mGONZALEZ Y GONZALEZ[0m
[43mGUDE GUDE[0m
[43mHALF AND HALF[0m
[43mHOME SWEET HOME[0m
[43mJANCHI JANCHI[0m
[43mKENEDY FRIED CHICKEN[0m
[43mKENNDY FRIED CHICKEN[0m
[43mKENNEDY  FRIED CHICKEN[0m
[43mKENNEDY FRIED CHICKEN[0m
[43mKENNEDY GRILL & FRIED CHICKEN[0m
[43mKENNEDY GRILL AND FRIED CHICKEN[0m
[43mKENNED

Or find all the restaurant names that the first 5 characters (or more) are identical to the last characters.

In [57]:
grep(r'^([A-Z]+)\1$', 'restaurant-names.txt')

[43mBERONBERON[0m
[43mCOCO[0m
[43mISIS[0m
[43mMANGOMANGO[0m
[43mNONO[0m
[43mVIVI[0m


Find all names that have three consecutive same digits

In [58]:
grep(r'([0-9])\1\1', 'restaurant-names.txt')

[43m111[0m RESTAURANT
[43m444[0m MADISON COFFEE SHOP
[43m555[0m VIVACAFE
[43m777[0m THEATER BAR
[43m888[0m KITCHEN
CAFE 2[43m000[0m CORONA
CHEN BROTHERS [43m888[0m RESTAURANT, INC.
GALLAGHER'S 2[43m000[0m
LEGENDS [43m000[0m
MEXICO 2[43m000[0m DELI RESTAURANT
NEW [43m888[0m CHINA EXPRESS
OUTBACK STEAKHOUSE [43m333[0m0
OUTBACK STEAKHOUSE [43m333[0m2
STARBUCKS COFFEE (STORE #7[43m555[0m)
SUBWAY STORE 46[43m555[0m
TULCINGO DELI [43m111[0m


As we are going to see, these backreferences will also be of tremendous use for extraction purposes.

In [59]:
#### Naming groups
# The group that follows the term "DOUBLE" is named "doublewhat" and we can refer to it as \doublewhat
grep(r'DOUBLE (?P<doublewhat>\w+)', 'restaurant-names.txt')


2647 [43mDOUBLE DRAGON[0m CHINESE RESTAURANT
BEST [43mDOUBLE DRAGON[0m RESTAURANT
D & B [43mDOUBLE CHINESE[0m RESTAURANT
[43mDOUBLE CRISPY[0m BAKERY
[43mDOUBLE DELIGHT[0m CHINESE RESTAURANT
[43mDOUBLE DOWN[0m SALOON
[43mDOUBLE DRAGON[0m
[43mDOUBLE DRAGON[0m CHINESE RESTAURANT
[43mDOUBLE DRAGON[0m RESTAURANT
[43mDOUBLE DUTCH[0m ESPRESSO
[43mDOUBLE HAPPY[0m KITCHEN
[43mDOUBLE HAPPY[0m RESTAURANT
[43mDOUBLE RAINBOW[0m
[43mDOUBLE WIDE[0m BAR
[43mDOUBLE WINDSOR[0m
GINGERS ([43mDOUBLE TREE[0m HOTEL)
NEW [43mDOUBLE CHINESE[0m RESTAURANT
NEW [43mDOUBLE DRAGON[0m
NEW [43mDOUBLE DRAGON[0m CHINESE RESTAURANT
NY [43mDOUBLE CHINESE[0m RESTAURANT
THE MET GRILL/[43mDOUBLE TREE[0m HOTEL


### More Advanced Regular Expressions

And the ones below get a little bit more advanced:

* `*?`, `+?`: ordinarily, `*`, `+` and `?` are **greedy**. This means they are matching the longest possible string that satisfies the regular expression. Adding the `?` to any of these makes it non-greedy, instead matching the shortest possible expression. 
* `(?: )`: A non-capturing group. This works just as `()`, but doesn’t hold on to the matched contents.
* `(?<=x)`: Matches any string that is preceded by x (an arbitrary regular expression).


## Implementing Regular Expressions in Python

In the introduction to RegEx, we hid the mechanics of how to implement regular expressions in Python with our "grep" UDF. Now, we will dive a bit deeper in this document, which presents basic regular expression syntax and cover common use cases for regular expressions: pattern matching, filtering, data extraction, and string replacement. 

We will present examples using python’s standard [re regular expression library](http://docs.python.org/library/re.html).

You may also want to look at this [*excellent* tutorial from Google](https://developers.google.com/edu/python/regular-expressions).


### Searching strings using regexes

In [63]:
# first import the library
import re

In [64]:
# Regular expressions are compiled into pattern objects
regex = re.compile(r'D.*Data')
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu"

There are few methods to retrieve your matched text: 
* `.findall()` : returns a comma-separated set of ALL results matching your regex as a list
* `.search()` : returns your regex's best match in an object that enables you to access each group
* `.finditer()` : returns an iterator object, with all the matches, which enables you to work through every match in a for loop

Let's compare these.

In [65]:
# .findall()
match = regex.findall(text)
print(match)

['Donna Datastrom, Dealing with Data']


In [66]:
# .search()
match = regex.search(text)
print(match)
print(match.group())

<re.Match object; span=(6, 40), match='Donna Datastrom, Dealing with Data'>
Donna Datastrom, Dealing with Data


In [67]:
matches = regex.finditer(text)
print(matches)
for match in matches:
    print(match.group())

<callable_iterator object at 0x0000025FA94897B8>
Donna Datastrom, Dealing with Data


In [68]:
# We will now try to match an email address. What is wrong in our regex? 
# Can you fix it? Try to use \w as a shorthand
regex = re.compile(r'\w+@\w+')
text = "My email is adam.brandenburger@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

brandenburger@stern


In [69]:
# FIXED -- We will now try to match an email address. What is wrong in our regex? 
# Can you fix it? Try to use \w as a shorthand
regex = re.compile(r'\w+@[\w\.]+')
text = "My email is adam.brandenburger@stern.nyu.edu. You can email me."

matches = regex.finditer(text)
for match in matches:
    print(match.group())

brandenburger@stern.nyu.edu.


In [70]:
# We are looking for binary numbers
regex = re.compile(r'[10]+')
text = "asddf1101110100011abd1111panos0000"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

1101110100011
1111
0000


In [71]:
# We look for money figures, either integers, or with 1 or 2 decimal
# digits
regex = re.compile(r'\$\d+(\.\d\d?)?')
text = '$1200.23 is the price today. $1200 was the price yesterday'
matches = regex.finditer(text)
for match in matches:
    print(match.group())

$1200.23
$1200


In [72]:
# This code is going to generate no matches
regex = re.compile(r'Ra*nd.*m R[egex]')
text =  "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

### Flags for regexes: Case-sentitivity and multiline searches

Regular expressions are typically case-sensitive. 

In [73]:
# Regular expressions are compiled into pattern objects
# Regular expressions are case-sensitive
regex = re.compile(r'D.*STR')
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

But we can specify that they are case-insensitive, using the flag re.IGNORECASE

In [74]:
# Unless we specify that they are case-insensitive, using the flag re.IGNORECASE
regex = re.compile('D.*STR',re.IGNORECASE)
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu"
matches = regex.finditer(text)
for match in matches:
    print(match.group())

Donna Datastr


 For a full list of available flags, please see the [re documentation](http://docs.python.org/library/re.html).

### Multiple matches in a string

The search command goes through the string to find the longest expression that matches the regex
and once it finds the first match, it stops. For example, we will not get the second phone number

In [75]:
# The search command goes through the string to find the longest expression that matches the regex
# Then it continues with the second one
regex = re.compile('\d{3}-\d{3}-\d{4}')
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu, 646-555-5555"
matches = regex.search(text)
print(matches.group())

212-998-0803


If we want to find multiple matches within the string, then we use the `finditer` command that returns a collection of `MatchObject` items.

In [76]:
# The matches command returns an iterator containing "match" objects, which have a variety of attributes
regex = re.compile(r'\d{3}-\d{3}-\d{4}')
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu, 646-555-5555"
matches = regex.finditer(text)
for m in matches:
    print("Starts at:", m.start(), 
    "Ends at:", m.end(),
    "Content:", m.group())

Starts at: 42 Ends at: 54 Content: 212-998-0803
Starts at: 72 Ends at: 84 Content: 646-555-5555


### Extracting Data -- where regex start to get really cool

#### Defining groups within regexes

In addition to simple matching and filtering, many regular expressions implementations, including python’s re, provide a powerful mechanism for extracting meaningful data from raw text. Through capturing, those strings that satisfy a particular regular expression are extracted from the text being matched, and returned to the program processing the raw data. 

**The portion of regular expressions that should be captured is surrounded by parentheses, `"( )"`.**

Then, provided the regular expression containing the capturing statement is satisfied, the result of the regular expression will contain a group of text matching patterns. This group method gets the results of the portions of the input text matched by the capturing statements in the regular expression. The matches are indexed from one-- to get the portion of the text matched by first capturing statement, you could query `result.group(1)`, the second parentheses will have its match stored in `result.group(2)`, etc. The value stored at `result.group(0)`, is the entire portion of the input string matched by the regular expression, not just the portion satisfying the capturing parentheses.

As example of data extraction using capturing regular expressions, say we’re scanning some raw text for phone numbers that we wish to retain for later processing. We might try something like:

In [77]:
import re
# Find phone numbers: 
# Three digits \d{3}
# followed by zero or more non-digits \D*
# followed by three digits \d{3}
# followed by zero or more non-digits \D*
# followed by four digits \d{4}

# The re.VERBOSE flag at the end allows us to write the regex as a multiline string 
# and allows for comments (after the # character)
# In this mode, any whitespace character is ignored, unless explicitly added as part
# of a bracketed expression or when preceded by an unescaped backslash

regex = re.compile(r"""(\d{3}) # The first three digits / area code
                       \D*     # Followed by zero or more non-digits
                       (\d{3}) # The first three digits of the "local" part 
                       \D*     # Followed by zero or more non-digits
                       (\d{4}) # The last four digits of the phone number
                       """, re.VERBOSE)
text = "Prof. Donna Datastrom, Dealing with Data, 212-998-0803, msdata@nyu.edu, 646-555-5555"

matches = regex.finditer(text)
for match in matches:
    print(match.group())
    print("Formatted:", match.group(1),"-", match.group(2), "-", match.group(3))
    print("===========")

212-998-0803
Formatted: 212 - 998 - 0803
646-555-5555
Formatted: 646 - 555 - 5555


Now we will try to extract and format all phone numbers that are part of a big file:

In [78]:
raw_text = """
512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

In [79]:
raw_text

'\n512-234-5234\nfoo\nbar\n124-512-5555\nbiz\n125-555-5785\n679-397-5255\n2126660921\n212-998-0902\n888-888-2222\n801-555-1211\n802 555 1212\n803.555.1213\n(804) 555-1214\n1-805-555-1215\n1(806)555-1216\n807-555-1217-1234\n808-555-1218x1\n809-555-1219 ext. 1234\nwork 1-(810) 555.1220 #1234\n'

In [80]:
# Notice now that each part of the phone is included in parentheses
# allowing us to grab individual part of the phone number
regex = re.compile(r'([2-9]\d{2})\D*(\d{3})\D*(\d{4})')
matches = regex.finditer(raw_text)

phones = list()
for m in matches:
    area_code = m.group(1)
    first_three_digits = m.group(2)
    last_four_digits =  m.group(3)
    
    phone = "(" + area_code + ")" + first_three_digits + "-" + last_four_digits
            
    phones.append(phone)

# Notice that our list does not include numbers with invalid area codes (e.g., 124, 125)
phones

['(512)234-5234',
 '(679)397-5255',
 '(212)666-0921',
 '(212)998-0902',
 '(888)888-2222',
 '(801)555-1211',
 '(802)555-1212',
 '(803)555-1213',
 '(804)555-1214',
 '(805)555-1215',
 '(806)555-1216',
 '(807)555-1217',
 '(808)555-1218',
 '(809)555-1219',
 '(810)555-1220']

### String Replacement

In addition to matching and extraction, regular expressions can be used to change data--especially unstructured text--in very powerful ways.  In particular, regex allow you to find specific patterns and then replace them with specified strings. 

As a data scientist, this is useful when trying to get data formated correctly as input to a specific system, such as when doing data cleanup.

In python’s re library, the function used for replacement is `sub()` (think "substitute"). 

The pattern for invoking `sub()` is 

`re.sub(regex, replacement, text)`

This will return a version of text where all instances of the regex have been substituted with replacement.

Imagine we want to conceal all phone numbers in a document. We could use the following call to `sub()`:

In [81]:
raw_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

regex = re.compile('([2-9]\d{2})\D*(\d{3})\D*(\d{4})')

newstring = re.sub(regex, "XXX-XXX-XXXX", raw_text)

print(newstring)

XXX-XXX-XXXX
foo
bar
124-512-5555
biz
125-555-5785
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
XXX-XXX-XXXX
(XXX-XXX-XXXX
1-XXX-XXX-XXXX
1(XXX-XXX-XXXX
XXX-XXX-XXXX-1234
XXX-XXX-XXXXx1234
XXX-XXX-XXXX ext. 1234
work 1-(XXX-XXX-XXXX #1234



When performing substitution, matches found using the capturing mechanism are available to the replacement using `\1`, `\2`, and so on, as shortcuts to `group(1)`, `group(2)`, etc. 

In order to use this back-referencing capability, we need to tell the `sub()` mechanism to treat the replacement string as a regular expression. For instance, if we want to make sure all phone numbers are normalized and all area codes are surrounded by parentheses, we can use:

In [82]:
print(re.sub(regex, r"(\1)-\2-\3", raw_text))

(512)-234-5234
foo
bar
124-512-5555
biz
125-555-5785
(679)-397-5255
(212)-666-0921
(212)-998-0902
(888)-888-2222
(801)-555-1211
(802)-555-1212
(803)-555-1213
((804)-555-1214
1-(805)-555-1215
1((806)-555-1216
(807)-555-1217-1234
(808)-555-1218x1234
(809)-555-1219 ext. 1234
work 1-((810)-555-1220 #1234



#### Exercise 1

The webpage at `http://www.stern.nyu.edu/faculty/search_name_form/` contains the contact emails for all the Stern faculty members. Write code that will allow you to extract all the emails that appear in the page. Just for your convenience, the code below will fetch the page, and store the HTML source in the variable `html`.

Then you will need to write the right regex and write the code that finds emails in the retrieved html.

In [None]:
import requests
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
response = requests.get(url)
html = response.text
html

In [84]:
# Find occurences of the pattern in the HTML source

# You want to write a regular expression that will find all the email addresses that appear in the html
# variable, and store the emails in a list. You may also want to write the list of emails in a text file.
pattern = r'YOUR PATTERN HERE'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
    ... #YOUR CODE HERE



#### Solution for Exercise 1

In [None]:
# Email regex
regex = re.compile(r'\w+@(\w+\.)+\w+')

# We can create either a list or a set, but let's avoid duplicates
emails = set()

# Fetch the HTML source
url = 'http://www.stern.nyu.edu/faculty/search_name_form'
html = requests.get(url).text

# Find matches
matches = regex.finditer(html)
# Go through matches and add them in our result set
for m in matches:
    emails.add(m.group())

sorted(emails)

#### Exercise 2

* The webpage at `http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A` contains the list of all tickers at the NASDAQ exchange, which start with the letter `A`. Inspect the HTML, and figure how what is the pattern for referring to the ticker (hint: you will see URLs of the form `http://www.nasdaq.com/symbol/....`). 
* Write regular expressions to extract the tickers that appear in a web page
* Write code for iterating over all pages of NASDAQ for all the different letters
* Write code for going over multiple pages within the same letter. (optional)

In [86]:
import requests
import re
# Fetch the HTML from the page
url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter=A'
html = requests.get(url).text
# Find occurences of the pattern in the HTML source
pattern = r'YOUR PATTERN HERE'
regex = re.compile(pattern)
matches = regex.finditer(html)
for m in matches:
    ... #YOUR CODE HERE

#### Solution for exercise 2

In [None]:
import requests

alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
tickers = set()
for letter in alphabet:
    url = 'http://www.nasdaq.com/screening/companies-by-name.aspx?letter='+letter
    print(url)
    html = requests.get(url).text
    
    # The code below extracts the number of pages for each letter
    # of the alphabet. Potentially we can use that number to
    # iterate over all the pages in NASDAQ. Left as an exercise
    # for the interested reader :-)
    pages_regex = r'Displaying.*of.*<b>(\d+)</b>.*results'
    pregex = re.compile(pages_regex)
    pages = pregex.finditer(html)
    for m in pages:
        print("Results:", m.group(1))
        pages = int(int(m.group(1))/50+1)
        print("Letter", letter, "needs", str(pages), "pages")
    
    ticker_regex = r'http://www.nasdaq.com/symbol/(\w+)'
    regex = re.compile(ticker_regex)
    matches = regex.finditer(html)
    for m in matches:
        ticker = m.group(1).upper()
        #print("URL:", m.group())
        #print("Ticker:", ticker)
        tickers.add(ticker)
    print("We have ", len(tickers), "tickers")

tickers

In [None]:
## Commas
In our last class, we looked at the order of replacement operations for commas and fou

str1 = "value1,value2,,value3,,,value4,value5,,,,value6,,,,,value7"

# replace 2x , before 3x ,
str2 = str1.replace(",,",",")
str2.replace(",,,",",")

# replace 2x , before 3x ,
str2 = str1.replace(",,,",",")
str3 = str2.replace(",,",",")
str3.replace(",,",",")

import re
comma_pattern = r'(\w*)(,+)'
comma_regex = re.compile(comma_pattern)
re.sub(comma_pattern,r'\1,',str1)