# Basic Regex Homework

In [1]:
# use print only as a function
from __future__ import print_function

## Homework 1: FAA tower closures

A list of FAA tower closures has been copied from a [PDF](http://www.faa.gov/news/media/fct_closed.pdf) into the file **`faa.txt`**, which is stored in the **`data`** directory of the course repository.

In [1]:
# read the file into a single string
with open('faa.txt',encoding='utf-8') as f:
    data = f.read()

In [2]:
# check the number of characters
len(data)

5574

In [3]:
# examine the first 500 characters
print(data[0:500])

FAA Contract Tower Closure List
(149 FCTs)
3‐22‐2013
LOC
ID Facility Name City State
DHN DOTHAN RGNL DOTHAN AL
TCL TUSCALOOSA RGNL TUSCALOOSA AL
FYV DRAKE FIELD FAYETTEVILLE AR
TXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR
GEU GLENDALE MUNI GLENDALE AZ
GYR PHOENIX GOODYEAR GOODYEAR AZ
IFP LAUGHLIN/BULLHEAD INTL BULLHEAD CITY AZ
RYN RYAN FIELD TUCSON AZ
FUL FULLERTON MUNI FULLERTON CA
MER CASTLE ATWATER CA
OXR OXNARD OXNARD CA
RAL RIVERSIDE MUNI RIVERSIDE CA
RNM RAMONA RAMONA CA
SAC SACRAMENTO EXECU


In [5]:
# examine the last 500 characters
print(data[-500:])

 YAKIMA WA
CWA CENTRAL WISCONSIN MOSINEE WI
EAU CHIPPEWA VALLEY RGNL EAU CLAIRE WI
ENW KENOSHA RGNL KENOSHA WI
Page 3 of 4
FAA Contract Tower Closure List
(149 FCTs)
3‐22‐2013
LOC
ID Facility Name City State
JVL SOUTHERN WISCONSIN RGNL JANESVILLE WI
LSE LA CROSSE MUNI LA CROSSE WI
MWC LAWRENCE J TIMMERMAN MILWAUKEE WI
OSH WITTMAN RGNL OSHKOSH WI
UES WAUKESHA COUNTY WAUKESHA WI
HLG WHEELING OHIO CO WHEELING WV
LWB GREENBRIER VALLEY LEWISBURG WV
PKB MID-OHIO VALLEY RGNL PARKERSBURG WV
Page 4 of 4



Your assignment is to **create a list of tuples** containing the **tower IDs** and the **states** they are located in.

Here is the **expected output:**

> `faa = [('DHN', 'AL'), ('TCL', 'AL'), ..., ('PKB', 'WV')]`

In [6]:
import re

In [7]:
re.findall(r'([A-Z][A-Z][A-Z]) .+ ([A-Z][A-Z])',data)[0:10]

[('DHN', 'AL'),
 ('TCL', 'AL'),
 ('FYV', 'AR'),
 ('TXK', 'AR'),
 ('GEU', 'AZ'),
 ('GYR', 'AZ'),
 ('IFP', 'AZ'),
 ('RYN', 'AZ'),
 ('FUL', 'CA'),
 ('MER', 'CA')]

In [8]:
re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})',data)[0:10]
closures = re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})',data)

As a **bonus task**, use regular expressions to extract the **number of closures** listed in the second line of the file (149), and then use an **assertion** to check that the number of closures is equal to the length of the `faa` list.

In [9]:
num_closures = int(re.search(r'\((\d+) FCTs\)',data).group(1))
num_closures

149

In [10]:
assert num_closures == len(closures)
len(closures)

149

## Homework 2: Stack Overflow reputation

I have downloaded my **Stack Overflow reputation history** into the file **`reputation.txt`**, which is stored in the **`data`** directory of the course repository. (If you are a Stack Overflow user with a reputation of 10 or more, you should be able to [download your own reputation history](http://stackoverflow.com/reputation).)

We are only interested in the lines that **begin with two dashes**, such as:

> `-- 2012-08-30 rep +5    = 6`

That line can be interpreted as follows: "On 2012-08-30, my reputation increased by 5, bringing my reputation total to 6."

Your assignment is to **create a list of tuples** containing only these dated entries, including the **date**, **reputation change** (regardless of whether it is positive/negative/zero), and **running total**.

Here is the **expected output:**

> `rep = [('2012-08-30', '+5', '6'), ('2012-12-11', '+10', '16'), ...,  ('2015-10-14', '-1', '317')]`

In [11]:
with open('reputation.txt') as f:
    data=f.read()
print(data[0:500])

total votes: 36
 2  12201376 (5)
-- 2012-08-30 rep +5    = 6         
 2  13822612 (10)
-- 2012-12-11 rep +10   = 16        
 2  13822612 (10)
-- 2013-03-20 rep +10   = 26        
-- 2013-12-05 rep 0     = 26        
-- 2014-01-25 rep 0     = 26        
 16  7141669 (2)
-- 2014-03-19 rep +2    = 28        
 1  12202249 (2)
-- 2014-05-11 rep +2    = 30        
 16 23599806 (2)
 2  23597220 (10)
-- 2014-05-12 rep +12   = 42        
 2  13822612 (10)
-- 2014-06-12 rep +10   = 52        
 2  2359722


In [14]:
rep = re.findall(r'-- (\d{4}-\d{2}-\d{2}) rep ([+-]?\d+)\s+=\s(\d+)',data)
rep[1:10]

[('2012-12-11', '+10', '16'),
 ('2013-03-20', '+10', '26'),
 ('2013-12-05', '0', '26'),
 ('2014-01-25', '0', '26'),
 ('2014-03-19', '+2', '28'),
 ('2014-05-11', '+2', '30'),
 ('2014-05-12', '+12', '42'),
 ('2014-06-12', '+10', '52'),
 ('2014-06-26', '+10', '62')]

As a **bonus task**, convert this list of tuples into a **pandas DataFrame**. It should have appropriate column names, and the second and third columns should be of type integer (rather than string/object).

In [28]:
import pandas as pd
rep_df = pd.DataFrame(rep,columns=['date','reputation change','running total'],dtype='int')
print(rep_df.dtypes)
rep_df.head(10)

date                 object
reputation change     int32
running total         int32
dtype: object


Unnamed: 0,date,reputation change,running total
0,2012-08-30,5,6
1,2012-12-11,10,16
2,2013-03-20,10,26
3,2013-12-05,0,26
4,2014-01-25,0,26
5,2014-03-19,2,28
6,2014-05-11,2,30
7,2014-05-12,12,42
8,2014-06-12,10,52
9,2014-06-26,10,62
