# Lecture 4.6 - Basics of cleaning messy text files 
## Part 2 - Grouping blocks of data and extracting information

In this lecture, we will go over a number of cases of messy data, and how to use Python to fix these problems.  This includes

1. Removing unwanted lines.
2. Parsing lines with regular expressions.
3. Working with data blocks spread across multiple lines.

## Reading in current progress

In [1]:
with open('911_Deaths_Grouped.csv') as f:
    content = f.read()
content[:500]

&quot;Gordon M. Aamoth, Jr., 32, Sandler O&#39;Neill + Partners, World Trade Center.\nEdelmiro Abad, 54, Brooklyn, N.Y., Fiduciary Trust Company International, World Trade Center.\nMarie Rose Abad, 49, Keefe, Bruyette&amp;Woods, Inc., World Trade Center.\nAndrew Anthony Abate, 37, Melville, N.Y., Cantor Fitzgerald, World Trade Center.\nVincent Paul Abate, 40, Brooklyn, N.Y., Cantor Fitzgerald, World Trade Center.\nLaurence Christopher Abel, 37, New York City, Cantor Fitzgerald, World Trade Center.\nAlona Abraham, 3&quot;

In [2]:
grouped_lines = content.split('\n')
grouped_lines

antor Fitzgerald, World Trade Center.&#39;,
 &#39;Anthony J. Fallone, Jr., 39, New York City, Cantor Fitzgerald, World Trade Center.&#39;,
 &#39;Dolores Brigitte Fanelli, 38, Farmingville, N.Y., Marsh&amp;McLennan Companies, Inc., World Trade Center.&#39;,
 &#39;Robert John Fangman, 33, Chelsea, Mass., Flight Crew, United 175, World Trade Center.&#39;,
 &#39;John Joseph Fanning, 54, West Hempstead, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Kathleen Anne Faragher, 33, Risk Waters Group conference attendee from Janus Capital Group, World Trade Center.&#39;,
 &#39;Thomas James Farino, 37, Bohemia, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Nancy C. Doloszycki Farley, 45, Jersey City, N.J., Reinsurance Solutions, World Trade Center.&#39;,
 &#39;Paige Marie Farley-Hackel, 46, Newton, Mass., Passenger, United 11, World Trade Center.&#39;,
 &#39;Elizabeth Ann Farmer, 62, Cantor Fitzgerald contractor, World Trade Center.&#39;,
 &#39;Douglas 

## Preprocessing 

Below I have transfered over the preprocessing functions and applied them to the data.

In [3]:
# Imports
from composable import pipeable
from composable.strict import map

In [5]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))

In [6]:
(grouped_lines
>> map(add_missing_period)
>> map(fix_world_trade)
)

antor Fitzgerald, World Trade Center.&#39;,
 &#39;Anthony J. Fallone, Jr., 39, New York City, Cantor Fitzgerald, World Trade Center.&#39;,
 &#39;Dolores Brigitte Fanelli, 38, Farmingville, N.Y., Marsh&amp;McLennan Companies, Inc., World Trade Center.&#39;,
 &#39;Robert John Fangman, 33, Chelsea, Mass., Flight Crew, United 175, World Trade Center.&#39;,
 &#39;John Joseph Fanning, 54, West Hempstead, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Kathleen Anne Faragher, 33, Risk Waters Group conference attendee from Janus Capital Group, World Trade Center.&#39;,
 &#39;Thomas James Farino, 37, Bohemia, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Nancy C. Doloszycki Farley, 45, Jersey City, N.J., Reinsurance Solutions, World Trade Center.&#39;,
 &#39;Paige Marie Farley-Hackel, 46, Newton, Mass., Passenger, United 11, World Trade Center.&#39;,
 &#39;Elizabeth Ann Farmer, 62, Cantor Fitzgerald contractor, World Trade Center.&#39;,
 &#39;Douglas 

In [7]:
# For convenience I will give these a name
prepped_lines = (grouped_lines 
                >> map(add_missing_period)
                >> map(fix_world_trade)
                )
prepped_lines

antor Fitzgerald, World Trade Center.&#39;,
 &#39;Anthony J. Fallone, Jr., 39, New York City, Cantor Fitzgerald, World Trade Center.&#39;,
 &#39;Dolores Brigitte Fanelli, 38, Farmingville, N.Y., Marsh&amp;McLennan Companies, Inc., World Trade Center.&#39;,
 &#39;Robert John Fangman, 33, Chelsea, Mass., Flight Crew, United 175, World Trade Center.&#39;,
 &#39;John Joseph Fanning, 54, West Hempstead, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Kathleen Anne Faragher, 33, Risk Waters Group conference attendee from Janus Capital Group, World Trade Center.&#39;,
 &#39;Thomas James Farino, 37, Bohemia, N.Y., New York City Fire Department, World Trade Center.&#39;,
 &#39;Nancy C. Doloszycki Farley, 45, Jersey City, N.J., Reinsurance Solutions, World Trade Center.&#39;,
 &#39;Paige Marie Farley-Hackel, 46, Newton, Mass., Passenger, United 11, World Trade Center.&#39;,
 &#39;Elizabeth Ann Farmer, 62, Cantor Fitzgerald contractor, World Trade Center.&#39;,
 &#39;Douglas 

## Regular expression from lab 2

Below I have attempted to combine all of the regular expressions from lab 2

In [29]:
import re
line_parts = re.compile('^(.+), (\?\?|\d{1,3}),(.*?)( Passenger,| Flight Crew,)?( United \d{2,3},| American \d{2,3},)?( World Trade Center| Pentagon| Shanksville, Pa)(, died \d{1,2}/\d{1,2}/\d{1,2})?\.$')

In [31]:
prepped_lines[2402]

&#39;Jesus Sanchez, 45, Flight Crew, United 175, World Trade Center.&#39;

In [30]:
line_parts.search(prepped_lines[2402]).groups()

(&#39;Jesus Sanchez&#39;,
 &#39;45&#39;,
 &#39;&#39;,
 &#39; Flight Crew,&#39;,
 &#39; United 175,&#39;,
 &#39; World Trade Center&#39;,
 None)

#### Always check for non-matches

In [32]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

## Verbose regular expressions

**Pros:**
* Spread over multiple lines
* Allow comments

**Cons:**
* Ignore white space outside `()`
* Require escaping spaces `\ `

In [12]:
# Without Using VERBOSE 
regex_email = re.compile(r'^([a-z0-9_\.-]+)@([0-9a-z\.-]+)\.([a-z\.]{2, 6})$')

In [13]:
# Using VERBOSE 
regex_email = re.compile(r""" 
                        ^([a-z0-9_\.-]+)			 # local Part 
                        @							 # single @ sign 
                        ([0-9a-z\.-]+)			 	 # Domain name 
                        \.						 	 # single Dot . 
                        ([a-z]{2,6})$				 # Top level Domain 
                        """,re.VERBOSE)

## Another example.

This example, from the Python docs, shows how to space out an OR section across multiple lines.

In [14]:
charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

## Cleaning up our regular expr

<h2> <font color="red"> Exercise 4.6.1 - Clean up the regular expression </font> </h2>

To clean up the regular expression, 

1. Replace all spaces with `\ ` or `\s` (I prefer the second)
2. Turn the string into a multi-line string.
3. Spread the parts over many lines
4. Add comments.

In [38]:
my_line_parts = re.compile("""
('^(.+),\s
(
    \?\?|\d{1,3}
),
(.*?)
(
    \sPassenger,
    |\sFlight\sCrew,
)?
(
    \sUnited\s\d{2,3},
    |\sAmerican\s\d{2,3},
)?
(
    \sWorld\sTrade\sCenter
    |\sPentagon
    |\sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}
)?
\.$
')
;
""", re.VERBOSE)

In [39]:
[(i, l) for i, l in enumerate(prepped_lines) if not my_line_parts.search(l)]

o, Jr., 41, Freeport, N.Y., New York City Police Department, World Trade Center.&#39;),
 (790,
  &#39;Ronald Carl Fazio, Sr., 57, Closter, N.J., Aon Corporation, World Trade Center.&#39;),
 (791,
  &#39;William M. Feehan, 71, Flushing, N.Y., New York City Fire Department, World Trade Center.&#39;),
 (792,
  &#39;Francis Jude Feely, 41, Marsh&amp;McLennan Companies, Inc., World Trade Center.&#39;),
 (793,
  &#39;Garth Erin Feeney, 25, New York City, Risk Waters Group conference attendee from DataSynapse, World Trade Center.&#39;),
 (794,
  &#39;Sean Bernard Fegan, 34, New York City, Fred Alger Management, Inc., World Trade Center.&#39;),
 (795,
  &#39;Lee S. Fehling, 28, Wantagh, N.Y., New York City Fire Department, World Trade Center.&#39;),
 (796,
  &#39;Peter Adam Feidelberg, 34, Hoboken, N.J., Aon Corporation, World Trade Center.&#39;),
 (797,
  &#39;Alan D. Feinberg, 48, Marlboro, N.J., New York City Fire Department, World Trade Center.&#39;),
 (798,
  &#39;Rosa Maria Feliciano, 30

> Describe the bug here

In [553]:
# Your fix here

## Progress so far

In [40]:
# Imports
from composable import pipeable
from composable.strict import map

In [41]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [42]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
# New
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))

In [43]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [44]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(get_line_parts)
                )
split_lines

reddo&#39;,
  &#39; 45&#39;,
  &#39; Manalapan, N.J., Cantor Fitzgerald contractor from International Brotherhood of Electrical Workers,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Darlene E. Flagg&#39;,
  &#39; ??&#39;,
  &#39; Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Wilson F. Flagg&#39;,
  &#39; 62&#39;,
  &#39; Millwood, Va., Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Christina Donovan Flannery&#39;,
  &#39; 26&#39;,
  &quot; Middle Village, N.Y., Sandler O&#39;Neill + Partners,&quot;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Eileen Flecha&#39;,
  &#39; 33&#39;,
  &#39; Queens, N.Y., Fiduciary Trust Company International,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Andre G. Fletcher&#39;,
  &#39; 37&#39;,
  &#39; New York City Fire 

## Pulling out and cleaning up names

Sometimes it is useful to pull the various columns apart and clean them up separately.  To illustrate, will will pull out and clean up the names. We can do this using the `get` function from `toolz.curried` which *gets* the value from a list at a given index.

In [45]:
from toolz.curried import get

In [46]:
(split_lines
>> map(get(0))
)

Lawrence Ira Beck&#39;,
 &#39;Manette Marie Beckles&#39;,
 &#39;Carl John Bedigian&#39;,
 &#39;Michael Ernest Beekman&#39;,
 &#39;Maria A. Behr&#39;,
 &#39;Max J. Beilke&#39;,
 &#39;Yelena Belilovsky&#39;,
 &#39;Nina Patrice Bell&#39;,
 &#39;Debbie S. Bellows&#39;,
 &#39;Stephen Elliot Belson&#39;,
 &#39;Paul M. Benedetti&#39;,
 &#39;Denise Lenore Benedetto&#39;,
 &#39;Bryan Craig Bennett&#39;,
 &#39;Eric L. Bennett&#39;,
 &#39;Oliver Bennett&#39;,
 &#39;Margaret L. Benson&#39;,
 &#39;Dominick J. Berardi&#39;,
 &#39;James Patrick Berger&#39;,
 &#39;Steven Howard Berger&#39;,
 &#39;John P. Bergin&#39;,
 &#39;Alvin Bergsohn&#39;,
 &#39;Daniel David Bergstein&#39;,
 &#39;Graham Andrew Berkeley&#39;,
 &#39;Michael J. Berkeley&#39;,
 &#39;Donna M. Bernaerts&#39;,
 &#39;David W. Bernard&#39;,
 &#39;William H. Bernstein&#39;,
 &#39;David M. Berray&#39;,
 &#39;David Shelby Berry&#39;,
 &#39;Joseph John Berry&#39;,
 &#39;William Reed Bethke&#39;,
 &#39;Yeneneh Betru&#39;,
 &#39;Timothy D. Bette

Now we can clean up these name by removing commas.

In [47]:
remove_commas = lambda s: s.replace(',', '')

(split_lines
>> map(get(0))
>> map(remove_commas)
)

i&#39;,
 &#39;Jane S. Beatty&#39;,
 &#39;Alan Anthony Beaven&#39;,
 &#39;Lawrence Ira Beck&#39;,
 &#39;Manette Marie Beckles&#39;,
 &#39;Carl John Bedigian&#39;,
 &#39;Michael Ernest Beekman&#39;,
 &#39;Maria A. Behr&#39;,
 &#39;Max J. Beilke&#39;,
 &#39;Yelena Belilovsky&#39;,
 &#39;Nina Patrice Bell&#39;,
 &#39;Debbie S. Bellows&#39;,
 &#39;Stephen Elliot Belson&#39;,
 &#39;Paul M. Benedetti&#39;,
 &#39;Denise Lenore Benedetto&#39;,
 &#39;Bryan Craig Bennett&#39;,
 &#39;Eric L. Bennett&#39;,
 &#39;Oliver Bennett&#39;,
 &#39;Margaret L. Benson&#39;,
 &#39;Dominick J. Berardi&#39;,
 &#39;James Patrick Berger&#39;,
 &#39;Steven Howard Berger&#39;,
 &#39;John P. Bergin&#39;,
 &#39;Alvin Bergsohn&#39;,
 &#39;Daniel David Bergstein&#39;,
 &#39;Graham Andrew Berkeley&#39;,
 &#39;Michael J. Berkeley&#39;,
 &#39;Donna M. Bernaerts&#39;,
 &#39;David W. Bernard&#39;,
 &#39;William H. Bernstein&#39;,
 &#39;David M. Berray&#39;,
 &#39;David Shelby Berry&#39;,
 &#39;Joseph John Berry&#39;,
 &#39;W

## Pulling out and cleaning up ages

NExt, we will pull out and clean the ages.  In this case, we should replace the missing values, currently `'??'`, to blanks.

In [48]:
remove_quest_mark = lambda s: s.replace('??', '')

(split_lines
>> map(get(1))
>> map(remove_quest_mark)
)

[&#39; 32&#39;,
 &#39; 54&#39;,
 &#39; 49&#39;,
 &#39; 37&#39;,
 &#39; 40&#39;,
 &#39; 37&#39;,
 &#39; 30&#39;,
 &#39; 55&#39;,
 &#39; 42&#39;,
 &#39; 38&#39;,
 &#39; 29&#39;,
 &#39; 37&#39;,
 &#39; 28&#39;,
 &#39; 61&#39;,
 &#39; 25&#39;,
 &#39; 51&#39;,
 &#39; 62&#39;,
 &#39; 28&#39;,
 &#39; 22&#39;,
 &#39; 36&#39;,
 &#39; 48&#39;,
 &#39; 32&#39;,
 &#39; 37&#39;,
 &#39; 36&#39;,
 &#39; 37&#39;,
 &#39; 35&#39;,
 &#39; 46&#39;,
 &#39; 30&#39;,
 &#39; 43&#39;,
 &#39; 74&#39;,
 &#39; 27&#39;,
 &#39; 47&#39;,
 &#39; 30&#39;,
 &#39; 33&#39;,
 &#39; 37&#39;,
 &#39; 37&#39;,
 &#39; 41&#39;,
 &#39; 39&#39;,
 &#39; 46&#39;,
 &#39; 25&#39;,
 &#39; 46&#39;,
 &#39; 57&#39;,
 &#39; 43&#39;,
 &#39; 51&#39;,
 &#39; 44&#39;,
 &#39; 39&#39;,
 &#39; 31&#39;,
 &#39; 30&#39;,
 &#39; 36&#39;,
 &#39; 48&#39;,
 &#39; 41&#39;,
 &#39; 31&#39;,
 &#39; 23&#39;,
 &#39; 38&#39;,
 &#39; 25&#39;,
 &#39; 60&#39;,
 &#39; 40&#39;,
 &#39; 60&#39;,
 &#39; 43&#39;,
 &#39; 41&#39;,
 &#39; 32&#39;,
 &#39; 29&#39;,
 &#39; 2

## Progress so far

In [49]:
# Imports
from composable import pipeable
from composable.strict import map

In [50]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [51]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
# New
remove_commas = lambda s: s.replace(',', '')
remove_quest_mark = lambda s: s.replace('??', '')

In [52]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [53]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(get_line_parts)
                )
split_lines

reddo&#39;,
  &#39; 45&#39;,
  &#39; Manalapan, N.J., Cantor Fitzgerald contractor from International Brotherhood of Electrical Workers,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Darlene E. Flagg&#39;,
  &#39; ??&#39;,
  &#39; Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Wilson F. Flagg&#39;,
  &#39; 62&#39;,
  &#39; Millwood, Va., Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Christina Donovan Flannery&#39;,
  &#39; 26&#39;,
  &quot; Middle Village, N.Y., Sandler O&#39;Neill + Partners,&quot;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Eileen Flecha&#39;,
  &#39; 33&#39;,
  &#39; Queens, N.Y., Fiduciary Trust Company International,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Andre G. Fletcher&#39;,
  &#39; 37&#39;,
  &#39; New York City Fire 

In [54]:
names =  (split_lines
        >> map(get(0))
        >> map(remove_commas)
        )
names

i&#39;,
 &#39;Jane S. Beatty&#39;,
 &#39;Alan Anthony Beaven&#39;,
 &#39;Lawrence Ira Beck&#39;,
 &#39;Manette Marie Beckles&#39;,
 &#39;Carl John Bedigian&#39;,
 &#39;Michael Ernest Beekman&#39;,
 &#39;Maria A. Behr&#39;,
 &#39;Max J. Beilke&#39;,
 &#39;Yelena Belilovsky&#39;,
 &#39;Nina Patrice Bell&#39;,
 &#39;Debbie S. Bellows&#39;,
 &#39;Stephen Elliot Belson&#39;,
 &#39;Paul M. Benedetti&#39;,
 &#39;Denise Lenore Benedetto&#39;,
 &#39;Bryan Craig Bennett&#39;,
 &#39;Eric L. Bennett&#39;,
 &#39;Oliver Bennett&#39;,
 &#39;Margaret L. Benson&#39;,
 &#39;Dominick J. Berardi&#39;,
 &#39;James Patrick Berger&#39;,
 &#39;Steven Howard Berger&#39;,
 &#39;John P. Bergin&#39;,
 &#39;Alvin Bergsohn&#39;,
 &#39;Daniel David Bergstein&#39;,
 &#39;Graham Andrew Berkeley&#39;,
 &#39;Michael J. Berkeley&#39;,
 &#39;Donna M. Bernaerts&#39;,
 &#39;David W. Bernard&#39;,
 &#39;William H. Bernstein&#39;,
 &#39;David M. Berray&#39;,
 &#39;David Shelby Berry&#39;,
 &#39;Joseph John Berry&#39;,
 &#39;W

In [55]:
ages =  (split_lines
        >> map(get(1))
        >> map(remove_quest_mark)
        )
ages

[&#39; 32&#39;,
 &#39; 54&#39;,
 &#39; 49&#39;,
 &#39; 37&#39;,
 &#39; 40&#39;,
 &#39; 37&#39;,
 &#39; 30&#39;,
 &#39; 55&#39;,
 &#39; 42&#39;,
 &#39; 38&#39;,
 &#39; 29&#39;,
 &#39; 37&#39;,
 &#39; 28&#39;,
 &#39; 61&#39;,
 &#39; 25&#39;,
 &#39; 51&#39;,
 &#39; 62&#39;,
 &#39; 28&#39;,
 &#39; 22&#39;,
 &#39; 36&#39;,
 &#39; 48&#39;,
 &#39; 32&#39;,
 &#39; 37&#39;,
 &#39; 36&#39;,
 &#39; 37&#39;,
 &#39; 35&#39;,
 &#39; 46&#39;,
 &#39; 30&#39;,
 &#39; 43&#39;,
 &#39; 74&#39;,
 &#39; 27&#39;,
 &#39; 47&#39;,
 &#39; 30&#39;,
 &#39; 33&#39;,
 &#39; 37&#39;,
 &#39; 37&#39;,
 &#39; 41&#39;,
 &#39; 39&#39;,
 &#39; 46&#39;,
 &#39; 25&#39;,
 &#39; 46&#39;,
 &#39; 57&#39;,
 &#39; 43&#39;,
 &#39; 51&#39;,
 &#39; 44&#39;,
 &#39; 39&#39;,
 &#39; 31&#39;,
 &#39; 30&#39;,
 &#39; 36&#39;,
 &#39; 48&#39;,
 &#39; 41&#39;,
 &#39; 31&#39;,
 &#39; 23&#39;,
 &#39; 38&#39;,
 &#39; 25&#39;,
 &#39; 60&#39;,
 &#39; 40&#39;,
 &#39; 60&#39;,
 &#39; 43&#39;,
 &#39; 41&#39;,
 &#39; 32&#39;,
 &#39; 29&#39;,
 &#39; 2

<h2> <font color="red"> Exercise 4.6.2 - Separating and cleaning other columns. </font> </h2>

To clean up the following columns 

1. Grab the date of death and replace the missing values with `9/11/2001`
2. Grab the locations (e.g. `World Trade Center`) and remove the comma from `'Shanksville, Pa.`
3. Grab the flights.
4. Grab the passenger status.

**Note:** Be sure to strip whitespace from all of them.

In [68]:
# Your fix here
missing_deaths = pipeable(lambda line: (', died 9/11/2001') if len(line) == 0 else line.strip())
death = (split_lines
        >> map(get(-1))
        >> map(missing_deaths)
        )
death

9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;,
 &#39;, died 9/11/2001&#39;

In [72]:
remove_commas = pipeable(lambda line: line.replace(',', '').strip() if ',' in line else line.strip())
locations = (split_lines
            >> map(get(-2))
            >> map(remove_commas)
            )
locations

ade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;Pentagon&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 &#39;World Trade Center&#39;,
 

In [75]:
remove_whitespace = pipeable(lambda line: line.strip())
flights = (split_lines
            >> map(get(4))
            >> map(remove_whitespace)
            )
flights

[&#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;United 175,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;United 93,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;United 11,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;U

In [78]:
passenger_status = (split_lines 
                    >> map(get(3))
                    >> map(remove_whitespace)
                    )
passenger_status

[&#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;Passenger,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;Passenger,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;Passenger,&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;&#39;,
 &#39;Pa

## Grabbing the troubling bit

We have made significant progress, but still need to work on the third entry, which contains the hometown and employment information.  Again, we can do this using the `get` function from `toolz.curried` which *gets* the value from a list at a given index.

In [79]:
troubling_bit = (split_lines
                >> map(get(2))
                )
troubling_bit

nt,&#39;,
 &quot; Summit, N.J., Sandler O&#39;Neill + Partners,&quot;,
 &quot; Rockville Centre, N.Y., Sandler O&#39;Neill + Partners,&quot;,
 &#39; Rutherford, N.J., Aon Corporation,&#39;,
 &quot; New York City, Sandler O&#39;Neill + Partners,&quot;,
 &#39; Carr Futures, Inc.,&#39;,
 &#39; Jersey City, N.J., Cantor Fitzgerald,&#39;,
 &#39; Glen Rock, N.J., Chuo Mitsui Trust and Banking Company, Ltd.,&#39;,
 &#39; Woodstock, N.Y., Fiduciary Trust Company International,&#39;,
 &#39; Summit Security Services, Inc.,&#39;,
 &#39; Wilmot, N.H.,&#39;,
 &#39; Glen Gardner, N.J., Cantor Fitzgerald,&#39;,
 &#39; Port Washington, N.Y., Risk Waters Group,&#39;,
 &#39; Staten Island, N.Y., New York City Fire Department,&#39;,
 &#39; Scarsdale, N.Y., Cantor Fitzgerald,&#39;,
 &#39;&#39;,
 &#39; Manasquan, N.J., Cantor Fitzgerald,&#39;,
 &#39; Princeton Junction, N.J., Euro Brokers,&#39;,
 &#39; Staten Island, N.Y., New York City Fire Department,&#39;,
 &#39; Cantor Fitzgerald,&#39;,
 &#39; Norwalk,

## Progressively filtering out states

We will start by matching two of the most common states, NY and NJ.

In [80]:
state = re.compile(', (N\.Y\.|N\.J\.),?')
# Rows that match
[(l, state.search(l)) for l in troubling_bit]

y Fire Department,&#39;,
  &lt;re.Match object; span=(14, 21), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Middletown, N.J., Aon Corporation,&#39;,
  &lt;re.Match object; span=(11, 18), match=&#39;, N.J.,&#39;&gt;),
 (&#39; Jersey City, N.J., Cantor Fitzgerald,&#39;,
  &lt;re.Match object; span=(12, 19), match=&#39;, N.J.,&#39;&gt;),
 (&#39; Marsh&amp;McLennan Companies, Inc.,&#39;, None),
 (&#39; Brooklyn, N.Y., Aon Corporation,&#39;,
  &lt;re.Match object; span=(9, 16), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Cedar Grove, N.J., Windows on the World visitor,&#39;,
  &lt;re.Match object; span=(12, 19), match=&#39;, N.J.,&#39;&gt;),
 (&#39; Aon Corporation,&#39;, None),
 (&#39; Marsh&amp;McLennan Companies, Inc.,&#39;, None),
 (&#39; New York City Fire Department,&#39;, None),
 (&#39; South Huntington, N.Y., New York City Police Department,&#39;,
  &lt;re.Match object; span=(17, 24), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Cantor Fitzgerald,&#39;, None),
 (&#39; North Brunswick, N.J., Cantor Fitz

and inspecting all rows that don't match for additional states or problems

In [81]:
[(i, l) for i, l in enumerate(troubling_bit) if not state.search(l)]

9, &#39; Arlington, Va., United States Navy Civilian,&#39;),
 (1210, &#39; Fred Alger Management, Inc.,&#39;),
 (1216, &#39; New York, Keefe, Bruyette&amp;Woods, Inc.,&#39;),
 (1217, &#39; Stamford, Conn., Marsh&amp;McLennan Companies, Inc.,&#39;),
 (1222, &#39; New York City, Marsh&amp;McLennan, Advantage Security,&#39;),
 (1223, &#39;&#39;),
 (1224, &#39; Norwalk, Conn., Euro Brokers,&#39;),
 (1227, &#39; Springfield, Va., United States Army Civilian,&#39;),
 (1229, &#39; Burke, Va., United States Army,&#39;),
 (1230, &#39; Lake Ridge, Va., Defense Intelligence Agency,&#39;),
 (1231, &#39; Norwalk, Conn., Thomson Financial/Vestek,&#39;),
 (1234, &#39; New Jersey, Cantor Fitzgerald,&#39;),
 (1235, &#39; New York City Fire Department,&#39;),
 (1237, &#39; Cantor Fitzgerald,&#39;),
 (1241, &#39; Fiduciary Trust Company International,&#39;),
 (1243, &#39; Cantor Fitzgerald,&#39;),
 (1245, &quot; New York City, Sandler O&#39;Neill + Partners,&quot;),
 (1246, &quot; New York City, Sandler 

## Fixing a common problem.

Notice that many rows simply contain ` New York City,` without the state.  Let's fix this problem in our preprocessing step.

In [82]:
grouped_lines[41]

&#39;David D. Alger, 57, New York City, Fred Alger Management, Inc., World Trade Center.&#39;

In [83]:
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))
grouped_lines[41] >> fix_nyc

&#39;David D. Alger, 57, New York City, N.Y., Fred Alger Management, Inc., World Trade Center.&#39;

## Progress so far

In [84]:
# Imports
from composable import pipeable
from composable.strict import map

In [85]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [86]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '')
# New
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))

In [87]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [88]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                )
split_lines

d of Electrical Workers,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Darlene E. Flagg&#39;,
  &#39; ??&#39;,
  &#39; Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Wilson F. Flagg&#39;,
  &#39; 62&#39;,
  &#39; Millwood, Va., Passenger, American 77,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; Pentagon&#39;,
  &#39;&#39;),
 (&#39;Christina Donovan Flannery&#39;,
  &#39; 26&#39;,
  &quot; Middle Village, N.Y., Sandler O&#39;Neill + Partners,&quot;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Eileen Flecha&#39;,
  &#39; 33&#39;,
  &#39; Queens, N.Y., Fiduciary Trust Company International,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Andre G. Fletcher&#39;,
  &#39; 37&#39;,
  &#39; New York City Fire Department,&#39;,
  &#39;&#39;,
  &#39;&#39;,
  &#39; World Trade Center&#39;,
  &#39;&#39;),
 (&#39;Carl M. Fli

In [89]:
names =  (split_lines
        >> map(get(0))
        >> map(remove_commas)
        )
names

i&#39;,
 &#39;Jane S. Beatty&#39;,
 &#39;Alan Anthony Beaven&#39;,
 &#39;Lawrence Ira Beck&#39;,
 &#39;Manette Marie Beckles&#39;,
 &#39;Carl John Bedigian&#39;,
 &#39;Michael Ernest Beekman&#39;,
 &#39;Maria A. Behr&#39;,
 &#39;Max J. Beilke&#39;,
 &#39;Yelena Belilovsky&#39;,
 &#39;Nina Patrice Bell&#39;,
 &#39;Debbie S. Bellows&#39;,
 &#39;Stephen Elliot Belson&#39;,
 &#39;Paul M. Benedetti&#39;,
 &#39;Denise Lenore Benedetto&#39;,
 &#39;Bryan Craig Bennett&#39;,
 &#39;Eric L. Bennett&#39;,
 &#39;Oliver Bennett&#39;,
 &#39;Margaret L. Benson&#39;,
 &#39;Dominick J. Berardi&#39;,
 &#39;James Patrick Berger&#39;,
 &#39;Steven Howard Berger&#39;,
 &#39;John P. Bergin&#39;,
 &#39;Alvin Bergsohn&#39;,
 &#39;Daniel David Bergstein&#39;,
 &#39;Graham Andrew Berkeley&#39;,
 &#39;Michael J. Berkeley&#39;,
 &#39;Donna M. Bernaerts&#39;,
 &#39;David W. Bernard&#39;,
 &#39;William H. Bernstein&#39;,
 &#39;David M. Berray&#39;,
 &#39;David Shelby Berry&#39;,
 &#39;Joseph John Berry&#39;,
 &#39;W

In [90]:
troubling_bit = (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                >> map(get(2))
                )
troubling_bit

Sandler O&#39;Neill + Partners,&quot;,
 &#39; Carr Futures, Inc.,&#39;,
 &#39; Jersey City, N.J., Cantor Fitzgerald,&#39;,
 &#39; Glen Rock, N.J., Chuo Mitsui Trust and Banking Company, Ltd.,&#39;,
 &#39; Woodstock, N.Y., Fiduciary Trust Company International,&#39;,
 &#39; Summit Security Services, Inc.,&#39;,
 &#39; Wilmot, N.H.,&#39;,
 &#39; Glen Gardner, N.J., Cantor Fitzgerald,&#39;,
 &#39; Port Washington, N.Y., Risk Waters Group,&#39;,
 &#39; Staten Island, N.Y., New York City Fire Department,&#39;,
 &#39; Scarsdale, N.Y., Cantor Fitzgerald,&#39;,
 &#39;&#39;,
 &#39; Manasquan, N.J., Cantor Fitzgerald,&#39;,
 &#39; Princeton Junction, N.J., Euro Brokers,&#39;,
 &#39; Staten Island, N.Y., New York City Fire Department,&#39;,
 &#39; Cantor Fitzgerald,&#39;,
 &#39; Norwalk, Conn., Aon Corporation visitor,&#39;,
 &#39; Boston, Mass. and Paris, France,&#39;,
 &#39; Staten Island, N.Y., Cantor Fitzgerald,&#39;,
 &#39; Santa Monica, Calif.,&#39;,
 &#39; Medford, N.Y., New York City Poli

## Adding more states

Next, we will start adding start to our pattern, and again looking for additional states/problems.  For example, let's add the `Mass.` and `D.C.` patterns.

In [100]:
state = re.compile(', (N\.Y\.|N\.J\.|Mass\.|D\.C\.),?')
[(l, state.search(l)) for l in troubling_bit if state.search(l)]

&lt;re.Match object; span=(8, 15), match=&#39;, N.J.,&#39;&gt;),
 (&#39; Fairlawn, N.J., Risk Waters Group conference attendee from Compaq Computer Corporation,&#39;,
  &lt;re.Match object; span=(9, 16), match=&#39;, N.J.,&#39;&gt;),
 (&#39; New York City, N.Y., Marsh&amp;McLennan, Advantage Security,&#39;,
  &lt;re.Match object; span=(14, 21), match=&#39;, N.Y.,&#39;&gt;),
 (&quot; Middletown, N.J., Sandler O&#39;Neill + Partners,&quot;,
  &lt;re.Match object; span=(11, 18), match=&#39;, N.J.,&#39;&gt;),
 (&#39; South Hempstead, N.Y., New York City Fire Department,&#39;,
  &lt;re.Match object; span=(16, 23), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Roslyn, N.Y., Carr Futures, Inc.,&#39;,
  &lt;re.Match object; span=(7, 14), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Belle Harbor, N.Y., New York City Fire Department,&#39;,
  &lt;re.Match object; span=(13, 20), match=&#39;, N.Y.,&#39;&gt;),
 (&#39; Hoboken, N.J., Marsh&amp;McLennan Companies, Inc.,&#39;,
  &lt;re.Match object; span=(8, 15), mat

In [97]:
[(i, l) for i, l in enumerate(troubling_bit) if not state.search(l)]

y Police Department,&#39;),
 (1519, &#39; Baseline Financial Services,&#39;),
 (1521, &#39;&#39;),
 (1522, &#39; Fairfield, Conn., Keefe, Bruyette&amp;Woods, Inc.,&#39;),
 (1523, &#39; Culpeper, Va., Flight Crew, American 77,&#39;),
 (1524, &#39; Culpeper, Va., Flight Crew, American 77,&#39;),
 (1525, &#39; Port Authority of New York and New Jersey,&#39;),
 (1531, &#39; Forestville, Md., United States Army Civilian,&#39;),
 (1532, &#39; Cantor Fitzgerald,&#39;),
 (1536, &#39; Chicago, Ill., Aon Corporation contractor from Keane Inc.,&#39;),
 (1537, &#39; Frank W. Lin&amp;Co.,&#39;),
 (1540, &#39; New York City Fire Department,&#39;),
 (1546, &#39; Pitney Bowes Inc.,&#39;),
 (1548, &#39; New Jersey, Washington Group International,&#39;),
 (1555, &#39; Empire BlueCross BlueShield,&#39;),
 (1557, &#39; United States Army,&#39;),
 (1561, &#39; Fiduciary Trust Company International,&#39;),
 (1562, &#39; Aramark Corporation,&#39;),
 (1563, &#39;&#39;),
 (1565,
  &#39; Langhorne, Pa., Marsh&a

<h2> <font color="red"> Exercise 4.6.2 - Continue the process. </font> </h2>

Now it is your turn.  You should

1. Keep adding states to the pattern.
2. Add preprocessing steps to fix any issues.

In [117]:
# Your code here
fix_ny = pipeable(lambda line: line.replace('New York,', 'N.Y.').replace('New Jersey,', 'N.J.'))
fixed = (troubling_bit
        >> map(fix_ny)
        )

my_state = re.compile("""
', 
(
    N\.Y\.
    | N\.J\.
    | Mass\.
    | D\.C\.
    | Calif\.
    | N\.H\.
    | Conn\.
    | Md\.
    | Va\.
    | Mo\.
    | Ky\.
    | Pa\.
    | Ill\.
),?
;
'
""", re.VERBOSE)

In [118]:
[(i, l) for i, l in enumerate(fixed) if not my_state.search(l)]

ral Park, N.Y., Marsh&amp;McLennan Companies, Inc.,&#39;),
 (601, &#39; Washington Group International,&#39;),
 (602, &#39; Westbury, N.Y., New York City Fire Department,&#39;),
 (603, &#39;&#39;),
 (604, &#39; Babylon, N.Y., Marsh&amp;McLennan Companies, Inc.,&#39;),
 (605, &#39; Upper Marlboro, Md., Passenger, American 77,&#39;),
 (606, &#39; Farmingdale, N.Y., Cantor Fitzgerald,&#39;),
 (607, &#39; Manalapan, N.J., Cantor Fitzgerald,&#39;),
 (608, &#39; Ridgewood, N.Y., Cantor Fitzgerald,&#39;),
 (609, &#39; Alexandria, Va., United States Navy,&#39;),
 (610, &#39; Mohegan Lake, N.Y., ABM Industries Inc.,&#39;),
 (611, &#39; Staten Island, N.Y., Cantor Fitzgerald,&#39;),
 (612, &#39; Framingham, Mass.,&#39;),
 (613, &quot; Fresh Meadows, N.Y., Sandler O&#39;Neill + Partners,&quot;),
 (614, &#39; Brooklyn, N.Y., Cantor Fitzgerald,&#39;),
 (615, &#39; Bronx, N.Y., New York City Fire Department,&#39;),
 (616, &#39; Allendale, N.J., Keefe, Bruyette&amp;Woods, Inc.,&#39;),
 (617, &#39; St

<h2> <font color="red"> Exercise 4.6.3 - Make your solution verbose </font> </h2>

Now make your solution to the last problem verbose.  Also reorder the cases so that similar cases are close and add comments.  Finally, change the regular expression to capture the parts before and after the state.

In [528]:
# Your code here

## Splitting the troubling bit

Now that we have a way to identify rows that have home addresses (through matching the state), we will split up this data.  We will do this by considering three cases.

1. Blank entry become three blanks (for town, state, employer).
2. Lines that match the states regex will get split by this pattern.
3. The remaining lines hold only the employer and become `'','',entry`

In [534]:
def split_troubling_bit(entry):
    if len(entry) == 0:
        return ('', '', '')
    elif state.search(entry):
        return state.search(entry).groups(default='')
    else:
        return ('', '', entry)

In [535]:
( troubling_bit
 >> map(split_troubling_bit)
)

[('', '', " Sandler O'Neill + Partners,"),
 (' Brooklyn', 'N.Y.', ' Fiduciary Trust Company International,'),
 ('', '', ' Keefe, Bruyette&Woods, Inc.,'),
 (' Melville', 'N.Y.', ' Cantor Fitzgerald,'),
 (' Brooklyn', 'N.Y.', ' Cantor Fitzgerald,'),
 (' New York City', 'N.Y.', ' Cantor Fitzgerald,'),
 ('', '', ' Ashdod, Israel,'),
 (' Westchester County', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', ' Aon Corporation,'),
 (' Glen Rock', 'N.J.', ' Cantor Fitzgerald,'),
 ('', '', ''),
 ('', '', ' Cantor Fitzgerald,'),
 ('', '', ' Fuji Bank, Ltd. security,'),
 ('', '', ' Cantor Fitzgerald,'),
 (' New York City', 'N.Y.', ' Windows on the World,'),
 (' Bronx', 'N.Y.', ' New York Metropolitan Transportation Council,'),
 (' New Hyde Park', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 (' New York City', 'N.Y.', ' Fred Alger Management, Inc.,'),
 (' Bronx', 'N.Y.', ' Windows on the World,'),
 ('', '', ' Cantor Fitzgerald,'),
 (' Manalapan'

## Progress so far

In [423]:
# Imports
from composable import pipeable
from composable.strict import map

In [480]:
# Reg Ex for a line
line_parts = re.compile(r'''^(.+),
(
      \s\?\?                          # ??
    | \s\d{1,3}                       # or age
),
(.*?)                                 # Includes hometown and 
(
        \sPassenger,                  # Optional flight status
    |   \sFlightsCrew,
)?
(
      \sUnited\s\d{2,3},              # Optional flight
    | \sAmericans\d{2,3},
)?
(
       \sWorld\sTrade\sCenter         # Location
    |  \sPentagon
    |  \sShanksville,\sPa
)
(
    ,\sdied\s\d{1,2}/\d{1,2}/\d{1,2}  # Optional date of death
)?
\.$''', re.VERBOSE)

In [500]:
# Helper functions
add_missing_period = pipeable(lambda line: line if line.endswith('.') else line + '.' )
fix_world_trade = pipeable(lambda line: line.replace('WorldTrade', 'World Trade'))
get_line_parts = pipeable(lambda line: line_parts.search(line).groups(default=''))
remove_commas = lambda s: s.replace(',', '')
# New
fix_nyc = pipeable(lambda line: line.replace(', New York City,', ', New York City, N.Y.,'))

In [501]:
[(i, l) for i, l in enumerate(prepped_lines) if not line_parts.search(l)]

[]

In [502]:
split_lines =  (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                )
split_lines

[('Gordon M. Aamoth, Jr.',
  ' 32',
  " Sandler O'Neill + Partners,",
  '',
  '',
  ' World Trade Center',
  ''),
 ('Edelmiro Abad',
  ' 54',
  ' Brooklyn, N.Y., Fiduciary Trust Company International,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Marie Rose Abad',
  ' 49',
  ' Keefe, Bruyette&Woods, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Andrew Anthony Abate',
  ' 37',
  ' Melville, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Vincent Paul Abate',
  ' 40',
  ' Brooklyn, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Laurence Christopher Abel',
  ' 37',
  ' New York City, N.Y., Cantor Fitzgerald,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Alona Abraham',
  ' 30',
  ' Ashdod, Israel,',
  ' Passenger,',
  ' United 175,',
  ' World Trade Center',
  ''),
 ('William F. Abrahamson',
  ' 55',
  ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
  '',
  '',
  ' World Trade Center',
  ''),
 ('Richard Anth

In [506]:
names =  (split_lines
        >> map(get(0))
        >> map(remove_commas)
        )
names

['Gordon M. Aamoth Jr.',
 'Edelmiro Abad',
 'Marie Rose Abad',
 'Andrew Anthony Abate',
 'Vincent Paul Abate',
 'Laurence Christopher Abel',
 'Alona Abraham',
 'William F. Abrahamson',
 'Richard Anthony Aceto',
 'Heinrich Bernhard Ackermann',
 'Paul Acquaviva',
 'Christian Adams',
 'Donald LaRoy Adams',
 'Patrick Adams',
 'Shannon Lewis Adams',
 'Stephen George Adams',
 'Ignatius Udo Adanga',
 'Christy A. Addamo',
 'Terence Edward Adderley Jr.',
 'Sophia B. Addo',
 'Lee Adler',
 'Daniel Thomas Afflitto',
 'Emmanuel Akwasi Afuakwah',
 'Alok Agarwal',
 'Mukul Kumar Agarwala',
 'Joseph Agnello',
 'David Scott Agnes',
 'Joao Alberto da Fonseca Aguiar Jr.',
 'Brian G. Ahearn',
 'Jeremiah Joseph Ahern',
 'Joanne Marie Ahladiotis',
 'Shabbir Ahmed',
 'Terrance Andre Aiken',
 'Godwin O. Ajala',
 'Trudi M. Alagero',
 'Andrew Alameno',
 'Margaret Ann Alario',
 'Gary M. Albero',
 'Jon Leslie Albert',
 'Peter Craig Alderman',
 'Jacquelyn Delaine Aldridge-Frederick',
 'David D. Alger',
 'Ernest Ali

In [507]:
troubling_bit = (grouped_lines
                >> map(add_missing_period)
                >> map(fix_world_trade)
                >> map(fix_nyc)
                >> map(get_line_parts)
                >> map(get(2))
                )
troubling_bit

[" Sandler O'Neill + Partners,",
 ' Brooklyn, N.Y., Fiduciary Trust Company International,',
 ' Keefe, Bruyette&Woods, Inc.,',
 ' Melville, N.Y., Cantor Fitzgerald,',
 ' Brooklyn, N.Y., Cantor Fitzgerald,',
 ' New York City, N.Y., Cantor Fitzgerald,',
 ' Ashdod, Israel,',
 ' Westchester County, N.Y., Marsh&McLennan Companies, Inc.,',
 ' Marsh&McLennan Companies, Inc.,',
 ' Aon Corporation,',
 ' Glen Rock, N.J., Cantor Fitzgerald,',
 '',
 ' Cantor Fitzgerald,',
 ' Fuji Bank, Ltd. security,',
 ' Cantor Fitzgerald,',
 ' New York City, N.Y., Windows on the World,',
 ' Bronx, N.Y., New York Metropolitan Transportation Council,',
 ' New Hyde Park, N.Y., Marsh&McLennan Companies, Inc.,',
 ' New York City, N.Y., Fred Alger Management, Inc.,',
 ' Bronx, N.Y., Windows on the World,',
 ' Cantor Fitzgerald,',
 ' Manalapan, N.J., Cantor Fitzgerald,',
 ' Windows on the World,',
 ' Cantor Fitzgerald,',
 ' Fiduciary Trust Company International,',
 ' Belle Harbor, N.Y., New York City Fire Department,',

In [537]:
state = re.compile('''
^(.*?)
,?\s                    # Optional comman
(
       N\.Y\.           
    |  N\.J\.
    |  D\.C\.
    |  N\.H\.
    |  N\.M\.
    |  N\.C\.
    |  R.I.
    |  Md\.
    |  Pa\.
    |  Va\.
    |  Ga\.
    |  La\.
    |  Mass\.
    |  Calif\.
    |  Ariz\.
    |  Fla\.
    |  Ill\.
    |  Conn\.
    |  Hawaii
    |  Iowa
    |  Maine
    |  New\sHampshire
    |  New\sJersey
    |  New\sYork
    |  Ohio
    |  Pennsylvania
    |  Texas
    |  Utah
    |  Virginia
    |  Japan
    |  India
    |  Germany
    |  Manitoba,\sCanada
    |  New\sSouth\sWales,\sAustralia
    |  England,\sUnited\sKingdom
)
,
(.*?)$
''', re.VERBOSE)

In [538]:
( troubling_bit
 >> map(split_troubling_bit)
)

[('', '', " Sandler O'Neill + Partners,"),
 (' Brooklyn', 'N.Y.', ' Fiduciary Trust Company International,'),
 ('', '', ' Keefe, Bruyette&Woods, Inc.,'),
 (' Melville', 'N.Y.', ' Cantor Fitzgerald,'),
 (' Brooklyn', 'N.Y.', ' Cantor Fitzgerald,'),
 (' New York City', 'N.Y.', ' Cantor Fitzgerald,'),
 ('', '', ' Ashdod, Israel,'),
 (' Westchester County', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', ' Marsh&McLennan Companies, Inc.,'),
 ('', '', ' Aon Corporation,'),
 (' Glen Rock', 'N.J.', ' Cantor Fitzgerald,'),
 ('', '', ''),
 ('', '', ' Cantor Fitzgerald,'),
 ('', '', ' Fuji Bank, Ltd. security,'),
 ('', '', ' Cantor Fitzgerald,'),
 (' New York City', 'N.Y.', ' Windows on the World,'),
 (' Bronx', 'N.Y.', ' New York Metropolitan Transportation Council,'),
 (' New Hyde Park', 'N.Y.', ' Marsh&McLennan Companies, Inc.,'),
 (' New York City', 'N.Y.', ' Fred Alger Management, Inc.,'),
 (' Bronx', 'N.Y.', ' Windows on the World,'),
 ('', '', ' Cantor Fitzgerald,'),
 (' Manalapan'

<h2> <font color="red"> Exercise 4.5.4 </font> </h2>

Clean up each part of the troubling bits, then comma join this section into 1 string.

**Hint:** Be sure to remove any problematic commas.

In [539]:
# Your code here

## Combining the parts back together.

We can combine the parts back together using the `zip` function.

In [589]:
from composable.strict import zipOnto
from composable.list import to_list
(zip(names, ages, fixed_troubling_bits)
 >> to_list
 >> map(comma_join)
)

["Gordon M. Aamoth Jr., 32,,, Sandler O'Neill + Partners",
 'Edelmiro Abad, 54, Brooklyn,N.Y., Fiduciary Trust Company International',
 'Marie Rose Abad, 49,,, Keefe Bruyette&Woods Inc.',
 'Andrew Anthony Abate, 37, Melville,N.Y., Cantor Fitzgerald',
 'Vincent Paul Abate, 40, Brooklyn,N.Y., Cantor Fitzgerald',
 'Laurence Christopher Abel, 37, New York City,N.Y., Cantor Fitzgerald',
 'Alona Abraham, 30,,, Ashdod Israel',
 'William F. Abrahamson, 55, Westchester County,N.Y., Marsh&McLennan Companies Inc.',
 'Richard Anthony Aceto, 42,,, Marsh&McLennan Companies Inc.',
 'Heinrich Bernhard Ackermann, 38,,, Aon Corporation',
 'Paul Acquaviva, 29, Glen Rock,N.J., Cantor Fitzgerald',
 'Christian Adams, 37,,,',
 'Donald LaRoy Adams, 28,,, Cantor Fitzgerald',
 'Patrick Adams, 61,,, Fuji Bank Ltd. security',
 'Shannon Lewis Adams, 25,,, Cantor Fitzgerald',
 'Stephen George Adams, 51, New York City,N.Y., Windows on the World',
 'Ignatius Udo Adanga, 62, Bronx,N.Y., New York Metropolitan Transport

<h2> <font color="red"> Exercise 4.5.4 </font> </h2>

Use `zip` to combine all part of the data and write the result out to a file called `911_Deaths_Fixed.csv` 

In [539]:
# Your code here