## Parsing text

## Lecture objectives

1. Demonstrate how to parse unstructured text data

Let's start by loading in the data that we saved at the end of the previous lecture. We can read a pandas dataframe using the `read_pickle()` function. (If we had saved it as a .csv, we'd use `read_csv()`.)

In [2]:
import pandas as pd
smalldf = pd.read_pickle('scratch/Seattle_permits.pandas')
smalldf.head()

Unnamed: 0,permitnum,permitclass,permitclassmapped,permittypemapped,description,statuscurrent,originaladdress1,originalcity,originalstate,originalzip,...,housingunitsremoved,housingunitsadded,applieddate,issueddate,expiresdate,decisiondate,permittypedesc,contractorcompanyname,estprojectcost,newdescription
0,3001212-LU,Single Family/Duplex,Residential,Master Use Permit,PROJECT CANCELLED 12/8/2010 -- This short plat...,Canceled,6519 S BANGOR ST,SEATTLE,WA,98178,...,,,,,,,,,,PROJECT CANCELLED 12/8/2010 -- This short plat...
1,3001271-LU,Single Family/Duplex,Residential,Master Use Permit,Land Use Permit to adjust the boundary between...,Completed,4226 1ST AVE NW,SEATTLE,WA,98107,...,0.0,0.0,2005-12-16,2006-05-15,2007-11-15,2006-05-10,,,,Land Use Permit to adjust the boundary between...
2,3001310-LU,Single Family/Duplex,Residential,Master Use Permit,Land use application to adjust the boundary be...,Completed,941 23RD AVE S,SEATTLE,WA,98144,...,,,2007-02-14,2008-08-28,2011-08-14,2008-08-13,,,,Land use application to adjust the boundary be...
3,3001312-LU,,,Master Use Permit,Cancelled due to no activity for more than 9 y...,Canceled,3131 E MADISON ST,SEATTLE,WA,98112,...,,,,,,,,,,Cancelled due to no activity for more than 9 y...
4,3001440-LU,Commercial,Non-Residential,Master Use Permit,PROJECT CANCELLED 5/23/2011 -- Project On Hold...,Canceled,9030 13TH AVE NW,SEATTLE,WA,98117,...,,,2005-08-12,,,,,,,PROJECT CANCELLED 5/23/2011 -- Project On Hold...


We've already scraped the description for each project. 

Suppose we want to extract a particular piece of information? For example, how do we get the number of parking spaces? Well, that depends on whether the city uses consistent terminology. 

You'll need to design a set of rules that cover different possibilities. For example, the description might say "2 parking spaces" or "TWO PARKING SPACES" or "1 uncovered and 1 covered parking space." Looking at your data is key.

For starters, let's take the simplest case. We'll add a column to our dataframe that indicates whether there is "no parking" in the project description.

In [3]:
# import the numpy library, which underlies pandas
# we'll use its nan (null) value to indicate missing data
import numpy as np

def noparking(description):
    # convert the description to lower case
    text = description.lower()
    if 'no parking' in text:
        return True
    elif 'zero parking' in text:
        return True
    elif 'parking' in text:
        return False
    else:
        # capture all other possibilities
        return np.nan

# Now apply our function
smalldf['noparking'] = smalldf.description.apply(noparking)

In [4]:
# look at the output (just the noparking column)
smalldf.noparking

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9    False
Name: noparking, dtype: object

This is a brute force method of parsing text. We are specifying all the combinations of text that might indicate that there are no parking spaces, and looking to see if any of them are contained in the string.

We'll see more sophisticated ways of analyzing text later in the course, but this type of approach is often the simplest and most robust.

<div class="alert alert-block alert-info">
<strong>Thought exercise:</strong> If you want to get the number of parking spaces for each project, what would be your next step? In principle, how might you do that?
</div>

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>The simplest way to parse text is to look for a particular string within a longer string.</li>
  <li>Converting to lower case (or upper case) reduces the number of possibilities that you'll have to search for.
  <li>The <strong>in</strong> operator is most useful here.</li>
</ul>
</div>