## Parsing text

## Lecture objectives

1. Demonstrate how to parse unstructured text data

Let's start by loading in the data that we saved at the end of the previous lecture. We can read a pandas dataframe using the `read_pickle()` function. (If we had saved it as a .csv, we'd use `read_csv()`.)

In [1]:
import pandas as pd
smalldf = pd.read_pickle('../scratch/Seattle_permits.pandas')
smalldf.head()

Unnamed: 0,permitnum,permitclass,permitclassmapped,permittypemapped,permittypedesc,description,estprojectcost,statuscurrent,originaladdress1,originalcity,...,longitude,location1,housingunitsremoved,housingunitsadded,applieddate,issueddate,expiresdate,decisiondate,contractorcompanyname,newdescription
0,3001064-EG,Commercial,Non-Residential,Early Design Guidance,Design Review,Early Design Guidance for: Land Use Applicatio...,16000000.0,Completed,2200 E MADISON ST,SEATTLE,...,-122.30341969,"{'latitude': '47.61885483', 'longitude': '-122...",,,,,,,,Early Design Guidance for: Land Use Applicatio...
1,3001064-LU,Commercial,Non-Residential,Master Use Permit,,"Land Use Application to allow one, 6-story bui...",16000000.0,Completed,2200 E MADISON ST,SEATTLE,...,-122.30341969,"{'latitude': '47.61885483', 'longitude': '-122...",0.0,103.0,2011-06-21,2012-04-06,2015-02-17,2012-02-02,,"Land Use Application to allow one, 6-story bui..."
2,3001095-LU,Single Family/Duplex,Residential,Master Use Permit,,Land Use Application to subdivide one parcel i...,,Completed,5414 21ST AVE SW,SEATTLE,...,-122.35930088,"{'latitude': '47.55326217', 'longitude': '-122...",,,2012-04-25,2012-09-05,2015-08-10,2012-07-26,,Land Use Application to subdivide one parcel i...
3,3001121-LU,,,Master Use Permit,,Unit Lot Subdivision,,Canceled,103 30TH AVE,SEATTLE,...,-122.29413641,"{'latitude': '47.60184595', 'longitude': '-122...",,,,,,,,Unit Lot Subdivision
4,3001139-LU,Multifamily,Residential,Master Use Permit,,Cancel per customer request 4/15/08 log #4507\...,,Canceled,3649 S MORGAN ST,SEATTLE,...,-122.28533406,"{'latitude': '47.54411404', 'longitude': '-122...",,,2008-03-14,,,,,Cancel per customer request 4/15/08 log #4507\...


We've already scraped the description for each project. 

Suppose we want to extract a particular piece of information? For example, how do we get the number of parking spaces? Well, that depends on whether the city uses consistent terminology. 

You'll need to design a set of rules that cover different possibilities. For example, the description might say "2 parking spaces" or "TWO PARKING SPACES" or "1 uncovered and 1 covered parking space." Looking at your data is key.

For starters, let's take the simplest case. We'll add a column to our dataframe that indicates whether there is "no parking" in the project description.

In [2]:
# import the numpy library, which underlies pandas
# we'll use its nan (null) value to indicate missing data
import numpy as np

def noparking(description):
    # convert the description to lower case
    text = description.lower()
    if 'no parking' in text:
        return True
    elif 'zero parking' in text:
        return True
    elif 'parking' in text:
        return False
    else:
        # capture all other possibilities
        return np.nan

# Now apply our function
smalldf['noparking'] = smalldf.description.apply(noparking)

In [3]:
# look at the output (just the noparking column)
smalldf.noparking

0    False
1    False
2      NaN
3      NaN
4      NaN
5      NaN
6    False
7    False
8      NaN
9      NaN
Name: noparking, dtype: object

This is a brute force method of parsing text. We are specifying all the combinations of text that might indicate that there are no parking spaces, and looking to see if any of them are contained in the string.

We'll see more sophisticated ways of analyzing text later in the course, but this type of approach is often the simplest and most robust.

<div class="alert alert-block alert-info">
<strong>Thought exercise:</strong> If you want to get the number of parking spaces for each project, what would be your next step? In principle, how might you do that?
</div>

<div class="alert alert-block alert-success">
<strong>Thought exercise:</strong> You can look for numbers, or define a set of numbers. We can then extract the numbers to get number of parking spaces. 
</div>

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>The simplest way to parse text is to look for a particular string within a longer string.</li>
  <li>Converting to lower case (or upper case) reduces the number of possibilities that you'll have to search for.
  <li>The <strong>in</strong> operator is most useful here.</li>
</ul>
</div>