In [2]:
# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')

# This is necessary to show lots of columns in pandas 0.12. 
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 60)

plt.rcParams['figure.figsize'] = (15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from [NYC Open Data](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). 

In [3]:
complaints = pd.read_csv('../../data/service-requests.zip', compression='zip')

  interactivity=interactivity, compiler=compiler, result=result)


Depending on your pandas version, you might see an error like "DtypeWarning: Columns (8) have mixed types". This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are integers.

For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.

### What's even in it? (the summary)

When you print a large dataframe, it will only show you the first few rows.

In [4]:
complaints.head()

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Description,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
0,33526864,06/06/2016 05:01:00 PM,06/08/2016 03:41:00 PM,DOT,Department of Transportation,Street Light Condition,Street Light Out,,,,,,,GRAND CONCOURSE,206 ST E,INTERSECTION,,,,Closed,,Service Request status for this request is ava...,06/08/2016 03:41:00 PM,Unspecified BRONX,BRONX,,,Unspecified,BRONX,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,,,
1,33527101,06/06/2016 01:18:00 PM,06/07/2016 01:18:00 PM,DEP,Department of Environmental Protection,Lead,Lead Kit Request (Residential) (L10),,10037.0,1960 PARK AVENUE,PARK AVENUE,HARLEM RIVER DRIVE SB EN PRK,HARLEM RIVER DRIVE EXIT 20,,,ADDRESS,NEW YORK,,,Closed,,The Department of Environmental Protection inv...,06/07/2016 01:18:00 PM,11 MANHATTAN,MANHATTAN,1002035.0,234251.0,Unspecified,MANHATTAN,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,40.809623,-73.935754,"(40.80962280597619, -73.93575360303117)"
2,33527148,06/07/2016 12:29:18 AM,06/07/2016 01:46:13 AM,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10030.0,101 WEST 141 STREET,WEST 141 STREET,LENOX AVENUE,7 AVENUE,,,ADDRESS,NEW YORK,,Precinct,Closed,06/07/2016 08:29:18 AM,The Police Department responded to the complai...,06/07/2016 01:46:13 AM,10 MANHATTAN,MANHATTAN,1001328.0,237302.0,Unspecified,MANHATTAN,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,N,,,,,,,,,,,,40.817998,-73.9383,"(40.817998342152926, -73.93829980045645)"
3,33527176,06/06/2016 10:12:00 PM,,DEP,Department of Environmental Protection,Air Quality,"Air: Dust, Construction/Demolition (AE4)",,,,,,,26 ST,ROBERT F KENNEDY BRIDGE,INTERSECTION,,,,Open,,,,Unspecified QUEENS,QUEENS,,,Unspecified,QUEENS,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,,,
4,33527223,06/07/2016 12:39:55 AM,06/08/2016 12:45:08 AM,HPD,Department of Housing Preservation and Develop...,HPD Literature Request,The ABCs of Housing,,,,,,,,,,,,,Closed,06/08/2016 12:39:55 AM,The literature will be emailed within 24 hours...,06/07/2016 12:44:24 AM,0 Unspecified,Unspecified,,,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,N,,,,,,,,,,,,,,


## Selecting columns and rows

To select a column, we index with the name of the column, like this:

In [None]:
complaints['Complaint Type']

To get the first 5 rows of a dataframe, we can use a slice: `df[:5]`.

This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.

In [5]:
complaints[:5]
complaints.ix[:5]
complaints.iloc[:5]
complaints.loc[:5]

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,Street Name,Cross Street 1,Cross Street 2,Intersection Street 1,Intersection Street 2,Address Type,City,Landmark,Facility Type,Status,Due Date,Resolution Description,Resolution Action Updated Date,Community Board,Borough,X Coordinate (State Plane),Y Coordinate (State Plane),Park Facility Name,Park Borough,School Name,School Number,School Region,School Code,School Phone Number,School Address,School City,School State,School Zip,School Not Found,School or Citywide Complaint,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
0,33526864,06/06/2016 05:01:00 PM,06/08/2016 03:41:00 PM,DOT,Department of Transportation,Street Light Condition,Street Light Out,,,,,,,GRAND CONCOURSE,206 ST E,INTERSECTION,,,,Closed,,Service Request status for this request is ava...,06/08/2016 03:41:00 PM,Unspecified BRONX,BRONX,,,Unspecified,BRONX,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,,,
1,33527101,06/06/2016 01:18:00 PM,06/07/2016 01:18:00 PM,DEP,Department of Environmental Protection,Lead,Lead Kit Request (Residential) (L10),,10037.0,1960 PARK AVENUE,PARK AVENUE,HARLEM RIVER DRIVE SB EN PRK,HARLEM RIVER DRIVE EXIT 20,,,ADDRESS,NEW YORK,,,Closed,,The Department of Environmental Protection inv...,06/07/2016 01:18:00 PM,11 MANHATTAN,MANHATTAN,1002035.0,234251.0,Unspecified,MANHATTAN,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,40.809623,-73.935754,"(40.80962280597619, -73.93575360303117)"
2,33527148,06/07/2016 12:29:18 AM,06/07/2016 01:46:13 AM,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10030.0,101 WEST 141 STREET,WEST 141 STREET,LENOX AVENUE,7 AVENUE,,,ADDRESS,NEW YORK,,Precinct,Closed,06/07/2016 08:29:18 AM,The Police Department responded to the complai...,06/07/2016 01:46:13 AM,10 MANHATTAN,MANHATTAN,1001328.0,237302.0,Unspecified,MANHATTAN,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,N,,,,,,,,,,,,40.817998,-73.9383,"(40.817998342152926, -73.93829980045645)"
3,33527176,06/06/2016 10:12:00 PM,,DEP,Department of Environmental Protection,Air Quality,"Air: Dust, Construction/Demolition (AE4)",,,,,,,26 ST,ROBERT F KENNEDY BRIDGE,INTERSECTION,,,,Open,,,,Unspecified QUEENS,QUEENS,,,Unspecified,QUEENS,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,,,
4,33527223,06/07/2016 12:39:55 AM,06/08/2016 12:45:08 AM,HPD,Department of Housing Preservation and Develop...,HPD Literature Request,The ABCs of Housing,,,,,,,,,,,,,Closed,06/08/2016 12:39:55 AM,The literature will be emailed within 24 hours...,06/07/2016 12:44:24 AM,0 Unspecified,Unspecified,,,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,N,,,,,,,,,,,,,,
5,33527355,06/06/2016 03:41:18 PM,06/13/2016 09:36:00 AM,DOT,Department of Transportation,Street Condition,Pothole,,11377.0,,,,,51 AVENUE,70 STREET,INTERSECTION,Woodside,,,Closed,,The Department of Transportation inspected thi...,06/13/2016 09:36:00 AM,02 QUEENS,QUEENS,1013566.0,207275.0,Unspecified,QUEENS,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,,,,,,,,,,,,,40.73555,-73.894217,"(40.735550059675916, -73.89421681891704)"


We can combine these to get the first 5 rows of a column:

In [6]:
complaints['Complaint Type'][:5]

0     Street Light Condition
1                       Lead
2    Noise - Street/Sidewalk
3                Air Quality
4     HPD Literature Request
Name: Complaint Type, dtype: object

and it doesn't matter which direction we do it in:

In [None]:
complaints[:5]['Complaint Type']

## Selecting multiple columns

What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.

In [None]:
complaints[['Complaint Type', 'Borough']]

That showed us a summary, and then we can look at the first 10 rows:

In [None]:
complaints[['Complaint Type', 'Borough']][:10]

## What's the most common complaint type?

This is a really easy question to answer! There's a `.value_counts()` method that we can use:

In [None]:
complaints['Complaint Type'].value_counts()

If we just wanted the top 10 most common complaints, we can do this:

In [None]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

But it gets better! We can plot them!

In [None]:
complaint_counts[:10].plot(kind='bar')

<style>
    @font-face {
        font-family: "Computer Modern";
        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');
    }
    div.cell{
        width:800px;
        margin-left:16% !important;
        margin-right:auto;
    }
    h1 {
        font-family: Helvetica, serif;
    }
    h4{
        margin-top:12px;
        margin-bottom: 3px;
       }
    div.text_cell_render{
        font-family: Computer Modern, "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
        line-height: 145%;
        font-size: 130%;
        width:800px;
        margin-left:auto;
        margin-right:auto;
    }
    .CodeMirror{
            font-family: "Source Code Pro", source-code-pro,Consolas, monospace;
    }
    .text_cell_render h5 {
        font-weight: 300;
        font-size: 22pt;
        color: #4057A1;
        font-style: italic;
        margin-bottom: .5em;
        margin-top: 0.5em;
        display: block;
    }
    
    .warning{
        color: rgb( 240, 20, 20 )
        }  