<a href="https://colab.research.google.com/github/afeld/python-public-policy/blob/master/lecture_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# a bit of setup
import numpy as np
np.random.seed(5)

# **NYU Wagner - Python Coding for Public Policy**
# Class 1: Intro to Pandas


# LECTURE

## Python review

Hold off on coding until the in-class exercise.

### Strings

In [2]:
"hello" + " " + "world"

'hello world'

### Booleans

In [1]:
False and True

False

In [2]:
True or False

True

In [42]:
12 > 6

True

Booleans are a special type, not to be confused with strings! (no quotes)

### Numbers

In [3]:
5 + 6

11

### Conditionals

In [4]:
3 > 11

False

### Variables

In [5]:
subtotal = 550 + 12
tax = subtotal * 0.0875
total = subtotal + tax
total

611.175

### Lists

In [6]:
planets = ['Mercury', 'Venus', 'Earth', 'Mars', 'Jupiter', 'Saturn', 'Uranus', 'Neptune']
planets[:3]

['Mercury', 'Venus', 'Earth']

# Pandas

- A Python package (bundled up code that you can reuse)
- Very common for data science in Python
- [A lot like R](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html), in their core concept being data frames

## Start by importing necessary packages

In [2]:
import pandas as pd

## Read and save 311 Service Requests dataset as a pandas dataframe

This will take a while (~30 seconds).

In [3]:
df = pd.read_csv('https://nyu.box.com/shared/static/x3zfnpsva4kwcqj6amfchheszdv269lq.zip', low_memory=False)

## Today's goal

Learn which 311 complaints are most common and which agencies are responsible for handling them. But first, let's take a look at the data, then clean it up!

## Preview the data contents

In [4]:
df.head() # defaults to providing the first 5 if you don't specify a number

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,39888071,08/01/2018 12:00:10 AM,08/01/2018 01:52:46 AM,DHS,Operations Unit - Department of Homeless Services,Homeless Person Assistance,,Other,10029,200 EAST 109 STREET,...,,,,,,,,40.793339,-73.942942,"(40.79333937834769, -73.9429417746998)"
1,39889166,08/01/2018 12:00:26 AM,08/18/2018 10:46:43 AM,HPD,Department of Housing Preservation and Develop...,DOOR/WINDOW,DOOR,RESIDENTIAL BUILDING,10031,528 WEST 136 STREET,...,,,,,,,,40.820124,-73.953071,"(40.82012422332215, -73.9530712339799)"
2,39882869,08/01/2018 12:00:54 AM,08/01/2018 12:49:55 AM,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11216,761 LINCOLN PLACE,...,,,,,,,,40.670809,-73.951399,"(40.67080917938279, -73.9513990916184)"
3,39894246,08/01/2018 12:01:00 AM,08/02/2018 10:30:00 PM,DEP,Department of Environmental Protection,Noise,Noise: Construction Before/After Hours (NM1),,10010,,...,,,,,,,,40.740262,-73.990517,"(40.74026158873342, -73.99051651686905)"
4,39881329,08/01/2018 12:01:00 AM,08/05/2018 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,11413,121-28 198 STREET,...,,,,,,,,40.688144,-73.75099,"(40.68814402968042, -73.75098958473612)"


In [5]:
df.tail(10) # last 10 records in the dataframe

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
2859190,43619254,08/24/2019 01:58:34 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Talking,Residential Building/House,11226.0,1509 NOSTRAND AVENUE,...,,,,,,,,40.649391,-73.949411,"(40.649390802790116, -73.94941102713294)"
2859191,43623130,08/24/2019 01:58:58 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11234.0,FLATLANDS AVENUE,...,,,,,,,,40.62621,-73.927659,"(40.62621042607075, -73.92765889759266)"
2859192,43618563,08/24/2019 01:59:20 AM,,NYPD,New York City Police Department,Illegal Parking,Posted Parking Sign Violation,Street/Sidewalk,11413.0,146 ROAD,...,,,,,,,,40.661635,-73.762651,"(40.66163483733922, -73.76265143072415)"
2859193,43624826,08/24/2019 01:59:38 AM,,NYPD,New York City Police Department,Noise - Vehicle,Car/Truck Music,Street/Sidewalk,11218.0,3403 14 AVENUE,...,,,,,,,,40.641957,-73.981313,"(40.64195655592916, -73.98131320909779)"
2859194,43626512,08/24/2019 01:59:40 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10034.0,65 PAYSON AVENUE,...,,,,,,,,40.867405,-73.927765,"(40.867405209755695, -73.9277653007795)"
2859195,43619262,08/24/2019 02:00:14 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10463.0,3308 BAILEY AVENUE,...,,,,,,,,40.879701,-73.901057,"(40.87970084316435, -73.9010571445122)"
2859196,43622052,08/24/2019 02:00:20 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11237.0,265 STOCKHOLM STREET,...,,,,,,,,40.701783,-73.920555,"(40.70178323733244, -73.92055493919345)"
2859197,43625918,08/24/2019 02:00:27 AM,,NYPD,New York City Police Department,Noise - Vehicle,Car/Truck Music,Street/Sidewalk,11233.0,560 RALPH AVENUE,...,,,,,,,,40.669992,-73.922489,"(40.66999185896368, -73.9224889475533)"
2859198,43622055,08/24/2019 02:00:54 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,,,...,,,,,,,,,,
2859199,43623704,08/24/2019 02:00:56 AM,,NYPD,New York City Police Department,Noise - Vehicle,Car/Truck Music,Street/Sidewalk,10303.0,2806 RICHMOND TERRACE,...,,,,,,,,40.636794,-74.153979,"(40.63679359185655, -74.15397942810047)"


In [6]:
df.sample(5) # random sample of size determined by you

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
661317,40641428,10/24/2018 09:18:00 PM,10/24/2018 09:32:58 PM,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,APARTMENT ONLY,RESIDENTIAL BUILDING,11210,2047 NOSTRAND AVENUE,...,,,,,,,,40.635422,-73.947916,"(40.63542192081139, -73.94791557833476)"
1543137,41750921,02/20/2019 09:07:21 PM,02/20/2019 09:59:28 PM,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,10468,2825 WEBB AVENUE,...,,,,,,,,40.872803,-73.900811,"(40.872803217936394, -73.9008106846054)"
759089,40766103,11/07/2018 06:39:38 AM,11/11/2018 04:24:06 PM,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,10460,576 MORRIS PARK AVENUE,...,,,,,,,,40.84364,-73.870137,"(40.84364049664065, -73.8701367224693)"
1210524,41368688,01/08/2019 11:58:32 AM,01/10/2019 04:25:51 PM,DOT,Department of Transportation,Sidewalk Condition,Blocked - Construction,Sidewalk,10466,3966 BRONX BOULEVARD,...,,,,,,,,40.888888,-73.864258,"(40.88888833332773, -73.86425798841263)"
2466446,43040873,06/21/2019 10:21:00 AM,06/26/2019 12:00:00 AM,DSNY,Department of Sanitation,Request Large Bulky Item Collection,Request Large Bulky Item Collection,Sidewalk,10308,58 LINDENWOOD ROAD,...,,,,,,,,40.548965,-74.152539,"(40.54896455500829, -74.1525392169111)"


## Pandas data structures

<!-- source: https://docs.google.com/document/d/1HGw2BdbuXSIwcgDWXkzZGPXYr5yJ_WEM3Gw-nLoHzCo/edit#heading=h.7z4rqdvodt9j -->

![Diagram showing a DataFrame, Series, labels, and indexes](img/data_structures-1.png)

## How many records are in the dataset?

### `size` method

How many cells are there in the data table?

In [7]:
df.size

117227200

What if I only care about how many rows there are? The columns in the dataframe are like a list. You can use a column name as an index to get one column from the dataframe.

`.size` includes `null` (empty) values.

In [8]:
df['Facility Type'].size

2859200

### `count()` method

You can also use the `count()` function, which gives the count of values per column. `count()` doesn't include `null` (empty) values.

In [9]:
df.count()

Unique Key                        2859200
Created Date                      2859200
Closed Date                       2722927
Agency                            2859200
Agency Name                       2859200
Complaint Type                    2859200
Descriptor                        2816522
Location Type                     2242215
Incident Zip                      2746912
Incident Address                  2483931
Street Name                       2483750
Cross Street 1                    1719330
Cross Street 2                    1712706
Intersection Street 1              613896
Intersection Street 2              611987
Address Type                      2579813
City                              2725223
Landmark                           185198
Facility Type                      769320
Status                            2859200
Due Date                           978779
Resolution Description            2615253
Resolution Action Updated Date    2795696
Community Board                   

To just get the count in the "unique key" column:

In [10]:
df['Unique Key'].count()

2859200

### `info()` method

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2859200 entries, 0 to 2859199
Data columns (total 41 columns):
 #   Column                          Dtype  
---  ------                          -----  
 0   Unique Key                      int64  
 1   Created Date                    object 
 2   Closed Date                     object 
 3   Agency                          object 
 4   Agency Name                     object 
 5   Complaint Type                  object 
 6   Descriptor                      object 
 7   Location Type                   object 
 8   Incident Zip                    object 
 9   Incident Address                object 
 10  Street Name                     object 
 11  Cross Street 1                  object 
 12  Cross Street 2                  object 
 13  Intersection Street 1           object 
 14  Intersection Street 2           object 
 15  Address Type                    object 
 16  City                            object 
 17  Landmark                   

## What are the distinct sets of values in columns that seem most useful?

### `set()` function for getting a set of unique values

Let's look at the "status" column. What are the status options for these 311 complaints?

In [12]:
set(df['Status'])

{'Assigned',
 'Closed',
 'Email Sent',
 'In Progress',
 'Open',
 'Pending',
 'Started',
 'Unspecified'}

In [13]:
set(df['Open Data Channel Type'])


{'MOBILE', 'ONLINE', 'OTHER', 'PHONE', 'UNKNOWN'}

In [14]:
set(df['Agency'])

{'ACS',
 'COIB',
 'DCA',
 'DCAS',
 'DCP',
 'DEP',
 'DFTA',
 'DHS',
 'DOB',
 'DOE',
 'DOF',
 'DOHMH',
 'DOITT',
 'DOT',
 'DPR',
 'DSNY',
 'DVS',
 'EDC',
 'FDNY',
 'HPD',
 'HRA',
 'MOC',
 'NYCEM',
 'NYPD',
 'OMB',
 'TAT',
 'TAX',
 'TLC'}

In [15]:
set(df['Complaint Type'])

{'"-->\'-->`-->&...',
 '$(sleep 11)',
 '${3804*3137}',
 '%2e%2e%2f%2e%2e%2f%2e%2e%2f...',
 '%2e%2e%5c%2e%2e%5c%2e%2e%5c...',
 '%2fetc%2fpasswd',
 '%E5%98%8A%E5%98%8DX-Injecti...',
 '%c0%ae/%c0%ae/%c0%ae/%c0%ae...',
 '%c0%ae/%c0%ae/%c0%ae/WEB-IN...',
 '%c0%ae/%c0%ae/WEB-INF/web.xml',
 '%c0%ae/WEB-INF/web.xml',
 '%{(#dm=@ognl.OgnlContext@DE...',
 '%{4761*8506}',
 '() { :;}; /bin/sleep 0',
 '() { :;}; /bin/sleep 11',
 '() { _; } >_',
 '(select extractvalue(xmltyp...',
 "(select load_file('\\\\\\\\615h...",
 '*)(!(objectClass=*)',
 '*)(objectClass=*',
 '.../....///.../....///.../....',
 '.../...//.../...//.../...//...',
 '..././..././..././..././......',
 '.../.\\.../.\\.../.\\.../.\\......',
 '.../Misc. Comments',
 '...\\./...\\./...\\./...\\./......',
 '...\\.\\...\\.\\...\\.\\...\\.\\......',
 '../../../../../../../../../...',
 '../../../../WEB-INF/web.xml',
 '../../../WEB-INF/web.xml',
 '../../../WEB-INF/web.xml;x=',
 '../../WEB-INF/web.xml',
 '../../WEB-INF/web.xml;x=',
 '../WEB-INF/w

FYI you can also use `df['column_name'].unique()` to get a list of unique values.

## In-class exercise

[Create a copy of the Homework 1 starter notebook.](https://colab.research.google.com/github/afeld/python-public-policy/blob/master/hw_1.ipynb)

## Excluding bad records from the dataframe

First, let's refresh ourselves on what the invalid complaint types are, by getting the distinct list of all complaint types

In [16]:
# the set() function returns all unique values in a column
set(df['Complaint Type'])

{'"-->\'-->`-->&...',
 '$(sleep 11)',
 '${3804*3137}',
 '%2e%2e%2f%2e%2e%2f%2e%2e%2f...',
 '%2e%2e%5c%2e%2e%5c%2e%2e%5c...',
 '%2fetc%2fpasswd',
 '%E5%98%8A%E5%98%8DX-Injecti...',
 '%c0%ae/%c0%ae/%c0%ae/%c0%ae...',
 '%c0%ae/%c0%ae/%c0%ae/WEB-IN...',
 '%c0%ae/%c0%ae/WEB-INF/web.xml',
 '%c0%ae/WEB-INF/web.xml',
 '%{(#dm=@ognl.OgnlContext@DE...',
 '%{4761*8506}',
 '() { :;}; /bin/sleep 0',
 '() { :;}; /bin/sleep 11',
 '() { _; } >_',
 '(select extractvalue(xmltyp...',
 "(select load_file('\\\\\\\\615h...",
 '*)(!(objectClass=*)',
 '*)(objectClass=*',
 '.../....///.../....///.../....',
 '.../...//.../...//.../...//...',
 '..././..././..././..././......',
 '.../.\\.../.\\.../.\\.../.\\......',
 '.../Misc. Comments',
 '...\\./...\\./...\\./...\\./......',
 '...\\.\\...\\.\\...\\.\\...\\.\\......',
 '../../../../../../../../../...',
 '../../../../WEB-INF/web.xml',
 '../../../WEB-INF/web.xml',
 '../../../WEB-INF/web.xml;x=',
 '../../WEB-INF/web.xml',
 '../../WEB-INF/web.xml;x=',
 '../WEB-INF/w

Let's see how frequently these invalid Complaint Type values appear in the data.

Use `.groupby().size()` to get the count of 311 requests per complaint type value. This is very similar to a pivot table in Excel.

In [17]:
# remember .size gives you the count of cells across all columns in the dataframe
df.size

117227200

In [18]:
# to just get the total count of records in the dataset, we should get the size of the 'Unique Key' column
df['Unique Key'].size

2859200

In [50]:
with pd.option_context("display.max_rows", None):
    display(df.groupby('Complaint Type').size())

Complaint Type
"-->'-->`-->&...                                  1
$(sleep 11)                                       1
${3804*3137}                                      1
%2e%2e%2f%2e%2e%2f%2e%2e%2f...                    1
%2e%2e%5c%2e%2e%5c%2e%2e%5c...                    1
%2fetc%2fpasswd                                   1
%E5%98%8A%E5%98%8DX-Injecti...                    1
%c0%ae/%c0%ae/%c0%ae/%c0%ae...                    1
%c0%ae/%c0%ae/%c0%ae/WEB-IN...                    1
%c0%ae/%c0%ae/WEB-INF/web.xml                     1
%c0%ae/WEB-INF/web.xml                            1
%{(#dm=@ognl.OgnlContext@DE...                    1
%{4761*8506}                                      1
() { :;}; /bin/sleep 0                            1
() { :;}; /bin/sleep 11                           1
() { _; } >_                                      1
(select extractvalue(xmltyp...                    1
(select load_file('\\\\615h...                    1
*)(!(objectClass=*)                              

```python
with pd.option_context("display.max_rows", None):
    display(...)
```

What this code is doing: showing all cells in a DataFrame with [rich output](https://ipython.readthedocs.io/en/stable/interactive/plotting.html#rich-outputs).

```python
df.groupby('Complaint Type').size()
```

What this code is doing:

1. Group the records in the dataset based on their `Complaint Type` value
1. Count the records that have been grouped together by their shared `Complaint Type` value

Watch out! `.groupby().size()` function doesn't work the same way as `.size`. The former gets the count of number of rows in each group.

It looks like most invalid complaint types only have a few records. Try excluding all complaint type categories with < 4 records, assuming that all complaint type categories with < 4 instances in the data are bad data entries.

Why 4? It's arbitrary. We're looking for trends in the data in this case don't care about low frequency entries.

Create a dataframe that captures the count of records per `Complaint Type` value.

In [47]:
counts = df.groupby('Complaint Type').size().reset_index(name='count')
counts
# .reset_index(name='count') allows us to name the new column that contains the count of rows

Unnamed: 0,Complaint Type,count
0,"""-->'-->`-->&...",1
1,$(sleep 11),1
2,${3804*3137},1
3,%2e%2e%2f%2e%2e%2f%2e%2e%2f...,1
4,%2e%2e%5c%2e%2e%5c%2e%2e%5c...,1
5,%2fetc%2fpasswd,1
6,%E5%98%8A%E5%98%8DX-Injecti...,1
7,%c0%ae/%c0%ae/%c0%ae/%c0%ae...,1
8,%c0%ae/%c0%ae/%c0%ae/WEB-IN...,1
9,%c0%ae/%c0%ae/WEB-INF/web.xml,1


You can also use `.count()` but [the output is a little different](https://stackoverflow.com/questions/33346591/what-is-the-difference-between-size-and-count-in-pandas).

Create a "series" that only lists the `Complaint Type` values that have record counts > 4. (Remember: A single column from a pandas dataframe is called a series. It's essentially a list containing all the values in the column.) 

In [48]:
valid_complaint_types = counts['Complaint Type'][counts['count'] > 4]
valid_complaint_types

45                                     APPLIANCE
46                             Abandoned Vehicle
48                               Advocate - Lien
49                              Advocate - Other
50                               Advocate - RPIE
52                 Advocate-Co-opCondo Abatement
53                Advocate-Commercial Exemptions
55                  Advocate-Personal Exemptions
57                 Advocate-Prop Refunds/Credits
58                       Advocate-Property Value
60                                   Air Quality
61                              Alzheimer's Care
62                                  Animal Abuse
63                   Animal Facility - No Permit
64                              Animal in a Park
65                                  Animal-Abuse
67                                      Asbestos
68                              BEST/Site Safety
69                    Beach/Pool/Sauna Complaint
70                      Benefit Card Replacement
71                  

Filter our `df` dataframe to only keep the rows where the `Complaint Type` value is in the `valid_complaint_types` series we created in the previous step. Save the result in a new dataframe.

In [23]:
df_cleaned = df[df['Complaint Type'].isin(valid_complaint_types)]

How can we make sure this worked? Let's check how many records there were originally in `df` vs how many are in `df_cleaned`.

Before:

In [24]:
df['Unique Key'].size

2859200

After:

In [25]:
df_cleaned['Unique Key'].size

2859011

We can also print the set of complaint_type values from our cleaned dataframe to make sure they look correct.

In [26]:
df_cleaned['Complaint Type'].sample(10)

880261               Noise - Residential
1144831             Sanitation Condition
626307                    HEAT/HOT WATER
2313101    Electronics Waste Appointment
929041               Noise - Residential
2780651                     Damaged Tree
227911                  Blocked Driveway
2159680                          GENERAL
486414                  BEST/Site Safety
668845           Overgrown Tree/Branches
Name: Complaint Type, dtype: object

Great, now those invalid records will be excluded from our analysis!

Another approach to excluding those invalid records would be to use ["regex" (regular expressions)](https://www.w3schools.com/python/python_regex.asp) to find records with weird characters.

## Filtering rows

Slicing and dicing is done through [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

![DataFrame](img/data_structures-2.jpg)

### Boolean indexing

![DataFrame and Series](img/data_structures-3.jpg)

#### How it works

In [2]:
from IPython.display import Video
Video("img/boolean-indexing.mp4")

## Done with clean up! Time for the actual analysis: 
### Which 311 complaints are most common and which agencies are responsible for handling them?

#### Which complaints are the most common?

In [27]:
df_cleaned.groupby('Complaint Type').size().nlargest(15).reset_index(name='count')

# .reset_index(name='count') isn't necessary but it's helpful to include because it allows us to name the new column that contains the count of rows

Unnamed: 0,Complaint Type,count
0,Noise - Residential,236350
1,HEAT/HOT WATER,222722
2,Illegal Parking,195159
3,Request Large Bulky Item Collection,177175
4,Blocked Driveway,145446
5,Street Condition,97178
6,Noise - Street/Sidewalk,95977
7,UNSANITARY CONDITION,85904
8,Street Light Condition,77755
9,Water System,74139


#### Which agencies are responsible for handling these complaint categories?

In [28]:
df_cleaned.groupby(['Agency', 'Complaint Type']).size().nlargest(15).reset_index(name='count')

Unnamed: 0,Agency,Complaint Type,count
0,NYPD,Noise - Residential,236350
1,HPD,HEAT/HOT WATER,222722
2,NYPD,Illegal Parking,195159
3,DSNY,Request Large Bulky Item Collection,177175
4,NYPD,Blocked Driveway,145446
5,DOT,Street Condition,97178
6,NYPD,Noise - Street/Sidewalk,95977
7,HPD,UNSANITARY CONDITION,85904
8,DOT,Street Light Condition,77755
9,DEP,Water System,74139


#### Which agencies receive the most total 311 requests?

In [29]:
df_cleaned.groupby('Agency').size().nlargest(15).reset_index(name='count')

Unnamed: 0,Agency,count
0,NYPD,850077
1,HPD,603043
2,DSNY,420165
3,DOT,298879
4,DEP,207280
5,DOB,149656
6,DPR,119970
7,DOHMH,72253
8,DOF,41436
9,TLC,35730


#### What is the most frequent request per agency?

First, create a dataframe that contains the count of complaints per `Agency` per `Complaint Type`.

In [33]:
agency_counts = df_cleaned.groupby(['Agency', 'Complaint Type']).size().reset_index(name='count')
agency_counts.head(20)

Unnamed: 0,Agency,Complaint Type,count
0,ACS,Forms,364
1,COIB,Forms,7
2,DCA,Consumer Complaint,16385
3,DCA,DCA / DOH New License Application Request,1052
4,DCAS,Comments,68
5,DCAS,Question,788
6,DCP,Research Questions,22
7,DEP,Air Quality,8333
8,DEP,Asbestos,1909
9,DEP,FATF,118


Use `drop_duplicates()` to keep the row with the highest value per `Agency`.

In [35]:
agency_counts.sort_values('count', ascending=False).drop_duplicates('Agency').sort_values('Agency')

Unnamed: 0,Agency,Complaint Type,count
0,ACS,Forms,364
1,COIB,Forms,7
2,DCA,Consumer Complaint,16385
5,DCAS,Question,788
6,DCP,Research Questions,22
18,DEP,Water System,74139
26,DFTA,Housing - Low Income Senior,4375
30,DHS,Homeless Person Assistance,20991
44,DOB,General Construction/Plumbing,54939
55,DOE,School Maintenance,2372


Another way, only sorting it once:

In [38]:
agency_counts.sort_values(['Agency', 'count']).drop_duplicates('Agency', keep='last')

Unnamed: 0,Agency,Complaint Type,count
0,ACS,Forms,364
1,COIB,Forms,7
2,DCA,Consumer Complaint,16385
5,DCAS,Question,788
6,DCP,Research Questions,22
18,DEP,Water System,74139
26,DFTA,Housing - Low Income Senior,4375
30,DHS,Homeless Person Assistance,20991
44,DOB,General Construction/Plumbing,54939
55,DOE,School Maintenance,2372


# HOMEWORK 1

Continue in the notebook you created.