# Lecture 1 demo solutions

## Load data

In [1]:
import pandas as pd

requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)

  requests = pd.read_csv(


## Analysis

### Which complaints are most common?

In [13]:
requests["Complaint Type"].value_counts().head()

Complaint Type
Noise - Residential                    41311
HEAT/HOT WATER                         39095
Illegal Parking                        34297
Request Large Bulky Item Collection    30939
Blocked Driveway                       25530
Name: count, dtype: int64

Equivalent to:

In [14]:
requests.groupby("Complaint Type").size().nlargest()

Complaint Type
Noise - Residential                    41311
HEAT/HOT WATER                         39095
Illegal Parking                        34297
Request Large Bulky Item Collection    30939
Blocked Driveway                       25530
dtype: int64

### What's the most frequent request per agency?

In [4]:
requests.groupby(["Agency", "Complaint Type"]).size().to_frame(name="count")

Unnamed: 0_level_0,Unnamed: 1_level_0,count
Agency,Complaint Type,Unnamed: 2_level_1
ACS,Forms,56
COIB,Forms,1
DCA,Consumer Complaint,2892
DCA,DCA / DOH New License Application Request,186
DCAS,Comments,13
...,...,...
TLC,Lost Property,472
TLC,Taxi Complaint,2416
TLC,Taxi Compliment,41
TLC,Taxi Licensee Complaint,5


## Exclude bad records from the DataFrame

Let's look at the complaint types.

In [5]:
requests["Complaint Type"].unique()

array(['Street Condition', 'HEAT/HOT WATER', 'Noise - Residential',
       'Illegal Parking', 'Request Large Bulky Item Collection', 'Noise',
       'Noise - Street/Sidewalk', 'Electronics Waste Appointment',
       'Blocked Driveway', 'Dirty Conditions', 'Curb Condition',
       'Noise - Commercial', 'General Construction/Plumbing',
       'Traffic Signal Condition', 'Street Light Condition', 'Lead',
       'Street Sign - Damaged', 'Noise - Vehicle', 'New Tree Request',
       'Sanitation Condition', 'Mosquitoes', 'WATER LEAK',
       'UNSANITARY CONDITION', 'Root/Sewer/Sidewalk Condition',
       'Dead/Dying Tree', 'Derelict Vehicles', 'Collection Truck Noise',
       'Sewer', 'GENERAL', 'Overflowing Litter Baskets', 'Vacant Lot',
       'Sidewalk Condition', 'PAINT/PLASTER', 'Building/Use',
       'Street Sign - Dangling', 'Construction Safety Enforcement',
       'PLUMBING', 'Derelict Vehicle', 'Homeless Person Assistance',
       'ELECTRIC', 'Water System', 'Damaged Tree',
       

Let's make that a bit easier to read:

In [6]:
complaints = requests["Complaint Type"].unique()
complaints.sort()
list(complaints)

['$(sleep 11)',
 '(select extractvalue(xmltyp...',
 '..././..././..././..././......',
 '../../../../../../../../../...',
 '../WEB-INF/web.xml',
 '@(9313*3464)',
 'APPLIANCE',
 'Abandoned Vehicle',
 'Advocate - Other',
 'Advocate - RPIE',
 'Advocate-Business Tax',
 'Advocate-Co-opCondo Abatement',
 'Advocate-Personal Exemptions',
 'Advocate-Prop Refunds/Credits',
 'Air Quality',
 "Alzheimer's Care",
 'Animal Abuse',
 'Animal Facility - No Permit',
 'Animal in a Park',
 'Animal-Abuse',
 'Appliance',
 'Asbestos',
 'BEST/Site Safety',
 'Beach/Pool/Sauna Complaint',
 'Benefit Card Replacement',
 'Bereavement Support Group',
 'Bike Rack Condition',
 'Bike/Roller/Skate Chronic',
 'Blocked Driveway',
 'Boilers',
 'Borough Office',
 'Bottled Water',
 'Bridge Condition',
 'Broken Parking Meter',
 'Building Marshals office',
 'Building/Use',
 'Bus Stop Shelter Complaint',
 'Bus Stop Shelter Placement',
 'Calorie Labeling',
 'Case Management Agency Complaint',
 'Collection Truck Noise',
 'Comments

Let's see how frequently these invalid Complaint Type values appear in the data.

Use `.groupby().size()` to get the count of 311 requests per complaint type value. This is very similar to [pivot tables](https://support.google.com/docs/answer/1272900) in spreadsheets. See also: [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html).

In [7]:
with pd.option_context("display.max_rows", 1000):
    display(requests.groupby("Complaint Type").size())

Complaint Type
$(sleep 11)                                      1
(select extractvalue(xmltyp...                   1
..././..././..././..././......                   1
../../../../../../../../../...                   2
../WEB-INF/web.xml                               1
@(9313*3464)                                     1
APPLIANCE                                     2539
Abandoned Vehicle                             1655
Advocate - Other                                26
Advocate - RPIE                                  1
Advocate-Business Tax                            2
Advocate-Co-opCondo Abatement                    4
Advocate-Personal Exemptions                    24
Advocate-Prop Refunds/Credits                   39
Air Quality                                   1457
Alzheimer's Care                                25
Animal Abuse                                  1282
Animal Facility - No Permit                     10
Animal in a Park                               456
Animal-Abuse    

```python
with pd.option_context("display.max_rows", 1000):
    display(...)
```

What this code is doing: showing all cells in a DataFrame with [rich output](https://ipython.readthedocs.io/en/stable/interactive/plotting.html#rich-outputs).

```python
requests.groupby('Complaint Type').size()
```

What this code is doing:

1. Group the records in the dataset based on their `Complaint Type` value
1. Count the records that have been grouped together by their shared `Complaint Type` value

How should we find junk records to delete?

It looks like most invalid complaint types only have a few records. Try excluding all complaint type categories with < 3 records, assuming that all complaint type categories with < 3 instances in the data are bad data entries.

Why 3? It's arbitrary. We're looking for trends in the data, and in this case we don't care about low frequency entries.

Create a DataFrame that captures the count of records per `Complaint Type` value.

In [8]:
counts = requests.groupby("Complaint Type").size().reset_index(name="count")
counts
# .reset_index(name='count') allows us to name the new column that contains the count of rows

Unnamed: 0,Complaint Type,count
0,$(sleep 11),1
1,(select extractvalue(xmltyp...,1
2,..././..././..././..././......,1
3,../../../../../../../../../...,2
4,../WEB-INF/web.xml,1
...,...,...
243,eval(compile('for x in rang...,1
244,file:///c:/windows/win.ini,1
245,idexf3mrb7)(!(objectClass=*),1
246,qfix4${695*589}lixaf,1


**Let's create a Series that only lists the `Complaint Type` values that have record counts >= 3.**

Remember: A single column from a pandas DataFrame is called a Series. It's essentially a list containing all the values in the column.

In [9]:
valid_complaint_counts = counts[counts["count"] >= 3]
valid_complaint_counts

Unnamed: 0,Complaint Type,count
6,APPLIANCE,2539
7,Abandoned Vehicle,1655
8,Advocate - Other,26
11,Advocate-Co-opCondo Abatement,4
12,Advocate-Personal Exemptions,24
...,...,...
236,WATER LEAK,6641
237,Water Conservation,853
238,Water Quality,332
239,Water System,12949


Filter our `requests` DataFrame to only keep the rows where the `Complaint Type` value is in the `valid_complaint_types` Series we created in the previous step. Save the result in a new DataFrame.

In [10]:
valid_complaint_types = valid_complaint_counts["Complaint Type"]
is_valid_complaint = requests["Complaint Type"].isin(valid_complaint_types)
requests_clean = requests[is_valid_complaint]

How can we make sure this worked? Let's check how many records there were originally in `requests` vs how many are in `requests_clean`.

Before:

In [11]:
requests["Unique Key"].size

500000

After:

In [12]:
requests_clean["Unique Key"].size

499958

Great, now those invalid records will be excluded from our analysis!

Another approach to excluding those invalid records would be to use [regular expressions ("RegExes")](https://www.w3schools.com/python/python_regex.asp) to find records with weird characters.