In [1]:
# a bit of setup just for lecture; ignore me
import numpy as np

np.random.seed(5)

# Lecture 1: Working with data

_Please sign attendance sheet_

- How was the homework? 👍/👎
- Questions?
- Reminder about the [between-class participation](https://python-public-policy.afeld.me/en/columbia/syllabus.html#participation)

## Additional programming concepts

### Functions

- Functions == methods
- Arguments == parameters

For simplicity, we'll use them interchangeably.

### Packages

- a.k.a. "libraries"
- Developers have create them to make code/functionality reusable and easily sharable
- Software plugins that you `import`
- Main packages we’ll use:
    - `pandas`
    - `plotly`

> A module is a file containing Python definitions and statements.

https://docs.python.org/3/tutorial/modules.html

## Working with files in Python

We'll open the file, then read it row by row.

```python
# set up the file object
with open("moby_dick.txt") as file:
    for line in file:
        print(line)
```

## Working with CSVs in pure Python

We will use Python's CSV [DictReader](https://docs.python.org/3/library/csv.html#csv.DictReader). We'll open the file, parse it as a CSV, then operate row by row.

### Example

```python
import csv


# set up the file object
with open("people.csv") as csvfile:
    # set up the reader
    reader = csv.DictReader(csvfile)
    # loop through the rows
    for row in reader:
        # access the data in various columns
        first = row["first_name"]
        last = row["last_name"]

        print(f"{first} {last}")
```

### [In-class exercise](https://python-public-policy.afeld.me/en/columbia/lecture_1_exercise.html)

## 311 requests

Who's called 311 before?

[NYC 311 homepage](https://portal.311.nyc.gov/)

### [311 data](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)

## Today's goal

- Which 311 complaints are most common?
- Which agencies are responsible for handling them?

## Pandas

- A Python package (bundled up code that you can reuse)
- Very common for data science in Python
- [A lot like R](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_r.html)
   - Both organize around "data frames"

### Start by importing necessary packages

In [2]:
import pandas as pd

### Read and save 311 Service Requests dataset as a pandas DataFrame

We're using a sample to make it easier/faster to work with. This will take a while (~30 seconds).

In [3]:
requests = pd.read_csv(
    "https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip"
)

  requests = pd.read_csv(


Ignore the `DtypeWarning` for now; we'll come back to it.

## Preview the data contents

In [4]:
requests.head()  # defaults to providing the first 5 if you don't specify a number

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
0,39885889,08/01/2018 12:05:13 AM,08/01/2018 12:05:13 AM,DOT,Department of Transportation,Street Condition,Pothole,,11235,3143 SHORE PARKWAY,...,,,,,,,,40.585156,-73.959119,"(40.585155533520144, -73.95911915841708)"
1,39886470,08/01/2018 12:06:05 AM,08/01/2018 12:06:05 AM,DOT,Department of Transportation,Street Condition,Pothole,,11235,3153 SHORE PARKWAY,...,,,,,,,,40.585218,-73.958608,"(40.58521848090658, -73.95860788382927)"
2,39893543,08/01/2018 12:06:16 AM,08/03/2018 02:03:55 PM,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,RESIDENTIAL BUILDING,11221,729 LAFAYETTE AVENUE,...,,,,,,,,40.690733,-73.943964,"(40.69073285353906, -73.943963521266)"
3,39886233,08/01/2018 12:06:29 AM,08/01/2018 02:54:24 AM,NYPD,New York City Police Department,Noise - Residential,Banging/Pounding,Residential Building/House,11693,82-01 BEACH CHANNEL DRIVE,...,,,,,,,,40.589931,-73.808896,"(40.58993080750793, -73.80889570815852)"
4,39880309,08/01/2018 12:06:51 AM,08/01/2018 04:54:26 AM,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11216,64 HERKIMER STREET,...,,,,,,,,40.679716,-73.951234,"(40.67971590505359, -73.95123396494363)"


In [5]:
requests.tail(10)  # last 10 records in the DataFrame

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
499990,43622247,08/24/2019 01:33:43 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11385.0,578 FAIRVIEW AVENUE,...,,,,,,,,40.707576,-73.907325,"(40.70757578135031, -73.90732527364065)"
499991,43620877,08/24/2019 01:34:32 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11201.0,160 NAVY WALK,...,,,,,,,,40.693967,-73.98021,"(40.6939671536727, -73.98020958205214)"
499992,43619232,08/24/2019 01:38:44 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,11238.0,981 DEAN STREET,...,,,,,,,,40.67803,-73.95762,"(40.67803039848778, -73.95762012074778)"
499993,43626613,08/24/2019 01:43:57 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Talking,Street/Sidewalk,10023.0,WEST 65 STREET,...,,,,,,,,40.775372,-73.98771,"(40.7753720958196, -73.98770974232366)"
499994,43619756,08/24/2019 01:44:27 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,11208.0,211 NICHOLS AVENUE,...,,,,,,,,40.685495,-73.868876,"(40.68549515215576, -73.8688762456483)"
499995,43622302,08/24/2019 01:46:09 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10009.0,431 EAST 9 STREET,...,,,,,,,,40.727536,-73.983295,"(40.72753608835362, -73.98329522742081)"
499996,43619709,08/24/2019 01:49:49 AM,,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10304.0,191 BROAD STREET,...,,,,,,,,40.624157,-74.081006,"(40.62415703282506, -74.08100614362155)"
499997,43623124,08/24/2019 01:56:35 AM,,NYPD,New York City Police Department,Noise - Street/Sidewalk,Loud Music/Party,Street/Sidewalk,10031.0,534 WEST 153 STREET,...,,,,,,,,40.830718,-73.945006,"(40.83071800761314, -73.94500557250639)"
499998,43625595,08/24/2019 01:56:40 AM,,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Club/Bar/Restaurant,10452.0,EAST 170 STREET,...,,,,,,,,40.839882,-73.916783,"(40.839882158779105, -73.91678321635897)"
499999,43622817,08/24/2019 01:57:58 AM,,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Store/Commercial,10033.0,247 AUDUBON AVENUE,...,,,,,,,,40.846376,-73.934048,"(40.84637632367179, -73.93404825809533)"


In [6]:
requests.sample(5)  # random sample of size determined by you

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Vehicle Type,Taxi Company Borough,Taxi Pick Up Location,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Latitude,Longitude,Location
311150,42044618,03/25/2019 10:16:38 PM,03/25/2019 11:47:09 PM,NYPD,New York City Police Department,Noise - Residential,Loud Talking,Residential Building/House,10453.0,1800 POPHAM AVENUE,...,,,,,,,,40.851389,-73.917651,"(40.85138859491611, -73.91765079814148)"
32142,40107788,08/25/2018 10:24:40 AM,08/25/2018 04:30:08 PM,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11419.0,133-05 107 AVENUE,...,,,,,,,,40.687721,-73.811536,"(40.6877211401399, -73.8115361385404)"
120849,40692417,10/29/2018 07:23:28 AM,11/01/2018 09:52:41 PM,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,APARTMENT ONLY,RESIDENTIAL BUILDING,10040.0,660 FT WASHINGTON AVENUE,...,,,,,,,,40.856431,-73.936092,"(40.856431325229096, -73.93609157070478)"
471582,43386392,07/26/2019 11:17:21 PM,07/27/2019 02:36:26 AM,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10303.0,351 HARBOR ROAD,...,,,,,,,,40.62751,-74.160446,"(40.62751009515846, -74.16044618563123)"
455213,43253780,07/11/2019 03:23:28 PM,07/12/2019 11:25:43 AM,NYPD,New York City Police Department,Illegal Parking,Double Parked Blocking Traffic,Street/Sidewalk,11204.0,61 STREET,...,,,,,,,,,,


## Pandas data structures

<!-- source: https://docs.google.com/document/d/1HGw2BdbuXSIwcgDWXkzZGPXYr5yJ_WEM3Gw-nLoHzCo/edit#heading=h.7z4rqdvodt9j -->

![Diagram showing a DataFrame, Series, labels, and indexes](extras/img/data_structures-1.png)

## DataFrame information

In [7]:
requests.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 41 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   Unique Key                      500000 non-null  int64  
 1   Created Date                    500000 non-null  object 
 2   Closed Date                     476156 non-null  object 
 3   Agency                          500000 non-null  object 
 4   Agency Name                     500000 non-null  object 
 5   Complaint Type                  500000 non-null  object 
 6   Descriptor                      492534 non-null  object 
 7   Location Type                   392590 non-null  object 
 8   Incident Zip                    480411 non-null  object 
 9   Incident Address                434544 non-null  object 
 10  Street Name                     434519 non-null  object 
 11  Cross Street 1                  300838 non-null  object 
 12  Cross Street 2  

## Demo

### Analysis

#### Which complaints are most common?

In [11]:
# code goes here

#### What's the most frequent request per agency?

In [12]:
# code goes here

- [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#grouping) similar to [pivot tables](https://support.google.com/docs/answer/1272900) in spreadsheets
- [`to_frame()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html)
- [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html)

### Exclude bad records from the DataFrame

Let's look at the complaint types.

In [9]:
# code goes here

How should we go about cleaning those up?

In [10]:
# code goes here

## [Best practices](https://python-public-policy.afeld.me/en/columbia/assignments.html#tips)

## [Homework 1](https://python-public-policy.afeld.me/en/columbia/hw_1.html)