In [1]:
import pandas as pd

# Reading and writing data

Pandas can read data from many formats

A common format is 'csv', short for comma-separated values. These are just what they say they are: tabular data, arranged in rows, where each column is separated by a comma

Read a csv file with the `pd.read_csv()` method

1. from an online source:

In [2]:
pd.read_csv('https://data.cityofnewyork.us/resource/c3uy-2p5r.csv')

Unnamed: 0,unique_id,indicator_id,name,measure,measure_info,geo_type_name,geo_join_id,geo_place_name,time_period,start_date,data_value,message
0,827029,386,Ozone (O3),Mean,ppb,CD,208,Riverdale and Fieldston (CD8),Summer 2022,2022-06-01T00:00:00.000,30.2,
1,827082,386,Ozone (O3),Mean,ppb,UHF34,102,Northeast Bronx,Summer 2022,2022-06-01T00:00:00.000,32.2,
2,827136,386,Ozone (O3),Mean,ppb,UHF42,201,Greenpoint,Summer 2022,2022-06-01T00:00:00.000,32.4,
3,823412,365,Fine particles (PM 2.5),Mean,mcg/m3,UHF34,211,Williamsburg - Bushwick,Summer 2022,2022-06-01T00:00:00.000,7.0,
4,827080,386,Ozone (O3),Mean,ppb,UHF34,104,Pelham - Throgs Neck,Summer 2022,2022-06-01T00:00:00.000,33.3,
...,...,...,...,...,...,...,...,...,...,...,...,...
995,741041,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,204,East New York,Summer 2021,2021-06-01T00:00:00.000,10.8,
996,741083,375,Nitrogen dioxide (NO2),Mean,ppb,UHF34,403,Flushing - Clearview,Summer 2021,2021-06-01T00:00:00.000,11.7,
997,742442,365,Fine particles (PM 2.5),Mean,mcg/m3,CD,205,Fordham and University Heights (CD5),Summer 2021,2021-06-01T00:00:00.000,8.7,
998,743709,386,Ozone (O3),Mean,ppb,UHF42,211,Williamsburg - Bushwick,Summer 2021,2021-06-01T00:00:00.000,29.3,


2. from a file on your computer, specified by the file path:

In [3]:
pd.read_csv("C:/Users/dlevine/Downloads/Air_Quality_20250311.csv")

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,179772,640,Boiler Emissions- Total SO2 Emissions,Number per km2,number,UHF42,409.0,Southeast Queens,2015,01/01/2015,0.3,
1,221956,386,Ozone (O3),Mean,ppb,UHF34,305307.0,Upper East Side-Gramercy,Summer 2014,06/01/2014,24.9,
2,221806,386,Ozone (O3),Mean,ppb,UHF34,103.0,Fordham - Bronx Pk,Summer 2014,06/01/2014,30.7,
3,221836,386,Ozone (O3),Mean,ppb,UHF34,204.0,East New York,Summer 2014,06/01/2014,32.0,
4,221812,386,Ozone (O3),Mean,ppb,UHF34,104.0,Pelham - Throgs Neck,Summer 2014,06/01/2014,31.9,
...,...,...,...,...,...,...,...,...,...,...,...,...
18020,816914,643,Annual vehicle miles traveled,Million miles,per square mile,CD,503.0,Tottenville and Great Kills (CD3),2019,01/01/2019,12.9,
18021,816913,643,Annual vehicle miles traveled,Million miles,per square mile,CD,503.0,Tottenville and Great Kills (CD3),2010,01/01/2010,14.7,
18022,816872,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,208.0,Canarsie - Flatlands,2010,01/01/2010,43.4,
18023,816832,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,407.0,Southwest Queens,2010,01/01/2010,65.8,


3. from a file on your computer, specified by the _relative_ path 

In [4]:
pd.read_csv('../Data/Source Data/Air_Quality_20250311.csv')

Unnamed: 0,Unique ID,Indicator ID,Name,Measure,Measure Info,Geo Type Name,Geo Join ID,Geo Place Name,Time Period,Start_Date,Data Value,Message
0,179772,640,Boiler Emissions- Total SO2 Emissions,Number per km2,number,UHF42,409.0,Southeast Queens,2015,01/01/2015,0.3,
1,221956,386,Ozone (O3),Mean,ppb,UHF34,305307.0,Upper East Side-Gramercy,Summer 2014,06/01/2014,24.9,
2,221806,386,Ozone (O3),Mean,ppb,UHF34,103.0,Fordham - Bronx Pk,Summer 2014,06/01/2014,30.7,
3,221836,386,Ozone (O3),Mean,ppb,UHF34,204.0,East New York,Summer 2014,06/01/2014,32.0,
4,221812,386,Ozone (O3),Mean,ppb,UHF34,104.0,Pelham - Throgs Neck,Summer 2014,06/01/2014,31.9,
...,...,...,...,...,...,...,...,...,...,...,...,...
18020,816914,643,Annual vehicle miles traveled,Million miles,per square mile,CD,503.0,Tottenville and Great Kills (CD3),2019,01/01/2019,12.9,
18021,816913,643,Annual vehicle miles traveled,Million miles,per square mile,CD,503.0,Tottenville and Great Kills (CD3),2010,01/01/2010,14.7,
18022,816872,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,208.0,Canarsie - Flatlands,2010,01/01/2010,43.4,
18023,816832,643,Annual vehicle miles traveled,Million miles,per square mile,UHF42,407.0,Southwest Queens,2010,01/01/2010,65.8,


(Relative paths are useful to be able to package up all your code and data for someone else to view on their own computers. With the complete folder (including the same file structure), anyone else will be able to run the same code referencing the same files.)

### Some other data formats:

#### json

- (short for [JavaScript Object Notation](https://en.wikipedia.org/wiki/JSON), but don't memorize that). 
- Follows a structure similar to Python lists and dicts. e.g.: 
```json
{
  "first_name": "John",
  "last_name": "Smith",
  "is_alive": true,
  "age": 27,
  "address": {
    "street_address": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postal_code": "10021-3100"
  },
  "phone_numbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
    "Catherine",
    "Thomas",
    "Trevor"
  ],
  "spouse": null
}
```

- A common format for data passed around websites 
- Not necessarily tabular (i.e. not neatly 2-dimensional), but can represent tabular data. e.g. https://data.cityofnewyork.us/resource/c3uy-2p5r.json

#### Excel

- Excel files include data, formulas, layouts and formatting of multiple sheets within a binary file that is not human-readable.
- Can read into pandas, but should specify the sheet and range.

In [7]:
pd.read_excel('../Data/Source Data/Arraignments_thruJan2025.xlsx')

Unnamed: 0,2025,Arraignments (cases continued at arraignment),Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28
0,,Misd- missing,Misd- remand,Misd-$1 bail,Misd- money bail,Misd-ROR,Misd- Sup Rel,Misd Total,NVFO- missing,NVFO- remand,...,VFO-ROR,VFO- Sup Rel,VFO Total,Arraignments- missing,Arraignments- remand,Arraignments- $1 bail,Arraignments-money bail set,Arraignments-ROR,Arraignments- Sup Rel,Arraignments Total
1,January,2,0,314,265,5373,1549,7503,1,13,...,346,490,1688,17,96,394,1343,6361,2732,10943
2,February,,,,,,,,,,...,,,,,,,,,,
3,March,,,,,,,,,,...,,,,,,,,,,
4,April,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
102,December,4,17,149,142,4025,239,4576,1,14,...,367,162,1010,7,69,221,745,5129,611,6782
103,,,,,,,,,,,...,,,,,,,,,,
104,Source:,,,,,,,,,,...,,,,,,,,,,
105,CJA,,,,,,,,,,...,,,,,,,,,,


In [8]:
pd.read_excel(
    '../Data/Source Data/Arraignments_thruJan2025.xlsx',
    sheet_name='Arraignments HISTORICAL',
    header=1
)

Unnamed: 0,Year,Misdemeanor/ Violation,Nonviolent felony,Violent Felony,Remand,Bail Set,ROR,Supervised Release,Unnamed: 8,Unnamed: 9,Unnamed: 10,Source
0,1993,61121,53514,36445,0.02,0.46,0.52,,,,,*1993-2019 https://www.nycja.org/publications/...
1,1994,77741,57191,36125,0.02,0.46,0.53,,,,,"2020 - 2024:calculated from monthly tab, origi..."
2,1995,82152,58089,34588,0.02,0.44,0.54,,,,,
3,1996,87933,57551,30398,0.02,0.46,0.52,,,,,
4,1997,96113,54698,28820,0.02,0.43,0.55,,,,,
5,1998,92091,54909,27117,0.02,0.44,0.55,,,,,
6,1999,89762,47088,23147,0.02,0.43,0.55,,,,,
7,2000,102020,43538,21518,0.02,0.42,0.57,,,,,
8,2001,93190,37728,20350,0.02,0.41,0.57,,,,,
9,2002,94435,37771,19931,0.02,0.41,0.57,,,,,


## Writing data

dataframe's `.to_csv()` method will write out data to a csv file.

- include an absolute or relative file name to save to
- (default will save index column, pass `index=False` to skip it)

`.to_clipboard()` method will copy data to clipboard (to paste into Excel or Google Sheets or a document)

`.to_excel()` method will save excel file

- can add formatting, etc. - for example to automate a process and create a report from a template