In [1]:
import pandas as pd

The first step is importing the `pandas` package. Imports bring in new functionality to extend the base Python language. Pandas gives us tools for working with data.

By convention, `pandas` is abbreviated `pd` (with the  `as` syntax). 

(The entire pandas package is now available in your notebook. You can access `pandas` functions with the `.` accessor, as seen below)

## the `DataFrame`

The primary object used with the `pandas` package is the `DataFrame`. 

This is a flexible container for storing rows and columns of data.

For starters, you can think of a DataFrame like an Excel spreadsheet (but we'll see how they are more flexible and powerful)

First, we'll create a toy DataFrame and assign it to the variable name `toy_dataframe` (with the `=` for assignment)

In [2]:
toy_dataframe = pd.DataFrame(
    data={
        'color':['red','blue','yellow'],
        'number':[3,6,5],
        'flavor':['vanilla','chocolate','strawberry']
    }
)

(Don't worry too much about the syntax here because you'll hardly ever construct DataFrames this way.)

To show a DataFrame in a notebook, enter its name:

In [3]:
toy_dataframe

Unnamed: 0,color,number,flavor
0,red,3,vanilla
1,blue,6,chocolate
2,yellow,5,strawberry


A simple table, with rows and columns. 

Now let's get some real data.

In [4]:
film_permits = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/tg4x-b46p/rows.csv?accessType=DOWNLOAD',
    parse_dates=['StartDateTime','EndDateTime','EnteredOn'],
    date_format='%m/%d/%Y %H:%M:%S %p'
)

## DataFrame methods

Pandas DataFrames make it easy to work with data because DataFrames come with many helpful, built-in methods to view and transform the data.

(what's a _method_? 

It's a pre-defined function that acts on an object. 

_What??_ By constructing the data as a _pd.DataFrame_, that DataFrame automatically inherits all the data-transforming functions that are provided by the pandas package. 

_What???_ In practice, this means that you don't need to do a lot of programming from scratch, you can use pre-defined methods for almost all common data tasks)

For example, DataFrames have a `.head()` method that returns the top 5 rows of a DataFrame.

In [5]:
film_permits.head()

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,696255,Shooting Permit,2023-02-17 09:00:00,2023-02-18 12:00:00,2023-02-14 10:47:33,Mayor's Office of Media & Entertainment,KINGSLAND AVENUE between DEAD END and GREENPOI...,Brooklyn,1,94,Television,Cable-episodic,United States of America,11222
1,714139,Shooting Permit,2023-05-12 01:00:00,2023-05-13 05:00:00,2023-05-04 02:27:51,Mayor's Office of Media & Entertainment,WEST 26 STREET between 12 AVENUE and 11 AVEN...,Manhattan,4,10,Television,Episodic series,United States of America,10001
2,705334,Shooting Permit,2023-04-10 09:00:00,2023-04-10 10:00:00,2023-03-30 05:17:29,Mayor's Office of Media & Entertainment,SOUTH STREET between BROAD STREET and OLD SLIP...,Manhattan,"1, 3","1, 5",Film,Feature,United States of America,"10002, 10004, 10005"
3,746696,Shooting Permit,2023-11-07 06:00:00,2023-11-07 10:00:00,2023-10-31 12:08:00,Mayor's Office of Media & Entertainment,NORTH HENRY STREET between GREENPOINT AVENUE a...,Brooklyn,1,94,Commercial,Commercial,United States of America,11222
4,717328,Theater Load in and Load Outs,2023-05-31 12:01:00,2023-06-01 06:00:00,2023-05-17 11:55:51,Mayor's Office of Media & Entertainment,WEST 55 STREET between 11 AVENUE and 12 AVEN...,Manhattan,4,18,Theater,Theater,United States of America,10019


(methods are invoked by following a DataFrame reference with a dot (`.`), the name of the method, then parenthesis. Like `.head()`)

Methods can take _arguments_ to futher specify their behavior. 

For example, `head()` also accepts a number for the number of rows to return:

In [6]:
film_permits.head(n=2)

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,696255,Shooting Permit,2023-02-17 09:00:00,2023-02-18 12:00:00,2023-02-14 10:47:33,Mayor's Office of Media & Entertainment,KINGSLAND AVENUE between DEAD END and GREENPOI...,Brooklyn,1,94,Television,Cable-episodic,United States of America,11222
1,714139,Shooting Permit,2023-05-12 01:00:00,2023-05-13 05:00:00,2023-05-04 02:27:51,Mayor's Office of Media & Entertainment,WEST 26 STREET between 12 AVENUE and 11 AVEN...,Manhattan,4,10,Television,Episodic series,United States of America,10001


The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) for the `head()` method explains exactly what it does and what arguments it takes. It will be a good habit to check the documentation for methods to understand how they work.

## Viewing and manipulating data

Almost all methods called on DataFrames return _views_ of the data, they do not change the underlying data itself.

This is very helpful when you are exploring data and espcially when you are starting out and learning how methods and functions work. You can test out transformations and see the results without making any permanant changes to the underlying data.

In [10]:
(
    film_permits
    .query('Borough == "Brooklyn"')
    .sort_values('EventID')
    [['StartDateTime','Category','EventType']]
    .tail(10)
)

Unnamed: 0,StartDateTime,Category,EventType
1883,2024-10-24 02:00:00,WEB,Shooting Permit
10134,2024-10-28 09:00:00,Film,Rigging Permit
9601,2024-10-25 07:00:00,Television,Shooting Permit
1197,2024-10-24 07:00:00,Television,Shooting Permit
10125,2024-10-28 06:00:00,Commercial,Shooting Permit
4889,2024-10-26 06:00:00,Commercial,Shooting Permit
10138,2024-10-28 06:00:00,Commercial,Shooting Permit
2435,2024-10-24 12:30:00,Theater,Theater Load in and Load Outs
8344,2024-10-27 06:00:00,Commercial,Shooting Permit
10135,2024-10-28 06:00:00,Television,Shooting Permit


For instance, this set of methods filters the data to show just a few select columns for the last 10 permits issued in Brooklyn. But this is just a view. We still have all the original data:

In [7]:
film_permits

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,696255,Shooting Permit,2023-02-17 09:00:00,2023-02-18 12:00:00,2023-02-14 10:47:33,Mayor's Office of Media & Entertainment,KINGSLAND AVENUE between DEAD END and GREENPOI...,Brooklyn,1,94,Television,Cable-episodic,United States of America,11222
1,714139,Shooting Permit,2023-05-12 01:00:00,2023-05-13 05:00:00,2023-05-04 02:27:51,Mayor's Office of Media & Entertainment,WEST 26 STREET between 12 AVENUE and 11 AVEN...,Manhattan,4,10,Television,Episodic series,United States of America,10001
2,705334,Shooting Permit,2023-04-10 09:00:00,2023-04-10 10:00:00,2023-03-30 05:17:29,Mayor's Office of Media & Entertainment,SOUTH STREET between BROAD STREET and OLD SLIP...,Manhattan,"1, 3","1, 5",Film,Feature,United States of America,"10002, 10004, 10005"
3,746696,Shooting Permit,2023-11-07 06:00:00,2023-11-07 10:00:00,2023-10-31 12:08:00,Mayor's Office of Media & Entertainment,NORTH HENRY STREET between GREENPOINT AVENUE a...,Brooklyn,1,94,Commercial,Commercial,United States of America,11222
4,717328,Theater Load in and Load Outs,2023-05-31 12:01:00,2023-06-01 06:00:00,2023-05-17 11:55:51,Mayor's Office of Media & Entertainment,WEST 55 STREET between 11 AVENUE and 12 AVEN...,Manhattan,4,18,Theater,Theater,United States of America,10019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,816501,Shooting Permit,2024-10-21 07:00:00,2024-10-21 07:00:00,2024-10-16 05:05:58,Mayor's Office of Media & Entertainment,WASHINGTON STREET between MORRIS STREET and BA...,Manhattan,1,1,Film,Feature,United States of America,"10004, 10006"
9995,816574,Shooting Permit,2024-10-21 06:00:00,2024-10-21 11:59:00,2024-10-17 10:51:34,Mayor's Office of Media & Entertainment,MONITOR STREET between GREENPOINT AVENUE and N...,Brooklyn,1,94,Television,Episodic series,United States of America,11222
9996,816430,Shooting Permit,2024-10-20 08:00:00,2024-10-21 06:00:00,2024-10-16 01:42:26,Mayor's Office of Media & Entertainment,5 AVENUE between EAST 88 STREET and EAST 8...,Manhattan,"64, 8","19, 22",Commercial,Commercial,United States of America,10128
9997,816596,Shooting Permit,2024-10-21 06:00:00,2024-10-21 10:00:00,2024-10-17 11:52:56,Mayor's Office of Media & Entertainment,"NOBLE STREET between WEST STREET and DEAD END,...",Brooklyn,1,94,Television,Cable-episodic,United States of America,11222


If you _do_ want to store the result of a transformation, just assign it to a variable. By assigning the result to a _new_ variable, you will avoid overwriting the source data.

In [None]:
last_10_brooklyn_permits = (
    film_permits
    .query('Borough == "Brooklyn"')
    .sort_values('EventID')
    [['StartDateTime','Category','EventType']]
    .tail(10)
)

(A good reminder: if your cell code shows a DataFrame as a result, you are looking at only a temporary view. If you assign the view to a variable, the cell will _not_ display the result. (like above))

## Series

DataFrames are composed of Series. You can think of a DataFrame as a table, and a Series as a row or column

In [11]:
film_permits['Category']

0        Television
1        Television
2              Film
3        Commercial
4           Theater
            ...    
10136    Television
10137       Theater
10138    Commercial
10139    Television
10140          Film
Name: Category, Length: 10141, dtype: object

This column is a series. You can select a column series by putting the column name in square brackets.

In [12]:
film_permits.iloc[10]

EventID                                                         693518
EventType                                              Shooting Permit
StartDateTime                                      2023-02-02 07:00:00
EndDateTime                                        2023-02-02 11:00:00
EnteredOn                                          2023-01-30 11:09:47
EventAgency                    Mayor's Office of Media & Entertainment
ParkingHeld          CALYER STREET between DIAMOND STREET and JEWEL...
Borough                                                       Brooklyn
CommunityBoard(s)                                                    1
PolicePrecinct(s)                                                   94
Category                                                    Television
SubCategoryName                                        Episodic series
Country                                       United States of America
ZipCode(s)                                                       11222
Name: 

This row is also a Series. You can select a row using `.loc` of `.iloc` methods. We'll get there later. 

Mostly, Series behave just like one-dimensional DataFrames and have most of the same methods.

## Columns and Index

DataFrames have labled rows and columns. As the example above, `Category` is the name of one column in the DataFrame. To show all column names, use:

In [None]:
film_permits.columns

Index(['EventID', 'EventType', 'StartDateTime', 'EndDateTime', 'EnteredOn',
       'EventAgency', 'ParkingHeld', 'Borough', 'CommunityBoard(s)',
       'PolicePrecinct(s)', 'Category', 'SubCategoryName', 'Country',
       'ZipCode(s)'],
      dtype='object')

This DataFrame has not had row labels set, so by default the rows are labeled just with incrimental numbers:

In [None]:
film_permits.index

RangeIndex(start=0, stop=9999, step=1)

## Selecting columns

To select more than one column, include the column names in double brackets:

In [13]:
film_permits[['Category','SubCategoryName']]

Unnamed: 0,Category,SubCategoryName
0,Television,Cable-episodic
1,Television,Episodic series
2,Film,Feature
3,Commercial,Commercial
4,Theater,Theater
...,...,...
10136,Television,Cable-episodic
10137,Theater,Theater
10138,Commercial,Commercial
10139,Television,Episodic series


# Tasks

1. Sort `film_permits` by `EventID` and display the result below.

In [None]:
### your code

2. Select a columns `Category`, `SubCategoryName`, and `Borough` from `film_permits` and store as a new DataFrame called `borough_category`

In [None]:
### your code

4. [Extra credit] Use a method to count the number of permits is each `Borough` and display the result below.

In [None]:
### your code