# Day 3 -- Python for Researchers

## Today's Goals:
   * Learn the basics of pandas 

**Introduction to Pandas**

Pandas is the industry standard data analysis library. It lets us convert raw datasets (typically a .csv file) into something called a *dataframe*. A dataframe looks like a spreadsheet, but is actually optimized to allow you to handle large datasets quickly and efficiently. We can use pandas to clean, analyze, and even visualize our data. 

In pandas you can work with two kinds of objects -- a dataframe, or a 2D array (something with rows and columns) or a series, a 1D array (or just a  singular column). Both of these objects comes with different built-in functions that we will explore throughout today's lesson. 

Pandas is a really powerful tool for researchers. Oftentimes, the way you really learn to master it is by looking at examples and reading documentation. They have a "10 Minutes to Pandas" guide we recommend you return to after this week: https://pandas.pydata.org/docs/user_guide/10min.html

#### Section 1: Our first dataset

To start our exploration of pandas, we're going to use a Film Permits dataset from the NYC Open Data site (https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p/about_data). That dataset is already in our Day 3 folder. Please note: when you gather a dataset online, it typically comes with it's own internal logic -- especially goverment datasets -- so you will want to study any documentation they provide to better undersatnd your data. 

In order to use pandas, you always have to import it. The common way to do this is to "import pandas as pd" -- pd just is a shorter amout of letters to type in. 

To create your dataframe, you run the code below. Our dataframe is being stored in the variable permits_df. Take a second to study the output, what can we learn from it? 

In [6]:
import pandas as pd
permits_df = pd.read_csv("Film_Permits_20250518.csv")

permits_df

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,753784,Theater Load in and Load Outs,12/24/2023 12:01:00 AM,12/31/2023 11:59:00 PM,12/18/2023 12:59:21 PM,Mayor's Office of Media & Entertainment,WEST 62 STREET between COLUMBUS AVENUE and A...,Manhattan,7,20,Theater,Theater,United States of America,10023
1,752706,Shooting Permit,12/13/2023 06:00:00 AM,12/13/2023 08:00:00 PM,12/06/2023 09:01:38 AM,Mayor's Office of Media & Entertainment,HUDSON STREET between MORTON STREET and BARROW...,Manhattan,"1, 2, 3","5, 6, 7, 84, 90",Film,Feature,United States of America,"10002, 10014, 10038, 11201, 11211"
2,753230,Shooting Permit,12/18/2023 07:00:00 AM,12/18/2023 11:59:00 PM,12/12/2023 10:09:19 AM,Mayor's Office of Media & Entertainment,ROSE FEISS BOULEVARD between EAST 139 STREET ...,Bronx,1,40,Television,Episodic series,United States of America,10454
3,752951,Shooting Permit,12/14/2023 07:00:00 AM,12/14/2023 09:00:00 PM,12/08/2023 09:29:50 AM,Mayor's Office of Media & Entertainment,ATLANTIC AVENUE between WASHINGTON AVENUE and ...,Brooklyn,8,77,Television,Episodic series,United States of America,11238
4,752181,Shooting Permit,12/13/2023 06:00:00 AM,12/13/2023 11:59:00 PM,12/01/2023 09:38:19 AM,Mayor's Office of Media & Entertainment,GRAND AVENUE between 64 STREET and FLUSHING AV...,Queens,"1, 5","104, 94",Television,Episodic series,United States of America,"11222, 11378"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11576,835577,Shooting Permit,02/15/2025 12:00:00 AM,02/15/2025 01:00:00 PM,02/14/2025 09:47:11 AM,Mayor's Office of Media & Entertainment,WEST 48 STREET between 6 AVENUE and 7 AVENUE,Manhattan,5,18,Television,News,United States of America,"10036, 10105"
11577,834355,Theater Load in and Load Outs,02/15/2025 12:01:00 AM,02/15/2025 11:59:00 PM,02/04/2025 01:57:53 PM,Mayor's Office of Media & Entertainment,EAST 11 STREET between 3 AVENUE and 4 AVENUE...,Manhattan,"11, 3","23, 9",Theater,Theater,United States of America,"10003, 10029"
11578,834236,Theater Load in and Load Outs,02/15/2025 12:01:00 AM,02/16/2025 06:00:00 AM,02/03/2025 04:23:45 PM,Mayor's Office of Media & Entertainment,DEKALB AVENUE between FLATBUSH AVENUE EXTENSIO...,Brooklyn,2,88,Theater,Theater,United States of America,11201
11579,834362,Theater Load in and Load Outs,02/16/2025 12:01:00 AM,02/17/2025 06:00:00 AM,02/04/2025 02:09:39 PM,Mayor's Office of Media & Entertainment,EAST 11 STREET between 3 AVENUE and 4 AVENUE...,Manhattan,"11, 3","23, 9",Theater,Theater,United States of America,"10003, 10029"


**Learning about our dataset**

We're going to use four techniques to learn about our dataset:

  * info() -- tells you how many rows and columns there are; the names of each column; the different kinds of datatype contained in the columns; and how many non-null values there are in each column
  * head() -- shows by default the top five rows, you can add an optional parameter (a number) to show more
  * tail() -- same as head(), except it shows you the bottom rows
  * .dtypes -- tells you what each column's data type is



In [None]:
permits_df.info()

In [None]:
permits_df.head()

In [None]:
permits_df.tail()

Note that there are seven datatypes in pandas, some of which correspond to the datatypes in Python:

* object (equivalent to a Python string)
* int64 (integer)
* float64 (float aka decimal)
* bool (Boolean, True/False values)
* datetime64 (date and time values)
* timedelta[ns] (differences between two values)
* category (finite list of text values, corresponds to categorical variables in statistics)

It's important to know what types of data you are manipulating, because they all come with unique abilities and limitations.

What kind of datatypes does our dataframe contain? 

In [None]:
permits_df.dtypes

We can also access specific information in our dataframe. The most frequent way you will do this is to isolate a column. 

The preferred syntax for this is to use brackets and input the column name as strings. You can isolate only one at a time, 
or isolate multiple by writing them in a list.

In [None]:
#isolating one column
permits_df["Borough"]

In [None]:
#isolating multiple columns
permits_df[["EventID", "Borough", "Category"]]

Unnamed: 0,EventID,Borough,Category
0,753784,Manhattan,Theater
1,752706,Manhattan,Film
2,753230,Bronx,Television
3,752951,Brooklyn,Television
4,752181,Queens,Television
...,...,...,...
11576,835577,Manhattan,Television
11577,834355,Manhattan,Theater
11578,834236,Brooklyn,Theater
11579,834362,Manhattan,Theater


If you want to isolate very specific information, you can use .iloc and .loc

 * You use .loc when dealing with column names (labels), e.g., below isolates the first six rows in the EventType column (label), notice it uses inclusive slicing
 * You use .iloc when you want to index a specific row and column (iloc stands of “integer location”),
   e.g., below isolates the same thing using index numbers and uses exclusive slicing


In [15]:
#example of loc

permits_df.loc[:5, "EventType"]

0    Theater Load in and Load Outs
1                  Shooting Permit
2                  Shooting Permit
3                  Shooting Permit
4                  Shooting Permit
5    Theater Load in and Load Outs
Name: EventType, dtype: object

In [17]:
#example of iloc

permits_df.iloc[:5]

Unnamed: 0,EventID,EventType,StartDateTime,EndDateTime,EnteredOn,EventAgency,ParkingHeld,Borough,CommunityBoard(s),PolicePrecinct(s),Category,SubCategoryName,Country,ZipCode(s)
0,753784,Theater Load in and Load Outs,12/24/2023 12:01:00 AM,12/31/2023 11:59:00 PM,12/18/2023 12:59:21 PM,Mayor's Office of Media & Entertainment,WEST 62 STREET between COLUMBUS AVENUE and A...,Manhattan,7,20,Theater,Theater,United States of America,10023
1,752706,Shooting Permit,12/13/2023 06:00:00 AM,12/13/2023 08:00:00 PM,12/06/2023 09:01:38 AM,Mayor's Office of Media & Entertainment,HUDSON STREET between MORTON STREET and BARROW...,Manhattan,"1, 2, 3","5, 6, 7, 84, 90",Film,Feature,United States of America,"10002, 10014, 10038, 11201, 11211"
2,753230,Shooting Permit,12/18/2023 07:00:00 AM,12/18/2023 11:59:00 PM,12/12/2023 10:09:19 AM,Mayor's Office of Media & Entertainment,ROSE FEISS BOULEVARD between EAST 139 STREET ...,Bronx,1,40,Television,Episodic series,United States of America,10454
3,752951,Shooting Permit,12/14/2023 07:00:00 AM,12/14/2023 09:00:00 PM,12/08/2023 09:29:50 AM,Mayor's Office of Media & Entertainment,ATLANTIC AVENUE between WASHINGTON AVENUE and ...,Brooklyn,8,77,Television,Episodic series,United States of America,11238
4,752181,Shooting Permit,12/13/2023 06:00:00 AM,12/13/2023 11:59:00 PM,12/01/2023 09:38:19 AM,Mayor's Office of Media & Entertainment,GRAND AVENUE between 64 STREET and FLUSHING AV...,Queens,"1, 5","104, 94",Television,Episodic series,United States of America,"11222, 11378"


**Preparing our dataset for analysis**

Here's our research question: *How many of each film permit type was distributed in each borough?* Because this dataset was produced by NYC, it's pretty clean. We just want to drop the columns that don't help us answer our research question so that we can save memory. It's not a huge dataset, so this isn't necessary, but is useful to know how to do. 

Here are all our columns (I turned them into a list for easy reading).

In [8]:
list(permits_df.columns)

['EventID',
 'EventType',
 'StartDateTime',
 'EndDateTime',
 'EnteredOn',
 'EventAgency',
 'ParkingHeld',
 'Borough',
 'CommunityBoard(s)',
 'PolicePrecinct(s)',
 'Category',
 'SubCategoryName',
 'Country',
 'ZipCode(s)']

We probably don't care about a number of those partiular columns, so let's drop them.

It's best practice to make a copy of your dataframe before you start changing it, so we'll start with that: 

In [9]:
permits_df_copy = permits_df.copy()

Then we can drop several of those columns using drop(). Here's some info about what we're doing:
 * drop() elimates columns
 * You have to tell it what column names you want to drop, this can be passed as a list 
 * You need to tell it you're dropping a column (not a row) which is axis = 1 (axis = 0 is for rows)
 * inplace = True makes sure this affects the dataframe; otherwise, it would create a copy of the dataframe where the change happens 

I've dropped two columns below, try adding another column name to the list that we don't need! 

In [11]:
permits_df_copy.drop(["CommunityBoard(s)", "PolicePrecinct(s)"], axis = 1, inplace = True)

Ok, now let's try to answer our RQ. What we will have to do is:
  * isolate the two columns that have our information (EventType and Borough)
  * use value_counts() which helps us aggregate values in columns
  * convert this into it's own dataframe using reset_index()
  * sort the columns for easier viewing using sort_values()

Ok, now let's try to answer our RQ. What we will have to do is:
  * isolate the two columns that have our information (EventType and Borough)
  * use value_counts() which helps us aggregate values in columns
  * convert this into it's own dataframe 
  * sort the columns for easier viewing using s

In [38]:
#isolate our two columns:

rq_df = permits_df_copy[["EventType", "Borough"]]
rq_df

Unnamed: 0,EventType,Borough
0,Theater Load in and Load Outs,Manhattan
1,Shooting Permit,Manhattan
2,Shooting Permit,Bronx
3,Shooting Permit,Brooklyn
4,Shooting Permit,Queens
...,...,...
11576,Shooting Permit,Manhattan
11577,Theater Load in and Load Outs,Manhattan
11578,Theater Load in and Load Outs,Brooklyn
11579,Theater Load in and Load Outs,Manhattan


In [None]:
#count the values
#this LOOKS like it is a dataframe, but it's actually a series

rq_df = rq_df.value_counts()
rq_df

EventType                      Borough      
Shooting Permit                Manhattan        3945
                               Brooklyn         3176
Theater Load in and Load Outs  Manhattan        1648
Shooting Permit                Queens           1535
Theater Load in and Load Outs  Brooklyn          560
Shooting Permit                Bronx             331
Rigging Permit                 Manhattan         132
                               Brooklyn           74
Shooting Permit                Staten Island      63
DCAS Prep/Shoot/Wrap Permit    Manhattan          50
Rigging Permit                 Queens             28
                               Bronx              17
DCAS Prep/Shoot/Wrap Permit    Bronx               6
                               Brooklyn            5
Theater Load in and Load Outs  Bronx               5
DCAS Prep/Shoot/Wrap Permit    Queens              3
Rigging Permit                 Staten Island       1
DCAS Prep/Shoot/Wrap Permit    Staten Island       1
T

In [None]:
#reset the index to transform it into a dataframe

rq_df = rq_df.reset_index()
#rq_df.columns = ["Event", "Borough", "Count"]
rq_df

Unnamed: 0,EventType,Borough,count
0,Shooting Permit,Manhattan,3945
1,Shooting Permit,Brooklyn,3176
2,Theater Load in and Load Outs,Manhattan,1648
3,Shooting Permit,Queens,1535
4,Theater Load in and Load Outs,Brooklyn,560
5,Shooting Permit,Bronx,331
6,Rigging Permit,Manhattan,132
7,Rigging Permit,Brooklyn,74
8,Shooting Permit,Staten Island,63
9,DCAS Prep/Shoot/Wrap Permit,Manhattan,50


In [None]:
rq_df.sort_values("Event", inplace = True, ascending = False) 
rq_df