# Class 9: pandas Series and Data Frames

Today we will continue our exploration of pandas DataFrames which allow us to analyze data tables.

### Downloading the data for today's class

Please run the code in the cell below to download the data for today's class.

In [20]:
import YData

# YData.download.download_class_code(9)       # get class code    
# YData.download.download_class_code(9, True) # get the code with the answers 
# YData.download_homework(4)  # download the homework 

YData.download.download_data("dow.csv")
YData.download_data("nyc23_flights.csv")
YData.download_data("nyc23_airlines.csv")
YData.download_data("nyc23_weather.csv")


The file `dow.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `nyc23_flights.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `nyc23_airlines.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `nyc23_weather.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


In [21]:
## If you are using Google Colabs, you should install the YData packages and mount the your google drive by uncommenting and running the code below.

# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')


In [22]:
# import the numpy package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## DataFrames (and flight delays) continued...

Let's continue our exploration of pandas DataFrames by continuing to analyze our flight delays dataset. 

The code below loads the data into a pandas DataFrame named `flights` and sets the Index to be the airplane's [tail number](https://en.wikipedia.org/wiki/Tail_number). Some variables of interest in this DataFrame are:

- `year`, `month`, `day`: Date of departure 
- `dep_time`, `arr_time`: Actual departure and arrival times, [UTC](https://en.wikipedia.org/wiki/UTC_offset) 
- `sched_dep_time`, `sched_arr_time`: Scheduled departure and arrival times, UTC 
- `dep_delay`, `arr_delay`: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- `hour`, `minute`: Time of scheduled departure broken into hour and minutes.
- `carrier`: Two letter carrier abbreviation. See get_airlines to get the full name.
- `flight` Flight number.
- `origin`, `dest`: Origin and destination airport. See get_airports for additional metadata.
- `air_time`: Amount of time spent in the air, in minutes.
- `distance`: Distance between airports, in miles.
- `time_hour`: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

The first 3 rows of this DataFrame are shown below. 



In [23]:
#import YData
#YData.download_data("nyc23_flights.csv")

flights_all = pd.read_csv("nyc23_flights.csv", index_col="tailnum") #, parse_dates=[18])

flights = flights_all[['arr_delay', 'dep_delay', 'carrier', 'flight', 'arr_time', 'dep_time',  'origin', 'dest', 'air_time', 'distance', 'time_hour']]

flights.head(3)


Unnamed: 0_level_0,arr_delay,dep_delay,carrier,flight,arr_time,dep_time,origin,dest,air_time,distance,time_hour
tailnum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
N25201,205.0,203.0,UA,628,328.0,1.0,EWR,SMF,367.0,2500,2023-01-01 20:00:00
N830DN,53.0,78.0,DL,393,228.0,18.0,JFK,ATL,108.0,760,2023-01-01 23:00:00
N807JB,34.0,47.0,B6,371,500.0,31.0,JFK,BQN,190.0,1576,2023-01-01 23:00:00


### Selecting columns from a DataFrame

We can select columns from a DataFrame using the square brackets; e.g., `my_df["my_col"]`

If we'd like to select multiple columns we can pass a list; e.g., `my_df[["col1", "col2"]]`


In [24]:
# Get just the arrival delay
# Be careful: if you just use a ["Col_name"] it will return it as a Series!





In [25]:
# we can also get a single column using the .col_name 





In [26]:
# if you want to get a single column as a DataFrame, pass a list in the [] brackets





In [27]:
# get multiple columns as a DataFrame





### Getting a subset of rows from a DataFrame

Similar to pandas Series, we can get particular rows from a DataFrame using:

- `.loc`:  Get rows by Index values - and by Boolean masks
- `.iloc`.:  Get rows by their index number



In [29]:
# Extract rows based on the Index name "N25201"




In [30]:
# Extract a row based on the row number (get row 0 to 3)



In [31]:
# We can get multiple rows that meet particular conditions using Boolean masking

# flights leaving JFK





In [32]:
# extract the rows the correspond to JFK




In [33]:
import matplotlib.pyplot as plt

# visualize JFK flight delays as a histogram




### Sorting values in a DataFrame

We can sort values in a DataFrame using `.sort_values("col_name")`

We can sort from highest to lowest by setting the argument `ascending = False`


In [34]:
# Sort the data by arrival delay



In [35]:
# What is the longest arrival delay? 



### Adding new columns to a Data Frame

We can add a column to a data frame using square backets. For example: 

- `my_df["new col"] = my_df["col1"] + my_df["col2"]`.




Let's add a column called "madeup_time" which has the reduction in delay from when the flight left (`dep_delay`) to when it arrived (`arr_delay`).

In [36]:
# copy the data 



# calculate how many minutes were made up in flight



# add change column




In [37]:
# sort the values



# sort the data from largest to smallest




We can rename columns by:
1. Creating a `rename_dictionary` dictionary that maps old column names to new column names
2. By passing this dictionary to the `my_df.rename(columns = rename_dictionary)` method

In [38]:
# Rename the Percent change column





### Getting aggregate statistics by group

We can get aggregate statistics by group using `groupby()` and `agg` methods using the following syntax:

`my_df.groupby("col_name").agg("agg_function_name")`

Can you get the average delay for each airline? 


In [39]:
# What was the average delay for each airline? 





There are several ways to get multiple statistics by group. Perhaps the most useful way is to use the syntax:

<pre>
my_df.groupby("group_col_name").agg(
   new_col1 = ('col_name', 'statistic_name1'),
   new_col2 = ('col_name', 'statistic_name2'),
   new_col3 = ('col_name', 'statistic_name3')
)
</pre>


Let's create a DataFrame that has for each carrier:
1. The number of flights 
2. The max departure delay
3. The median arrival delay

![grumpy](http://www.quickmeme.com/img/17/1702cb8d3730013bdff1203920324ab55a244f0061cfaa118af059b683e2d275.jpg)

## "Joining" DataFrames by Index

To explore joining DataFrames, let's load the airline names into a DataFrames into a DataFrame called `airline_names`. 

Let's also set the Index for both the `airline_names` and `flights` to be the airline carrier code. 

For demonstration purposes, let's also do the following: 

1. Reduce the `flights` DataFrame to only have information on American Airlines (AA), Jet Blue (B6) and United Air Lines Inc. (UA) and save it to the name `flights_3_carriers`.

2. Reduce `airline_names` to the first 10 entries (thus removing United Airlines), and save it to the name `airline_names_reduced` 



In [40]:
flights_3_carriers = flights.reset_index().set_index("carrier")

# just get flights from American Airlines (AA), Jet Blue (B6) and Delta (DL) 
flights_3_carriers = flights_3_carriers.loc[["AA", "B6", "UA"]].sort_values("time_hour")

flights_3_carriers.head()


Unnamed: 0_level_0,tailnum,arr_delay,dep_delay,flight,arr_time,dep_time,origin,dest,air_time,distance,time_hour
carrier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AA,N925AN,-7.0,3.0,499,808.0,503.0,EWR,MIA,154.0,1085,2023-01-01 05:00:00
B6,N948JB,44.0,41.0,646,923.0,611.0,EWR,FLL,163.0,1065,2023-01-01 05:00:00
B6,N639JB,4.0,-10.0,800,905.0,549.0,JFK,PBI,164.0,1028,2023-01-01 05:00:00
UA,N13113,68.0,17.0,206,926.0,537.0,EWR,IAH,258.0,1400,2023-01-01 05:00:00
B6,N2043J,-1.0,10.0,996,948.0,520.0,JFK,BQN,192.0,1576,2023-01-01 05:00:00


In [41]:
airline_names = pd.read_csv("nyc23_airlines.csv", index_col = "carrier")
airline_names_reduced = airline_names.iloc[0:10]

airline_names_reduced

Unnamed: 0_level_0,name
carrier,Unnamed: 1_level_1
9E,Endeavor Air Inc.
AA,American Airlines Inc.
AS,Alaska Airlines Inc.
B6,JetBlue Airways
DL,Delta Air Lines Inc.
F9,Frontier Airlines Inc.
G4,Allegiant Air
HA,Hawaiian Airlines Inc.
MQ,Envoy Air
NK,Spirit Air Lines


When two DataFrames have the same Index values, we can use the `.join()` method to join them.

In [42]:
# Let's do a left join by setting how = "left"




In [43]:
# Let's do a right join by setting how = "right"  




### "Merging" DataFrames by column values

If we want to join by value in a column rather than by Index value we can use the `.merge()` method (which is very similar to the `.join()` method). 


In [44]:
# reset the index of flights_3_carriers
flights_3_carriers2 = flights_3_carriers.reset_index()
flights_3_carriers2.head(3)

Unnamed: 0,carrier,tailnum,arr_delay,dep_delay,flight,arr_time,dep_time,origin,dest,air_time,distance,time_hour
0,AA,N925AN,-7.0,3.0,499,808.0,503.0,EWR,MIA,154.0,1085,2023-01-01 05:00:00
1,B6,N948JB,44.0,41.0,646,923.0,611.0,EWR,FLL,163.0,1065,2023-01-01 05:00:00
2,B6,N639JB,4.0,-10.0,800,905.0,549.0,JFK,PBI,164.0,1028,2023-01-01 05:00:00


In [45]:
# reset the index of airline_names_reduced
airline_names_reduced2 = airline_names_reduced.reset_index()
airline_names_reduced2.head(3)

Unnamed: 0,carrier,name
0,9E,Endeavor Air Inc.
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.


In [46]:
# use the .merge() method to join the DataFrames




#### Merging with different column names

What if the columns we want to join on have different names, we can use the `left_on` and `right_on` arguments to specify which columns (i.e., keys) should be used to align the two DataFrames

In [47]:
flights_3_carriers3 = flights_3_carriers2.rename(columns = {"carrier": "Airline Code"})
flights_3_carriers3.head(3)

Unnamed: 0,Airline Code,tailnum,arr_delay,dep_delay,flight,arr_time,dep_time,origin,dest,air_time,distance,time_hour
0,AA,N925AN,-7.0,3.0,499,808.0,503.0,EWR,MIA,154.0,1085,2023-01-01 05:00:00
1,B6,N948JB,44.0,41.0,646,923.0,611.0,EWR,FLL,163.0,1065,2023-01-01 05:00:00
2,B6,N639JB,4.0,-10.0,800,905.0,549.0,JFK,PBI,164.0,1028,2023-01-01 05:00:00


In [48]:
# merge the DataFrames specifying the column names to join on




#### Example: Spelling out names of airlines with the longest delays

Please try to create a DataFrame where the Index name is the full airline name, and the columns are:
- `mean_delay`: Has the mean arrival delay for each airline
- `median_delay`: Has the median arrival delay for each airline
- `count`: The number of flights that went into these averages

To do this, start with the `flights`, and the `airline_names` DataFrames and go from there. Also, be sure your results are sorted from the largest mean delay to the smallest mean delay


## Further flight delay explorations

See if you can calculate (and visualize) how the mean delay is affected by:

- The hour of the day a flight leaves?
- The month of the year?
- The different airports?

As a more challenging question, see if you calculate and visualize the mean delay as a function of the wind speed.

If you are interested in exploring further, you can also check out the following data:

- `nyc23_planes.csv`: Information about different airplanes 
- `nyc23_airports.csv`: The names of different airports

All these data sets can be downloaded using the `YData.download_data()` function. A codebook that contains information on the variables in these DataFrames can be found at: https://cran.r-project.org/web/packages/nycflights23/nycflights23.pdf 

#### Q1: Calculate and visualize how the average delay differs by the hour of the day a flight leaves

In [49]:
# reload the data so we have all the columns
flights = pd.read_csv("nyc23_flights.csv", index_col="tailnum") #, parse_dates=[18])


# calculate and visualze the mean departure delay as a function of the hour the flight left










#### Q2: Calculate and visualize how the average delay differs by the month of the year

#### Q3: Calculate and visualize how the average delay is differs depending on the airport it leaves from

#### Q4: Calculate and visualize how the average delay is differs depending on wind speed

The data on the weather for each day/time is loaded below. To do this will take multiple steps so I've created multiple cells for you to do your work


In [50]:

#YData.download_data("nyc23_weather.csv")

weather = pd.read_csv("nyc23_weather.csv")
