# Pandas Basics

## Start working with Pandas

To star working with Pandas, first we must install the Pandas library and import it in our code like so.

In [1]:
import pandas as pd

## I/O (Reading and Writing)

A Pandas DataFrame can be think as a spreadsheet where we have columns and rows.

When we work with DataFrames we'll be working with data sets and even though is possible to create our own data sets most often than not we'll be reading from either a file or a database.

Pandas comes with a lot of methods to read from different sources like:

- read_csv()
- read_excel()
- read_json()
- read_html()
- read_parquet()
- read_sql()
- read_table()

In [2]:
df = pd.read_csv('Combined_Flights_2022.csv')

When we read from a file some of the most common parameters to use are:

- delimiter= , specify a character as delimiter based on which separator is being used in the dataset (e.g. blank space, comma, etc.).
- usecols= , to use a subset of columns.
- parse_dates= , to automatically parse dates present in the file.
- chunksize= , to read a large file in chunks.

Once we have read and created a DataFrame object from a data set, and similar to read, there are methods to write to a type of file suchs as:

- to_clipboard()
- to_csv()
- to dict()
- to_excel()
- to_html()
- to_json()
- to_sql()

Most of this methods take a filepath as the first parameter, and some other common parameters to use are:

- index= , wether to save or not the index if it doesn't have relevant information, for example, when we save to_csv() is common to not save the index.

## Quickview Methods

There are several methods that help us understand what the data looks like. One of them is the .head() method, which by default let us see the first 5 rows of a DataFrame. The head() command can be given a number of rows to display.

Depending on the number of columns present in the dataset, it may hide the columns in the middle if they excede the screen width, but we can override this by setting a value for the display.max_columns pandas attribute.

In [3]:
pd.set_option('display.max_columns', 500)
df.head()

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4301.0,C5,20445.0,C5,N21144,4301.0,11921.0,1192102.0,31921.0,"Grand Junction, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1200-1259,1.0,0.0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4299.0,C5,20445.0,C5,N16170,4299.0,12206.0,1220605.0,32206.0,"Harlingen/San Benito, TX",TX,48.0,Texas,74.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0700-0759,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,0800-0859,2.0,0.0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0


There is also a .tail() command that shows the last 5 rows. It can also be given a value and that number of rows will be displayed.

In [4]:
df.tail(3)

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
18324,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1955,1953.0,0.0,-2.0,2307.0,0.0,118.0,145.0,134.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,388.0,WN,19393.0,WN,N7840A,388.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,-1.0,1900-1959,12.0,2005.0,2303.0,4.0,2320.0,-13.0,0.0,-1.0,2300-2359,4.0,0.0
18325,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1010,1010.0,0.0,0.0,1322.0,0.0,120.0,150.0,132.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2775.0,WN,19393.0,WN,N290WN,2775.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,0.0,1000-1059,8.0,1018.0,1318.0,4.0,1340.0,-18.0,0.0,-2.0,1300-1359,4.0,0.0
18326,2022-04-19,Southwest Airlines Co.,BNA,BWI,False,False,635,633.0,0.0,-2.0,907.0,0.0,76.0,105.0,94.0,587.0,20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


The method .sample() provides a random subset of rows. It can be given an integer value to display.

Because of .sample() being a random subset it can also be given a value for the random_state attribute to help with reproducibility, mantaining the same sample between executions.

In [5]:
df.sample(3, random_state=529)

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
11660,2022-04-10,Southwest Airlines Co.,MCI,BNA,False,False,755,851.0,56.0,56.0,1012.0,42.0,65.0,95.0,81.0,491.0,2022,2.0,4.0,10.0,7.0,WN,WN,19393.0,WN,1678.0,WN,19393.0,WN,N291WN,1678.0,13198.0,1319801.0,33198.0,"Kansas City, MO",MO,29.0,Missouri,64.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,1.0,3.0,0700-0759,10.0,901.0,1006.0,6.0,930.0,42.0,1.0,2.0,0900-0959,2.0,0.0
8347,2022-04-02,Southwest Airlines Co.,STL,FLL,False,False,840,1014.0,94.0,94.0,1401.0,101.0,144.0,160.0,167.0,1057.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3139.0,WN,19393.0,WN,N7869A,3139.0,15016.0,1501606.0,31123.0,"St. Louis, MO",MO,29.0,Missouri,64.0,11697.0,1169706.0,32467.0,"Fort Lauderdale, FL",FL,12.0,Florida,33.0,1.0,6.0,0800-0859,19.0,1033.0,1357.0,4.0,1220.0,101.0,1.0,6.0,1200-1259,5.0,0.0
8938,2022-04-03,Southwest Airlines Co.,BUR,LAS,False,False,1720,1731.0,11.0,11.0,1828.0,8.0,37.0,60.0,57.0,223.0,2022,2.0,4.0,3.0,7.0,WN,WN,19393.0,WN,1249.0,WN,19393.0,WN,N720WN,1249.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,12889.0,1288903.0,32211.0,"Las Vegas, NV",NV,32.0,Nevada,85.0,0.0,0.0,1700-1759,13.0,1744.0,1821.0,7.0,1820.0,8.0,0.0,0.0,1800-1859,1.0,0.0


.sample() also allows to pass a frac value, and that will provide a sample of size the fraction value of the entire dataset.

In [6]:
df.sample(frac=0.1)

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
16745,2022-04-11,Southwest Airlines Co.,STL,MIA,False,False,1445,1538.0,53.0,53.0,1916.0,61.0,139.0,150.0,158.0,1068.0,2022,2.0,4.0,11.0,1.0,WN,WN,19393.0,WN,18.0,WN,19393.0,WN,N738CB,18.0,15016.0,1501606.0,31123.0,"St. Louis, MO",MO,29.0,Missouri,64.0,13303.0,1330303.0,32467.0,"Miami, FL",FL,12.0,Florida,33.0,1.0,3.0,1400-1459,16.0,1554.0,1913.0,3.0,1815.0,61.0,1.0,4.0,1800-1859,5.0,0.0
2564,2022-04-01,Southwest Airlines Co.,DAL,DCA,False,False,1710,1818.0,68.0,68.0,2147.0,52.0,135.0,165.0,149.0,1184.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,4444.0,WN,19393.0,WN,N7738A,4444.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,11278.0,1127805.0,30852.0,"Washington, DC",VA,51.0,Virginia,38.0,1.0,4.0,1700-1759,9.0,1827.0,2142.0,5.0,2055.0,52.0,1.0,3.0,2000-2059,5.0,0.0
13113,2022-04-10,Southwest Airlines Co.,STL,AUS,False,False,1655,1720.0,25.0,25.0,1920.0,20.0,106.0,125.0,120.0,721.0,2022,2.0,4.0,10.0,7.0,WN,WN,19393.0,WN,2217.0,WN,19393.0,WN,N953WN,2217.0,15016.0,1501606.0,31123.0,"St. Louis, MO",MO,29.0,Missouri,64.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,1.0,1.0,1600-1659,10.0,1730.0,1916.0,4.0,1900.0,20.0,1.0,1.0,1900-1959,3.0,0.0
4899,2022-04-01,Southwest Airlines Co.,SJC,KOA,False,False,935,933.0,0.0,-2.0,1150.0,0.0,305.0,340.0,317.0,2384.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,976.0,WN,19393.0,WN,N8327A,976.0,14831.0,1483106.0,32457.0,"San Jose, CA",CA,6.0,California,91.0,12758.0,1275804.0,32758.0,"Kona, HI",HI,15.0,Hawaii,2.0,0.0,-1.0,0900-0959,9.0,942.0,1147.0,3.0,1215.0,-25.0,0.0,-2.0,1200-1259,10.0,0.0
11937,2022-04-10,Southwest Airlines Co.,MDW,PIT,False,False,2205,2207.0,2.0,2.0,24.0,4.0,56.0,75.0,77.0,402.0,2022,2.0,4.0,10.0,7.0,WN,WN,19393.0,WN,928.0,WN,19393.0,WN,N416WN,928.0,13232.0,1323202.0,30977.0,"Chicago, IL",IL,17.0,Illinois,41.0,14122.0,1412202.0,30198.0,"Pittsburgh, PA",PA,42.0,Pennsylvania,23.0,0.0,0.0,2200-2259,17.0,2224.0,20.0,4.0,20.0,4.0,0.0,0.0,0001-0559,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14544,2022-04-11,Southwest Airlines Co.,FLL,ORD,False,False,1150,1152.0,2.0,2.0,1441.0,26.0,167.0,205.0,229.0,1182.0,2022,2.0,4.0,11.0,1.0,WN,WN,19393.0,WN,1507.0,WN,19393.0,WN,N940WN,1507.0,11697.0,1169706.0,32467.0,"Fort Lauderdale, FL",FL,12.0,Florida,33.0,13930.0,1393007.0,30977.0,"Chicago, IL",IL,17.0,Illinois,41.0,0.0,0.0,1100-1159,12.0,1204.0,1351.0,50.0,1415.0,26.0,1.0,1.0,1400-1459,5.0,0.0
3809,2022-04-01,Southwest Airlines Co.,MCO,ISP,False,False,650,717.0,27.0,27.0,933.0,3.0,111.0,160.0,136.0,971.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1658.0,WN,19393.0,WN,N8676A,1658.0,13204.0,1320402.0,31454.0,"Orlando, FL",FL,12.0,Florida,33.0,12391.0,1239103.0,31703.0,"Islip, NY",NY,36.0,New York,22.0,1.0,1.0,0600-0659,22.0,739.0,930.0,3.0,930.0,3.0,0.0,0.0,0900-0959,4.0,0.0
9483,2022-04-09,Southwest Airlines Co.,SLC,SJC,False,False,1105,1112.0,7.0,7.0,1154.0,4.0,90.0,105.0,102.0,584.0,2022,2.0,4.0,9.0,6.0,WN,WN,19393.0,WN,1919.0,WN,19393.0,WN,N914WN,1919.0,14869.0,1486903.0,34614.0,"Salt Lake City, UT",UT,49.0,Utah,87.0,14831.0,1483106.0,32457.0,"San Jose, CA",CA,6.0,California,91.0,0.0,0.0,1100-1159,9.0,1121.0,1151.0,3.0,1150.0,4.0,0.0,0.0,1100-1159,3.0,0.0
17931,2022-04-18,Southwest Airlines Co.,STL,LAS,False,False,2145,2303.0,78.0,78.0,28.0,78.0,189.0,205.0,205.0,1371.0,2022,2.0,4.0,18.0,1.0,WN,WN,19393.0,WN,361.0,WN,19393.0,WN,N8306H,361.0,15016.0,1501606.0,31123.0,"St. Louis, MO",MO,29.0,Missouri,64.0,12889.0,1288903.0,32211.0,"Las Vegas, NV",NV,32.0,Nevada,85.0,1.0,5.0,2100-2159,12.0,2315.0,24.0,4.0,2310.0,78.0,1.0,5.0,2300-2359,6.0,0.0


We can access the columns names of a DataFrame by using the .columns property.

In [7]:
df.columns

Index(['FlightDate', 'Airline', 'Origin', 'Dest', 'Cancelled', 'Diverted',
       'CRSDepTime', 'DepTime', 'DepDelayMinutes', 'DepDelay', 'ArrTime',
       'ArrDelayMinutes', 'AirTime', 'CRSElapsedTime', 'ActualElapsedTime',
       'Distance', 'Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek',
       'Marketing_Airline_Network', 'Operated_or_Branded_Code_Share_Partners',
       'DOT_ID_Marketing_Airline', 'IATA_Code_Marketing_Airline',
       'Flight_Number_Marketing_Airline', 'Operating_Airline',
       'DOT_ID_Operating_Airline', 'IATA_Code_Operating_Airline',
       'Tail_Number', 'Flight_Number_Operating_Airline', 'OriginAirportID',
       'OriginAirportSeqID', 'OriginCityMarketID', 'OriginCityName',
       'OriginState', 'OriginStateFips', 'OriginStateName', 'OriginWac',
       'DestAirportID', 'DestAirportSeqID', 'DestCityMarketID', 'DestCityName',
       'DestState', 'DestStateFips', 'DestStateName', 'DestWac', 'DepDel15',
       'DepartureDelayGroups', 'DepTimeBlk', 'TaxiOu

Similarly, we can use the .index property to see the index values in the DataFrame

In [8]:
df.index

RangeIndex(start=0, stop=18327, step=1)

## Data Summary Methods

There are some methods that help us to quickly see a top-down view of what the data looks like. One of this methods is the info() method which gives info about the DataFrame including its size, listing each column and data type of each column, and also the memory usage of the dataset. Optionally, we can use the verbose=False attribute for a smaller summary.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18327 entries, 0 to 18326
Data columns (total 61 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   FlightDate                               18327 non-null  object 
 1   Airline                                  18327 non-null  object 
 2   Origin                                   18327 non-null  object 
 3   Dest                                     18327 non-null  object 
 4   Cancelled                                18327 non-null  bool   
 5   Diverted                                 18327 non-null  bool   
 6   CRSDepTime                               18327 non-null  int64  
 7   DepTime                                  17456 non-null  float64
 8   DepDelayMinutes                          17456 non-null  float64
 9   DepDelay                                 17456 non-null  float64
 10  ArrTime                                  17418

The .describe() method give us descriptive statistics about the data including the count values, min, max, some percentages, and standar deviation (std) for any numeric column.

In [10]:
df.describe()

Unnamed: 0,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,DOT_ID_Marketing_Airline,Flight_Number_Marketing_Airline,DOT_ID_Operating_Airline,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginStateFips,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestStateFips,DestWac,DepDel15,DepartureDelayGroups,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,DistanceGroup,DivAirportLandings
count,18327.0,17456.0,17456.0,17456.0,17418.0,17397.0,17397.0,18327.0,17397.0,18327.0,18327.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,18326.0,17455.0,17455.0,17436.0,17436.0,17417.0,17417.0,18326.0,17396.0,17396.0,17396.0,18326.0,18326.0
mean,1332.078573,1352.415101,25.221242,23.971127,1427.834539,23.021038,97.940507,124.317564,117.854803,702.141485,2021.890762,2.0,4.0,6.531376,4.628779,19451.030339,2060.58005,19481.795154,2060.39916,12732.565044,1273260.0,31669.021772,24.96153,61.3399,12752.970752,1275301.0,31695.546491,24.684219,61.337444,0.386594,1.021942,13.450734,1371.877151,1426.933513,6.463455,1461.874495,18.047252,0.344619,0.669349,3.297665,0.003056
std,496.392971,518.010874,45.748876,46.494965,573.626425,45.654885,55.539409,55.886119,57.211014,459.034779,14.788304,0.0,0.0,5.305208,2.372645,174.7108,1362.72961,275.300385,1362.486708,1610.343704,161034.3,1260.110449,17.020854,24.871066,1516.63642,151663.6,1229.445849,16.83565,24.869995,0.486983,2.717628,9.071916,518.477489,569.160232,7.177778,539.942295,48.658571,0.475258,2.833992,1.841128,0.104426
min,500.0,1.0,0.0,-20.0,1.0,0.0,16.0,35.0,28.0,67.0,20.0,2.0,4.0,1.0,1.0,19393.0,1.0,19393.0,1.0,10135.0,1013506.0,30135.0,1.0,2.0,10135.0,1013506.0,30135.0,1.0,2.0,0.0,-2.0,3.0,1.0,1.0,1.0,5.0,-49.0,0.0,-2.0,1.0,0.0
25%,910.0,915.0,0.0,-2.0,1022.25,0.0,57.0,83.0,76.0,358.0,2022.0,2.0,4.0,2.0,1.0,19393.0,886.0,19393.0,886.0,11259.0,1125904.0,30559.0,8.0,37.0,11278.0,1127805.0,30693.0,8.0,38.0,0.0,-1.0,9.0,929.0,1020.0,3.0,1045.0,-9.0,0.0,-1.0,2.0,0.0
50%,1325.0,1337.0,7.0,7.0,1445.0,3.0,84.0,113.0,104.0,602.0,2022.0,2.0,4.0,3.0,5.0,19393.0,1928.0,19393.0,1928.0,12892.0,1289208.0,31205.0,22.0,73.0,12892.0,1289208.0,31453.0,21.0,73.0,0.0,0.0,11.0,1350.0,1443.0,5.0,1500.0,3.0,0.0,0.0,3.0,0.0
75%,1745.0,1805.0,30.0,30.0,1909.0,26.0,124.0,150.0,145.0,925.0,2022.0,2.0,4.0,11.0,7.0,19393.0,2912.0,19393.0,2912.0,14107.0,1410702.0,32575.0,44.0,82.0,14107.0,1410702.0,32575.0,42.0,82.0,1.0,2.0,15.0,1818.0,1908.0,7.0,1915.0,26.0,1.0,1.0,4.0,0.0
max,2255.0,2400.0,659.0,659.0,2400.0,697.0,404.0,425.0,474.0,2979.0,2022.0,2.0,4.0,29.0,7.0,19977.0,6665.0,20500.0,6665.0,15919.0,1591904.0,35412.0,72.0,93.0,15919.0,1591904.0,35412.0,72.0,93.0,1.0,12.0,168.0,2400.0,2400.0,186.0,2355.0,697.0,1.0,12.0,11.0,9.0


.describe() can also manage non-numeric column. It will show the count value, number of unique values, the top ocurring value and its frequency.

In [11]:
df[['Airline']].describe()

Unnamed: 0,Airline
count,18327
unique,5
top,Southwest Airlines Co.
freq,16506


The property .shape gives a summary of the numbers of columns and rows present in the dataset.

In [12]:
df.shape

(18327, 61)

We could also use len(df) to obtain the number of rows.

In [13]:
len(df)

18327

## Subsetting a DataFrame

We can subset a DataFrame to a set of columns by providing a list of column names like so.

In [14]:
df[['FlightDate', 'Airline', 'Origin']]

Unnamed: 0,FlightDate,Airline,Origin
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO
...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,BNA
18323,2022-04-19,Southwest Airlines Co.,BNA
18324,2022-04-19,Southwest Airlines Co.,BNA
18325,2022-04-19,Southwest Airlines Co.,BNA


We could also use a combination of the property columns and the split syntax to obtain a list of columns names and pass that within the brackets [ ] to obtain a subset.

In [15]:
df[df.columns[:5]]

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False
...,...,...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,BNA,AUS,False
18323,2022-04-19,Southwest Airlines Co.,BNA,BDL,False
18324,2022-04-19,Southwest Airlines Co.,BNA,BOS,False
18325,2022-04-19,Southwest Airlines Co.,BNA,BOS,False


We could also use list comprehension to obtain a subset of, for example, all the columns which names contain the word 'Time'.

In [16]:
df[[c for c in df.columns if 'Time' in c]]

Unnamed: 0,CRSDepTime,DepTime,ArrTime,AirTime,CRSElapsedTime,ActualElapsedTime,DepTimeBlk,CRSArrTime,ArrTimeBlk
0,1133,1123.0,1228.0,40.0,72.0,65.0,1100-1159,1245.0,1200-1259
1,732,728.0,848.0,55.0,77.0,80.0,0700-0759,849.0,0800-0859
2,1529,1514.0,1636.0,47.0,70.0,82.0,1500-1559,1639.0,1600-1659
3,1435,1430.0,1547.0,57.0,90.0,77.0,1400-1459,1605.0,1600-1659
4,1135,1135.0,1251.0,49.0,70.0,76.0,1100-1159,1245.0,1200-1259
...,...,...,...,...,...,...,...,...,...
18322,2215,2206.0,10.0,110.0,140.0,124.0,2200-2259,35.0,0001-0559
18323,1335,1336.0,1640.0,109.0,135.0,124.0,1300-1359,1650.0,1600-1659
18324,1955,1953.0,2307.0,118.0,145.0,134.0,1900-1959,2320.0,2300-2359
18325,1010,1010.0,1322.0,120.0,150.0,132.0,1000-1059,1340.0,1300-1359


We can also obtain a subset of columns based on the data type contained in that columns. For this we  can use the .select_dtypes() method.

In [17]:
df.select_dtypes('float')

Unnamed: 0,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Quarter,Month,DayofMonth,DayOfWeek,DOT_ID_Marketing_Airline,Flight_Number_Marketing_Airline,DOT_ID_Operating_Airline,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginStateFips,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestStateFips,DestWac,DepDel15,DepartureDelayGroups,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,DistanceGroup,DivAirportLandings
0,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2.0,4.0,4.0,1.0,19977.0,4301.0,20445.0,4301.0,11921.0,1192102.0,31921.0,8.0,82.0,11292.0,1129202.0,30325.0,8.0,82.0,0.0,-1.0,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1.0,0.0
1,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2.0,4.0,4.0,1.0,19977.0,4299.0,20445.0,4299.0,12206.0,1220605.0,32206.0,48.0,74.0,12266.0,1226603.0,31453.0,48.0,74.0,0.0,-1.0,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,2.0,0.0
2,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2.0,4.0,4.0,1.0,19977.0,4298.0,20445.0,4298.0,11413.0,1141307.0,30285.0,8.0,82.0,11292.0,1129202.0,30325.0,8.0,82.0,0.0,-1.0,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,2.0,0.0
3,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2.0,4.0,4.0,1.0,19977.0,4296.0,20445.0,4296.0,12266.0,1226603.0,31453.0,48.0,74.0,11973.0,1197302.0,31973.0,28.0,53.0,0.0,-1.0,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,2.0,0.0
4,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2.0,4.0,4.0,1.0,19977.0,4295.0,20445.0,4295.0,11413.0,1141307.0,30285.0,8.0,82.0,11292.0,1129202.0,30325.0,8.0,82.0,0.0,0.0,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18322,2206.0,0.0,-9.0,10.0,0.0,110.0,140.0,124.0,756.0,2.0,4.0,19.0,2.0,19393.0,2019.0,19393.0,2019.0,10693.0,1069302.0,30693.0,47.0,54.0,10423.0,1042302.0,30423.0,48.0,74.0,0.0,-1.0,11.0,2217.0,7.0,3.0,35.0,-25.0,0.0,-2.0,4.0,0.0
18323,1336.0,1.0,1.0,1640.0,0.0,109.0,135.0,124.0,852.0,2.0,4.0,19.0,2.0,19393.0,694.0,19393.0,694.0,10693.0,1069302.0,30693.0,47.0,54.0,10529.0,1052906.0,30529.0,9.0,11.0,0.0,0.0,12.0,1348.0,1637.0,3.0,1650.0,-10.0,0.0,-1.0,4.0,0.0
18324,1953.0,0.0,-2.0,2307.0,0.0,118.0,145.0,134.0,942.0,2.0,4.0,19.0,2.0,19393.0,388.0,19393.0,388.0,10693.0,1069302.0,30693.0,47.0,54.0,10721.0,1072102.0,30721.0,25.0,13.0,0.0,-1.0,12.0,2005.0,2303.0,4.0,2320.0,-13.0,0.0,-1.0,4.0,0.0
18325,1010.0,0.0,0.0,1322.0,0.0,120.0,150.0,132.0,942.0,2.0,4.0,19.0,2.0,19393.0,2775.0,19393.0,2775.0,10693.0,1069302.0,30693.0,47.0,54.0,10721.0,1072102.0,30721.0,25.0,13.0,0.0,0.0,8.0,1018.0,1318.0,4.0,1340.0,-18.0,0.0,-2.0,4.0,0.0


One thing to notice is that when we want to obtain a single column if we pass the column name within single brackets we obtain a pandas Series object instead of a DataFrame object.

In [18]:
col = df['Airline']
col

0        Commutair Aka Champlain Enterprises, Inc.
1        Commutair Aka Champlain Enterprises, Inc.
2        Commutair Aka Champlain Enterprises, Inc.
3        Commutair Aka Champlain Enterprises, Inc.
4        Commutair Aka Champlain Enterprises, Inc.
                           ...                    
18322                       Southwest Airlines Co.
18323                       Southwest Airlines Co.
18324                       Southwest Airlines Co.
18325                       Southwest Airlines Co.
18326                       Southwest Airlines Co.
Name: Airline, Length: 18327, dtype: object

In [19]:
type(col)

pandas.core.series.Series

If we want the returned column like a DataFrame we must use double brackets.

In [20]:
col = df[['Airline']]
col

Unnamed: 0,Airline
0,"Commutair Aka Champlain Enterprises, Inc."
1,"Commutair Aka Champlain Enterprises, Inc."
2,"Commutair Aka Champlain Enterprises, Inc."
3,"Commutair Aka Champlain Enterprises, Inc."
4,"Commutair Aka Champlain Enterprises, Inc."
...,...
18322,Southwest Airlines Co.
18323,Southwest Airlines Co.
18324,Southwest Airlines Co.
18325,Southwest Airlines Co.


In [21]:
type(col)

pandas.core.frame.DataFrame

To filter a DataFrame based on its rows we use either the .iloc or .loc. Both allows us to access elements in the DataFrame based in their location, but .iloc uses the index location and .loc uses the names.

The first value that .iloc receives is a row index number (or subset of row numbers), the second value is the column number (or subset of column numbers).

In [22]:
df.iloc[1, 3]

'IAH'

In [23]:
df.iloc[:5, :5]

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False


If we provide a single value we only filter to the row and obtain a pandas Series object.

In [24]:
df.iloc[5]

FlightDate                                           2022-04-04
Airline               Commutair Aka Champlain Enterprises, Inc.
Origin                                                      DEN
Dest                                                        TUL
Cancelled                                                 False
                                        ...                    
ArrDel15                                                    0.0
ArrivalDelayGroups                                         -1.0
ArrTimeBlk                                            1200-1259
DistanceGroup                                               3.0
DivAirportLandings                                          0.0
Name: 5, Length: 61, dtype: object

If we put it in a list we obtain a single row as a DataFrame instead.

In [25]:
df.iloc[[5]]

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
5,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DEN,TUL,False,False,955,952.0,0.0,-3.0,1238.0,0.0,77.0,105.0,106.0,541.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4294.0,C5,20445.0,C5,N11191,4294.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,15370.0,1537002.0,34653.0,"Tulsa, OK",OK,40.0,Oklahoma,73.0,0.0,-1.0,0900-0959,25.0,1017.0,1234.0,4.0,1240.0,-2.0,0.0,-1.0,1200-1259,3.0,0.0


If we want to filter only to columns we could use the : character for the rows value to say 'give me all the rows', and pass a column number like so. Here we are selecting all the rows for the second column (1).

In [26]:
df.iloc[:, 1]

0        Commutair Aka Champlain Enterprises, Inc.
1        Commutair Aka Champlain Enterprises, Inc.
2        Commutair Aka Champlain Enterprises, Inc.
3        Commutair Aka Champlain Enterprises, Inc.
4        Commutair Aka Champlain Enterprises, Inc.
                           ...                    
18322                       Southwest Airlines Co.
18323                       Southwest Airlines Co.
18324                       Southwest Airlines Co.
18325                       Southwest Airlines Co.
18326                       Southwest Airlines Co.
Name: Airline, Length: 18327, dtype: object

If we put the column value in a list we obtain it as a DataFrame, which helps with visualization.

In [27]:
df.iloc[:, [1]]

Unnamed: 0,Airline
0,"Commutair Aka Champlain Enterprises, Inc."
1,"Commutair Aka Champlain Enterprises, Inc."
2,"Commutair Aka Champlain Enterprises, Inc."
3,"Commutair Aka Champlain Enterprises, Inc."
4,"Commutair Aka Champlain Enterprises, Inc."
...,...
18322,Southwest Airlines Co.
18323,Southwest Airlines Co.
18324,Southwest Airlines Co.
18325,Southwest Airlines Co.


If we want to obtain a subset now using .loc instead of .iloc, we would use the column names instead of the column index. The result is exactly the same.

In [28]:
df.loc[:,['Airline']]

Unnamed: 0,Airline
0,"Commutair Aka Champlain Enterprises, Inc."
1,"Commutair Aka Champlain Enterprises, Inc."
2,"Commutair Aka Champlain Enterprises, Inc."
3,"Commutair Aka Champlain Enterprises, Inc."
4,"Commutair Aka Champlain Enterprises, Inc."
...,...
18322,Southwest Airlines Co.
18323,Southwest Airlines Co.
18324,Southwest Airlines Co.
18325,Southwest Airlines Co.


One of the advantages of .loc is that it allows to filter not only on names but on boolean expressions.

For this to work firts we must create an expression to evaluate against one or more columns, and thus returning a True or False Series object.

In [29]:
df['Airline'] == 'Southwest Airlines Co.'

0        False
1        False
2        False
3        False
4        False
         ...  
18322     True
18323     True
18324     True
18325     True
18326     True
Name: Airline, Length: 18327, dtype: bool

If we pass that expression to .loc we will obtain only the rows where that expressions is evaluated to True.

In [30]:
df.loc[df['Airline'] == 'Southwest Airlines Co.']

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
1821,2022-04-01,Southwest Airlines Co.,ABQ,AUS,False,False,1035,1037.0,2.0,2.0,1303.0,0.0,75.0,100.0,86.0,619.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,236.0,WN,19393.0,WN,N232WN,236.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,0.0,1000-1059,8.0,1045.0,1300.0,3.0,1315.0,-12.0,0.0,-1.0,1300-1359,3.0,0.0
1822,2022-04-01,Southwest Airlines Co.,ABQ,BUR,False,False,1750,1823.0,33.0,33.0,1914.0,24.0,96.0,120.0,111.0,672.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1238.0,WN,19393.0,WN,N248WN,1238.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,1.0,2.0,1700-1759,13.0,1836.0,1912.0,2.0,1850.0,24.0,1.0,1.0,1800-1859,3.0,0.0
1823,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,645,648.0,3.0,3.0,922.0,0.0,79.0,105.0,94.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,100.0,WN,19393.0,WN,N206WN,100.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,0.0,0600-0659,13.0,701.0,920.0,2.0,930.0,-8.0,0.0,-1.0,0900-0959,3.0,0.0
1824,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1150,1146.0,0.0,-4.0,1439.0,4.0,76.0,105.0,113.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,869.0,WN,19393.0,WN,N8689C,869.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,-1.0,1100-1159,9.0,1155.0,1411.0,28.0,1435.0,4.0,0.0,0.0,1400-1459,3.0,0.0
1825,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1730,1757.0,27.0,27.0,2042.0,32.0,83.0,100.0,105.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1256.0,WN,19393.0,WN,N774SW,1256.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,1.0,1700-1759,13.0,1810.0,2033.0,9.0,2010.0,32.0,1.0,2.0,2000-2059,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,2215,2206.0,0.0,-9.0,10.0,0.0,110.0,140.0,124.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2019.0,WN,19393.0,WN,N407WN,2019.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,2200-2259,11.0,2217.0,7.0,3.0,35.0,-25.0,0.0,-2.0,0001-0559,4.0,0.0
18323,2022-04-19,Southwest Airlines Co.,BNA,BDL,False,False,1335,1336.0,1.0,1.0,1640.0,0.0,109.0,135.0,124.0,852.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,694.0,WN,19393.0,WN,N749SW,694.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10529.0,1052906.0,30529.0,"Hartford, CT",CT,9.0,Connecticut,11.0,0.0,0.0,1300-1359,12.0,1348.0,1637.0,3.0,1650.0,-10.0,0.0,-1.0,1600-1659,4.0,0.0
18324,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1955,1953.0,0.0,-2.0,2307.0,0.0,118.0,145.0,134.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,388.0,WN,19393.0,WN,N7840A,388.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,-1.0,1900-1959,12.0,2005.0,2303.0,4.0,2320.0,-13.0,0.0,-1.0,2300-2359,4.0,0.0
18325,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1010,1010.0,0.0,0.0,1322.0,0.0,120.0,150.0,132.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2775.0,WN,19393.0,WN,N290WN,2775.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,0.0,1000-1059,8.0,1018.0,1318.0,4.0,1340.0,-18.0,0.0,-2.0,1300-1359,4.0,0.0


We can combine expressions with & (and) or | (or) statements to filter rows on more than one condition.

In [31]:
df.loc[(df['Airline'] == 'Southwest Airlines Co.') & (df['FlightDate'] == '2022-04-02')]

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
5333,2022-04-02,Southwest Airlines Co.,ABQ,AUS,False,False,940,1007.0,27.0,27.0,1255.0,35.0,81.0,100.0,108.0,619.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3466.0,WN,19393.0,WN,N955WN,3466.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,1.0,1.0,0900-0959,10.0,1017.0,1238.0,17.0,1220.0,35.0,1.0,2.0,1200-1259,3.0,0.0
5334,2022-04-02,Southwest Airlines Co.,ABQ,BUR,False,False,1725,1726.0,1.0,1.0,1817.0,0.0,100.0,120.0,111.0,672.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,2275.0,WN,19393.0,WN,N237WN,2275.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,0.0,0.0,1700-1759,10.0,1736.0,1816.0,1.0,1825.0,-8.0,0.0,-1.0,1800-1859,3.0,0.0
5335,2022-04-02,Southwest Airlines Co.,ABQ,BWI,False,False,1030,1025.0,0.0,-5.0,1607.0,2.0,193.0,215.0,222.0,1670.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3965.0,WN,19393.0,WN,N410WN,3965.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10821.0,1082106.0,30852.0,"Baltimore, MD",MD,24.0,Maryland,35.0,0.0,-1.0,1000-1059,11.0,1036.0,1549.0,18.0,1605.0,2.0,0.0,0.0,1600-1659,7.0,0.0
5336,2022-04-02,Southwest Airlines Co.,ABQ,DAL,False,False,1055,1114.0,19.0,19.0,1348.0,13.0,80.0,100.0,94.0,580.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3500.0,WN,19393.0,WN,N227WN,3500.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,1.0,1000-1059,10.0,1124.0,1344.0,4.0,1335.0,13.0,0.0,0.0,1300-1359,3.0,0.0
5337,2022-04-02,Southwest Airlines Co.,ABQ,DAL,False,False,1325,1413.0,48.0,48.0,1700.0,50.0,81.0,105.0,107.0,580.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,4010.0,WN,19393.0,WN,N405WN,4010.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,3.0,1300-1359,13.0,1426.0,1647.0,13.0,1610.0,50.0,1.0,3.0,1600-1659,3.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8516,2022-04-02,Southwest Airlines Co.,VPS,BWI,False,False,1045,1210.0,85.0,85.0,1503.0,68.0,102.0,130.0,113.0,819.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,2494.0,WN,19393.0,WN,N720WN,2494.0,15624.0,1562404.0,31504.0,"Valparaiso, FL",FL,12.0,Florida,33.0,10821.0,1082106.0,30852.0,"Baltimore, MD",MD,24.0,Maryland,35.0,1.0,5.0,1000-1059,8.0,1218.0,1500.0,3.0,1355.0,68.0,1.0,4.0,1300-1359,4.0,0.0
8517,2022-04-02,Southwest Airlines Co.,VPS,DAL,False,False,950,1127.0,97.0,97.0,1318.0,68.0,100.0,140.0,111.0,630.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3010.0,WN,19393.0,WN,N7824A,3010.0,15624.0,1562404.0,31504.0,"Valparaiso, FL",FL,12.0,Florida,33.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,6.0,0900-0959,7.0,1134.0,1314.0,4.0,1210.0,68.0,1.0,4.0,1200-1259,3.0,0.0
8518,2022-04-02,Southwest Airlines Co.,VPS,DAL,False,False,1420,1555.0,95.0,95.0,1745.0,65.0,97.0,140.0,110.0,630.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,3106.0,WN,19393.0,WN,N952WN,3106.0,15624.0,1562404.0,31504.0,"Valparaiso, FL",FL,12.0,Florida,33.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,6.0,1400-1459,7.0,1602.0,1739.0,6.0,1640.0,65.0,1.0,4.0,1600-1659,3.0,0.0
8519,2022-04-02,Southwest Airlines Co.,VPS,MDW,False,False,1525,1646.0,81.0,81.0,1850.0,35.0,110.0,170.0,124.0,782.0,2022,2.0,4.0,2.0,6.0,WN,WN,19393.0,WN,1390.0,WN,19393.0,WN,N936WN,1390.0,15624.0,1562404.0,31504.0,"Valparaiso, FL",FL,12.0,Florida,33.0,13232.0,1323202.0,30977.0,"Chicago, IL",IL,17.0,Illinois,41.0,1.0,5.0,1500-1559,9.0,1655.0,1845.0,5.0,1815.0,35.0,1.0,2.0,1800-1859,4.0,0.0


One important note is that we can easily take the inverse of an expression by adding the ~ character in front of the statement and then use .loc on it like so. This will give us all the rows where the Airline name is not 'Southwest Airlines Co.' and the FlightDate is not '2022-04-02'.

In [32]:
df.loc[~((df['Airline'] == 'Southwest Airlines Co.') & (df['FlightDate'] == '2022-04-02'))]

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4301.0,C5,20445.0,C5,N21144,4301.0,11921.0,1192102.0,31921.0,"Grand Junction, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1200-1259,1.0,0.0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4299.0,C5,20445.0,C5,N16170,4299.0,12206.0,1220605.0,32206.0,"Harlingen/San Benito, TX",TX,48.0,Texas,74.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0700-0759,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,0800-0859,2.0,0.0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,2215,2206.0,0.0,-9.0,10.0,0.0,110.0,140.0,124.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2019.0,WN,19393.0,WN,N407WN,2019.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,2200-2259,11.0,2217.0,7.0,3.0,35.0,-25.0,0.0,-2.0,0001-0559,4.0,0.0
18323,2022-04-19,Southwest Airlines Co.,BNA,BDL,False,False,1335,1336.0,1.0,1.0,1640.0,0.0,109.0,135.0,124.0,852.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,694.0,WN,19393.0,WN,N749SW,694.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10529.0,1052906.0,30529.0,"Hartford, CT",CT,9.0,Connecticut,11.0,0.0,0.0,1300-1359,12.0,1348.0,1637.0,3.0,1650.0,-10.0,0.0,-1.0,1600-1659,4.0,0.0
18324,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1955,1953.0,0.0,-2.0,2307.0,0.0,118.0,145.0,134.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,388.0,WN,19393.0,WN,N7840A,388.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,-1.0,1900-1959,12.0,2005.0,2303.0,4.0,2320.0,-13.0,0.0,-1.0,2300-2359,4.0,0.0
18325,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1010,1010.0,0.0,0.0,1322.0,0.0,120.0,150.0,132.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2775.0,WN,19393.0,WN,N290WN,2775.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,0.0,1000-1059,8.0,1018.0,1318.0,4.0,1340.0,-18.0,0.0,-2.0,1300-1359,4.0,0.0


An alternate way to query our DataFrame based on boolean expressions is by using the .query() method. This method takes a string representation of the boolean expression we wish to filter on.

As an example we can filter at departure times greater than 11:30.

In [33]:
df.query('DepTime > 1130')

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0
6,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,LCH,False,False,2139,2136.0,0.0,-3.0,2218.0,0.0,26.0,52.0,42.0,127.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4293.0,C5,20445.0,C5,N14143,4293.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,12915.0,1291503.0,31205.0,"Lake Charles, LA",LA,22.0,Louisiana,72.0,0.0,-1.0,2100-2159,11.0,2147.0,2213.0,5.0,2231.0,-13.0,0.0,-1.0,2200-2259,1.0,0.0
8,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,AEX,False,False,1424,1414.0,0.0,-10.0,1513.0,0.0,37.0,60.0,59.0,190.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4291.0,C5,20445.0,C5,N33182,4291.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,10185.0,1018502.0,30185.0,"Alexandria, LA",LA,22.0,Louisiana,72.0,0.0,-1.0,1400-1459,16.0,1430.0,1507.0,6.0,1524.0,-11.0,0.0,-1.0,1500-1559,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18319,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,1230,1228.0,0.0,-2.0,1431.0,0.0,109.0,140.0,123.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,301.0,WN,19393.0,WN,N747SA,301.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,1200-1259,9.0,1237.0,1426.0,5.0,1450.0,-19.0,0.0,-2.0,1400-1459,4.0,0.0
18321,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,1520,1519.0,0.0,-1.0,1737.0,0.0,111.0,140.0,138.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,1704.0,WN,19393.0,WN,N8611F,1704.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,1500-1559,10.0,1529.0,1720.0,17.0,1740.0,-3.0,0.0,-1.0,1700-1759,4.0,0.0
18322,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,2215,2206.0,0.0,-9.0,10.0,0.0,110.0,140.0,124.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2019.0,WN,19393.0,WN,N407WN,2019.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,2200-2259,11.0,2217.0,7.0,3.0,35.0,-25.0,0.0,-2.0,0001-0559,4.0,0.0
18323,2022-04-19,Southwest Airlines Co.,BNA,BDL,False,False,1335,1336.0,1.0,1.0,1640.0,0.0,109.0,135.0,124.0,852.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,694.0,WN,19393.0,WN,N749SW,694.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10529.0,1052906.0,30529.0,"Hartford, CT",CT,9.0,Connecticut,11.0,0.0,0.0,1300-1359,12.0,1348.0,1637.0,3.0,1650.0,-10.0,0.0,-1.0,1600-1659,4.0,0.0


The query can take multiple expressions too. Any value in the query is assumed to be column names but we can pass srting value by wrapping them in double quotes " ".

In [34]:
df.query('(DepTime > 1130) & (Origin == "DRO")')

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0
531,2022-04-03,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1524.0,0.0,-5.0,1633.0,0.0,50.0,70.0,69.0,251.0,2022,2.0,4.0,3.0,7.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N11191,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,9.0,1533.0,1623.0,10.0,1639.0,-6.0,0.0,-1.0,1600-1659,2.0,0.0
533,2022-04-03,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1134.0,0.0,-1.0,1247.0,2.0,43.0,70.0,73.0,251.0,2022,2.0,4.0,3.0,7.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N31131,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,19.0,1153.0,1236.0,11.0,1245.0,2.0,0.0,0.0,1200-1259,2.0,0.0
1003,2022-04-02,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1644.0,75.0,75.0,1757.0,78.0,46.0,70.0,73.0,251.0,2022,2.0,4.0,2.0,6.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N16151,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,1.0,5.0,1500-1559,15.0,1659.0,1745.0,12.0,1639.0,78.0,1.0,5.0,1600-1659,2.0,0.0
1065,2022-04-02,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1740,1730.0,0.0,-10.0,1903.0,10.0,43.0,73.0,93.0,251.0,2022,2.0,4.0,2.0,6.0,UA,UA_CODESHARE,19977.0,UA,4225.0,C5,20445.0,C5,N11150,4225.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1700-1759,17.0,1747.0,1830.0,33.0,1853.0,10.0,0.0,0.0,1800-1859,2.0,0.0
1535,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1520.0,0.0,-9.0,1640.0,1.0,49.0,70.0,80.0,251.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N11150,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,26.0,1546.0,1635.0,5.0,1639.0,1.0,0.0,0.0,1600-1659,2.0,0.0


We can also access external variables by creating a variable and then using the @ symbol inside the expression before the name of the variable.

In [35]:
min_time = 1130
df.query('(DepTime > @min_time) & (Origin == "DRO")')

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0
531,2022-04-03,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1524.0,0.0,-5.0,1633.0,0.0,50.0,70.0,69.0,251.0,2022,2.0,4.0,3.0,7.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N11191,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,9.0,1533.0,1623.0,10.0,1639.0,-6.0,0.0,-1.0,1600-1659,2.0,0.0
533,2022-04-03,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1134.0,0.0,-1.0,1247.0,2.0,43.0,70.0,73.0,251.0,2022,2.0,4.0,3.0,7.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N31131,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,19.0,1153.0,1236.0,11.0,1245.0,2.0,0.0,0.0,1200-1259,2.0,0.0
1003,2022-04-02,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1644.0,75.0,75.0,1757.0,78.0,46.0,70.0,73.0,251.0,2022,2.0,4.0,2.0,6.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N16151,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,1.0,5.0,1500-1559,15.0,1659.0,1745.0,12.0,1639.0,78.0,1.0,5.0,1600-1659,2.0,0.0
1065,2022-04-02,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1740,1730.0,0.0,-10.0,1903.0,10.0,43.0,73.0,93.0,251.0,2022,2.0,4.0,2.0,6.0,UA,UA_CODESHARE,19977.0,UA,4225.0,C5,20445.0,C5,N11150,4225.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1700-1759,17.0,1747.0,1830.0,33.0,1853.0,10.0,0.0,0.0,1800-1859,2.0,0.0
1535,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1520.0,0.0,-9.0,1640.0,1.0,49.0,70.0,80.0,251.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N11150,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,26.0,1546.0,1635.0,5.0,1639.0,1.0,0.0,0.0,1600-1659,2.0,0.0


## Summarizing Data

There are some statistics methods that can be run in columns. Some of them are:

- .mean(): computes the mean of the values in a Series or DataFrame.
- .max(): finds the max value in a Series or each column in a DataFrame.
- .min(): finds the min value in a Series or each column in a DataFrame.
- .std(): computes the standard deviation in a Series or DataFrame, which is the square root of the variance.
- .var(): computes the variance in a Series or DataFrame, which is a measure of how the data varies in relation to the mean value.
- .count(): counts the number of non-null values in a Series or DataFrame.
- .sum(): computes the sum of the values in a Series or DataFrame.
- .quantile(): computes the given quantile in a Series or DataFrame. This method can receive a single value between 0 and 1, or can take a list of values to calculate more than one quantile at once.

In [36]:
print(f"Mean: {df['DepTime'].mean()}")
print(f"Max: {df['DepTime'].max()}")
print(f"Min: {df['DepTime'].min()}")
print(f"Std: {df['DepTime'].std()}")
print(f"Var: {df['DepTime'].var()}")
print(f"Count: {df['DepTime'].count()}")
print(f"Sum: {df['DepTime'].sum()}")
print(f"Quant 0.5: {df['DepTime'].quantile(0.5)}")
print(df['DepTime'].quantile([0.25, 0.75]))

Mean: 1352.4151008249312
Max: 2400.0
Min: 1.0
Std: 518.0108743008974
Var: 268335.2658939801
Count: 17456
Sum: 23607758.0
Quant 0.5: 1337.0
0.25     915.0
0.75    1805.0
Name: DepTime, dtype: float64


When this methods are run in multiple columns the returned value is a Series object where the index are the column names and the values are the corresponding statistic value.

In [37]:
s = df[['DepTime', 'DepDelay', 'ArrTime','ArrDelay']].min()
print(s)
print(type(s))

DepTime      1.0
DepDelay   -20.0
ArrTime      1.0
ArrDelay   -49.0
dtype: float64
<class 'pandas.core.series.Series'>


If we want to run multiple statistics methods at once we can use the .agg() method which receives a list of statistics methods.

In [38]:
df[['DepTime', 'DepDelay', 'ArrTime', 'ArrDelay']].agg(['mean', 'max', 'min'])

Unnamed: 0,DepTime,DepDelay,ArrTime,ArrDelay
mean,1352.415101,23.971127,1427.834539,18.047252
max,2400.0,659.0,2400.0,697.0
min,1.0,-20.0,1.0,-49.0


The .agg() method can also take a dictionary, where the keys are the column names and the values are a list of the aggregations. Only the specified aggregations will be returned for each column key in the dictionary.

In [39]:
df[['DepTime', 'DepDelay', 'ArrTime','ArrDelay']].agg(
    {'DepTime': ['min',  'max'],
     'DepDelay': ['mean'],
     'ArrTime': ['min', 'max']
    }
)

Unnamed: 0,DepTime,DepDelay,ArrTime
min,1.0,,1.0
max,2400.0,,2400.0
mean,,23.971127,


Another useful categorical method is the .unique() method, which returns a list of all the unique values in a column. Similarly, the .nunique() method -meaning number of uniques-, returns the number of unique fields.

In [40]:
df['Airline'].unique()

array(['Commutair Aka Champlain Enterprises, Inc.',
       'GoJet Airlines, LLC d/b/a United Express',
       'Air Wisconsin Airlines Corp', 'Mesa Airlines Inc.',
       'Southwest Airlines Co.'], dtype=object)

In [41]:
df['Airline'].nunique()

5

The .value_counts() method returns the number of ocurrances for each field in a column.

In [42]:
df['Airline'].value_counts()

Southwest Airlines Co.                       16506
Air Wisconsin Airlines Corp                    777
Commutair Aka Champlain Enterprises, Inc.      647
GoJet Airlines, LLC d/b/a United Express       395
Mesa Airlines Inc.                               2
Name: Airline, dtype: int64

Setting the normalize=True attribute for the .value_counts() method gives us the fractional value of ocurrances instead.

In [43]:
df['Airline'].value_counts(normalize=True)

Southwest Airlines Co.                       0.900638
Air Wisconsin Airlines Corp                  0.042396
Commutair Aka Champlain Enterprises, Inc.    0.035303
GoJet Airlines, LLC d/b/a United Express     0.021553
Mesa Airlines Inc.                           0.000109
Name: Airline, dtype: float64

Running the .value_counts() method on different columns will give us the count of those columns combination (kind of like a group by?). The returned value is a multi-index Series object.

In [44]:
df[['Airline', 'Origin']].value_counts()

Airline                                    Origin
Southwest Airlines Co.                     DAL       858
                                           PHX       841
                                           DEN       826
                                           BWI       819
                                           LAS       755
                                                    ... 
Commutair Aka Champlain Enterprises, Inc.  AUS         1
                                           BNA         1
Air Wisconsin Airlines Corp                STL         1
GoJet Airlines, LLC d/b/a United Express   GSP         1
                                           ALB         1
Length: 246, dtype: int64

If we reset this index with the .reset_index() method, we have a DataFrame returned instead, where every single combination is in a row and the count value for every combination is in a new column.

In [45]:
df[['Airline', 'Origin']].value_counts().reset_index()

Unnamed: 0,Airline,Origin,0
0,Southwest Airlines Co.,DAL,858
1,Southwest Airlines Co.,PHX,841
2,Southwest Airlines Co.,DEN,826
3,Southwest Airlines Co.,BWI,819
4,Southwest Airlines Co.,LAS,755
...,...,...,...
241,"Commutair Aka Champlain Enterprises, Inc.",AUS,1
242,"Commutair Aka Champlain Enterprises, Inc.",BNA,1
243,Air Wisconsin Airlines Corp,STL,1
244,"GoJet Airlines, LLC d/b/a United Express",GSP,1


## Advanced Column Methods

There are some useful methods that can be used when doing group by operations.

The .rank() method computes the numerical rank of 1 value according to some defined criteria. We can rank values based on axis (if we rank values through the rows down (default) or the thorugh the columns sideways), based on appearence order, we can rank values on ascending or descending order, and we can provide a specific method for rank calculation for values that appear more than once. For example, the first time a value appears in a row it can be assigned a rank number, later a second appearance of the same value in a row down below can be assigned a second rank number. By default the method attribute uses the average of those 2 ranking numbers to calculate the final rank number for that value in general, but we could use the min rank number, the max rank number or the firstly assigned rank number.

In [46]:
df[['CRSDepTime']].rank(axis=0, method='average', ascending=True)

Unnamed: 0,CRSDepTime
0,7171.5
1,2836.5
2,11326.5
3,10439.5
4,7212.0
...,...
18322,18030.5
18323,9350.0
18324,16082.5
18325,5588.5


The .shift() method will shift all the values in a column by the provided number. This number can be positive or negative. One thing to note is that when a shifting occurs the values at the begginign or at the end of the Series will be left as empty values (NaN) beacuse of the shifting.

In [47]:
df[['CRSDepTime']].shift(3)

Unnamed: 0,CRSDepTime
0,
1,
2,
3,1133.0
4,732.0
...,...
18322,1230.0
18323,755.0
18324,1520.0
18325,2215.0


To prevent this we can provide a fill value by using the fill_value attribute. This will fill any NaN value with the provided filler value.

In [48]:
df[['CRSDepTime']].shift(3, fill_value=0)

Unnamed: 0,CRSDepTime
0,0
1,0
2,0
3,1133
4,732
...,...
18322,1230
18323,755
18324,1520
18325,2215


We can run the cumulative sum over a DataFrame or a Series object by using the .cumsum() method. The cummulative sum is the sum of each value with the previous values. For example, if we have a column with values like [0, 1 , 2, ...] the cumulative sum will begin as 0, then (0 + 1), then (0 + 1 + 2) + ... etc, and for each index the value for that column cell will be the cumsum up to that cell. In this example we are running the cumulative sum for the 'CRSDepTime' column.

In [49]:
df[['CRSDepTime']].cumsum()

Unnamed: 0,CRSDepTime
0,1133
1,1865
2,3394
3,4829
4,5964
...,...
18322,24408069
18323,24409404
18324,24411359
18325,24412369


There are also .cummax() and .cummin() methods, being the cummax y cummin the max and min values up to that cell of the column, repectively.

In [50]:
df[['CRSDepTime']].cummax()

Unnamed: 0,CRSDepTime
0,1133
1,1133
2,1529
3,1529
4,1529
...,...
18322,2255
18323,2255
18324,2255
18325,2255


In [51]:
df[['CRSDepTime']].cummin()

Unnamed: 0,CRSDepTime
0,1133
1,732
2,732
3,732
4,732
...,...
18322,500
18323,500
18324,500
18325,500


## Rolling Methods

Rolling methods can be specially useful with time-series data. Within a numeric column we can use the .rolling() method and provide a window period where to look at when running our aggregations. This will return a Rolling object.

In [52]:
df[['DepDelayMinutes']].rolling(window=5)

Rolling [window=5,center=False,axis=0,method=single]

If we add our aggregations to the previous object we obtain the aggregated method applied to the specified window. Rembemer that a window is a subset of values. In this example we calculate the mean value of the first 5 values in the DepDelayMinutes column, then the mean for the next 5 value, and so on. The mean values for the elements at indexes 0 through 3 are shown as NaN because the windows is specified at 5, so we need at least 5 values to start calculating the mean.

In [53]:
df[['DepDelayMinutes']].rolling(window=5).agg('mean')

Unnamed: 0,DepDelayMinutes
0,
1,
2,
3,
4,0.0
...,...
18322,0.2
18323,0.4
18324,0.4
18325,0.2


Another useful method with numeric values is .clip(). The clip() method limits the values on a series to be in the range comprehended between a low and up boundaries. Any value lower than the low limits is replaced with the specified lower boundarie value and the same applies for values higher than the upper limit.

In [54]:
df['DepTime'].clip(1000,2000)

0        1123.0
1        1000.0
2        1514.0
3        1430.0
4        1135.0
          ...  
18322    2000.0
18323    1336.0
18324    1953.0
18325    1010.0
18326    1000.0
Name: DepTime, Length: 18327, dtype: float64

## Groupby Methods

This methods are useful when working with DataFrames which have categorical types.

By using the .groupby() method we can provide a single column name or multiple column names in a list and perform further aggregations. In this example we group DepDelay values by Airlines and calculate the mean over those departure delay times. We can obtain the results as a Series or DataFrame object.

In [55]:
df.groupby('Airline')['DepDelay'].mean()

Airline
Air Wisconsin Airlines Corp                   1.603093
Commutair Aka Champlain Enterprises, Inc.    11.000000
GoJet Airlines, LLC d/b/a United Express     12.114213
Mesa Airlines Inc.                           30.000000
Southwest Airlines Co.                       25.912026
Name: DepDelay, dtype: float64

In [56]:
df.groupby('Airline')[['DepDelay']].mean()

Unnamed: 0_level_0,DepDelay
Airline,Unnamed: 1_level_1
Air Wisconsin Airlines Corp,1.603093
"Commutair Aka Champlain Enterprises, Inc.",11.0
"GoJet Airlines, LLC d/b/a United Express",12.114213
Mesa Airlines Inc.,30.0
Southwest Airlines Co.,25.912026


Here we are aggregating multiple methods over multiple columns. Notice that the result is a DataFrame with multi-index columns.

In [57]:
df_agg = df.groupby('Airline')[['DepDelay', 'ArrDelay']].agg(['mean', 'max', 'min'])
df_agg

Unnamed: 0_level_0,DepDelay,DepDelay,DepDelay,ArrDelay,ArrDelay,ArrDelay
Unnamed: 0_level_1,mean,max,min,mean,max,min
Airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Air Wisconsin Airlines Corp,1.603093,659.0,-19.0,-3.895349,641.0,-45.0
"Commutair Aka Champlain Enterprises, Inc.",11.0,518.0,-20.0,5.099688,522.0,-40.0
"GoJet Airlines, LLC d/b/a United Express",12.114213,387.0,-13.0,7.543147,372.0,-42.0
Mesa Airlines Inc.,30.0,40.0,20.0,19.0,28.0,10.0
Southwest Airlines Co.,25.912026,648.0,-14.0,19.935896,697.0,-49.0


In [58]:
df_agg.columns

MultiIndex([('DepDelay', 'mean'),
            ('DepDelay',  'max'),
            ('DepDelay',  'min'),
            ('ArrDelay', 'mean'),
            ('ArrDelay',  'max'),
            ('ArrDelay',  'min')],
           )

To better separate the results we could use the flat_index() method but another option is to use list comprehension like shown below. The join() method joins each value inside the column name tuples by the specified character _ and thus creating separate columns for each aggregation.

In [59]:
df_agg.columns = ['_'.join(c) for c in df_agg.columns]
df_agg

Unnamed: 0_level_0,DepDelay_mean,DepDelay_max,DepDelay_min,ArrDelay_mean,ArrDelay_max,ArrDelay_min
Airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Air Wisconsin Airlines Corp,1.603093,659.0,-19.0,-3.895349,641.0,-45.0
"Commutair Aka Champlain Enterprises, Inc.",11.0,518.0,-20.0,5.099688,522.0,-40.0
"GoJet Airlines, LLC d/b/a United Express",12.114213,387.0,-13.0,7.543147,372.0,-42.0
Mesa Airlines Inc.,30.0,40.0,20.0,19.0,28.0,10.0
Southwest Airlines Co.,25.912026,648.0,-14.0,19.935896,697.0,-49.0


## New Columns

We can create new columns by performing any operation on a single or multiple columns.

In this example we take the values for the DepTime column and divide them by 60. We then assign the result values to a new column called DepTime2.

In [60]:
df['DepTime2'] = df['DepTime'] / 60
df.head()

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings,DepTime2
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4301.0,C5,20445.0,C5,N21144,4301.0,11921.0,1192102.0,31921.0,"Grand Junction, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1200-1259,1.0,0.0,18.716667
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4299.0,C5,20445.0,C5,N16170,4299.0,12206.0,1220605.0,32206.0,"Harlingen/San Benito, TX",TX,48.0,Texas,74.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0700-0759,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,0800-0859,2.0,0.0,12.133333
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0,25.233333
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0,23.833333
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0,18.916667


Another way of creating new columns is by using the .assign() method. This method can be chained with other operations and returns a new DataFrame object with all the original columns.

To use it we provide a new column name and what that column should be equal to.

In [61]:
df = df.assign(DepTime3 = df['DepTime'] / 60)
df.head()

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings,DepTime2,DepTime3
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4301.0,C5,20445.0,C5,N21144,4301.0,11921.0,1192102.0,31921.0,"Grand Junction, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1200-1259,1.0,0.0,18.716667,18.716667
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4299.0,C5,20445.0,C5,N16170,4299.0,12206.0,1220605.0,32206.0,"Harlingen/San Benito, TX",TX,48.0,Texas,74.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0700-0759,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,0800-0859,2.0,0.0,12.133333,12.133333
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0,25.233333,25.233333
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0,23.833333,23.833333
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0,18.916667,18.916667


## Sorting Data

We can sort by a specific column or a list of columns using the .sort_values() method. By default the sorting is in ascending order but we can set the ascending attribute to False. Regardles of the sorting order, missing values will always be put at the bottom.

In [62]:
df[['FlightDate', 'Airline', 'ArrDelay']].sort_values('ArrDelay', ascending=False)

Unnamed: 0,FlightDate,Airline,ArrDelay
8466,2022-04-02,Southwest Airlines Co.,697.0
755,2022-04-03,Air Wisconsin Airlines Corp,641.0
8472,2022-04-02,Southwest Airlines Co.,627.0
8432,2022-04-02,Southwest Airlines Co.,578.0
1468,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",522.0
...,...,...,...
17635,2022-04-18,Southwest Airlines Co.,
17687,2022-04-18,Southwest Airlines Co.,
17761,2022-04-18,Southwest Airlines Co.,
17856,2022-04-18,Southwest Airlines Co.,


We can chain a .reset_index() command to the sorting to have the index start from 0.

In [63]:
df[['FlightDate', 'Airline', 'ArrDelay']].sort_values('ArrDelay', ascending=False).reset_index(drop=True)

Unnamed: 0,FlightDate,Airline,ArrDelay
0,2022-04-02,Southwest Airlines Co.,697.0
1,2022-04-03,Air Wisconsin Airlines Corp,641.0
2,2022-04-02,Southwest Airlines Co.,627.0
3,2022-04-02,Southwest Airlines Co.,578.0
4,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",522.0
...,...,...,...
18322,2022-04-18,Southwest Airlines Co.,
18323,2022-04-18,Southwest Airlines Co.,
18324,2022-04-18,Southwest Airlines Co.,
18325,2022-04-18,Southwest Airlines Co.,


There is also a .sort_index() method which sort the index values (if they are numeric).

In [64]:
df[['FlightDate', 'Airline', 'ArrDelay']].sort_index()

Unnamed: 0,FlightDate,Airline,ArrDelay
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-17.0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-1.0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-3.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-18.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",6.0
...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,-25.0
18323,2022-04-19,Southwest Airlines Co.,-10.0
18324,2022-04-19,Southwest Airlines Co.,-13.0
18325,2022-04-19,Southwest Airlines Co.,-18.0


## Handling Missing Data

The .isna() method returns a boolean value for each value in our DataFrame, being True when a value is missing.

In [65]:
df[['FlightDate', 'Airline', 'ArrDelay']].isna()

Unnamed: 0,FlightDate,Airline,ArrDelay
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
18322,False,False,False
18323,False,False,False
18324,False,False,False
18325,False,False,False


Because of boolean values being essentially a 1 or 0, we can chain a .sum() method to get the count of all the missing values.

In [66]:
df[['FlightDate', 'Airline', 'ArrDelay']].isna().sum()

FlightDate      0
Airline         0
ArrDelay      931
dtype: int64

The .dropna() method will drop any row containing missing values. We can also provide a subset attribute to only drop rows on the given subset. The subset attribute receives a list.

In [67]:
df[['FlightDate', 'Airline', 'ArrDelay']].dropna(subset=['ArrDelay'])

Unnamed: 0,FlightDate,Airline,ArrDelay
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-17.0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-1.0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-3.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-18.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",6.0
...,...,...,...
18321,2022-04-19,Southwest Airlines Co.,-3.0
18322,2022-04-19,Southwest Airlines Co.,-25.0
18323,2022-04-19,Southwest Airlines Co.,-10.0
18324,2022-04-19,Southwest Airlines Co.,-13.0


The .fillna() method will fill any missing value with the given filler value.

In [68]:
df[['FlightDate', 'Airline', 'ArrDelay']].fillna(0)

Unnamed: 0,FlightDate,Airline,ArrDelay
0,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-17.0
1,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-1.0
2,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-3.0
3,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",-18.0
4,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",6.0
...,...,...,...
18322,2022-04-19,Southwest Airlines Co.,-25.0
18323,2022-04-19,Southwest Airlines Co.,-10.0
18324,2022-04-19,Southwest Airlines Co.,-13.0
18325,2022-04-19,Southwest Airlines Co.,-18.0


## Combining Data

To demonstrate data combination first we will create 2 new independent DataFrames. When we create new DataFrames and wish to make them independet of the original DataFrame we must use the .copy() method.

In [69]:
df1 = df.query('Airline == "Southwest Airlines Co."').copy()
df2 = df.query('Airline == "Commutair Aka Champlain Enterprises, Inc."').copy()

To stack this 2 DataFrames on top of each other we can use the pandas .concat () method and provide a list of the DataFrames we want to concat.

In [70]:
df_stack = pd.concat([df1, df2])
df_stack

Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings,DepTime2,DepTime3
1821,2022-04-01,Southwest Airlines Co.,ABQ,AUS,False,False,1035,1037.0,2.0,2.0,1303.0,0.0,75.0,100.0,86.0,619.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,236.0,WN,19393.0,WN,N232WN,236.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,0.0,1000-1059,8.0,1045.0,1300.0,3.0,1315.0,-12.0,0.0,-1.0,1300-1359,3.0,0.0,17.283333,17.283333
1822,2022-04-01,Southwest Airlines Co.,ABQ,BUR,False,False,1750,1823.0,33.0,33.0,1914.0,24.0,96.0,120.0,111.0,672.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1238.0,WN,19393.0,WN,N248WN,1238.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,1.0,2.0,1700-1759,13.0,1836.0,1912.0,2.0,1850.0,24.0,1.0,1.0,1800-1859,3.0,0.0,30.383333,30.383333
1823,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,645,648.0,3.0,3.0,922.0,0.0,79.0,105.0,94.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,100.0,WN,19393.0,WN,N206WN,100.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,0.0,0600-0659,13.0,701.0,920.0,2.0,930.0,-8.0,0.0,-1.0,0900-0959,3.0,0.0,10.800000,10.800000
1824,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1150,1146.0,0.0,-4.0,1439.0,4.0,76.0,105.0,113.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,869.0,WN,19393.0,WN,N8689C,869.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,-1.0,1100-1159,9.0,1155.0,1411.0,28.0,1435.0,4.0,0.0,0.0,1400-1459,3.0,0.0,19.100000,19.100000
1825,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1730,1757.0,27.0,27.0,2042.0,32.0,83.0,100.0,105.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1256.0,WN,19393.0,WN,N774SW,1256.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,1.0,1700-1759,13.0,1810.0,2033.0,9.0,2010.0,32.0,1.0,2.0,2000-2059,3.0,0.0,29.283333,29.283333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",HSV,IAH,False,False,615,608.0,0.0,-7.0,804.0,0.0,96.0,134.0,116.0,595.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4228.0,C5,20445.0,C5,N14177,4228.0,12217.0,1221702.0,30255.0,"Huntsville, AL",AL,1.0,Alabama,51.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0600-0659,15.0,623.0,759.0,5.0,829.0,-25.0,0.0,-2.0,0800-0859,3.0,0.0,10.133333,10.133333
1596,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",IAH,HSV,False,False,1020,1021.0,1.0,1.0,1159.0,0.0,76.0,115.0,98.0,595.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4227.0,C5,20445.0,C5,N14158,4227.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,12217.0,1221702.0,30255.0,"Huntsville, AL",AL,1.0,Alabama,51.0,0.0,0.0,1000-1059,15.0,1036.0,1152.0,7.0,1215.0,-16.0,0.0,-2.0,1200-1259,3.0,0.0,17.016667,17.016667
1597,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",TUL,IAH,False,False,1325,1318.0,0.0,-7.0,1455.0,0.0,70.0,105.0,97.0,429.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4226.0,C5,20445.0,C5,N23139,4226.0,15370.0,1537002.0,34653.0,"Tulsa, OK",OK,40.0,Oklahoma,73.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,1300-1359,14.0,1332.0,1442.0,13.0,1510.0,-15.0,0.0,-1.0,1500-1559,2.0,0.0,21.966667,21.966667
1598,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",CPR,DEN,False,False,1738,1734.0,0.0,-4.0,1840.0,0.0,39.0,77.0,66.0,230.0,2022,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4225.0,C5,20445.0,C5,N21154,4225.0,11122.0,1112205.0,31122.0,"Casper, WY",WY,56.0,Wyoming,88.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1700-1759,13.0,1747.0,1826.0,14.0,1855.0,-15.0,0.0,-1.0,1800-1859,1.0,0.0,28.900000,28.900000


The .concat() method can take an axis value which by default is 0 and means it will stack the DataFrames on top of each other. If we set it to 1 it will try to concat the DataFrames based on their index values and put them one next to each other, but since sometimes the DataFrames indexes aren't unique or equal it can cause problems.

In [71]:
df_side = pd.concat([df1, df2], axis=1)
df_side

  output = repr(obj)
  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings,DepTime2,DepTime3,FlightDate.1,Airline.1,Origin.1,Dest.1,Cancelled.1,Diverted.1,CRSDepTime.1,DepTime.1,DepDelayMinutes.1,DepDelay.1,ArrTime.1,ArrDelayMinutes.1,AirTime.1,CRSElapsedTime.1,ActualElapsedTime.1,Distance.1,Year.1,Quarter.1,Month.1,DayofMonth.1,DayOfWeek.1,Marketing_Airline_Network.1,Operated_or_Branded_Code_Share_Partners.1,DOT_ID_Marketing_Airline.1,IATA_Code_Marketing_Airline.1,Flight_Number_Marketing_Airline.1,Operating_Airline.1,DOT_ID_Operating_Airline.1,IATA_Code_Operating_Airline.1,Tail_Number.1,Flight_Number_Operating_Airline.1,OriginAirportID.1,OriginAirportSeqID.1,OriginCityMarketID.1,OriginCityName.1,OriginState.1,OriginStateFips.1,OriginStateName.1,OriginWac.1,DestAirportID.1,DestAirportSeqID.1,DestCityMarketID.1,DestCityName.1,DestState.1,DestStateFips.1,DestStateName.1,DestWac.1,DepDel15.1,DepartureDelayGroups.1,DepTimeBlk.1,TaxiOut.1,WheelsOff.1,WheelsOn.1,TaxiIn.1,CRSArrTime.1,ArrDelay.1,ArrDel15.1,ArrivalDelayGroups.1,ArrTimeBlk.1,DistanceGroup.1,DivAirportLandings.1,DepTime2.1,DepTime3.1
1821,2022-04-01,Southwest Airlines Co.,ABQ,AUS,False,False,1035.0,1037.0,2.0,2.0,1303.0,0.0,75.0,100.0,86.0,619.0,2022.0,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,236.0,WN,19393.0,WN,N232WN,236.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,0.0,1000-1059,8.0,1045.0,1300.0,3.0,1315.0,-12.0,0.0,-1.0,1300-1359,3.0,0.0,17.283333,17.283333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1822,2022-04-01,Southwest Airlines Co.,ABQ,BUR,False,False,1750.0,1823.0,33.0,33.0,1914.0,24.0,96.0,120.0,111.0,672.0,2022.0,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1238.0,WN,19393.0,WN,N248WN,1238.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,1.0,2.0,1700-1759,13.0,1836.0,1912.0,2.0,1850.0,24.0,1.0,1.0,1800-1859,3.0,0.0,30.383333,30.383333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1823,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,645.0,648.0,3.0,3.0,922.0,0.0,79.0,105.0,94.0,580.0,2022.0,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,100.0,WN,19393.0,WN,N206WN,100.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,0.0,0600-0659,13.0,701.0,920.0,2.0,930.0,-8.0,0.0,-1.0,0900-0959,3.0,0.0,10.800000,10.800000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1824,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1150.0,1146.0,0.0,-4.0,1439.0,4.0,76.0,105.0,113.0,580.0,2022.0,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,869.0,WN,19393.0,WN,N8689C,869.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,-1.0,1100-1159,9.0,1155.0,1411.0,28.0,1435.0,4.0,0.0,0.0,1400-1459,3.0,0.0,19.100000,19.100000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1825,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1730.0,1757.0,27.0,27.0,2042.0,32.0,83.0,100.0,105.0,580.0,2022.0,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1256.0,WN,19393.0,WN,N774SW,1256.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,1.0,1700-1759,13.0,1810.0,2033.0,9.0,2010.0,32.0,1.0,2.0,2000-2059,3.0,0.0,29.283333,29.283333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1595,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",HSV,IAH,False,False,615.0,608.0,0.0,-7.0,804.0,0.0,96.0,134.0,116.0,595.0,2022.0,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4228.0,C5,20445.0,C5,N14177,4228.0,12217.0,1221702.0,30255.0,"Huntsville, AL",AL,1.0,Alabama,51.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0600-0659,15.0,623.0,759.0,5.0,829.0,-25.0,0.0,-2.0,0800-0859,3.0,0.0,10.133333,10.133333
1596,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",IAH,HSV,False,False,1020.0,1021.0,1.0,1.0,1159.0,0.0,76.0,115.0,98.0,595.0,2022.0,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4227.0,C5,20445.0,C5,N14158,4227.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,12217.0,1221702.0,30255.0,"Huntsville, AL",AL,1.0,Alabama,51.0,0.0,0.0,1000-1059,15.0,1036.0,1152.0,7.0,1215.0,-16.0,0.0,-2.0,1200-1259,3.0,0.0,17.016667,17.016667
1597,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",TUL,IAH,False,False,1325.0,1318.0,0.0,-7.0,1455.0,0.0,70.0,105.0,97.0,429.0,2022.0,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4226.0,C5,20445.0,C5,N23139,4226.0,15370.0,1537002.0,34653.0,"Tulsa, OK",OK,40.0,Oklahoma,73.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,1300-1359,14.0,1332.0,1442.0,13.0,1510.0,-15.0,0.0,-1.0,1500-1559,2.0,0.0,21.966667,21.966667
1598,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2022-04-01,"Commutair Aka Champlain Enterprises, Inc.",CPR,DEN,False,False,1738.0,1734.0,0.0,-4.0,1840.0,0.0,39.0,77.0,66.0,230.0,2022.0,2.0,4.0,1.0,5.0,UA,UA_CODESHARE,19977.0,UA,4225.0,C5,20445.0,C5,N21154,4225.0,11122.0,1112205.0,31122.0,"Casper, WY",WY,56.0,Wyoming,88.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1700-1759,13.0,1747.0,1826.0,14.0,1855.0,-15.0,0.0,-1.0,1800-1859,1.0,0.0,28.900000,28.900000


To solve this we can reset the index of the DataFrames first and try again.

In [72]:
df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)
df_side = pd.concat([df1, df2], axis=1)
df_side

  output = repr(obj)
  df_html=dataframe._repr_html_(),  # pylint: disable=protected-access


Unnamed: 0,FlightDate,Airline,Origin,Dest,Cancelled,Diverted,CRSDepTime,DepTime,DepDelayMinutes,DepDelay,ArrTime,ArrDelayMinutes,AirTime,CRSElapsedTime,ActualElapsedTime,Distance,Year,Quarter,Month,DayofMonth,DayOfWeek,Marketing_Airline_Network,Operated_or_Branded_Code_Share_Partners,DOT_ID_Marketing_Airline,IATA_Code_Marketing_Airline,Flight_Number_Marketing_Airline,Operating_Airline,DOT_ID_Operating_Airline,IATA_Code_Operating_Airline,Tail_Number,Flight_Number_Operating_Airline,OriginAirportID,OriginAirportSeqID,OriginCityMarketID,OriginCityName,OriginState,OriginStateFips,OriginStateName,OriginWac,DestAirportID,DestAirportSeqID,DestCityMarketID,DestCityName,DestState,DestStateFips,DestStateName,DestWac,DepDel15,DepartureDelayGroups,DepTimeBlk,TaxiOut,WheelsOff,WheelsOn,TaxiIn,CRSArrTime,ArrDelay,ArrDel15,ArrivalDelayGroups,ArrTimeBlk,DistanceGroup,DivAirportLandings,DepTime2,DepTime3,FlightDate.1,Airline.1,Origin.1,Dest.1,Cancelled.1,Diverted.1,CRSDepTime.1,DepTime.1,DepDelayMinutes.1,DepDelay.1,ArrTime.1,ArrDelayMinutes.1,AirTime.1,CRSElapsedTime.1,ActualElapsedTime.1,Distance.1,Year.1,Quarter.1,Month.1,DayofMonth.1,DayOfWeek.1,Marketing_Airline_Network.1,Operated_or_Branded_Code_Share_Partners.1,DOT_ID_Marketing_Airline.1,IATA_Code_Marketing_Airline.1,Flight_Number_Marketing_Airline.1,Operating_Airline.1,DOT_ID_Operating_Airline.1,IATA_Code_Operating_Airline.1,Tail_Number.1,Flight_Number_Operating_Airline.1,OriginAirportID.1,OriginAirportSeqID.1,OriginCityMarketID.1,OriginCityName.1,OriginState.1,OriginStateFips.1,OriginStateName.1,OriginWac.1,DestAirportID.1,DestAirportSeqID.1,DestCityMarketID.1,DestCityName.1,DestState.1,DestStateFips.1,DestStateName.1,DestWac.1,DepDel15.1,DepartureDelayGroups.1,DepTimeBlk.1,TaxiOut.1,WheelsOff.1,WheelsOn.1,TaxiIn.1,CRSArrTime.1,ArrDelay.1,ArrDel15.1,ArrivalDelayGroups.1,ArrTimeBlk.1,DistanceGroup.1,DivAirportLandings.1,DepTime2.1,DepTime3.1
0,2022-04-01,Southwest Airlines Co.,ABQ,AUS,False,False,1035,1037.0,2.0,2.0,1303.0,0.0,75.0,100.0,86.0,619.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,236.0,WN,19393.0,WN,N232WN,236.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,0.0,1000-1059,8.0,1045.0,1300.0,3.0,1315.0,-12.0,0.0,-1.0,1300-1359,3.0,0.0,17.283333,17.283333,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",GJT,DEN,False,False,1133.0,1123.0,0.0,-10.0,1228.0,0.0,40.0,72.0,65.0,212.0,2022.0,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4301.0,C5,20445.0,C5,N21144,4301.0,11921.0,1192102.0,31921.0,"Grand Junction, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1100-1159,17.0,1140.0,1220.0,8.0,1245.0,-17.0,0.0,-2.0,1200-1259,1.0,0.0,18.716667,18.716667
1,2022-04-01,Southwest Airlines Co.,ABQ,BUR,False,False,1750,1823.0,33.0,33.0,1914.0,24.0,96.0,120.0,111.0,672.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1238.0,WN,19393.0,WN,N248WN,1238.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,10800.0,1080003.0,32575.0,"Burbank, CA",CA,6.0,California,91.0,1.0,2.0,1700-1759,13.0,1836.0,1912.0,2.0,1850.0,24.0,1.0,1.0,1800-1859,3.0,0.0,30.383333,30.383333,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",HRL,IAH,False,False,732.0,728.0,0.0,-4.0,848.0,0.0,55.0,77.0,80.0,295.0,2022.0,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4299.0,C5,20445.0,C5,N16170,4299.0,12206.0,1220605.0,32206.0,"Harlingen/San Benito, TX",TX,48.0,Texas,74.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,0.0,-1.0,0700-0759,16.0,744.0,839.0,9.0,849.0,-1.0,0.0,-1.0,0800-0859,2.0,0.0,12.133333,12.133333
2,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,645,648.0,3.0,3.0,922.0,0.0,79.0,105.0,94.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,100.0,WN,19393.0,WN,N206WN,100.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,0.0,0600-0659,13.0,701.0,920.0,2.0,930.0,-8.0,0.0,-1.0,0900-0959,3.0,0.0,10.800000,10.800000,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1529.0,1514.0,0.0,-15.0,1636.0,0.0,47.0,70.0,82.0,251.0,2022.0,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4298.0,C5,20445.0,C5,N21144,4298.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,-1.0,1500-1559,21.0,1535.0,1622.0,14.0,1639.0,-3.0,0.0,-1.0,1600-1659,2.0,0.0,25.233333,25.233333
3,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1150,1146.0,0.0,-4.0,1439.0,4.0,76.0,105.0,113.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,869.0,WN,19393.0,WN,N8689C,869.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,0.0,-1.0,1100-1159,9.0,1155.0,1411.0,28.0,1435.0,4.0,0.0,0.0,1400-1459,3.0,0.0,19.100000,19.100000,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",IAH,GPT,False,False,1435.0,1430.0,0.0,-5.0,1547.0,0.0,57.0,90.0,77.0,376.0,2022.0,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4296.0,C5,20445.0,C5,N11184,4296.0,12266.0,1226603.0,31453.0,"Houston, TX",TX,48.0,Texas,74.0,11973.0,1197302.0,31973.0,"Gulfport/Biloxi, MS",MS,28.0,Mississippi,53.0,0.0,-1.0,1400-1459,16.0,1446.0,1543.0,4.0,1605.0,-18.0,0.0,-2.0,1600-1659,2.0,0.0,23.833333,23.833333
4,2022-04-01,Southwest Airlines Co.,ABQ,DAL,False,False,1730,1757.0,27.0,27.0,2042.0,32.0,83.0,100.0,105.0,580.0,2022,2.0,4.0,1.0,5.0,WN,WN,19393.0,WN,1256.0,WN,19393.0,WN,N774SW,1256.0,10140.0,1014005.0,30140.0,"Albuquerque, NM",NM,35.0,New Mexico,86.0,11259.0,1125904.0,30194.0,"Dallas, TX",TX,48.0,Texas,74.0,1.0,1.0,1700-1759,13.0,1810.0,2033.0,9.0,2010.0,32.0,1.0,2.0,2000-2059,3.0,0.0,29.283333,29.283333,2022-04-04,"Commutair Aka Champlain Enterprises, Inc.",DRO,DEN,False,False,1135.0,1135.0,0.0,0.0,1251.0,6.0,49.0,70.0,76.0,251.0,2022.0,2.0,4.0,4.0,1.0,UA,UA_CODESHARE,19977.0,UA,4295.0,C5,20445.0,C5,N17146,4295.0,11413.0,1141307.0,30285.0,"Durango, CO",CO,8.0,Colorado,82.0,11292.0,1129202.0,30325.0,"Denver, CO",CO,8.0,Colorado,82.0,0.0,0.0,1100-1159,19.0,1154.0,1243.0,8.0,1245.0,6.0,0.0,0.0,1200-1259,2.0,0.0,18.916667,18.916667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16501,2022-04-19,Southwest Airlines Co.,BNA,AUS,False,False,2215,2206.0,0.0,-9.0,10.0,0.0,110.0,140.0,124.0,756.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2019.0,WN,19393.0,WN,N407WN,2019.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10423.0,1042302.0,30423.0,"Austin, TX",TX,48.0,Texas,74.0,0.0,-1.0,2200-2259,11.0,2217.0,7.0,3.0,35.0,-25.0,0.0,-2.0,0001-0559,4.0,0.0,36.766667,36.766667,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16502,2022-04-19,Southwest Airlines Co.,BNA,BDL,False,False,1335,1336.0,1.0,1.0,1640.0,0.0,109.0,135.0,124.0,852.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,694.0,WN,19393.0,WN,N749SW,694.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10529.0,1052906.0,30529.0,"Hartford, CT",CT,9.0,Connecticut,11.0,0.0,0.0,1300-1359,12.0,1348.0,1637.0,3.0,1650.0,-10.0,0.0,-1.0,1600-1659,4.0,0.0,22.266667,22.266667,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16503,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1955,1953.0,0.0,-2.0,2307.0,0.0,118.0,145.0,134.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,388.0,WN,19393.0,WN,N7840A,388.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,-1.0,1900-1959,12.0,2005.0,2303.0,4.0,2320.0,-13.0,0.0,-1.0,2300-2359,4.0,0.0,32.550000,32.550000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
16504,2022-04-19,Southwest Airlines Co.,BNA,BOS,False,False,1010,1010.0,0.0,0.0,1322.0,0.0,120.0,150.0,132.0,942.0,2022,2.0,4.0,19.0,2.0,WN,WN,19393.0,WN,2775.0,WN,19393.0,WN,N290WN,2775.0,10693.0,1069302.0,30693.0,"Nashville, TN",TN,47.0,Tennessee,54.0,10721.0,1072102.0,30721.0,"Boston, MA",MA,25.0,Massachusetts,13.0,0.0,0.0,1000-1059,8.0,1018.0,1318.0,4.0,1340.0,-18.0,0.0,-2.0,1300-1359,4.0,0.0,16.833333,16.833333,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We can check the difference in shape of our DataFrames to corroborate the way this axis attribute influences the .concat() method.

The df_stack DataFrame has the same number of columns (63) than the original DataFrame but has the combined number of rows from df1 and df2 (7991). The df_side DataFrame has the double number of columns (126) and the number of rows of the longest DataFrame (i.e. df1).

In [73]:
print(f'df1 shape = {df1.shape}')
print(f'df2 shape = {df2.shape}')
print(f'df_stack shape = {df_stack.shape}')
print(f'df_side shape = {df_side.shape}')

df1 shape = (16506, 63)
df2 shape = (647, 63)
df_stack shape = (17153, 63)
df_side shape = (16506, 126)


## Merge Data

As seen above concatenating with an axis of 1 is a bit confusing as we end up with duplicated columns with the same name. For this is better to use the .merge() method instead.

To ilustrate this we will again create 2 new DataFrames, df1 will have the average departure delay time and df2 will have the average arrival delay time.

In [74]:
df1 = df.groupby(['Airline', 'FlightDate'])[['DepDelay']].mean().reset_index()
df2 = df.groupby(['Airline', 'FlightDate'])[['ArrDelay']].mean().reset_index()

In [75]:
df1

Unnamed: 0,Airline,FlightDate,DepDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211
1,Air Wisconsin Airlines Corp,2022-04-02,0.190476
2,Air Wisconsin Airlines Corp,2022-04-03,5.163265
3,Air Wisconsin Airlines Corp,2022-04-04,-3.358209
4,"Commutair Aka Champlain Enterprises, Inc.",2022-04-01,9.225
5,"Commutair Aka Champlain Enterprises, Inc.",2022-04-02,4.581633
6,"Commutair Aka Champlain Enterprises, Inc.",2022-04-03,21.160622
7,"Commutair Aka Champlain Enterprises, Inc.",2022-04-04,4.555556
8,"GoJet Airlines, LLC d/b/a United Express",2022-04-01,10.56962
9,"GoJet Airlines, LLC d/b/a United Express",2022-04-02,3.628571


In [76]:
df2

Unnamed: 0,Airline,FlightDate,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-02,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-03,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-04,-11.099502
4,"Commutair Aka Champlain Enterprises, Inc.",2022-04-01,3.42
5,"Commutair Aka Champlain Enterprises, Inc.",2022-04-02,-2.387755
6,"Commutair Aka Champlain Enterprises, Inc.",2022-04-03,16.09375
7,"Commutair Aka Champlain Enterprises, Inc.",2022-04-04,-0.592593
8,"GoJet Airlines, LLC d/b/a United Express",2022-04-01,7.120253
9,"GoJet Airlines, LLC d/b/a United Express",2022-04-02,-1.642857


We can run .merge() over df1 providing the second DataFrame to merge, i.e. df2. This will combine any similar columns and put the new column to the side.

Optionally, we can provide different merging type using the how attribute and setting it to 'left', 'right', 'inner' (default), or 'outer'.

In [77]:
df1.merge(df2, how='inner')

Unnamed: 0,Airline,FlightDate,DepDelay,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-02,0.190476,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-03,5.163265,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-04,-3.358209,-11.099502
4,"Commutair Aka Champlain Enterprises, Inc.",2022-04-01,9.225,3.42
5,"Commutair Aka Champlain Enterprises, Inc.",2022-04-02,4.581633,-2.387755
6,"Commutair Aka Champlain Enterprises, Inc.",2022-04-03,21.160622,16.09375
7,"Commutair Aka Champlain Enterprises, Inc.",2022-04-04,4.555556,-0.592593
8,"GoJet Airlines, LLC d/b/a United Express",2022-04-01,10.56962,7.120253
9,"GoJet Airlines, LLC d/b/a United Express",2022-04-02,3.628571,-1.642857


Alternatively, we could also use the pandas .merge() method which takes a left and right DataFrames.

Both of this merge methods have an optional on attribute which takes a subset of columns it should merge on.

In [78]:
pd.merge(df1, df2, on=['Airline', 'FlightDate'])

Unnamed: 0,Airline,FlightDate,DepDelay,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-02,0.190476,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-03,5.163265,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-04,-3.358209,-11.099502
4,"Commutair Aka Champlain Enterprises, Inc.",2022-04-01,9.225,3.42
5,"Commutair Aka Champlain Enterprises, Inc.",2022-04-02,4.581633,-2.387755
6,"Commutair Aka Champlain Enterprises, Inc.",2022-04-03,21.160622,16.09375
7,"Commutair Aka Champlain Enterprises, Inc.",2022-04-04,4.555556,-0.592593
8,"GoJet Airlines, LLC d/b/a United Express",2022-04-01,10.56962,7.120253
9,"GoJet Airlines, LLC d/b/a United Express",2022-04-02,3.628571,-1.642857


If we provide a subset of columns but during the merge pandas detects there are similar columns that could by grouped, it will automatically create the columns and add a suffix of _x and _y to them.

In [79]:
pd.merge(df1, df2, on=['Airline'])

Unnamed: 0,Airline,FlightDate_x,DepDelay,FlightDate_y,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-01,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-02,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-03,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-04,-11.099502
4,Air Wisconsin Airlines Corp,2022-04-02,0.190476,2022-04-01,-1.074074
...,...,...,...,...,...
528,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-25,8.000000
529,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-26,32.000000
530,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-27,-3.000000
531,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-28,-8.000000


We can provide specific suffixes to use by using the suffixes atribute and passing a list of string values like so.

In [80]:
pd.merge(df1, df2, on=['Airline'], suffixes=['_dep', '_arr'])

Unnamed: 0,Airline,FlightDate_dep,DepDelay,FlightDate_arr,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-01,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-02,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-03,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-04,-11.099502
4,Air Wisconsin Airlines Corp,2022-04-02,0.190476,2022-04-01,-1.074074
...,...,...,...,...,...
528,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-25,8.000000
529,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-26,32.000000
530,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-27,-3.000000
531,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-28,-8.000000


It's easy to use the on attribute when the column names are the same between the DataFrames, but even when they are different we can use the left_on and right_on attributes. This will make the merge work even if the column names are different.

In [81]:
pd.merge(df1, df2, left_on=['Airline'], right_on=['Airline'])

Unnamed: 0,Airline,FlightDate_x,DepDelay,FlightDate_y,ArrDelay
0,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-01,-1.074074
1,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-02,-2.708995
2,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-03,-0.353846
3,Air Wisconsin Airlines Corp,2022-04-01,4.584211,2022-04-04,-11.099502
4,Air Wisconsin Airlines Corp,2022-04-02,0.190476,2022-04-01,-1.074074
...,...,...,...,...,...
528,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-25,8.000000
529,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-26,32.000000
530,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-27,-3.000000
531,Southwest Airlines Co.,2022-04-29,38.000000,2022-04-28,-8.000000
