### Data Transformation

Visualization is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. You will often need to create some new variables or summaries, rename variables, or reorder observations for the data to be easier to manage.

To explore data manipulation verbs of pandas, we’ll use flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.

**As always, import the necessary libraries now and the dataset you will need at the start of your notebook**

In [1]:
import pandas as pd
import numpy as np

flights = pd.read_csv("../Data/nycflights13_flights.csv", index_col=0)
flights.reset_index(drop=True, inplace=True) # reset index to default integer index
flights.head()
# flights.info()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,3,25,1929.0,1905,24.0,2236.0,2217,19.0,UA,1471,N37298,EWR,RSW,169.0,1068,19,5,2013-03-25 19:00:00
1,2013,4,26,956.0,1000,-4.0,1257.0,1334,-37.0,DL,1765,N717TW,JFK,SFO,337.0,2586,10,0,2013-04-26 10:00:00
2,2013,5,21,1320.0,1309,11.0,1430.0,1414,16.0,EV,4129,N11536,EWR,DCA,39.0,199,13,9,2013-05-21 13:00:00
3,2013,7,18,1222.0,1230,-8.0,1357.0,1419,-22.0,EV,5796,N13958,EWR,CLT,77.0,529,12,30,2013-07-18 12:00:00
4,2013,8,29,540.0,545,-5.0,921.0,921,0.0,B6,939,N535JB,JFK,BQN,198.0,1576,5,45,2013-08-29 05:00:00


We are going to learn five key pandas functions or object methods. Object methods are things the objects can perform. For example, pandas data frames know how to tell you their shape, the pandas object knows how to concatenate two data frames together. The way we tell an object we want it to do something is with the ‘dot operator’. We will refer to these object operators as functions or methods. Below are the five methods that allow you to solve the vast majority of your data manipulation challenges:

| Python pandas function | What it does |
|-------------------------|--------------|
| `query()`              | Pick observations (rows) by their values |
| `sort_values()`        | Reorder the rows based on one or more columns |
| `filter()` or `loc[]`  | Pick variables (columns) by their names |
| `rename()`             | Rename columns or index labels |
| `assign()`             | Create new variables (columns) using functions of existing variables |
| `groupby()`            | Split data into groups based on values of one or more keys |
| `agg()`                | Collapse many values down to a single summary (e.g., mean, sum, count) |

To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. Python provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal).

**Filter rows with `.query()`**

`.query()` allows you to subset observations based on their values. The function takes 2 arguments, the first argument specifies the rows to be selected. This argument can be label names or a boolean series. The second argument specifies the columns to be selected. For example we can select all fligths for August 16th:

In [2]:
flights.query('month == 8 & day == 16')
flights[(flights.month == 1) & (flights.day == 16)]

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
445,2013,1,16,1444.0,1320,84.0,1648.0,1508,100.0,EV,4628,N11192,EWR,STL,163.0,872,13,20,2013-01-16 13:00:00
1057,2013,1,16,954.0,957,-3.0,1219.0,1125,54.0,EV,4094,N14562,EWR,BNA,154.0,748,9,57,2013-01-16 09:00:00
1159,2013,1,16,1526.0,1530,-4.0,1642.0,1642,0.0,EV,3267,N11184,EWR,ORF,54.0,284,15,30,2013-01-16 15:00:00
1179,2013,1,16,1459.0,1436,23.0,1620.0,1543,37.0,EV,4372,N16183,EWR,DCA,58.0,199,14,36,2013-01-16 14:00:00
1306,2013,1,16,830.0,755,35.0,1049.0,930,79.0,WN,3935,N444WN,LGA,MDW,143.0,725,7,55,2013-01-16 07:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200430,2013,1,16,1450.0,1450,0.0,1649.0,1645,4.0,EV,4381,N13956,EWR,DTW,100.0,488,14,50,2013-01-16 14:00:00
200520,2013,1,16,918.0,900,18.0,1237.0,1133,64.0,UA,1643,N48127,EWR,DEN,244.0,1605,9,0,2013-01-16 09:00:00
200926,2013,1,16,1004.0,855,69.0,1235.0,1101,94.0,EV,4185,N17138,EWR,DTW,99.0,488,8,55,2013-01-16 08:00:00
201754,2013,1,16,1347.0,1345,2.0,1701.0,1703,-2.0,UA,1164,N53442,EWR,FLL,162.0,1065,13,45,2013-01-16 13:00:00


Multiple arguments to `.query()` are combined with `“and”`: every expression must be true in order for a row to be included in the output. For some operations you may need other Boolean operations - `&` is “and”, `|` is “or”, and `!` is “not”

![GitHub Codespaces](Boolean_operators.png)

The following code finds all flights that departed in August or September:

**Arrange or sort rows with `.sort_values()`**

Works similarly to `.query()` except that instead of selecting rows, it changes their order. 

In [None]:
flights.sort_values(by = ['year', "month", 'day'])

If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns

**Select columns with `filter()` or `loc[]`**

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in.

In [3]:
flights.filter(['year', 'month', 'day'])

Unnamed: 0,year,month,day
0,2013,3,25
1,2013,4,26
2,2013,5,21
3,2013,7,18
4,2013,8,29
...,...,...,...
202061,2013,1,27
202062,2013,8,8
202063,2013,1,30
202064,2013,3,18


Use `.rename()` to rename a column or multiple columns.

In [4]:
flights.rename(columns = {'year': 'Year'})

Unnamed: 0,Year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,3,25,1929.0,1905,24.0,2236.0,2217,19.0,UA,1471,N37298,EWR,RSW,169.0,1068,19,5,2013-03-25 19:00:00
1,2013,4,26,956.0,1000,-4.0,1257.0,1334,-37.0,DL,1765,N717TW,JFK,SFO,337.0,2586,10,0,2013-04-26 10:00:00
2,2013,5,21,1320.0,1309,11.0,1430.0,1414,16.0,EV,4129,N11536,EWR,DCA,39.0,199,13,9,2013-05-21 13:00:00
3,2013,7,18,1222.0,1230,-8.0,1357.0,1419,-22.0,EV,5796,N13958,EWR,CLT,77.0,529,12,30,2013-07-18 12:00:00
4,2013,8,29,540.0,545,-5.0,921.0,921,0.0,B6,939,N535JB,JFK,BQN,198.0,1576,5,45,2013-08-29 05:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202061,2013,1,27,1311.0,1315,-4.0,1451.0,1504,-13.0,US,1895,N192UW,EWR,CLT,77.0,529,13,15,2013-01-27 13:00:00
202062,2013,8,8,2145.0,1800,225.0,8.0,2039,209.0,DL,61,N685DA,LGA,ATL,104.0,762,18,0,2013-08-08 18:00:00
202063,2013,1,30,2248.0,2135,73.0,149.0,36,73.0,B6,11,N809JB,JFK,FLL,166.0,1069,21,35,2013-01-30 21:00:00
202064,2013,3,18,957.0,1000,-3.0,1242.0,1234,8.0,DL,1847,N397DA,LGA,ATL,112.0,762,10,0,2013-03-18 10:00:00


**Add new variables with `.assign()`**
Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. 
`.assign()` always adds new columns at the end of your dataset.

In [None]:
flights.assign(
    gain = lambda x: flights.arr_delay - flights.dep_delay,
    speed = lambda x: flights.distance / flights.air_time * 60
)

# SAME THINGS (ChatGPT version^)

flights.assign(
    gain = flights.arr_delay - flights.dep_delay,
    speed = flights.distance / flights.air_time * 60
)

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,tailnum,origin,dest,air_time,distance,hour,minute,time_hour,gain,speed
0,2013,3,25,1929.0,1905,24.0,2236.0,2217,19.0,UA,...,N37298,EWR,RSW,169.0,1068,19,5,2013-03-25 19:00:00,-5.0,379.171598
1,2013,4,26,956.0,1000,-4.0,1257.0,1334,-37.0,DL,...,N717TW,JFK,SFO,337.0,2586,10,0,2013-04-26 10:00:00,-33.0,460.415430
2,2013,5,21,1320.0,1309,11.0,1430.0,1414,16.0,EV,...,N11536,EWR,DCA,39.0,199,13,9,2013-05-21 13:00:00,5.0,306.153846
3,2013,7,18,1222.0,1230,-8.0,1357.0,1419,-22.0,EV,...,N13958,EWR,CLT,77.0,529,12,30,2013-07-18 12:00:00,-14.0,412.207792
4,2013,8,29,540.0,545,-5.0,921.0,921,0.0,B6,...,N535JB,JFK,BQN,198.0,1576,5,45,2013-08-29 05:00:00,5.0,477.575758
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
202061,2013,1,27,1311.0,1315,-4.0,1451.0,1504,-13.0,US,...,N192UW,EWR,CLT,77.0,529,13,15,2013-01-27 13:00:00,-9.0,412.207792
202062,2013,8,8,2145.0,1800,225.0,8.0,2039,209.0,DL,...,N685DA,LGA,ATL,104.0,762,18,0,2013-08-08 18:00:00,-16.0,439.615385
202063,2013,1,30,2248.0,2135,73.0,149.0,36,73.0,B6,...,N809JB,JFK,FLL,166.0,1069,21,35,2013-01-30 21:00:00,0.0,386.385542
202064,2013,3,18,957.0,1000,-3.0,1242.0,1234,8.0,DL,...,N397DA,LGA,ATL,112.0,762,10,0,2013-03-18 10:00:00,11.0,408.214286


**Grouped summaries or aggregations with `.agg()`**

The last key verb is `.agg()`. It collapses a data frame to a single row:

In [11]:
flights.agg({"dep_delay": "mean"})
flights.agg({"dep_delay": ["min", 'max']})

flights.agg({"dep_delay": ["mean", "min", 'max'], "arr_delay": ['mean', 'min', 'max']})

Unnamed: 0,dep_delay,arr_delay
mean,12.623232,6.944759
min,-33.0,-79.0
max,1301.0,1272.0


`.agg()` is not terribly useful unless we pair it with `.groupby()`. This changes the unit of analysis from the complete dataset to individual groups.

In [None]:
flights.groupby(["year", "month", 'day']).agg(delay = ("dep_delay", "mean")).reset_index() #reset_index() because otherwise date columns are joined and it is weird

Unnamed: 0,year,month,day,delay
0,2013,1,1,10.900407
1,2013,1,2,13.898396
2,2013,1,3,10.500000
3,2013,1,4,9.415730
4,2013,1,5,6.162100
...,...,...,...,...
360,2013,12,27,9.987868
361,2013,12,28,9.138430
362,2013,12,29,23.005629
363,2013,12,30,10.164666


**Combining Multiple Operations**

Imagine you wanted to find out how flight delays vary by destination airport. You also want to see how many flights go to each destination, what the average distance is, and the average arrival delay. However, you don’t want destinations with very few flights or very long flights to distort the results. With longer oprations it is always more efficient to assing in to a variable using `=`.

In [19]:
delays = flights.groupby('dest').agg(
    count = ("distance", "count"),
      dist = ("distance", "mean"), 
      delay = ("arr_delay", "mean")).query('count > 20 & dest != "HNL"')
delays

Unnamed: 0_level_0,count,dist,delay
dest,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABQ,145,1826.000000,7.958621
ACK,142,199.000000,5.524823
ALB,295,143.000000,11.989362
ATL,10321,757.058812,11.132765
AUS,1448,1514.237569,4.274126
...,...,...,...
TPA,4532,1003.864519,7.831215
TUL,186,1215.000000,40.545977
TVC,60,652.616667,16.109091
TYS,388,638.381443,23.864789


#### Exercise - Average speed by destination

Imagine you wanted to know how fast flights travel on average depending on the destination. You decide to create a new variable called speed (in miles per hour) and then summarize by destination.

Task:

Create a new column `speed = distance / (air_time / 60)`.

For each destination `(dest)` assume there are `NAN` and calculate:

`count: number of flights`

`avg_speed: average speed (mph)`

Keep only destinations with more than 20 flights and sort them by `avg_speed` from fastest to slowest.

In [None]:
speeds = (flights.assign(speed = flights.distance / (flights.air_time / 60)).groupby("dest").agg(count = ("speed", "size"), avg_speed = ("speed", "mean")).query("count >=20 ").sort_values("avg_speed")
)


Unnamed: 0_level_0,count,avg_speed
dest,Unnamed: 1_level_1,Unnamed: 2_level_1
PHL,969,175.643066
ALB,295,272.213844
BDL,270,276.773209
DCA,5898,280.234279
BWI,1069,282.751364
...,...,...
STT,289,478.681312
PSE,229,481.060702
HNL,428,483.518542
SJU,3521,485.500289


### All Done!