<a href="https://colab.research.google.com/github/bitprj/DigitalHistory/blob/master/Week6-Advanced-Data-Wrangling-using-Pandas/Advanced-Data-Wrangling-with-Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/bitproject.png?raw=1" width="200" align="left"> 
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/data-science.jpg?raw=1" width="300" align="right">

# <div align="center">Advanced Data Wrangling with Pandas</div>

## Table of Contents

- Series
    - Creating a Series
    - Sorting a Series
        - **1.0 - Now Try This**
    - Accessing an element in a Series
        - **2.0 - Now Try This**
        - **3.0 - Now Try This**
    - Binary Operations
        - **4.0 - Now Try This**

- Advanced Dataframe
    - Creaing new columns
    - Replacing NaN values
    - Sorting the dataset
    - Selecting subset of data
        - **5.0 - Now Try This**
    - Dropping entities
    - Unique values in columns
    - Joining dataset
        - Merge
        - Join

- Practical Exercise: Flight Delay data
    - Join dataset
    - Groupby
    - Filtering based on condition
    - **6.0 - Now Try This**
    - **7.0 - Now Try This**
    - **8.0 - Now Try This**
    - **9.0 - Now Try This**


## Dataframe vs. Series
There are two main data structures you need to get comfortable with: **Dataframe** and **Series** (Remember: we covered the basics of DataFrame in Week 3.)

To refresh your memory, **DataFrame** is a tabular, column-oriented data structure with both row and column labels. 

**Series** is 1-dimensional labelled array and it can hold data of any type (integer, string, float, python objects, etc.). Its labels are called indices. You can think of a Series as a ordered dictionary, as it is a mapping of index values to data values.


## Series

- Creating a Series
- Sorting a Series
- Accessing an element in a Series
- Binary Operations (add, sum, mul, etc.)


### Creating a Series

A series can be created from 
1. scalar
2. list (array) 
2. dictionary

First and foremost, let's import Pandas library.

In [None]:
import pandas as pd

#### 1. Series from a scalar value

In [None]:
pd.Series(5, index =[0, 1, 2, 3, 4, 5])

#### 2-1. Series from a list. 

By default, index starts at 0 and ends at len(array)-1. 

In this example, as we didn't specify the index, the first index is 0 and the last index is 4, which is len(array)-1.

In [None]:
import numpy as np
array = np.array(['a','b','c','d'])
pd.Series(array)

0    a
1    b
2    c
3    d
dtype: object

#### 2-2. Series from a list with index

If we want to assign a different index instead of the default ones, we can specify the index as well.

In [None]:
array = np.array(['a','b','c','d'])
pd.Series(array,index=[1,2,3,4])

Using the examples below, we can note that Series is NOT ordered by value or index.

Instead, Series is ordered in the specified order of indices: 105 --> 103 --> 102 --> 109 --> 104 --> 110

In [None]:
names = np.array(['Daisy', 'Matt', 'Kelly', 'Mike', 'Ashley', 'Kyle'])
pd.Series(names, index=[105,103,102,109,104,110])

Then, the question is "Are the indices unique?"
Let's check it out!

In [None]:
names = np.array(['Daisy', 'Matt', 'Kelly', 'Mike', 'Ashley', 'Kyle'])
pd.Series(names, index=[100,100,102,102,103,110])

The indices don't have to be unique!

Can we use non-numerical indices?
Let's check it out!

In [None]:
names = np.array(['Daisy', 'Matt', 'Kelly', 'Mike'])
cities=["Atlanta", "San Francisco", "New York", "Seattle"]

pd.Series(names, index=cities)

We found out that the indices don't have to numbers; indices could be of any types! 

In [None]:
names = np.array(['Daisy', 'Matt', 'Kelly', 'Mike'])
pd.Series(names, index=[0.79, [0.8, 1.4], 13, "Hello"])

We used float, list, integer, and string as indices and it worked!

This example is to demonstrate that you can use any types as indices; in the real world, it's more common to keep the datatype consistent for indices.

#### 3. Series from a dict
We can create a Series from a **dict**! Let's see how it works!

In [None]:
# aDict stores how many fruits we have.
aDict = {'Apple':3, 'Banana':5, 'Cherry': 2, 'Mango': 13, 'Peach': 10}
pd.Series(aDict)

When we are passing a dict, the order in the dict is preserved.

In [None]:
aDict = {'Mango': 13, 'Banana':5, 'Cherry':2, 'Apple': 3, 'Peach': 10}
pd.Series(aDict)

If we want to know the datatype of Series, we can call dtype function.

In [None]:
aDict = {'Mango': 13, 'Banana':5, 'Cherry':2, 'Apple': 3, 'Peach': 10}
pd.Series(aDict).dtype

Now, let's change the value for "Mango." The value is changed to 13.45 and see what dtype returns!

In [None]:
aDict = {'Mango': 13.45, 'Banana':5, 'Cherry':2, 'Apple': 3, 'Peach': 10}
pd.Series(aDict).dtype

If we want to print out indices only, we can check ```index``` property.

In [None]:
aDict = {'Mango': 13.45, 'Banana':5, 'Cherry':2, 'Apple': 3, 'Peach': 10}
pd.Series(aDict).index

Let's get the values only from a Series. Similarly, we can call ```values``` property from a Series.

In [None]:
pd.Series(aDict).values

### Sorting a Series

If we want to sort the Series in the order of "index", we can use the ```sort_index()``` function.

In [None]:
s = pd.Series(aDict)
s.sort_index()

Similarly, if we want to sort the series based on the value, we can use a built-in function sort_values. By default, it will sort the elements in an ascending order.

In [None]:
s = pd.Series(aDict)
s.sort_values()

Let's sort the series in a descending order.

In [None]:
s.sort_values(ascending=False)

### 1.0 - Now Try This
Sort the series by index in a descending order.

### Accessing an element in a Series

We can access elements in a Series in the following ways:
- by index number
- by index label

In [None]:
array = np.array(['a','b','c','d','e','f','g','h'])
s = pd.Series(array)

#### 1. By index number

We can access the first element by index number 0.

In [None]:
s[0]

In [None]:
s[5]

We can access the first 3 elements with index operation [:3]

In [None]:
s[:3]

### 2.0 - Now Try This

Access the items in the index of 3-5 (inclusive) using index operation.

#### By Index Label


We can retrieve a single element using index label.

In [None]:
aDict = {'Apple':3, 'Banana':5, 'Cherry': 2, 'Peach': 10}
s = pd.Series(aDict)
s['Apple']

In [None]:
s['Peach']

### 3.0 - Now Try This



In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

Retrieve an element that corresponds to index 'd'

### Binary Operations of Series
- Add
- Sub
- Mul

Here, we have created two Series with numbers.

In [None]:
# creating a first series
s1 = pd.Series([1,5,6,2])
 
# creating a second series
s2 = pd.Series([4,1,3,5])

Let's add the two Series!

In [None]:
# answer = s1 + s2
answer = s1.add(s2)
print(answer)

We can also subtract a Series from another Series using sub() function.

In [None]:
# answer = s1 - s2
answer = s1.sub(s2)
print(answer)

In [None]:
answer = s1.mul(s2)
print(answer)

Using a simpler approach, we can use mathematical notation and it will result in the same output.

In [None]:
s1 + s2

In [None]:
s1 - s2

In [None]:
s1 * s2

Now, let's create two Series with indices. Note that the order of index is different in two Series.

In [None]:
# creating a first series
s1 = pd.Series([1,5,6,2], index=['b','c','d','a'])
 
# creating a second series
s2 = pd.Series([4,1,3,5], index=['a','b','c','d'])

In [None]:
s1 + s2

s['a'] = 6 was obtained by s1['a'] + s2['a'] and all the other values are calculated in the same way.

### 4.0 - Now Try This

Write down what this code will return on a piece of paper.
Compare your answer with the output of the code.

In [None]:
s1 * s2

## Advanced Dataframe

Now, let's move onto some more advanced ways to use Dataframe. To demontrate this, we'll be using a dataframe on people's income.

In [None]:
import numpy as np

data = {'First Name': ['Jennifer', 'Chris', 'John', 'Annie', 'Chloe', 'Jeremy', 'Allison', 'Hannah'], 
        'Last Name': ['Brown', 'Smith', 'Williams', 'Wong', 'Anderson', 'Scott', 'Kim', 'Mills'], 
        'Age': [16, 32, 21, 35, 27, 42, 28, 23],
        'Hourly Wage': [8,14,60,44,80,54,32,16],
        'Hours per week': [20,28,40,40,32,40,40,20],
        'Years of experience': [0,2,np.nan,9,np.nan,4,5,np.nan]
        } 

emp = pd.DataFrame(data)
emp

### Creating new columns

#### String Manipulation

We want to have a column displaying the full names of the people in the dataset. To do this, we can combine First Name and Last Name and create a new column called "Full Name."

The ```' '``` in the middle is adding a space between the first name and last name. 

In [None]:
# string manipulation
emp['Full Name'] = emp['First Name'] + ' ' + emp['Last Name']
emp

#### Numerical column

Let's calculate weekly salary for each person and add a column called ```Weekly Salary```. We need to do some simple math to get the weekly salary.

In [None]:
# multiplication of two numerical columns
emp['Weekly Salary'] = emp['Hourly Wage'] * emp['Hours per week']
emp

Now, we've added two columns "Full Name" and "Weekly Salary."

#### List Comprehension

Let's create a new column called ```High Income``` based on the ```Weekly Salary```. We will use a list comprehension for this (you can refer to Week 2 materials if you can't recall list comprehension.)

```High Income``` will be ```True``` if ```Weekly Salary``` is > 2000, otherwise False.

In [None]:
# list comprehension
emp['High Income'] = [True if x > 2000 else False for x in emp['Weekly Salary']]
emp

### Fill in NaN values

There are lots of ```NaN``` values (missing values) in ```Years of experience```.

Let's fill in the missing values with 0's using ```fillna(0)```.

In [None]:
emp['Years of experience'].fillna(0, inplace=True)

If we want to replace all missing values in the dataset, we can call ```fillna(0)``` on the dataframe.

In [None]:
emp.fillna(0, inplace=True)

### Sorting the dataset

Let's sort the data by 'Weekly Salary'.

In [None]:
emp.sort_values(by='Weekly Salary')

Now, all of the high income workers are located in the bottom of the dataset.

### Selecting subset of data based on condition

Let's review how we can select subset of data based on criteria.

Selecting rows where 'High Income' is True

In [None]:
emp[emp['High Income'] == True]

### 5.0 - Now Try This

Let's find people who are younger than 30 years old AND have high income.

(Hint: Make sure to use ```()``` on each condition.)

### Dropping rows

There are two ways of dropping rows.

1. ```df.drop(index #)```
2. ```df.drop(df[<some boolean condition>].index)```: 

The first line is used when we know exactly what rows we want to drop from the dataframe.
With the second line, we can drop rows that do not meet certain criteria.

Let's drop rows where 'High Income' is False.

In [None]:
emp.drop(emp[emp['High Income'] == False].index)

In [None]:
emp

Even though we dropped the columns where High Income == False, emp remains unchanged.

This is because we didn't update emp!

We can update the dataframe once we drop the rows by setting ```inplace=True```

In [None]:
emp.drop(emp[emp['High Income'] == False].index, inplace=True)

In [None]:
emp

Here we go! emp has been updated!

### Company Dataframe

Here, we have the dataset with the company name each person in ```emp``` works for.

In [None]:
data = {'First Name': ['Jennifer', 'Chris', 'John', 'Annie', 'Chloe', 'Jeremy', 'Allison', 'Hannah'], 
        'Last Name': ['Brown', 'Smith', 'Williams', 'Wong', 'Anderson', 'Scott', 'Kim', 'Mills'], 
        'Company': ['Home Depot', 'Bit Project', 'Microsoft', 'Bit Project', 'Disney', 'Adidas', 'Home Depot', np.nan]
        } 

companies = pd.DataFrame(data)

In [None]:
companies

Unnamed: 0,First Name,Last Name,Company
0,Jennifer,Brown,Home Depot
1,Chris,Smith,Bit Project
2,John,Williams,Microsoft
3,Annie,Wong,Bit Project
4,Chloe,Anderson,Disney
5,Jeremy,Scott,Adidas
6,Allison,Kim,Home Depot
7,Hannah,Mills,


### Fill in NaN values

In the dataframe, Hannah's company is ```NaN```. Let's replace it with a string "UNKNOWN".

In [None]:
companies['Company'].fillna('UNKNOWN', inplace = True)

### Unique values in a column

Let's find out the unique company names in the company dataframe.

```df[[column_name]].drop_duplicates()```

This code will literally drop duplicates in the ```column_name```!

In [None]:
companies[["Company"]].drop_duplicates()

Unnamed: 0,Company
0,Home Depot
1,Bit Project
2,Microsoft
4,Disney
5,Adidas
7,UNKNOWN


#### Count unique values

With ```groupby()``` and ```size()``` functions, we can count the occurrences of a unique value in the ```groupby``` column.

In [None]:
companies.groupby(['Company']).size()

Company
Adidas         1
Bit Project    2
Disney         1
Home Depot     2
Microsoft      1
UNKNOWN        1
dtype: int64

The code above groups by ```Company``` and return the number of times each company appears in the ```Company``` column.

If we want to sort by the size or count, we can use ```sort_values```.

In [None]:
companies.groupby(['Company']).size().sort_values()

Company
Adidas         1
Disney         1
Microsoft      1
UNKNOWN        1
Bit Project    2
Home Depot     2
dtype: int64

#### The number of unique values

To find out the number of unique values in a column, we can use ```nunique()```.

```nunique()``` can be interpreted as **n**umber of **unique** values.

In [None]:
companies["Company"].nunique()

6

These three functions will be very helpful when working with a large, complicated dataset!

### Joining Dataset

Let's join the two tables ```emp``` and ```companies``` to find out which companies those people work for.

### 1. Merge

These are the parameters we can use when we merge two dataframes.

```DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)```

- how: {‘inner’, ‘left’, ‘right’, ‘outer’}, default ‘inner’
- on: Column or index level names to join on. These must be found in both DataFrames.
- left_on: Column or index level names to join on in the left DataFrame.
- right_on: Column or index level names to join on in the right DataFrame.
- left_index: Use the index from the left DataFrame as the join key.
- right_index: Use the index from the right DataFrame as the join key
- suffixes: A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. 

We will cover the most fundamental and commonly used ones here. Feel free to check for more details at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html 


#### Merge - How

```how``` indicates a type of ```join``` and there are 4: ```inner```, ```left```, ```right```, and ```outer```.

![inner_join.png](https://github.com/bitprj/DigitalHistory/blob/master/Week6-Advanced-Data-Wrangling-using-Pandas/assets/inner_join.png?raw=1)

```inner join``` will take in the common items / columns between the two dataframes.

![left_join.png](https://github.com/bitprj/DigitalHistory/blob/master/Week6-Advanced-Data-Wrangling-using-Pandas/assets/left_join.png?raw=1)

```left join``` will include everything from left dataframe and add a column from right dataframe where there's a match in the right dataframe.

![right_join.png](https://github.com/bitprj/DigitalHistory/blob/master/Week6-Advanced-Data-Wrangling-using-Pandas/assets/right_join.png?raw=1)

Similar to ```left join```, ```right join``` will include everything from right dataframe and add a column from left dataframe where there's a match in the left dataframe.

![full_join.png](https://github.com/bitprj/DigitalHistory/blob/master/Week6-Advanced-Data-Wrangling-using-Pandas/assets/full_join.png?raw=1)

```full join``` will include everything from left and right dataframe even if there's no common item between the two dataframes.

Understanding different types of ```join```s are very important as this will help us to determine which type of ```join``` we will perform on a dataframe.

#### Merge - How Example:

We can merge ```emp``` that has weekly salary information and ```companies``` that has the name of the company each person works for.

The code below will return a dataframe of two merged objects.

In [None]:
pd.merge(emp, companies)

Yay, it worked! 

We dropped the rows from ```emp``` where employee's weekly salary is less than $2000, so the updated ```emp``` has 3 entities.

By default, ```merge``` performs ```inner join``` and that's why it's showing less number of rows than ```companies``` dataframe.

Now, let's try ```right_join``` and see how it works!

In [None]:
pd.merge(emp, companies, how='right')

As our right dataframe is ```companies``` with 8 entities, it shows all of the 8 employees in ```companies``` dataframe.

But, as we've dropped the rows that do not meet our salary expectation in ```emp```, the result of ```right_join``` is missing ```Age```, ```Hourly Wage```, ```Hours per week```, etc for some employees.

#### Merge - On, left_on, right_on

```on```, ```left_on```, and ```right_on``` indicate a column or index level names to join on.

- ```on``` is used when the same column name exists in both dataframes.
- ```left_on``` and ```right_on``` is used when the column names are different in two dataframes.

#### Merge - On Example

Let's join the two dataframes using common column names, ```First Name``` and ```Last Name```.

In [None]:
pd.merge(emp, companies, on=['First Name', 'Last Name'])

The resulting dataset shows three entities, since ```inner join``` was performed by default.

#### Merge - left_on, right_on Example

It might happen that the two dataframes we want to merge have different column names.
In this case, we will have to specify the left column name and the right column name.

Let's join the two dataframes using a different column name. 

- ```emp``` column name: ```Full Name``` 
- ```companies``` column name: ```Name```

In [None]:
companies['Name'] = companies['First Name'] + ' ' + companies['Last Name']

In [None]:
pd.merge(emp, companies, left_on='Full Name', right_on='Name')

As mentioned earlier, by default, ```inner join``` has been applied and the resulting dataframe show 3 entities!

In the result dataset, ```First Name_x``` is from the first table, which is ```emp``` while ```First Name_y``` is from the second table, which is ```companies```.

#### Merge - Suffixes

Suffixes indicate the suffix to add to overlapping column names in left and right respectively.

In the previous example, by default, suffixes ```_x``` and ```_y``` were added respectively.

We can specify suffixes by setting ```suffixes=[]```.

#### Merge - Suffixes Example

Let's add ```_1``` to the first dataframe and ```_2``` to the second dataframe.

In [None]:
pd.merge(emp, companies, left_on='Full Name', right_on='Name', suffixes=['_1', '_2'])

Since inner join was performed, only 3 common entities show up and there's actually no need to indicate ```_1``` or ```_2```.

Let's try right join and see how dataframe looks different!

In [None]:
pd.merge(emp, companies, how = 'right', left_on='Full Name', right_on='Name', suffixes=['_emp', '_companies'])

Because we specified right join, it shows all of the entities from ```companies``` dataframe and matching entities from ```emp``` dataframe.

The ```First Name_emp``` and ```Last Name_emp``` show three entities only while ```First Name_companies``` and ```Last Name_companies``` show all entities!

### 2. Join

These are the parameters we can use when we merge two dataframes.

```DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)```

- other: DataFrame, Series, or list of DataFrame to join
- on: Column or index level names to join on the index in other. These must be found in both DataFrames.
- how: {‘inner’, ‘left’, ‘right’, ‘outer’}, default ‘inner’
- lsuffix: Suffix to use from left frame’s overlapping columns.
- rsuffix: Suffix to use from right frame’s overlapping columns.

We have covered ```on``` and ```how``` in detail in ```merge```.

So, we will try one example of ```join``` using ```lsuffix``` and ```rsuffix```.

#### Join - lsuffix and rsuffix

This is similar to ```suffixes``` in ```merge```.

In [None]:
emp.join(companies, lsuffix='_1', rsuffix='_2')

## Practical Exercise - Flight delay data 

https://www.kaggle.com/divyansh22/flight-delay-prediction

Now, we will dive into a more complicated data analysis using Pandas.

<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week6-Advanced-Data-Wrangling-using-Pandas/assets/icons/flight.jpg?raw=1" width="300" align="right"> 

This dataset is the flight delay prediction for the month of January. 

This data is collected from the Bureau of Transportation Statistics, Govt. of the USA. This data is open-sourced under U.S. Govt. Works. This dataset contains all the flights in the month of January 2020. There are more than 400,000 flights in the month of January itself throughout the United States. The features were manually chosen to do a primary time series analysis. There are several other features available on their website.

This data could well be used to predict the flight delay at the destination airport specifically for the month of January in upcoming years as the data is for January only.

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week6-Advanced-Data-Wrangling-using-Pandas/data/Jan_2020_ontime.csv'
df = pd.read_csv(url)
df

As always, the first thing we need to do is get an idea of what data looks like!
We can call three functions here:
- head()
- describe()
- info()

In [None]:
df

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.info()

### Columns to use:
- Day_of_week: Day of Week starting from Monday
  - Ex: Monday = 1, Sunday = 7
- Op_unique_carrier: Unique Carrier Code (DL, WN, AA, 9E, OO, etc.)
- Origin: Origin Airport (JFK, ATL, SFO, LAX, etc.)
- Dest: Destination Airport (JFK, ATL, SFO, LAX, etc.)
- Dep_time: Actual Departure Time (local time: hhmm)
  - Ex: 1848 is 18:48 in local time
- Dep_del15: Departure Delay Indicator, 15 Minutes or More (1=Yes, 0=No)
- Arr_time: Actual Arrival Time (local time: hhmm)
- Arr_del15: Arrival Delay Indicator, 15 Minutes or More (1=Yes, 0=No)
- Cancelled: Cancelled Flight Indicator (1=Yes, 0=No)
- Diverted: Diverted Flight Indicator (1=Yes, 0=No)
- Distance: Distance between airports (miles)

We are going to select the columns to use using ```[]```.

In [None]:
flight = df[["DAY_OF_WEEK", "OP_UNIQUE_CARRIER", "ORIGIN", "DEST", "DEP_TIME", "DEP_DEL15", "ARR_TIME", "ARR_DEL15", "CANCELLED", "DIVERTED", "DISTANCE"]]
flight

### 6.0 - Now Try This

Let's retrieve unique airline codes (```OP_UNIQUE_CARRIER```) from the dataframe.

**Q: We know DL (Delta Air Lines), AA (American Airlines) and a few more major airlines. But, do we know all of the unique carrier codes?**

**A: No, we don't! Let's add airline names from another dataset!**

### Join datasets
- ```flight``` dataframe has ```OP_UNIQUE_CARRIER```, but it is missing airline names.
- ```airline_names.xlsx``` has airline codes as well as airline names.

Let's join ```flight``` dataframe with ```airline_names.xlsx``` to get airline names.

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week6-Advanced-Data-Wrangling-using-Pandas/data/airline_names.xlsx'
airline = pd.read_excel(url)
airline

### 7.0 - Now Try This

- Use ```merge``` function to join ```airline``` dataframe to ```flight``` dataframe.
- Name the resulting dataframe as ```flight_airline```.
- Print ```flight_airline``` and make sure that this dataframe has airline codes AND airline names!

(Hint: what are the common column names between two dataframes? Are the column names same?)

Now, ```flight_airline``` dataframe has all of the delay, cancellation info as well as airline codes and names!

## Groupby
- Which airlines experience most delays? (# delays by airline)
- Which day of the week experiences most delays? (# delays by the day of a week)
- What origins experience most delays? (# delays by origin)

### Which airlines experience most delays?

This question can be reworded as "# of delays by airline"

We can use ```groupby``` function to get the number of delays by each airline!

In [None]:
flight_airline.groupby(['carrier','carrier_name'])['DEP_DEL15'].sum().sort_values(ascending=False)

If the threshold is 10,000 delays in January, it seems that WN, AA, OO are top three airlines that experience most delays.
But often times, passengers would care more about **arrival on time** and that's what airline industries focus more.

### 8.0 - Now Try This

Calculate sum of ARR_DEL15 using the groupby function and sort the dataset in a descending order.

If we use the same threshold of 10,000 delays, AA, OO, OH, and WN are airlines that experience most delays in regards to **arrival**.

We've found the top airlines with most delays but this result does not take into account "Total number of flights."

So, it will be more accurate if we can find the percentage of delayed flights for each airline.

```delayed flight % = (delayed flights) / (total flights)```

Let's calculate **delayed flight %**!

In [None]:
# this code returns the total number of flights
flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].count()

Using this formula, ```delayed flight % = (delayed flights) / (total flights)```

we will divide the number of delayed flights by total flights.

In [None]:
# this code returns the percentage of delayed flights of each airline.
flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].sum() / flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].count()

Now, in order to see the top airlines with highest delay percentages, let's sort the result in a descending order!

In [None]:
# sort_values(ascending = False) has been added
(flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].sum() / flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].count()).sort_values(ascending = False)

Tada, we've come up with a completely different result!

OH, AS, G4 airlines have highest delay percentages while 9E, DL, HA, WN have reletively low delay percentages.

#### agg()

The code above is correct but it's very lengthy because we calculate ```sum()``` and ```count()``` separately.

We will learn a shortcut to calculate multiple aggregate functions using ```agg()``` argument.

Everything remains same, except that we replace ```sum()``` or ```count()``` with ```agg(['sum','count'])```.

In [None]:
new_df = flight_airline.groupby(['carrier','carrier_name'])['ARR_DEL15'].agg(['sum','count'])
new_df

Now, ```new_df``` has columns: ```carrier```, ```carrier_name```, ```sum```, and ```count```.

We will create another column called ```delay %``` using sum and count columns!

In [None]:
new_df['delay %'] = new_df['sum'] / new_df['count']

In [None]:
new_df

Tada! We've successfully added a ```delay %``` column! Let's sort the dataframe in a descending order to find out airlines with highest delay %!

In [None]:
new_df.sort_values(by=['delay %'], ascending=False)

### Which day of the week experiences most delays? 

Let's find out which day of the week experiences most delays!

In the data world, this question can be answered if we can find ```# delays by the day of a week```.

To answer this question, we can use groupby function, but should we use ```sum```? ```count```? or ```mean```?

- ```sum```: The first thing that came to my mind is that we can use ```sum```. But similar to the question above, each day of a week has a different number of flights each day. (Ex: there would be more flights scheduled on weekends.) ```Sum``` is not the most appropriate function to use here.


- ```count```: We can count the number of flights and divide ```sum(delayed flights)``` by ```count``` (similar to what we just did)


- ```mean```: What if we use ```mean```? We can get the average delay percentage of each day and this is what we are looking for!



In [None]:
flight_airline.groupby(['DAY_OF_WEEK'])['ARR_DEL15'].mean()

To confirm that ```mean``` returns same value as ```sum / count```, let's check ```sum / count``` real quick.

In [None]:
flight_airline.groupby(['DAY_OF_WEEK'])['ARR_DEL15'].sum() / flight_airline.groupby(['DAY_OF_WEEK'])['ARR_DEL15'].count()

Yay! It returns same answer as above!

Saturday has the highest delay percentage while Wednesday has the lowest delay percentage!

### What origins experience most delays? 

Let's find out what origins experience most delays?

Knowing this might help when we book a flight later! :)

Since we are dealing with **Origin** airport this time, it's appropriate to use ```DEP_DEL15```

In [None]:
flight_airline.groupby(['ORIGIN'])['DEP_DEL15'].mean()

In [None]:
flight_airline.groupby(['ORIGIN'])['DEP_DEL15'].mean().sort_values(ascending=False)

This resulting Series show Origin airports and the corresponding delay %.

Since we don't know all of the airport names from the codes, we can join this Series to the dataset that has airport codes and airport names.

## Filtering based on condition

- Which airports have delay percentages of 30% or more?

### Which origin airports have delay percentages of 30% or more?

```airline_delay_causes.csv``` has lots of useful information, but we will extract airport codes and airport names only.

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week6-Advanced-Data-Wrangling-using-Pandas/data/airline_delay_causes.csv'
airport_codes = pd.read_csv(url)
airport_codes

In [None]:
airport_codes.head()

### 9.0 - Now Try This

- Using ```drop_duplicates()``` function, find a unique set of airport codes and airport names 
- Save the dataframe as ```airport_names```.

The Series below is what we have found earlier -- delay % by each origin airport.

In [None]:
series = flight_airline.groupby(['ORIGIN'])['DEP_DEL15'].mean().sort_values(ascending=False)

Let's convert the result of groupby aggregation to ```DataFrame``` using ```to_frame()``` function.

In [None]:
origin_delays = series.to_frame()

In [None]:
origin_delays

Now, let's take a closer look at the question and break it down to smaller subproblems!

**Which airports have delay percentages of 30% or more?**
1. Find airport codes with a delay percentage greater than 30%.
2. Join the resulting dataframe to ```airport_names``` to get airport names.

In [None]:
airports_most_delays = origin_delays[origin_delays['DEP_DEL15'] > 0.3]
airports_most_delays

In [None]:
pd.merge(airports_most_delays, airport_names, left_on="ORIGIN", right_on = "airport")

We've found that ADK, OTH, OGD, PPG, and ASE have highest delay percentages in the US!

### Pop-up Question

What if we want to find the delay percentage of a specific airport?

Then we can select the origin airport code by using ```df[df[column_name]=="ABC"]```

Let's check out the delay % of San Francisco airport.

In [None]:
origin_delays[origin_delays['ORIGIN'] == 'SFO']

The code above did not work! Why? It's because ```ORIGIN``` is not a column, but an index.

In the case where our selected column is an index, we can use ```index``` to select the index.

In [None]:
origin_delays[origin_delays.index == 'SFO']

## Resources
- [Python Pandas Series](https://www.geeksforgeeks.org/python-pandas-series/)
- [Pandas API Reference Merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
- [Join types](https://www.codespot.org/sql-join/)
- [Flight Delay Prediction data](https://www.kaggle.com/divyansh22/flight-delay-prediction)
- [Airline Delay Causes](https://www.kaggle.com/anshuls235/airline-delay-causes)

## Homework
1. Find out five largest airlines by the number of flights in January
2. Calculate cancellation percentages of the top 5 airlines
3. Find out the percentage of flights that managed to arrive within 15 minutes even though the flight departed more than 15 minutes past scheduled departure time (Hint: you may use ```DEP_DEL15 == 1``` and ```ARR_DEL15 == 0```)
4. Convert categorical column (airline code / name) to numerical column and plot data visualizations for each airline
