# Chapter 8: Data Wrangling: Join, Combine, and Reshape

In many applications, data may be spread across a number of files or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, join, and rearrange data.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## I. Merging Datasets

### 1. Default merge operation for data frames

In [2]:
# Generate two data frames
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [3]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


`df1.merge(df2)` merges df1 with df2:

In [4]:
df1.merge(df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [5]:
# It is the same as df2.merge(df1)
df2.merge(df1)

Unnamed: 0,key,data2,data1
0,a,0,2
1,a,0,4
2,a,0,5
3,b,1,0
4,b,1,1
5,b,1,6


In [6]:
pd.merge(df2, df1)

Unnamed: 0,key,data2,data1
0,a,0,2
1,a,0,4
2,a,0,5
3,b,1,0
4,b,1,1
5,b,1,6


Q: Can you identify the rule followed by merge?

- a row from df1 is merged with a row from df2 as long as they have the same key value
- the `key` column is chosen because it is shared between the two dataframes
- if a row form df1 has a key that does not appear in df2, this row will nto be included
- if a row from df2 has a key that does nto appear in df1, this row will not be inlcuded

In [7]:
df3 = pd.DataFrame({'key': ['a', 'b', 'b'],
                    'data2': range(3)})
df3

Unnamed: 0,key,data2
0,a,0
1,b,1
2,b,2


In [9]:
# Can you predict the resulting data frame?
df1.merge(df3)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,2
2,b,1,1
3,b,1,2
4,b,6,1
5,b,6,2
6,a,2,0
7,a,4,0
8,a,5,0


**It is a good practice to specify explicitly which column(s) to join on.**

In [8]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [10]:
df1.merge(df3, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,2
2,b,1,1
3,b,1,2
4,b,6,1
5,b,6,2
6,a,2,0
7,a,4,0
8,a,5,0


### 2. What if the column to join has different names in the two data frames?

In [11]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare'],
    'Hw1': [100, 90, 80],
    'Hw2': [60, 70, 80]
})
homework

Unnamed: 0,Name,Hw1,Hw2
0,Alice,100,60
1,Bob,90,70
2,Clare,80,80


In [12]:
exam = pd.DataFrame({
    "Full Name": ['Alice', 'Bob', 'Clare'],
    "Midterm": [70, 80, 90],
    "Final": [85, 65, 75]
})
exam

Unnamed: 0,Full Name,Midterm,Final
0,Alice,70,85
1,Bob,80,65
2,Clare,90,75


In [13]:
pd.merge(homework, exam, left_on="Name", right_on="Full Name")

Unnamed: 0,Name,Hw1,Hw2,Full Name,Midterm,Final
0,Alice,100,60,Alice,70,85
1,Bob,90,70,Bob,80,65
2,Clare,80,80,Clare,90,75


### 3. What if the column to join has different values?

In [14]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90]
})
homework

Unnamed: 0,Name,Hw1,Hw2
0,Alice,100,60
1,Bob,90,70
2,Clare,80,80
3,David,70,90


In [15]:
exam = pd.DataFrame({
    "Full Name": ['Alice', 'Bob', 'Clare', 'Eli'],
    "Midterm": [70, 80, 90, 100],
    "Final": [85, 65, 75, 55]
})
exam

Unnamed: 0,Full Name,Midterm,Final
0,Alice,70,85
1,Bob,80,65
2,Clare,90,75
3,Eli,100,55


In [16]:
pd.merge(homework, exam, left_on="Name", right_on="Full Name")

Unnamed: 0,Name,Hw1,Hw2,Full Name,Midterm,Final
0,Alice,100,60,Alice,70,85
1,Bob,90,70,Bob,80,65
2,Clare,80,80,Clare,90,75


Different join types with `how` argument
- inner: Use only the keys combinations observed in both tables
- outer: Use all possible keys combinations
- left: Use all keys found in the first data frame
- right: Use all keys found in the second data frame

In [17]:
pd.merge(homework, exam, left_on="Name", right_on="Full Name",
         how='outer')

Unnamed: 0,Name,Hw1,Hw2,Full Name,Midterm,Final
0,Alice,100.0,60.0,Alice,70.0,85.0
1,Bob,90.0,70.0,Bob,80.0,65.0
2,Clare,80.0,80.0,Clare,90.0,75.0
3,David,70.0,90.0,,,
4,,,,Eli,100.0,55.0


### 4. What if we want to join on multiple columns?

In [18]:
homework = pd.DataFrame({
    'Semester': ['Fall 2018', 'Fall 2018', 'Fall 2019', 'Fall 2019'],
    'Name': ['Alice', 'Bob', 'Clare', 'Alice'],
    'Hw1': [50, 90, 80, 70],
    'Hw2': [60, 70, 80, 90]
})
homework

Unnamed: 0,Semester,Name,Hw1,Hw2
0,Fall 2018,Alice,50,60
1,Fall 2018,Bob,90,70
2,Fall 2019,Clare,80,80
3,Fall 2019,Alice,70,90


In [19]:
exam = pd.DataFrame({
    'When': ['Fall 2018', 'Fall 2018', 'Fall 2019', 'Fall 2019'],
    "Name": ['Alice', 'Bob', 'Clare', 'Alice'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55]
})
exam

Unnamed: 0,When,Name,Midterm,Final
0,Fall 2018,Alice,60,45
1,Fall 2018,Bob,80,65
2,Fall 2019,Clare,90,75
3,Fall 2019,Alice,100,55


In [20]:
pd.merge(homework, exam, on='Name')

Unnamed: 0,Semester,Name,Hw1,Hw2,When,Midterm,Final
0,Fall 2018,Alice,50,60,Fall 2018,60,45
1,Fall 2018,Alice,50,60,Fall 2019,100,55
2,Fall 2019,Alice,70,90,Fall 2018,60,45
3,Fall 2019,Alice,70,90,Fall 2019,100,55
4,Fall 2018,Bob,90,70,Fall 2018,80,65
5,Fall 2019,Clare,80,80,Fall 2019,90,75


In [21]:
pd.merge(homework, exam, left_on=['Semester', 'Name'],
         right_on=['When', 'Name'])

Unnamed: 0,Semester,Name,Hw1,Hw2,When,Midterm,Final
0,Fall 2018,Alice,50,60,Fall 2018,60,45
1,Fall 2018,Bob,90,70,Fall 2018,80,65
2,Fall 2019,Clare,80,80,Fall 2019,90,75
3,Fall 2019,Alice,70,90,Fall 2019,100,55


### 5. What if there are overlapping columns?

In [22]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90],
    'Average': [80, 80, 80, 80]
})
homework

Unnamed: 0,Name,Hw1,Hw2,Average
0,Alice,100,60,80
1,Bob,90,70,80
2,Clare,80,80,80
3,David,70,90,80


In [23]:
exam = pd.DataFrame({
    "Name": ['Alice', 'Bob', 'Clare', 'Eva'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55],
    "Average": [52.5, 72.5, 82.5, 77.5]
})
exam

Unnamed: 0,Name,Midterm,Final,Average
0,Alice,60,45,52.5
1,Bob,80,65,72.5
2,Clare,90,75,82.5
3,Eva,100,55,77.5


In [24]:
pd.merge(homework, exam, on='Name', how='outer')

Unnamed: 0,Name,Hw1,Hw2,Average_x,Midterm,Final,Average_y
0,Alice,100.0,60.0,80.0,60.0,45.0,52.5
1,Bob,90.0,70.0,80.0,80.0,65.0,72.5
2,Clare,80.0,80.0,80.0,90.0,75.0,82.5
3,David,70.0,90.0,80.0,,,
4,Eva,,,,100.0,55.0,77.5


In [25]:
pd.merge(homework, exam, on='Name', suffixes=('_hw', '_ex'), how='outer')

Unnamed: 0,Name,Hw1,Hw2,Average_hw,Midterm,Final,Average_ex
0,Alice,100.0,60.0,80.0,60.0,45.0,52.5
1,Bob,90.0,70.0,80.0,80.0,65.0,72.5
2,Clare,80.0,80.0,80.0,90.0,75.0,82.5
3,David,70.0,90.0,80.0,,,
4,Eva,,,,100.0,55.0,77.5


### 6. What if we want to merge on index?

In [27]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90],
    'Average': [80, 80, 80, 80]
}, index=[111, 222, 333, 444])
homework

Unnamed: 0,Name,Hw1,Hw2,Average
111,Alice,100,60,80
222,Bob,90,70,80
333,Clare,80,80,80
444,David,70,90,80


In [28]:
exam = pd.DataFrame({
    "Name": ['Alice', 'Bob', 'Clare', 'Eva'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55],
    "Average": [52.5, 72.5, 82.5, 77.5]
})
exam = exam.set_index('Name')
exam

Unnamed: 0_level_0,Midterm,Final,Average
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,60,45,52.5
Bob,80,65,72.5
Clare,90,75,82.5
Eva,100,55,77.5


In [29]:
pd.merge(homework, exam, left_on='Name', right_index=True)

Unnamed: 0,Name,Hw1,Hw2,Average_x,Midterm,Final,Average_y
111,Alice,100,60,80,60,45,52.5
222,Bob,90,70,80,80,65,72.5
333,Clare,80,80,80,90,75,82.5


## II. Concatenations

### 1. Concatenating NumPy Arrays
My personal favorite methods are np.hstack() for horizontal concatenation and np.vstack() for vertical concatenation.

In [30]:
arr1 = np.arange(12).reshape([3, 4])
print(arr1)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [31]:
arr2 = np.arange(10, 90, 10).reshape([2, 4])
print(arr2)

[[10 20 30 40]
 [50 60 70 80]]


In [32]:
print(np.vstack([arr1, arr2]))

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [10 20 30 40]
 [50 60 70 80]]


In [33]:
arr3 = np.arange(100, 10, -10).reshape([3, 3])
print(arr3)

[[100  90  80]
 [ 70  60  50]
 [ 40  30  20]]


In [34]:
print(np.hstack([arr1, arr3]))

[[  0   1   2   3 100  90  80]
 [  4   5   6   7  70  60  50]
 [  8   9  10  11  40  30  20]]


### 2. Concatenating Data Frames

In [35]:
spring_records = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Homework': [60, 70, 80, 90],
    'Exam': [65, 75, 85, 95]
})
spring_records

Unnamed: 0,Name,Homework,Exam
0,Alice,60,65
1,Bob,70,75
2,Clare,80,85
3,David,90,95


In [36]:
fall_records = pd.DataFrame({
    'Name': ['Alice', 'Eva', 'Fred', 'Gabriel'],
    'Homework': [66, 77, 88, 99],
    'Exam': [69, 79, 89, 99]
})
fall_records

Unnamed: 0,Name,Homework,Exam
0,Alice,66,69
1,Eva,77,79
2,Fred,88,89
3,Gabriel,99,99


In [37]:
pd.concat([spring_records, fall_records])

Unnamed: 0,Name,Homework,Exam
0,Alice,60,65
1,Bob,70,75
2,Clare,80,85
3,David,90,95
0,Alice,66,69
1,Eva,77,79
2,Fred,88,89
3,Gabriel,99,99


In [38]:
pd.concat([spring_records, fall_records], axis=1)

Unnamed: 0,Name,Homework,Exam,Name.1,Homework.1,Exam.1
0,Alice,60,65,Alice,66,69
1,Bob,70,75,Eva,77,79
2,Clare,80,85,Fred,88,89
3,David,90,95,Gabriel,99,99


## Example: Analyzing Airport Operations

Download `airports.csv`, `airport-frequencies.csv`, `countries.csv`, `regions.csv` from [OurAirports.com](https://ourairports.com/data/)

#### 1. Select data with multiple conditions

In [11]:
# Find the region code for New York from region data frame.
regions = pd.read_csv("regions.csv")
regions.sample(5)

Unnamed: 0,id,code,local_code,name,continent,iso_country,wikipedia_link,keywords
1095,303855,GM-C,C,Central River Division,AF,GM,https://en.wikipedia.org/wiki/Central_River_Di...,Central River Division
803,303594,DO-23,23,San Pedro de Macorís Province,,DO,https://en.wikipedia.org/wiki/San_Pedro_de_Mac...,
1645,304402,KR-27,27,Daegu Gwang'yeogsi,AS,KR,https://en.wikipedia.org/wiki/Daegu_Gwang'yeogsi,
517,303330,CF-KG,KG,Kémo,AF,CF,https://en.wikipedia.org/wiki/K%C3%A9mo,
1470,304246,IR-U-A,U-A,(unassigned),AS,IR,,


In [19]:
ny_filter1 = regions["name"] == "New York"

regions[us_filter]#.sample(5)

Unnamed: 0,id,code,local_code,name,continent,iso_country,wikipedia_link,keywords
3704,306110,US-NY,NY,New York,,US,https://en.wikipedia.org/wiki/New_York,


In [21]:
# Extract all large airports in New York state from airoprts data frame

airports = pd.read_csv("airports.csv")
airports.head()


Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
0,6523,00A,heliport,Total Rf Heliport,40.070801,-74.933601,11.0,,US,US-PA,Bensalem,no,00A,,00A,,,
1,323361,00AA,small_airport,Aero B Ranch Airport,38.704022,-101.473911,3435.0,,US,US-KS,Leoti,no,00AA,,00AA,,,
2,6524,00AK,small_airport,Lowell Field,59.9492,-151.695999,450.0,,US,US-AK,Anchor Point,no,00AK,,00AK,,,
3,6525,00AL,small_airport,Epps Airpark,34.864799,-86.770302,820.0,,US,US-AL,Harvest,no,00AL,,00AL,,,
4,6526,00AR,closed,Newport Hospital & Clinic Heliport,35.6087,-91.254898,237.0,,US,US-AR,Newport,no,,,,,,00AR


In [45]:
ny_filter2 = (airports["iso_region"] == "US-NY") & (airports["type"] == "large_airport")
large_ny_airports = airports[ny_filter2]
large_ny_airports#.sample(10)


Unnamed: 0,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
26527,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,Buffalo,yes,KBUF,BUF,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,
27925,3622,KJFK,large_airport,John F Kennedy International Airport,40.639801,-73.7789,13.0,,US,US-NY,New York,yes,KJFK,JFK,JFK,http://www.panynj.gov/CommutingTravel/airports...,https://en.wikipedia.org/wiki/John_F._Kennedy_...,"Manhattan, New York City, NYC, Idlewild"
28065,3643,KLGA,large_airport,La Guardia Airport,40.777199,-73.872597,21.0,,US,US-NY,New York,yes,KLGA,LGA,LGA,http://www.panynj.gov/CommutingTravel/airports...,https://en.wikipedia.org/wiki/LaGuardia_Airport,"Manhattan, New York City, NYC, Glenn H. Curtis..."
30050,3855,KROC,large_airport,Greater Rochester International Airport,43.1189,-77.672401,559.0,,US,US-NY,Rochester,yes,KROC,ROC,ROC,,https://en.wikipedia.org/wiki/Greater_Rocheste...,
30299,3913,KSYR,large_airport,Syracuse Hancock International Airport,43.111198,-76.1063,421.0,,US,US-NY,Syracuse,yes,KSYR,SYR,SYR,http://www.syrairport.org/,https://en.wikipedia.org/wiki/Syracuse_Hancock...,


In [48]:
# Extract the name, identification code, and municipality of 
# all airports with ISO region "US-NY" and type "large_airport"

large_ny_airports2 = large_ny_airports.loc[:, ["name", "id", "municipality"]]
large_ny_airports2


Unnamed: 0,name,id,municipality
26527,Buffalo Niagara International Airport,3431,Buffalo
27925,John F Kennedy International Airport,3622,New York
28065,La Guardia Airport,3643,New York
30050,Greater Rochester International Airport,3855,Rochester
30299,Syracuse Hancock International Airport,3913,Syracuse


#### 2. Sorting

In [32]:
# From airport_freq, extract all communication frequencies for KJFK,
# with frequencies sorted in ascending order

airport_freq = pd.read_csv("airport-frequencies.csv")
airport_freq.head()

Unnamed: 0,id,airport_ref,airport_ident,type,description,frequency_mhz
0,70518,6528,00CA,CTAF,CTAF,122.9
1,307581,6589,01FL,ARCAL,,122.9
2,75239,6589,01FL,CTAF,CEDAR KNOLL TRAFFIC,122.8
3,60191,6756,04CA,CTAF,CTAF,122.9
4,59287,6779,04MS,UNIC,UNICOM,122.8


In [35]:
airport_freq = airport_freq.sort_values(by = "frequency_mhz")
airport_freq.head(10)

Unnamed: 0,id,airport_ref,airport_ident,type,description,frequency_mhz
23155,328044,40554,TT-TT01,Flight Planning,Piarco Tower Ops,0.0
19689,308254,308253,NO-0033,123.5,Lillehammer Mjøsisen,0.0
1956,298892,4970,DN56,131.7,ESCRAVOS TOWER,0.0
19428,313875,5333,MP24,122.8 unicom,,0.0
2160,310110,28519,EDCB,GE,Ballenstedt Info,0.0
23902,75418,26568,VILD,watch hours 1030,,0.0
5111,301042,301040,IN-0070,NDB,Vijayanagar VN,0.217
5113,301047,301046,IN-0072,NDB,Raigarh NDB -RG,0.247
5112,301044,301043,IN-0071,NDB,Koppal NDB BAK,0.307
5096,314718,41380,ID-KWB,NDB,KJ,0.336


In [38]:
kjfk_filter = airport_freq["airport_ident"] == "KJFK"
asc_freq_kjfk = airport_freq[kjfk_filter]["frequency_mhz"]
asc_freq_kjfk.head(10)

11616    115.10
11620    115.90
11621    119.10
11619    121.90
11622    122.95
11613    125.70
11614    127.40
11615    132.40
11617    135.05
11618    135.90
Name: frequency_mhz, dtype: float64

In [39]:
# From airport_freq, extract all communication frequencies for KJFK,
# with frequencies sorted in descending order

airport_freq = airport_freq.sort_values(by = "frequency_mhz", ascending = False)
airport_freq.head(10)

Unnamed: 0,id,airport_ref,airport_ident,type,description,frequency_mhz
4398,57161,2877,FHAW,ACC,ATLANTICO FIR,1795.5
19695,71814,4976,NSFA,INFO,RDO,1790.4
19692,71811,4976,NSFA,APP,APP,1790.4
19698,71817,4976,NSFA,TWR,TWR,1790.4
22621,51506,6104,SKBO,OPS,MILGP RDO OPS,1395.0
24326,53626,26782,WAPK,RDO,RDO,1340.0
4620,51120,3043,FZAI,A/G,A/G VOICE RDO,1330.4
21780,55004,5656,RKSS,A/G,SEOUL RDO,1330.3
19541,51624,4839,MUHA,A/G,BOYEROS RDO INTL,1329.7
4902,51919,2383,HE44,RDO,UN ISMAILIYAH OPS,1325.7


In [40]:
kjfk_filter = airport_freq["airport_ident"] == "KJFK"
desc_freq_kjfk = airport_freq[kjfk_filter]["frequency_mhz"]
desc_freq_kjfk.head(10)

11618    135.90
11617    135.05
11615    132.40
11614    127.40
11613    125.70
11622    122.95
11619    121.90
11621    119.10
11620    115.90
11616    115.10
Name: frequency_mhz, dtype: float64

#### 3. Filter on a list of values

In [42]:
# Extract all communication frequencies used for a large airport in New York state
airport_freq = pd.read_csv("airport-frequencies.csv")
airport_freq.head()

Unnamed: 0,id,airport_ref,airport_ident,type,description,frequency_mhz
0,70518,6528,00CA,CTAF,CTAF,122.9
1,307581,6589,01FL,ARCAL,,122.9
2,75239,6589,01FL,CTAF,CEDAR KNOLL TRAFFIC,122.8
3,60191,6756,04CA,CTAF,CTAF,122.9
4,59287,6779,04MS,UNIC,UNICOM,122.8


In [50]:
large_ny_airports_comm = pd.merge(large_ny_airports, airport_freq, left_on = "ident", right_on = "airport_ident" )
large_ny_airports_comm.head()

Unnamed: 0,id_x,ident,type_x,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,...,local_code,home_link,wikipedia_link,keywords,id_y,airport_ref,airport_ident,type_y,description,frequency_mhz
0,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,...,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,,69857,3431,KBUF,A/D,APP/DEP,126.15
1,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,...,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,,69858,3431,KBUF,ATIS,ATIS,135.35
2,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,...,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,,69859,3431,KBUF,CLD,CLNC DEL,124.7
3,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,...,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,,69860,3431,KBUF,GND,GND,121.9
4,3431,KBUF,large_airport,Buffalo Niagara International Airport,42.940498,-78.732201,728.0,,US,US-NY,...,BUF,,https://en.wikipedia.org/wiki/Buffalo_Niagara_...,,69861,3431,KBUF,RDO,RDO,122.6


In [55]:
large_ny_airports_freq = large_ny_airports_comm["frequency_mhz"]
large_ny_airports_freq.head()

0    126.15
1    135.35
2    124.70
3    121.90
4    122.60
Name: frequency_mhz, dtype: float64

#### 4. Grouping

In [64]:
# Calculate the number of large airports for each country
large_airports_filter = airports["type"] == "large_airport"
large_airports = airports[large_airports_filter]
value_counts = large_airports["iso_country"].value_counts()
value_counts

US    170
CN     35
GB     27
RU     19
IT     17
DE     17
TR     14
JP     12
IN     11
BR     11
CA     10
ES     10
FR      9
MX      9
KR      8
ID      8
PL      7
ZA      6
AU      6
NO      6
PH      6
PT      5
IR      5
SA      5
AE      4
SE      4
NG      4
UA      4
TH      4
KZ      3
     ... 
PA      1
RW      1
AZ      1
CL      1
GP      1
BN      1
AM      1
LV      1
RS      1
SV      1
MR      1
MG      1
KW      1
TD      1
UZ      1
BW      1
SD      1
UG      1
ET      1
JM      1
PF      1
AL      1
BF      1
SI      1
UY      1
BD      1
FI      1
ZW      1
AO      1
JO      1
Name: iso_country, Length: 147, dtype: int64

In [61]:
# Find the top 5 countries having the largest amount of large airports
large_airports["iso_country"].value_counts()[0:5]

US    170
CN     35
GB     27
RU     19
IT     17
Name: iso_country, dtype: int64

#### 5. Merging

In [63]:
# Merge the above result with countries data frame to find the name of the countries
countries = pd.read_csv("countries.csv")
countries.head()

Unnamed: 0,id,code,name,continent,wikipedia_link,keywords
0,302672,AD,Andorra,EU,https://en.wikipedia.org/wiki/Andorra,
1,302618,AE,United Arab Emirates,AS,https://en.wikipedia.org/wiki/United_Arab_Emir...,"UAE,مطارات في الإمارات العربية المتحدة"
2,302619,AF,Afghanistan,AS,https://en.wikipedia.org/wiki/Afghanistan,
3,302722,AG,Antigua and Barbuda,,https://en.wikipedia.org/wiki/Antigua_and_Barbuda,
4,302723,AI,Anguilla,,https://en.wikipedia.org/wiki/Anguilla,


In [69]:
value_counts = pd.DataFrame(value_counts)
value_counts.head()

Unnamed: 0,iso_country
US,170
CN,35
GB,27
RU,19
IT,17


In [82]:
ans = pd.merge(countries, value_counts, left_on = "code", right_index = True).sort_values(by = "iso_country", ascending = False)#.rename(columns = {'iso_country':'num_large_airports'}, inplace = True) 
ans.head()

Unnamed: 0,id,code,name,continent,wikipedia_link,keywords,iso_country
228,302755,US,United States,,https://en.wikipedia.org/wiki/United_States,America,170
45,302627,CN,China,AS,https://en.wikipedia.org/wiki/China,中国的机场,35
74,302688,GB,United Kingdom,EU,https://en.wikipedia.org/wiki/United_Kingdom,Great Britain,27
187,302714,RU,Russia,EU,https://en.wikipedia.org/wiki/Russia,"Soviet, Sovietskaya, Sovetskaya, Аэропорты России",19
106,302697,IT,Italy,EU,https://en.wikipedia.org/wiki/Italy,Aeroporti d'Italia,17


In [86]:
ans.rename(columns = {'iso_country':'num_large_airports'}, inplace = True)
ans.head(15)

Unnamed: 0,id,code,name,continent,wikipedia_link,keywords,num_large_airports
228,302755,US,United States,,https://en.wikipedia.org/wiki/United_States,America,170
45,302627,CN,China,AS,https://en.wikipedia.org/wiki/China,中国的机场,35
74,302688,GB,United Kingdom,EU,https://en.wikipedia.org/wiki/United_Kingdom,Great Britain,27
187,302714,RU,Russia,EU,https://en.wikipedia.org/wiki/Russia,"Soviet, Sovietskaya, Sovetskaya, Аэропорты России",19
106,302697,IT,Italy,EU,https://en.wikipedia.org/wiki/Italy,Aeroporti d'Italia,17
54,302681,DE,Germany,EU,https://en.wikipedia.org/wiki/Germany,Flughäfen in Deutschland,17
220,302667,TR,Turkey,AS,https://en.wikipedia.org/wiki/Turkey,Türkiye havaalanları,14
110,302639,JP,Japan,AS,https://en.wikipedia.org/wiki/Japan,"Nippon, 日本の空港",12
101,302634,IN,India,AS,https://en.wikipedia.org/wiki/India,,11
29,302791,BR,Brazil,SA,https://en.wikipedia.org/wiki/Brazil,"Brasil, Brasilian",11


In [67]:
# Append full country name and region name to airports.
countries_lite = countries.loc[:, ["code", "name"]]
countries_lite.rename(columns = {'name':'country_name'}, inplace = True) 
pd.merge(countries_lite, airports, left_on = "code", right_on = "iso_country")


Unnamed: 0,code,country_name,id,ident,type,name,latitude_deg,longitude_deg,elevation_ft,continent,iso_country,iso_region,municipality,scheduled_service,gps_code,iata_code,local_code,home_link,wikipedia_link,keywords
0,AD,Andorra,318235,AD-0001,heliport,Camí Heliport,42.546257,1.519160,,EU,AD,AD-04,La Massana,no,,,,,,
1,AD,Andorra,41841,AD-ALV,heliport,Andorra la Vella Heliport,42.511174,1.533551,3450.0,EU,AD,AD-07,Andorra La Vella,no,,ALV,,,http://pl.wikipedia.org/wiki/Heliport_Andorra_...,
2,AE,United Arab Emirates,44426,AE-0002,heliport,Burj al Arab Resort Helipad,25.141327,55.185496,689.0,AS,AE,AE-DU,Dubai,no,,,,http://www.jumeirah.com/en/Hotels-and-Resorts/...,,
3,AE,United Arab Emirates,300977,AE-0003,small_airport,Dubai Skydive,25.089874,55.136626,,AS,AE,AE-DU,,no,,,,http://www.skydivedubai.ae/facilities/index.htm,,
4,AE,United Arab Emirates,307257,AE-0004,heliport,Sheikh Sultan Bin Khalifa bin Zayed Al Nahyan ...,25.122566,55.174681,,AS,AE,AE-DU,,no,,,,,,
5,AE,United Arab Emirates,307258,AE-0005,heliport,Kempinski Emirates Palace Twin Heliport,24.462268,54.320590,,AS,AE,AE-UQ,,no,,,,,https://en.wikipedia.org/wiki/Emirates_Palace,
6,AE,United Arab Emirates,313546,AE-0006,seaplane_base,Jebel Ali Seaplane Base,24.988967,55.023796,0.0,AS,AE,AE-DU,Jebel Ali,no,,DJH,,,,
7,AE,United Arab Emirates,313547,AE-0007,heliport,Al Ghuwaifat Border Post helipad,24.120421,51.600595,,AS,AE,AE-DU,,no,,,,,,
8,AE,United Arab Emirates,313548,AE-0008,heliport,Al Ghuwaifat Customs Post helipad,24.128339,51.616767,,AS,AE,AE-DU,,no,,,,,,
9,AE,United Arab Emirates,315508,AE-0009,heliport,Delma Hospital Helipad,24.475600,52.310100,17.0,AS,AE,AE-AZ,Delma Island,no,,,,,,
