# Design One
##### The goal of this design is to analyse and visualise the flight network of airlines, airports and routes using the OpenFlights database. The design will focus on understanding the relationships between different airports, airlines, and routes and their geographical distribution.

#### Data Abstraction:
The design will use five different databases provided by OpenFlights - the Countries database, Airport database, the Airline database, the Route database, and the Plane database. The Airport database contains information about over 10,000 airports, train stations, and ferry terminals around the world, including their unique identifier, name, city, country, IATA and ICAO codes, latitude and longitude, altitude, timezone, daylight savings time, and type. The Airline database contains information about over 5,888 airlines, including their unique identifier, name, alias, IATA and ICAO codes, callsign, country, and active status. The Route database contains information about over 67,663 routes between 3321 airports, including the airline, source airport, destination airport, and number of stops. The Plane database contains information about 173 passenger aircraft and their IATA and ICAO codes. The Countries database contains a list of ISO 3166-1 country codes, which can be used to look up the human-readable country names for the codes used in the Airline and Airport tables.


#### Task Abstraction:

The design will perform the following tasks:
- Data Cleaning and Preprocessing - The data from the OpenFlights database will be preprocessed to remove any missing or irrelevant information and to ensure consistency in the data.
- Network Analysis - The design will analyse the relationships between the airports, airlines, and routes and create a network graph to visualise the connections. The design will also calculate various network metrics such as the number of nodes, edges, and degrees of the graph.
- Geographical Distribution - The design will visualise the geographical distribution of the airports and routes on a map. The latitude and longitude of the airports will be used to plot their locations on the map.
- Airline Analysis - The design will analyse the airlines and their connections to the airports and routes. The design will visualise the distribution of airlines across different countries and their number of routes.
- Route Analysis - The design will analyse the routes and their connections to the airlines and airports. The design will visualise the distribution of routes between different countries and the number of stops in each route.
- Plane Analysis - The design will analyse the types of planes used by the airlines and their distribution. The design will also visualise the relationship between different types of planes and the airlines that operate them.

# Data Cleaning and Preprocessing

## airlines.dat

In [19]:
import pandas as pd
airlines = pd.read_csv('airlines.dat', delimiter=',', names=['airline_id', 'name', 'alias', 'iata', 'icao', 'callsign', 'country', 'active'])
print(airlines.head())
print(airlines.isna().sum())

   airline_id                                          name alias iata icao  \
0          -1                                       Unknown    \N    -  NaN   
1           1                                Private flight    \N    -  NaN   
2           2                                   135 Airways    \N  NaN  GNL   
3           3                                 1Time Airline    \N   1T  RNX   
4           4  2 Sqn No 1 Elementary Flying Training School    \N  NaN  WYT   

  callsign         country active  
0       \N              \N      Y  
1      NaN             NaN      Y  
2  GENERAL   United States      N  
3  NEXTIME    South Africa      Y  
4      NaN  United Kingdom      N  
airline_id       0
name             0
alias          506
iata          4627
icao            87
callsign       808
country         15
active           0
dtype: int64


In [20]:
# Drop unwanted columns
airlines.drop(['alias', 'iata', 'callsign'], axis=1, inplace=True)

# Remove rows with missing data in icao and country columns
airlines.dropna(subset=['icao', 'country'], inplace=True)


In [21]:
print(airlines.head())

   airline_id                                          name icao  \
2           2                                   135 Airways  GNL   
3           3                                 1Time Airline  RNX   
4           4  2 Sqn No 1 Elementary Flying Training School  WYT   
5           5                               213 Flight Unit  TFU   
6           6                 223 Flight Unit State Airline  CHD   

          country active  
2   United States      N  
3    South Africa      Y  
4  United Kingdom      N  
5          Russia      N  
6          Russia      N  


## airports.dat

In [22]:
airports = pd.read_csv('airports.dat', delimiter=',', names=['Airport ID', 'Name', 'City', 'Country', 'IATA', 'ICAO', 'Latitude', 'Longitude', 'Altitude', 'Timezone', 'DST', 'Tz', 'Type'])
print(airports.head())
print(airports.isna().sum())


                                    Airport ID          Name  \
1                               Goroka Airport        Goroka   
2                               Madang Airport        Madang   
3                 Mount Hagen Kagamuga Airport   Mount Hagen   
4                               Nadzab Airport        Nadzab   
5  Port Moresby Jacksons International Airport  Port Moresby   

               City Country  IATA      ICAO    Latitude  Longitude Altitude  \
1  Papua New Guinea     GKA  AYGA -6.081690  145.391998       5282       10   
2  Papua New Guinea     MAG  AYMD -5.207080  145.789001         20       10   
3  Papua New Guinea     HGU  AYMH -5.826790  144.296005       5388       10   
4  Papua New Guinea     LAE  AYNZ -6.569803  146.725977        239       10   
5  Papua New Guinea     POM  AYPY -9.443380  147.220001        146       10   

  Timezone                   DST       Tz         Type  
1        U  Pacific/Port_Moresby  airport  OurAirports  
2        U  Pacific/Port_M

In [23]:
# Drop unwanted columns
airports.drop('Name', axis=1, inplace=True)


In [24]:
print(airports.head())


                                    Airport ID              City Country  \
1                               Goroka Airport  Papua New Guinea     GKA   
2                               Madang Airport  Papua New Guinea     MAG   
3                 Mount Hagen Kagamuga Airport  Papua New Guinea     HGU   
4                               Nadzab Airport  Papua New Guinea     LAE   
5  Port Moresby Jacksons International Airport  Papua New Guinea     POM   

   IATA      ICAO    Latitude  Longitude Altitude Timezone  \
1  AYGA -6.081690  145.391998       5282       10        U   
2  AYMD -5.207080  145.789001         20       10        U   
3  AYMH -5.826790  144.296005       5388       10        U   
4  AYNZ -6.569803  146.725977        239       10        U   
5  AYPY -9.443380  147.220001        146       10        U   

                    DST       Tz         Type  
1  Pacific/Port_Moresby  airport  OurAirports  
2  Pacific/Port_Moresby  airport  OurAirports  
3  Pacific/Port_Moresby  a

## routes.dat


In [25]:
routes = pd.read_csv('routes.dat', delimiter=',', names=['Airline', 'Airline ID', 'Source airport', 'Source airport ID', 'Destination airport', 'Destination airport ID', 'Codeshare', 'Stops', 'Equipment'])
print(routes.head())
print(routes.isna().sum())


  Airline Airline ID Source airport Source airport ID Destination airport  \
0      2B        410            AER              2965                 KZN   
1      2B        410            ASF              2966                 KZN   
2      2B        410            ASF              2966                 MRV   
3      2B        410            CEK              2968                 KZN   
4      2B        410            CEK              2968                 OVB   

  Destination airport ID Codeshare  Stops Equipment  
0                   2990       NaN      0       CR2  
1                   2990       NaN      0       CR2  
2                   2962       NaN      0       CR2  
3                   2990       NaN      0       CR2  
4                   4078       NaN      0       CR2  
Airline                       0
Airline ID                    0
Source airport                0
Source airport ID             0
Destination airport           0
Destination airport ID        0
Codeshare            

In [26]:
# Drop unwanted columns
routes = routes.drop('Codeshare', axis=1)

# Remove rows with missing data in Equipment column
routes.dropna(subset=['Equipment'], inplace=True)

In [27]:
print(routes.head())


  Airline Airline ID Source airport Source airport ID Destination airport  \
0      2B        410            AER              2965                 KZN   
1      2B        410            ASF              2966                 KZN   
2      2B        410            ASF              2966                 MRV   
3      2B        410            CEK              2968                 KZN   
4      2B        410            CEK              2968                 OVB   

  Destination airport ID  Stops Equipment  
0                   2990      0       CR2  
1                   2990      0       CR2  
2                   2962      0       CR2  
3                   2990      0       CR2  
4                   4078      0       CR2  


## countries.dat



In [28]:
countries = pd.read_csv('countries.dat', delimiter=',', names=['name', 'iso_code', 'dafif_code'])
print(countries.head())
print(countries.isna().sum())


                                name iso_code dafif_code
0  Bonaire, Saint Eustatius and Saba       BQ        NaN
1                              Aruba       AW         AA
2                Antigua and Barbuda       AG         AC
3               United Arab Emirates       AE         AE
4                        Afghanistan       AF         AF
name          0
iso_code      1
dafif_code    1
dtype: int64


In [29]:
# Remove rows with missing data in Equipment column
countries.dropna(subset=['iso_code','dafif_code'], inplace=True)


In [30]:
print(countries.head())


                   name iso_code dafif_code
1                 Aruba       AW         AA
2   Antigua and Barbuda       AG         AC
3  United Arab Emirates       AE         AE
4           Afghanistan       AF         AF
5               Algeria       DZ         AG


## planes.dat

In [31]:
planes = pd.read_csv('planes.dat', delimiter=',', names=['Name', 'IATA code', 'ICAO code'])
print(planes.head())
print(planes.isna().sum())

                                           Name IATA code ICAO code
0                       Aerospatiale (Nord) 262       ND2      N262
1  Aerospatiale (Sud Aviation) Se.210 Caravelle       CRV      S210
2                  Aerospatiale SN.601 Corvette       NDC      S601
3                Aerospatiale/Alenia ATR 42-300       AT4      AT43
4                Aerospatiale/Alenia ATR 42-500       AT5      AT45
Name         0
IATA code    0
ICAO code    0
dtype: int64


# Visualization step

In [38]:
import altair as alt
from altair_viewer import show
# create a chart object

airports = airports.head(5000)
routes = routes.head(5000)

# add a layer for the airports
chart = alt.Chart(airports).mark_circle().encode(
    longitude='Longitude',
    latitude='Latitude',
    color=alt.Color('Country', legend=None),
    tooltip=['Name', 'City', 'Country']
)

show(chart)




ModuleNotFoundError: No module named 'altair_viewer'