# The Sky Team Project: Analysis

## This notebook explains what this project is, what it means, and what we did with it.

### First, some background on the name...

You might recognize our team name, "Sky Team," as being very similar to one of the world's leading airline alliances, SkyTeam. This isn't unintentional. We felt this was a great name for our project, as our mutual love of aviation and interest in exploring the massive amount of data that was collected by San Francisco Airport Commission.

### About that data...

It's located [here](https://catalog.data.gov/dataset/air-traffic-passenger-statistics), and it's from San Francisco International Airport. The City of San Francisco, through its SFO Airport Commission, is responsible for the data we used.

### What's the point?

What we have here is an aviation enthusiast (or data scientist's) dream. We have over 17,500 rows of data, spanning over 12 years! In SFO alone, the airline industry has changed so much over that period of time that there will be plenty of observable trends derived from the metrics gleaned by SFOAC in this databank. There's even many airlines on that list who (sadly) no longer grace our skies...the consolidation of the airline industry from 2009-2015 has really left its mark.

### Import sqlite3, Pandas, and the ipython-sql (%sql) Jupyter extension...and a few more.

Like all great Jupyter notebooks, our project revolves around several Python modules that were essential for the proper excecution of our code. These were SQLite, Pandas, Numpy, MatPlotLib, Seaborn, GeoPy, Basemap, rgb2hex, Polygon, ScalarMappable, Colorbarbase, and Math.

In [9]:
# Here are all the modules we used.
import sqlite3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import rgb2hex
from matplotlib.patches import Polygon
from matplotlib.cm import ScalarMappable
from matplotlib.colorbar import ColorbarBase
import matplotlib as mpl
import math

%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### Connecting to the database

In [2]:
%sql sqlite:///airpassenger.db

'Connected: None@airpassenger.db'

## The Terminals

As you might imagine, terminal information is a crucially important part of air travel.

In [3]:
airportterminal_rs = %sql None@airpassenger.db SELECT * FROM AIRPORTDIMENSION;
airportterminal = airportterminal_rs.DataFrame().set_index('AirportID')
airportterminal

Done.


Unnamed: 0_level_0,Terminal,BoardingArea
AirportID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Terminal 1,B
2,International,G
3,International,A
4,Terminal 3,E
5,Terminal 1,C
6,Terminal 1,A
7,Terminal 3,F
8,Other,Other
9,Terminal 2,D


In [None]:
# I was thinking here would be a good place to show the utilization of terminals over the entire dataset, maybe in a histogram. However, I have had nothing but trouble doing this. I will keep trying in the meantime.

In [None]:
# We could then show the terminal allocation from 2005-2011.

In [None]:
# Then, we could show 2011-2017 terminal allocation.

In [13]:
"Create a histogram that shows the distribution of terminals among flights."
terminalmap = boardingarea['airportterminal']
plt.hist(age, alpha=.40, label='Terminal', bins=6, ec="k")
plt.xlabel('Terminal Number')
plt.ylabel('Number of Passengers')
plt.title('Distribution of Passengers Among Terminals')

# Add legend
plt.legend()

# Show the figure
plt.show()

NameError: name 'age' is not defined

In [4]:
passenger_rs = %sql None@airpassenger.db SELECT * FROM PASSENGERFACT;
passenger = passenger_rs.DataFrame().set_index('EntryID')
passenger

Done.


Unnamed: 0_level_0,OperatingID,AirportID,GeoID,ActivityID,TimeID,PublishedID,PassengerCount
EntryID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,1,1,1,1,1,27271
2,1,1,1,2,1,1,29131
3,1,1,1,3,1,1,5415
4,1,1,1,1,2,1,27472
5,1,1,1,2,2,1,26535
6,1,1,1,3,2,1,5712
7,1,1,1,1,3,1,17341
8,1,1,1,2,3,1,18541
9,1,1,1,3,3,1,4412
10,1,1,1,1,5,1,17698


In [None]:
# I was then hoping to create a bar chart showing the distribution of places the flights went to.

In [None]:
# We could then do one that shows what % of the flights were "Low Fare."

In [None]:
# Then after that, a pie chart showing the % breakdown of flights enplaned, in-transit, and deplaned.

In [None]:
# Once we have that, it should be relatively easier to make one for 2005-2011.

In [None]:
# Then, the same for 2011-2017. This will help emphasize the industry shift. I will write descriptions in Markdown for all of them.

In [10]:
%%sql
select terminal, sum(passengercount)
from AIRPORTDIMENSION JOIN PASSENGERFACT ON (AIRPORTDIMENSION.AirportID = PASSENGERFACT.AirportID)
group by terminal;

Done.


Terminal,sum(passengercount)
International,1805337136
Other,228
Terminal 1,407671524
Terminal 2,48724642
Terminal 3,559782518


In [17]:
airportterminal_rs = %sql None@airpassenger.db select ActivityType, sum(passengercount) from ACTIVITYDIMENSION JOIN PASSENGERFACT ON (ACTIVITYDIMENSION.ActivityID = PASSENGERFACT.ActivityID) group by ActivityType;
airportterminal = airportterminal_rs.DataFrame()
airportterminal


Done.


Unnamed: 0,ActivityType,sum(passengercount)
0,Deplaned,1412383233
1,Enplaned,1399418473
2,Thru / Transit,9714342


In [None]:
%%sql
select ActivityType, sum(passengercount)
from ACTIVITYDIMENSION JOIN PASSENGERFACT ON (ACTIVITYDIMENSION.ActivityID = PASSENGERFACT.ActivityID)
group by ActivityType;

In [None]:
ax = sns.barplot(y)

In [None]:
# Lastly, a histogram of passenger volumes by -year would be helpful. Since there are so many years, that should be granular enough.

# Once I know how to do this, it would be good to choose date ranges that seperate passenger volumes by season - so we can see the busiest times/years to fly, and compare them.

# It would also be good to make a histogram of UNIQUE Airlines - so we can see the consolidation that took place with the mergers and eliminated a lot of choice in the market over this timespan.