# COGS 108 - Final Project 

# Overview

Our team, MARCK, is working with datasets of weather (temperature, humidity, air pressure, months, major cities) and flights (passenger traffic and date).  Based on current knowledge, we assume that the most measures of successful trip include pleasant weather.  We know that these aspects of weather fluctuate over the year, with some seasons not being optimal for travel for certain cities.

# Names

- Robert Ball
- Adam Kabbara 
- Karen Thai
- Ching-Han Tu
- Marjorie Tolentino

# Group Members IDs

- A12727981
- A14723936
- A15503591
- A14528237
- A13649532

# Research Question

To what extent does bad weather impact flight bookings? Specifically, we want to identify a potential threshold of some factor(s), such as temperature, humidity, natural disasters, etc. in common travel destinations that causes deceleration or negative growth of flights booked to that destination. We believe that by looking at flights booked over the course of 10 years to major American cities, namley Los Angeles, Seattle, Chicago, New York City and Orlando and by comparing them to changes in such as temperature, humidity, and air pressure, we can identify at what point of factors of weather may reliably make a significant impact on booked flights.

## Background and Prior Work

Vacation is fun -- when weather permits. Throughout the year, there are peak and off-peak time for traveling. People would comment: “ I don’t want to go to New York now because it is too cold” or “I don’t want to go to Seattle because it’s going to rain.” However, these places are still remaining as popular locations to travel. Hence, we are interested in learning more about what meteorological factors play into booking flights.

There are many reasons why people decide not to travel. Whether it is the fact the flights get canceled/delayed too often at that time of year or that their destination is not suited with the weather they desire. The Bureau of Transportation Statistics reports the reason why flights get canceled. It is important to understand the reason why flights cancel for a few reasons. We will be able to determine that the weather at the destination is not that great. This will be able to tell us whether there were fewer travelers to that destination using another dataset alongside the dataset with the one that has the information about the flight cancellations. Extreme Weather is defined

Business Insider’s “Why cold winter weather cancels roughly 60,000 flights a year in the US” also explores this idea from the context of what happens on the ground.  Although planes can physically fly well in cold weather, there is the issue of uneven runways or visibility issues.  Iced passenger airplanes take turns at defrosting stations, which can cause delays.  The combination of bad road conditions that have caused passengers to miss their flights and delays causing passengers to miss connecting flights lead to a domino effect of cancellations as airports do not find it beneficial to fly with few passengers.
	
The Federal Aviation Administration addressed several questions regarding weather delays. It noted that weather is the largest delay in the National Air System and that New York and San Francisco airports tend to have the most weather-related delays.  However, air traffic delays differ over the course of the year depending on the type of weather. During winter, surface winds and low ceiling and visibility are the main causes of weather delays, while in the summer the main causes are convective weather and low ceiling and visibility. It also explained that when flights encounter thunderstorms, aircrafts may have to divert to other airports, which causes large passenger delay and high costs.  It would be more desirable to predict the weather before a flight’s departure in order to determine if it should be cancelled or not for the airport.

References (include links):
- Causes of flight delays and cancellations with one cause is weather 

https://www.bts.gov/topics/airlines-and-airports/understanding-reporting-causes-flight-delays-and-cancellations

Describes 50% flight were delayed due to weather (Extreme weather and NAS weather)

- Cold weather leads to the cancellation of 60,000 Flights per year

https://www.businessinsider.com/flight-cancellation-cold-weather-storm-blizzard-closing-airports-2019-2

Cold weather can inhibit the plane from take-off when there ice on the plane or on the runway

- Weather is the main cause of flight delays

https://www.faa.gov/nextgen/programs/weather/faq/

Shows the different types of weather conditions and how many delays they cause in winter and summer

# Hypothesis


We hypothesized that as the weather gets more unfriendly, which is when the temperature highly increases and decreases, precipitation increases, or when the atmospheric pressure decreases, the booked flights will decrease accordingly. This is because an unfriendly weather would affect the schedule negatively. Unbearable high temperature and low temperature would decrease tourists’ will for visiting that location. Rainy days would reduce the available events and locations to visit.  Moreover, atmospheric pressure can lead to cloudiness, wind, and precipitation, when it’s low, which can decrease tourists’ willingness of visiting the location. In contrast, high pressure can bring calm weather, which is favorable when touring a city. Hence, the more the weather is friendly, the more the flights will be booked. 


# Dataset(s)


Dataset Name: Welcome Aboard : USA Airport Dataset EDA

Link to the dataset: https://www.kaggle.com/flashgordon/welcome-aboard-usa-airport-dataset-eda/data

Number of observations: 39,674,833

Description: This dataset contains over 3.6 million flights and information about their origin city/airport/population, destination city/airport/population, number of passengers, seats, and distance of flight. It also contains the flight dates but only the month and year.  It is restricted to only flights coming to and from cities in the United States. 


Dataset Name: 'Timeanddate.com' <b>This was created with  Beautiful Soup webscraping</b>

Link to the dataset: Timeanddate.com (Note: an example link from this website would be https://www.timeanddate.com/weather/usa/los-angeles/historic?month=9&year=2009, but we used information from over one thousand links) 

Number of observations: 2,240 

Description: This website has data on past temperature, humidity, and pressure for hundreds of cities around the world. We used BeautifulSoup to webscrape this website for this data and build a table of weather conditions for the average Temperature, Humidity, and Air Pressure each month (we had to retain weather conditions to average month data because the exact dates of flights was not listed in the previous dataset)

For the five citites of observation, Los Angeles, Seattle, Chicago, New York City, and Orlando, we used these two datasets to compare monthly weather conditions with their corresponding flight data in order to see what relationships exist between the available variabes.

# Setup

In [11]:
# Imports used the the project

# Display plots directly in the notebook instead of in a new window
%matplotlib inline

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import patsy
import statsmodels.api as sm
from scipy.stats import pearsonr, norm, ttest_ind
import requests
import bs4
from bs4 import BeautifulSoup

In [12]:
# Configure libraries
# The seaborn library makes plots look nicer
sns.set()
sns.set_context('talk')

# Don't display too many rows/cols of DataFrames
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8

# Round decimals when displaying DataFrames
pd.set_option('precision', 2)

Loading the airport data

In [13]:
df_airports = pd.read_csv("Airports.csv")

In [14]:
df_airports.head()

Unnamed: 0,Origin_airport,Destination_airport,Origin_city,Destination_city,...,Distance,Fly_date,Origin_population,Destination_population
0,MHK,AMW,"Manhattan, KS","Ames, IA",...,254,200810,122049,86219
1,EUG,RDM,"Eugene, OR","Bend, OR",...,103,199011,284093,76034
2,EUG,RDM,"Eugene, OR","Bend, OR",...,103,199012,284093,76034
3,EUG,RDM,"Eugene, OR","Bend, OR",...,103,199010,284093,76034
4,MFR,RDM,"Medford, OR","Bend, OR",...,156,199002,147300,76034


Webscraping the weather data from Timeanddate.com 

In [16]:
#creating lists to iterate through for weather years and months
years = list(range(2010,2019))
months = list(range(1,13))

#creating 'temprorary' lists for Temperature, Humidity, and Pressure for creating the weather tables (by city)
lst_T = []
lst_H = []
lst_P = []

In [17]:
#this function takes in a url (that should follow the format of the example link given above from Timeanddate.com) 
#It appends the 3 weather conditions to the above lists and returns a dataframe for the weather conditions 
def WeatherAppend(site):
    page = requests.get(site)
    soup = BeautifulSoup(page.content)
    avg_table = soup.find('div',{'class':'eight columns'})
    td_avg_table = avg_table.find_all("td")
    T = td_avg_table[6]
    H = td_avg_table[7]
    P = td_avg_table[8]
    lst_T.append(T.string)
    lst_H.append(H.string)
    lst_P.append(P.string)
    data = {'Temperature':lst_T, "Humidity":lst_H, "Pressure":lst_P}
    city = pd.DataFrame(data)
    return city

In [18]:
#example of webscraping, iterating through the month and year ranges starting with January 2010 in Los Angeles
site = 'https://www.timeanddate.com/weather/usa/los-angeles/historic?month=1&year=2010'    
for ind in years:
    site = site.replace(site[site.find("year=")+5:],str(years[ind-2011]),1)
    for i in months:
        losangeles = WeatherAppend(site.replace(site[site.find("=")+1:site.find("&")],str(months[i-1]),1))  

In [20]:
losangeles

Unnamed: 0,Temperature,Humidity,Pressure
0,60 °F,59%,"30.07 ""Hg"
1,58 °F,53%,"30.04 ""Hg"
2,59 °F,71%,"30.03 ""Hg"
...,...,...,...
105,72 °F,56%,"29.89 ""Hg"
106,65 °F,63%,"30.01 ""Hg"
107,59 °F,48%,"30.08 ""Hg"


Looks great, but wait, a problem with the data!

Unfortunately, we had not realized until later exploring the airport data that the flight dates, which are represetned in the dataset as YearMonth (e.g. 200910), only go to the end of 2009. We could not find any other datasets that had the airport data we needed for 2010-2019. Timeanddate.com only has past weather data that goes back to September of 2009. Futher research showed that weather data traditionally was obtained by www.wunderground.com, until access was discontinued (https://apicommunity.wunderground.com/weatherapi/topics/end-of-service-for-the-weather-underground-api). We tried to run similar code to iterate through their website for past weather data but found that it would be impossible or illegal to do so because the html restricted weather data from webscraping.

Because we did not want to risk any legal reprecussions in obtaining weather data, we decided it would be best to use the overlap our datasets did include, despite how small. As a result, we could only work with four months of data, September-December of 2009. This made it very difficult to create a holistic answer to our hypothesis. However, across the five cities, there were still 9,505 rows of flight data that fell into this small date range we could work with.

# Data Cleaning

Describe your data cleaning steps here.

In [4]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

All the data is publically available. The datasets regarding air and passenger traffic has anonymized the data and removed any personally identifying information.  However, our study reflects trends that can influence plane ticket purchases.  A possible unintended consequence of our work is that travelers may be able to see what months have lower/higher traffic in number of flights to a certain destination, and this could affect their plans in terms of when to go.  For example, seeing a decrease in traffic to a city one passenger could infer lower hotel rates due to lower demand or low amount of travelers.  Due to the change of plans, this could have an economic impact in the area for any local business hoping to attract travelers (rental car service, restaurants, tourist activities, etc.)  There could also be unintentional harm that affects the traveler’s experience, for there would be a tradeoff between pleasing travel weather or conditions of less traffic (less busy areas, waiting in lines, parking etc.)


Possible biases in our data come from our team’s perception of “popular, major” cities based on our pre-existing knowledge.  However, these places are not completely representative of major cities throughout the U.S. (variety in climate, compactness, landmarks to visit).

The data is limited so it should not influence people’s travelling decision that much. The data for flights only provide the month and year. As we are limited to the month and the year instead of the exact date, it decreases the accuracy of the analysis by reducing the data of changes we can use. In this case, based on the result of our analysis, we can see that there was overall a small correlation and should not be too significantly used by these parties (airlines, tourist agencies, customers, etc.) because of the risk if our data were to be taken as fact.


# Conclusion & Discussion

In conclusion, the result does not match with our hypothesis. We hypothesized that as the weather gets more unfriendly, which is when the temperature highly increases and decreases, precipitation increases, or when the atmospheric pressure decreases, the booked flights will decrease accordingly. However, the result of analysis shows that 