<a href="https://colab.research.google.com/github/Applied-Machine-Learning-2022/project-1-data-aeds-uark/blob/ahmed-2/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


For this project you will be given a dataset and an associated problem. Over the course of the day, you will explore the dataset and train the best model you can in order to solve the problem. At the end of the day, you will give a short presentation about your model and solution.

### Deliverables

1. A **copy of this Colab notebook** containing your code and responses to the ethical considerations below.
1. At the end of the day, we will ask you and your group to stand in front of the class and give a brief **presentation about what you have done**. 

## Team

Please enter your team members' names in the placeholders in this text area:

*   Ahmed Moustafa
*   Ellion Dison
*   Devin Hill
*   Santiago Dorado



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project, we will **use flight statistics data to gain insights into US airports' and airlines' flights in 2008.**

You are free to use any toolkit we've covered in class to solve the problem (e.g. Pandas, Matplotlib, Seaborn).

Demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for each column's data type and summary statistics.
1. Explore the data programmatically and visually.
1. Produce an answer and visualization, where applicable, for at least three questions from the list below, and discuss any relevant insights. Feel free to generate and answer some of your own questions. 

  * Which U.S. airport is the busiest airport? You can decide how you'd like to measure "business" (e.g., annually, monthly, daily).
  * Of the 2008 flights that are *actually delayed*, think about:
    * Which 10 U.S. airlines have the most delays?
    * Which 10 U.S. airlines have the longest average delay time?
    * Which 10 U.S. airports have the most delays?
    * Which 10 U.S. airports have the longest average delay time?
  * More analysis:
    * Are there patterns on how flight delays are distributed across different hours of the day?
    * How about across months or seasons? Can you think of any reasons for these seasonal delays?
    * If you look at average delay time or number of delays by airport, does the data show linearity? Does any subset of the data show linearity?
    * Add reason for delay to your delay analysis above.
    * Examine flight frequencies, delays, time of day or year, etc. for a specific airport, airline or origin-arrival airport pair.

### Student Solution



1.  **Get the data into a Python object**



In [None]:
# Getting Airline Data and Inspecting - Ahmed

# Imports
import pandas as pd
import zipfile
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Download the dataset from Kaggle and store it as a DataFrame
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'
! kaggle datasets download giovamata/airlinedelaycauses --force

with zipfile.ZipFile('airlinedelaycauses.zip','r') as z:
  z.extractall('./')

df = pd.read_csv('DelayedFlights.csv')

2. **Inspect the data for each column's data type and summary statistics**

In [None]:
# Printing the data types of our columns and getting statistics for all columns - Ahmed
print(df.dtypes, end='\n----------------------------')

pd.set_option('display.max_columns', None)
df.describe()

3. **Explore the data programmatically and visually**

In [None]:
# Show a heatmap of the correlation between columns after dropping missing values & constant data - Ahmed
sns.set(rc = {'figure.figsize':(10,8)})
plot = sns.heatmap(df.dropna().drop(['Unnamed: 0','Year', 'Cancelled', 'Diverted'], axis=1).corr(), cmap='coolwarm').set(title='Delayed Flights Correlation')

The heatmap above shows the correlation between columns in our DataFrame that are numeric and have signifigance/are not constant (for example, *Year* is 2008 throughout the dataset). 

From the heatmap, we can see that there is a fairly strong positive correlation between Departure Delay and Arrival Delay which makes sense as a plane that departs late is more likely to arrive at it's destination late.

3. **Explore the data programmatically and visually (continued)**

In [None]:
# What airline has the most unique planes that are delayed flights? - Ahmed

# Create a DataFrame containing the unique Airlines and their respective # of Planes
flight_df = pd.DataFrame(df.groupby(['UniqueCarrier','FlightNum'], as_index=False)['FlightNum'].count())
plane_df = pd.DataFrame({'Airlines':flight_df['UniqueCarrier'].unique(), '# of Planes':flight_df['UniqueCarrier'].value_counts().sort_index()})

# Plot the number of planes associated with each airline on a Bar Chart
colors = ['#0000FF' for _ in range(len(plane_df))]
colors[-1] = '#FF0000'
plt.bar('Airlines', '# of Planes',data=plane_df.sort_values(by='# of Planes', ascending=True), color=colors)
plt.xlabel('Airlines')
plt.ylabel('# of Planes')
plt.title('# of Planes per Airline')
plt.show()

The bar chart above gives us a visual representation of how many unique planes each Airline/Unique Carrier has associated with it. We determined this information by creating a DataFrame of all the Flight Numbers (plane identifiers) associated with each Unique Carrier. Something to note is that it is possible for a single plane to be associated with multiple Airlines as there are more planes in the sum of all of our bars than there are unique Flight Numbers in the dataset.

We can discern that the Airline with the most unique planes that appear in the delayed dataset is WN (Southwest Airlines). This plot can be useful in further understanding total delay counts for Airlines and eliminating some biasing factors as we can see there may be a large difference in the # of unique planes that each Airline has access to within this delayed flight dataset.

## Busiest Airport

In [None]:
# Which U.S. airport is the busiest airport? - Ahmed
# You can decide how you'd like to measure "business" (e.g., annually, monthly, daily).

## Annually (2008)
# Series containing the # of flights originating from each airport
year_origin = df.groupby('Origin')['Origin'].count()

# Series containing the # of flights entering each airport
year_dest = df.groupby('Dest')['Dest'].count()

# Create a new series of the sorted sum of the previous series
year_flights = year_origin.add(year_dest, fill_value=0).astype('int64').sort_values(ascending=False)

year_flights.head()

Annually, the busiest airport is **ATL** with 238511 incoming & outgoing flights in 2008. We chose to analyze annually as it allows us to look at the total business rather than during seasons of possible spikes, dips, etc.

Something important to note is that we determined this by summing the counts of the Origin and Dest columns of our dataset as those tell us the airport planes leave from and go to. The reason we need to do this is because based on our definition of "business", we have to account for Origin and Destination airports which double-counts flights but is logically sound as for every delayed flight, there are 2 different airports associated with it.

## 2008 Flight Delays

In [None]:
# Which 10 U.S. airlines have the most delays? - Devin

# Create a new DataFrame after dropping missing Delay values
df1 = df.dropna(subset=['ArrDelay'])

# Delete the rows at indices where the Delay is negative or 0 (arrived early) from df1
df1 = df1[df1['ArrDelay'] > 0]

# Create a new Dataframe of the total count of Delays grouping by UniqueCarrier (Airline)
unique_carrier_df = df1.groupby(['UniqueCarrier'], as_index=False)['ArrDelay'].count()[['UniqueCarrier', 'ArrDelay']].sort_values('ArrDelay', ascending=False)

unique_carrier_df.head(10)

In the year of 2008 in the USA the following airlines have the highest number of delays: **WN, AA, MQ, UA, OO, DL, XE, US, CO, and EV**. With WN having the highest number with **324717**.

In [None]:
# Which 10 U.S. airlines have the longest average delay time? - Santiago

# Create a new DataFrame after dropping missing Delay values
temp_df = df.dropna(subset=['ArrDelay'])

# Create a new Dataframe of the mean of Delays grouping by UniqueCarrier (Airline)
delays_df = temp_df.groupby('UniqueCarrier')[['ArrDelay']].mean().sort_values(by='ArrDelay', ascending = False)

delays_df.head(10)

The 10 airlines with the longest average delay are: **YV, B6, OH, XE, UA, EV, 9E, AA, OO, and MQ**

In [None]:
# Which 10 U.S. airports have the most delays? - Devin

# Create a new DataFrame after dropping missing Delay values
df2 = df.dropna(subset=['ArrDelay'])

# Delete the rows at indices where the Delay is negative or 0 (arrived early) from df1
df2 = df2[df2['ArrDelay'] > 0]

# Create a new Dataframe of the total count of Delays grouping by Origin Airport (as it's related to Arrival Delay)
origin_airport_df = df2.groupby('Origin')['Origin'].count().sort_values(ascending= False)

origin_airport_df.head(10)

In 2008, The following airports are the top 10 with the most delays are: **ATL, ORD, DFW, DEN, LAX, IAH, PHX, EWR, LAS and DTW.**

In [None]:
# Which 10 U.S. airports have the longest average delay time? - Santiago

# Delete rows with no delay
temp_df1 = df.dropna(subset=['ArrDelay'])

# Create a new Dataframe of the mean of Delays grouping by Origin Airport (as it's related to Arrival Delay)
delays_df1 = temp_df1.groupby('Origin')[['ArrDelay']].mean().sort_values(by='ArrDelay', ascending = False)

delays_df1.head(10)

The top 10 airports with the longest average delay time are: **CMX, PLN, SPI, MQT, ALO, MOT, HHH, EGE, LMT, and PUB.**

## More Analysis

In [None]:
# Are there patterns on how flight delays are distributed across different hours of the day? - Ahmed

# Cleaning up data
dfHourlyDelays = df.dropna(subset= ['ArrDelay'])

# Create bins for the 24 hours (0-2400 with size 100 bins)
bins = pd.cut(dfHourlyDelays["ArrTime"], bins=24)

# Calculates total number of delays per arrival hour
delaysPerHour = dfHourlyDelays.groupby(bins, as_index=False)[['ArrDelay']].count()

# Create list of hours and plotting the # of Delays
hours = ['12 AM','1 AM','2 AM','3 AM','4 AM','5 AM','6 AM','7 AM','8 AM', '9 AM','10 AM','11 AM',
         '12 PM','1 PM','2 PM','3 PM','4 PM','5 PM','6 PM','7 PM','8 PM', '9 PM','10 PM','11 PM',]
plt.plot(hours, delaysPerHour['ArrDelay'])
plt.xticks(rotation = 45)
plt.title('Number of Delays by Arrival Hour')
plt.ylabel('# of Delays')
plt.xlabel('Arrival Hours')
plt.show()

Based on the line graph above, we can see that there is an almost cyclic pattern for the # of delayed flights by the hour of arrival. The **peak** in delayed flights occurs around 8PM (*2000*) whilst the **trough** occurs around 4AM (*0400*). These numbers make sense as the congestion of arriving flights matches up with this trend and that congestion might be directly linked to flight delays.

In [None]:
# Are there patterns on how flight delays are distributed across months or seasons? Can you think of any reasons for these seasonal delays? - Ellion

# Cleaning up data
dfMonthlyDelays = df.dropna(subset= ['ArrDelay'])

# Calculates total number of delays per month
delaysPerMonth = dfMonthlyDelays.groupby('Month',as_index=False)[['ArrDelay']].count()

# Bar graph visualizing data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.bar(months, delaysPerMonth['ArrDelay'])
plt.title('Number of Delays Per Month')
plt.ylabel('# of Delays')
plt.xlabel('Months')
plt.show()

**Can you think of any reasons for these seasonal delays?**

The highest number of delays occurs during December. During this month, people are usually traveling for Christmas and people are going home from college. As a result, there is an increased number of flights which can cause more delays. Also, there is an increased chance for weather delays due to harsh winter weather. From June to July the delays are higher because those months within the airline industy are known as the busiest travel months of the year. This is likely due to a lot of people traveling for summer vacations.

## Exercise 2: Ethical Implications

Even the most basic of data manipulations has the potential to affect segments of the population in different ways. It is important to consider how your code might positively and negatively affect different types of users.

In this section of the project, you will reflect on the ethical implications of your analysis.

### Student Solution

**Positive Impact**

Your analysis is trying to solve a problem. Think about who will benefit if the problem is solved, and write a brief narrative about how the model will help.

> This analysis would benefit people who travel a lot for personal reasons or for business reasons where timing is more important. Using this model can help target where the delays are happening and why they are happening. With this information you determine how to reduce these delays so people have a better chance of leaving and arriving to their destinition on time. It will also help airlines reduce extra capital costs, reallocation of flight crews and aircraft, and additional crew expenses that result from delays. This data could also be used data to predict flight delays to improve airline operations and passenger satisfaction, which will result in a positive impact on the economy.






**Negative Impact**

Solutions usually don't have a universal benefit. Think about who might be negatively impacted by your analysis. This person or persons might not be directly considered in the analysis, but they might be impacted indirectly.

> Our analysis might hurt the airports and airlines economically. Based on our analysis, consumers can view which airlines have the highest average delays which might become a deterent for some consumers. Airlines and other entities can also view which airports have the highest average delays which might become a deterent for future business and flights. Beyond the economic standpoint, some of the jobs related to the Airline industry may take a hit for conclusions that are drawn from this analysis which may have inherent biases.





**Bias**

Data analysis can be biased for many reasons. The bias can come from the data itself (e.g. sampling, data collection methods, available sources), and from the interpretation of the analysis outcome.

Think of at least two ways that bias might have been introduced to your analysis and explain them below.

> Our Flight Delay dataset **only samples data from 2008** which makes analyses/conclusions related to annual trends biased. For example, if you were try to create some model/regression based on this dataset to predict some trends in 2023 Flight Delays, it probably would not be accurate as the dataset is limited and possible subjective to 2008.

> The dataset **only contains flights that are delayed** meaning we have no data for non-delayed flights to compare or make realistic conclusions about Airlines, Airports, etc. An example of this bias is if someone were to see the fact that the ATL airport has the most delays (120336) and assume that they were the worst airport with regards to delayed flights. The problem here is that they come to this conclusions without taking into account that ATL might have way more flights than other airports which means that they aren't as bad proportionally in terms of delay.

> Another possible bias is that a lot of flights which may be subject to delays are international but this **dataset only has national data**. If we were to come to any conclusions about Airline companies that have international flights, then it's very possible that those conclusions are not accurate for the company as a whole since the Airline's data is limited to the US.




**Changing the Dataset to Mitigate Bias**

The most common way that an analysis is biased is when the dataset itself is biased. Look back at the input data that you used for your analysis. Think about how you might change something about the data to reduce bias in your model.

What changes could you make to make your dataset less biased? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of the changes that could be made to your input data.


> Some changes that we could make to make our dataset less biased:
* We could get more data over the course of multiple years, especially more recent years to make current predictions/models more accurate. This would help eliminate our 2008 sampling bias.
* We could include the non-delayed flights with a column denoting whether flights are delayed or not so we can actually try to find realistic relations between features (such as airlines/airports) and whether a flight is delayed rather than making unsubstantiated conclusions. 
* Include international flights in our dataset to increase the quantity of data for unique carriers so that conclusions and analyses related to their delayed flights are more accurate.



**Changing the Analysis Questions to Mitigate Bias**

Are there any ways to reduce bias by changing the analysis itself? This could include modifying the choice of questions you ask, the approach you take to answer the questions, etc.

Write a brief summary of any changes that you could make to help reduce bias in your analysis.

> When we ask questions related to data analysis, it's important to avoid bias in the questions themselves. An example of a question related to this dataset that might have some inherit bias is "what is the worst airline when it comes to delays?". There are a lot of problems with this question as it is very vague and doesn't specify how we should conclude that metric since its very possible that an airline had many delayed flights in 2008 due to the recession. This bias in the data and muddled question can lead to a biased conclusion about an airline.



> Along the lines of the example above, it's important to state your assumptions for questions where they exist. For reference, we do this in the Busiest Airport question as we state our definition of "business" and the reason why the counts appear as they do. Another crucial part of answering questions to reduce bias is mentioning information that may lead to bias throughout the analysis. You can see a brief example of this in the 2nd part of the programmatic and visualization of the data as we print out the total unique planes associated with each airline. That information is useful when analyzing delayed flights for specific airline and making conclusions about it as some airline have a large number of delayed flights with a smaller number of planes compared to others which could be a negative sign related to management.







**Mitigating Bias Downstream**

While analysis can point to suggestions, it is people who make decisions based on them. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your analysis to reduce the bias? Describe these below.



> Machines can treat similarly-situated people and objects differently. Some algorithms run the risk of replicating and even amplifying human biases, particularly those affecting minority groups. For example, there were algorithms powering commercially available facial recognition software systems that were failing to recognize darker-skinned complexions. To mitigate biases like these there should be rules in place to make sure make sure the selection of training data is transparent. This will ensure the training data is diverse and includes all types of outcomes or people. Biased models most likely worked in controlled environments. Machines learning models should be required to simulate real world application as much as possible to combat this issue. To enforce these rules there should be a diverse group of people that analyze the data to ensure there isn't an unconscious bias and make sure the rules in place to mitigate bias are being followed.

> With regards to our analysis, there should be some **public** and **transparent** committee/board in place that can be trusted to approve the validity of the flight data provided as well as prevent conclusions that do not properly reference valid correlations.





