In [None]:
library(dplyr)
library(plotly)

In [None]:
df <- read.csv("../input/2008.csv.bz2")

In [None]:
head(df)

**1. Find top-10 carriers in terms of the number of completed flights (_UniqueCarrier_ column)?**

**Which of the listed below is _not_ in your top-10 list?**
- DL
- AA
- OO
- **EV **

In [None]:
df_n <- table(df['UniqueCarrier'])
sort(df_n, decreasing=TRUE)[1:10]

**2. Plot distributions of flight cancellation reasons (_CancellationCode_).**

**What is the most frequent reason for flight cancellation? (Use this [link](https://www.transtats.bts.gov/Fields.asp?Table_ID=236) to translate codes into reasons)**
- carrier
- **weather conditions** 
- National Air System
- security reasons

In [None]:
df_n <-table(df['CancellationCode'])
sort(df_n, decreasing=TRUE)

**3. Which route is the most frequent, in terms of the number of flights?**

(Take a look at _'Origin'_ and _'Dest'_ features. Consider _A->B_ and _B->A_ directions as _different_ routes) 

 - New-York – Washington
 - **San-Francisco – Los-Angeles **
 - San-Jose – Dallas
 - New-York – San-Francisco

In [None]:
df_n <- df %>% group_by(Origin, Dest) %>% summarise(count = n()) %>% arrange(desc(count))
df_n[1,]

**4. Find top-5 delayed routes (count how many times they were delayed on departure). From all flights on these 5 routes, count all flights with weather conditions contributing to a delay.**

- 449 
- 539 
- 549 
- **668** 

In [None]:
df_n <- df
filter(df_n, DepDelay>0) %>% group_by(Origin, Dest) %>% summarize(count = n(), w = length(which(WeatherDelay > 0))) %>% arrange(desc(count)) %>% .[1:5, 'w'] %>% sum()

**5. Examine the hourly distribution of departure times. For that, create a new series from DepTime, removing missing values.**

**Choose all correct statements:**
 - Flights are normally distributed within time interval [0-23] (Search for: Normal distribution, bell curve).
 - Flights are uniformly distributed within time interval [0-23].
 - In the period from 0 am to 4 am there are considerably less flights than from 7 pm to 8 pm.

In [None]:

df_n <- df[!is.na(df$DepTime), c('DepTime','Year')]
df_n$DepTime <- sapply(df_n$DepTime, function(x) (x %/% 100) %% 24)
graph <- df_n %>% group_by(DepTime) %>% summarise(count = n()) %>% arrange(desc(count))
plot_ly(graph, x = ~DepTime, y = ~count, type = "bar")

**6. Show how the number of flights changes through time (on the daily/weekly/monthly basis) and interpret the findings.**

**Choose all correct statements:**
- **The number of flights during weekends is less than during weekdays (working days).**
- The lowest number of flights is on Sunday.
- **There are less flights during winter than during summer.**  

In [None]:
barplot(table(df['DayOfWeek']))

In [None]:

barplot(table(df['Month']))

**7. Examine the distribution of cancellation reasons with time. Make a bar plot of cancellation reasons aggregated by months.**

**Choose all correct statements:**
- **December has the highest rate of cancellations due to weather.** 
- The highest rate of cancellations in September is due to Security reasons.
- **April's top cancellation reason is carriers.**
- Flights cancellations due to National Air System are more frequent than those due to carriers.

In [None]:

df_n <- df[df$Cancelled == 1, c('CancellationCode','Month')]
temp <- df_n %>% group_by(CancellationCode, Month) %>% count(CancellationCode)
reasons <- c(A='carrier', B='weather conditions', C='National Air System', D='security reasons')
temp$CancellationCode <- reasons[sapply(temp$CancellationCode, toString)]
sep_parts <- temp[temp$CancellationCode == 'carrier',]
fig <- plot_ly(x = sep_parts$Month, y = sep_parts$n, type = "bar", name = 'carrier')
sep_parts <- temp[temp$CancellationCode == 'weather conditions',]
fig <- fig %>% add_trace(y = sep_parts$n, name = 'weather conditions')
sep_parts <- temp[temp$CancellationCode == 'National Air System',]
fig <- fig %>% add_trace(y = sep_parts$n, name = 'National Air System')
sep_parts <- vector('integer', 12)
months <- c(temp[temp$CancellationCode == 'security reasons',]['Month'])
for (i in (1:12)) {
    if (i %in% months$Month) {
        sep_parts[i] = temp[temp$CancellationCode == 'security reasons' & temp$Month == i,]['n']
    }
}
fig <- fig %>% add_trace(y = sep_parts, name = 'security reasons')
fig <- fig %>% layout(yaxis = list(title = 'Count'), barmode = 'group')
fig

**8. Which month has the greatest number of cancellations due to Carrier?** 
- May
- January
- September
-** April **

In [None]:

cancel <- filter(df, CancellationCode == 'A')
cancel %>% group_by(Month, CancellationCode) %>% summarize(count = n()) %>% arrange(desc(count)) %>% .[1,]

**9. Identify the carrier with the greatest number of cancellations due to carrier in the corresponding month from the previous question.**

- 9E
- EV
- HA
- **AA**

In [None]:

carrier <- filter(df, Month == 4 & Cancelled == 1)
carrier %>% group_by(UniqueCarrier) %>% summarize(count = n()) %>% arrange(desc(count)) %>% .[1,]

**10. Examine median arrival and departure delays (in time) by carrier. Which carrier has the lowest median delay time for both arrivals and departures? Leave only non-negative values of delay times ('ArrDelay', 'DepDelay').
[Boxplots](https://seaborn.pydata.org/generated/seaborn.boxplot.html) can be helpful in this exercise, as well as it might be a good idea to remove outliers in order to build nice graphs. You can exclude delay time values higher than a corresponding .95 percentile.**

- EV
- OO
- AA
- **AQ **

In [None]:

df %>% filter(ArrDelay > 0) %>% group_by(UniqueCarrier) %>% summarize(n = median(ArrDelay)) %>% arrange(n) %>% .[1, ]

In [None]:
df %>% filter(DepDelay > 0) %>% group_by(UniqueCarrier) %>% summarize(n = median(DepDelay)) %>% arrange(n) %>% .[1, ]