# Final - _Due Friday, October 21_
---
## Note: this is the final. It is _not_ a paired programming assignment. **You must complete this lab _on your own_**. 
---
### We'll be exploring the "Airplane Crashes Since 1908" dataset from [Kaggle](http://www.kaggle.com).
### Full history of airplane crashes throughout the world, from 1908-present.
The dataset was downloaded from https://www.kaggle.com/saurograndi/airplane-crashes-since-1908. 

After loading and cleaning the data:

+ **Hypothesis Testing**: 
    
+ **Linear Regression**: 
    
+ **Time Series**: 

***
## Setup & clean the data
### First, load the packages that will be used in this notebook.

In [2]:
%pylab inline

import pandas as pd
import seaborn as sns
import statsmodels.api as sm

from sqlalchemy import create_engine
from scipy import stats

random.seed(1234)
sns.set(font_scale=1.5)

Populating the interactive namespace from numpy and matplotlib


### Next, load the csv file into a DataFrame and look at the first few lines.

In [116]:
df = pd.read_csv('Airplane_Crashes_and_Fatalities_Since_1908.csv')
df.head()

Unnamed: 0,Date,Time,Location,Operator,Flight #,Route,Type,Registration,cn/In,Aboard,Fatalities,Ground,Summary
0,09/17/1908,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,07/12/1912,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,08/06/1913,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,09/09/1913,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
4,10/17/1913,10:30,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...


Here are the variable descriptions (not available in Kaggle website, but from my own understanding of the data):

|Variable|Description|
|:-:|:--|
|**Date**|Date of the accident|
|**Time**|Time of the accident|
|**Location**|Where the accident happened|
|**Operator**|The name of the airline having the accident|
|**Flight#**|The airline flight number|
|**Route**|The location where the flight was bounded|
|**Type**|The type of aircraft that had the accident|
|**Registration**|An alphanumeric string to identify the aircraft|
|**cn/ln**|Serial number of the aircraft|
|**Aboard**|Number of people that was on the aircraft|
|**Fatalities**|The number of fatalities of the total people aboard|
|**Ground**|The distance from the ground when the fatality happened|
|**Summary**|A free text field that summarizes and describes the accident|

Note that the field "Registration" should be unique (per NAA regulations), but after running a SQL query, we found duplicated records.

We can also modify the column names to get rid of periods and spaces. The column names should also be lowercased for use with SQLite.

In [117]:
# Lowercase and replace periods & spaces in the column names
new_names = []

for col in df.columns:
    new_names.append(col.replace('.', '', len(col)).replace(' ', '', len(col)).lower())

df.columns = new_names

print(df.columns)

Index(['date', 'time', 'location', 'operator', 'flight#', 'route', 'type',
       'registration', 'cn/in', 'aboard', 'fatalities', 'ground', 'summary'],
      dtype='object')


In [5]:
df.count()

date            5268
time            3049
location        5248
operator        5250
flight#         1069
route           3562
type            5241
registration    4933
cn/in           4040
aboard          5246
fatalities      5256
ground          5246
summary         4878
dtype: int64

In [6]:
df.size

68484

In [7]:
df.describe()



Unnamed: 0,aboard,fatalities,ground
count,5246.0,5256.0,5246.0
mean,27.554518,20.068303,1.608845
std,43.076711,33.199952,53.987827
min,0.0,0.0,0.0
25%,,,
50%,,,
75%,,,
max,644.0,583.0,2750.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
date            5268 non-null object
time            3049 non-null object
location        5248 non-null object
operator        5250 non-null object
flight#         1069 non-null object
route           3562 non-null object
type            5241 non-null object
registration    4933 non-null object
cn/in           4040 non-null object
aboard          5246 non-null float64
fatalities      5256 non-null float64
ground          5246 non-null float64
summary         4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB


We can see that the above numeric variables (aboard, fatalities, ground) have missing data (per NaN values), so will create a new dataset without missing values.

### Next, load the data into SQL

First we'll need to create a database, then create a table in our new database.

In [10]:
# Create air_crashes table in air_crashes database (only run this once!)
engine = create_engine('sqlite:///air_crashes.db')
conn = engine.connect()
df.to_sql('air_crashes', conn)

In [11]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [12]:
%sql sqlite:///air_crashes.db

'Connected: None@air_crashes.db'

In [13]:
%%sql
SELECT name FROM sqlite_master WHERE type = "table";

Done.


name
air_crashes


In [118]:
# Reload the data, but skip rows where type, location, aboard, fatalities, ground, date and time is NULL
df_not_null = %sql SELECT * FROM air_crashes WHERE type IS NOT NULL AND location IS NOT NULL AND aboard IS NOT NULL AND fatalities IS NOT NULL AND ground IS NOT NULL AND date IS NOT NULL AND time IS NOT NULL;
df = df_not_null.DataFrame()

Done.


In [119]:
# Date column is not datetime format, so we format it appropiately.
df['date'] = pd.to_datetime(df['date'])

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3042 entries, 0 to 3041
Data columns (total 14 columns):
index           3042 non-null int64
date            3042 non-null datetime64[ns]
time            3042 non-null object
location        3042 non-null object
operator        3039 non-null object
flight#         996 non-null object
route           2566 non-null object
type            3042 non-null object
registration    2946 non-null object
cn/in           2440 non-null object
aboard          3042 non-null float64
fatalities      3042 non-null float64
ground          3042 non-null float64
summary         3026 non-null object
dtypes: datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 332.8+ KB


In [17]:
df.describe()

Unnamed: 0,index,aboard,fatalities,ground
count,3042.0,3042.0,3042.0,3042.0
mean,3032.58021,34.038133,24.760026,2.601907
std,1458.323473,51.708634,40.529903,70.833732
min,0.0,0.0,0.0,0.0
25%,1836.25,5.0,3.0,0.0
50%,3091.5,16.0,10.0,0.0
75%,4377.75,39.0,27.0,0.0
max,5266.0,644.0,583.0,2750.0


In [120]:
df.insert(1,'year', df['date'].apply(lambda x: x.year))
df.insert(2,'month', df['date'].apply(lambda x: x.month))
df.insert(3,'dayofweek', df['date'].apply(lambda x: x.dayofweek))
df.insert(4,'hour', df['time'].str[:2])

In [121]:
df.head()

Unnamed: 0,index,year,month,dayofweek,hour,date,time,location,operator,flight#,route,type,registration,cn/in,aboard,fatalities,ground,summary
0,0,1908,9,3,17,1908-09-17,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,1,1912,7,4,6,1912-07-12,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,3,1913,9,1,18,1913-09-09,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
3,4,1913,10,4,10,1913-10-17,10:30,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...
4,5,1915,3,4,1,1915-03-05,01:00,"Tienen, Belgium",Military - German Navy,,,Zeppelin L-8 (airship),,,41.0,21.0,0.0,Crashed into trees while attempting to land af...


## Hypothesis Testing

We will be using a significance level of 0.05

In [122]:
antonov_df = df[df['type'].str[:3] == ('Ant')]

In [123]:
airbus_df = df[df['type'].str[:6] == ('Airbus')]

In [124]:
boeing_df = df[df['type'].str[:6] == ('Boeing')]

#### The aircraft manufacturer Antonov claims the fatalities in their airplanes is less than 22 passengers.

We want to know if the number of fatalities is more than 22.

$H_0: \mu <= 22$

$H_a: \mu > 22$

$t_{stat} = \frac{\bar{X} - \mu_0}{s/ \sqrt{n}} $

In [125]:
t_stat_antonov = (antonov_df.fatalities.mean() - 22)/(antonov_df.fatalities.std()/len(antonov_df)**0.5)
t_stat_antonov

1.236639642819578

In [126]:
p_value_antonov = stats.t(len(antonov_df)-1).cdf(t_stat_antonov)
p_value_antonov

0.89053580997718673

The p-value is more than 0.05, therefore, we fail to reject the null hypothesis. There's no enough evidence to support that the fatalities in an aircraft manufactured by Antonov is more than 22.

A Type I error would be to claim that the Antonov aircraft have more than 22 fatalities, when they have less.

A Type II error would be to claim that the Antonov aircraft have less than 22 fatalities, when they have more.

#### The aircraft manufacturer Airbus claims the fatalities in their airplanes is less than 20%.

We want to know if the number of fatalities is more than 20%.

$H_0: p <= 20$%

$H_a: p > 20$%

In [127]:
percent_fatalities = airbus_df.fatalities/airbus_df.aboard

In [128]:
z_stat_airbus = (percent_fatalities.mean() - 0.20)/((0.20 * (1 - 0.20)/len(percent_fatalities)))**0.5
z_stat_airbus

6.022520746991066

In [129]:
p_value_airbus = stats.norm.cdf(z_stat_airbus)
p_value_airbus

0.99999999914139326

The p-value is more than 0.05, therefore, we fail to reject the null hypothesis. There's no enough evidence to support that the percentage of fatalities in an aircraft manufactured by Airbus is more than 20%.

A Type I error would be to claim that the Airbus aircraft have more than 20% fatalities, when they have less.

A Type II error would be to claim that the Airbys aircraft have less than 20% fatalities, when they have more.

#### There's no difference in the number of fatalities for aircrafts Airbus and Boeing.

We want to know if they are different.

$H_0: \mu_1 = \mu_2 \text{ or } \mu_1 - \mu_2 = 0$  

$H_a: \mu_1 \neq \mu_2 \text{ or } \mu_1 - \mu_2 \neq 0$

In [130]:
stats.ttest_ind_from_stats(airbus_df.fatalities.mean(),airbus_df.fatalities.std(),airbus_df.fatalities.count(),boeing_df.fatalities.mean(),boeing_df.fatalities.std(),boeing_df.fatalities.count())

Ttest_indResult(statistic=2.2072303579014232, pvalue=0.028014126502163825)

The p-value is less than 0.05, therefore we reject the null hypothesis. There's enough evidence to conclude that the average fatalities for Airbus and Boeing is different.

### Are the fatalities independent of aircraft type Antonov, Airbus and Boeing?

$H_0$: fatalities are independent from aircraft type (Antonov, Airbus or Boeing).

$H_a$: fatalities are dependent from aircraft type (Antonov, Airbus or Boeing).

In [141]:
accidents_df = pd.DataFrame(index=['Fatalities'], columns=['Antonov','Airbus','Boeing'])
accidents_df.head()

Unnamed: 0,Antonov,Airbus,Boeing
Fatalities,,,


In [142]:
accidents_df['Antonov'] = [antonov_df.fatalities.sum()]
accidents_df['Airbus'] = [airbus_df.fatalities.sum()]
accidents_df['Boeing'] = [boeing_df.fatalities.sum()]

In [143]:
accidents_df

Unnamed: 0,Antonov,Airbus,Boeing
Fatalities,2645.0,2971.0,17459.0


In [144]:
# In order to pass the dataframe data into stats.chisquare, IT NEEDS TO BE CONVERTED FROM PANDAS
# DATAFRAME INTO NUMPY ARRAY OR LIST.
fatalities_total_array = np.asarray(accidents_df.ix[0,:])
fatalities_total_array

array([  2645.,   2971.,  17459.])

In [145]:
expected_fatalities = [accidents_df.sum().sum()/3]*3
expected_fatalities

[7691.666666666667, 7691.666666666667, 7691.666666666667]

In [146]:
stats.chisquare(fatalities_total_array, expected_fatalities)

Power_divergenceResult(statistic=18611.61499458288, pvalue=0.0)

The p-value is less than 0.05, therefore we reject the null hypothesis that aircraft fatalities is independent from the aircraft type (Antonov, Airbus, Boeing).