# Final Frontier

By Shay Altshue & Yvonne King

Quick Notebook Reference

1. Project Plan
2. Acquire Data
3. Prepare Data
4. Exploration
5. Modeling
6. Conclusions

## Project Plan

**Acquisition, Prep, and Initial Exploration**
> - Collect all file
- Create a dataframe using pandas for each file
- Clean and prepare the data to perform aggregations and merge each dataframe together
- Remove/repair erroneous data
- Look at shape of data

**Exploration**
- Answer the folowing qustion
> some questions

**Main Hypotheses**
- $H_0$
- $H_a$

## Imports

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

#Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

#Hypothesis testing
from math import sqrt
from scipy import stats

import src.wrangle
import src.features
import src.preprocessing

## Wrangle

- The ```wrangle.py``` file has all the functions to call in our data and handles the following:
    - Handled any null values by replacing them with zero
    - Set the date/time column to be the Index
    - Created a Unique ID for each space Mission by combining the company name with original index number
    - Renamed Columns
    - Created numerical codes for mission_status
    - Created country column by extracted the information from the location column

In [2]:
#bring in complete dataframe
df = src.wrangle.get_space_data()

In [3]:
#take a peek at the data
df.sample(5)

Unnamed: 0_level_0,company_name,location,rocket_type,rocket_status,mission_cost,mission_status,year,country
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1986-10-15 09:29:00+00:00,RVSN USSR,"Site 41/1, Plesetsk Cosmodrome, Russia",Molniya-M /Block 2BL | Cosmos 1785,retired,0.0,Success,1986,Russia
2019-11-13 06:35:00+00:00,CASC,"LC-16, Taiyuan Satellite Launch Center, China",Long March 6 | Ningxia-1 (x5),active,0.0,Success,2019,China
2017-03-30 22:27:00+00:00,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA",Falcon 9 Block 3 | SES-10,retired,62.0,Success,2017,United States of America
2006-02-15 22:34:00+00:00,Sea Launch,"LP Odyssey, Kiritimati Launch Area, Pacific Ocean",Zenit-3 SL | EchoStar-X,active,0.0,Success,2006,United States of America
1994-04-13 06:04:00+00:00,Lockheed,"SLC-36B, Cape Canaveral AFS, Florida, USA",Atlas I | GOES-I,retired,0.0,Success,1994,United States of America


In [4]:
#look at the shape of the data
df.shape

(4324, 8)

In [5]:
#look at data types and counts
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4324 entries, 2020-08-07 05:12:00+00:00 to 1957-10-04 19:28:00+00:00
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   company_name    4324 non-null   object
 1   location        4324 non-null   object
 2   rocket_type     4324 non-null   object
 3   rocket_status   4324 non-null   object
 4   mission_cost    4324 non-null   object
 5   mission_status  4324 non-null   object
 6   year            4324 non-null   int64 
 7   country         4324 non-null   object
dtypes: int64(1), object(7)
memory usage: 304.0+ KB


In [6]:
#make sure there are no nulls
df.isnull().sum()

company_name      0
location          0
rocket_type       0
rocket_status     0
mission_cost      0
mission_status    0
year              0
country           0
dtype: int64

# Modeling

## Preprocessing the Data

In [11]:
# Split and encode the data
train, test = src.preprocessing.preprocesses_data_for_modeling(df)

In [12]:
print('train size:', train.shape)
print('test size:', test.shape)

train size: (3459, 9)
test size: (865, 9)


In [13]:
X = ['company_name', 'location', 'rocket_type']
y = 'mission_result'

In [14]:
X_train, y_train = train[X], train[y]
X_test, y_test = test[X], test[y]

## Models

## Baseline

In [15]:
df.mission_result.value_counts()

1    3879
0     445
Name: mission_result, dtype: int64

In [16]:
X_train = train[['company_name', 'location', 'rocket_type', 'mission_cost']]
y_train = train.mission_result

X_test = test[['company_name', 'location', 'rocket_type', 'mission_cost']]
y_test = test.mission_result

## Logistic Regression

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
log = LogisticRegression(random_state=123)

In [19]:
log.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=123, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
log.score(X_train, y_train)

0.897080080948251

In [21]:
log.score(X_test, y_test)

0.8971098265895954

## Decision Tree

In [22]:
from sklearn.tree import DecisionTreeClassifier

In [23]:
tree = DecisionTreeClassifier(max_depth=3, random_state=123)

In [24]:
tree.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')

In [25]:
tree.score(X_train, y_train)

0.8993928881179531

In [26]:
tree.score(X_test, y_test)

0.8971098265895954

## Random Forest

In [27]:
from sklearn.ensemble import RandomForestClassifier

In [28]:
forest = RandomForestClassifier(n_estimators=500, max_depth=3, random_state=123)

In [29]:
forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=123,
                       verbose=0, warm_start=False)

In [30]:
forest.score(X_train, y_train)

0.8993928881179531

In [31]:
forest.score(X_test, y_test)

0.8971098265895954

## K Nearest Neighbors

In [32]:
from sklearn.neighbors import KNeighborsClassifier

In [33]:
knn = KNeighborsClassifier(n_neighbors=7)

In [34]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=7, p=2,
                     weights='uniform')

In [35]:
knn.score(X_train, y_train)

0.9193408499566349

In [36]:
knn.score(X_test, y_test)

0.8647398843930636