# <center>________________________________________________________________</center>

# <center>LANDING PREDICTION FOR THE SPACEX FALCON 9 ROCKET</center>

# <center>Part 2 - Data Wrangling</center>

# <center>________________________________________________________________</center>

# Introduction
***

In this project, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of \\$62 million; other providers cost upward of \\$165 million each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

In this part, we will perform data wrangling and determine what would be the label for training supervised models.

In the data set, there are several different cases where the booster did not land successfully. Sometimes a landing was attempted but failed due to an accident. So we will convert the outcomes into training Labels with `1` means the booster successfully landed and `0` means it wasn't.

# Libraries
***

In [None]:
#!pip install pandas
#!pip install numpy

In [1]:
import pandas as pd
import numpy as np

# Data Acquisation
***

We wil load the SpaceX dataset we created from the last part. We will use the dataset that collected through SpaceX REST API:

In [2]:
#df=pd.read_csv("falcon9_api.csv") # to load the data from directory
df=pd.read_csv("https://github.com/efeyemez/Portfolio/raw/main/Datasets/SpaceX_Falcon_9/falcon9_api.csv")
print(df.shape)
df.head(10)

(173, 17)


Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,,LEO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1.0,False,False,False,,1.0,0.0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B1004,-80.577366,28.561857
5,6,2014-01-06,Falcon 9,3325.0,GTO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B1005,-80.577366,28.561857
6,7,2014-04-18,Falcon 9,2296.0,ISS,CCSFS SLC 40,True Ocean,1.0,False,False,True,,1.0,0.0,B1006,-80.577366,28.561857
7,8,2014-07-14,Falcon 9,1316.0,LEO,CCSFS SLC 40,True Ocean,1.0,False,False,True,,1.0,0.0,B1007,-80.577366,28.561857
8,9,2014-08-05,Falcon 9,4535.0,GTO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B1008,-80.577366,28.561857
9,10,2014-09-07,Falcon 9,4428.0,GTO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B1011,-80.577366,28.561857


# Missing Values
***

We can see below that some of the rows are missing values in our dataset:

In [3]:
df.isnull().sum()

FlightNumber       0
Date               0
BoosterVersion     0
PayloadMass       24
Orbit              1
LaunchSite         0
Outcome            0
Flights            5
GridFins           5
Reused             5
Legs               5
LandingPad        31
Block              5
ReusedCount        5
Serial             5
Longitude          0
Latitude           0
dtype: int64

Before we can continue we must deal with these missing values. The <code>LandingPad</code> column will retain empty values to represent when landing pads were not used. For the <code>PayloadMass</code> column, we will use the average value of the column to fill the empty records. For the other features, we will remove the records that have empty values:

In [4]:
# Calculate the mean value of PayloadMass column
nu = round(df["PayloadMass"].mean(), 0)
print(nu)
# Replace the np.nan values with its mean value
df["PayloadMass"].replace(np.nan, nu, inplace=True)

# Drop the rows for missing other missing values
df.dropna(subset=["Orbit", "Flights", "GridFins", "Reused", "Legs", "Block", "ReusedCount", "Serial"], axis=0, inplace=True)

# Reset the index and Flight Numbers
df.reset_index(drop=True, inplace=True)
df = df.assign(FlightNumber = list(range(1, df.shape[0]+1)))

print(df.shape)
df.head()

8184.0
(167, 17)


Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,8184.0,LEO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1.0,False,False,False,,1.0,0.0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1.0,False,False,False,,1.0,0.0,B1004,-80.577366,28.561857


In [5]:
df.isnull().sum()

FlightNumber       0
Date               0
BoosterVersion     0
PayloadMass        0
Orbit              0
LaunchSite         0
Outcome            0
Flights            0
GridFins           0
Reused             0
Legs               0
LandingPad        26
Block              0
ReusedCount        0
Serial             0
Longitude          0
Latitude           0
dtype: int64

We see that number of missing values has changed to zero except for <code>LandingPad</code>.

# Data Types
***

Identify which columns are numerical and categorical:


In [6]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights           float64
GridFins           object
Reused             object
Legs               object
LandingPad         object
Block             float64
ReusedCount       float64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

We will change some of the datatypes as follows:

In [7]:
# Convert data types to proper format
df[["Flights"]] = df[["Flights"]].astype("int64")

df[["GridFins", "Reused", "Legs"]] = df[["GridFins", "Reused", "Legs"]].astype("bool")

df[["ReusedCount"]] = df[["ReusedCount"]].astype("int64")
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

# Data Standardization
***

We will examine if some standardization in "Orbit" feature is needed. Because the values "SSO" and "SO" are both mean "Sun-synchronous orbit" and sometimes they both used in the same dataset. We can check that:

In [8]:
df["Orbit"].value_counts().keys()

Index(['VLEO', 'ISS', 'GTO', 'LEO', 'PO', 'SSO', 'MEO', 'GEO', 'TLI', 'ES-L1',
       'HEO', 'SO'],
      dtype='object')

We see that there are records for both values. We will standardize it and use SSO for both cases:

In [9]:
df['Orbit'].replace("SO", "SSO", inplace=True)
df["Orbit"].value_counts().keys()

Index(['VLEO', 'ISS', 'GTO', 'LEO', 'PO', 'SSO', 'MEO', 'GEO', 'TLI', 'ES-L1',
       'HEO'],
      dtype='object')

# Creating the Label "Class"
***

From the column <code>Outcome</code> we will determine the number of <code>landing_outcomes</code>:

In [10]:
landing_outcomes = df["Outcome"].value_counts()

print(landing_outcomes)

True ASDS      109
True RTLS       22
None None       19
False ASDS       7
True Ocean       5
False Ocean      2
None ASDS        2
False RTLS       1
Name: Outcome, dtype: int64


<code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean. <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.<code>True ASDS</code> means the mission outcome was successfully  landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship. <code>None ASDS</code> and <code>None None</code> represent a failure to land.

In [11]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 True RTLS
2 None None
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


We create a set of outcomes where the second stage did not land successfully:


In [12]:
bad_outcomes = set(['None None', 'False ASDS', 'False Ocean', 'None ASDS', 'False RTLS'])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

Using the <code>Outcome</code>,  we will create a list where the element is 0 if the corresponding row in <code>Outcome</code> is in the set <code>bad_outcomes</code>; otherwise, it's 1. Then we will assign it to the variable <code>landing_class</code>:

In [13]:
landing_class = []
for i in range(df.shape[0]):
    
    # landing_class = 0 if bad_outcome
    if ({df.loc[i, "Outcome"]}.issubset(bad_outcomes)):
          landing_class.append(0)
            
    # landing_class = 1 otherwise       
    else:
          landing_class.append(1)

In [14]:
print(len(landing_class))
print(landing_class)

167
[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


This variable will represent the classification variable that represents the outcome of each launch. If the value is 0, the  first stage did not land successfully; 1 means  the first stage landed successfully.

In [15]:
df['Class']=landing_class
df.head(5)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,8184.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0


We can use the following line of code to determine the overall success rate:

In [16]:
print("The success Rate: is %.3f%%" % (100*df[["Class"]].mean()))

The success Rate: is 81.437%


We can now export our dataset into a CSV file:

In [17]:
df.to_csv('falcon9_wrangled.csv', index=False)

# <center>________________________________________________________________</center>