# **Space X  Falcon 9 First Stage Landing Prediction**


 ## Lab 2: Data wrangling 


Estimated time needed: **60** minutes


In this lab, we will perform some **Exploratory Data Analysis** (EDA) to find some patterns in the data and determine what would be the **label for training supervised models**. 

In the data set, there are several different cases where the **booster did not land successfully**. Sometimes a landing was *attempted but failed* due to an accident.

For example $\rightarrow$ The `Outcome` column in dataset indicates whether booster was landed successfully or not.
|Category|Meaning|
|:---:|:---:|
|`True Ocean`|**Successfully** landed on a specific region of the **ocean**|
|`False Ocean`|**Unsuccessfully** landed on a specific region of the **ocean**|
|`True RTLS`|**Successfully** landed on a **ground pad**|
|`False RTLS`|**Unsuccessfully** landed on a **ground pad**|
|`True ASDS`|**Successfully** landed on a **drone ship**|
|`False ASDS`|**Unsuccessfully** landed on a **drone ship**|
|`None ASDS`|**Failed** to land|
|`None`|**Failed** to land|

In this lab we will mainly convert those **outcomes into Training Labels** as follows:
|Label|Meaning|
|:---:|:---:|
|`1`|Booster was **successfully** landed|
|`0`|**Failed** to land booster successfully|

Example of a **successful launch** $\rightarrow$

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/landing_1.gif)


Example of an **unsuccessful launch** $\rightarrow$

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/crash.gif)


---

## Objectives
1. Perform Exploratory Data Analysis
2. Determine Training Labels

----


## Import Libraries and Define Auxiliary Functions


In [1]:
# for data manipulation and analysis
import pandas as pd
# for numerical computation
import numpy as np

### Data Analysis 


Load Space X dataset, from last section.


In [36]:
# local file path
local_path = r"D:\IBM Professional Certification\10_Data Science Capstone Project\2_Data wrangling\data\dataset_part_1.csv"

In [37]:
# load dataset as dataframe
df = pd.read_csv(local_path)

In [38]:
# check first 10 rows of dataset
df.head(10)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857
5,6,2014-01-06,Falcon 9,3325.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1005,-80.577366,28.561857
6,7,2014-04-18,Falcon 9,2296.0,ISS,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1006,-80.577366,28.561857
7,8,2014-07-14,Falcon 9,1316.0,LEO,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1007,-80.577366,28.561857
8,9,2014-08-05,Falcon 9,4535.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1008,-80.577366,28.561857
9,10,2014-09-07,Falcon 9,4428.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1011,-80.577366,28.561857


Identify and calculate the percentage of the missing values in each attribute


In [39]:
# percentage of missing values per column for deciding how to deal with NaN entries
    ## total number of rows
total_rows = len(df)
    ## percentage of NaN rows per column/feature
(df.isnull().sum(axis=0)/total_rows) * 100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass        0.000000
Orbit              0.000000
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        28.888889
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

#### NOTE:
1. `LandingPad` column has $\approx29\%$ missing values

Identify which columns are numerical and categorical:


In [40]:
# data types of dataset columns
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

#### NOTE
1. **Categorical** columns with `object` dtype
    - `BoosterVersion`
    - `Orbit`
    - `LaunchSite`
    - `Outcome`
    - `LandingPad`
2. **Categorical** columns with `bool` dtype
    - `GridFins`
    - `Reused`
    - `Legs`
3. **Categorical** columns with `float64` dtype
    - `Block`
4. **Numeric** columns
    - `PayloadMass`
    - `ReusedCount`

---

### TASK 1: Calculate the number of launches on each site

The data contains several Space X  launch facilities:
1. **CCAFS SLC 40**: <a href='https://en.wikipedia.org/wiki/List_of_Cape_Canaveral_and_Merritt_Island_launch_sites'>Cape Canaveral Space</a> Launch Complex 40
2. **VAFB SLC 4E**: Vandenberg Air Force Base Space Launch Complex 4E
3. **(SLC-4E)** 
4. **KSC LC 39A**: Kennedy Space Center Launch Complex 39A

The location of each Launch Is placed in the column `LaunchSite`.

Next, let's see the number of launches for each site.

Use the method  <code>value_counts()</code> on the column <code>LaunchSite</code> to determine the number of launches  on each site: 


In [41]:
# Apply value_counts() on column LaunchSite
df["LaunchSite"].value_counts()

LaunchSite
CCAFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: count, dtype: int64

Each launch aims to an dedicated orbit, and here are some common orbit types:


|Orbit|Meaning|Description|
|:---:|:---:|:---:|
|`LEO`|Low Earth orbit|Earth-centred orbit with an altitude of 2,000 km (1,200 mi) or less (approximately one-third of the radius of Earth), or with at least 11.25 periods per day (an orbital period of 128 minutes or less) and an eccentricity less than 0.25. Most of the manmade objects in outer space are in LEO.|
|`VLEO`|Very Low Earth Orbits|Orbits with a mean altitude below 450 km. Operating in these orbits can provide a number of benefits to Earth observation spacecraft as the spacecraft operates closer to the observation.|
|`GTO`|Geo-synchroous Orbit|High Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance. Because the satellite orbits at the same speed that the Earth is turning, the satellite seems to stay in place over a single longitude, though it may drift north to south.|
|`SSO`|Sun-synchronous Orbit/Heliosynchronous Orbit|Polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time.|
|`ES-L1`|Earth-Sun Lagrange Point-1|At Lagrange points the gravitational forces of the two large bodies cancel out in such a way that a small object placed in orbit there is in equilibrium relative to the center of mass of the large bodies. L1 is one such point between the sun and the earth.|
|`HEO`|Highly Elliptical Orbit|Elliptic orbit with high eccentricity, usually referring to one around Earth.|
|`ISS`|International Space Station|Modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada).|
|`MEO`|Medium Earth Orbit/Intermediate Circular Orbit|Geocentric orbits ranging in altitude from 2,000 km (1,200 mi) to just below geosynchronous orbit at 35,786 kilometers (22,236 mi). These are "most commonly at 20,200 kilometers (12,600 mi), or 20,650 kilometers (12,830 mi), with an orbital period of 12 hours.|
|`HEO`|High Earth Orbit|Geocentric orbits above the altitude of geosynchronous orbit (35,786 km or 22,236 mi).|
|`GEO`|Geo-synchronous Earth Orbit|Circular geosynchronous orbit 35,786 kilometres (22,236 miles) above Earth's equator and following the direction of Earth's rotation.|
|`PO`|Polar Orbit|A type of orbit in which a satellite passes above or nearly above both poles of the body being orbited (_usually a planet such as the Earth_).|

![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/Orbits.png)


---

### TASK 2: Calculate the number and occurrence of each orbit


 Use the method  <code>.value_counts()</code> to determine the number and occurrence of each orbit in the  column <code>Orbit</code>


In [42]:
# Apply value_counts on Orbit column
df["Orbit"].value_counts()

Orbit
GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64

### TASK 3: Calculate the number and occurence of mission outcome of the orbits


Use the method <code>.value_counts()</code> on the column <code>Outcome</code> to determine the number of <code>landing_outcomes</code>.Then assign it to a variable landing_outcomes.


In [43]:
# landing_outcomes = values on Outcome column
landing_outcomes = df["Outcome"].value_counts()

In [44]:
# check landing outcomes of launchs
landing_outcomes

Outcome
True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64

#### Interpreting the Landing Outcomes
|Category|Meaning|
|:---:|:---:|
|`True Ocean`|**Successfully** landed on a specific region of the **ocean**|
|`False Ocean`|**Unsuccessfully** landed on a specific region of the **ocean**|
|`True RTLS`|**Successfully** landed on a **ground pad**|
|`False RTLS`|**Unsuccessfully** landed on a **ground pad**|
|`True ASDS`|**Successfully** landed on a **drone ship**|
|`False ASDS`|**Unsuccessfully** landed on a **drone ship**|
|`None ASDS`|**Failed** to land|
|`None`|**Failed** to land|

In [45]:
# get indices of landing outcomes which were failures
bad_indices = [] ## to store index positions of failed outcomes
for i, outcome in enumerate(landing_outcomes.keys()):
    if (("False" in outcome) or ("None" in outcome)): 
        print(f"Index {i}: {outcome} ")
        bad_indices.append(i)

Index 1: None None 
Index 3: False ASDS 
Index 5: False Ocean 
Index 6: None ASDS 
Index 7: False RTLS 


We create a set of outcomes where the second stage did not land successfully:


In [46]:
# extract only the failed categories
bad_outcomes = set(landing_outcomes.keys()[bad_indices])
# check failed landing attemps categories
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

### TASK 4: Create a landing outcome label from Outcome column


Using the <code>Outcome</code>,  create a list where the element is zero if the corresponding  row  in  <code>Outcome</code> is in the set <code>bad_outcome</code>; otherwise, it's one. Then assign it to the variable <code>landing_class</code>:


In [47]:
# landing_class = 0 if bad_outcome
# landing_class = 1 otherwise
landing_class = [
    0 if outcome in bad_outcomes else 1
    for outcome in df["Outcome"].values
]

This variable will represent the classification variable that represents the outcome of each launch. If the value is zero, the  first stage did not land successfully; one means  the first stage landed Successfully 


In [48]:
# create a column for `Class` (one-hot encoded) representing whether landing outcome was success/failure
df['Class'] = landing_class

In [50]:
# check the dataframe
df.head(8)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0
5,6,2014-01-06,Falcon 9,3325.0,GTO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B1005,-80.577366,28.561857,0
6,7,2014-04-18,Falcon 9,2296.0,ISS,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1006,-80.577366,28.561857,1
7,8,2014-07-14,Falcon 9,1316.0,LEO,CCAFS SLC 40,True Ocean,1,False,False,True,,1.0,0,B1007,-80.577366,28.561857,1


We can use the following line of code to determine  the success rate:


In [51]:
# success rate of the launches (successful landing)
df["Class"].mean()

0.6666666666666666

We can now export it to a CSV for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range.


<code>df.to_csv("dataset_part_2.csv", index=False)</code>


In [52]:
# save dataframe to local storage
local_path_save = r"D:\IBM Professional Certification\10_Data Science Capstone Project\2_Data wrangling\data\dataset_part_1_processed_1.csv"

# save as csv
df.to_csv(local_path_save, index=False)