# **Space X  Falcon 9 Data Understanding and wrangling**
IMB's Applied Data Science Capstone Project


### Objectives
Perform exploratory  Data Analysis and determine Training Labels 

- Exploratory Data Analysis
- Determine Training Labels 


----


In [1]:
import pandas as pd
import numpy as np

### Data Analysis 


In [2]:
df=pd.read_csv("/Users/fanzhaoting/Brian_code/coursera_dspythonclass/falcon/data/Falcon_dataset.csv")
pd.set_option('display.max_columns', None)
df.head(10)

Unnamed: 0,FlightNumber,Date,BoosterVersion,LaunchSite,Payload,PayloadMass,Orbit,Customer,LaunchOutcome,LandingOutcome,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,F9 v1.0,CCSFS SLC 40,Dragon Spacecraft Qualification Unit,6124,LEO,SpaceX,Success,Failure,None None,1,False,False,False,,1,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,F9 v1.0,CCSFS SLC 40,SpaceX COTS Demo Flight 2,525,LEO,NASA,Success,No attempt,None None,1,False,False,False,,1,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,F9 v1.0,CCSFS SLC 40,SpaceX CRS-2,677,ISS,NASA,Success,No attempt,None None,1,False,False,False,,1,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,F9 v1.1,VAFB SLC 4E,CASSIOPE,500,PO,MDA,Success,Uncontrolled,False Ocean,1,False,False,False,,1,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,F9 v1.1,CCSFS SLC 40,SES-8,3170,GTO,SES,Success,No attempt,None None,1,False,False,False,,1,0,B1004,-80.577366,28.561857,0
5,6,2014-01-06,F9 v1.1,CCSFS SLC 40,Thaicom 6,3325,GTO,Thaicom,Success,No attempt,None None,1,False,False,False,,1,0,B1005,-80.577366,28.561857,0
6,7,2014-04-18,F9 v1.1,CCSFS SLC 40,SpaceX CRS-3,2296,ISS,NASA,Success,Controlled,True Ocean,1,False,False,True,,1,0,B1006,-80.577366,28.561857,1
7,8,2014-07-14,F9 v1.1,CCSFS SLC 40,Orbcomm-OG2,1316,LEO,Orbcomm,Success,Controlled,True Ocean,1,False,False,True,,1,0,B1007,-80.577366,28.561857,1
8,9,2014-08-05,F9 v1.1,CCSFS SLC 40,AsiaSat 8,4535,GTO,AsiaSat,Success,No attempt,None None,1,False,False,False,,1,0,B1008,-80.577366,28.561857,0
9,10,2014-09-07,F9 v1.1,CCSFS SLC 40,AsiaSat 6,4428,GTO,AsiaSat,Success,No attempt,None None,1,False,False,False,,1,0,B1011,-80.577366,28.561857,0


Replace newline characters in every column

In [3]:
df = df.replace('\n', ' ', regex=True)

Clean column `BoosterVersion` for aggregation

In [4]:
# Define the pattern to extract (e.g., "F9 v1.0", "F9 v1.1", etc.)
patterns = ["F9 v1.0", "F9 v1.1","F9 FT", "F9 B4","F9 B5"]  # Add more patterns as needed

# Function to extract the matching pattern from a cell
def extract_pattern(cell):
    for pattern in patterns:
        if pattern in cell:
            return pattern
    return None  # Return None if no match is found

# Apply the extraction function to the specified column
df['BoosterVersion'] = df['BoosterVersion'].apply(extract_pattern)

Identify and calculate the percentage of the missing values in each attribute


In [5]:
df.isnull().sum()/df.count()*100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
LaunchSite         0.000000
Payload            0.000000
PayloadMass        0.000000
Orbit              0.000000
Customer           1.136364
LaunchOutcome      0.000000
LandingOutcome     0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        41.269841
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
Class              0.000000
dtype: float64

Identify which columns are numerical and categorical:


In [6]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
LaunchSite         object
Payload            object
PayloadMass         int64
Orbit              object
Customer           object
LaunchOutcome      object
LandingOutcome     object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block               int64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
Class               int64
dtype: object

### Calculate the number of launches on each site

The data contains several Space X  launch facilities: <a href='https://en.wikipedia.org/wiki/List_of_Cape_Canaveral_and_Merritt_Island_launch_sites?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01'>Cape Canaveral Space</a> Launch Complex 40  <b>VAFB SLC 4E </b> , Vandenberg Air Force Base Space Launch Complex 4E <b>(SLC-4E)</b>, Kennedy Space Center Launch Complex 39A <b>KSC LC 39A </b>.The location of each Launch Is placed in the column <code>LaunchSite</code>


In [7]:
# Apply value_counts() on column LaunchSite
launch_counts = df['LaunchSite'].value_counts()
print(launch_counts)

CCSFS SLC 40    54
KSC LC 39A      22
VAFB SLC 4E     13
Name: LaunchSite, dtype: int64


Each launch aims to an dedicated orbit, and here are some common orbit types:




* <b>LEO</b>: Low Earth orbit (LEO)is an Earth-centred orbit with an altitude of 2,000 km (1,200 mi) o Most of the manmade objects in outer space are in LEO <a href='https://en.wikipedia.org/wiki/Low_Earth_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01'>[1]</a>.

* <b>VLEO</b>: Very Low Earth Orbits (VLEO) can be defined as the orbits with a mean altitude below 450 km. Operating in these orbits can provide a number of benefits to Earth observation spacecraft as the spacecraft operates closer to the observation<a href='https://www.researchgate.net/publication/271499606_Very_Low_Earth_Orbit_mission_concepts_for_Earth_Observation_Benefits_and_challenges?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01'>[2]</a>.


* <b>GTO</b> A geosynchronous orbit is a high Earth orbit that allows satellites to match Earth's rotation. Located at 22,236 miles (35,786 kilometers) above Earth's equator, this position is a valuable spot for monitoring weather, communications and surveillance. Because the satellite orbits at the same speed that the Earth is turning, the satellite seems to stay in place over a single longitude, though it may drift north to south,” NASA wrote on its Earth Observatory website <a  href="https://www.space.com/29222-geosynchronous-orbit.html?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01" >[3] </a>.


* <b>SSO (or SO)</b>: It is a Sun-synchronous orbit  also called a heliosynchronous orbit is a nearly polar orbit around a planet, in which the satellite passes over any given point of the planet's surface at the same local mean solar time <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01">[4] </a>.
    
    
    
* <b>ES-L1 </b>:At the Lagrange points the gravitational forces of the two large bodies cancel out in such a way that a small object placed in orbit there is in equilibrium relative to the center of mass of the large bodies. L1 is one such point between the sun and the earth  
<a href="https://en.wikipedia.org/wiki/Lagrange_point?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01#L1_point">[5] </a> .
    
    
* <b>HEO</b> A highly elliptical orbit, is an elliptic orbit with high eccentricity, usually referring to one around Earth <a href="https://en.wikipedia.org/wiki/Highly_elliptical_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01">[6]</a>.


* <b> ISS </b> A modular space station (habitable artificial satellite) in low Earth orbit. It is a multinational collaborative project between five participating space agencies: NASA (United States), Roscosmos (Russia), JAXA (Japan), ESA (Europe), and CSA (Canada)<a href="https://en.wikipedia.org/wiki/International_Space_Station?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01"> [7] </a>


* <b> MEO </b> Geocentric orbits ranging in altitude from 2,000 km (1,200 mi) to just below geosynchronous orbit at 35,786 kilometers (22,236 mi). Also known as an intermediate circular orbit. These are "most commonly at 20,200 kilometers (12,600 mi), or 20,650 kilometers (12,830 mi), with an orbital period of 12 hours <a href="https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01"> [8] </a>


* <b> HEO </b> Geocentric orbits above the altitude of geosynchronous orbit (35,786 km or 22,236 mi) <a href="https://en.wikipedia.org/wiki/List_of_orbits?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01"> [9] </a>


* <b> GEO </b> It is a circular geosynchronous orbit 35,786 kilometres (22,236 miles) above Earth's equator and following the direction of Earth's rotation <a href="https://en.wikipedia.org/wiki/Geostationary_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01"> [10] </a>


* <b> PO </b> It is one type of satellites in which a satellite passes above or nearly above both poles of the body being orbited (usually a planet such as the Earth <a href="https://en.wikipedia.org/wiki/Polar_orbit?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDS0321ENSkillsNetwork865-2023-01-01"> [11] </a>

some are shown in the following plot:


<div>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/Images/Orbits.png" width="400"/>
</div>


### Calculate the number and occurrence of each orbit


In [8]:
# determine the number and occurrence of each orbit in the  column Orbit
launch_counts = df['Orbit'].value_counts()
print(launch_counts)

GTO      26
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: Orbit, dtype: int64


### Calculate the number and occurence of mission outcome of the orbits


In [9]:
# landing_outcomes = values on Outcome column
landing_outcomes = df['Outcome'].value_counts()
print(landing_outcomes)

True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       1
False RTLS      1
Name: Outcome, dtype: int64


there are several different cases where the booster did not land successfully. Sometimes a landing was attempted but failed due to an accident

- <code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean. 
- <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.
- <code>True ASDS</code> means the mission outcome was successfully  landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship.
- <code>None ASDS</code> and <code>None None</code> these represent a failure to land.


In [10]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


In [11]:
#Create a landing outcome label from Outcome column
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

### Create `Class` column that represent the Outcome 

In [12]:
landing_class = [0 if outcome in bad_outcomes else 1 for outcome in df['Outcome']]
print(landing_class)

[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [13]:
#represent the classification variable that represents the outcome of each launch.
df['Class']=landing_class
df[['Class']].head(8)

Unnamed: 0,Class
0,0
1,0
2,0
3,0
4,0
5,0
6,1
7,1


In [14]:
df.head(10)

Unnamed: 0,FlightNumber,Date,BoosterVersion,LaunchSite,Payload,PayloadMass,Orbit,Customer,LaunchOutcome,LandingOutcome,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,F9 v1.0,CCSFS SLC 40,Dragon Spacecraft Qualification Unit,6124,LEO,SpaceX,Success,Failure,None None,1,False,False,False,,1,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,F9 v1.0,CCSFS SLC 40,SpaceX COTS Demo Flight 2,525,LEO,NASA,Success,No attempt,None None,1,False,False,False,,1,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,F9 v1.0,CCSFS SLC 40,SpaceX CRS-2,677,ISS,NASA,Success,No attempt,None None,1,False,False,False,,1,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,F9 v1.1,VAFB SLC 4E,CASSIOPE,500,PO,MDA,Success,Uncontrolled,False Ocean,1,False,False,False,,1,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,F9 v1.1,CCSFS SLC 40,SES-8,3170,GTO,SES,Success,No attempt,None None,1,False,False,False,,1,0,B1004,-80.577366,28.561857,0
5,6,2014-01-06,F9 v1.1,CCSFS SLC 40,Thaicom 6,3325,GTO,Thaicom,Success,No attempt,None None,1,False,False,False,,1,0,B1005,-80.577366,28.561857,0
6,7,2014-04-18,F9 v1.1,CCSFS SLC 40,SpaceX CRS-3,2296,ISS,NASA,Success,Controlled,True Ocean,1,False,False,True,,1,0,B1006,-80.577366,28.561857,1
7,8,2014-07-14,F9 v1.1,CCSFS SLC 40,Orbcomm-OG2,1316,LEO,Orbcomm,Success,Controlled,True Ocean,1,False,False,True,,1,0,B1007,-80.577366,28.561857,1
8,9,2014-08-05,F9 v1.1,CCSFS SLC 40,AsiaSat 8,4535,GTO,AsiaSat,Success,No attempt,None None,1,False,False,False,,1,0,B1008,-80.577366,28.561857,0
9,10,2014-09-07,F9 v1.1,CCSFS SLC 40,AsiaSat 6,4428,GTO,AsiaSat,Success,No attempt,None None,1,False,False,False,,1,0,B1011,-80.577366,28.561857,0


We can use the following line of code to determine  the success rate:


In [15]:
df["Class"].mean()

0.6741573033707865

We can now export it to a CSV for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range.


In [16]:
df.to_csv("data/Falcon_dataset.csv", index=False)
