# Space X Falcon 9 First Stage Landing Prediction
## Part 2: Data Wrangling

 In this lab, we will carry out some Exploratory Data Analysis in order to uncover patterns in the data and determine the labels for training supervised models.
 
 In the dataset we have been building, there are several labels given if a landing was succesful. For example, `True Ocean` means that the mission outcome was successfully landed to a specific region of the ocean and `True RTLS` means that the mission outcome was succesfully landed to a ground pad. This means there are multiple possible labels 
for a succesfull landing, as well as an unsuccesful landing.

In this lab, we will need to convert these outomes into training labels, generalizing whether the booster landed succesfully `1` or did not land successfully `0`.

----

Importing our libraries:

In [1]:
import pandas as pd
import numpy as np

## Data Analysis
Loading our dataset:

In [4]:
df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(2)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857


Calculating percentage of missing values in each category

In [6]:
df.isnull().sum()/len(df)*100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass        0.000000
Orbit              0.000000
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        28.888889
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

Identifying categorical and numerical columns

In [7]:
df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

### 1. Calculating number of launches in each site

In [10]:
df['LaunchSite'].value_counts()

CCAFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: LaunchSite, dtype: int64

It is important to note that each launch aims to a dedicated orbit.

### 3. Calculating number and occurance of mission outcomes of the orbits

In [13]:
landing_outcomes = df['Outcome'].value_counts()
landing_outcomes

True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: Outcome, dtype: int64

All outcomes with `False` are failed landings, as well as those with `None`.

In [14]:
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

0 True ASDS
1 None None
2 True RTLS
3 False ASDS
4 True Ocean
5 False Ocean
6 None ASDS
7 False RTLS


Let's create set of outcomes where the second stage did not land succesfully.

In [15]:
bad_outcomes = set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

### 4. Create landing outcome label from Outcome column
Creating a list where the element is `0` if the correspoinding row in `Outcome` is in the set `bad_outcomes`, otherwise it is `1`. </br>Then assign it to `landing_class`.

In [18]:
landing_class = []
for oc in df['Outcome']:
    if oc in bad_outcomes:
        landing_class.append(0)
    else:
        landing_class.append(1)

Now let's add a new column, `Class` to represent these newly defined outcomes.

In [25]:
df['Class'] = landing_class
df.head(2)

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6104.959412,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCAFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0


Let's determine the success rate.

In [26]:
df['Class'].mean()

0.6666666666666666

Lastly, let's save our progress and export our modified dataset to a CSV file.

In [27]:
df.to_csv('dataset_part_2.csv', index=False)