# Space missions feature engineering and predictions

Who does not love Space?
This DataSet includes all the space missions since the beginning of Space Race (1957)


### Importing Libraries


In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

### Importing the dataset


In [2]:
df = pd.read_csv("../datasets/Space_Exploration_Cleaned.csv")

## Feature Engineering


When we try to build a model, we cant just pass null values to it. We need to fill those values somehow and feed it.


In [3]:
df.isnull().sum()

Company Name         0
Location             0
Datum                0
Detail               0
Status Rocket        0
Rocket            3360
Status Mission       0
Country              0
DateTime             0
Year                 0
Launch_Site          0
Count                0
Month                0
dtype: int64

So, 3360 rocket data is missing..


In [4]:
df["Rocket"] = df["Rocket"].fillna(df["Rocket"].mean())

In [5]:
df.isnull().sum()

Company Name      0
Location          0
Datum             0
Detail            0
Status Rocket     0
Rocket            0
Status Mission    0
Country           0
DateTime          0
Year              0
Launch_Site       0
Count             0
Month             0
dtype: int64

There are no more null data


In [6]:
df.head()

Unnamed: 0,Company Name,Location,Datum,Detail,Status Rocket,Rocket,Status Mission,Country,DateTime,Year,Launch_Site,Count,Month
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA","Fri Aug 07, 2020 05:12 UTC",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success,USA,2020-08-07 05:12:00+00:00,2020,"LC-39A, Kennedy Space Center, Florida",1,Aug
1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...","Thu Aug 06, 2020 04:01 UTC",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success,China,2020-08-06 04:01:00+00:00,2020,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...",1,Aug
2,SpaceX,"Pad A, Boca Chica, Texas, USA","Tue Aug 04, 2020 23:57 UTC",Starship Prototype | 150 Meter Hop,StatusActive,153.792199,Success,USA,2020-08-04 23:57:00+00:00,2020,"Pad A, Boca Chica, Texas",1,Aug
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan","Thu Jul 30, 2020 21:25 UTC",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success,Kazakhstan,2020-07-30 21:25:00+00:00,2020,"Site 200/39, Baikonur Cosmodrome",1,Jul
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA","Thu Jul 30, 2020 11:50 UTC",Atlas V 541 | Perseverance,StatusActive,145.0,Success,USA,2020-07-30 11:50:00+00:00,2020,"SLC-41, Cape Canaveral AFS, Florida",1,Jul


Next we need to make sure what columns would be ideal for a model to train. For example Detail,Datum has no need to be included in training data.


In [7]:
df = df.drop(
    ["Location", "Datum", "Detail", "DateTime", "Launch_Site", "Month", "Count"], axis=1
)

In [8]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,StatusActive,50.0,Success,USA,2020
1,CASC,StatusActive,29.75,Success,China,2020
2,SpaceX,StatusActive,153.792199,Success,USA,2020
3,Roscosmos,StatusActive,65.0,Success,Kazakhstan,2020
4,ULA,StatusActive,145.0,Success,USA,2020


Another very important thing is that we cant pass string values to a model for training. We have to convert it to some numerical form for a model to understand.


In [9]:
df["Status Mission"].value_counts()

Status Mission
Success              3879
Failure               339
Partial Failure       102
Prelaunch Failure       4
Name: count, dtype: int64

The thing we intend to predict here is whether the mission will fail or not. So we have to reduce four unique values into two unique values.


In [10]:
df["Status Mission"] = df["Status Mission"].apply(
    lambda x: x if x == "Success" else "Failure"
)
df["Status Mission"].value_counts()

Status Mission
Success    3879
Failure     445
Name: count, dtype: int64

Now we have to convert those values into numerical form. The simplest way to do this is make value success 1 and failure 0. LabelEncoder helps us to do just that.


In [11]:
encoder = LabelEncoder()
df["Status Mission"] = encoder.fit_transform(df["Status Mission"])

In [12]:
df[:10]

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,StatusActive,50.0,1,USA,2020
1,CASC,StatusActive,29.75,1,China,2020
2,SpaceX,StatusActive,153.792199,1,USA,2020
3,Roscosmos,StatusActive,65.0,1,Kazakhstan,2020
4,ULA,StatusActive,145.0,1,USA,2020
5,CASC,StatusActive,64.68,1,China,2020
6,Roscosmos,StatusActive,48.5,1,Kazakhstan,2020
7,CASC,StatusActive,153.792199,1,China,2020
8,SpaceX,StatusActive,50.0,1,USA,2020
9,JAXA,StatusActive,90.0,1,Japan,2020


In [13]:
df["Status Mission"].value_counts()

Status Mission
1    3879
0     445
Name: count, dtype: int64

Similiarly, we convert Status Rocket into numerical form


In [14]:
encoder = LabelEncoder()
df["Status Rocket"] = encoder.fit_transform(df["Status Rocket"])

In [15]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Country,Year
0,SpaceX,0,50.0,1,USA,2020
1,CASC,0,29.75,1,China,2020
2,SpaceX,0,153.792199,1,USA,2020
3,Roscosmos,0,65.0,1,Kazakhstan,2020
4,ULA,0,145.0,1,USA,2020


In [16]:
df["Status Rocket"].value_counts()

Status Rocket
1    3534
0     790
Name: count, dtype: int64

We can predict data for both company and country column, but I decided to drop Country Column.


In [17]:
df = df.drop(["Country"], axis=1)

In [18]:
df.head()

Unnamed: 0,Company Name,Status Rocket,Rocket,Status Mission,Year
0,SpaceX,0,50.0,1,2020
1,CASC,0,29.75,1,2020
2,SpaceX,0,153.792199,1,2020
3,Roscosmos,0,65.0,1,2020
4,ULA,0,145.0,1,2020


### One hot encoding the Country Column


In [19]:
def onehot_encode(data, column):
    dummies = pd.get_dummies(data[column])
    data = pd.concat([data, dummies], axis=1)
    data.drop(column, axis=1, inplace=True)
    return data

In [20]:
df = onehot_encode(df, "Company Name")

Segregating the X and y values. What that means is given X data columns, we have to predict y. So, y will only have 1 column and X should not have that column.


In [21]:
df.head()

Unnamed: 0,Status Rocket,Rocket,Status Mission,Year,AEB,AMBA,ASI,Arianespace,Arme de l'Air,Blue Origin,...,SpaceX,Starsem,ULA,US Air Force,US Navy,UT,VKS RF,Virgin Orbit,Yuzhmash,i-Space
0,0,50.0,1,2020,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
1,0,29.75,1,2020,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0,153.792199,1,2020,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
3,0,65.0,1,2020,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0,145.0,1,2020,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [22]:
X = df.drop("Status Mission", axis=1)
y = df["Status Mission"]

In [23]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Status Mission, dtype: int64

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [25]:
X_train.head()

Unnamed: 0,Status Rocket,Rocket,Year,AEB,AMBA,ASI,Arianespace,Arme de l'Air,Blue Origin,Boeing,...,SpaceX,Starsem,ULA,US Air Force,US Navy,UT,VKS RF,Virgin Orbit,Yuzhmash,i-Space
4158,1,153.792199,1962,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
630,1,164.0,2012,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
801,0,153.792199,2008,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3118,1,153.792199,1973,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1605,1,450.0,1993,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [26]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)

In [27]:
rfc_pred = rfc.predict(X_test)

In [28]:
accuracy_score(rfc_pred, y_test)

0.8959537572254336

In [29]:
print(confusion_matrix(y_test, rfc_pred))

[[ 13  65]
 [ 25 762]]


In [30]:
print(classification_report(y_test, rfc_pred))

              precision    recall  f1-score   support

           0       0.34      0.17      0.22        78
           1       0.92      0.97      0.94       787

    accuracy                           0.90       865
   macro avg       0.63      0.57      0.58       865
weighted avg       0.87      0.90      0.88       865

