# Exploratory Data Analysis (EDA) and Feature Engineering

In this notebook, we perform exploratory data analysis on Falcon 9 launch data to identify key patterns and relationships that may impact the success of first stage landings. We also conduct initial feature engineering to prepare the dataset for modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

## Data Loading

We begin by loading the cleaned dataset from the previous step.


In [None]:
df = pd.read_csv("../data/raw/dataset_part_2.csv")
df.head()

## Data Overview

Review the structure and completeness of the dataset to validate the inputs before visual analysis.

In [None]:
df.info()
df.isnull().sum() / len(df) * 100

## Payload vs Flight Number

Examine whether the payload mass has any influence on the success of the landing, across different flights.

In [None]:
sns.catplot(data=df, x="FlightNumber", y="PayloadMass", hue="Class", aspect=5)
plt.xlabel("Flight Number", fontsize=14)
plt.ylabel("Payload Mass (kg)", fontsize=14)
plt.title("Payload vs Flight Number by Landing Success")
plt.show()

## Launch Site Analysis

Explore how landing success varies across different launch sites.

In [None]:
sns.catplot(data=df, x="FlightNumber", y="LaunchSite", hue="Class", aspect=5)
plt.xlabel("Flight Number", fontsize=14)
plt.ylabel("Launch Site", fontsize=14)
plt.title("Launch Site vs Flight Number by Success")
plt.show()

## Payload Mass by Launch Site

Next, we analyze how payload mass is distributed across launch sites and its potential effect on success rates.

In [None]:
sns.catplot(data=df, x="PayloadMass", y="LaunchSite", hue="Class", aspect=5)
plt.xlabel("Payload Mass (kg)", fontsize=14)
plt.ylabel("Launch Site", fontsize=14)
plt.title("Payload Mass vs Launch Site by Success")
plt.show()

## Success Rate by Orbit

We calculate the average success rate grouped by orbit type.

In [None]:
orbit_success = df.groupby("Orbit")["Class"].mean().reset_index()
sns.catplot(data=orbit_success, x="Orbit", y="Class", kind="bar")
plt.xlabel("Orbit Type", fontsize=14)
plt.ylabel("Success Rate", fontsize=14)
plt.title("Landing Success Rate by Orbit")
plt.show()

## Orbit Type by Flight Number

We now look at how different orbits were used across the flight timeline.

In [None]:
sns.catplot(data=df, x="FlightNumber", y="Orbit", hue="Class", aspect=5)
plt.xlabel("Flight Number", fontsize=14)
plt.ylabel("Orbit", fontsize=14)
plt.title("Orbit vs Flight Number by Success")
plt.show()

## Success Rate Over Time

We extract the launch year from the date column and analyze the trend of successful landings over the years.

In [None]:
df["Year"] = pd.to_datetime(df["Date"]).dt.year
yearly_success = df.groupby("Year")["Class"].mean()

sns.lineplot(x=yearly_success.index, y=yearly_success.values)
plt.xlabel("Year", fontsize=14)
plt.ylabel("Success Rate", fontsize=14)
plt.title("Landing Success Rate Over Time")
plt.show()

# Feature Engineering

In this section, we prepare the dataset for classification modeling by encoding categorical features.

In [None]:
features = df[[
    'FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights',
    'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block',
    'ReusedCount', 'Serial'
]]
features.head()

## One-Hot Encoding

We transform the categorical variables into numerical format using one-hot encoding. This allows classification models to interpret the data properly.

In [None]:
features_encoded = pd.get_dummies(features, columns=['Orbit', 'LaunchSite', 'LandingPad', 'Serial'])
features_encoded = features_encoded.astype('float64')
features_encoded.head()

## Save Processed Dataset

We save the final processed dataset for use in the next step of model building.

In [None]:
features_encoded.to_csv("data/processed/dataset_part_3.csv", index=False)