In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Data processing and manipulation
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
import re
import unicodedata
import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Widgets and display
import ipywidgets as widgets
from ipywidgets import VBox
from IPython.display import display, clear_output

# Machine Learning
from sklearn import preprocessing
from sklearn.model_selection import (train_test_split, GridSearchCV, StratifiedKFold)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

ModuleNotFoundError: No module named 'requests'


![d](https://i.imgur.com/aHykSXG.png)


# Table of Contents
---
1. [Introduction](#introduction)
2. [Data Collection and Preprocessing](#data-collect)
   - [SpaceX REST API](#spacex-api)
   - [Webscraping](#webscrape)
3. [Exploratory Data Analysis](#eda)
   - [Initial EDA](#initeda)
   - [EDA Visualation](#edavis)
4. [Model Building](#model-building)
   - [Feature Enginnering](#feature-eng)
   - [Logistic Regression](#log-regreg)
   - [Support Vector Machine](#svm)
   - [Decision Trees](#decision-trees)
   - [kNN](#knn)
5. [Evaluation](#evaluation)
   - [Confusion Matrices](#confusion)
   - [Best Model and Parameters](#best-params)
6. [Final Model](#final-model)
7. [Conclusion](#conclusion)



# **Introduction** <a name="introduction"></a> 

This project aims to predict the successful landing of SpaceX's Falcon 9 first stage. SpaceX offers Falcon 9 rocket launches for 62 million dollars on its official website, which is significantly lower than other providers who charge upwards of 165 million dollars. A significant reason for this cost reduction is SpaceX's ability to reuse the rocket's first stage. By determining the success of the first stage landing, we can estimate the overall launch cost. This analysis is crucial for potential competitors who might want to offer competitive bids against SpaceX for rocket launches. In this study, we will gather and preprocess data from SpaceX's Rest API, ensuring it is in the appropriate format. 

# **Data Collection and Preprocessing** <a name="data-collect"></a> 
-----

In the data collection phase, we will be utilizing the SpaceX Rest API to gather data for the landing site, booster types, and outcomes of the landings of each type of booster. We will also be webscraping data for the Falcon 9 historical launch records from a Wikipedia page titled List of Falcon 9 and Falcon Heavy launches
https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches

---
![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png)


##  SpaceX REST API <a name="spacex-api"></a>
---

In [None]:
# Below we will define a series of helper functions that will help us use the API to extract information using identification numbers in the launch data.
# Takes the dataset and uses the rocket column to call the API and append the data to the list
def getBoosterVersion(data):
    for x in data['rocket']:
       if x:
        response = requests.get("https://api.spacexdata.com/v4/rockets/"+str(x)).json()
        BoosterVersion.append(response['name'])

    return BoosterVersion

# Takes the dataset and uses the launchpad column to call the API and append the data to the list        
def getLaunchSite(data):
    for x in data['launchpad']:
       if x:
         response = requests.get("https://api.spacexdata.com/v4/launchpads/"+str(x)).json()
         Longitude.append(response['longitude'])
         Latitude.append(response['latitude'])
         LaunchSite.append(response['name'])
        
# Takes the dataset and uses the payloads column to call the API and append the data to the lists
def getPayload(data):
    for load in data['payloads']:
       if load:
        response = requests.get("https://api.spacexdata.com/v4/payloads/"+load).json()
        PayloadMass.append(response['mass_kg'])
        Orbit.append(response['orbit'])
        
# Takes the dataset and uses the cores column to call the API and append the data to the lists
def getCore(data):
    for core in data['cores']:
            if core['core'] != None:
                response = requests.get("https://api.spacexdata.com/v4/cores/"+core['core']).json()
                Block.append(response['block'])
                ReusedCount.append(response['reuse_count'])
                Serial.append(response['serial'])
            else:
                Block.append(None)
                ReusedCount.append(None)
                Serial.append(None)
            Outcome.append(str(core['landing_success'])+' '+str(core['landing_type']))
            Flights.append(core['flight'])
            GridFins.append(core['gridfins'])
            Reused.append(core['reused'])
            Legs.append(core['legs'])
            LandingPad.append(core['landpad'])

: 

In [None]:
# Obtaining JSON file and normalizing it
spacex_url="https://api.spacexdata.com/v4/launches/past"
response = requests.get(spacex_url)
data = pd.json_normalize(response.json())
data.head()

In [None]:
# Lets take a subset of our dataframe keeping only the features we want and the flight number, and date_utc.
data = data[['rocket', 'payloads', 'launchpad', 'cores', 'flight_number', 'date_utc']]

# We will remove rows with multiple cores because those are falcon rockets with 2 extra rocket boosters and rows that have multiple payloads in a single rocket.
data = data[data['cores'].map(len)==1]
data = data[data['payloads'].map(len)==1]

# Since payloads and cores are lists of size 1 we will also extract the single value in the list and replace the feature.
data['cores'] = data['cores'].map(lambda x : x[0])
data['payloads'] = data['payloads'].map(lambda x : x[0])

# We also want to convert the date_utc to a datetime datatype and then extracting the date leaving the time
data['date'] = pd.to_datetime(data['date_utc']).dt.date

# Using the date we will restrict the dates of the launches
data = data[data['date'] <= datetime.date(2020, 11, 13)]
data.head()

In [None]:
#Lists to be use to create a new DataFrame
BoosterVersion = []
PayloadMass = []
Orbit = []
LaunchSite = []
Outcome = []
Flights = []
GridFins = []
Reused = []
Legs = []
LandingPad = []
Block = []
ReusedCount = []
Serial = []
Longitude = []
Latitude = []

In [None]:
#Lets now use the functions from above to parse through the data
getBoosterVersion(data)
getLaunchSite(data)
getPayload(data)
getCore(data)

In [None]:
# Constructing our dataset with the data we have obtained
launch_dict = {'FlightNumber': list(data['flight_number']),
'Date': list(data['date']),
'BoosterVersion':BoosterVersion,
'PayloadMass':PayloadMass,
'Orbit':Orbit,
'LaunchSite':LaunchSite,
'Outcome':Outcome,
'Flights':Flights,
'GridFins':GridFins,
'Reused':Reused,
'Legs':Legs,
'LandingPad':LandingPad,
'Block':Block,
'ReusedCount':ReusedCount,
'Serial':Serial,
'Longitude': Longitude,
'Latitude': Latitude}
#checking for length of values
for key, value in launch_dict.items():
    print(f"{key}: {len(value)}")

In [None]:
launch_df = pd.DataFrame(launch_dict)
launch_df.head()

In [None]:
#Filtering and  reseting index
data_falcon9 = launch_df[launch_df['BoosterVersion'] != 'Falcon 1'].copy()
data_falcon9.loc[:, 'FlightNumber'] = list(range(1, data_falcon9.shape[0] + 1))
data_falcon9.head()

In [None]:
#Dealing with missing data we are leaving the landing site as The LandingPad column will retain None values to represent when landing pads were not used.

payload_mass_mean = data_falcon9['PayloadMass'].mean()
data_falcon9['PayloadMass'].fillna(payload_mass_mean, inplace=True)
data_falcon9.isnull().sum()

#Writing out our first dataset
data_falcon9.to_csv('dataset_part_1.csv', index=False)

## Webscraping Data <a name="webscrape"></a>
---

In [None]:
# Below we will define a series of helper functions that will help process web scraped HTML tables

def date_time(table_cells):
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    out=[i for i in table_cells.strings][0]
    return out

def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass

def extract_column_from_header(row):
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    colunm_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(colunm_name.strip().isdigit()):
        colunm_name = colunm_name.strip()
        return colunm_name    


In [None]:
# We will now use Beautiful Soup to extract all columns and variable names from the HTML table header
# obtaining HTML from the website
static_url = "https://en.wikipedia.org/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches&oldid=1027686922"
response = requests.get(static_url)
html = response.content
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all("table")
html_tables = tables
first_launch_table = html_tables[2]
# checking our work here to make sure it pulls in the HTML
#print(first_launch_table)


In [None]:
# Iterate through the <th> elements and apply the provided extract_column_from_header() to extract column name one by one
column_names = []
# Apply find_all() function with `th` element on first_launch_table
first_row = first_launch_table.find_all("th")
# Iterate each th element and apply the provided extract_column_from_header() to get a column name
for row in first_row:
    column_name = extract_column_from_header(row)
    if column_name:
        column_names.append(column_name)
#print(column_names)

In [None]:
# Prepare for dict for creation
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )'
               ]
# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

In [None]:
# Dict creation
extracted_row = 0
headers = ['Flight No.', 'Date', 'Time', 'Version Booster', 'Launch site', 'Payload', 
           'Payload mass', 'Orbit', 'Customer', 'Launch outcome', 'Booster landing']

# Initialize the dictionary
launch_dict = {header: [] for header in headers}

def extract_data_from_row(row):
    flight_number = row.th.string.strip() if row.th and row.th.string and row.th.string.strip().isdigit() else None
    if not flight_number:
        return

    data = []
    row_data = row.find_all('td')

    data.append(flight_number)  # Flight No.
    date_time_list = date_time(row_data[0])
    data.extend([cell.a.string if (cell.a and cell.a.string) else None for cell in row_data[2:6]])# Date & Time
    data.append(booster_version(row_data[1]) or row_data[1].a.string)  # Version Booster
    data.append(row_data[6].a.string if row_data[6].a else None)  # Customer
    data.append(list(row_data[7].strings)[0])  # Launch outcome
    data.append(landing_status(row_data[8]).strip('\n'))  # Booster landing
    
    # Append data to the dictionary
    for key, value in zip(headers, data):
        launch_dict[key].append(value)

for table in soup.find_all('table', "wikitable plainrowheaders collapsible"):
    for row in table.find_all("tr"):
        extract_data_from_row(row)
        extracted_row += 1

#print(launch_dict)


In [None]:
#DataFrame creation
df= pd.DataFrame({ key:pd.Series(value) for key, value in launch_dict.items() })
df.replace("Success\n","Success",inplace=True)
df.replace("No attempt\n","No attempt",inplace=True)
df.head(10)

In [None]:
df.to_csv('spacex_web_scraped.csv', index=False)

# **Exploratory Data Analysis and Data Wrangling**<a name="eda"></a>
---

## Initial EDA <a name="initeda"></a>
----

In [None]:
#Identifying which columns are numerical and catagorical
df=pd.read_csv("/kaggle/working/dataset_part_1.csv")
df.dtypes

In [None]:
#Calculating the numbers of launches on each site and orbit counts
#Then storing landing outcomes into a variable
display(df['LaunchSite'].value_counts())
display(df['Orbit'].value_counts())
landing_outcomes = df['Outcome'].value_counts()
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)


<code>True Ocean</code> means the mission outcome was successfully  landed to a specific region of the ocean while <code>False Ocean</code> means the mission outcome was unsuccessfully landed to a specific region of the ocean. <code>True RTLS</code> means the mission outcome was successfully  landed to a ground pad <code>False RTLS</code> means the mission outcome was unsuccessfully landed to a ground pad.<code>True ASDS</code> means the mission outcome was successfully  landed to a drone ship <code>False ASDS</code> means the mission outcome was unsuccessfully landed to a drone ship. <code>None ASDS</code> and <code>None None</code> these represent a failure to land.


In [None]:
# Creating landing outcome lablel from outcome column
# landing_class = 0 if bad_outcome
landing_class = df['Outcome'].apply(lambda x: 0 if x in bad_outcomes else 1)
# landing_class = 1 otherwise
landing_class = df['Outcome'].apply(lambda x: 1 if x not in bad_outcomes else 0)
df['Class']=landing_class
display(df.head())
#Calculate mean/success rate over all
print("Success rate "+ str(df["Class"].mean()))

In [None]:
#write new data set to csv
df.to_csv('dataset_part_2.csv', index=False)

## EDA Visualization <a name="edavis"></a>
----

In [None]:
# reading in part 2 of data set
df=pd.read_csv("/kaggle/working/dataset_part_2.csv")

In [None]:
#Visualizing Flight Number vs Payload Mass
sns.catplot(y="PayloadMass", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.title("Payload mass of different Falcon9 Flights",fontsize=20,fontweight='bold')
plt.xlabel("Flight Number",fontsize=20, fontweight='bold')
plt.ylabel("Pay load Mass (kg)",fontsize=20, fontweight='bold')
plt.show()

## *Observations*
1. Payload Mass Significance: The variation in payload masses across flights might indicate the diverse range of missions the Falcon9 rockets undertook. Some missions require heavier instruments or more cargo, leading to a higher payload mass, while others are more lightweight.
2. Flight Capability: The fact that both low and high payload masses appear throughout the flight numbers could suggest that Falcon9's capability to handle a wide range of payload masses has been consistent over time

In [None]:
#Visualizing Flight Number vs Launch Site
sns.catplot(y="LaunchSite", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.title("Flight Number Vs Launch Site",fontsize=20,fontweight='bold')
plt.xlabel("Flight Number",fontsize=20,fontweight='bold')
plt.ylabel("Launch Site",fontsize=20, fontweight='bold')
plt.show()

## *Observations*
1. Most flights are launched from CCSFS SLC 40, followed by KSC LC 39A, and then VAFB SLC 4E.
2. The flight numbers range from around 1 to above 80, showing the progression of Falcon9 flights over time.
3. There's no evident pattern suggesting that later flights (higher flight numbers) have a higher or lower success  rate. Both classes are well distributed across flight numbers.

In [None]:
#Visualizing Payload vs Launch Site
sns.catplot(y="LaunchSite", x="PayloadMass", hue="Class", data=df, aspect = 5)
plt.title("Payload mass vs. Launch Site by class",fontsize=20,fontweight='bold')
plt.xlabel("Payload Mass (kg)",fontsize=20, fontweight='bold')
plt.ylabel("Launch Site",fontsize=20, fontweight='bold')
plt.show()

## *Observations*
1. Light Payloads (below 5,000 kg): Most launches across all sites seem successful (Class 1). There are only a few Class 0 instances in this range, especially at the VAFB SLC 4E site.
2. Intermediate Payloads (5,000 kg to 10,000 kg): A more even mix of both classes is observed, particularly at the CCSFS SLC 40 and KSC LC 39A sites.
3. Heavy Payloads (above 10,000 kg): Most launches seem successful (Class 1). Notably, the heaviest payloads, which are close to or above 15,000 kg, are all launched from the CCSFS SLC 40 site and appear successful.
4. The CCSFS SLC 40 site appears to be versatile, accommodating both light and heavy payloads with a good success rate.

In [None]:
#Visualizing Success Rate by Orbit Type
orbit_mean = df.groupby(['Orbit'])['Class'].mean().reset_index()
orbit_mean = orbit_mean.sort_values(['Class'], ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(y="Class", x="Orbit", data=orbit_mean)
plt.title("Success Rate by Orbit Type",fontsize=25,fontweight='bold')
plt.xlabel("Orbit Type",fontsize=20)
plt.ylabel("Success Rate",fontsize=20)
plt.xticks(rotation=45)
plt.show()

## *Observations*
1. Orbits 'ES-L1', 'GEO', and 'HEO' have high success rates, nearing 1.0 or 100%. As we move towards 'SO', the success rate decreases, with 'SO' having the lowest among the displayed orbits.

In [None]:
# Visualizing Flight Number and Orbit Type
sns.set_palette("husl",2)
sns.scatterplot(y="Orbit", x="FlightNumber", hue="Class", data=df)
plt.title("Flight Number vs Orbit",fontsize=25,fontweight='bold')
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

## *Observations*
1. Initial Challenges with New Missions: There seems to be a higher concentration of failures (orange dots) in the earlier flight numbers for several orbit types, suggesting that initial missions faced challenges. As the flight numbers increase, there's a noticeable shift towards more successful missions.
2. In summary, the data seems to show that while there were initial challenges (more failures in the earlier flight numbers), there has been a trend of improvement over time for most orbit types. Certain orbits like 'GEO', 'SO', and 'ES-L1', though sparsely populated, show a higher failure rate, indicating the complexities or challenges associated with those missions.

In [None]:
# Visualizing Payload and Orbit Type
sns.scatterplot(y="Orbit", x="PayloadMass", hue="Class", data=df) 
plt.title("Payload vs Orbit",fontsize=25,fontweight='bold')
plt.xlabel("Payload",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

## *Observations*
1. Initial Challenges with New Missions: There seems to be a higher concentration of failures (orange dots) in the earlier flight numbers for several orbit types, suggesting that initial missions faced challenges. As the flight numbers increase, there's a noticeable shift towards more successful missions.
2. In summary, the data seems to show that while there were initial challenges (more failures in the earlier flight numbers), there has been a trend of improvement over time for most orbit types. Certain orbits like 'GEO', 'SO', and 'ES-L1', though sparsely populated, show a higher failure rate, indicating the complexities or challenges associated with those missions.





In [None]:
df['Year'] = df['Date'].str.split('-').str[0]

# Calculate the success rate for each year
success_rate_by_year = df.groupby('Year')['Class'].mean()

# Reset the index to make 'Year' a regular column
success_rate_by_year = success_rate_by_year.reset_index()

# Plot a line chart with x-axis as the year and y-axis as the success rate
sns.lineplot(x='Year', y='Class', data=success_rate_by_year)
plt.title("Success Rate by Year",fontsize=25,fontweight='bold')
plt.xlabel("Year", fontsize=20)
plt.ylabel("Success Rate", fontsize=20)
plt.show()

## *Observations*
1. We see that there is an upward trend with success rates over the years

## Dashboard for Launches and Payloads
*note this dashboard will not work when the note book is published but it does when in editing the note book, i may put a link to it somewhere where it will be interactive.*

In [None]:
spacex_df = pd.read_csv("/kaggle/input/spacex-launches-dash/spacex_launch_dash.csv")


# Widgets
site_dropdown = widgets.Dropdown(
    options=['ALL', 'CCAFS LC-40', 'CCAFS SLC-40', 'KSC LC-39A', 'VAFB SLC-4E'],
    value='ALL',
    description='Launch Site:'
)

payload_slider = widgets.IntRangeSlider(
    value=[spacex_df['Payload Mass (kg)'].min(), spacex_df['Payload Mass (kg)'].max()],
    min=0,
    max=10000,
    step=1000,
    description='Payload (kg):',
    continuous_update=False
)

update_button = widgets.Button(description="Update Visuals")

# Output widget
plot_output = widgets.Output()

# Update function
def update_visuals(button=None):
    selected_site = site_dropdown.value
    low_payload, high_payload = payload_slider.value

    # Filtering the data
    filtered_df = spacex_df[(spacex_df['Payload Mass (kg)'] >= low_payload) & (spacex_df['Payload Mass (kg)'] <= high_payload)]
    if selected_site != 'ALL':
        filtered_df = filtered_df[filtered_df['Launch Site'] == selected_site]

    # Creating the pie chart
    if selected_site == 'ALL':
        pie_fig = px.pie(filtered_df[filtered_df['class'] == 1], names='Launch Site', title='Total Successful Launches By Site')
    else:
        pie_fig = px.pie(filtered_df, names='class', title=f"Success vs Failure for site {selected_site}", labels={0: 'Failure', 1: 'Success'})

    # Creating the scatter plot
    scatter_fig = px.scatter(
        filtered_df, 
        x='Payload Mass (kg)', 
        y='class', 
        color='Booster Version Category',
        title=f"Correlation between Payload and Success for site {selected_site}"
    )
    
    with plot_output:
        clear_output(wait=True)
        pie_fig.show()
        scatter_fig.show()

update_button.on_click(update_visuals)

# Display the widgets
display(VBox([site_dropdown, payload_slider, update_button, plot_output]))
update_visuals()  # Display the initial visuals



## *Observations*

1. B4 stands out as the most versatile booster, capable of handling a wide range of payloads, from very light to heavy (up to 10,000 kg), and maintaining a high success rate.
2. Other booster versions, such as **B5, FT, v1.0, and v1.1**, have been used for varying payload masses but seem more concentrated in the 0 to 6,000 kg range.
3. All booster versions consistently exhibit a high success rate across different payload masses, indicating reliable performance.
4. While present in the higher payload range, the FT booster does not handle payloads as heavy as the B4.
5. None of the other booster versions (except B4) are shown to manage the heaviest payloads (close to 10,000 kg).

In summary, while all booster versions demonstrate reliable performance across different payload masses, the B4 booster stands out in terms of versatility and capability to manage a broader spectrum of payloads, including the heaviest ones.


# **Model Building** <a name="model-building"></a> 
-----

# **Feature Engineering** <a name="feature-eng"></a> 
 ---

In [None]:
# looking to see what variable would affect success rate
features = df[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', 'Serial']]
features.head()

In [None]:
# one hot encoding
features_one_hot = pd.get_dummies(features, columns=['LaunchSite', 'Orbit', 'LandingPad', 'Serial'])
features_one_hot.head()

In [None]:
features_one_hot = features_one_hot.astype('float64')
#features_one_hot.dtypes
features_one_hot.to_csv('dataset_part_3.csv', index=False)

## *Splitting Data up for training* <a name="log-greg"></a> 

In [None]:
data = pd.read_csv('/kaggle/working/dataset_part_2.csv')
X = pd.read_csv('/kaggle/working/dataset_part_3.csv')
Y = data['Class'].to_numpy()
transform = preprocessing.StandardScaler()
X = transform.fit_transform(X)
# i had to use statify here
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2, stratify=Y)


In [None]:
print(pd.Series(Y_train).value_counts())
print(pd.Series(Y_test).value_counts())

*Learned something new here:*
After applying stratified sampling, the  training set contains 48 samples of class 1 and 24 samples of class 0. Meanwhile, the test set contains 12 samples of class 1 and 6 samples of class 0. This maintains the 2:1 ratio of class 1 to class 0 in both the training and test sets, which matches the original dataset's distribution.

## *Setting Parameters for GridCV sampling*

In [None]:
#This  might be commented out in final so that the notebook runs faster.


# Defining hyperparameter tuning parameters for each model
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=2)

# parameters for Decision Tree
tree_parameters = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [2*n for n in range(1, 10)],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'class_weight': [None, 'balanced'],
    'ccp_alpha': np.linspace(0, 0.035, 10)
}

# parameters for K-Nearest Neighbors
knn_parameters = {
    'n_neighbors': list(range(1, 11)),
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2],
    'leaf_size': list(range(20, 41, 5)),
    'n_jobs': [-1]
}

# parameters for Logistic Regression
logreg_parameters = {
    "C": [0.01, 0.1, 1],
    "penalty": ['l2'],
    "solver": ['lbfgs'],
    "fit_intercept": [True, False],
    "class_weight": [None, 'balanced'],
    "max_iter": [100, 200, 500],
    "tol": [1e-4, 1e-3, 1e-2],
    "multi_class": ['ovr', 'multinomial']
}

# parameters for Support Vector Machine (SVM)
svm_parameters = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': np.logspace(-3, 3, 5),
    'gamma': np.logspace(-3, 3, 5),
    'degree': [2, 3, 4],
    'coef0': np.linspace(-1, 1, 5),
    'shrinking': [True, False],
    'probability': [True, False]
}

# Create lists to store model results
model_names = []
accuracies = []
test_accuracies = []
best_params_list = []

# Decision Tree
tree = DecisionTreeClassifier()
tree_cv = GridSearchCV(tree, tree_parameters, cv=stratified_kfold)
tree_cv.fit(X_train, Y_train)
best_params_tree = tree_cv.best_params_
model_names.append('Decision Tree')
accuracies.append(tree_cv.best_score_)
test_accuracies.append(tree_cv.score(X_test, Y_test))
best_params_list.append(best_params_tree)

# K-Nearest Neighbors
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, knn_parameters, cv=stratified_kfold)
knn_cv.fit(X_train, Y_train)
best_params_knn = knn_cv.best_params_
model_names.append('K-Nearest Neighbors')
accuracies.append(knn_cv.best_score_)
test_accuracies.append(knn_cv.score(X_test, Y_test))
best_params_list.append(best_params_knn)

# Logistic Regression
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg, logreg_parameters, cv=stratified_kfold)
logreg_cv.fit(X_train, Y_train)
best_params_logreg = logreg_cv.best_params_
model_names.append('Logistic Regression')
accuracies.append(logreg_cv.best_score_)
test_accuracies.append(logreg_cv.score(X_test, Y_test))
best_params_list.append(best_params_logreg)

# Support Vector Machine (SVM)
svm = SVC()
svm_cv = GridSearchCV(svm, svm_parameters, cv=stratified_kfold)
svm_cv.fit(X_train, Y_train)
best_params_svm = svm_cv.best_params_
model_names.append('Support Vector Machine')
accuracies.append(svm_cv.best_score_)
test_accuracies.append(svm_cv.score(X_test, Y_test))
best_params_list.append(best_params_svm)

# Save best parameters to json if needed later
best_params = {
    'tree': best_params_tree,
    'knn': best_params_knn,
    'logreg': best_params_logreg,
    'svm': best_params_svm
}
with open('/kaggle/working/best_params.json', 'w') as f:
    json.dump(best_params, f)

# **Evaluation** <a name="evaluation"></a> 
 ---

## *Confusion matrix plot of each model* <a name="confusion"></a> 

In [None]:
# Create a grid of confusion matrices with titles
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Confusion Matrices for Different Models', fontsize=16)

for i, model_name in enumerate(model_names):
    row, col = divmod(i, 2)
    ax = axes[row, col]

    model = None
    if model_name == 'Decision Tree':
        model = DecisionTreeClassifier(**best_params_list[i])  # Use the best hyperparameters
    elif model_name == 'K-Nearest Neighbors':
        model = KNeighborsClassifier(**best_params_list[i])  # Use the best hyperparameters
    elif model_name == 'Logistic Regression':
        model = LogisticRegression(**best_params_list[i])  # Use the best hyperparameters
    elif model_name == 'Support Vector Machine':
        model = SVC(**best_params_list[i])  # Use the best hyperparameters

    model.fit(X_train, Y_train)
    y_pred = model.predict(X_test)

    # Calculate the confusion matrix
    cm = confusion_matrix(Y_test, y_pred)

    # Plot the confusion matrix with title
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax)
    ax.set_title(f'Confusion Matrix - {model_name}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust the layout
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## *Best model and parameters* <a name="best-params"></a> 

In [None]:
# Create a DataFrame to store the results
data = {'Model': model_names, 'Accuracy': accuracies, 'Test Accuracy': test_accuracies, 'Best Params': best_params_list}
df = pd.DataFrame(data)

# Identify the best-performing model
best_model = df.loc[df['Accuracy'].idxmax()]

print('Best model is', best_model['Model'], 'with an accuracy of', best_model['Accuracy'])
print('Best hyperparameters are:', best_model['Best Params'])

# Plot the model accuracies
plt.figure(figsize=(10, 5))
sns.barplot(x='Model', y='Accuracy', data=df)
plt.title('Models Accuracy')
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.xticks(rotation=45)
plt.show()
df

# **Final Model Training** <a name="final-model"></a> 

In [None]:
# Initialize the Decision Tree Classifier with the best hyperparameters
final_model = DecisionTreeClassifier(
    ccp_alpha=0.019444444444444445,
    class_weight=None,
    criterion='gini',
    max_depth=16,
    max_features='sqrt',
    min_samples_leaf=1,
    min_samples_split=10,
    splitter='random'
)

# Train the model on the full training set
final_model.fit(X_train, Y_train)

# Evaluate the model on the test set
test_accuracy = final_model.score(X_test, Y_test)
print(f"Final Model Test Accuracy: {test_accuracy * 100:.2f}%")


# **Conclusion** <a name="conclusion"></a> 
----

***Trend Analysis*:**
SpaceX has been riding a wave of success over the past few years. The data indicates a consistent upward trajectory in their launch successes, reflecting technology, processes, and expertise improvements.

***Orbital Landings*:**
A significant achievement has been successfully landing all first stages of rockets designated for ES-L1, GEO, HEO, and SSO orbits. This showcases SpaceX's reusable rocket technology's efficiency and reliability across various mission types.

***Versatility of the B4 Booster*:**
The B4 booster has proven to be a workhorse for SpaceX. Its adaptability is noteworthy, with the capability to handle a vast spectrum of payloads. Whether it's light cargo or heavy equipment weighing up to 10,000 kg, the B4 maintains a commendable success rate, highlighting its engineering excellence.

***Launch Site Success Rates*:**
Among all the launch sites SpaceX uses, the KSC LC-39A has the highest success rate. This could be attributed to multiple factors such as location advantages, infrastructure, or the specific missions launched from this site.

***Machine Learning Insights*:** 
After analyzing various algorithms for predicting the success of a landing, the Decision Tree classifier emerged as the best fit. This suggests that the decision-making process for landings has identifiable patterns and rules, which the Decision Tree algorithm can effectively map out