# Introduction

<div style="background: linear-gradient(135deg, #f3f8ff, #d7eaff); border: 3px dashed #00bfff; padding: 20px; border-radius: 20px; font-family: 'Permanent Marker', cursive, sans-serif; color: #444444; box-shadow: 0px 5px 15px rgba(0, 0, 0, 0.1);">
    <h1 style="color: #ff1493; text-shadow: 2px 2px #ffa07a; text-align: center; font-size: 2.8rem;">🌈 🚀 Rocket Launch Delays 🚀 🌈</h1>
    <p style="font-size: 1.3rem; line-height: 1.8; color: #444444; text-align: justify;">
        <strong>Welcome to My Notebook</strong> – the most electrifying rocket launch challenge presented by the <span style="color: #00bfff;">Ishan Purohit</span> program. 
        This notebook takes you on a dazzling ride through the wonders of space exploration, inspired by the iconic Space Race of the 20th century!
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        The Space Race was a legendary contest between the <span style="color: #ff4500;">Soviet Union (USSR)</span> and the <span style="color: #1e90ff;">United States (US)</span> during the Cold War, 
        pushing the limits of science and technology. Some milestones that shaped history include:
    </p>
    <ul style="list-style-type: '🌟 '; font-size: 1.2rem; margin-left: 20px; color: #333333;">
        <li><strong>4 October 1957:</strong> The launch of <em>Sputnik 1</em>, the first Earth-orbiting satellite.</li>
        <li><strong>3 November 1957:</strong> The launch of <em>Sputnik 2</em>, sending <strong>Laika</strong>, the first living organism, into orbit.</li>
    </ul>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #444444;">
        <strong>SkyCast</strong> is all about recreating this spirit of daring innovation and curiosity. Get ready to aim for the stars and showcase 
        your genius in this exciting challenge!
    </p>
    <div style="text-align: center; margin-top: 20px;">
        <img src="https://w0.peakpx.com/wallpaper/654/570/HD-wallpaper-girl-seeing-rocket-launches-artist-artwork-artstation.jpg" alt="SkyCast Rocket Launch Challenge" 
             style="border: 5px solid #ff1493; border-radius: 15px; width: 90%; max-width: 800px; box-shadow: 0px 5px 20px rgba(255, 20, 147, 0.3);">
    </div>
    <p style="text-align: center; font-size: 1.1rem; margin-top: 15px; color: #00bfff; text-shadow: 1px 1px #ff4500;">
        🌟✨ Let’s light up the sky with your creativity! ✨🌟
    </p>
</div>


# Importing Libraries

<div style="background: linear-gradient(135deg, #fff7e6, #ffe4b5); border: 3px dotted #ff7f50; padding: 20px; border-radius: 15px; font-family: 'Comic Sans MS', cursive, sans-serif; color: #333333; box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.2);">
    <h2 style="color: #ff4500; text-shadow: 2px 2px #ffa07a; text-align: center; font-size: 2rem;">📚 Importing Libraries 🚀</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; text-align: justify;">
        To kickstart our journey, we’ll first import all the essential libraries required for this project. 
        These libraries will provide us with the tools for data manipulation, visualization, and advanced computations. 
    </p>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        Let’s load up these libraries and unlock their power to make this project a success! 🚀
    </p>
    <p style="text-align: center; font-size: 1rem; margin-top: 10px; color: #ff6347; text-shadow: 1px 1px #ffa07a;">
        🛠️ Time to code like a pro! 🛠️
    </p>
</div>


In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier, BaggingClassifier
import lightgbm as lgb
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score

# Checking Dataset

<div style="background: linear-gradient(135deg, #f0f9ff, #e0ffff); border: 3px dashed #4682b4; padding: 20px; border-radius: 15px; font-family: 'Baloo Bhai', cursive, sans-serif; color: #333333; box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.2);">
    <h2 style="color: #1e90ff; text-shadow: 2px 2px #87cefa; text-align: center; font-size: 2.5rem;">🔍 Exploring the Dataset 🌟</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; text-align: justify;">
        Before we dive into the rocket science, let's take a closer look at our dataset! 
        Inspecting the dataset helps us understand the structure, dimensions, and types of data we’re dealing with. 
        This step is crucial for ensuring everything is in tip-top shape before moving forward.
    </p>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        Let’s dive in and make sure our data is as ready as we are for this epic journey! 🚀✨
    </p>
    <p style="text-align: center; font-size: 1rem; margin-top: 10px; color: #4682b4; text-shadow: 1px 1px #b0e0e6;">
        🗂️ Time to explore and analyze! 🗂️
    </p>
</div>


In [None]:
data = pd.read_csv('/kaggle/input/sky-cast-margazhi-25/train.csv')
data_test = pd.read_csv('/kaggle/input/sky-cast-margazhi-25/test.csv')

In [None]:
data.head()

In [None]:
data_test.head()

<div style="background-color: #f0f8ff; border: 1px solid #dcdcdc; padding: 10px; border-radius: 5px; color: #333333; font-family: Arial, sans-serif; font-size: 14px;">
    <p style="margin: 0;">The unnamed columns can be dropped since we can see it's just repeating <strong>Serial Numbers</strong>.</p>
</div>


In [None]:
data.columns

In [None]:
data.drop(columns = ['Unnamed: 0.1', 'Unnamed: 0'], inplace = True)
data_test.drop(columns = ['Unnamed: 0.1', 'Unnamed: 0'], inplace = True)

In [None]:
data.info()

In [None]:
data_test.info()

# Exploratory Data Analysis

<div style="background: #c1f0f6; border: 3px solid #1e88e5; padding: 20px; border-radius: 15px; font-family: 'Comic Sans MS', sans-serif; color: #333333; box-shadow: 4px 4px 20px rgba(0, 0, 0, 0.2);">
    <h2 style="color: #1e88e5; text-align: center; font-size: 2.5rem; font-weight: bold;">🔍 Exploratory Data Analysis (EDA) 🚀</h2>
    <p style="font-size: 1.4rem; line-height: 1.8; color: #333333; text-align: center;">
        Buckle up, because we’re about to embark on an *explosive* adventure through the world of data! 🌎 With EDA, we start by understanding our rocket dataset and unraveling the secrets hidden inside. 🌠
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        Through visuals and statistics, we dive deep into the dataset’s distribution, correlations, and missing pieces. We find out which columns are flying high 🚀 and which need a bit more TLC 🛠️. The goal? To uncover insights, spot patterns, and understand what makes our data tick! 🔎
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        From identifying trends 📈 to discovering hidden relationships 🔗, this is the stage where we get to know our data like never before! We’ll spot outliers, correlations, and maybe even some data quirks! It’s the perfect way to prep for the next steps, where our models get ready for liftoff 🚀
    </p>
    <h3 style="color: #1e88e5; text-align: center; font-size: 2rem; font-weight: bold;">Data Visualization to the Rescue! 📊</h3>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333; text-align: center;">
        And of course, we can’t forget the power of visualizations! 🎨 Charts, plots, and graphs come to the rescue, providing a vibrant, colorful view of the data and making complex insights crystal clear. With EDA, we don’t just look at numbers—we let the data speak to us in *all its glory*! 💫
    </p>
    <h3 style="color: #1e88e5; text-align: center; font-size: 2rem; font-weight: bold;">Let the Data Journey Begin! 🌟</h3>
    <p style="font-size: 1.4rem; line-height: 1.8; color: #333333; text-align: center;">
        So, let’s start the journey by exploring the data, uncovering mysteries, and setting the stage for the *ultimate model-building adventure*! 💥 Ready to analyze? Let’s go! 🚀🔍
    </p>
</div>


### Company Name

In [None]:
companies = data.groupby(['Company Name'])['Detail'].count().sort_values(ascending=False).reset_index()

plt.figure(figsize=(25,7))
bar = sns.barplot(x='Company Name',y='Detail',data=companies[1:],palette='rocket')
b = bar.set_xticklabels(bar.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.ylabel('No of launches')
t=plt.title('Comapany vs launches')

### Status

In [None]:
status = data['Status Rocket'].value_counts()

fig = make_subplots(
    rows=1, cols=2, specs=[[{"type": "xy"}, {"type": "domain"}]]
)

fig.add_trace(
    go.Bar(
        x=status.index,
        y=status.values,
        text=status.values.tolist(),
        textposition='auto',
        marker_color='#003786',
        name='Status',
    ),
    row=1, col=1
)

fig.add_trace(
    go.Pie(
        labels=status.index,
        values=status.values,
        textposition='inside',
        textinfo='percent+label',
        marker={'colors': ['rgb(178,24,43)', 'rgb(253,219,199)']}
    ),
    row=1, col=2
)


fig.update_layout(
    title_text='Status of Rockets',
    font_size=10,
    autosize=False,
    width=800,
    height=400
)

fig.show()

In [None]:
data_ussr = data[data['Company Name'] == "RVSN USSR"]

bar = data_ussr['Status Mission'].value_counts().plot(kind='bar', figsize=(13, 4))
bar.set_xticklabels(bar.get_xticklabels(), rotation=0)

for p in bar.patches:
    bar.annotate(int(p.get_height()), 
                 (p.get_x() + p.get_width()/2, p.get_height()), ha='center', va='center', 
                 xytext=(0, 5), textcoords='offset points')

plt.xlabel('Status')
plt.title('Status of Rocket Launch')

plt.tight_layout()
plt.show()

## Status Mission

In [None]:
m = data['Status Mission'].value_counts()
mf = go.Figure([go.Pie(labels=m.keys(),values=m.values,textposition='inside', textinfo='percent+label',marker={'colors':["0e58a8","rgb(215,48,39)","rgb(112,164,148)","e2d9e2"]})])
mf.update_layout(title_text='Status of Launch', font_size=10, autosize=False, width=700, height=400)

<div style="background: #fff3e6; border: 2px solid #ff9933; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #ff9933; text-align: center;">Imbalance in Mission Success and Failure</h3>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        In the dataset, there is a noticeable imbalance as only <strong>10.3%</strong> of the missions have failed. 
        This creates an unequal distribution of success and failure, which may affect how the model learns patterns and generalizes predictions.
        With such a low proportion of failed missions, models might be biased towards predicting success, which could impact the accuracy and performance of failure predictions.
    </p>
</div>


### Company with Sucsessful and Unsucsessful Mission

In [None]:
plt.figure(figsize=(20,5))
cmp = data.groupby(['Company Name','Status Mission']).count()['Detail'].reset_index()
cmp = cmp[cmp['Status Mission'] == 1].sort_values('Detail', ascending=False)
sns.barplot(x='Company Name', y='Detail', data=cmp[1:20])
plt.ylabel('No of successful missions')
t = plt.title('Company vs Successful Missions')


In [None]:
plt.figure(figsize=(20,5))
cmp = data.groupby(['Company Name','Status Mission']).count()['Detail'].reset_index()
cmp = cmp[cmp['Status Mission'] == 0].sort_values('Detail', ascending=False)
sns.barplot(x='Company Name', y='Detail', data=cmp[1:20])
plt.ylabel('No of successful missions')
t = plt.title('Company vs Unsuccessful Missions')


# Data Processing

<div style="background: linear-gradient(135deg, #e6ffee, #d1f7d9); border: 4px groove #32cd32; padding: 25px; border-radius: 15px; font-family: 'Pacifico', cursive, sans-serif; color: #2e8b57; box-shadow: 5px 5px 15px rgba(0, 0, 0, 0.2);">
    <h2 style="color: #006400; text-shadow: 3px 3px #7fff00; text-align: center; font-size: 2.5rem;">🛠️ Data Processing & Engineering 🌟</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; text-align: justify;">
        Now that we’ve explored our dataset, it's time to clean, process, and transform the data! 🧹
        while data engineering prepares it for meaningful analysis and modeling.
    </p>
    <p style="font-size: 1.1rem; line-height: 1.8; text-align: justify; color: #228b22;">
        With clean and well-engineered data, we set the foundation for high-quality insights and robust model performance. 
        Let’s make the data shine like a star! 🌟
    </p>
    <p style="text-align: center; font-size: 1rem; margin-top: 10px; color: #2e8b57; text-shadow: 1px 1px #98fb98;">
        🌟 Let’s engineer some magic with our data! 🌟
    </p>
</div>


<div style="background: #f9f9f9; border: 2px solid #1e90ff; padding: 15px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 3px 3px 10px rgba(0, 0, 0, 0.1);">
    <h3 style="color: #1e90ff; text-align: center; font-size: 1.5rem;">📍 Extracting 'Center' from 'Location' Column</h3>
    <p style="font-size: 1.1rem; line-height: 1.6; color: #333333;">
        To extract the <strong>'center'</strong> from the <strong>'Location'</strong> column, we split the string based on the comma and select the second part, which represents the center of the location. This is done using the <code>split(',')[1]</code> method on each location value.
    </p>
    <p style="font-size: 1.1rem; line-height: 1.6; color: #333333;">
        This extracted <strong>'center'</strong> will be useful in our model as it allows us to focus on specific geographical regions or zones that may have different characteristics. By isolating the center, we can better understand location-based patterns, which can be valuable for tasks like clustering, prediction, and optimization within our model.
    </p>
</div>


In [None]:
data['center'] = data['Location'].apply(lambda x:x.split(',')[1])
data.groupby('center').count()['Detail'].sort_values()[-10:].plot(kind='barh')
plt.xlabel('Number of Launches')
t=plt.title('center with number of launches')

In [None]:
data['center'] = data['Location'].apply(lambda x:x.split(',')[1])
data_test['center'] = data_test['Location'].apply(lambda x:x.split(',')[1])

data['center'] = data['center'].astype(str)
data_test['center'] = data_test['center'].astype(str)

<div style="background: #f0f8ff; border: 3px solid #20b2aa; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 15px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #20b2aa; text-align: center; font-size: 2rem;">🔄 Encoding Company Names with 'Status Mission'</h2>
        <strong>Why is this Useful?</strong>
        <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
            <li><strong>Numerical Encoding:</strong> The encoding of <strong>'Company Name'</strong> as the average of <strong>'Status Mission'</strong> provides a numerical representation for categorical data. This is useful for machine learning algorithms that require numerical inputs.</li>
            <li><strong>Feature Importance:</strong> By encoding companies with their mission status, we can capture how each company's historical success or failure rate affects model predictions. It allows the model to recognize patterns or tendencies associated with specific companies.</li>
            <li><strong>Better Predictions:</strong> This feature might improve model performance by adding a meaningful encoded feature that represents company-specific trends, making predictions more accurate.</li>
        </ul>
    </p>
</div>


In [None]:
company_target_enc = data.groupby('Company Name')['Status Mission'].mean()

data['Company_Name_Encoded'] = data['Company Name'].map(company_target_enc)
data_test['Company_Name_Encoded'] = data_test['Company Name'].map(company_target_enc)

In [None]:
company_rank = data.groupby('Company Name')['Status Mission'].mean().rank()

data['Company_Rank'] = data['Company Name'].map(company_rank)
data_test['Company_Rank'] = data_test['Company Name'].map(company_rank)

<div style="background: #f5f5f5; border: 2px solid #008b8b; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 12px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #008b8b; text-align: center; font-size: 2rem;">📅 Feature Extraction from Datum and Location</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        In this step, we are extracting various components from the <strong>'Datum'</strong> column, which contains date and time information, and creating additional features. Additionally, we are extracting the <strong>'Country'</strong> feature from the <strong>'Location'</strong> column, which provides information about the geographical location of each record.
    </p>
    <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
        <li><strong>Extracting Date and Time Components:</strong> The <strong>'Datum'</strong> column is first converted into a <em>datetime</em> format using <code>pd.to_datetime()</code>. Once this is done, we extract various components:
            <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
                <li><strong>weekday_number:</strong> The day of the week (0=Monday, 6=Sunday).</li>
                <li><strong>hour:</strong> The hour of the day (0-23).</li>
                <li><strong>minute:</strong> The minute of the hour (0-59).</li>
                <li><strong>year:</strong> The year of the date.</li>
                <li><strong>month:</strong> The numeric month (1-12).</li>
                <li><strong>day:</strong> The day of the month (1-31).</li>
                <li><strong>day_of_year:</strong> The day of the year (1-366).</li>
                <li><strong>week_of_year:</strong> The week number of the year (1-52).</li>
            </ul>
        </li>
        <li><strong>Creating the 'Country' Feature:</strong> The <strong>'Location'</strong> column contains location information in a string format, which is split by commas. We extract the last part of the split string, representing the country, and assign it to a new <strong>'Country'</strong> feature.</li>
    </ul>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        These extracted features help in understanding temporal patterns, such as when rockets are more likely to be launched, and provide insights into the geographical distribution of rocket statuses.
    </p>
</div>


In [None]:
data['Datum'] = pd.to_datetime(data['Datum'], errors='coerce')

# Extract numeric components
data['weekday_number'] = data['Datum'].dt.weekday      # 0=Monday, 6=Sunday
data['hour'] = data['Datum'].dt.hour                  # Hour of the day (0-23)
data['minute'] = data['Datum'].dt.minute              # Minutes
data['year'] = data['Datum'].dt.year                  # Year
data['month'] = data['Datum'].dt.month                # Numeric month (1-12)
data['day'] = data['Datum'].dt.day                    # Day of the month (1-31)
data['day_of_year'] = data['Datum'].dt.dayofyear      # Day of the year (1-366)
data['week_of_year'] = data['Datum'].dt.isocalendar().week  # Week of the year (1-52)

rockes_status_dict = {'StatusRetired':1,'StatusActive':2}

data['Status Rocket'].replace(rockes_status_dict,inplace=True)

data['Country'] = data['Location'].apply(lambda x: x.split(',')[-1])

In [None]:
data_test['Datum'] = pd.to_datetime(data_test['Datum'], errors='coerce')

data_test['weekday_number'] = data_test['Datum'].dt.weekday      # 0=Monday, 6=Sunday
data_test['hour'] = data_test['Datum'].dt.hour                  # Hour of the day (0-23)
data_test['minute'] = data_test['Datum'].dt.minute              # Minutes
data_test['year'] = data_test['Datum'].dt.year                  # Year
data_test['month'] = data_test['Datum'].dt.month                # Numeric month (1-12)
data_test['day'] = data_test['Datum'].dt.day                    # Day of the month (1-31)
data_test['day_of_year'] = data_test['Datum'].dt.dayofyear      # Day of the year (1-366)
data_test['week_of_year'] = data_test['Datum'].dt.isocalendar().week  # Week of the year (1-52)

rockes_status_dict = {'StatusRetired': 1, 'StatusActive': 2}

data_test['Status Rocket'].replace(rockes_status_dict, inplace=True)

data_test['Country'] = data_test['Location'].apply(lambda x: x.split(',')[-1])

In [None]:
fig = make_subplots(rows=3, cols=1)

for i, period in enumerate(['year', 'month', 'weekday_number']):
  
    total_counts = data[period].value_counts().sort_index()
    failure_counts = data[data['Status Mission'] == 0][period].value_counts().sort_index()
    
    failure_rate = (failure_counts / total_counts) * 100.0
    
    mean_failure_rate = failure_rate.mean()
    
    if period == 'year':
        x = list(failure_rate.index)
    elif period == 'month':
        x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    else:
        x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    
    trace1 = go.Scatter(
        x=x, 
        y=list(failure_rate.values), 
        mode='lines', 
        text=list(failure_rate.keys()), 
        name=f'Failure Rate by {period}', 
        connectgaps=False
    )
    
    trace2 = go.Scatter(
        x=x, 
        y=[mean_failure_rate] * len(failure_rate), 
        mode='lines', 
        showlegend=False, 
        name=f'Mean Failure Rate by {period}', 
        line={'dash': 'dash', 'color': 'grey'}
    )
    
    fig.append_trace(trace1, row=i+1, col=1)
    fig.append_trace(trace2, row=i+1, col=1)


fig.update_layout(
    template='simple_white',
    height=600,
    title={'text': '<b>Failure Rate as a Percentage of Total Missions by Year, Month, and Weekday</b>', 'x': 0.5}
)

for i in range(1, 4):
    fig.update_yaxes(title_text='<b>Failure Rate (%)</b>', row=i, col=1)

fig.show()


<div style="background: #f9f9f9; border: 2px solid #dddddd; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #4CAF50; text-align: center;">Insights from the "Failure Rate as a Percentage of Total Missions by Year, Month, and Weekday" Graph</h3><br>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        <strong>Yearly Pattern:</strong> <br>
        The failure rate shows a general downward trend over the years in the top graph. This indicates significant improvements in mission reliability, suggesting that advancements in technology and better operational practices have contributed to fewer mission failures over time.
    </p>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        <strong>Seasonal Variations:</strong> <br>
        The middle graph highlights a noticeable peak in failure rates during the summer months (June - August). This could be attributed to environmental factors such as higher temperatures, increased humidity, or other seasonal challenges that may affect mission performance.
    </p>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        <strong>Weekday Patterns:</strong> <br>
        The bottom graph shows that Monday consistently has the lowest failure rate. This could be due to more rigorous maintenance checks and preparations being carried out over the weekend, ensuring that the missions are more thoroughly prepared at the start of the week.
    </p>
</div>


In [None]:
sun = data.groupby(["Country","Company Name","Status Mission"])["Datum"].count().reset_index()
sun = sun[(sun.Country == " USA") | (sun.Country == " China") | (sun.Country == " Russian Federation") | (sun.Country == " France")]
fig = px.sunburst(sun, path = ["Country", "Company Name", "Status Mission"], values = "Datum", title = "Sunburst Chart for some Countries")
fig.show()

In [None]:
data_ussr = data[data['Company Name'] == "RVSN USSR"]

x = data_ussr.groupby('year').count()['Detail'].plot(kind='bar', figsize=(14, 4))

plt.ylabel('Number of missions')
plt.title('Number of Missions by RVSN USSR per Year')

plt.tight_layout()
plt.show()

<div style="background: #e9f7f1; border: 2px solid #009966; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #009966; text-align: center;">RVSN USSR's Mission Journey</h3>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        All missions of <strong>RVSN USSR</strong> are now retired, and the company no longer produces rockets. 
        The company began its journey in <strong>1957</strong>, producing a significant number of rockets and achieving many successful missions. 
        In fact, it had the highest number of successful missions among all companies until <strong>1998</strong>. 
        After 1998, however, RVSN USSR ceased launching any missions.
    </p>
</div>


In [None]:
year_wise = data.groupby(['Company Name','year']).count()['Detail'].reset_index()
year_wise = year_wise[year_wise['Company Name'].isin(companies['Company Name'][:20])]

fig = go.Figure(data=go.Heatmap(
        z=year_wise['Detail'],
        x=year_wise['year'],
        y=year_wise['Company Name'],
        colorscale='Viridis'))

fig.update_layout(
    title='Company wise launches per year',
    xaxis_nticks=36)

fig.show()


<div style="background: #f3f8ff; border: 2px solid #007acc; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #007acc; text-align: center;">Space Industry Evolution</h3>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        <strong>RVSN USSR</strong> was the first company to enter the space industry and had remarkable performance, 
        with an increasing number of launches every year. It remains the only company to have launched <strong>97 missions in 1977</strong> 
        until it collapsed in <strong>1991</strong>.
    </p>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        At the beginning, in <strong>1957</strong>, there were very few companies in space research. However, by <strong>2020</strong>, 
        the number of companies has grown tremendously. Companies like <strong>SpaceX</strong>, <strong>VKS RF</strong>, 
        <strong>Arianespace</strong>, and <strong>ISRO</strong> not only ventured into space exploration but also maintained a 
        consistent number of launches each year.
    </p>
</div>


In [None]:
mission_counts_by_year = data["year"].value_counts().reset_index()
mission_counts_by_year.columns = ['year', 'count']  
fig = px.bar(mission_counts_by_year, x="year", y="count", title="Number of Missions by Year")
fig.show()

<div style="background: #fafafa; border: 2px solid #32cd32; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #32cd32; text-align: center; font-size: 2rem;">🌍 Identifying Top Countries for Successful Missions</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        In this step, we create a new feature called <strong>'Top_Country'</strong> which identifies the country with the most successful rocket launches in each year. This feature is crucial for understanding which countries have been the most successful in their space missions and how this information can be leveraged in modeling and analysis.
    </p>
    <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
        <li><strong>Identifying Top Countries:</strong> 
            First, we filter the data to include only successful missions by selecting rows where the <strong>'Status Mission'</strong> equals 1 (successful). Then, we group the data by <strong>'year'</strong> and <strong>'Country'</strong> and count the number of successful missions for each group. We sort the results by year and mission count to find the countries with the most successful launches per year.
        </li>
        <li><strong>Assigning Top Country to Each Year:</strong>
            For each year, we extract the top country with the most successful launches and create a new dataframe <strong>'top_country_per_year'</strong> which contains the <strong>'year'</strong>, the <strong>'Top_Country'</strong>, and the number of successful launches for that country.
        </li>
        <li><strong>Feature Merge and Flagging Top Country:</strong>
            After merging the <strong>'Top_Country'</strong> information with the main dataset, we create a new feature called <strong>'Is_Top_Country'</strong>. This feature is a binary indicator that flags whether the country in each record is the top country for that year. If the country in the record matches the top country for that year, the value of <strong>'Is_Top_Country'</strong> will be 1, otherwise, it will be 0.
        </li>
        <li><strong>Data Cleanup:</strong>
            After creating the new feature, we clean up the dataset by removing unnecessary columns like <strong>'Top_Country'</strong> from the training and test sets, keeping only the relevant columns for modeling.
        </li>
    </ul>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        This feature allows us to capture the distinction between countries that have been the most successful in space missions and those that have not. It is an important factor in understanding the global trends in rocket launches and could significantly enhance our model’s predictive power.
    </p>
</div>


In [None]:
ds = data[data['Status Mission'] == 1]

ds_grouped = ds.groupby(['year', 'Country'])['Status Mission'].count().reset_index()

ds_sorted = ds_grouped.sort_values(['year', 'Status Mission'], ascending=[True, False])

top_country_per_year = pd.concat([group[1].head(1) for group in ds_sorted.groupby('year')])

top_country_per_year.columns = ['year', 'Top_Country', 'Successful_Launches']

combined_data = pd.concat([data.assign(dataset="train"), data_test.assign(dataset="test")], ignore_index=True)

combined_data = combined_data.merge(top_country_per_year[['year', 'Top_Country']], on='year', how='left')

combined_data['Is_Top_Country'] = (combined_data['Country'] == combined_data['Top_Country']).astype(int)

data = combined_data[combined_data['dataset'] == "train"].drop(columns=['dataset'])
data_test = combined_data[combined_data['dataset'] == "test"].drop(columns=['dataset'])

data.drop(columns=['Top_Country'], inplace=True)
data_test.drop(columns=['Top_Country', 'Status Mission'], inplace=True)

In [None]:
country_status_counts = data.groupby(['Country', 'Status Mission']).size().reset_index(name='count')

top_10_countries = data['Country'].value_counts().head(10).index

filtered_data = country_status_counts[country_status_counts['Country'].isin(top_10_countries)]

fig = px.bar(filtered_data, 
             x="Status Mission", y="count", color="Status Mission", 
             facet_col="Country", facet_col_wrap=4,  
             title="Top 10 Countries: Success and Failure Count by Country",
             color_discrete_sequence=["green", "red"])  

fig.update_layout(
    showlegend=False,  
    height=800,  
    title_x=0.5, 
    title_font=dict(size=20, family="Arial, sans-serif"), 
    margin=dict(t=60, b=40, l=40, r=40) 
)

fig.show()

<div style="background: #fafafa; border: 2px solid #32cd32; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #32cd32; text-align: center; font-size: 2rem;">🌍 Country Encoding Based on Mission Success Rate</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        In this step, we create a new feature called <strong>'Country_Encoded'</strong> which encodes the country based on the average mission success rate. This feature helps capture the relationship between a country's overall success in space missions and the outcomes of individual missions. It is particularly useful for models that need to understand country-specific patterns of mission success.
    </p>
    <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
        <li><strong>Country Encoding:</strong> 
            We calculate the average success rate of missions for each country by grouping the data by <strong>'Country'</strong> and calculating the mean of the <strong>'Status Mission'</strong> column. This gives us a measure of how successful missions from each country are, on average.
        </li>
        <li><strong>Mapping Encoded Values:</strong>
            Once the average success rate is calculated for each country, we create a new feature <strong>'Country_Encoded'</strong> in the dataset by mapping the calculated average success rate for each country. This means that for each record, the <strong>'Country_Encoded'</strong> value will represent the average mission success rate of that country.
        </li>
    </ul>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        This feature provides valuable insight into the mission success trends for each country, which could improve the model’s ability to predict the outcomes of missions based on the country of origin. By encoding the average success rate of countries, we add an important global context that can help in decision-making and modeling.
    </p>
</div>


In [None]:
country_enc = data.groupby('Country')['Status Mission'].mean()

data['Country_Encoded'] = data['Country'].map(country_enc)
data_test['Country_Encoded'] = data_test['Country'].map(country_enc)

In [None]:
fig = px.treemap(data,path = ['Status Mission','Country','Company Name'])
fig.update_layout(template = 'ggplot2',margin=dict(l=80, r=80, t=50, b=10),
                  title = { 'text' : '<b>Mission Status,Countries and Companies</b>', 'x' : 0.5},
                 font_family = 'Fira Code',title_font_color= '#ff6767')
fig.show()

<div style="background: #f0f8ff; border: 2px solid #66ccff; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #66ccff; text-align: center;">RVSN USSR Launch Locations</h3>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        From the treemap, I was immediately intrigued by the fact that 'RVSN USSR' had launch locations in both Kazakhstan and Russia. This raised a question that I wanted to dig deeper into. After further investigation and a bit of research, I discovered the Baikonur Cosmodrome, which explained the data.
    </p>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        That ought to explain the reason why we see RVSN USSR entries in both Kazakhstan and Russia for space missions.
    </p>
</div>


<div style="background: #f0f8ff; border: 2px solid #ff6347; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #ff6347; text-align: center; font-size: 2rem;">⏰ Feature Creation: Time of Day, Weekend, and Season</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        In this step, we create multiple time-based features from the <strong>'weekday_number'</strong> and <strong>'hour'</strong> columns. These features help the model understand how different time factors (like day of the week, time of day, and season) influence mission outcomes.
    </p>
    <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
        <li><strong>Weekend Feature:</strong>
            We create a new feature <strong>'is_weekend'</strong> that indicates whether the day is a weekend (Saturday or Sunday). This is achieved by checking if the <strong>'weekday_number'</strong> is either 5 (Saturday) or 6 (Sunday), and assigning a value of 1 for weekends and 0 for weekdays. This feature helps capture patterns related to weekend behavior.
        </li><br>
        <li><strong>Time of Day:</strong>
            We create a feature <strong>'time_of_day'</strong> by categorizing the <strong>'hour'</strong> column into four periods: 
            <ul>
                <li><strong>Morning</strong>: From 5 AM to 11 AM</li>
                <li><strong>Afternoon</strong>: From 12 PM to 4 PM</li>
                <li><strong>Evening</strong>: From 5 PM to 8 PM</li>
                <li><strong>Night</strong>: From 9 PM to 4 AM</li>
            </ul>
            This categorization helps the model identify time-related patterns in the mission data.
        </li><br>
        <li><strong>Season Feature:</strong>
            We create a feature <strong>'season'</strong> that categorizes the <strong>'month'</strong> column into four seasons: 
            <ul>
                <li><strong>Winter</strong>: December, January, February</li>
                <li><strong>Spring</strong>: March, April, May</li>
                <li><strong>Summer</strong>: June, July, August</li>
                <li><strong>Autumn</strong>: September, October, November</li>
            </ul>
            This feature helps account for seasonal variations in mission outcomes, which could be important for predicting the status of missions.
        </li><br>
    </ul>
    <p style="font-size: 1.1rem; line-height: 1.8; color: #333333;">
        These features, <strong>'is_weekend'</strong>, <strong>'time_of_day'</strong>, and <strong>'season'</strong>, add valuable temporal context to the dataset, potentially improving the model’s ability to predict mission success based on the time-related factors of when a mission took place.
    </p>
</div>


In [None]:
def categorize_time_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

def categorize_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Autumn'

data['is_weekend'] = data['weekday_number'].isin([5, 6]).astype(int)
data['time_of_day'] = data['hour'].apply(categorize_time_of_day)
data['season'] = data['month'].apply(categorize_season)

data_test['is_weekend'] = data_test['weekday_number'].isin([5, 6]).astype(int)
data_test['time_of_day'] = data_test['hour'].apply(categorize_time_of_day)
data_test['season'] = data_test['month'].apply(categorize_season)

<div style="background: #e6f7ff; border: 2px solid #00b3b3; padding: 20px; border-radius: 12px; font-family: 'Arial', sans-serif; color: #333333; box-shadow: 4px 4px 10px rgba(0, 0, 0, 0.1);">
    <h2 style="color: #00b3b3; text-align: center; font-size: 2rem;">🚀 Error in 'Rocket' Column and Correction</h2>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        In the dataset, we encountered a small issue in the <strong>'Rocket'</strong> column, where some values were incorrectly formatted with extra spaces and commas (e.g., <strong>'5,000.0 '</strong> and <strong>'1,160.0 '</strong>). These extra spaces and formatting issues could lead to errors during data analysis and model training.
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        To address this, we performed the following steps:
    </p>
    <ul style="font-size: 1.1rem; line-height: 1.6; color: #333333; margin-left: 20px;">
        <li><strong>Removing extra spaces:</strong> We replaced the problematic values <strong>'5,000.0 '</strong> with <strong>500</strong> and <strong>'1,160.0 '</strong> with <strong>116</strong>, eliminating the unnecessary characters.</li>
        <li><strong>Converting to numeric format:</strong> After replacing the erroneous values, we converted the <strong>'Rocket'</strong> column to the <strong>float64</strong> data type to ensure it's correctly recognized as a numeric feature.</li>
    </ul>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        This correction ensures that the <strong>'Rocket'</strong> column contains valid numeric data, allowing the model to correctly interpret and use this feature for analysis and prediction.
    </p>
</div>


In [None]:
data_n = data[~data[' Rocket'].isna()].copy()
data_n[' Rocket'] = data_n[' Rocket'].replace('5,000.0 ',500)
data_n[' Rocket'] = data_n[' Rocket'].replace('1,160.0 ',116)
data[' Rocket'] = data[' Rocket'].replace('5,000.0 ',500)
data[' Rocket'] = data[' Rocket'].replace('1,160.0 ',116)

data_test[' Rocket'] = data_test[' Rocket'].replace('5,000.0 ',500)
data_test[' Rocket'] = data_test[' Rocket'].replace('1,160.0 ',116)

data_n[' Rocket'] = data_n[' Rocket'].astype('float64')
data[' Rocket'] = data[' Rocket'].astype('float64')
data_test[' Rocket'] = data_test[' Rocket'].astype('float64')

# Imputing Rocket Missing Value

<div style="background-color: #f0f4f8; padding: 30px; border-radius: 15px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <h1 style="color: #2d3e50; background-color: #a7c4e4; padding: 15px; border-radius: 8px; font-family: 'Arial', sans-serif;">Imputing Missing Values for 'Rocket' Column</h1>
    <p style="font-size: 16px; color: #4e5b6e; font-family: 'Arial', sans-serif;">The goal is to predict and impute the missing values in the <strong>'Rocket'</strong> column using the available features in the dataset.</p>

<h2 style="color: #1abc9c; font-family: 'Arial', sans-serif;">Process Overview</h2>
    <p style="font-size: 14px; color: #555; line-height: 1.6; font-family: 'Arial', sans-serif;">The approach involves using the other features in the dataset to train a machine learning model, specifically XGBoost, to predict the missing values in the <strong>'Rocket'</strong> column. This method leverages relationships between features that are available in the rows with non-missing <strong>'Rocket'</strong> values.</p>

<h2 style="color: #6a5acd; font-family: 'Arial', sans-serif;">Key Insights</h2>
    <ul style="font-size: 14px; color: #2c3e50; background-color: #ecf2f9; padding: 20px; border-radius: 10px; list-style-type: disc; line-height: 1.6; font-family: 'Arial', sans-serif;">
        <li><strong>Handling Missing Data:</strong> Instead of simply dropping rows with missing <strong>'Rocket'</strong> values, a predictive model is trained to fill these gaps, leading to a more informed imputation.</li>
        <li><strong>Feature Transformation:</strong> Categorical features like <strong>'center'</strong>, <strong>'time_of_day'</strong>, and <strong>'season'</strong> are encoded into numerical values, enabling the model to learn from these features effectively.</li>
        <li><strong>XGBoost's Effectiveness:</strong> After experimenting with different algorithms, XGBoost was selected as the best performer due to its ability to handle complex relationships between features, providing accurate imputation of the missing values.</li>
    </ul>

<p style="font-size: 14px; color: #555; font-family: 'Arial', sans-serif;">This approach of using a machine learning model like XGBoost to predict missing values offers a more sophisticated alternative to traditional imputation methods, leading to potentially better model performance and more accurate data for analysis.</p>
</div>


In [None]:
columns = ['Serial Number', 'Location', 'Datum', 'Detail', 'Country', 'Company Name']

data.drop(columns, axis=1, inplace=True)
data_test.drop(columns, axis=1, inplace=True)

In [None]:
all_centers = pd.concat([data['center'], data_test['center']]).unique()

le = LabelEncoder()
le.fit(all_centers)

data['center'] = le.transform(data['center'])
data_test['center'] = le.transform(data_test['center'])

In [None]:
le = LabelEncoder()

data['time_of_day'] = le.fit_transform(data['time_of_day'])
data_test['time_of_day'] = le.transform(data_test['time_of_day'])

data['season'] = le.fit_transform(data['season'])
data_test['season'] = le.transform(data_test['season'])

In [None]:
combined_data = pd.concat([data, data_test], axis=0).drop(['Status Mission'], axis = 1)

combined_data = combined_data[~combined_data[' Rocket'].isna()]

In [None]:
X = combined_data.drop([' Rocket'], axis = 1)
y = combined_data[' Rocket']

train_x,test_x,train_y,test_y = train_test_split(X,y)

In [None]:
xgb_model = XGBRegressor(n_estimators=1000,learning_rate=0.05)
xgb_model.fit(train_x,train_y,eval_set=[(test_x[0:20],test_y[0:20])],verbose=False)

In [None]:
Xm = data[data[' Rocket'].isna()].drop([' Rocket', 'Status Mission'], axis=1)
xgb_predict = xgb_model.predict(Xm)
missing_index = data[data[' Rocket'].isna()][' Rocket'].index.to_list()
data.loc[missing_index,' Rocket'] = xgb_predict

Xm = data_test[data_test[' Rocket'].isna()].drop([' Rocket'], axis=1)
xgb_predict = xgb_model.predict(Xm)
missing_index = data_test[data_test[' Rocket'].isna()][' Rocket'].index.to_list()
data_test.loc[missing_index,' Rocket'] = xgb_predict

## Analysing Budget

<div style="background: #f0f8ff; border: 2px solid #66ccff; padding: 20px; border-radius: 10px; font-family: 'Arial', sans-serif; color: #333333;">
    <h3 style="color: #66ccff; text-align: center;">Creating the 'Average Budget per Year' Feature</h3>
    <p style="font-size: 1rem; color: #333333; line-height: 1.6;">
        To better understand the budget allocation trends over time, I created a new feature called <strong>'Average Budget per Year'</strong>.
        This feature was calculated by grouping the dataset by <strong>year</strong> and calculating the mean of the <strong>budget</strong> for each year. 
        By doing so, I obtained a more generalized understanding of how the budget has evolved over time, helping us analyze the financial trends in the space industry.
    </p>
</div>


In [None]:
combined_data = pd.concat([data, data_test], ignore_index=True)

average_budget_per_year = combined_data.groupby('year')[' Rocket'].mean().reset_index()
average_budget_per_year.columns = ['year', 'avg_budget_per_year']

data = data.merge(average_budget_per_year, on='year', how='left')
data_test = data_test.merge(average_budget_per_year, on='year', how='left')

In [None]:
df_d = data[data.loc[:, " Rocket"]<1000]
plt.figure(figsize = (22,6))
sns.histplot(data = df_d, x = " Rocket", hue = "Status Rocket")
plt.show()

In [None]:
plt.figure(figsize = (22,6))
sns.histplot(data = df_d, x = " Rocket", hue = "Status Mission")
plt.show()

# Modeling

<div style="background-color: #f0f8ff; padding: 20px; border-radius: 10px; font-family: Arial, sans-serif; font-size: 18px;">
  <h3 style="color: #2d87f0; font-size: 24px;">✨ Model Training and Experimentation</h3>
  <p><strong>🔧 Experimenting with Multiple Algorithms:</strong> I tested several machine learning algorithms including <strong>AdaBoost</strong>, <strong>LightGBM</strong>, <strong>XGBoost</strong>, and a <strong>Voting Classifier</strong>. Each model was evaluated to determine which one provided the best performance on the given task.</p>
  
  <p><strong>⚙️ Hyperparameter Tuning:</strong> I applied hyperparameter tuning to these algorithms, particularly focusing on <strong>AdaBoost</strong>, <strong>LightGBM</strong>, and <strong>XGBoost</strong>, using optimization techniques to fine-tune their performance and make them more efficient.</p>
  
  <p><strong>💡 Final Model Choice:</strong> Despite testing various models, I ultimately selected the <strong>Random Forest</strong> model because it produced the best results. The other models were tested in Google Colab, but for the final notebook, I focused on the Random Forest model which showed the highest <em>F1 score</em>.</p>

  <p><strong>🌟 Best Score:</strong> By experimenting with different algorithms and fine-tuning their parameters, I was able to identify <strong>Random Forest</strong> as the most reliable model for our task. Although other models were tested, the Random Forest model provided the best performance overall.</p>
</div>


<div style="background-color: #f0f8ff; padding: 20px; border-radius: 10px; font-family: Arial, sans-serif; font-size: 18px;">
  <h3 style="color: #2d87f0; font-size: 24px;">✨ Model Training</h3>
  <p><strong>🔧 Handling Missing Values:</strong> I started by <em>dropping rows with null values</em> from the training dataset to ensure clean data for model training.</p>
  
  <p><strong>⚙️ Hyperparameter Tuning:</strong> I tuned the <em>Random Forest model</em> with hyperparameter optimization techniques to further enhance performance and use those parameters.</p>

  
</div>


In [None]:
data = data.dropna()

In [None]:
X = data.drop(['Status Mission'], axis = 1)

y = data['Status Mission']

X_test = data_test

In [None]:
for column in X_test.columns:
    if X_test[column].isnull().sum() > 0:
        mode_value = X_test[column].mode()[0]
        X_test[column].fillna(mode_value, inplace=True)

In [None]:
# rf_clf = RandomForestClassifier()


# param_grid = {
#     'model__n_estimators': [50, 100, 200],
#     'model__max_depth': [None, 10, 20, 30],
#     'model__min_samples_split': [2, 5, 10],
#     'model__min_samples_leaf': [1, 2, 4],
#     'model__bootstrap': [True, False]
# }

# rf_pipe = Pipeline(steps=[('model', rf_clf)])

# # Create GridSearchCV object
# grid_search = GridSearchCV(rf_pipe, param_grid, cv=5, scoring='f1', verbose=2)

# grid_search.fit(X, y)

# print("Best parameters found: ", grid_search.best_params_)
# print("Best F1 score found: ", grid_search.best_score_)

In [None]:
rf = RandomForestClassifier(bootstrap = True, max_depth = 20, min_samples_leaf = 2, min_samples_split = 2, n_estimators = 50)

rf.fit(X,y)

y_pred = rf.predict(X_test)

In [None]:
sample_submission = pd.read_csv("/kaggle/input/sky-cast-margazhi-25/sample_submission.csv")

sample_submission['Status Mission'] = y_pred

sample_submission.to_csv('submission.csv', index=False)

<div style="background: #ffebcc; border: 3px solid #ff6600; padding: 20px; border-radius: 15px; font-family: 'Comic Sans MS', sans-serif; color: #333333; box-shadow: 4px 4px 20px rgba(0, 0, 0, 0.2);">
    <h2 style="color: #ff6600; text-align: center; font-size: 2.5rem; font-weight: bold;">🚀 And That's a Wrap! 💥</h2>
    <p style="font-size: 1.4rem; line-height: 1.8; color: #333333; text-align: center;">
        Well, well, well... we've taken this launch to the next level! 🚀 After all the crazy feature engineering, data wrangling, and rocket-fueled transformations, it's time to say: Our model's journey is complete! 🚀💫
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        Now, let’s talk about the real hero of this challenge — the Random Forest! 🌳🌲 This model’s magic lies in its ability to create an ensemble of decision trees, each making decisions based on random subsets of features and data points. It then combines the wisdom of all those trees to make predictions that are stronger, faster, and more accurate than any single tree could achieve.
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        What makes Random Forest work best for us here is its robustness and flexibility! 🎯 Whether we’re dealing with outliers, overfitting, or tricky feature interactions, Random Forest handles it like a pro. It also gives us the power to measure feature importance, meaning we can understand which factors are truly driving those rocket launches and what’s making those missions a success. 🎉
    </p>
    <p style="font-size: 1.2rem; line-height: 1.8; color: #333333;">
        It’s like having a team of rocket scientists working together, each bringing in their own unique perspective. And when they all agree, BOOM – we’ve got a top-performing model that gives us the edge we need! 🚀💡
    </p>
    <h3 style="color: #ff6600; text-align: center; font-size: 2rem; font-weight: bold;">Ready for liftoff! 🔥🚀</h3>
    <p style="font-size: 1.4rem; line-height: 1.8; color: #333333; text-align: center;">
        So, the next time you’re ready to predict the success of a rocket launch, you know exactly which model to call on. Random Forest's got your back! 🔥 Let’s continue to explore the skies, one prediction at a time! 🌌
    </p>
</div>
