<img src="https://images.unsplash.com/photo-1446776709462-d6b525c57bd3?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" height=500>

###### Source: NASA via Unsplash

# Analyzing and Visualizing the Space Race

### 0. Setup

#### 0.1 Import Statements

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import statistics
from scipy.stats import skew

#### 0.2 Notebook Formatting

In [13]:
pd.options.display.float_format = '{:,.2f}'.format

### 1. Understanding The Problem

#### 1.1 Historical Context

The Space Race, which began in the 1950s between the United States and the Soviet Union, holds significant historical relevance. This period marked a pivotal point in human achievement, where technological and scientific advancements pushed the boundaries of exploration and innovation. The drive to explore space was not only a pursuit of scientific discovery but also a critical demonstration of national power and technological supremacy amid Cold War tensions.

From a societal perspective, the Space Race fueled global fascination and inspired generations to pursue careers in science, technology, engineering, and mathematics (STEM). Economically, the competition led to substantial government investments in research and development, catalyzing advancements in computing, telecommunications, and materials science, which laid the groundwork for modern industries. Politically, space achievements were leveraged to project ideological superiority, demonstrating the geopolitical power of each nation. This technological rivalry eventually fostered international collaborations, including the formation of space agencies and joint missions, reflecting a shift from competition to cooperation in the post-Cold War era.

### 2. Data Collection And Cleaning

The main data used for the analysis is a dataset made available through Kaggle by LaCla3D (https://www.kaggle.com/datasets/agirlcoding/all-space-missions-from-1957) which was webscraped from https://nextspaceflight.com/launches/past/?page=1. The dataset was downloaded and stored directly in the project folder as "space_race_data.csv".

#### 2.1 Load Data

In [14]:
data = pd.read_csv('./space_race_data.csv')

#### 2.2 Understanding, Exploring And Adjusting The Dataset

Here I checked the first few lines to see what kind of data is available:

In [15]:
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Company Name,Location,Datum,Detail,Status Rocket,Rocket,Status Mission
0,0,0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA","Fri Aug 07, 2020 05:12 UTC",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success
1,1,1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...","Thu Aug 06, 2020 04:01 UTC",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success
2,2,2,SpaceX,"Pad A, Boca Chica, Texas, USA","Tue Aug 04, 2020 23:57 UTC",Starship Prototype | 150 Meter Hop,StatusActive,,Success
3,3,3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan","Thu Jul 30, 2020 21:25 UTC",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success
4,4,4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA","Thu Jul 30, 2020 11:50 UTC",Atlas V 541 | Perseverance,StatusActive,145.0,Success


Checked the original number of rows and columns.

In [16]:
print(f'The dataset has {data.shape[0]} rows and {data.shape[1]} columns.')

The dataset has 4324 rows and 9 columns.


Checked which data types were the dataset columns originally.

In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4324 entries, 0 to 4323
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0.1    4324 non-null   int64 
 1   Unnamed: 0      4324 non-null   int64 
 2   Company Name    4324 non-null   object
 3   Location        4324 non-null   object
 4   Datum           4324 non-null   object
 5   Detail          4324 non-null   object
 6   Status Rocket   4324 non-null   object
 7    Rocket         964 non-null    object
 8   Status Mission  4324 non-null   object
dtypes: int64(2), object(7)
memory usage: 304.2+ KB


##### Key observations on cleaning that needs to be done
- The first two columns seem to be just repetitions of the dataset's index, they can be dropped.
- Column labels can be simplified
- Data types need to be adjusted for some columns

##### Dropping Columns with Repeated Index

In [18]:
data.drop(labels=['Unnamed: 0.1','Unnamed: 0'], axis=1, inplace=True)
data.head()

Unnamed: 0,Company Name,Location,Datum,Detail,Status Rocket,Rocket,Status Mission
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA","Fri Aug 07, 2020 05:12 UTC",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success
1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...","Thu Aug 06, 2020 04:01 UTC",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success
2,SpaceX,"Pad A, Boca Chica, Texas, USA","Tue Aug 04, 2020 23:57 UTC",Starship Prototype | 150 Meter Hop,StatusActive,,Success
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan","Thu Jul 30, 2020 21:25 UTC",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA","Thu Jul 30, 2020 11:50 UTC",Atlas V 541 | Perseverance,StatusActive,145.0,Success


##### Rename Columns

In [31]:
data.rename(columns={'Company Name':'company',
                     'Location':'location',
                     'Datum':'date',
                     'Detail':'rocket_name',
                     'Status Rocket':'rocket_status',
                     ' Rocket':'mission_cost',
                     'Status Mission':'mission_status'}, inplace=True)
data.head()

Unnamed: 0,company,location,date,rocket_name,rocket_status,mission_cost,mission_status
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA","Fri Aug 07, 2020 05:12 UTC",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success
1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...","Thu Aug 06, 2020 04:01 UTC",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success
2,SpaceX,"Pad A, Boca Chica, Texas, USA","Tue Aug 04, 2020 23:57 UTC",Starship Prototype | 150 Meter Hop,StatusActive,,Success
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan","Thu Jul 30, 2020 21:25 UTC",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA","Thu Jul 30, 2020 11:50 UTC",Atlas V 541 | Perseverance,StatusActive,145.0,Success


##### Description of the dataset's columns
- **company** : The name of the company that has launched the rocket.
- **location** : The location where the launch took place.
- **date** : The date when the launch took place.
- **rocket_name** : The name of the rocket.
- **rocket_status** : Indicates whether the rocket is still active or not.
- **mission_cost** : How much has the launch costed.
- **mission_status** : Indicates whether the rocket launch succeeded or failed.

##### Convert date to datetime

In [60]:
data.date = pd.to_datetime(data.date, 
                           format='mixed', 
                           utc=True).dt.strftime('%Y-%m-%d')
data.head()


Unnamed: 0,company,location,date,rocket_name,rocket_status,mission_cost,mission_status
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA",2020-08-07,Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success
1,CASC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...",2020-08-06,Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success
2,SpaceX,"Pad A, Boca Chica, Texas, USA",2020-08-04,Starship Prototype | 150 Meter Hop,StatusActive,,Success
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan",2020-07-30,Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA",2020-07-30,Atlas V 541 | Perseverance,StatusActive,145.0,Success
