<a href="https://www.kaggle.com/code/commandante/data-cleaning-all-space-missions-from-1957?scriptVersionId=142271915" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Data Cleaning | Exploring Space Missions from 1957

Hey there, I'm *Kaustav* and I'm excited to present this notebook where I've taken on the task of cleaning and organizing space mission data from 1957 onwards. The goal is to provide you with a clearer and more insightful perspective on these missions.

With a blend of primary and secondary research, the data has been meticulously refined, setting the stage for a more user-friendly experience. Leveraging the power of Python, I've conducted transformative cleaning and structuring, paving the way for an engaging data exploration.

As we journey through the cosmos of data, the focus will soon shift towards Exploratory Data Analysis and the art of storytelling through Tableau. Join me on this voyage as we unearth the tales hidden within the numbers and charts.

If you find my work intriguing, I invite you to upvote, share, and connect with me on LinkedIn. Let's discuss insights, collaborate, and continue exploring the endless possibilities that data holds!

Let's embark on this exciting odyssey together!

Connect with me on [LinkedIn](linkedin.com/in/commandantekaustav) 🚀


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/all-space-missions-from-1957/Space_Corrected.csv


# Importing Data

In [2]:
df = pd.read_csv('/kaggle/input/all-space-missions-from-1957/Space_Corrected.csv')

# Exploring Data

In [3]:
print("Columns in our data")
for x in list(df.columns):
    print(x)

Columns in our data
Unnamed: 0.1
Unnamed: 0
Company Name
Location
Datum
Detail
Status Rocket
 Rocket
Status Mission


The`Rocket` feature contains a space which is problematic hence removing it

In [4]:
# Fixing column name
df.rename(columns={' Rocket': 'Rocket'}, inplace=True)

In [5]:
# Dropping useless columns
try:
    df.drop(['Unnamed: 0.1','Unnamed: 0'], axis = 1, inplace = True)
except:
    print("Dropped already! No need to do it again!")

## Exploing `Company Name`
**Unique Values**

In [6]:
print("Number of Unique companies:", df['Company Name'].nunique())

Number of Unique companies: 56


In [7]:
# Printing all the unique values in 'Company Name'
df['Company Name'].unique()

array(['SpaceX', 'CASC', 'Roscosmos', 'ULA', 'JAXA', 'Northrop', 'ExPace',
       'IAI', 'Rocket Lab', 'Virgin Orbit', 'VKS RF', 'MHI', 'IRGC',
       'Arianespace', 'ISA', 'Blue Origin', 'ISRO', 'Exos', 'ILS',
       'i-Space', 'OneSpace', 'Landspace', 'Eurockot', 'Land Launch',
       'CASIC', 'KCST', 'Sandia', 'Kosmotras', 'Khrunichev', 'Sea Launch',
       'KARI', 'ESA', 'NASA', 'Boeing', 'ISAS', 'SRC', 'MITT', 'Lockheed',
       'AEB', 'Starsem', 'RVSN USSR', 'EER', 'General Dynamics',
       'Martin Marietta', 'Yuzhmash', 'Douglas', 'ASI', 'US Air Force',
       'CNES', 'CECLES', 'RAE', 'UT', 'OKB-586', 'AMBA',
       "Arm??e de l'Air", 'US Navy'], dtype=object)

***Replacing `Arm??e de l'Air` with `L'Armée de l'Air`***

In [8]:
df['Company Name'].replace("Arm??e de l'Air",
                           "L'Armée de l'Air", inplace=True)

In [9]:
print("Number of empty values in Company Names: ",df['Company Name'].isna().sum())

Number of empty values in Company Names:  0


## Exploing `Location`

In [10]:
print("Number of unique Locations: ",df['Location'].nunique())

Number of unique Locations:  137


In [11]:
# Printing all the unique values in 'Location'
df['Location'].unique()

array(['LC-39A, Kennedy Space Center, Florida, USA',
       'Site 9401 (SLS-2), Jiuquan Satellite Launch Center, China',
       'Pad A, Boca Chica, Texas, USA',
       'Site 200/39, Baikonur Cosmodrome, Kazakhstan',
       'SLC-41, Cape Canaveral AFS, Florida, USA',
       'LC-9, Taiyuan Satellite Launch Center, China',
       'Site 31/6, Baikonur Cosmodrome, Kazakhstan',
       'LC-101, Wenchang Satellite Launch Center, China',
       'SLC-40, Cape Canaveral AFS, Florida, USA',
       'LA-Y1, Tanegashima Space Center, Japan',
       'LP-0B, Wallops Flight Facility, Virginia, USA',
       'Site 95, Jiuquan Satellite Launch Center, China',
       'LC-3, Xichang Satellite Launch Center, China',
       'Pad 1, Palmachim Airbase, Israel',
       'Rocket Lab LC-1A, M?\x81hia Peninsula, New Zealand',
       'LC-2, Xichang Satellite Launch Center, China',
       'Xichang Satellite Launch Center, China',
       'Cosmic Girl, Mojave Air and Space Port, California, USA',
       'Site 43/4, Plese

As we can observe, the location details are written in the following approximate form.

1. Laucher Name (Launchpad, Carrier Vessel, Spaceport etc.)
2. Facility / Space Station
3. State / Province
4. Country


**Country**:

In [12]:
df['Country'] = df['Location'].apply(lambda x: x.split(",")[-1].strip())

In [13]:
# All unique values in Country
print("Number of unique countries : ", df['Country'].nunique())
df['Country'].unique()

Number of unique countries :  22


array(['USA', 'China', 'Kazakhstan', 'Japan', 'Israel', 'New Zealand',
       'Russia', 'Shahrud Missile Test Site', 'France', 'Iran', 'India',
       'New Mexico', 'Yellow Sea', 'North Korea',
       'Pacific Missile Range Facility', 'Pacific Ocean', 'South Korea',
       'Barents Sea', 'Brazil', 'Gran Canaria', 'Kenya', 'Australia'],
      dtype=object)

As we can see that some of the values aren't valid.
1. `Shahrud Missile Test Site` is a secret miliatry base situated in `Iran`
2. `Yellow Sea` : The `Tai Rui Barge` is a mobile launch platform used by the China Aerospace Science and Technology Corporation and China Aerospace Science and Industry Corporation for launching small satellites. It is located in the Yellow Sea, off the coast of `China`.
3. `Pacific Missile Range Facility` : The `Pacific Missile Range Facility` Barking Sands (PMRF) is a United States Navy (USN) installation located on the west side of `Kauai, Hawaii`.
4. `Pacific Ocean` : The `Kiritimati Launch Area`, also known as the `Sea Launch Launch Site`, is a floating launch platform located in the Pacific Ocean off the coast of `Kiribati`.It was used by Sea Launch, a joint venture between Boeing, Lockheed Martin, and other companies, to launch commercial satellites into orbit.
5. `Barents Sea` : The `Barents Sea Launch Area` is a former submarine-launched ballistic missile (SLBM) launch area located in the Barents Sea, off the coast of `Russia`. It was used by the `Soviet Union` and `Russia` to launch SLBMs into the Arctic Ocean.
6. `Gran Canaria` : The `Gando Air Base` is an air base of the Spanish Air Force located in Gando, on the island of `Gran Canaria, Spain`. It is one of the largest and most important air bases in Spain and is unique for the wide variety of aircraft it operates.

~~**Then**, Every entry has a space in the beginning.~~

In [14]:
try:
    replacements_country = {'Shahrud Missile Test Site' : 'Iran',
                        'Yellow Sea' : 'China',
                        'Pacific Missile Range Facility' : 'USA',
                        'Pacific Ocean' : 'Kiribati',
                        'Barents Sea' : 'Russia',
                        'Gran Canaria' : 'Spain'}
    df['Country'].replace(to_replace = replacements_country,
                      inplace=True)
    print("Fixed all the invalid entries ")
except:
    print("Done Already! No Changes!")
finally:
    print("Number of unique countries : ", df['Country'].nunique())
    for country in df['Country'].unique():
        print(country, end ='\t')

Fixed all the invalid entries 
Number of unique countries :  18
USA	China	Kazakhstan	Japan	Israel	New Zealand	Russia	Iran	France	India	New Mexico	North Korea	Kiribati	South Korea	Brazil	Spain	Kenya	Australia	

**Laucher Name (Launchpad, Carrier Vessel, Spaceport etc.)**

In [15]:
# Launch-pad / Carrier-vessel
df['Launcher'] = df['Location'].apply(lambda x: x.split(",")[0])

In [16]:
print("Number of unique Launch-pads/sites/carriers: ",df['Launcher'].nunique())
print("Unique values are:")
for _ in df['Launcher'].unique():
    print(_)

Number of unique Launch-pads/sites/carriers:  130
Unique values are:
LC-39A
Site 9401 (SLS-2)
Pad A
Site 200/39
SLC-41
LC-9
Site 31/6
LC-101
SLC-40
LA-Y1
LP-0B
Site 95
LC-3
Pad 1
Rocket Lab LC-1A
LC-2
Xichang Satellite Launch Center
Cosmic Girl
Site 43/4
LA-Y2
Launch Plateform
LC-201
Site 43/3
ELA-3
LP-0A
Imam Khomeini Spaceport
Site 133/3
Site 81/24
ELS
Blue Origin Launch Site
First Launch Pad
Taiyuan Satellite Launch Center
Second Launch Pad
LC-16
Vertical Launch Area
Stargazer
Site 1/5
SLC-37B
ELV-1 (SLV)
Site 1S
SLC-4E
Tai Rui Barge
SLC-6
Mu Pad
SLC-2W
SLC-3E
Uchinoura Space Center
Site 45/1
SLC-576E
SLC-46
Site 901 (SLS-1)
LP-41
Site 370/13
Site 35/1
Site 175/59
LP Odyssey
Jiuquan Satellite Launch Center
LC-1
LP-1
SLC-17B
SLC-8
Site 16/2
Site 109/95
Site 132/1
LC-39B
SLC-17A
Omelek Island
Site 32/2
LC-7
Site 90/20
K-84 Submarine
Svobodny Cosmodrome
K-496 Submarine
SLC-36B
SLC-36A
SLC-4W
VLS Pad
ELA-2
Site 32/1
Site 107/1
Site 41/1
K-407 Submarine
Site 81/23
Site 138 (LA-2B)
Site 1

1. `Rocket Lab LC-1A` : `Rocket Lab` is a private aerospace company based in New Zealand that develops and manufactures small launch vehicles. The company's launch complex in Mahia Peninsula is called `Launch Complex 1 (LC-1)`. 

In [17]:
print("Number of empty cells: ", df['Launcher'].isna().sum())

Number of empty cells:  0


**Facility**:

In [18]:
df['Facility'] = df['Location'].apply(lambda x: x.split(",")[1].strip())

In [19]:
print("Number of unique Facilities: ",df['Facility'].nunique())
print("Unique values are:")
for _ in df['Facility'].unique():
    print(_)

Number of unique Facilities:  44
Unique values are:
Kennedy Space Center
Jiuquan Satellite Launch Center
Boca Chica
Baikonur Cosmodrome
Cape Canaveral AFS
Taiyuan Satellite Launch Center
Wenchang Satellite Launch Center
Tanegashima Space Center
Wallops Flight Facility
Xichang Satellite Launch Center
Palmachim Airbase
M?hia Peninsula
China
Mojave Air and Space Port
Plesetsk Cosmodrome
Shahrud Missile Test Site
Guiana Space Centre
Semnan Space Center
West Texas
Satish Dhawan Space Centre
Spaceport America
Vostochny Cosmodrome
Vandenberg AFB
Yellow Sea
Uchinoura Space Center
Japan
Sohae Satellite Launching Station
Kauai
Yasny Cosmodrome
Kiritimati Launch Area
Naro Space Center
Ronald Reagan Ballistic Missile Defense Test Site
Pacific Spaceport Complex
Tonghae Satellite Launching Ground
Barents Sea Launch Area
Russia
Alc?›ntara Launch Center
Kapustin Yar
Base Aerea de Gando
Edwards AFB
San Marco Launch Platform
RAAF Woomera Range Complex
Hammaguir
Naval Air Station Point Mugu


As we can see that some of the values aren't valid.
1. `M?\x81hia Peninsula`, ` M?hia Peninsula` : `Rocket Lab` is a private aerospace company based in `New Zealand` that develops and manufactures small launch vehicles. The company's launch complex in `Mahia Peninsula` is called `Launch Complex 1 (LC-1)`. 
2. `Alc?›ntara Launch Center` : The `Alcântara Launch Center (CLA)` is a spaceport located in the municipality of `Alcântara`, in the state of `Maranhão, Brazil`. It is the closest launch site to the equator in the world, making it attractive for launches of geostationary satellites. The CLA was officially opened in 1990 and has since been used to launch a variety of satellites, including Brazilian, Russian, and Ukrainian satellites.
3. `Boca Chica` : `SpaceX`'s `Starbase` facility in `Boca Chica`, Texas is a spaceport, production, and development facility for Starship rockets. It has been under construction since the late 2010s and is currently the site of SpaceX's development and testing of the Starship and Super Heavy rockets.
4. `Kauai` : The `Pacific Missile Range Facility` (PMRF) on Kauai was used to launch rockets and missiles from 1959 to 1995.
5. `West Texas` : Jeff Bezos's space company `Blue Origin` has a launch site in West Texas called `Launch Site One`. It is located in the desert near the town of Van Horn. The launch site has been used to launch a number of New Shepard suborbital rockets, including the one that Jeff Bezos took on his first space flight in 2021.
6. `Japan` : In Japan's context, It's only `Uchinoura Space Center` in which the value is in `Facility, Country` form.
7. `Yellow Sea` : In Yellow Sea's context, It's only `Tai Rui Barge` in which the value is in `Facility, Country` form.
8. `Russia` : In Russia's context, It's only `Svobodny Cosmodrome` in which the value is in `Facility, Country` form.
9. `China` : This could be a little tricky as I'm unable to work a way out to solve this problem. `China` has two entries written in the aforementioned form, which are, `Taiyuan Satellite Launch Center` and `Xichang Satellite Launch Center`. My idea is to use the following approach df[`Facility`] <- df['Launcher] if df[`Facility`]=='China`; but for some reason it's not working. I'd really appreciate some help. :)

In [20]:
try:
    replacements_facility = {'M?\x81hia Peninsula' : 'Rocket Lab',
                            'Alc?›ntara Launch Center' : 'Alcântara Launch Center',
                            'Boca Chica' : 'Starbase',
                            'Japan' : 'Uchinoura Space Center',
                            'Kauai' : 'Pacific Missile Range Facility',
                            'West Texas' : 'Launch Site One',
                            'Yellow Sea' : 'Tai Rui Barge',
                            'Russia' : 'Svobodny Cosmodrome'}
    df['Facility'].replace(to_replace = replacements_facility,
                      inplace=True)
    print("Fixed all the invalid entries ")
except:
    print("Done Already! No Changes!")
finally:
    print("Number of unique countries : ", df['Facility'].nunique())
    for country in df['Facility'].unique():
        print(country, end ='\t')

Fixed all the invalid entries 
Number of unique countries :  43
Kennedy Space Center	Jiuquan Satellite Launch Center	Starbase	Baikonur Cosmodrome	Cape Canaveral AFS	Taiyuan Satellite Launch Center	Wenchang Satellite Launch Center	Tanegashima Space Center	Wallops Flight Facility	Xichang Satellite Launch Center	Palmachim Airbase	Rocket Lab	China	Mojave Air and Space Port	Plesetsk Cosmodrome	Shahrud Missile Test Site	Guiana Space Centre	Semnan Space Center	Launch Site One	Satish Dhawan Space Centre	Spaceport America	Vostochny Cosmodrome	Vandenberg AFB	Tai Rui Barge	Uchinoura Space Center	Sohae Satellite Launching Station	Pacific Missile Range Facility	Yasny Cosmodrome	Kiritimati Launch Area	Naro Space Center	Ronald Reagan Ballistic Missile Defense Test Site	Pacific Spaceport Complex	Tonghae Satellite Launching Ground	Barents Sea Launch Area	Svobodny Cosmodrome	Alcântara Launch Center	Kapustin Yar	Base Aerea de Gando	Edwards AFB	San Marco Launch Platform	RAAF Woomera Range Complex	Hammagui

As it's more about comparing different countries to each other, further granurality of the locations isn't needed.

In [21]:
df.to_csv('Enhanced Space.csv')

## Exploring `Datum`

In [22]:
# def remove_utc(datum):
#     return datum[:-4]

# # Apply the remove_utc() function to the Datum column
# df['Datum'] = df['Datum'].apply(remove_utc)
# df['Datum']

In [23]:
# df['Date'] = df['Datum'].apply(lambda x: x[:17])

In [24]:
# df['Date'] = pd.to_datetime(df['Datum'])
# df['Date']

In [25]:
# df['Datum'] = pd.to_datetime(df['Datum'], format='%a %b %d, %Y %H:%M', errors='coerce')

In [26]:
# df['Datum'].isna().sum()