# 🚀 Project Title: NASA Space Object Index Data Extraction + EDA 

## 📌 One-Liner
Automates the extraction, cleaning, and exploratory analysis of launch and mission data from a NASA infinite-scroll web resource to support space-tech market intelligence and infrastructure strategy.

---

## TL;DR Executive Summary
**3 Findings**:  
1. Successfully built an automated scraper for NASA’s mission listings using infinite-scroll handling.  
2. Cleaned and structured the dataset into consistent formats for mission names, launch dates, locations, and mission objectives.  
3. Conducted exploratory data analysis revealing patterns in mission frequency, geographic launch distribution, and thematic mission types.

**2 Implications**:  
1. Enables ongoing, low-effort tracking of NASA’s mission portfolio for competitive intelligence.  
2. Establishes a reusable pipeline for other space agencies’ open portals.

**1 Recommendation**:  
Extend this workflow to include launch success metrics, payload mass, and satellite type for richer strategic correlation.

---

## 🎯 Problem Statement & Decision Context
- **Business Question**: How can we systematically collect, clean, and analyze launch mission data from NASA to identify patterns relevant for market positioning in the space-tech and EO sectors?  
- **Scope**: NASA missions page, infinite scroll, structured tabular extraction, CSV/GeoJSON storage, EDA in Python.  
- **Out of Scope**: Real-time mission tracking, orbital mechanics calculations, and EO raster integration (covered in later projects).  
- **Success Criteria**: Fully automated data extraction script + cleaned dataset + EDA visualizations revealing at least three actionable patterns.

---

## 👥 Stakeholders & Use Cases
- **Primary Stakeholders**:  
  - Aerospace startups (Skyroot, Pixxel) for competitive benchmarking  
  - Space policy think tanks for mission diversity analysis  
  - Infrastructure planners for launch site capacity planning

- **Use Cases**:  
  - Regular reporting on NASA mission pipeline  
  - Comparative analysis with other agencies  
  - Foundation for EO mission overlay and downstream analytics

---

## 🗂 Data Card
- **Source**: NASA Launch/Mission website (infinite scroll endpoint)  
- **Method**: Selenium/Python requests with dynamic content loading handling  
- **License**: Public domain (US Government works)  
- **Update Frequency**: Daily/Weekly (can be scheduled)  
- **Key Attributes**:  
  - `mission_name` (string)  
  - `launch_date` (datetime)  
  - `launch_location` (string)  
  - `mission_type` (categorical)  
  - `mission_summary` (text)

- **Known Limitations**:  
  - Data may omit classified missions  
  - Inconsistent mission type labels require standardization

---

## 🔍 Method Overview
1. **Data Extraction**  
   - Automated infinite-scroll loading until all records loaded  
   - HTML parsing & structured field extraction  
2. **Data Cleaning**  
   - Standardizing dates, normalizing location names, deduplicating records  
3. **Exploratory Data Analysis**  
   - Launch frequency by year/quarter  
   - Launch sites distribution mapping  
   - Mission type breakdown  
4. **Output Preparation**  
   - CSV for tabular use  
   - GeoJSON for GIS integration

---

## ⚙️ Environment & Reproducibility
- **Python Version**: 3.10+  
- **Key Libraries**: pandas, requests, BeautifulSoup, Selenium, geopandas, matplotlib/seaborn  
- **Runtime**: ~5–8 minutes end-to-end  
- **File Structure**:


In [2]:
import pandas as pd
df = pd.read_csv("unoosa_objects.csv")

df.head()







Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Launching_State,Launch_Site,Launch_Vehicle,Basic_Perigee,Basic_Apogee,Basic_Inclination,Current_Function,Status
0,,2025-085Q,,2025-04-28,,,,,,,,in orbit
1,,2025-085S,,2025-04-28,,,,,,,,in orbit
2,,2025-085T,,2025-04-28,,,,,,,,in orbit
3,,2025-085U,,2025-04-28,,,,,,,,in orbit
4,,2025-085V,,2025-04-28,,,,,,,,in orbit


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21289 entries, 0 to 21288
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Name_of_Space_Object      17263 non-null  object 
 1   International_Designator  21289 non-null  object 
 2   National_Designator_s1    7576 non-null   object 
 3   Launch_Date               21289 non-null  object 
 4   Launching_State           0 non-null      float64
 5   Launch_Site               0 non-null      float64
 6   Launch_Vehicle            0 non-null      float64
 7   Basic_Perigee             0 non-null      float64
 8   Basic_Apogee              0 non-null      float64
 9   Basic_Inclination         0 non-null      float64
 10  Current_Function          0 non-null      float64
 11  Status                    21288 non-null  object 
dtypes: float64(7), object(5)
memory usage: 1.9+ MB


In [4]:

df.describe()

Unnamed: 0,Launching_State,Launch_Site,Launch_Vehicle,Basic_Perigee,Basic_Apogee,Basic_Inclination,Current_Function
count,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,,,,,,
std,,,,,,,
min,,,,,,,
25%,,,,,,,
50%,,,,,,,
75%,,,,,,,
max,,,,,,,


In [5]:
df.isnull().sum()

Name_of_Space_Object         4026
International_Designator        0
National_Designator_s1      13713
Launch_Date                     0
Launching_State             21289
Launch_Site                 21289
Launch_Vehicle              21289
Basic_Perigee               21289
Basic_Apogee                21289
Basic_Inclination           21289
Current_Function            21289
Status                          1
dtype: int64

In [6]:
# Drop all rows where all columns are NaN
df = df.dropna(how='all')

# Display the shape or first few rows to confirm
print(df.shape)
df.head()

(21289, 12)


Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Launching_State,Launch_Site,Launch_Vehicle,Basic_Perigee,Basic_Apogee,Basic_Inclination,Current_Function,Status
0,,2025-085Q,,2025-04-28,,,,,,,,in orbit
1,,2025-085S,,2025-04-28,,,,,,,,in orbit
2,,2025-085T,,2025-04-28,,,,,,,,in orbit
3,,2025-085U,,2025-04-28,,,,,,,,in orbit
4,,2025-085V,,2025-04-28,,,,,,,,in orbit


In [7]:
# Drop all columns where all values are NaN
df = df.dropna(axis=1, how='all')

# Display the shape or first few rows to confirm
print(df.shape)
df.head()

(21289, 5)


Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Status
0,,2025-085Q,,2025-04-28,in orbit
1,,2025-085S,,2025-04-28,in orbit
2,,2025-085T,,2025-04-28,in orbit
3,,2025-085U,,2025-04-28,in orbit
4,,2025-085V,,2025-04-28,in orbit


In [8]:
# List columns where all values are NaN
removed_columns = df.columns[df.isna().all()].tolist()
print("Removed columns:", removed_columns)

Removed columns: []


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21289 entries, 0 to 21288
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Name_of_Space_Object      17263 non-null  object
 1   International_Designator  21289 non-null  object
 2   National_Designator_s1    7576 non-null   object
 3   Launch_Date               21289 non-null  object
 4   Status                    21288 non-null  object
dtypes: object(5)
memory usage: 831.7+ KB


# Exploratory Data Analysis (EDA) Steps

1. **Overview of Data**
   - Display the first few rows of the DataFrame.
   - Show the shape (number of rows and columns).
   - Display column names and data types.

2. **Missing Values**
   - Count missing values in each column.
   - Visualize missing data if needed.

3. **Convert Launch Date to DateTime**
   - Convert the `Launch_Date` column to a datetime object for time-based analysis.

4. **Replace NaN with null**
   - Replace all NaN values in the DataFrame with Python's `None` (which is equivalent to `null` in many contexts).

5. **Summary Statistics**
   - Generate summary statistics for each column.
   - For categorical columns, show value counts.

6. **Label-based Analysis**
   - Analyze the distribution of the `Status` column.
   - Analyze the distribution of the `Name_of_Space_Object` and other key labels.

7. **Time-based Analysis**
   - Analyze launches over time using the `Launch_Date` column.

8. **Visualizations**
   - Plot distributions and trends for key columns.

---

The following code cells will implement these steps.


In [10]:
# 1. Overview of Data
display(df.head())
print('Shape:', df.shape)
print('Columns:', df.columns.tolist())
print('Data types:')
print(df.dtypes)

# 2. Missing Values
print('\nMissing values per column:')
print(df.isnull().sum())

# 3. Convert Launch_Date to datetime object
df['Launch_Date'] = pd.to_datetime(df['Launch_Date'], errors='coerce')

# 4. Replace NaN with None (null)
df = df.where(pd.notnull(df), None)

# 5. Summary Statistics
print('\nSummary statistics:')
display(df.describe(include='all'))

# 6. Label-based Analysis
print('\nStatus value counts:')
print(df['Status'].value_counts(dropna=False))
print('\nName_of_Space_Object value counts (top 10):')
print(df['Name_of_Space_Object'].value_counts(dropna=False).head(10))

# 7. Time-based Analysis
print('\nLaunches per year:')
print(df['Launch_Date'].dt.year.value_counts().sort_index())


Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Status
0,,2025-085Q,,2025-04-28,in orbit
1,,2025-085S,,2025-04-28,in orbit
2,,2025-085T,,2025-04-28,in orbit
3,,2025-085U,,2025-04-28,in orbit
4,,2025-085V,,2025-04-28,in orbit


Shape: (21289, 5)
Columns: ['Name_of_Space_Object', 'International_Designator', 'National_Designator_s1', 'Launch_Date', 'Status']
Data types:
Name_of_Space_Object        object
International_Designator    object
National_Designator_s1      object
Launch_Date                 object
Status                      object
dtype: object

Missing values per column:
Name_of_Space_Object         4026
International_Designator        0
National_Designator_s1      13713
Launch_Date                     0
Status                          1
dtype: int64

Summary statistics:


Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Status
count,17263,21289,7576,21280,21288
unique,16737,20527,5128,,24
top,MOLNIYA 1,2018-092*,------,,in orbit
freq,90,106,2185,,13643
mean,,,,2010-11-23 14:40:42.857142784,
min,,,,1957-10-04 00:00:00,
25%,,,,2000-04-24 06:00:00,
50%,,,,2021-02-20 00:00:00,
75%,,,,2023-06-12 00:00:00,
max,,,,2025-04-28 00:00:00,



Status value counts:
Status
in orbit                       13643
decayed                         3498
recovered                       1424
in GSO                          1262
deorbited                       1111
in disposal/graveyard orbit       98
heliocentric                      79
on Moon                           59
selenocentric                     27
on Mars                           18
areocentric                       16
on Venus                          15
on Ryugu                          12
in Sun L1                          8
in Sun L2                          5
interstellar                       4
on Comet 67P                       2
None                               1
on Dimorphos                       1
in Moon L2                         1
orbiting Bennu                     1
orbiting Venus                     1
orbiting Ceres                     1
on Eros                            1
barycentric                        1
Name: count, dtype: int64

Name_of_Space_Objec

In [11]:
# Step 6: Label-based Analysis

# 1. Status column analysis
print("Status value counts (including NaN):")
print(df['Status'].value_counts(dropna=False))
print("\nStatus unique values:", df['Status'].unique())

# 2. Name_of_Space_Object analysis
print("\nTop 10 most common space object names:")
print(df['Name_of_Space_Object'].value_counts(dropna=False).head(10))
print("\nNumber of unique space object names:", df['Name_of_Space_Object'].nunique())

# 3. National_Designator_s1 analysis (if present)
if 'National_Designator_s1' in df.columns:
    print("\nTop 10 National Designators:")
    print(df['National_Designator_s1'].value_counts(dropna=False).head(10))
    print("Number of unique National Designators:", df['National_Designator_s1'].nunique())

# 4. International_Designator analysis (if present)
if 'International_Designator' in df.columns:
    print("\nNumber of unique International Designators:", df['International_Designator'].nunique())

# 5. Launch year distribution (if Launch_Date is datetime)
if pd.api.types.is_datetime64_any_dtype(df['Launch_Date']):
    print("\nLaunches per year:")
    print(df['Launch_Date'].dt.year.value_counts().sort_index())

# 6. Missing label analysis
print("\nMissing values in key label columns:")
label_cols = ['Name_of_Space_Object', 'International_Designator', 'National_Designator_s1', 'Status']
for col in label_cols:
    if col in df.columns:
        print(f"{col}: {df[col].isna().sum()} missing")

Status value counts (including NaN):
Status
in orbit                       13643
decayed                         3498
recovered                       1424
in GSO                          1262
deorbited                       1111
in disposal/graveyard orbit       98
heliocentric                      79
on Moon                           59
selenocentric                     27
on Mars                           18
areocentric                       16
on Venus                          15
on Ryugu                          12
in Sun L1                          8
in Sun L2                          5
interstellar                       4
on Comet 67P                       2
None                               1
on Dimorphos                       1
in Moon L2                         1
orbiting Bennu                     1
orbiting Venus                     1
orbiting Ceres                     1
on Eros                            1
barycentric                        1
Name: count, dtype: int64

Stat

In [12]:
# Show records where Status is None (null/NaN)
missing_status_df = df[df['Status'].isna()]

# Display the first few such records
display(missing_status_df.head())

# Optionally, show how many records have missing Status
print("Number of records with Status = None:", missing_status_df.shape[0])

Unnamed: 0,Name_of_Space_Object,International_Designator,National_Designator_s1,Launch_Date,Status
9711,SpaceBEE 098,2021-059BY,,2021-06-30,


Number of records with Status = None: 1


In [13]:
df.to_csv("cleaned_satellite_data.csv", index=False)