# 🚀 Project Title: NASA Space Object Index Data Extraction + EDA 

## 📌 One-Liner
Automates the extraction, cleaning, and exploratory analysis of launch and mission data from a NASA infinite-scroll web resource to support space-tech market intelligence and infrastructure strategy.

---

## TL;DR Executive Summary
**3 Findings**:  
1. Successfully built an automated scraper for NASA’s mission listings using infinite-scroll handling.  
2. Cleaned and structured the dataset into consistent formats for mission names, launch dates, locations, and mission objectives.  
3. Conducted exploratory data analysis revealing patterns in mission frequency, geographic launch distribution, and thematic mission types.

**2 Implications**:  
1. Enables ongoing, low-effort tracking of NASA’s mission portfolio for competitive intelligence.  
2. Establishes a reusable pipeline for other space agencies’ open portals.

**1 Recommendation**:  
Extend this workflow to include launch success metrics, payload mass, and satellite type for richer strategic correlation.

---

## 🎯 Problem Statement & Decision Context
- **Business Question**: How can we systematically collect, clean, and analyze launch mission data from NASA to identify patterns relevant for market positioning in the space-tech and EO sectors?  
- **Scope**: NASA missions page, infinite scroll, structured tabular extraction, CSV/GeoJSON storage, EDA in Python.  
- **Out of Scope**: Real-time mission tracking, orbital mechanics calculations, and EO raster integration (covered in later projects).  
- **Success Criteria**: Fully automated data extraction script + cleaned dataset + EDA visualizations revealing at least three actionable patterns.

---

## 👥 Stakeholders & Use Cases
- **Primary Stakeholders**:  
  - Aerospace startups (Skyroot, Pixxel) for competitive benchmarking  
  - Space policy think tanks for mission diversity analysis  
  - Infrastructure planners for launch site capacity planning

- **Use Cases**:  
  - Regular reporting on NASA mission pipeline  
  - Comparative analysis with other agencies  
  - Foundation for EO mission overlay and downstream analytics

---

## 🗂 Data Card
- **Source**: NASA Launch/Mission website (infinite scroll endpoint)  
- **Method**: Selenium/Python requests with dynamic content loading handling  
- **License**: Public domain (US Government works)  
- **Update Frequency**: Daily/Weekly (can be scheduled)  
- **Key Attributes**:  
  - `mission_name` (string)  
  - `launch_date` (datetime)  
  - `launch_location` (string)  
  - `mission_type` (categorical)  
  - `mission_summary` (text)

- **Known Limitations**:  
  - Data may omit classified missions  
  - Inconsistent mission type labels require standardization

---

## 🔍 Method Overview
1. **Data Extraction**  
   - Automated infinite-scroll loading until all records loaded  
   - HTML parsing & structured field extraction  
2. **Data Cleaning**  
   - Standardizing dates, normalizing location names, deduplicating records  
3. **Exploratory Data Analysis**  
   - Launch frequency by year/quarter  
   - Launch sites distribution mapping  
   - Mission type breakdown  
4. **Output Preparation**  
   - CSV for tabular use  
   - GeoJSON for GIS integration

---

## ⚙️ Environment & Reproducibility
- **Python Version**: 3.10+  
- **Key Libraries**: pandas, requests, BeautifulSoup, Selenium, geopandas, matplotlib/seaborn  
- **Runtime**: ~5–8 minutes end-to-end  
- **File Structure**:


In [6]:
import pandas as pd
df_unoosa = pd.read_csv("/Users/aaeush/Desktop/Drive/Drive/Academics/Py Project/MyCode/OrbitIQ/exports/raw_unoosa_index_of_objects_launched_into_space.csv")

df_unoosa.head()

Unnamed: 0,id,uri,values.object.internationalDesignator_s1,values.object.internationalDesignator@official_s1,values.object.nationalDesignator_s1,values.object.nameOfSpaceObjectIno_s1,values.object.nameOfSpaceObjectO_s1,values.object.launch.stateOfRegistry_s1,values.object.launch.stateOfRegistry@official_s1,values.object.launch.dateOfLaunch_s1,...,values.object.launch.dateOfLaunch@official_s1,values.object.status.dateOfDecay@official_s1,values.object.functionOfSpaceObject_s1,values.object.remark_s1,values.object.status.webSite_s1,values.object.unRegistration.registrationDocuments.document@uri_s,values.object.unRegistration.registrationDocuments.document..document.symbol_s,values.object.status.gsoLocation@official_s1,values.object.unRegistration.decayDocuments.document@uri_s,values.object.unRegistration.decayDocuments.document..document.symbol_s
0,"102,en,/osoindex/data/objects/2025/2025-085q_2...",/osoindex/data/objects/2025/2025-085q_24495.html,2025-085Q,False,,STARLINK 33861,,USA,False,2025-04-28,...,False,False,------,Not registered with the United Nations. Date o...,,,,,,
1,"102,en,/osoindex/data/objects/2025/2025-085s_2...",/osoindex/data/objects/2025/2025-085s_24497.html,2025-085S,False,,STARLINK 33887,,USA,False,2025-04-28,...,False,False,------,Not registered with the United Nations. Date o...,,,,,,
2,"102,en,/osoindex/data/objects/2025/2025-085t_2...",/osoindex/data/objects/2025/2025-085t_24498.html,2025-085T,False,,STARLINK 33886,,USA,False,2025-04-28,...,False,False,------,Not registered with the United Nations. Date o...,,,,,,
3,"102,en,/osoindex/data/objects/2025/2025-085u_2...",/osoindex/data/objects/2025/2025-085u_24499.html,2025-085U,False,,STARLINK 33840,,USA,False,2025-04-28,...,False,False,------,Not registered with the United Nations. Date o...,,,,,,
4,"102,en,/osoindex/data/objects/2025/2025-085v_2...",/osoindex/data/objects/2025/2025-085v_24500.html,2025-085V,False,,STARLINK 33851,,USA,False,2025-04-28,...,False,False,------,Not registered with the United Nations. Date o...,,,,,,


In [7]:
unoosa_rename_map = {
    "id": "id",
    "uri": "uri",

    "values.object.internationalDesignator_s1": "international_designator", #ID of object
    "values.object.internationalDesignator@official_s1": "international_designator_off", #True or False
    "values.object.nationalDesignator_s1": "national_designator",

    "values.object.nameOfSpaceObjectIno_s1": "space_object_name",

    "values.object.nameOfSpaceObjectO_s1": "space_object_name_2",

    "values.object.launch.stateOfRegistry_s1": "state_of_registry",
    "values.object.launch.stateOfRegistry@official_s1": "state_of_registry_off",

    "values.object.launch.dateOfLaunch_s1": "date_of_launch",
    "values.object.status.gsoLocation_s1": "gso_location",
    "values.object.unRegistration.unRegistered_s1": "un_registered",
    "values.en#object.status.objectStatus_s1": "status",
    "values.object.status@official_s1": "status_off",
    "values.object.status.dateOfDecay_s1": "date_of_decay",

    "values.object.launch.dateOfLaunch@official_s1":"date_of_launch_off" ,
    "values.object.status.dateOfDecay@official_s1":"date_of_decay_off" ,

    "values.object.functionOfSpaceObject_s1": "function",
    "values.object.remark_s1": "remarks",

    "values.object.status.webSite_s1": "external_website",

    "values.object.unRegistration.registrationDocuments.document@uri_s": "registration_doc",
    
    "values.object.unRegistration.registrationDocuments.document..document.symbol_s": "values.object.unRegistration.registrationDocuments.document..document.symbol_s",
    "values.object.status.gsoLocation@official_s1": "gso_location_off",
    "values.object.unRegistration.decayDocuments.document@uri_s": "decay_document_uri",
    "values.object.unRegistration.decayDocuments.document..document.symbol_s": "symbol",
}

In [8]:
df_unoosa.rename(columns=unoosa_rename_map, inplace=True)
print("Successully mapped columns")
print(list(df_unoosa.columns))

Successully mapped columns
['id', 'uri', 'international_designator', 'international_designator_off', 'national_designator', 'space_object_name', 'space_object_name_2', 'state_of_registry', 'state_of_registry_off', 'date_of_launch', 'gso_location', 'un_registered', 'status', 'status_off', 'date_of_decay', 'date_of_launch_off', 'date_of_decay_off', 'function', 'remarks', 'external_website', 'registration_doc', 'values.object.unRegistration.registrationDocuments.document..document.symbol_s', 'gso_location_off', 'decay_document_uri', 'symbol']


In [9]:
df_unoosa.isnull().sum()

id                                                                                    0
uri                                                                                   0
international_designator                                                              0
international_designator_off                                                          0
national_designator                                                               13713
space_object_name                                                                 15463
space_object_name_2                                                                4026
state_of_registry                                                                     0
state_of_registry_off                                                                 0
date_of_launch                                                                        0
gso_location                                                                      15133
un_registered                   

All rows have registered international designator
7576 rows have national designator
15133 missing gso locations
Most dont have date of decay
2018-092 appeared 106 times
Most international designators are official 
Duplicate object names
Most states of registry are official
Most launches on 2021-01-24 at 131
Generic functions
Website names very few
Documentation available for some space objects
Very few objects have decay documentation


In [10]:
unoosa_delete_list = ["values.object.unRegistration.registrationDocuments.document..document.symbol_s", "decay_document_uri"]

In [11]:
# Remove specified columns if present
cols_to_drop = [c for c in unoosa_delete_list if c in df_unoosa.columns]
df_unoosa.drop(columns=cols_to_drop, inplace=True)
print("Removed columns:", cols_to_drop)
df_unoosa.columns

Removed columns: ['values.object.unRegistration.registrationDocuments.document..document.symbol_s', 'decay_document_uri']


Index(['id', 'uri', 'international_designator', 'international_designator_off',
       'national_designator', 'space_object_name', 'space_object_name_2',
       'state_of_registry', 'state_of_registry_off', 'date_of_launch',
       'gso_location', 'un_registered', 'status', 'status_off',
       'date_of_decay', 'date_of_launch_off', 'date_of_decay_off', 'function',
       'remarks', 'external_website', 'registration_doc', 'gso_location_off',
       'symbol'],
      dtype='object')

In [12]:
df_unoosa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21289 entries, 0 to 21288
Data columns (total 23 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   id                            21289 non-null  object
 1   uri                           21289 non-null  object
 2   international_designator      21289 non-null  object
 3   international_designator_off  21289 non-null  object
 4   national_designator           7576 non-null   object
 5   space_object_name             5826 non-null   object
 6   space_object_name_2           17263 non-null  object
 7   state_of_registry             21289 non-null  object
 8   state_of_registry_off         21289 non-null  bool  
 9   date_of_launch                21289 non-null  object
 10  gso_location                  6156 non-null   object
 11  un_registered                 21289 non-null  object
 12  status                        21288 non-null  object
 13  status_off      

In [13]:
# Columns you want to coerce to boolean
to_bool = ['gso_location', 'date_of_decay_off', 'gso_location_off', 'international_designator_off', 'un_registered']  # adjust as needed

truthy = {"true", "t", "1", "y", "yes", "on"}
falsy  = {"false", "f", "0", "n", "no", "off"}
mapper = {**{v: True for v in truthy}, **{v: False for v in falsy}}

for col in to_bool:
    if col not in df_unoosa.columns:
        continue
    s_norm = df_unoosa[col].astype(str).str.strip().str.lower()
    df_unoosa[col] = s_norm.map(mapper).astype("bool")  # pandas nullable boolean
    # If you want 1/0 instead:
    # df_unoosa[col] = df_unoosa[col].map({True: 1, False: 0}).astype("Int8")

In [14]:

# 3. Convert Launch_Date to datetime object
df_unoosa['date_of_launch'] = pd.to_datetime(df_unoosa['date_of_launch'], errors='coerce')


In [15]:

# 5. Summary Statistics
print('\nSummary statistics:')
display(df_unoosa.describe(include='all'))



Summary statistics:


Unnamed: 0,id,uri,international_designator,international_designator_off,national_designator,space_object_name,space_object_name_2,state_of_registry,state_of_registry_off,date_of_launch,...,status_off,date_of_decay,date_of_launch_off,date_of_decay_off,function,remarks,external_website,registration_doc,gso_location_off,symbol
count,21289,21289,21289,21289,7576,5826,17263,21289,21289,21280,...,21287,6116,21289,21289,21184,21243,1753,18929,21289,5554
unique,21289,21289,20527,2,5128,5033,16737,260,2,,...,6,4131,2,2,2254,1501,839,1573,2,1221
top,"102,en,/osoindex/data/objects/2025/2025-085q_2...",/osoindex/data/objects/2025/2025-085q_24495.html,2018-092*,True,------,USA,MOLNIYA 1,USA,True,,...,true,2019-03-22,True,True,Spacecraft engaged in practical applications a...,------,https://www.oneweb.world/,"[""/osoindex/data/documents/us/st/stsgser.e942....",True,"[""ST/SG/SER.E/1227""]"
freq,1,1,106,14388,2185,342,90,12719,18929,,...,14152,104,18758,12422,10346,13441,420,298,20826,102
mean,,,,,,,,,,2010-11-23 14:40:42.857142784,...,,,,,,,,,,
min,,,,,,,,,,1957-10-04 00:00:00,...,,,,,,,,,,
25%,,,,,,,,,,2000-04-24 06:00:00,...,,,,,,,,,,
50%,,,,,,,,,,2021-02-20 00:00:00,...,,,,,,,,,,
75%,,,,,,,,,,2023-06-12 00:00:00,...,,,,,,,,,,
max,,,,,,,,,,2025-04-28 00:00:00,...,,,,,,,,,,


Oldest recorded launched object is from 1957 there is a distribution here
Most recent is from April 28 
Investigate status_off
Some dates are empty - why

In [16]:

# 4. Replace NaN with None (null)
df_unoosa = df_unoosa.where(pd.notnull(df_unoosa), None)


In [17]:
df_unoosa.head()

Unnamed: 0,id,uri,international_designator,international_designator_off,national_designator,space_object_name,space_object_name_2,state_of_registry,state_of_registry_off,date_of_launch,...,status_off,date_of_decay,date_of_launch_off,date_of_decay_off,function,remarks,external_website,registration_doc,gso_location_off,symbol
0,"102,en,/osoindex/data/objects/2025/2025-085q_2...",/osoindex/data/objects/2025/2025-085q_24495.html,2025-085Q,False,,STARLINK 33861,,USA,False,2025-04-28,...,False,,False,False,------,Not registered with the United Nations. Date o...,,,True,
1,"102,en,/osoindex/data/objects/2025/2025-085s_2...",/osoindex/data/objects/2025/2025-085s_24497.html,2025-085S,False,,STARLINK 33887,,USA,False,2025-04-28,...,False,,False,False,------,Not registered with the United Nations. Date o...,,,True,
2,"102,en,/osoindex/data/objects/2025/2025-085t_2...",/osoindex/data/objects/2025/2025-085t_24498.html,2025-085T,False,,STARLINK 33886,,USA,False,2025-04-28,...,False,,False,False,------,Not registered with the United Nations. Date o...,,,True,
3,"102,en,/osoindex/data/objects/2025/2025-085u_2...",/osoindex/data/objects/2025/2025-085u_24499.html,2025-085U,False,,STARLINK 33840,,USA,False,2025-04-28,...,False,,False,False,------,Not registered with the United Nations. Date o...,,,True,
4,"102,en,/osoindex/data/objects/2025/2025-085v_2...",/osoindex/data/objects/2025/2025-085v_24500.html,2025-085V,False,,STARLINK 33851,,USA,False,2025-04-28,...,False,,False,False,------,Not registered with the United Nations. Date o...,,,True,


## EDA Insights Summary

### 1. Launch Date Patterns
- **Oldest launch:** 1957.
- **Most recent launch:** April 28 2025.
- **Peak launch activity:** 2021-01-24 with 131 recorded launches.

### 2. Data Completeness
- Some `date_of_launch` fields are empty.
- GSO location missing for ~15,133 records — likely due to non-GEO orbits.
- Most objects lack a `date_of_decay`.
- Very few records have decay documentation.

### 3. Identifiers & Registrations
- 100% of records have an international designator.
- 7,576 have a national designator; remainder missing.
- Most designators and registry states use official standardized forms.

### 4. Duplication & Repetition
- International designator `2018-092` appears 106 times — likely a large multi-satellite deployment.
- Duplicate object names exist in dataset.

### 5. Metadata Availability
- Very few objects have a website name field filled.
- Some objects have linked documentation.
- Common “Generic functions” classification indicates limited detail for many records.

### 6. Special Columns of Interest
- `status_off` column should be investigated to determine the reasons for various statuses.

---
**Next Steps**
- Verify cause of missing launch dates.
- Confirm if repeated designators indicate multiple payloads per launch.
- Assess if “Generic functions” can be refined into specific mission categories.
- Explore each column and its values


In [18]:
import pandas as pd
from pandas.api.types import is_object_dtype, is_categorical_dtype, is_string_dtype

# Identify categorical-like columns (object, string, or pandas categorical)
cat_cols = [
    col for col in df_unoosa.columns
    if is_categorical_dtype(df_unoosa[col])
    or is_object_dtype(df_unoosa[col])
    or is_string_dtype(df_unoosa[col])
]

# Map each categorical column to its list of unique non-null values
cat_uniques = {col: df_unoosa[col].dropna().unique().tolist() for col in cat_cols}

# Print
for col, vals in cat_uniques.items():
    print(f"{col} ({len(vals)} unique):")
    print(vals)
    print("-" * 60)

# Optional: treat low-cardinality numeric columns as categorical too (uncomment and adjust threshold)
# low_card_cols = [c for c in df_unoosa.columns
#                  if df_unoosa[c].nunique(dropna=True) <= 50]  # threshold
# for col in sorted(set(low_card_cols) - set(cat_cols)):
#     vals = df_unoosa[col].dropna().unique().tolist()
#     print(f"{col} ({len(vals)} unique):")
#     print(vals)
#     print("-" * 60)

id (21289 unique):
['102,en,/osoindex/data/objects/2025/2025-085q_24495.html', '102,en,/osoindex/data/objects/2025/2025-085s_24497.html', '102,en,/osoindex/data/objects/2025/2025-085t_24498.html', '102,en,/osoindex/data/objects/2025/2025-085u_24499.html', '102,en,/osoindex/data/objects/2025/2025-085v_24500.html', '102,en,/osoindex/data/objects/2025/2025-085p_24494.html', '102,en,/osoindex/data/objects/2025/2025-085r_24496.html', '102,en,/osoindex/data/objects/2025/2025-085k_24490.html', '102,en,/osoindex/data/objects/2025/2025-085m_24492.html', '102,en,/osoindex/data/objects/2025/2025-085e_24485.html', '102,en,/osoindex/data/objects/2025/2025-085f_24486.html', '102,en,/osoindex/data/objects/2025/2025-085g_24487.html', '102,en,/osoindex/data/objects/2025/2025-085h_24488.html', '102,en,/osoindex/data/objects/2025/2025-085j_24489.html', '102,en,/osoindex/data/objects/2025/2025-085l_24491.html', '102,en,/osoindex/data/objects/2025/2025-085n_24493.html', '102,en,/osoindex/data/objects/2025/

  if is_categorical_dtype(df_unoosa[col])


id, uri, international_designator, national_designator, space_object_name, space_object_name_2 can be particular to the object, and may uniquely identify them
functions and remarks can be word clouds

In [20]:
df_unoosa.to_csv("/Users/aaeush/Desktop/Drive/Drive/Academics/Py Project/MyCode/OrbitIQ/exports/df_unoosa.csv", index=False)