# Space Mission Data - Web Scrapping from Next Space Flight Website

**Team Member - Ashina Neema**

**Task Goal**: 
The aim of this task is to collect information about space missions conducted in the past from a website called https://nextspaceflight.com/. I'm interested in gathering details like the name of the space mission, its launch date, location, mission status, the organization behind it, rocket status, and the price of the mission.

**Learning & Credits**: For learning how to scrape data from websites, I took guidance from Professor Mark Isken's course -
1. PCDA - http://www.sba.oakland.edu/faculty/isken/courses/mis5470_f23/index.html

2. Advance Analytics with python - http://www.sba.oakland.edu/faculty/isken/courses/aap/

Additionally, I gained valuable insights into web scraping techniques from the Udemy course "100 Days of Code." https://www.udemy.com/course/100-days-of-code/

I understood fundamental web scraping concepts, including HTML parsing and BeautifulSoup library usage for efficient data extraction. It was challenging but worth it!


# Task 1 : Web Scrapping

In [3]:
from bs4 import BeautifulSoup
import requests
import csv


total_pages = 221
limit = 30
complete_data = []
for page_number in range(1, total_pages + 1):
    url = f"https://nextspaceflight.com/launches/past/?page={page_number}"
    response = requests.get(url)
    space_page = response.text

    soup = BeautifulSoup(space_page, "html.parser")

    space_mission = soup.find_all(name="h5", class_="header-style")
    space_mission_texts = []
    for space_mission_tag in space_mission:
        space_mission_text = space_mission_tag.getText()
        space_mission_texts.append(space_mission_text)

    space_mission_list = [mission.strip() for mission in space_mission_texts]

    launch_date_org_tag = [item.getText() for item in soup.find_all(class_="mdl-card__supporting-text")]
    cleaned_launch_date_org_tag = [item.strip() for item in launch_date_org_tag]

    date_list = [item.split('\n')[0].strip() for item in cleaned_launch_date_org_tag]
    location_list = [item.split('\n')[1].strip() for item in cleaned_launch_date_org_tag]

    # Use zip to combine the three lists
    space_mission_data = [{'Space Mission': mission, 'Date': date, 'Location': location}
                          for mission, date, location in zip(space_mission_list, date_list, location_list)]

    # Print or use the list of dictionaries
    # print(type(space_mission_data))

    results_list = []

    # Set the limit for the loop

    counter = 0

    for button_element in soup.find_all('button', {'class': 'mdc-button'}):
        onclick_value = button_element.get('onclick')
        if onclick_value and 'location.href' in onclick_value:

            link = onclick_value.split('=')[1].strip().strip("'")
            Base_url = 'https://nextspaceflight.com'
            details_url = Base_url + link
            details_response = requests.get(details_url)
            details_html_content = details_response.text

            details_soup = BeautifulSoup(details_html_content, 'html.parser')

            h6_element = details_soup.find(name='h6', class_='rcorners status')
            # Find the span element inside the h6 element
            span_element = h6_element.find('span')
            # Extract the text content of the span element
            mission_status = span_element.text.strip()

            org = details_soup.find(name='div', class_='mdl-cell mdl-cell--6-col-desktop mdl-cell--12-col-tablet').text.strip()

            # Find the div containing 'Rocket Status:'
            rocket_status_div = details_soup.find(name='div', class_='mdl-cell mdl-cell--6-col-desktop mdl-cell--12-col-tablet',
                                           string=lambda t: t and 'Status:' in t)
            rocket_status = rocket_status_div.text.replace('Status:', '').strip() if rocket_status_div else None

            # Find the div containing 'Price:'
            price_div = details_soup.find(name='div', class_='mdl-cell mdl-cell--6-col-desktop mdl-cell--12-col-tablet',
                                            string=lambda t: t and 'Price:' in t)
            price = price_div.text.replace('Price:', '').replace('$', '').replace('million', '').strip() if price_div else None

            result_dict = {
                'Mission Status': mission_status,
                'Organization': org,
                'Rocket Status': rocket_status,
                'Price': price
            }

            results_list.append(result_dict)

            # Increment the counter
            counter += 1

            # Check if the limit is reached
            if counter >= limit:
                break

    # Iterate over both lists simultaneously
    for space_dict, other_dict in zip(space_mission_data, results_list):
        # Merge dictionaries for corresponding missions
        merged_dict = {**space_dict, **other_dict}
        complete_data.append(merged_dict)

# Print or use the complete_data list of dictionaries
for data in complete_data:
    print(data)

# Specify the CSV file name
csv_file_name = 'Space_Missions.csv'

# Specify the field names (keys in the dictionaries)
field_names = complete_data[0].keys()

# Write to CSV with 'utf-8' encoding
with open(csv_file_name, mode='w', newline='', encoding='utf-8') as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=field_names)

    # Write header
    writer.writeheader()

    # Write data
    for row in complete_data:
        writer.writerow(row)


{'Space Mission': 'Long March 2D | Yaogan 39 Group 05', 'Date': 'Sun Dec 10, 2023 01:58 UTC', 'Location': 'LC-3, Xichang Satellite Launch Center, China', 'Mission Status': 'Success', 'Organization': 'CASC', 'Rocket Status': 'Active', 'Price': '29.75'}
{'Space Mission': 'ZhuQue-2 | Flight 3', 'Date': 'Fri Dec 08, 2023 23:39 UTC', 'Location': 'Site 96, Jiuquan Satellite Launch Center, China', 'Mission Status': 'Success', 'Organization': 'Landspace', 'Rocket Status': 'Active', 'Price': None}
{'Space Mission': 'Falcon 9 Block 5 | Starlink Group 7-8', 'Date': 'Fri Dec 08, 2023 08:03 UTC', 'Location': 'SLC-4E, Vandenberg SFB, California, USA', 'Mission Status': 'Success', 'Organization': 'SpaceX', 'Rocket Status': 'Active', 'Price': '67.0'}
{'Space Mission': 'Falcon 9 Block 5 | Starlink Group 6-33', 'Date': 'Thu Dec 07, 2023 05:07 UTC', 'Location': 'SLC-40, Cape Canaveral SFS, Florida, USA', 'Mission Status': 'Success', 'Organization': 'SpaceX', 'Rocket Status': 'Active', 'Price': '67.0'}
{'

## Explanation of Code

Data Collection Scope: I targeting a total of 221 pages on the website, each containing data for approximately 30 space missions.

For scrapping I use, python's BeautifulSoup library and requests module for page access and data extraction.

The code iterates through each of the 221 pages systematically by generating specific URLs.
Utilizes the requests.get() method to access the HTML content of each page.
BeautifulSoup parses the HTML content and targets specific HTML elements (such as mission names, launch dates, and locations) using various methods like soup.find_all() and item.getText().

Once basic information is extracted, the code drills deeper into each mission's specific page by extracting the onclick attribute of buttons for additional details. This involves building the URL for each mission's page and accessing its HTML content.

Within these individual mission pages, the code identifies and extracts more specific details such as mission status, organization involved, rocket status, and rocket price using methods like details_soup.find() and text parsing.

The information extracted from each mission's page is structured into dictionaries (result_dict) to organize related details for each mission. The data is then appended to results_list and merged with the corresponding basic mission information gathered earlier.


Finally, the collected and structured data is written into a CSV file named "Space_Missions.csv." This file serves as an organized repository of extensive space mission details for subsequent analysis or reference.

This web scraping code systematically navigates through multiple web pages, extracts specific details from each page, structures the data, and consolidates it into a structured CSV file for further analysis or use.
