# **SpaceX  Falcon 9 Data collection(Web scraping)**
IMB's Applied Data Science Capstone Project


### Objectives:
Web scrap Falcon 9 launch records with `BeautifulSoup`: 
- Extract  Falcon 9 and Falcon Heavy Launches Records HTML table from Wikipedia
- clean the data



 Wikipedia page titled `List of Falcon 9 and Falcon Heavy launches`

https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches 
<div>



The launch records are stored in a HTML table

<div>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/labs/module_1_L2/images/falcon9-launches-wiki.png" width="500"/>
</div>

In [1]:
import sys
import requests
from bs4 import BeautifulSoup
import re
import unicodedata
import pandas as pd

 ### 1. Request and scrape the Falcon9 Launch Wiki page from its URL ###


In [2]:
static_url = "https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches_(2010%E2%80%932019)"
response = requests.get(static_url)
soup = BeautifulSoup(response.text, 'html5lib')

Print the page title to verify if the BeautifulSoup object was created properly 


In [3]:
# Use soup.title attribute
page_title = soup.title
if page_title:
    print("Page Title:", page_title.text)
else:
    print("Page title not found.")

Page Title: List of Falcon 9 and Falcon Heavy launches (2010–2019) - Wikipedia


### 2. Extract all column/variable names from the HTML table header


In [4]:
html_tables = soup.find_all('table') 

In [5]:
html_tables

[<table class="multicol" role="presentation" style="border-collapse: collapse; padding: 0; border: 0; background:transparent; width:100%;">
 
 <tbody><tr>
 <td style="text-align: left; vertical-align: top;">
 <h3><span class="mw-headline" id="Rocket_configurations">Rocket configurations</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=List_of_Falcon_9_and_Falcon_Heavy_launches_(2010%E2%80%932019)&amp;action=edit&amp;section=2" title="Edit section: Rocket configurations">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
 <div class="chart noresize" style="margin-top:1em;max-width:480px;">
 <div style="position:relative;min-height:320px;min-width:480px;max-width:480px;">
 <div style="float:right;position:relative;min-height:240px;min-width:380px;max-width:380px;border-left:1px black solid;border-bottom:1px black solid;">
 <div style="position:absolute;left:4px;top:224px;height:15px;min-width:28px;max-width:28px;ba

Starting from the third table is our target table contains the actual launch records.


In [6]:
# Let's print the third table and check its content
first_launch_table = html_tables[3]
print(first_launch_table)

<table class="wikitable plainrowheaders collapsible" style="width: 100%;">
<tbody><tr>
<th scope="col">Flight No.
</th>
<th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>
<th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_5-0"><a href="#cite_note-booster-5">[a]</a></sup>
</th>
<th scope="col">Launch site
</th>
<th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_6-0"><a href="#cite_note-Dragon-6">[b]</a></sup>
</th>
<th scope="col">Payload mass
</th>
<th scope="col">Orbit
</th>
<th scope="col">Customer
</th>
<th scope="col">Launch<br/>outcome
</th>
<th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th></tr>
<tr id="F9-001">
<th rowspan="2" scope="row" style="text-align:center;">1
</

In [7]:
column_names = []
th_elements = first_launch_table.find_all('th', scope='col')
print(th_elements)

[<th scope="col">Flight No.
</th>, <th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
</th>, <th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_5-0"><a href="#cite_note-booster-5">[a]</a></sup>
</th>, <th scope="col">Launch site
</th>, <th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_6-0"><a href="#cite_note-Dragon-6">[b]</a></sup>
</th>, <th scope="col">Payload mass
</th>, <th scope="col">Orbit
</th>, <th scope="col">Customer
</th>, <th scope="col">Launch<br/>outcome
</th>, <th scope="col"><a href="/wiki/Falcon_9_first-stage_landing_tests" title="Falcon 9 first-stage landing tests">Booster<br/>landing</a>
</th>]


Iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [8]:
def extract_cloumn_from_header(row):
    for tag_name in ['br','a','sup']:
        tag = row.find(tag_name)
        if tag:
            tag.extract()
    column_name = ' '.join(row.contents).strip()
    return column_name

for th_element in th_elements:
    name = extract_cloumn_from_header(th_element)
    if name is not None and len(name)>0:
        column_names.append(name)

print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


### 3. Create a dataframe by parsing the launch HTML tables


In [9]:
# Create an empty dictionary with keys from the extracted column names.
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []
# Added some new columns
launch_dict['Version booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]


In [10]:
launch_dict

{'Flight No.': [],
 'Launch site': [],
 'Payload': [],
 'Payload mass': [],
 'Orbit': [],
 'Customer': [],
 'Launch outcome': [],
 'Version booster': [],
 'Booster landing': [],
 'Date': [],
 'Time': []}

HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.
This step is to extract the useful information from the rows


There are some helper functions to process web scraped HTML table


In [11]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]
    #7 January 2020<br/>02:19:21<sup class="reference" id="cite_ref-18"><a href="#cite_note-18">[13]</a></sup>


def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


In [12]:
# Extract each table 
extracted_row = 0
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
   # get table row 
    for rows in table.find_all("tr"):
        #check to see if first table heading is as number corresponding to launch a number
         
        #the <tr> element defines a table row, the <th> element defines a table header, 
        #and the <td> element defines a table cell.()
        if rows.th:
            if rows.th.string: #check its not empty or contain only white space
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
        else:
            flag=False
        #get table element 
        row=rows.find_all('td')
        #if it is number save cells in a dictonary 
        if flag:
            extracted_row += 1
            # Flight Number value
            # Append the flight_number into launch_dict with key `Flight No.`
            #print(flight_number)
            launch_dict['Flight No.'].append(flight_number)
            print(flight_number)

            datatimelist=date_time(row[0])           
            # Date value
            # Append the date into launch_dict with key `Date`
            launch_dict["Date"].append(datatimelist[0])
            date = datatimelist[0].strip(',')
            print(date)
            # Time value
            # Append the time into launch_dict with key `Time`
            time = datatimelist[1]
            launch_dict["Time"].append(time)
            print(time)
            # Booster version
            # Append the bv into launch_dict with key `Version Booster`

            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            launch_dict["Version booster"].append(bv)
            print(bv)
            
            # Launch Site
            # Append the bv into launch_dict with key `Launch Site`
            launch_site = row[2].a.string
            launch_dict["Launch site"].append(launch_site)
            print(launch_site)
            
            # Payload
            # Append the payload into launch_dict with key `Payload`
            payload = row[3].a.string
            launch_dict["Payload"].append(payload)
            print(payload)
            
            # Payload Mass
            # Append the payload_mass into launch_dict with key `Payload mass`
            payload_mass = get_mass(row[4])
            launch_dict["Payload mass"].append(payload_mass)
            print(payload)
            
            # Orbit
            # Append the orbit into launch_dict with key `Orbit`
            orbit = row[5].a.string
            launch_dict["Orbit"].append(orbit)
            print(orbit)
            
            # Customer
            # Append the customer into launch_dict with key `Customer`
            if row[6].find('a'):
                customer = row[6].a.string
            else:
                customer = "Various"
            launch_dict["Customer"].append(customer)
            print(customer)
            
            # Launch outcome
            # Append the launch_outcome into launch_dict with key `Launch outcome`
            launch_outcome = list(row[7].strings)[0]
            launch_dict['Launch outcome'].append(launch_outcome)
            print(launch_outcome)
            
            # Booster landing
            # Append the launch_outcome into launch_dict with key `Booster landing`
            booster_landing = landing_status(row[8])
            launch_dict['Booster landing'].append(booster_landing)
            print(booster_landing)

1
4 June 2010
18:45
F9 v1.0 
CCSFS
Dragon Spacecraft Qualification Unit
Dragon Spacecraft Qualification Unit
LEO
SpaceX
Success

Failure
2
8 December 2010
15:43
F9 v1.0 
CCSFS
SpaceX COTS Demo Flight 1
SpaceX COTS Demo Flight 1
LEO
NASA
Success
Failure
3
22 May 2012
07:44
F9 v1.0 
CCSFS
SpaceX COTS Demo Flight 2
SpaceX COTS Demo Flight 2
LEO
NASA
Success
No attempt

4
8 October 2012
00:35
F9 v1.0 
CCSFS
SpaceX CRS-1
SpaceX CRS-1
LEO
NASA
Success

No attempt
5
1 March 2013
15:10
F9 v1.0 
CCSFS
SpaceX CRS-2
SpaceX CRS-2
LEO
NASA
Success

No attempt

6
29 September 2013
16:00
F9 v1.1 
VSFB
CASSIOPE
CASSIOPE
Polar orbit
MDA
Success
Uncontrolled
7
3 December 2013
22:41
F9 v1.1
CCSFS
SES-8
SES-8
GTO
SES
Success
No attempt
8
6 January 2014
22:06
F9 v1.1
CCSFS
Thaicom 6
Thaicom 6
GTO
Thaicom
Success
No attempt
9
18 April 2014
19:25
F9 v1.1
CCSFS
SpaceX CRS-3
SpaceX CRS-3
LEO
NASA
Success

Controlled
10
14 July 2014
15:15
F9 v1.1
CCSFS
Orbcomm-OG2
Orbcomm-OG2
LEO
Orbcomm
Success
Controlled
11
5

In [13]:
# Check if all the columns have same amount of values
def count_values_per_key(dictionary):
    count_per_key = {}
    
    for key, values in dictionary.items():
        if isinstance(values, list):
            count_per_key[key] = len(values)
        else:
            count_per_key[key] = 1
    
    return count_per_key


result = count_values_per_key(launch_dict)

for key, count in result.items():
    print(f"Key '{key}' has {count} value(s)")

Key 'Flight No.' has 77 value(s)
Key 'Launch site' has 77 value(s)
Key 'Payload' has 77 value(s)
Key 'Payload mass' has 77 value(s)
Key 'Orbit' has 77 value(s)
Key 'Customer' has 77 value(s)
Key 'Launch outcome' has 77 value(s)
Key 'Version booster' has 77 value(s)
Key 'Booster landing' has 77 value(s)
Key 'Date' has 77 value(s)
Key 'Time' has 77 value(s)


In [14]:
df=pd.DataFrame(launch_dict)

In [15]:
df.head()

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version booster,Booster landing,Date,Time
0,1,CCSFS,Dragon Spacecraft Qualification Unit,N,LEO,SpaceX,Success\n,F9 v1.0,Failure,4 June 2010,18:45
1,2,CCSFS,SpaceX COTS Demo Flight 1,U,LEO,NASA,Success,F9 v1.0,Failure,8 December 2010,15:43
2,3,CCSFS,SpaceX COTS Demo Flight 2,525 kg,LEO,NASA,Success,F9 v1.0,No attempt\n,22 May 2012,07:44
3,4,CCSFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0,No attempt,8 October 2012,00:35
4,5,CCSFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0,No attempt\n,1 March 2013,15:10


In [16]:
#export the data
df.to_csv('data/falcon_webscrape.csv', index=False)