# <center>________________________________________________________________</center>

# <center>LANDING PREDICTION FOR THE SPACEX FALCON 9 ROCKET</center>

# <center>Part 1.2 - Data Acquisation with Webscraping</center>

# <center>________________________________________________________________</center>

# Introduction
***

In this project, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of \\$62 million; other providers cost upward of \\$165 million each, much of the savings is because SpaceX can reuse the first stage. Therefore if we can determine if the first stage will land, we can determine the cost of a launch. This information can be used if an alternate company wants to bid against SpaceX for a rocket launch.

In this part, as an alternative to SpaceX API, we will be performing web scraping to collect Falcon 9 historical launch records from a Wikipedia page titled "List of Falcon 9 and Falcon Heavy launches". We will be dealing with the launches until 2023.

<b>Data Sources:</b><br>

2010 - 2019: [https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches_(2010%E2%80%932019)](https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches_(2010%E2%80%932019))<br>
2020 - Now: [https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches](https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches)

# Libraries
***

In [None]:
#!pip3 install requests
#!pip3 install pandas
#!pip3 install unicodedata
#!pip3 install beautifulsoup4

In [1]:
import requests
import pandas as pd

import unicodedata
from bs4 import BeautifulSoup

# Auxiliary Functions
***

We will use some helper functions to process web scraped HTML table:

In [2]:
def date_time(table_cells):
    """
    This function returns the data and time from the HTML  table cell
    Input: the  element of a table data cell extracts extra row
    """
    return [data_time.strip() for data_time in list(table_cells.strings)][0:2]

def booster_version(table_cells):
    """
    This function returns the booster version from the HTML  table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=''.join([booster_version for i,booster_version in enumerate( table_cells.strings) if i%2==0][0:-1])
    return out

def landing_status(table_cells):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    out=[i for i in table_cells.strings][0]
    return out


def get_mass(table_cells):
    """
    This function returns the payload mass from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    mass=unicodedata.normalize("NFKD", table_cells.text).strip()
    if mass:
        mass.find("kg")
        new_mass=mass[0:mass.find("kg")+2]
    else:
        new_mass=0
    return new_mass


def extract_column_from_header(row):
    """
    This function returns the landing status from the HTML table cell 
    Input: the  element of a table data cell extracts extra row
    """
    if (row.br):
        row.br.extract()
    if row.a:
        row.a.extract()
    if row.sup:
        row.sup.extract()
        
    column_name = ' '.join(row.contents)
    
    # Filter the digit and empty names
    if not(column_name.strip().isdigit()):
        column_name = column_name.strip()
        return column_name    


# HTML Request and Webscraping
***

## 2013 - 2019
***

In [3]:
url = "https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches_(2010%E2%80%932019)"

# use requests.get() method with the provided url
html_req = requests.get(url)

# assign the response to a object
html = html_req.content

print(html_req.headers["Content-Type"])
html[0:100]

text/html; charset=UTF-8


b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

We will create a `BeautifulSoup` object from the HTML `response`:

In [4]:
soup = BeautifulSoup(html, "html5lib")

# Print the page title to verify if the BeautifulSoup object was created properly
soup.title

<title>List of Falcon 9 and Falcon Heavy launches (2010–2019) - Wikipedia</title>

Next, we want to collect all relevant column names from the HTML table header so we find all the tables on the wiki page first:

In [5]:
html_tables = soup.find_all(name="table")

Starting from the third table is our target table contains the actual launch records.


In [6]:
launch_table = html_tables[2]

Next, we just need to iterate through the `<th>` elements and apply the provided `extract_column_from_header()` to extract column name one by one


In [7]:
table_headers = launch_table.find_all(name="th")
table_headers[0:5]

[<th scope="col">Flight No.
 </th>,
 <th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
 </th>,
 <th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>Booster</a> <sup class="reference" id="cite_ref-booster_5-0"><a href="#cite_note-booster-5">[a]</a></sup>
 </th>,
 <th scope="col">Launch site
 </th>,
 <th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_6-0"><a href="#cite_note-Dragon-6">[b]</a></sup>
 </th>]

In [8]:
column_names = []

for row in table_headers:
    
    column_name = extract_column_from_header(row)
    
    if (column_name !=None) and (len(column_name) > 0):
        
        column_names.append(column_name)
        
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


We will create an empty dictionary with keys from the extracted column names. Later, this dictionary will be converted into a dataframe:

In [9]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []

# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

Next, we just need to fill up the `launch_dict` with launch records extracted from table rows.


Usually, HTML tables in Wiki pages are likely to contain unexpected annotations and other types of noises, such as reference links `B0004.1[8]`, missing values `N/A [e]`, inconsistent formatting, etc.


In [10]:
extracted_row = 0

# Extract each table 
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
    
    # get table row 
    for rows in table.find_all("tr"):
        
        #check to see if first table heading is as number corresponding to launch a number 
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
                
        else:
            flag=False
            
        #get table element 
        row=rows.find_all('td')
        
        #if it is number save cells in a dictonary 
        if flag:
            extracted_row += 1                     
            
            # Flight Number value            
            launch_dict['Flight No.'].append(flight_number)            
            #print(flight_number)            
            
            # Date value
            datatimelist=date_time(row[0])
            date = datatimelist[0].strip(',')
            
            launch_dict['Date'].append(date)            
            #print(date)                      
            
            # Time value
            time = datatimelist[1]
            
            launch_dict['Time'].append(time)                       
            #print(time)                
                
            # Booster version
            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            
            launch_dict['Version Booster'].append(bv)            
            #print(bv)                                  
            
            # Launch Site
            launch_site = row[2].a.string
            
            launch_dict['Launch site'].append(launch_site)            
            #print(launch_site)                                   
            
            # Payload
            payload = row[3].a.string
            
            launch_dict['Payload'].append(payload)            
            #print(payload)                                    
            
            # Payload Mass
            payload_mass = get_mass(row[4])
            
            launch_dict['Payload mass'].append(payload_mass)           
            #print(payload)
            
            # Orbit
            orbit = row[5].a.string          
            
            launch_dict['Orbit'].append(orbit)                        
            #print(orbit)                    
            
            # Customer
            if (row[6].a is not None):
                customer=row[6].a.string
            else:
                customer=row[6].string          
            
            launch_dict['Customer'].append(customer)                        
            #print(customer)                                    
            
            # Launch outcome
            launch_outcome = list(row[7].strings)[0]
            
            launch_dict['Launch outcome'].append(launch_outcome)                        
            #print(launch_outcome)            
            
            # Booster landing
            booster_landing = landing_status(row[8])        
            
            launch_dict['Booster landing'].append(booster_landing)                        
            #print(booster_landing)            

After we have fill in the parsed launch record values into `launch_dict`, now we can create a dataframe from it:

In [11]:
df1=pd.DataFrame(launch_dict)
print(df1.shape)
df1.head()

(77, 11)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCSFS,Dragon Spacecraft Qualification Unit,N,LEO,SpaceX,Success\n,F9 v1.0,Failure,4 June 2010,18:45
1,2,CCSFS,SpaceX COTS Demo Flight 1,U,LEO,NASA,Success,F9 v1.0,Failure,8 December 2010,15:43
2,3,CCSFS,SpaceX COTS Demo Flight 2,525 kg,LEO,NASA,Success,F9 v1.0,No attempt\n,22 May 2012,07:44
3,4,CCSFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0,No attempt,8 October 2012,00:35
4,5,CCSFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0,No attempt\n,1 March 2013,15:10


## 2020 - Now
***

We simply repeat the above process for the other url:

In [12]:
url = "https://en.wikipedia.org/wiki/List_of_Falcon_9_and_Falcon_Heavy_launches"

# use requests.get() method with the provided url
html_req = requests.get(url)

# assign the response to a object
html = html_req.content

print(html_req.headers["Content-Type"])
html[0:100]

text/html; charset=UTF-8


b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-la'

In [13]:
soup = BeautifulSoup(html, "html5lib")

# Print the page title to verify if the BeautifulSoup object was created properly
soup.title

<title>List of Falcon 9 and Falcon Heavy launches - Wikipedia</title>

In [14]:
html_tables = soup.find_all(name="table")
launch_table = html_tables[2]
table_headers = launch_table.find_all(name="th")
table_headers[0:5]

[<th scope="col">Flight No.
 </th>,
 <th scope="col">Date and<br/>time (<a href="/wiki/Coordinated_Universal_Time" title="Coordinated Universal Time">UTC</a>)
 </th>,
 <th scope="col"><a href="/wiki/List_of_Falcon_9_first-stage_boosters" title="List of Falcon 9 first-stage boosters">Version,<br/>booster</a><sup class="reference" id="cite_ref-booster_17-0"><a href="#cite_note-booster-17">[b]</a></sup>
 </th>,
 <th scope="col">Launch<br/>site
 </th>,
 <th scope="col">Payload<sup class="reference" id="cite_ref-Dragon_18-0"><a href="#cite_note-Dragon-18">[c]</a></sup>
 </th>]

In [15]:
column_names = []

for row in table_headers:
    
    column_name = extract_column_from_header(row)
    
    if (column_name !=None) and (len(column_name) > 0):
        
        column_names.append(column_name)
        
print(column_names)

['Flight No.', 'Date and time ( )', 'Launch site', 'Payload', 'Payload mass', 'Orbit', 'Customer', 'Launch outcome']


In [16]:
launch_dict= dict.fromkeys(column_names)

# Remove an irrelvant column
del launch_dict['Date and time ( )']

# Let's initial the launch_dict with each value to be an empty list
launch_dict['Flight No.'] = []
launch_dict['Launch site'] = []
launch_dict['Payload'] = []
launch_dict['Payload mass'] = []
launch_dict['Orbit'] = []
launch_dict['Customer'] = []
launch_dict['Launch outcome'] = []

# Added some new columns
launch_dict['Version Booster']=[]
launch_dict['Booster landing']=[]
launch_dict['Date']=[]
launch_dict['Time']=[]

In [17]:
extracted_row = 0

# Extract each table 
for table_number,table in enumerate(soup.find_all('table',"wikitable plainrowheaders collapsible")):
    
    # get table row 
    for rows in table.find_all("tr"):
        
        #check to see if first table heading is as number corresponding to launch a number 
        if rows.th:
            if rows.th.string:
                flight_number=rows.th.string.strip()
                flag=flight_number.isdigit()
                
        else:
            flag=False
            
        #get table element 
        row=rows.find_all('td')
        
        #if it is number save cells in a dictonary 
        if flag:
            extracted_row += 1                     
            
            # Flight Number value            
            launch_dict['Flight No.'].append(flight_number)            
            #print(flight_number)            
            
            # Date value
            datatimelist=date_time(row[0])
            date = datatimelist[0].strip(',')
            
            launch_dict['Date'].append(date)            
            #print(date)                      
            
            # Time value
            time = datatimelist[1]
            
            launch_dict['Time'].append(time)                       
            #print(time)                
                
            # Booster version
            bv=booster_version(row[1])
            if not(bv):
                bv=row[1].a.string
            
            launch_dict['Version Booster'].append(bv)            
            #print(bv)                                  
            
            # Launch Site
            launch_site = row[2].a.string
            
            launch_dict['Launch site'].append(launch_site)            
            #print(launch_site)                                   
            
            # Payload
            payload = row[3].a.string
            
            launch_dict['Payload'].append(payload)            
            #print(payload)                                    
            
            # Payload Mass
            payload_mass = get_mass(row[4])
            
            launch_dict['Payload mass'].append(payload_mass)           
            #print(payload)
            
            # Orbit
            orbit = row[5].a.string          
            
            launch_dict['Orbit'].append(orbit)                        
            #print(orbit)                    
            
            # Customer
            if (row[6].a is not None):
                customer=row[6].a.string
            else:
                customer=row[6].string          
            
            launch_dict['Customer'].append(customer)                        
            #print(customer)                                    
            
            # Launch outcome
            launch_outcome = list(row[7].strings)[0]
            
            launch_dict['Launch outcome'].append(launch_outcome)                        
            #print(launch_outcome)            
            
            # Booster landing
            booster_landing = landing_status(row[8])        
            
            launch_dict['Booster landing'].append(booster_landing)                        
            #print(booster_landing)            

In [18]:
df2=pd.DataFrame(launch_dict)
print(df2.shape)
df2.head()

(149, 11)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,78,CCSFS,Starlink,"15,600 kg",LEO,SpaceX,Success\n,F9 B5,Success,7 January 2020,02:19:21
1,79,KSC,Crew Dragon in-flight abort test,"12,050 kg",Sub-orbital,NASA,Success\n,F9 B5,No attempt\n,19 January 2020,15:30
2,80,CCSFS,Starlink,"15,600 kg",LEO,SpaceX,Success\n,F9 B5,Success,29 January 2020,14:07
3,81,CCSFS,Starlink,"15,600 kg",LEO,SpaceX,Success\n,F9 B5,Failure,17 February 2020,15:05
4,82,CCSFS,SpaceX CRS-20,"1,977 kg",LEO,NASA,Success\n,F9 B5,Success,7 March 2020,04:50


## Consolidation
***

Now we will consolidate two dataframes into one:

In [19]:
df = pd.concat([df1, df2], axis=0)
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCSFS,Dragon Spacecraft Qualification Unit,N,LEO,SpaceX,Success\n,F9 v1.0,Failure,4 June 2010,18:45
1,2,CCSFS,SpaceX COTS Demo Flight 1,U,LEO,NASA,Success,F9 v1.0,Failure,8 December 2010,15:43
2,3,CCSFS,SpaceX COTS Demo Flight 2,525 kg,LEO,NASA,Success,F9 v1.0,No attempt\n,22 May 2012,07:44
3,4,CCSFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0,No attempt,8 October 2012,00:35
4,5,CCSFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0,No attempt\n,1 March 2013,15:10
...,...,...,...,...,...,...,...,...,...,...,...
221,222,VSFB,Starlink Group 2-9,"15,900 kg",LEO,SpaceX,Success\n,F9 B5B1075.3,Success,10 May 2023,20:09
222,223,CCSFS,Starlink Group 5-9,"~17,400 kg",LEO,SpaceX,Success\n,F9 B5,Success,14 May 2023,05:03
223,224,CCSFS,Starlink Group 6-3,"~17,600 kg",LEO,SpaceX,Success\n,F9 B5,Success,19 May 2023,06:19
224,225,VSFB,Iridium-NEXT,"~6,600 kg",Polar,Iridium,Success\n,F9 B5,Success,20 May 2023,13:16


After the consolidation, we can filter out the launches after 31/12/2022:

In [20]:
# Change the data type of the column "Date" to "datetime"
df['Date'] = pd.to_datetime(df['Date'], format='%d %B %Y')

In [21]:
df = df[df['Date']<"01-01-2023"]
print(df.shape)
df.head()

(194, 11)


Unnamed: 0,Flight No.,Launch site,Payload,Payload mass,Orbit,Customer,Launch outcome,Version Booster,Booster landing,Date,Time
0,1,CCSFS,Dragon Spacecraft Qualification Unit,N,LEO,SpaceX,Success\n,F9 v1.0,Failure,2010-06-04,18:45
1,2,CCSFS,SpaceX COTS Demo Flight 1,U,LEO,NASA,Success,F9 v1.0,Failure,2010-12-08,15:43
2,3,CCSFS,SpaceX COTS Demo Flight 2,525 kg,LEO,NASA,Success,F9 v1.0,No attempt\n,2012-05-22,07:44
3,4,CCSFS,SpaceX CRS-1,"4,700 kg",LEO,NASA,Success\n,F9 v1.0,No attempt,2012-10-08,00:35
4,5,CCSFS,SpaceX CRS-2,"4,877 kg",LEO,NASA,Success\n,F9 v1.0,No attempt\n,2013-03-01,15:10


We can now export our dataset into a CSV file:

In [22]:
df.to_csv('falcon9_webscraping.csv', index=False)

# <center>________________________________________________________________</center>