# <ins>Analysis on the Potential of Life on Exoplanets</ins>

## By: Ethan Barr and Tim Freerksen

## <ins>Introduction</ins>

Is there life in space? This has been a question for many years with no real evidence to back up the claim that there is. Our goal in this project is to take the exoplanets already discovered and see what the probability of life on these planets real is and hope that a good amount of the exoplanets that we discovered so far fit well into the criteria of haveing the ability to hold life.

Throughout this tutorial, we will try to see how many of the exoplanets we have discovered so far the right conditions in order to harbor life and then we will see what the probability is that a future exoplanet will end up holding life.

### <ins>Required Tools</ins>

The following libraries used for the project:

    1. pandas
    2. regex

If having issues with python 3+ or panda, we recommend referring to these following websites for more information:

    1. https://docs.python.org/3/
    2. https://pandas.pydata.org/pandas-docs/stable/install.html

### <ins>1. Data Collection</ins>

This is the first part of the data life cycle. In this part we will go through various websites to try and find data that both matches our topic at hand as well as gives enough information so that we can perform an analysis later on. 

For an Exoplanet Database we found that https://exoplanetarchive.ipac.caltech.edu/cgi-bin/TblView/nph-tblView?app=ExoTbls&config=PS gave the best and most information from 1989 - 2020. In order to retrieve this data we first converted the online database into a csv file in which we could then read and manipulate.

The following tools were used for the data collection:

    1. panda

In [56]:
import pandas as pd # used in order to read the csv file and convert it successfully into a datafram
import re           # used to easily gather columns of similar aspects

Since the exoplanetarchive website was nice enough to allow the downloading of the database into a csv file, there was not a lot of steps to fully access the entire database. It was sufficient to first download the database in a csv format and then add it as one of the files with this project. We could then easily access this file by performing a pandas read_csv which allowed for the entire csv file to be converted into a flexibile and readable DataFrame that we could use.

A <ins>DataFrame</ins> is a table that has rows and columns that correlate to certain pieces of data. Using DataFrames allows for better use of more pandas functions which help to manipulate this data much more flexibily. If interested in learning more about DataFrames then check out the pandas documentation of it at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

The only issue that the csv had when we first read it was that the data provided had its only numbering associated with each row. To combact this we decided to stick with how the original data numbered itself and assigned that row to be the row number titled, loc_rowid.



In [57]:
data = pd.read_csv('PS.csv')
data.set_index('loc_rowid', inplace=True)
data.head()

Unnamed: 0_level_0,pl_name,hostname,default_flag,sy_snum,sy_pnum,discoverymethod,disc_year,disc_facility,soltype,pl_controv_flag,...,sy_vmagerr2,sy_kmag,sy_kmagerr1,sy_kmagerr2,sy_gaiamag,sy_gaiamagerr1,sy_gaiamagerr2,rowupdate,pl_pubdate,releasedate
loc_rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,11 Com b,11 Com,0,2,1,Radial Velocity,2007,Xinglong Station,Published Confirmed,0,...,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848,2014-07-23,2011-08,2014-07-23
2,11 Com b,11 Com,1,2,1,Radial Velocity,2007,Xinglong Station,Published Confirmed,0,...,-0.023,2.282,0.346,-0.346,4.44038,0.003848,-0.003848,2014-05-14,2008-01,2014-05-14
3,11 UMi b,11 UMi,0,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-04-25 14:08:01,2009-10,2014-05-14
4,11 UMi b,11 UMi,0,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-04-25 14:08:01,2011-08,2014-07-23
5,11 UMi b,11 UMi,1,1,1,Radial Velocity,2009,Thueringer Landessternwarte Tautenburg,Published Confirmed,0,...,-0.005,1.939,0.27,-0.27,4.56216,0.003903,-0.003903,2018-09-04 16:14:36,2017-03,2018-09-06


### <ins>2. Data Processing</ins>

After you successfully retrieve the data that you are looking for and have it in some sort of dataframe so that you can manipulate it then you move onto this next step. Within this step we want to try and tidy up the data that we just read in. This is an important step because of the fact that it will allow the data to be read and understood with much more fluidity. In our case we would be altering the structure of the DataFrame through the process of tidying data and / or data wrangling.

You can learn more about:

    1. tidying data: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
    2. data wrangling: https://www.elderresearch.com/blog/what-is-data-wrangling-and-why-does-it-take-so-long/#:~:text=Data%20wrangling%20is%20the%20process,20%25%20for%20exploration%20and%20modeling.

We will now go through the steps of tiding up our DataFrame so that any extra columns are discarded since we will not need them to perform our analysis. We also want to add a few columns to the DataFrame so that when we perform some calculations we can easily convert the column values out of their comparison units (example: 10 Earth Masses, which is just saying that the planet is 10 time the size of Earth). Another issue that we have with this DataFrame is that the names, due to there being many columns, may not be very intuitive, so in order to be able to understand the column values quicker without a cheat sheet of what each column represents we want to change the column names.

Due to space being such a hard thing to measure as we can't just take out a ruler and measure it that way, astronomers typically write down what they believe a value is and then the upper and lower limits of what that value could be. We will be just taking into account what they believe the specific value to be and ignore the upper and lower limits in order to help calculate certian aspects of the exoplanets and host stars much easier.

The following tools were used for Data Processing:

    1. regex

In [58]:
# This Loop is systematically removing the Upper and Lower limit values as well as the Limit Flags attactched to certain
# values
for columnName, columnData in data.iteritems():
    if (re.search('[1|2]', columnName)):
        data.drop(columnName, axis=1, inplace=True)
    elif (re.search('lim', columnName)):
        data.drop(columnName, axis=1, inplace=True)

# Removing the columns that are not needed but were not captured by the previous for loop
data.drop('default_flag', axis=1, inplace=True)
data.drop('discoverymethod', axis=1, inplace=True)
data.drop('disc_facility', axis=1, inplace=True)
data.drop('soltype', axis=1, inplace=True)
data.drop('pl_controv_flag', axis=1, inplace=True)
data.drop('pl_refname', axis=1, inplace=True)
data.drop('ttv_flag', axis=1, inplace=True)
data.drop('rowupdate', axis=1, inplace=True)
data.drop('pl_pubdate', axis=1, inplace=True)
data.drop('releasedate', axis=1, inplace=True)
data.drop('sy_refname', axis=1, inplace=True)
data.drop('rastr', axis=1, inplace=True)
data.drop('ra', axis=1, inplace=True)
data.drop('decstr', axis=1, inplace=True)
data.drop('dec', axis=1, inplace=True)
data.drop('sy_gaiamag', axis=1, inplace=True)
data.drop('sy_kmag', axis=1, inplace=True)
data.drop('sy_vmag', axis=1, inplace=True)
data.drop('st_refname', axis=1, inplace=True)
data.drop('pl_orbeccen', axis=1, inplace=True)
data.drop('pl_insol', axis=1, inplace=True)
data.drop('pl_orbsmax', axis=1, inplace=True)

data.head()


Unnamed: 0_level_0,pl_name,hostname,sy_snum,sy_pnum,disc_year,pl_orbper,pl_rade,pl_radj,pl_bmasse,pl_bmassj,pl_bmassprov,pl_eqt,st_teff,st_rad,st_mass,st_met,st_metratio,st_logg,sy_dist
loc_rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,11 Com b,11 Com,2,1,2007,,,,5434.7,17.1,Msini,,,,2.6,,,,93.1846
2,11 Com b,11 Com,2,1,2007,326.03,,,6165.6,19.4,Msini,,4742.0,19.0,2.7,-0.35,[Fe/H],2.31,93.1846
3,11 UMi b,11 UMi,1,1,2009,516.22,,,3337.07,10.5,Msini,,4340.0,24.08,1.8,0.04,[Fe/H],1.6,125.321
4,11 UMi b,11 UMi,1,1,2009,,,,3432.4,10.8,Msini,,,,1.7,,,,125.321
5,11 UMi b,11 UMi,1,1,2009,516.21997,,,4684.8142,14.74,Msini,,4213.0,29.79,2.78,-0.02,[Fe/H],1.93,125.321


After removing the columns that aren't useful or won't be necessary for our calculations we can do the next step of tidying up our data. This would be to remove the rows that have missing data. Now we don't have to delete every row that is missing data because some columns can be interchangable. An example of an interchangable column is the radius and mass of the planet nut in the units of Earth Radius/Mass or Jupiter Radius/Mass which we can use either value in a formula since in the end we can just convert this value to any radius/mass units that we need. We also want to keep as many columns as we can so that we can get a better analysis.

To get more information regarding the data that we are using, check out this website:

    1. http://exoplanetarchive.ipac.caltech.edu

In [59]:
for index, row in data.iterrows():
    if (
          ((str(data.at[index, 'pl_rade']) == 'nan') & (str(data.at[index, 'pl_radj']) == 'nan'))                   # This is checking to make sure that at least one of the values for radius was recorded for the planet
        | ((str(data.at[index, 'pl_bmasse']) == 'nan') & (str(data.at[index, 'pl_bmassj']) == 'nan'))               # This is checking to make sure that at least one of the values for the mass of the planet was recorded
        | (str(data.at[index, 'pl_orbper']) == 'nan')                                                               # This is making sure that the row had a recorded Orbital period, or how long it takes to get around their host star
        | (str(data.at[index, 'st_teff']) == 'nan')                                                                 # This is making sure that we know what the Stellar Effective Temperature is for the host star
        | (str(data.at[index, 'st_rad']) == 'nan')                                                                  # This is the recording for the host stars radius
        | (str(data.at[index, 'st_mass']) == 'nan')                                                                 # This is the recorded mass for the hist star
        | (str(data.at[index, 'disc_year']) == 'nan')                                                               # The plamet's distance from the host star
     ):
        data.drop(index, inplace=True)

data.head()

Unnamed: 0_level_0,pl_name,hostname,sy_snum,sy_pnum,disc_year,pl_orbper,pl_rade,pl_radj,pl_bmasse,pl_bmassj,pl_bmassprov,pl_eqt,st_teff,st_rad,st_mass,st_met,st_metratio,st_logg,sy_dist
loc_rowid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
36,2MASS J21402931+1625183 A b,2MASS J21402931+1625183 A,1,1,2009,7336.5,10.31,0.92,6657.48,20.95,Mass,2075.0,2300.0,0.12,0.08,,,,
82,55 Cnc e,55 Cnc,2,5,2004,0.736539,1.91,0.17,8.08,0.02542,Mass,,5250.0,0.96,0.9,0.35,[M/H],4.42,12.5855
83,55 Cnc e,55 Cnc,2,5,2004,0.736546,2.173,0.194,8.37,0.026,Mass,,5250.0,0.96,0.9,0.35,[M/H],4.42,12.5855
85,55 Cnc e,55 Cnc,2,5,2004,0.737,1.897,0.169,7.74,0.02435,Mass,,5172.0,0.95,0.87,0.35,[Fe/H],4.43,12.5855
88,55 Cnc e,55 Cnc,2,5,2004,0.736544,2.08,0.186,7.81,0.02457,Mass,1958.0,5234.0,0.94,0.91,0.31,[Fe/H],4.45,12.5855


Now that we have cleaned most of the DataFrame up we can see that the rowid values for each row is not counting nicely, so we should not reset the index values by using an easy pandas function called .reset_index. This function will then move loc_rowid into its own column replacing its functionality. We can then say that the loc_rowid lost its reason for being in the column so to clean the DataFrame up more we can just remove it altogether

In [60]:
data.reset_index(inplace=True, drop=True)
data.head()

Unnamed: 0,pl_name,hostname,sy_snum,sy_pnum,disc_year,pl_orbper,pl_rade,pl_radj,pl_bmasse,pl_bmassj,pl_bmassprov,pl_eqt,st_teff,st_rad,st_mass,st_met,st_metratio,st_logg,sy_dist
0,2MASS J21402931+1625183 A b,2MASS J21402931+1625183 A,1,1,2009,7336.5,10.31,0.92,6657.48,20.95,Mass,2075.0,2300.0,0.12,0.08,,,,
1,55 Cnc e,55 Cnc,2,5,2004,0.736539,1.91,0.17,8.08,0.02542,Mass,,5250.0,0.96,0.9,0.35,[M/H],4.42,12.5855
2,55 Cnc e,55 Cnc,2,5,2004,0.736546,2.173,0.194,8.37,0.026,Mass,,5250.0,0.96,0.9,0.35,[M/H],4.42,12.5855
3,55 Cnc e,55 Cnc,2,5,2004,0.737,1.897,0.169,7.74,0.02435,Mass,,5172.0,0.95,0.87,0.35,[Fe/H],4.43,12.5855
4,55 Cnc e,55 Cnc,2,5,2004,0.736544,2.08,0.186,7.81,0.02457,Mass,1958.0,5234.0,0.94,0.91,0.31,[Fe/H],4.45,12.5855


Great! Now all our rows are perfectly numbered again and we only have the columns that we would need for us to perform an analysis. The next step that we will perform is to convert all the columns where the units are in Earth Radius, Earth Mass, Jupiter Radius, and Jupiter Mass and convert them into a more usable unit of measure that we can use in the calculations that we will do later. With that said we want to convert these units into kilometer (km) and kilograms (kg) as these are typically the units for astronomy calculations.

If you would like to learn more about the Astronomical System of Units then I recommend this website for a read:

    1. https://en.wikipedia.org/wiki/Astronomical_system_of_units

Now that we know what we are going to do we need to figure out how we are going to accomplish this. Based on a quick search we know:
    
    Earth Mass = 5.972 * 10 ^ 24 kg 
    Earth Radius = 6,371 km

    Jupiter Mass = 1.898 * 10 ^ 27 kg
    Jupiter Radius = 69,911 km

With this information we can then go through the columns and just multiply them by these values in order to get the mass of the planet in kilograms and the radius of the planet in kilometers

In [None]:
                                            ############ NOTE #############
#
#   In the 'CSV Info.txt' I tried to make it neat so that we know what columns we are working with and what columns we deleted.
#   This is so that we don't lose our mind as we code the stuff later on
#

# Need to convert the corrent columns for Earth Radius, Jupiter Radius, Solar Radius, Solar Mass to kg and km

# Also need to rename the columns for the above specific columns since we will most likely use those the most