# Exploring Exoplanets with Type Providers and Plotly

This is a new three parts episode of the course [Make F# your first functional programming language](https://github.com/fcolavecchia/fp-course-public). In the first part  we review the use of a Type Provider (thanks to [`FSharp.Data`](https://fsprojects.github.io/FSharp.Data/)), while we explore how to plot the data using [Plotly](https://plotly.com/fsharp/). This is a typical workflow for Data Sciences processing. 


## Getting the data

We are going to space! 

Yes, the Earth is not alone in the Universe, since there are thousands of planets orbiting stars in our galaxy, not so far away. These planets, called exoplanets, were first discovered in 1992. For this episode, we are going to use the data from the [NASA Exoplanet Archive](https://exoplanetarchive.ipac.caltech.edu/index.html). The data is stored in a database that can be accessed using an API, for example with `wget` or directly in the browser. I downloaded a curated version of that data to play with, in a [csv file](data/consolidatedExoPlanets.csv). 

Let us recall that a Type Provider is an implementation in F# that enables one to create a `type` from some structured data read from a file. This data can be in html, csv, json or xml format, that are ubiquitous in the web. To this end, we neeed to install the package in this notebook and open it:

In [1]:
#r "nuget: FSharp.Data"

open FSharp.Data

The type provider needs a structured data source to build the type, in our case we have the data in a file:

In [2]:
[<Literal>]
let exoplanetsFile = "../data/exoplanets.csv"

Creating the type is as easy as

In [3]:
type ExoPlanetTypeProvider = FSharp.Data.CsvProvider<exoplanetsFile, HasHeaders=true>

As usual in data science, one takes a glimpse of the data to have an overall feeling on what the data is about, while leaving the details to the code. One can notice that the file has the column names in the first row, that is why we use the argument `HasHeaders=true`.

> Check more details on how the Type Provider works in [this episode](https://github.com/fcolavecchia/fp-course-public/blob/main/en/80_TypeProviders.ipynb).

Now we effectively create the data and the type with:

In [4]:
let exoplanets = ExoPlanetTypeProvider.GetSample()

Let us see what we have inside:

In [5]:
exoplanets.Headers

Unnamed: 0,Unnamed: 1
Value,"[ pl_name, soltype, disc_refname, hd_name, pl_masse, pl_orbper, discoverymethod, cb_flag, sy_dist, pl_insol ]"


This gives us an `Option` type with the names of the data in it. Let us print it more clearly iterating the `seq`uence of headers:

In [6]:
exoplanets.Headers
|> Option.map (fun h -> 
                h
                |> Seq.iteri (fun i name -> printfn "Item %d: %s" (i+1) name)
)

Item 1: pl_name
Item 2: soltype
Item 3: disc_refname
Item 4: hd_name
Item 5: pl_masse
Item 6: pl_orbper
Item 7: discoverymethod
Item 8: cb_flag
Item 9: sy_dist
Item 10: pl_insol


Unnamed: 0,Unnamed: 1
Value,<null>


There are ten columns that corresponds to the following information, according to the NASA site:

- `pl_name`: This is the exoplanet name
- `soltype`: The status of the exoplanet referred to the full set of planets
- `disc_refname`: An HTML piece with the url to the published reference of the discovery
- `hd_name`: The name of the star that hosts the planet
- `pl_masse`: The planetary mass, measured in units of the mass of the Earth (i.e.: `pl_masse` of Earth is equal to one)
- `pl_orbper`: The orbital period (that is, the exoplanet duration of its year) measured in Earth years
- `discoverymethod`: The method used in the discovery
- `cb_flag`: Whether the planet orbits a binary system (now that would be a view!)
- `sy_dist`: Distance to the planetary system in units of parsecs (one parsec is about 3.26 light years)
- `pl_insol`: Insolation flux, the amount of energy the planet receives from the hosting star, given in units relative to the flux measured for the Earth from the Sun.

These are the main features of an exoplanet. The idea behind this research is to find Earth-like planets that can host life as we know it. Therefore, it is important to know the mass of the planet (large planets tend to be gaseous ones like Jupyter or Saturn); the distance from the hosting star (too far is to cold, too close would be hot) and the amount of energy the planet receives from the star (stars can be really big and bright, so even though the planet can be far away, it could still receive a lot of light from the hosting star, preventing life formation as we know it). 

Good! Remember also that the provider returns the data as a sequence in the `.Rows` property:

In [7]:
exoplanets.Rows 
|> Seq.take 2
|> Seq.iteri (fun i s ->  printfn $"{i}: %A{s}")

0: ("OGLE-TR-10 b", "Published Confirmed",
 "<a refstr=KONACKI_ET_AL__2005 href=https://ui.adsabs.harvard.edu/abs/2005ApJ...624..372K/abstract target=ref> Konacki et al. 2005 </a>",
 "", 197.046, 3.101278, "Transit", false, 1344.97, nan)
1: ("HD 210702 b", "Published Confirmed",
 "<a refstr=JOHNSON_ET_AL__2007 href=https://ui.adsabs.harvard.edu/abs/2007ApJ...665..785J/abstract target=ref> Johnson et al. 2007 </a>",
 "HD 210702", nan, 354.29, "Radial Velocity", false, 54.1963, nan)


Let us extract the first one into a value:

In [None]:
let exo0 = exoplanets.Rows |> Seq.item 0

Recall also that although the column name of the file is, for example, `pl_name`, one can access this value of a particular planet with `Pl_name` field of the type. Here the F# compiler will help you to determine the name of each field of the current row, when trying to access one of them, just go ahead and write down the value and it will pop up the possible fields:

<img src="../data/Fields of Exoplanet type.png" alt="" width="400"/>


Going back to the data, it looks like some columns are read as `nan`! (A data science classic...). Not to worry, the type provider let us change that field to an `Option` type, giving `Some value` when there is one, and `None`, instead of `nan. However, we need to recreate the type with the option `PreferOptionals=true`:

In [9]:
type ExoPlanetType = FSharp.Data.CsvProvider<exoplanetsFile, HasHeaders=true, PreferOptionals=true>

In [10]:
let exoplanets2 = ExoPlanetType.GetSample()

> We could have use the same name for the value here, like `let exoplanets = ExoPlanetType.GetSample()`  because the notebook allows us to do so. However, to keep things cleaner, I use `exoplanets2`.

Now our first exoplanet in the list is:

In [11]:
let ogleTR10b = 
    exoplanets2.Rows 
    |> Seq.item 0

printfn "Name: %A" ogleTR10b.Pl_name
printfn "Insolation: %A" ogleTR10b.Pl_insol

Name: "OGLE-TR-10 b"
Insolation: None


Now we are talking! Let us see the types of each field in the `ExoPlanetType`:

In [12]:
ogleTR10b.GetType().GetProperties()
|> Seq.iter (fun p -> printfn $"{p.PropertyType}")

System.String
System.String
System.String
Microsoft.FSharp.Core.FSharpOption`1[System.String]
Microsoft.FSharp.Core.FSharpOption`1[System.Decimal]
Microsoft.FSharp.Core.FSharpOption`1[System.Decimal]
System.String
System.Tuple`3[System.Boolean,Microsoft.FSharp.Core.FSharpOption`1[System.Decimal],Microsoft.FSharp.Core.FSharpOption`1[System.Decimal]]


There is a nice function to see the data as a table, `.DisplayTable()` in this interactive environment:

In [13]:
exoplanets2.Rows
|> Seq.take 4
|> fun r -> r.DisplayTable()

You see that `DisplayTable()` shows the `None` values printed as `<null>`. 
The second column is the status of discovery of the exoplanet, and we can count how many planets for each status type the list has:

In [14]:
exoplanets2.Rows 
|> Seq.countBy (fun x -> x.Soltype)
|> Seq.iter (fun (k,v) -> printfn $"{k}: {v}")

Published Confirmed: 17420
Kepler Project Candidate (q1_q8_koi): 2310
TESS Project Candidate: 877
Published Candidate: 776
Kepler Project Candidate (q1_q17_dr24_koi): 2705
Kepler Project Candidate (q1_q12_koi): 2683
Kepler Project Candidate (q1_q16_koi): 2725
Kepler Project Candidate (q1_q17_dr25_koi): 2719
Kepler Project Candidate (q1_q17_dr25_sup_koi): 2736


Let us work with the confirmed exoplanets only, creating a sequence by filtering the original data:

In [15]:
let confirmed = 
    exoplanets2.Rows 
    |> Seq.filter (fun x -> x.Soltype = "Published Confirmed")

confirmed |> Seq.length    

This is weird, since the NASA site talks about 5 thousand-ish planets. There must be some data that is repeated. Let us group the data by the planet name, that can be assumed to be a good unique key:

In [16]:
confirmed
|> Seq.groupBy (fun x -> x.Pl_name)
|> Seq.length

That is ok! (for August 2023...) So, we need to see what is going on with the repetitions. Let us group together by planet name, and take the one that is most repeated:

In [17]:
let exoWithMaxEntriesName, exoWithMaxEntries =
    confirmed
    |> Seq.groupBy (fun x -> x.Pl_name) // Group by name 
    |> Seq.map (fun (name, seq) -> name, seq |> Seq.length, seq) // Map into a tuple of name, count and values
    |> Seq.maxBy (fun (name, count, seq) -> count) // Find the tuple with the highest count
    |> fun (name, count, seq) -> name, seq // Return the name and count

printfn $"Planet {exoWithMaxEntriesName} has {exoWithMaxEntries |> Seq.length} entries"



Planet TrES-2 b has 25 entries


We created two values at once (a tuple) from processing the sequence of `confirmed` planets. The first value `exoWithMaxEntriesName` contains the name of the exoplanet that is most repeated, while the second one, `exoWithMaxEntries` is the list of the different rows corresponding to that planet. It looks like the planet _TrES-2 b_ has 25 entries! Let us see what that data looks like:

In [18]:
exoWithMaxEntries.DisplayTable()


It looks like the numeric values for `pl_masse`, `pl_orbper` and (maybe) `pl_insol` can be different for all entries of a given planet. A possible way to deal with this situation is to average each of them.

> I am not an exoplanet expert, so maybe there is another proper way to handle this data...

 Notice that, for example, the values for `pl_masse` are `Options`:

In [19]:
exoWithMaxEntries
|> Seq.map (fun p -> p.Pl_masse)

We average those values, taking into account only those ones that are in fact a measurement of the mass (given by the `Some` option), while discarding the non existent ones (the `None`s). Instead of going straight to the data, it can be useful to work out the problem of averaging a list of `Option` values with a minimal example: 

In [20]:
let optionsList = 
    [ Some 2.0m; Some 5.0m; None ; None ; Some 2.0m; Some 1.0m]  // decimal option list
    |> List.toSeq

let avg (data: decimal option seq) = 
    let values = 
        data  
        |> Seq.choose id // Discards the None values and keeps the Some values
    if Seq.isEmpty values then None else Some (values |> Seq.average)        

avg optionsList

Unnamed: 0,Unnamed: 1
Value,2.5


The application of `Seq.choose id` removes the `Nones` and extracts the values from the `Some` option. We also prevent taking the average of an empty sequence of data with the `if..then..else` construct (remember that in F# everything returns a value, and the `if` is used as such).

Note that we are building a function for a `decimal option list` because that is the data we are getting from the provider for those numerical values. The `m` suffix makes a literal float into a `decimal`. For the masses of our *TrES-2 b* planet we have:

In [21]:
exoWithMaxEntries
|> Seq.map (fun p -> p.Pl_masse)
|> avg

Unnamed: 0,Unnamed: 1
Value,398.6152305882353


Now we need to map our current sequence of data for _one_ planet into a single entry of the confirmed planets. Let us build a function that does exactly what we need, and then, we `Seq.map` over our sequence of planets.

In [22]:
let collapse (planet: seq<ExoPlanetType.Row>) = 
    let masses =
        planet 
        |> Seq.map (fun v -> v.Pl_masse)
        |> avg 

    let orbper =
        planet 
        |> Seq.map (fun v -> v.Pl_orbper)
        |> avg

    let insol =
        planet 
        |> Seq.map (fun v -> v.Pl_insol)
        |> avg

    let planetData = planet |> Seq.head   
        
    let row = ExoPlanetType.Row(
        plName = planetData.Pl_name,
        soltype = planetData.Soltype,
        discRefname = planetData.Disc_refname,
        hdName = planetData.Hd_name,
        plMasse = masses ,
        plOrbper = orbper,
        discoverymethod = planetData.Discoverymethod,
        cbFlag = planetData.Cb_flag,
        syDist = planetData.Sy_dist,
        plInsol = insol
    )

    row


There are some points to note:

First, the input argument of the function `collapse` is a sequence of data (represented by the type `ExoPlanetType.Row`) for a given planet that has many entries in our original data, as we did with `TrES-2 b`. So, the type provided by the Type Provider is `ExoPlanetType.Row`. Second, in the function we compute the average for the mass `.Pl_masse`, the orbital period `.Pl_orbper` and the insolation flux `.Pl_insol`. Then, since all the data in the sequence share the rest of the information, we extract this data from the first entry of the sequence, with `planet |> Seq.head`. Finally, we use a constructor `ExoPlanetType.Row` to build the new data. 

> Let us clarify some possible confusion about the names used for each field in the Type Provider. For example, let us take the planet name. The header of the column is `pl_name`. This is translated to the field `Pl_name` in the type created by the provider, that can be accessed by the `.Pl_name` notation. But, to create a new data for the type, the `plName = planetData.Pl_name` is used. Fortunately, the F# compiler always helps us, just remember to hover with the mouse on `ExoPlanetType.Row` to see how to map each field of the type in the constructor.

> The point of this notebook is to try to use the type created by the Provider as much as possible. However, one can avoid the precedent caveats by creating our own type and transforming the `ExoPlanetType.Row` to our own. That will depend on what the use of the data will be, we will see an example in the second part of this episode.

Let us try this in `exoWithMaxEntries` and see what we get:


In [23]:
collapse exoWithMaxEntries

Now we can go back to our full list of (possible repeated) planets, group them and collapse them into one entry per named planet:

In [24]:
let planets = 
    confirmed
    |> Seq.groupBy (fun p -> p.Pl_name)
    |> Seq.map (fun (name, entries) -> collapse entries)
    

In [25]:
planets |> Seq.length

Good! Now we have the correct number of planets! We can even use the Type Provider to write the new, consolidated data into a file:

In [26]:
let myCsv = new ExoPlanetType(planets)
let file = myCsv.SaveToString()
File.WriteAllText("../data/consolidatedExoplanets.csv", file)

Wonderful, we have curated our input list of exoplanets, using only the Type Provider and some helpful functions! In the next part, we will read the consolidated data and plot it...