# ExoStat Lab 12 Mass-Radius Relation of Exoplanets

**Administrative details:**

- This lab will not be turned in for credit 
- Data are downloaded from the NASA Exoplanet Archive (https://exoplanetarchive.ipac.caltech.edu)
- Collaborating on the ExoStat Labs is encouraged. If you get stuck for a while on a question, feel free to ask a neighbor or come to the instructor's or TF's office hours for additional help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) Please don't just share answers, though.

This term we will be using Piazza for class discussion. Find our class page [here](https://piazza.com/yale/spring2019/sds170/home)

You can read more about course policies on our [canvas site](https://canvas.yale.edu).

#### Today's ExoStat Lab

#### Power-law Model

<center>$M = CR^\gamma \Rightarrow \log_{10} M = \log_{10} C + \gamma \log_{10} R$</center>


Step 1: Estimate the power-law model using linear regression

Step 2: Classify the data into three clusters using KMeans

Step 3: Fit three separate linear regression based on three clusters 

Step 4: Compare the results from Step 1 and Step 3


Let's begin by running the cell below.

In [None]:
# Run this cell, but please don't change it.
import numpy as np
from datascience import *
from sklearn.cluster import KMeans

# These lines do some fancy plotting magic
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

## Step 0: Look at Data

In [None]:
kepler_data = Table.read_table("MR-kepler.csv")   # read data

In [None]:
kepler_data.show(4)  # print data

In [None]:
kepler_data.scatter(5, 4)  # plot mass and radius
plots.xlabel('Radius(R_Earth)')
plots.ylabel('Mass(M_Earth)')

**Question**: Do you think the power-law is a reasonable assumption for this dataset?

[Add your response here]

In [None]:
# convert data into log scale
log_kepler_data = kepler_data
log_kepler_data['st_mass'][:] = np.log10(kepler_data['st_mass'][:])
log_kepler_data['st_rad'][:] = np.log10(kepler_data['st_rad'][:])

In [None]:
log_kepler_data.scatter('st_rad', 'st_mass')
plots.xlabel('log_10 Radius(R_Earth)')
plots.ylabel('log_10 Mass(M_Earth)')

**Question**: Do you think the relation between $\log_{10} R$ and $\log_{10}M$ is linear?

[Add your response here]

## Part 1: Fit a regression line

In [None]:
# functions to fit the regression line
def standard_units(arr):
    return (arr - np.average(arr))/np.std(arr)

def correlation(t, x, y):
    x_standard = standard_units(t.column(x))
    y_standard = standard_units(t.column(y))
    return np.average(x_standard * y_standard)

def slope(t, x, y):
    r = correlation(t, x, y)
    y_sd = np.std(t.column(y))
    x_sd = np.std(t.column(x))
    return r * y_sd / x_sd

def intercept(t, x, y):
    x_mean = np.mean(t.column(x))
    y_mean = np.mean(t.column(y))
    return y_mean - slope(t, x, y)*x_mean

def fitted_values(t, x, y):
    """Return an array of the regressions estimates at all the x values"""
    a = slope(t, x, y)
    b = intercept(t, x, y)
    return a*t.column(x) + b

def predict_y(x_val):
    """
    Predicts y-values using nearest neighbors
    """
    nearby_points = data.where('x', are.between(x_val-0.25, x_val + 0.25))
    return np.mean(nearby_points.column('y'))

#### Obtain regression slope and intercept of the log-transformed data.

In [None]:
regression_slope = ...
regression_intercept = ...
(regression_slope, regression_intercept)

#### Obtain predicted value and errors

In [None]:
# obtained the predicted value
predicted = ...
actual = ...
errors = ...

#### Make a residual plot

In [None]:
...

**Question:** Do you think the regression line is a good fit? Why or why not

[Add your response here]

#### Root Mean Square Error

In [None]:
# calculate the root mean square error
rmse_1 = ...
rmse_1

## Part 2: Clustering using K-means (with 3 clusters)

We discussed the k-means clustering method during lecture.  To implement it, we will use `KMeans` in the Python module `sklearn.cluster`. (This was loaded in the setup cell above.)  You can learn more about `KMeans` [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

It turns out to use `KMeans`, the data format needs to be an array.  The code below takes the columns of our `sim` table and combines them as a 2D array.  Use `sim_km` in `KMeans`.

In [None]:
data_array = np.column_stack(make_array(log_kepler_data.column(4), log_kepler_data.column(5)))
print(type(data_array))
print(data_array.shape)

Run k-means clustering method choosing $k = 3$.

In [None]:
nclust = ...
kmeans = ...
kmeans

In [None]:
#Assign labels
kmeans_labels = kmeans.labels_
kmeans_labels

**Question:** Make a scatter plot of the log mass and log radius, using three different colors to distinguish data points from different clusters. Do you think the result is reasonable?

[Add your response here]

In [None]:
plots.scatter(data_array[kmeans_labels == 0,1], data_array[kmeans_labels == 0,0], s=25, c='blue', label='Cluster 1')
plots.scatter(data_array[kmeans_labels == 1,1], data_array[kmeans_labels == 1,0], s=25, c='green', label='Cluster 2')
plots.scatter(data_array[kmeans_labels == 2,1], data_array[kmeans_labels == 2,0], s=25, c='brown', label='Cluster 3')
plots.xlabel('log_10 Radius(R_Earth)')
plots.ylabel('log_10 Mass(M_Earth)')

# Part 3: Fit three seprate regression lines

Based on the k-means results in Part 2, we now split the data into three datasets.

In [None]:
data_cluster1 = data_array[kmeans_labels == 0,:]
data_cluster2 = data_array[kmeans_labels == 1,:]
data_cluster3 = data_array[kmeans_labels == 2,:]

Recall the data format is an array, to use the regression function in Part 1, we need to convert the data format to Table. The code below convert an array to Table.

In [None]:
# Convert array to Table
log_kepler_data_c1 = Table().with_columns("st_mass", data_cluster1[:, 0],
                                         "st_rad", data_cluster1[:, 1])

log_kepler_data_c2 = Table().with_columns("st_mass", data_cluster2[:, 0],
                                         "st_rad", data_cluster2[:, 1])

log_kepler_data_c3 = Table().with_columns("st_mass", data_cluster3[:, 0],
                                         "st_rad", data_cluster3[:, 1])

#### Fit three regression lines separately to three datasets

For each dataset, obtain the predicted values and errors. Plot the errors on a single plot with the x-axis as the $\log_{10} R$ and y-axis is the errors. Compare this plot with the one in Part 1, which one looks better?

In [None]:
#First get the slope and intercept for each dataset
regression_slope_c1 = ...
regression_intercept_c1 = ...

regression_slope_c2 = ...
regression_intercept_c2 = ...

regression_slope_c3 = ...
regression_intercept_c3 = ...

In [None]:
#Calculate the predicted values
predicted_c1 = ...
predicted_c2 = ...
predicted_c3 = ...

In [None]:
#Now get the errors
actual_c1 = ...
errors_c1 = actual_c1 - predicted_c1

actual_c2 = ...
errors_c2 = actual_c2 - predicted_c2

actual_c3 = ...
errors_c3 = actual_c3 - predicted_c3

To make the residual plot, we can combine the three datasets into a large dataset using .append(). See example in below:

In [None]:
data_new = data_c1
data_new.append(data_c2)
data_new.append(data_c3)
data_new

In [None]:
# plot the predicted errors, compare it with the plot in Part 1


#### Root Mean Square Error

To compare the results with part 1, let's calculate the root mean square errors. 

To calculate the root mean square root for the whole dataset. We first obtain the means square errors for each dataset. We then sum up the three mean square errors. The root mean square error is the square root of the sum.

In [None]:
...

### Which method is better (Part 1 and Part 3)?

Compare the results from part 1 and part 3. Which results look better? 

- Plot the best fit regression lines

Which results you would use in practice? Why? 


In [None]:
# Plot the best fit regression line in Part 1
...

In [None]:
# Plot the Kmeans best fit regression lines in Part 3 in a single plot
...