## Exoplanet Clustering 

### Set up environment 

In [1]:
# set up environment 
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans # kmeans
from sklearn.cluster import DBSCAN # dbscan
from scipy.cluster.hierarchy import dendrogram, linkage # hierarchical clustering

### Read in data

In [2]:
# read in data 
planets = pd.read_csv('exoplanets.csv', delimiter='\t')

# preview sample data 
planets.sample(10)

Unnamed: 0,NAME,LIGHT-YEARS FROM EARTH,PLANET MASS,STELLAR MAGNITUDE,DISCOVERY DATE
107,HIP 105854 b,257.0,8.2 Jupiters,5.64,2014
82,HD 219134 h,21.0,0.34 Jupiters,5.56911,2015
143,MOA-2010-BLG-353L b,20975.0,0.27 Jupiters,,2015
132,KMT-2018-BLG-1990L b,3154.0,0.348 Jupiters,,2019
214,PSR B1257+12 b,1957.0,0.02 Earths,,1994
187,OGLE-2015-BLG-1670L b,21855.0,17.9 Earths,,2019
191,OGLE-2016-BLG-1067L b,12167.0,0.43 Jupiters,,2019
192,OGLE-2016-BLG-1190L b,22084.0,13.38 Jupiters,,2017
154,MOA-2015-BLG-337L b,23160.0,33.7 Earths,,2018
126,KMT-2016-BLG-2397L b,15919.0,2.63 Jupiters,,2020


### Data Exploration 

Now that our data has been imported, we can do some data exploration. First, let's explore our columns and dtypes:

In [3]:
planets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   NAME                    234 non-null    object 
 1   LIGHT-YEARS FROM EARTH  230 non-null    float64
 2   PLANET MASS             234 non-null    object 
 3   STELLAR MAGNITUDE       138 non-null    float64
 4   DISCOVERY DATE          234 non-null    int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 9.3+ KB


As we can see from this and the sample above, the 'PLANET MASS' column is measured in Earths and Jupiters. Let's see how many 'Earths' and 'Jupiters' there are: 

In [4]:
# earths 
planets['PLANET MASS'].str.match(r'(.*Earths$)').sum()

56

In [5]:
# jupiters 
planets['PLANET MASS'].str.match(r'(.*Jupiters$)').sum()

178

Given the counts we just calculated, we can see that they cover all the rows within our dataframe. Now, we need to normalize the entire column so they all have the same units. 

First, lets store the indices of all the Earth units: 

In [6]:
# earth indices 
eindex = planets['PLANET MASS'].str.match(r'(.*Earths$)')
eindex 

0      False
1      False
2      False
3      False
4      False
       ...  
229    False
230    False
231    False
232    False
233    False
Name: PLANET MASS, Length: 234, dtype: bool

Now that we have our Earth indices saved, we can now remove 'Earths' and 'Jupiters' from our 'PLANET MASS' column. Additionally, we also need to change our dtype from 'object' to 'float'.  

In [7]:
# remove string from column 
planets['PLANET MASS'].replace('[A-Z][a-z]+', '', regex=True, inplace=True)

# to numeric 
planets['PLANET MASS'] = pd.to_numeric(planets['PLANET MASS'])

# check for string removal 
planets 

Unnamed: 0,NAME,LIGHT-YEARS FROM EARTH,PLANET MASS,STELLAR MAGNITUDE,DISCOVERY DATE
0,11 Comae Berenices b,304.0,19.4000,4.72307,2007
1,11 Ursae Minoris b,409.0,14.7400,5.01300,2009
2,14 Andromedae b,246.0,4.8000,5.23133,2008
3,18 Delphini b,249.0,10.3000,5.51048,2008
4,24 Bootis b,313.0,0.9100,5.59000,2018
...,...,...,...,...,...
229,Upsilon Andromedae b,44.0,0.6876,4.09565,1996
230,Upsilon Andromedae c,44.0,1.9810,4.09565,1999
231,Upsilon Andromedae d,44.0,4.1320,4.09565,1999
232,WISEP J121756.91+162640.2 A b,33.0,22.0000,,2012


In [8]:
# check for float64 dtype 
planets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   NAME                    234 non-null    object 
 1   LIGHT-YEARS FROM EARTH  230 non-null    float64
 2   PLANET MASS             234 non-null    float64
 3   STELLAR MAGNITUDE       138 non-null    float64
 4   DISCOVERY DATE          234 non-null    int64  
dtypes: float64(3), int64(1), object(1)
memory usage: 9.3+ KB


Now, we need to convert our Earth units to Jupiters. 

- Earth = 5.972 x10^24 kg 
- Jupiter = 1.898 x10^27 kg 

In [9]:
# earths per jupiter 
(1.898 * 10**27) / (5.972 * 10**24)

317.8164768921634

From our calculations, 317.816 Earths fit inside one Jupiter. To convert our Earths to Jupiters all we need to do is divide the Earths by 317.816: 

In [10]:
# earths to jupiters conversion 
planets.loc[eindex, 'PLANET MASS'] = planets.loc[eindex, 'PLANET MASS'] / 317.816

In [11]:
# check 
planets.loc[eindex][:10]

Unnamed: 0,NAME,LIGHT-YEARS FROM EARTH,PLANET MASS,STELLAR MAGNITUDE,DISCOVERY DATE
15,55 Cancri e,41.0,0.02514,5.95084,2004
17,61 Virginis b,28.0,0.016047,4.6955,2009
18,61 Virginis c,28.0,0.057266,4.6955,2009
19,61 Virginis d,28.0,0.072054,4.6955,2009
43,HD 102365 b,30.0,0.050344,4.89,2010
51,HD 136352 b,48.0,0.014537,5.65,2019
52,HD 136352 c,48.0,0.035524,5.65,2019
53,HD 136352 d,48.0,0.033038,5.65,2019
61,HD 160691 d,51.0,0.03321,5.12,2004
63,HD 16417 b,83.0,0.069537,5.78,2008


Before doing any clustering on our data, let's create some initial visualizations. Since our values are fairly large, we can plot them on a log scale. 

In [12]:
alt.Chart(planets).mark_circle().encode(
    x = alt.X('PLANET MASS', scale=alt.Scale(type='log')), 
    y = alt.Y('LIGHT-YEARS FROM EARTH', scale=alt.Scale(type='log')), 
    tooltip = ['NAME']
    ).properties(title='Planet Mass vs Light-Years from Earth')

### Clustering 

#### KMeans

For the purposes of our KMeans algorithm, we need to pull our data of interest and log transform it: 

In [35]:
# pull columns of interest 
cluster_df = planets[['LIGHT-YEARS FROM EARTH', 'PLANET MASS']].dropna()

# log transform 
cluster_log = pd.DataFrame() 
cluster_log['X'] = kmeans_df['LIGHT-YEARS FROM EARTH'].apply(np.log)
cluster_log['Y'] = kmeans_df['PLANET MASS'].apply(np.log) 

# check new df 
cluster_log.sample(10)

Unnamed: 0,X,Y
56,4.442651,1.667707
146,10.06416,-2.928259
150,9.757826,2.128232
118,4.584967,0.693147
38,4.867534,2.172476
218,8.272315,0.182322
216,7.579168,-4.400496
72,2.995732,-4.768221
99,3.526361,0.832909
140,10.043119,-1.551169


Now that our data has been log transformed, we can start clustering! From our initial visualization, we can see three distinct clusters so lets set k = 3. 

In [36]:
# create kmeans model  
km = KMeans(n_clusters=3, random_state=4) 

# create column and run model 
cluster_df['kmeans_cluster'] = km.fit_predict(cluster_log)

In [38]:
alt.Chart(cluster_df).mark_circle().encode(
    x = alt.X('PLANET MASS', scale=alt.Scale(type='log')), 
    y = alt.Y('LIGHT-YEARS FROM EARTH', scale=alt.Scale(type='log')), 
    color = alt.Color('kmeans_cluster:N', legend=alt.Legend(title='CLUSTER'))
).properties(title='Planet Mass vs Light-Years from Earth')

Our KMeans model classified our data into three different groups, but now let's see how DBSCAN clustering does:

In [39]:
# create dbscan model 
dbscan = DBSCAN(eps=1.0, min_samples=15)

# create column and run model 
cluster_df['dbscan_cluster'] = dbscan.fit_predict(cluster_log) 

In [44]:
alt.Chart(cluster_df).mark_circle().encode(
    x = alt.X('PLANET MASS', scale=alt.Scale(type='log')), 
    y = alt.Y('LIGHT-YEARS FROM EARTH', scale=alt.Scale(type='log')), 
    color = alt.Color('dbscan_cluster:N', legend=alt.Legend(title='CLUSTER'))
).properties(title='Planet Mass vs Light-Years from Earth')

The advantage to DBSCAN over KMeans is that DBSCAN classifies outliers from the data as along with clusters. As we can see from our visualization, DBSCAN classified the data into four clusters, with the ones labeled as -1 being the outliers. 