# Exoplanet Hunting in Deep Space
Machine Learning project on preprocessed data from the NASA Kepler space telescope. Hosted on Kaggle: https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data
<br>
The data is labeled and cleaned. The rows represent one solar system. Columns the flux emitted by the system over time. The data gives no information about the time between two measurements. Column label is indicates if there are exoplanets present in the system. 1 means there are no confirmed exoplanets, 2 means there is at least one confirmed exoplanet in the system.

Trainset:

- 5087 rows or observations.
- 3198 columns or features.
- Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
- 37 confirmed exoplanet-stars and 5050 non-exoplanet-stars.

Testset:

- 570 rows or observations.
- 3198 columns or features.
- Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
- 5 confirmed exoplanet-stars and 565 non-exoplanet-stars.

In [53]:
#import libaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Preprocessing
### Loading Data and first inspection

In [3]:
#read training data
df_train = pd.read_csv('./exoTrain.csv')
display(df_train.head(3))
display(df_train.info())

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5087 entries, 0 to 5086
Columns: 3198 entries, LABEL to FLUX.3197
dtypes: float64(3197), int64(1)
memory usage: 124.1 MB


None

Flux values are float64 only label has int64
- Transform label into bool

In [15]:
#Check if there are really only two labels
print(df_train['LABEL'].unique())

[2 1]


In [19]:
#transform into bool (minus 1 to give 0 and 1 as labels)
df_train['LABEL'] = (df_train['LABEL'] - 1).astype('bool')
df_train['LABEL'].unique()

array([ True, False])

### Missing values

In [21]:
#Are there any missing values at all?
print(f"There are: {df_train.isna().sum().sum()} missing values")

There are: 0 missing values


### Value range and outliers

In [52]:
#print min / max / mean / median values
print(f"Min    flux value: {df_train.min().min()}")
print(f"Max    flux value:  {df_train.max().max()}")
print(f"Mean   flux value:  {df_train[1:].mean().mean()}")
print(f"Median flux value: {df_train[1:].median().median()}")

Min    flux value: -2385019.12
Max    flux value:  4299288.0
Mean   flux value:  130.39963966677414
Median flux value: -0.2549999999999387


- wide range of values
    - need normalisation
- if model performs very badly maybe remove outlier systems

### Plot of two random systems

In [96]:
#Transpose data
df_train_trans_data = df_train.drop(columns=(['LABEL'])).T

In [97]:
#extract labels
df_train_label = df_train['LABEL']

In [105]:
#plt.figure(figsize=(12, 15))
fig = px.line(df_train_trans_data[0])
fig = px.line(df_train_trans_data[1])
fig = px.line(df_train_trans_data[2])
fig = px.line(df_train_trans_data[3])
fig.show()