# **ML Project: Exoplanet Hunting in Deep Space**
**Alken Rrokaj r0772839, Fatjon Barçi r0732033**

### Motivation:
Exoplanet hunting in deep space is done by tracking a star over several months or years, to observe if there is a regular 'dimming' of the flux (the light intensity). This is light dimming, is evidence that there may be an orbiting body around the star, such as a planet. This star could be considered to be a 'candidate' system for further depth observations, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'. Using a machine learning model is probably the only logical method of making this tedious task possible. 

### Dataset Description: 
[Exoplanet Hunting in Deep Space](https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data)

### Trainset:
* 5087 rows or observations.
* 3198 columns or features. // too many features. Try downsampling.
* Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
* 37 confirmed exoplanet-stars and 5050 non-exoplanet-stars.

### Testset:
* 570 rows or observations.
* 3198 columns or features.
* Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
* 5 confirmed exoplanet-stars and 565 non-exoplanet-stars. -->
<!-- 
### References:
Let’s find planets beyond our solar system & milky way … . Available at: https://medium.datadriveninvestor.com/lets-find-planets-beyond-our-solar-system-milky-way-galaxy-with-the-help-of-905dcfc95d3d (Accessed: November 7, 2022). 



## **Importing the Data**

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy
from imblearn.over_sampling import SMOTE 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import os

In [None]:
# read the train data
train_df = pd.read_csv('./exoTrain.csv')
train_df.head()

## **Data Analysis**
Extracting Features

In [None]:
# extracting features per star measurement
train_df.T.describe().T.head()

Seperate the data in two lists, based on whether they have exoplanets or not

In [None]:
exoplanets = []
no_exoplanets = []
for i in range(len(train_df)):
    if train_df['LABEL'][i] == 2:
        exoplanets.append(np.array(train_df.iloc[i,1:]))
    else:
        no_exoplanets.append(np.array(train_df.iloc[i,1:]))
        
no_exoplanets = np.array(no_exoplanets[1:])
exoplanets = np.array(exoplanets[1:])

In [None]:
print('no_exoplanets')
pd.DataFrame(no_exoplanets).T.describe().T.head()

In [None]:
print('exoplanets')
pd.DataFrame(exoplanets).T.describe().T.head()

## **Data Visualization**

In [None]:
def plotTheData(data_name, title, ylimit, xlimit, rang=10):
    plt.figure(figsize=(15,5))  
    if rang == 0:
        rang = len(data_name)-1
    for i in range(0,rang):
        plt.plot(data_name[i])
        if ylimit != 0:
            plt.ylim([-1*ylimit,ylimit])
        if xlimit != 0:
            plt.xlim(0,xlimit);
    plt.title(label=title)
    plt.show()  

In [None]:
plotTheData(exoplanets,'Exoplanet Stars\' Light Intensity vs Time', 5000,0,10)

In [None]:
plotTheData(no_exoplanets,'No Exoplanet Stars\' Light Intensity vs Time', 5000,0,20)

Now we normalize the data in based on max intensity of each star, this would make the absolute magnitude more similar with the other stars.

In [None]:
def featureNormalizeArray(X_array):
    normalized_array = []
    for X in X_array:
        normalized_array.append(featureNormalize(X))
    return np.array(normalized_array)
        
# Code Assignment starts here
# EX1. Optional Exercises: 
# 3.1 Feature Normalization     
def featureNormalize(X):
    """
    Normalizes the features in X. returns a normalized version of X where
    the mean value of  xeach feature is 0 and the standard deviation
    is 1. This is often a good preprocessing step to do when working with
    learning algorithms.
    
    Parameters
    ----------
    X : array_like
        The dataset of shape (m x n).
    
    Returns
    -------
    X_norm : array_like
        The normalized dataset of shape (m x n).
    """
    # You need to set these values correctly
    X_norm = X.copy()
    mu = np.zeros(X.shape)
    sigma = np.zeros(X.shape)

    # =========================== YOUR CODE HERE =====================
    mu = np.mean(X, axis=0)
    X_norm = X - mu

    sigma = np.std(X_norm, axis=0, ddof=1)
    X_norm /= sigma
    # ================================================================
    return X_norm
# Code Assignment stops here

In [None]:
normalized_exo = featureNormalizeArray(exoplanets)
normalized_no_exo = featureNormalizeArray(no_exoplanets)

In [None]:
ones = np.ones(len(normalized_no_exo), dtype=int)
twos = np.full(len(normalized_exo),2,dtype=int)

first = pd.DataFrame(normalized_exo)
first.insert(0,'LABEL',twos)

second = pd.DataFrame(normalized_no_exo)
second.insert(0,'LABEL',ones)      

normalized_whole = pd.concat([first,second],axis= 0)

In [None]:
print(first.head())

In [None]:
plotTheData(normalized_whole,'Normalized Exoplanet Stars\' Light Intensity vs Time',20,0)

In [None]:
plotTheData(normalized_exo,'Normalized Exoplanet Stars\' Light Intensity vs Time',20,0)

In [None]:
plotTheData(normalized_no_exo,'Normalized No Exoplanet Stars\' Light Intensity vs Time',20,0)

# **Data Balancing**
In this part of code we balance the classes using the SMOTE technique. 

In [None]:
smote = SMOTE (random_state = 69) 


First we separate the data and the labels.


In [None]:
x_train = normalized_whole.drop(['LABEL'], axis=1)
y_train = normalized_whole['LABEL']

This line of code applies the SMOTE oversampling technique to the training data, which is stored in x_train and y_train. The fit_sample method fits the SMOTE model to the training data and generates synthetic minority class samples, which are then combined with the original training data. The resulting oversampled dataset is stored in x_train and y_train. Finally the oversampled data is split into a new training and test set considering that the test set is also very imbalanced.

In [None]:
x_train, y_train = smote.fit_resample (x_train, y_train.ravel ())
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.3, random_state=1)

# **Machine Learning**

## **Logistic Regression**

In [None]:
# Code assignment starts here
def sigmoid(z):
    """
    Compute sigmoid function given the input z.
    
    Parameters
    ----------
    z : array_like
        The input to the sigmoid function. This can be a 1-D vector 
        or a 2-D matrix. 
    
    Returns
    -------
    g : array_like
        The computed sigmoid function. g has the same shape as z, since
        the sigmoid is computed element-wise on z.
    """    
    z = np.array(z)
    
    # You need to return the following variables correctly 
    g = np.zeros(z.shape)

    # ====================== YOUR CODE HERE ======================
    g = 1 / (1 + np.exp(-z))
    # =============================================================
    return g

# Code assignment ends here

In [None]:
# Code assignment starts here
def costFunction(theta, X, y):
    """
    Compute cost and gradient for logistic regression. 
    
    Parameters
    ----------
    theta : array_like
        The parameters for logistic regression. This a vector
        of shape (n+1, ).
    
    X : array_like
        The input dataset of shape (m x n+1) where m is the total number
        of data points and n is the number of features. We assume the 
        intercept has already been added to the input.
    
    y : arra_like
        Labels for the input. This is a vector of shape (m, ).
    
    Returns
    -------
    J : float
        The computed value for the cost function. 
    
    grad : array_like
        A vector of shape (n+1, ) which is the gradient of the cost
        function with respect to theta, at the current values of theta.
    """
    # Initialize some useful values
    m = y.size  # number of training examples

    # You need to return the following variables correctly 
    J = 0
    grad = np.zeros(theta.shape)

    # ====================== YOUR CODE HERE ======================
    h = sigmoid(X.dot(theta.T))
    
    J = (1 / m) * np.sum(-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h)))
    grad = (1 / m) * (h - y).dot(X)
    
    
    # =============================================================
    return J, grad
# Code assignment ends here

In [None]:
# Code assignment starts here
def predict(theta, X):
    """
    Predict whether the label is 0 or 1 using learned logistic regression.
    Computes the predictions for X using a threshold at 0.5 
    (i.e., if sigmoid(theta.T*x) >= 0.5, predict 1)
    
    Parameters
    ----------
    theta : array_like
        Parameters for logistic regression. A vecotor of shape (n+1, ).
    
    X : array_like
        The data to use for computing predictions. The rows is the number 
        of points to compute predictions, and columns is the number of
        features.

    Returns
    -------
    p : array_like
        Predictions and 0 or 1 for each row in X. 
    
    Instructions
    ------------
    Complete the following code to make predictions using your learned 
    logistic regression parameters.You should set p to a vector of 0's and 1's    
    """
    m = X.shape[0] # Number of training examples

    # You need to return the following variables correctly
    p = np.zeros(m)

    # ====================== YOUR CODE HERE ======================

    p = np.round(sigmoid(X.dot(theta.T)))
    
    # ===Adjusment to code since our labels are 1 2 and not 0 1 ==
    
    p = [x+1 for x in p]
    
    # ============================================================
    
    return p
# Code assignment ends here

In [None]:
test_df = pd.read_csv('./exoTest.csv')
df_test_x = test_df.drop(['LABEL'], axis=1)
df_test_y = test_df['LABEL']

In [None]:
from scipy import optimize

In [None]:
m,n = x_test.shape 
initial_theta = np.zeros(n)
res = optimize.minimize(costFunction,
                        initial_theta,
                        (x_train, y_train),
                        jac=True,
                        method='TNC',
                        options={'maxiter': 400})

In [None]:
theta = res.x
results = predict(theta,df_test_x)
accuracy = accuracy_score(df_test_y, results)
print("The accuracy of the model is: " + str(accuracy))
conf = confusion_matrix(df_test_y, results)
print(conf)
