<a href="https://colab.research.google.com/github/chanda-04/AIMLMonth2023/blob/main/Major_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exoplanet Detection and Classification using Machine Learning**

# **Introduction**
**Exoplanet Detection**:

Exoplanet detection involves identifying the presence of planets orbiting stars based on various observable signals, such as changes in brightness (transits) or radial velocity variations. Machine learning can enhance the efficiency and accuracy of detecting exoplanets in large datasets.

**The Transit Method of Detecting Extrasolar Planets**

When a planet passes in front of a star as viewed from Earth, the event is called a “transit”. On Earth, we can observe an occasional Venus or Mercury transit. These events are seen as a small black dot creeping across the Sun—Venus or Mercury blocks sunlight as the planet moves between the Sun and us. Kepler finds planets by looking for tiny dips in the brightness of a star when a planet crosses in front of it—we say the planet transits the star.

Once detected, the planet's orbital size can be calculated from the period (how long it takes the planet to orbit once around the star) and the mass of the star using Kepler's Third Law of planetary motion. The size of the planet is found from the depth of the transit (how much the brightness of the star drops) and the size of the star. From the orbital size and the temperature of the star, the planet's characteristic temperature can be calculated. From this the question of whether or not the planet is habitable (not necessarily inhabited) can be answered.

# **Problem statement**
ML algorithms can be trained to identify transit-like patterns in light curves. These algorithms learn to differentiate between actual transits and noise, helping to detect exoplanet candidates more effectively.

**about the Dataset:**  

The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems.

As you can imagine, planets themselves do not emit light, but the stars that they orbit do. If said star is watched over several months or years, there may be a regular 'dimming' of the flux (the light intensity). This is evidence that there may be an orbiting body around the star; such a star could be considered to be a 'candidate' system. Further study of our candidate system, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'.

link: https://www.kaggle.com/datasets/keplersmachines/kepler-labelled-time-series-data?resource=download&select=exoTrain.csv

In [None]:
# Import packages
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

In [None]:
#getting the idea about dataset
train_data = pd.read_csv('/exoTrain.csv.zip')
train_data.head()

In [None]:
#shape of dataset
train_data.shape

This dataset has 5087 stars. For each star we have 3187 flux values at different time intervals.

As this dataset has data containing data based on transit method for detecting exoplanets, these flux values will be used to detect if a star has exoplanest(s)

In [None]:
#checking for rows with null values and displaying them
train_data.isnull().sum()

In [None]:
sns.heatmap(train_data.isnull())

there are no missing values

In [None]:
#checking how many labels are present in the dataset
train_data['LABEL'].unique()

hence there are 2 labels:
1.   for star not containing exoplanets
2.   for star containing exoplanets

In [None]:
#Replacing the label values
train_data = train_data.replace({'LABEL' : {1:0, 2:1}})
train_data['LABEL'].unique()



we have replaced the values from 2,1 to 1,0 respectively as its much better to use values 0,1 for classification

In [None]:
print("no. of stars with exoplanets=",len(train_data[train_data['LABEL']==1]))
print("no. of stars without exoplanets=",len(train_data[train_data['LABEL']==0]))
#plotting a countplot
column_name = 'LABEL'
plt.figure(figsize=(3, 10))

sns.countplot(data=train_data, x=column_name)

# Add labels and title
plt.xlabel(column_name)
plt.ylabel('Count')
plt.title(f'Countplot of {column_name}')

# Show the plot
plt.show()


**Visualising the light curves in this data**

When a planet passes between an observer and the star, the flux value decreases and hence we see a dip in light curves with exoplanets


In [None]:
#dropping the label column as we dont need it for plotting the light curve
plot_train=train_data.drop(["LABEL"],axis=1)
plot_train

In [None]:
#plotting a star with confirmed exoplanets i.e. having label as 1 to visualize how the light curve of a star with exoplanet looks like
x=range(1,3198)
y=plot_train.iloc[16,:].values
plt.figure(figsize=(12, 6))
plt.plot(x,y,linewidth=0.5)

the dips in the above graphs represents the dimming of the flux (the light intensity). This is evidence that there may be an orbiting body i.e. exoplanet around the star. this type of graph with dips in flux is common with all the stars who are confirmed canditates

In [None]:
#plotting a star with no exoplanets i.e. having label as 0 to visualize how the light curve of a star without exoplanet looks like
x=range(1,3198)
y=plot_train.iloc[40,:].values
plt.figure(figsize=(10, 6))
plt.plot(x,y,linewidth=2)

there are no dips in the above graphs representing the dimming of the flux (the light intensity).This graph is 'flatter' compared to the previous group. This is evidence that there is no orbiting body i.e. exoplanet around the star which will cause the dimming of the flux.

In the scenario where you want to predict whether a star has exoplanets or not based on flux values, it would be more appropriate to use classification rather than regression.

# **Data preprocessing**

In [None]:
#Extracting independent (x) and dependent (y) features from our dataset
x=train_data.drop(["LABEL"], axis=1)
y=train_data.LABEL

**Handling the imbalance in the data:**

our dataset is imbalanced, where one class(label=0) has significantly more samples than the others, the classifier can be biased towards the majority class.

The Random Over-Sampling technique focuses on the minority class (the class with fewer samples) and aims to balance the class distribution by randomly duplicating instances from the minority class until its size matches the size of the majority class.

Random Over-Sampling process works:

    Identify the minority class that you want to balance.

    Randomly select instances from the minority class with replacement (allowing the same instance to be selected multiple times), adding these instances to the dataset.

    Repeat step 2 until the size of the minority class reaches the desired level of balance or matches the size of the majority class.

    Use the balanced dataset for training your machine learning model.

In [None]:

from imblearn.over_sampling import RandomOverSampler
from collections import Counter
ros = RandomOverSampler()
x_ros, y_ros = ros.fit_resample(x, y)
print(f"Before sampling:- {Counter(y)}")
print(f"After sampling:- {Counter(y_ros)}")


In [None]:
# Visualizing
y_ros.value_counts().plot(kind='bar', title='After aplying RandomOverSampler')

Splitting this data (70:30) into train and test data for our model development

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_ros, y_ros, test_size = 0.3, random_state = 0)


Feature scaling to ensure all features are on a similar scale.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_sc = sc.fit_transform(X_train)
X_test_sc = sc.transform(X_test)

# **model selection**

in this project we are going to demonstrate 3 machine learning classification methods:

1.   K-Nearest Neighbors (KNN)
2.   Random forest classifier
3.   decision tree classifier

find which one gives the best results.



# **K-Nearest Neighbors (KNN)**

**train the K-Nearest Neighbors (KNN) model:**

1.   Import the KNeighborsClassifier class from scikit-learn.
2.   Create an instance of the K-Nearest Neighbors (KNN) and fit it to the training data.
3.   Use the fit method to train the model.  

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC

In [None]:
# Choosing K = 1
knn_classifier = KNC(n_neighbors=1,metric='minkowski',p=2)
#metric is to be by default minkowski for p = 2 to calculate the Eucledian distances
# Fit the model
knn_classifier.fit(X_train_sc, y_train)


**Make Predictions**
    
1. Use the trained knn classifier to make predictions on the testing data.
2. Use the predict method to obtain the predicted labels.



In [None]:
# Predict
y_pred_knn = knn_classifier.predict(X_test_sc)

**Evaluate the Model**
1.   Assess the performance of the knn model using evaluation metrics such as accuracy, precision, recall, and F1 score.
2.   Compare the predicted labels (y_pred) with the actual labels (y_test).



In [None]:
# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print('Validation accuracy of KNN is', accuracy_score(y_test,y_pred_knn))
print ("\nClassification report :\n",(classification_report(y_test,y_pred_knn)))

#Confusion matrix
cm = confusion_matrix(y_test, y_pred_knn)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Predicted Negative", "Predicted Positive"],
            yticklabels=["Actual Negative", "Actual Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# **Random forest classifier**

**train the Random forest classifier model:**

1.   Import the RandomForestClassifier class from scikit-ensemble.
2.   Create an instance of the Random forest classifier and fit it to the training data.
3.   Use the fit method to train the model.  

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100, criterion='gini')
forest.fit(X_train_sc, y_train)

**Make Predictions**
    
1. Use the trained Random forest classifier to make predictions on the testing data.
2. Use the predict method to obtain the predicted labels.



In [None]:
# Predicting on the test set
y_pred_rf = forest.predict(X_test_sc)

**Evaluate the Model**
1.   Assess the performance of the Random forest  model using evaluation metrics such as accuracy, precision, recall, and F1 score.
2.   Compare the predicted labels (y_pred) with the actual labels (y_test).



In [None]:
# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print('\nValidation accuracy of RandomForestClassifier  is', accuracy_score(y_test,y_pred_rf))
print ("\nClassification report :\n",(classification_report(y_test,y_pred_rf)))

#Confusion matrix
#Confusion matrix
cm = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Predicted Negative", "Predicted Positive"],
            yticklabels=["Actual Negative", "Actual Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

# **decision tree classifier**

**train the decision tree classifier model:**

1.   Import the RandomForestClassifier class from scikit-ensemble.
2.   Create an instance of the decision tree classifier and fit it to the training data.
3.   Use the fit method to train the model.  

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(random_state=0)

# Train the classifier on the training data
dt_classifier.fit(X_train_sc, y_train)



**Make Predictions**
    
1. Use the trained decision tree classifier to make predictions on the testing data.
2. Use the predict method to obtain the predicted labels.



In [None]:
y_pred_dt = dt_classifier.predict(X_test_sc)

**Evaluate the Model**
1.   Assess the performance of the decision tree classifier model using evaluation metrics such as accuracy, precision, recall, and F1 score.
2.   Compare the predicted labels (y_pred) with the actual labels (y_test).



In [None]:
# Results
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

print('\nValidation accuracy of DecisionTreeClassifier  is', accuracy_score(y_test,y_pred_dt))
print ("\nClassification report :\n",(classification_report(y_test,y_pred_dt)))

#Confusion matrix
#Confusion matrix
cm = confusion_matrix(y_test, y_pred_dt)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Predicted Negative", "Predicted Positive"],
            yticklabels=["Actual Negative", "Actual Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

**Future Scope**

It's important to note that while machine learning offers powerful tools for exoplanet detection and classification, it also presents challenges related to dataset quality, overfitting, and model interpretability. Collaborations between astrophysicists, data scientists, and machine learning experts are crucial for developing robust and reliable models for exoplanet research. As the field continues to evolve, machine learning will likely play an increasingly integral role in our understanding of exoplanetary systems.