# Hunting Exoplanets In Space - Pandas DataFrame


In the process of creating Pandas DataFrames, we will see how NASA finds the exoplanets in the universe. There are deep Physical and mathematical theories on exploring exoplanets in the space.

---



### Finding Exoplanets Principle

There are billions of galaxies in the universe. These galaxies have millions of stars. One such galaxy is the Milky-way galaxy in which our solar system exists. The solar system has a star called Sun which has its own light. There are 8 planets in our solar system orbiting around the Sun. Similar to this, in some other galaxy there would be a star and probably a planet would be revolving around that star.

Long back, NASA placed a telescope called Kepler telescope in the space. This telescope is used to measure the brightness of the stars in the far-distant galaxies.


<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/kepler-space-telescope.jpg' width="800">

*Image credits: https://www.nasa.gov/feature/ames/kepler/nasa-s-kepler-confirms-100-exoplanets-during-its-k2-mission*

Whenever a planet, while orbiting its star, comes in between the telescope and the star, the brightness of the star recorded by the telescope is lower whereas when the planet goes behind the star, the brightness of the light recorded by the telescope is higher.

This method of detecting exoplanets in far-distant galaxies through the brightness of the light emitted by a star is called the **Transit Method**.

Essentially, if we plot the brightness on the vertical axis and the time on the horizontal axis, then we will see that the brightness of the star recorded by the telescope increases and decreases periodically. Thus, in the graph, we will notice a wave-like pattern. This indicates that the star definitely has at least one planet.

<img src = 'https://s3-whjr-v2-prod-bucket.whjr.online/99a90115-148e-45c6-b9b0-4ac4a5db4e18.gif' width=500 >



The image below shows some of the exoplanets (Kepler 4b to Kepler 8b) discovered by the Kepler space telescope. We can see the brightness level radiated by the star for each planet. The Flux values on the vertical axis represent the brightness level of the star.

<img src = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/transit-method.jpg' width='800'>

*Image credits: https://www.nasa.gov/content/light-curves-of-keplers-first-5-discoveries*

As we can see in the image above, the bigger the planet (Kepler 6b), deeper the dip in the brightness level. And, the longer the orbital period of a planet, broader is the width of the dip (Kepler 7b). Kepler 7b has the greatest orbital period of 4.9 days among these 5 planets.

So, this is how NASA finds a planet beyond our solar system. Now, let's use Kepler space telescope dataset to create a Pandas DataFrame in order to find out which stars beyond our solar system have a planet.

---

In [None]:
from google.colab import files
file_to_load = files.upload()

#### Loading CSV File



In [None]:
# Teacher Action: Read a 'csv' file using the 'read_csv()' function. Also, display the first 5 rows of the DataFrame using the 'head()' function.
# First of all we have to import the Pandas module with pd as an alias (or nickname).
import pandas as pd

exo_train_df = pd.read_csv('exoTrain.csv')
exo_train_df.head()

We have created a Pandas DataFrame for the `exoTrain.csv` file and stored it in the `exo_train_df` variable.

Now, we will create a DataFrame for the `exoTest.csv` file and store it in a variable called `exo_test_df`.

In [None]:
# Reading the 'exoTest.csv' file and display its first 5 rows using the 'head()' function.
exo_test_df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv')
exo_test_df.head()

The two DataFrames have exactly the same type of data.


In [None]:
# Find the number of rows and columns in the 'exo_train_df' DataFrame.
exo_train_df.shape

So, there are 5087 rows and 3198 columns in the `exo_train_df` DataFrame.

In [None]:
# Find the number of rows and columns in the 'exo_test_df' DataFrame.
exo_test_df.shape

There are 570 rows and 3198 columns in the `exo_test_df` DataFrame.

---

#### Check For The Missing Values^

In most of the cases, we do not get complete datasets. They either have some values missing from the rows and columns or they do not have standardized values.

So, before going ahead with the analysis, it is a good idea to check whether the dataset has any missing values.

In [None]:
# Check for the missing values using the 'isnull()' function.
exo_train_df.isnull()

There are $5087\times3198=16268226$ values in the DataFrame. It is not feasible to check so many values manually. So, we need a better approach to check for missing values.

We can call the `sum()` function on the `exo_train_df.isnull()` statement. It will return the sum of `True` values for every column in a DataFrame.

In [None]:
# Use the 'sum()' function to find the total number of True values in each column.
exo_train_df.isnull().sum()

We can see that a lot of columns have `0` missing values. But still, we cannot manually see whether all the columns have missing values or not because the list of columns is too long to be seen in this notebook. There are `3198` columns to search.

In [None]:
# View all the columns in the 'exo_train_df' DataFrame.
exo_train_df.columns


We, again, need a better approach. We will create a variable called `num_missing_values` to store the total number of values that are missing. Then, we will iterate through each column and within each column, we will iterate through each item to check for the missing values. If the `isnull()` function for a column returns `True`, then we will increase the value of the `num_missing_values` by `1` else we will not do anything.

In [None]:
# Iterate through the 'exo_train_df' DataFrame to find the total number of missing values.
num_missing_values = 0
# Here, we have created the num_missing_values which will store all the number of missing values in the DataFrame.

# Now, we will iterate through every column in the DataFrame, then will iterate through every item in each column.
for column in exo_train_df.columns:
  for item in exo_train_df[column].isnull():
    if item == True:
      num_missing_values += 1

num_missing_values

As seen, there are no missing values in the DataFrame because the final value of the `num_missing_values` is `0`.

Now let's find the number of non missing values by replacing `True` with `False` in the above code and store it in variable `non_missing_values`.

In [None]:
# In the above code replace 'True' with 'False' and get the number of non missing values.
non_missing_values = 0
for column in exo_train_df.columns:
  for item in exo_train_df[column].isnull():
    if item == False:
      non_missing_values += 1

non_missing_values

As we can see, the output is 16,268,226. It is the sum of all the values which are False. That means there are no missing values because the total number of values in the `exo_train_df` is 16,268,226 which is exactly the same as the total number of non-missing values.

---

#### Slicing A DataFrame Using The `iloc[]` Function

 We want to plot the scatter plots and line plots for 6 stars. For each of these stars we will create a Pandas series which will have the brightness levels starting from `FLUX.1` to `FLUX.3197`.

 Effectively, we need to create 6 Pandas series.

Let's create a Pandas series for the first star in the `exo_train_df`. Let's store the series in a variable called `star_0`. To do this, we need to use the `iloc` function.

In [None]:
# Create a Pandas series from a Pandas DataFrame using the 'iloc[]' function.
star_0 = exo_train_df.iloc[0, :]
star_0

In [None]:
# Using the 'iloc[]' function, create a Pandas series for the second star and storing it in a variable called 'star_1'.
star_1 = exo_train_df.iloc[1, :]
star_1.head()

In [None]:
#Using the 'iloc[]' function, create a Pandas series for the third star and storing it in a variable called 'star_2'.
star_2 = exo_train_df.iloc[2, :]
star_2.head()

We have created a Pandas series for each of the first three stars. Now, let's create the same for each of the last three stars in the DataFrame.

In [None]:
#Displaying the last 5 rows of the 'exo_train_df'
exo_train_df.tail()

In [None]:
# Creating a Pandas series for the last star
star_5086 = exo_train_df.iloc[5086, :]
star_5086.head()

In [None]:
# creating a Pandas series for the second-last star
star_5085 = exo_train_df.iloc[5085, :]
star_5085.head()

In [None]:
# creating a Pandas series for the third-last star
star_5084 = exo_train_df.iloc[5084, :]
star_5084.head()