# Lesson 17: Pandas DataFrame

---

### Teacher-Student Activities

In the last class, we learned about the Pandas series. Now in this lesson, we will learn about Pandas DataFrame which is a collection of Pandas series. In other words, a Pandas DataFrame is a two-dimensional array.

In the process of learning Pandas DataFrames, we will learn how NASA finds the exoplanets in the universe. There are deep Physical and mathematical theories on exploring exoplanets in space. But we will not go through all of that. Right now, we just need to understand the basic principle behind these theories to be able to learn how the exoplanets are found.

---



### Finding Exoplanets Principle

Imagine that you are in your room during the daytime with the window curtains open. The room probably would be well-lit. Now, imagine that you close the curtains of the window and block the sunlight from entering the room. In this situation, the room would be darker and the visibility would be low.

So, whenever the curtains are open, the brightness of the light would be higher whereas when the curtains are closed, the brightness would be lower. We can measure the brightness of the light using a spectroscope.

The same principle is applied in searching for an exoplanet. There are billions of galaxies in the universe. These galaxies have millions of stars. One such galaxy is the Milky-way galaxy in which our solar system exists. The solar system has a star called Sun which has its light. In astronomy, a star is a heavenly body that has its light. There are 8 planets in our solar system orbiting around the Sun. Similar to this, in some other galaxy there would be a star and probably a planet would be revolving around that star.

Long back, NASA placed a telescope called the Kepler telescope in space. This telescope is used to measure the brightness of the stars in the far-distant galaxies.
 

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/kepler-space-telescope.jpg' width="800">

*Image credits: https://www.nasa.gov/feature/ames/kepler/nasa-s-kepler-confirms-100-exoplanets-during-its-k2-mission*

Whenever a planet, while orbiting its star, comes in between the telescope and the star, the brightness of the star recorded by the telescope is lower whereas when the planet goes behind the star, the brightness of the light recorded by the telescope is higher.

This method of detecting exoplanets in far-distant galaxies through the brightness of the light emitted by a star is called the **Transit Method**. You can read about it by clicking on the link provided in the **Activities** section under the title **How Do Astronomers Find Exoplanets?**

Essentially, if you plot the brightness on the vertical axis and the time on the horizontal axis, then you will see that the brightness of the star recorded by the telescope increases and decreases periodically. Thus, in the graph, you will notice a wave-like pattern. This indicates that the star has at least one planet. 

<img src = 'https://s3-whjr-v2-prod-bucket.whjr.online/99a90115-148e-45c6-b9b0-4ac4a5db4e18.gif' width=500 >



The image below shows some of the exoplanets (Kepler 4b to Kepler 8b) discovered by the Kepler space telescope. You can see the brightness level radiated by the star for each planet. The Flux values on the vertical axis represent the brightness level of the star.

<img src = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/kepler-exoplanets-dataset/transit-method.jpg' width='800'>

*Image credits: https://www.nasa.gov/content/light-curves-of-keplers-first-5-discoveries*

As you can see in the image above, the bigger the planet (Kepler 6b), the deeper the dip in the brightness level. And, the longer the orbital period of a planet, the broader is the width of the dip (Kepler 7b). Kepler 7b has the greatest orbital period of 4.9 days among these 5 planets.

So, this is how NASA finds a planet beyond our solar system. Now, let's use the Kepler space telescope dataset to create a Pandas DataFrame to find out which stars beyond our solar system have a planet.

---

#### Task 1: Loading a CSV File

Generally, we store data in different files such as text (`txt` format) file, comma-separated value (`csv` format) file, tab-separated value (`tsv` format) file, etc. We can read the contents of these files through Python.

A comma-separated value (`csv`) file is used most commonly to store data.

To load or read the contents of a `csv` file, we can use the `read_csv()` function in Pandas. The data is read in the form of a two-dimensional array called a **Pandas DataFrame**.

As an input to the `read_csv()` function, we need to provide the full location of the `csv` file that we wish to read in the string format. The file that we wish to read is stored in cloud storage. 

Here's the link to the file:

https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv


**Dataset Credits:** https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data

In the above link, `exoTrain.csv` is the name of the `csv` file. We will create Pandas DataFrame using this file:


In [None]:
# S1.1: Read a 'csv' file using the 'read_csv()' function. Also, display the first 5 rows of the DataFrame using the 'head()' function.
# First of all we have to import the Pandas module with pd as an alias (or nickname).

import pandas as pd

exo_train_df = pd.read_csv("https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTrain.csv")
exo_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,-207.47,-154.88,-173.71,-146.56,-120.26,-102.85,-98.71,-48.42,-86.57,-0.84,-25.85,-67.39,-36.55,-87.01,-97.72,-131.59,-134.8,-186.97,-244.32,-225.76,-229.6,-253.48,-145.74,-145.74,30.47,-173.39,-187.56,-192.88,-182.76,...,-167.69,-56.86,7.56,37.4,-81.13,-20.1,-30.34,-320.48,-320.48,-287.72,-351.25,-70.07,-194.34,-106.47,-14.8,63.13,130.03,76.43,131.9,-193.16,-193.16,-89.26,-17.56,-17.31,125.62,68.87,100.01,-9.6,-25.39,-16.51,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,-86.51,-74.97,-73.15,-86.13,-76.57,-61.27,-37.23,-48.53,-30.96,-8.14,-5.54,15.79,45.71,10.61,40.66,16.7,15.18,11.98,-203.7,19.13,19.13,19.13,19.13,19.13,17.02,-8.5,-13.87,-29.1,-34.29,...,-36.75,-15.49,-13.24,20.46,-1.47,-0.4,27.8,-58.2,-58.2,-72.04,-58.01,-30.92,-13.42,-13.98,-5.43,8.71,1.8,36.59,-9.8,-19.53,-19.53,-24.32,-23.88,-33.07,-9.03,3.75,11.61,-12.66,-5.69,12.53,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,484.39,469.66,462.3,492.23,441.2,483.17,481.28,535.31,554.34,562.8,540.14,576.34,551.67,556.69,550.86,577.33,562.08,577.97,530.67,553.27,538.33,527.17,532.5,273.66,273.66,292.39,298.44,252.64,233.58,171.41,...,-51.09,-33.3,-61.53,-89.61,-69.17,-86.47,-140.91,-84.2,-84.2,-89.09,-55.44,-61.05,-29.17,-63.8,-57.61,2.7,-31.25,-47.09,-6.53,14.0,14.0,-25.05,-34.98,-32.08,-17.06,-27.77,7.86,-70.77,-64.44,-83.83,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,323.33,311.14,326.19,313.11,313.89,317.96,330.92,341.1,360.58,370.29,369.71,339.0,336.24,319.31,321.56,308.02,296.82,279.34,275.78,289.67,281.33,285.37,281.87,88.75,88.75,67.71,74.46,69.34,76.51,80.26,...,-2.75,14.29,-14.18,-25.14,-13.43,-14.74,2.24,-31.07,-31.07,-50.27,-39.22,-51.33,-18.53,-1.99,10.43,-1.97,-15.32,-23.38,-27.71,-36.12,-36.12,-15.65,6.63,10.66,-8.57,-8.29,-21.9,-25.8,-29.86,7.42,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,-933.3,-889.49,-888.66,-853.95,-800.91,-754.48,-717.24,-649.34,-605.71,-575.62,-526.37,-490.12,-458.73,-447.76,-419.54,-410.76,-404.1,-425.38,-397.29,-412.73,-446.49,-413.46,-1006.21,-1006.21,-973.29,-986.01,-975.88,-982.2,-953.73,...,-694.76,-705.01,-625.24,-604.16,-668.26,-742.18,-820.55,-874.76,-874.76,-853.68,-808.62,-777.88,-712.62,-694.01,-655.74,-599.74,-617.3,-602.98,-539.29,-672.71,-672.71,-594.49,-597.6,-560.77,-501.95,-461.62,-468.59,-513.24,-504.7,-521.95,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


As you can see, we have created a Pandas DataFrame for the `exoTrain.csv` file and stored it in the `exo_train_df` variable.

Now, you create a DataFrame for the `exoTest.csv` file and store it in a variable called `exo_test_df`. Here is the link to the file.

https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv

**Dataset Credits:** https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data

In [None]:
# S1.2: Read the 'exoTest.csv' file and display its first 5 rows using the 'head()' function.
exo_test_df = pd.read_csv("https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/kepler-exoplanets-dataset/exoTest.csv")
exo_test_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,-21.97,-23.17,-29.26,-33.99,-6.25,-28.12,-27.24,-32.28,-12.29,-16.57,-23.86,-5.69,9.24,35.52,81.2,116.49,133.99,148.97,174.15,187.77,215.3,246.8,-56.68,-56.68,-56.68,-52.05,-31.52,-31.15,-48.53,-38.93,...,-2.55,12.26,-7.06,-23.53,2.54,30.21,38.87,-22.86,-22.86,-4.37,2.27,-16.27,-30.84,-7.21,-4.27,13.6,15.62,31.96,49.89,86.93,86.93,42.99,48.76,22.82,32.79,30.76,14.55,10.92,22.68,5.91,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,2,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,5458.8,5329.39,5191.38,5031.39,4769.89,4419.66,4218.92,3924.73,3605.3,3326.55,3021.2,2800.61,2474.48,2258.33,1951.69,1749.86,1585.38,1575.48,1568.41,1661.08,1977.33,2425.62,2889.61,3847.64,3847.64,3741.2,3453.47,3202.61,2923.73,2694.84,...,-3470.75,-4510.72,-5013.41,-3636.05,-2324.27,-2688.55,-2813.66,-586.22,-586.22,-756.8,-1090.23,-1388.61,-1745.36,-2015.28,-2359.06,-2516.66,-2699.31,-2777.55,-2732.97,1167.39,1167.39,1368.89,1434.8,1360.75,1148.44,1117.67,714.86,419.02,57.06,-175.66,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,150.46,85.49,-20.12,-35.88,-65.59,-15.12,16.6,-25.7,61.88,53.18,64.32,72.38,100.35,67.26,14.71,-16.41,-147.46,-231.27,-320.29,-407.82,-450.48,-146.99,-146.99,-146.99,-146.99,-166.3,-139.9,-96.41,-23.49,13.59,...,-35.24,-70.13,-35.3,-56.48,-74.6,-115.18,-8.91,-37.59,-37.59,-37.43,-104.23,-101.45,-107.35,-109.82,-126.27,-170.32,-117.85,-32.3,-70.18,314.29,314.29,314.29,149.71,54.6,12.6,-133.68,-78.16,-52.3,-8.55,-19.73,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,2,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,-679.56,-706.03,-720.56,-631.12,-659.16,-672.03,-665.06,-667.94,-660.84,-672.75,-644.91,-680.53,-620.5,-570.34,-530.0,-537.88,-578.38,-532.34,-532.38,-491.03,-485.03,-427.19,-380.84,-329.5,-286.91,-283.81,-298.19,-271.03,-268.5,-209.56,...,16.5,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-1286.59,-14.94,64.09,8.38,45.31,100.72,91.53,46.69,20.34,30.94,-36.81,-33.28,-69.62,-208.0,-280.28,-340.41,-337.41,-268.03,-245.0,-230.62,-129.59,-35.47,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,2,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,14.62,-19.52,-11.43,-49.8,25.84,11.62,3.18,-9.59,14.49,8.82,32.32,-28.9,-28.9,-14.09,-30.87,-18.99,-38.6,-27.79,9.65,29.6,7.88,42.87,27.59,27.05,20.26,29.48,9.71,22.84,25.99,-667.55,...,-122.12,-32.01,-47.15,-56.45,-41.71,-34.13,-43.12,-53.63,-53.63,-53.63,-24.29,22.29,25.18,1.84,-22.29,-26.43,-12.12,-33.05,-21.66,-228.32,-228.32,-228.32,-187.35,-166.23,-115.54,-50.18,-37.96,-22.37,-4.74,-35.82,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


The two DataFrames have the same type of data.

**Q: Why do we have a training dataset and a test dataset?**

Ans: We create machine learning models using the training dataset to train a computer so that it can learn from that data. It, thus, can make predictions based on what it has learned. While learning, the computer tries to find a pattern in the data, how one information affects another, which information is the most critical one, etc.

The test dataset is used to test the accuracy of the model that you have built. The higher the accuracy, the higher is the prediction capability of the machine.

After creating the DataFrames, it is a good practice to find out the number of rows and columns that exist in a DataFrame. You can do this exercise by using the `shape` keyword:

In [None]:
# S1.3: Find the number of rows and columns in the 'exo_train_df' DataFrame.
exo_train_df.shape

(5087, 3198)

So, there are 5087 rows and 3198 columns in the `exo_train_df` DataFrame.

In [None]:
# S1.4: How many rows and columns are there in the 'exo_test_df'?
exo_test_df.shape

(570, 3198)

There are 570 rows and 3198 columns in the `exo_test_df` DataFrame. 

---

#### Task 2: Check for the Missing Values

In most cases, we do not get complete datasets. They either have some values missing from the rows and columns or they do not have standardized values.

For example: If there is a date column in a dataset, then there is a huge chance that some of the dates are entered in the `DD-MM-YYYY` format, some in the `MM-DD-YYYY` format, and so on.

So, before going ahead with the analysis, it is a good idea to check whether the dataset has any missing values.

To check for missing values in a DataFrame, use the `isnull()` function. If a DataFrame has a missing value, then the `isnull()` function will return `True` else it will return `False`. 

In [None]:
# S2.1: Check for the missing values using the 'isnull()' function.
exo_train_df.isnull()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5082,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5083,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5084,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5085,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False




As you can see, the `isnull()` function has returned the DataFrame with a lot of `False` values as an output. There are $5087\times3198=16268226$ values in the DataFrame. It is not feasible to check so many values manually. So, we need a better approach to check for missing values.

We can call the `sum()` function on the `exo_train_df.isnull()` statement. It will return the sum of `True` values for every column in a DataFrame. If a column does not have any missing values, then it will return `0` else a number greater than `0`:

In [None]:
# S2.2: Use the 'sum()' function to find the total number of True values in each column.
print(exo_train_df.isnull().sum())
#exo_train_df.columns

LABEL        0
FLUX.1       0
FLUX.2       0
FLUX.3       0
FLUX.4       0
            ..
FLUX.3193    0
FLUX.3194    0
FLUX.3195    0
FLUX.3196    0
FLUX.3197    0
Length: 3198, dtype: int64


We can see that a lot of columns have `0` missing values. But still, we cannot manually see whether all the columns have missing values or not because the list of columns is too long to be seen in this notebook. There are `3198` columns to search. If there were very few columns, then we would not need to go any further to check for the missing values.

The `columns` keyword returns an array of all the columns in a DataFrame. To get a column from a DataFrame just, write the name of the column inside the square brackets in the single or double inverted quote after writing the name of the DataFrame. 
For example: If you want to get the values of `FLUX.1` column from the `exo_train_df`, then write `exo_train_df['FLUX.1']`.

In [None]:
# S2.3: View all the columns in the 'exo_train_df' DataFrame.
exo_train_df.columns

Index(['LABEL', 'FLUX.1', 'FLUX.2', 'FLUX.3', 'FLUX.4', 'FLUX.5', 'FLUX.6',
       'FLUX.7', 'FLUX.8', 'FLUX.9',
       ...
       'FLUX.3188', 'FLUX.3189', 'FLUX.3190', 'FLUX.3191', 'FLUX.3192',
       'FLUX.3193', 'FLUX.3194', 'FLUX.3195', 'FLUX.3196', 'FLUX.3197'],
      dtype='object', length=3198)


We, again, need a better approach. We will create a variable called `num_missing_values` to store the total number of values that are missing. Then, we will iterate through each column, and within each column, we will iterate through each item to check for the missing values. If the `isnull()` function for a column returns `True`, then we will increase the value of the `num_missing_values` by `1` else we will not do anything.

In [None]:
# S2.4: Get the values of the 'FLUX.1' column from a DataFrame.
exo_train_df['FLUX.1']

0         93.85
1        -38.88
2        532.64
3        326.52
4      -1107.21
         ...   
5082     -91.91
5083     989.75
5084     273.39
5085       3.82
5086     323.28
Name: FLUX.1, Length: 5087, dtype: float64

In [None]:
# S2.5: Get all the values of the 'FLUX.2' column.
exo_train_df['FLUX.2']

0         83.81
1        -33.83
2        535.92
3        347.39
4      -1112.59
         ...   
5082     -92.97
5083     891.01
5084     278.00
5085       2.09
5086     306.36
Name: FLUX.2, Length: 5087, dtype: float64

Now, using the `columns` keyword and the square brackets method of getting all items in a column, we will check for the missing values in the entire DataFrame:

In [None]:
# S2.6: Iterate through the 'exo_train_df' DataFrame to find the total number of missing values.
count = 0
for column in exo_train_df.columns:
  for row in exo_train_df[column].isnull():
    if(row == True):
      count += 1

print(count)
    

0


As you can see, there are no missing values in the DataFrame because the final value of the `num_missing_values` is `0`.

Now let's find the number of non-missing values by replacing `True` with `False` in the above code and store it in variable `non_missing_values`:

In [None]:
# S2.7: In the above code replace 'True' with 'False' and get the number of non missing values.
# The output should be 16268226.
count = 0
for column in exo_train_df.columns:
  for row in exo_train_df[column].isnull():
    if(row == False):
      count += 1

print(count)

16268226


As we can see, the output is 16,268,226. It is the sum of all the values False. That means there are no missing values because the total number of values in the `exo_train_df` is 16,268,226 which is the same as the total number of non-missing values.

---

#### Task 3: Slicing a DataFrame Using the `iloc[]` Function

 We want to plot the scatter plots and line plots for 6 stars. For each of these stars, we will create a Pandas series which will have the brightness levels starting from `FLUX.1` to `FLUX.3197`. 

 Effectively, we need to create 6 Pandas series. 

Let's create a Pandas series for the first star in the `exo_train_df`. Let's store the series in a variable called `star_0`. To do this, we need to use the `iloc` function:

In [None]:
# S3.1: Create a Pandas series from a Pandas DataFrame using the 'iloc[]' function.
star_zero = exo_train_df.iloc[0,:]
star_zero

LABEL         2.00
FLUX.1       93.85
FLUX.2       83.81
FLUX.3       20.10
FLUX.4      -26.98
             ...  
FLUX.3193    92.54
FLUX.3194    39.32
FLUX.3195    61.42
FLUX.3196     5.08
FLUX.3197   -39.54
Name: 0, Length: 3198, dtype: float64

Inside, the `iloc[]` function, the digit `0` indicates the first row (located at index `0`) of the `exo_train_df` DataFrame, and the colon (`:`) symbol denotes collect all the values from the first column till the last column, i.e., all the columns starting from `LABEL` to `FLUX.3197` columns.

**Syntax:** 

`dataframe_name.iloc[row_position_start : row_position_end, column_position_start : column_position_end]`

In this syntax: 

- `row_position_start`: Denotes the position of the row in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `row_position_end`: Denotes the position of the row in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.
- `column_position_start`: Denotes the position of the column in the DataFrame **starting** from whose values you want to take in the new Pandas series or DataFrame.
- `column_position_end`: Denotes the position of the column in the DataFrame till whose values you want to take in the new Pandas series or DataFrame.

You can verify manually whether we have extracted the values from the first row or not by viewing the first 5 rows of the DataFrame using the `head()` function:

In [None]:
# S3.2: Compare the values of the 'star_0' Pandas series with the first row in the 'exo_train_df' DataFrame.
exo_train_df.head()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,-160.17,-207.47,-154.88,-173.71,-146.56,-120.26,-102.85,-98.71,-48.42,-86.57,-0.84,-25.85,-67.39,-36.55,-87.01,-97.72,-131.59,-134.8,-186.97,-244.32,-225.76,-229.6,-253.48,-145.74,-145.74,30.47,-173.39,-187.56,-192.88,-182.76,...,-167.69,-56.86,7.56,37.4,-81.13,-20.1,-30.34,-320.48,-320.48,-287.72,-351.25,-70.07,-194.34,-106.47,-14.8,63.13,130.03,76.43,131.9,-193.16,-193.16,-89.26,-17.56,-17.31,125.62,68.87,100.01,-9.6,-25.39,-16.51,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,-73.38,-86.51,-74.97,-73.15,-86.13,-76.57,-61.27,-37.23,-48.53,-30.96,-8.14,-5.54,15.79,45.71,10.61,40.66,16.7,15.18,11.98,-203.7,19.13,19.13,19.13,19.13,19.13,17.02,-8.5,-13.87,-29.1,-34.29,...,-36.75,-15.49,-13.24,20.46,-1.47,-0.4,27.8,-58.2,-58.2,-72.04,-58.01,-30.92,-13.42,-13.98,-5.43,8.71,1.8,36.59,-9.8,-19.53,-19.53,-24.32,-23.88,-33.07,-9.03,3.75,11.61,-12.66,-5.69,12.53,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,484.39,469.66,462.3,492.23,441.2,483.17,481.28,535.31,554.34,562.8,540.14,576.34,551.67,556.69,550.86,577.33,562.08,577.97,530.67,553.27,538.33,527.17,532.5,273.66,273.66,292.39,298.44,252.64,233.58,171.41,...,-51.09,-33.3,-61.53,-89.61,-69.17,-86.47,-140.91,-84.2,-84.2,-89.09,-55.44,-61.05,-29.17,-63.8,-57.61,2.7,-31.25,-47.09,-6.53,14.0,14.0,-25.05,-34.98,-32.08,-17.06,-27.77,7.86,-70.77,-64.44,-83.83,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,323.33,311.14,326.19,313.11,313.89,317.96,330.92,341.1,360.58,370.29,369.71,339.0,336.24,319.31,321.56,308.02,296.82,279.34,275.78,289.67,281.33,285.37,281.87,88.75,88.75,67.71,74.46,69.34,76.51,80.26,...,-2.75,14.29,-14.18,-25.14,-13.43,-14.74,2.24,-31.07,-31.07,-50.27,-39.22,-51.33,-18.53,-1.99,10.43,-1.97,-15.32,-23.38,-27.71,-36.12,-36.12,-15.65,6.63,10.66,-8.57,-8.29,-21.9,-25.8,-29.86,7.42,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,-970.88,-933.3,-889.49,-888.66,-853.95,-800.91,-754.48,-717.24,-649.34,-605.71,-575.62,-526.37,-490.12,-458.73,-447.76,-419.54,-410.76,-404.1,-425.38,-397.29,-412.73,-446.49,-413.46,-1006.21,-1006.21,-973.29,-986.01,-975.88,-982.2,-953.73,...,-694.76,-705.01,-625.24,-604.16,-668.26,-742.18,-820.55,-874.76,-874.76,-853.68,-808.62,-777.88,-712.62,-694.01,-655.74,-599.74,-617.3,-602.98,-539.29,-672.71,-672.71,-594.49,-597.6,-560.77,-501.95,-461.62,-468.59,-513.24,-504.7,-521.95,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


We have confirmed that we indeed created a Pandas series using the `iloc[]` function:

In [None]:
# S3.4: Verify whether the series created is a Pandas series using the 'type()' function.
type(star_zero)

pandas.core.series.Series

As you can see, `star_0` is a Pandas series. 

Similarly, create a Pandas series for the second star and print first 5 values of the series:

In [None]:
# S3.5: Using the 'iloc[]' function, create a Pandas series for the second star and store it in a variable called 'star_1'.
# Use the 'head()' function to display only the first 5 values of the series.
star_one = exo_train_df.iloc[1,:]
star_one

LABEL         2.00
FLUX.1      -38.88
FLUX.2      -33.83
FLUX.3      -58.54
FLUX.4      -40.09
             ...  
FLUX.3193     0.76
FLUX.3194   -11.70
FLUX.3195     6.46
FLUX.3196    16.00
FLUX.3197    19.93
Name: 1, Length: 3198, dtype: float64

In [None]:
# S3.6: Using the 'iloc[]' function, create a Pandas series for the third star and store it in a variable called 'star_2'.
star_two = exo_train_df.iloc[2,:]
star_two

LABEL          2.00
FLUX.1       532.64
FLUX.2       535.92
FLUX.3       513.73
FLUX.4       496.92
              ...  
FLUX.3193      5.06
FLUX.3194    -11.80
FLUX.3195    -28.91
FLUX.3196    -70.02
FLUX.3197    -96.67
Name: 2, Length: 3198, dtype: float64

We have created a Pandas series for each of the first three stars. Now, let's create the same for each of the last three stars in the DataFrame:

In [None]:
# S3.7: Display the last 5 rows of the 'exo_train_df' DataFrame using the 'tail()' function.
exo_train_df.tail()

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,FLUX.10,FLUX.11,FLUX.12,FLUX.13,FLUX.14,FLUX.15,FLUX.16,FLUX.17,FLUX.18,FLUX.19,FLUX.20,FLUX.21,FLUX.22,FLUX.23,FLUX.24,FLUX.25,FLUX.26,FLUX.27,FLUX.28,FLUX.29,FLUX.30,FLUX.31,FLUX.32,FLUX.33,FLUX.34,FLUX.35,FLUX.36,FLUX.37,FLUX.38,FLUX.39,...,FLUX.3158,FLUX.3159,FLUX.3160,FLUX.3161,FLUX.3162,FLUX.3163,FLUX.3164,FLUX.3165,FLUX.3166,FLUX.3167,FLUX.3168,FLUX.3169,FLUX.3170,FLUX.3171,FLUX.3172,FLUX.3173,FLUX.3174,FLUX.3175,FLUX.3176,FLUX.3177,FLUX.3178,FLUX.3179,FLUX.3180,FLUX.3181,FLUX.3182,FLUX.3183,FLUX.3184,FLUX.3185,FLUX.3186,FLUX.3187,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
5082,1,-91.91,-92.97,-78.76,-97.33,-68.0,-68.24,-75.48,-49.25,-30.92,-11.88,-4.85,3.88,16.85,26.54,36.7,36.93,38.64,57.02,59.46,78.27,101.61,75.4,115.64,130.04,148.42,190.33,203.23,234.36,272.32,299.24,279.73,344.61,-88.61,-90.03,-68.04,-59.67,-47.47,-33.73,-26.52,...,123.45,93.07,80.64,-96.47,-99.12,-79.91,-46.36,-24.5,15.23,37.81,84.92,123.14,147.16,170.86,-95.59,-87.71,-78.36,-51.03,-56.14,-36.15,-26.43,-7.16,-2.19,24.14,41.41,26.69,41.38,61.37,81.24,103.68,139.95,147.26,156.95,155.64,156.36,151.75,-24.45,-17.0,3.23,19.28
5083,1,989.75,891.01,908.53,851.83,755.11,615.78,595.77,458.87,492.84,384.34,288.95,257.42,208.06,224.73,160.31,53.22,61.89,91.62,15.27,-4.7,9.75,37.2,46.91,43.0,55.41,175.08,133.64,218.98,277.05,270.98,112.98,562.45,182.81,166.41,182.28,138.97,196.53,112.61,79.7,...,-230.39,-225.14,-271.2,116.92,92.22,49.7,53.14,66.41,91.87,29.14,-83.09,-158.39,-278.08,-194.89,13.12,55.98,68.2,69.8,51.73,106.22,-1.19,165.2,83.97,59.61,42.58,86.58,84.11,103.98,88.31,36.64,-26.5,-4.84,-76.3,-37.84,-153.83,-136.16,38.03,100.28,-45.64,35.58
5084,1,273.39,278.0,261.73,236.99,280.73,264.9,252.92,254.88,237.6,238.51,225.68,199.75,177.53,211.27,190.35,226.61,204.55,222.45,204.51,196.45,130.41,155.12,108.21,92.93,99.46,76.12,73.34,29.25,10.76,22.68,46.29,-7.08,158.47,176.38,164.44,124.96,114.69,95.18,100.21,...,49.68,-52.3,-33.87,-14.22,-51.89,2.48,19.21,38.83,53.35,76.6,7.28,-54.26,-60.81,-14.06,16.64,29.17,35.81,28.45,48.44,47.64,37.64,77.5,61.58,18.71,22.32,60.58,25.0,7.96,-33.64,-23.42,-26.82,-53.89,-48.71,30.99,15.96,-3.47,65.73,88.42,79.07,79.43
5085,1,3.82,2.09,-3.29,-2.88,1.66,-0.75,3.85,-0.03,3.28,6.29,-4.33,5.12,-2.24,-3.27,-7.51,-4.22,-0.82,-1.34,-6.76,-9.87,-2.18,6.43,-6.42,-6.75,-3.84,-0.56,-5.66,-4.3,-7.31,-5.81,-11.12,-4.53,4.29,-0.64,3.72,-4.25,3.12,8.85,-2.78,...,2.25,7.69,2.57,-7.28,-6.67,-8.64,-4.62,-2.87,-1.23,-3.89,-5.0,-1.68,-7.25,-0.65,0.04,-5.86,-7.83,-9.63,-12.7,-0.65,-8.66,-2.84,-8.58,-3.63,-7.44,-4.98,-3.6,-12.21,-6.65,-5.05,10.86,-3.23,-5.1,-4.61,-9.82,-1.5,-4.65,-14.55,-6.41,-2.55
5086,1,323.28,306.36,293.16,287.67,249.89,218.3,188.86,178.93,118.93,130.68,104.5,63.03,72.07,198.89,570.46,208.08,26.42,44.18,39.85,71.55,81.54,48.87,61.1,49.82,38.5,28.64,20.1,15.07,33.55,36.0,-29.34,-47.82,186.07,112.91,98.15,79.33,55.77,25.82,10.99,...,-32.79,-17.46,-4.6,168.11,22.56,-34.79,-0.85,-5.64,-15.34,27.73,31.34,15.93,1.88,-5.05,7.5,27.73,-22.82,-40.24,-26.11,-39.12,-26.63,31.11,24.86,42.61,30.88,17.34,-9.08,23.18,22.94,13.89,71.19,0.97,55.2,-1.63,-5.5,-25.33,-41.31,-16.72,-14.09,27.82


In [None]:
# S3.8:  Using the 'iloc[]' function, create a Pandas series for the last star and store it in a variable called 'star_5086'.
star_5086 = exo_train_df.iloc[5086,:]
star_5086

LABEL          1.00
FLUX.1       323.28
FLUX.2       306.36
FLUX.3       293.16
FLUX.4       287.67
              ...  
FLUX.3193    -25.33
FLUX.3194    -41.31
FLUX.3195    -16.72
FLUX.3196    -14.09
FLUX.3197     27.82
Name: 5086, Length: 3198, dtype: float64

In [None]:
# S3.7:  Using the 'iloc[]' function, create a Pandas series for the second-last star and store it in a variable called 'star_5085'.
star_5085 = exo_train_df.iloc[5085,:]
star_5085

LABEL         1.00
FLUX.1        3.82
FLUX.2        2.09
FLUX.3       -3.29
FLUX.4       -2.88
             ...  
FLUX.3193    -1.50
FLUX.3194    -4.65
FLUX.3195   -14.55
FLUX.3196    -6.41
FLUX.3197    -2.55
Name: 5085, Length: 3198, dtype: float64

In [None]:
# S3.9:  Using the 'iloc[]' function, create a Pandas series for the third-last star and store it in a variable called 'star_5084'.
star_5084 = exo_train_df.iloc[5084,:]
star_5084

LABEL          1.00
FLUX.1       273.39
FLUX.2       278.00
FLUX.3       261.73
FLUX.4       236.99
              ...  
FLUX.3193     -3.47
FLUX.3194     65.73
FLUX.3195     88.42
FLUX.3196     79.07
FLUX.3197     79.43
Name: 5084, Length: 3198, dtype: float64

So, in this class we learned the transit method of detecting exoplanets and how to import data from a CSV file and create a Pandas DataFrame from the file. 

We also learned how to check for the missing values and slice a DataFrame using the `iloc[]` function.

In the next class, we will learn how to create a scatter plot and a line plot to visualise data.

---