# Wrangle (Acquire and Prepare)

This notebook contains all steps and decisions in the data acquisition and data preparation phases of the pipeline.

## The Required Modules

Below are all the modules needed to run the code cells in this notebook.

In [1]:
import pandas as pd

## Data Acquisition

### Download With Kaggle API

The next cell will download the data using the kaggle API. The downloaded file will be a .zip file. We will need to unzip the file and then we can read the .csv file into a pandas dataframe.

This of course assumes that the Kaggle API is installed on the local machine. For instructions on downloading the data without the Kaggle API skip ahead to the "Download Without Kaggle API" section. Otherwise, if the Kaggle API is not installed, follow the instructions outlined in the Kaggle API repository README [here](https://github.com/Kaggle/kaggle-api) or follow the steps below:
- On the command line run 

```bash
    pip install kaggle
```

- Login to Kaggle, go to Your Profile -> Account -> API click "Create New API Token"
- Move the downloaded kaggle.json file to ~/.kaggle/kaggle.json, 

```bash
    mv ~/Downloads/kaggle.json ~/.kaggle/kaggle.json
```

- On the command line run 

```bash
    chmod 600 ~/.kaggle/kaggle.json
```

Now we can run the commands below to download the data.

In [2]:
# Download the source data, unzip the downloaded file, and remove the zip file.

!kaggle datasets download nasa/kepler-exoplanet-search-results -f cumulative.csv
!unzip cumulative.csv.zip
!rm cumulative.csv.zip

Downloading cumulative.csv.zip to /Users/peanutbutterandchocolate/Repositories/kepler-exoplanet-analysis
 86%|████████████████████████████████▊     | 1.00M/1.16M [00:00<00:00, 6.39MB/s]
100%|██████████████████████████████████████| 1.16M/1.16M [00:00<00:00, 7.15MB/s]
Archive:  cumulative.csv.zip
  inflating: cumulative.csv          


In [3]:
kepler = pd.read_csv('cumulative.csv')
kepler.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 50 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              9564 non-null   int64  
 1   kepid              9564 non-null   int64  
 2   kepoi_name         9564 non-null   object 
 3   kepler_name        2294 non-null   object 
 4   koi_disposition    9564 non-null   object 
 5   koi_pdisposition   9564 non-null   object 
 6   koi_score          8054 non-null   float64
 7   koi_fpflag_nt      9564 non-null   int64  
 8   koi_fpflag_ss      9564 non-null   int64  
 9   koi_fpflag_co      9564 non-null   int64  
 10  koi_fpflag_ec      9564 non-null   int64  
 11  koi_period         9564 non-null   float64
 12  koi_period_err1    9110 non-null   float64
 13  koi_period_err2    9110 non-null   float64
 14  koi_time0bk        9564 non-null   float64
 15  koi_time0bk_err1   9110 non-null   float64
 16  koi_time0bk_err2   9110 

Now we have our exoplanet data and we can already see that some cleaning is needed. We'll discuss this more in the preparation section.

### Download Without Kaggle API

For reproducibility we must consider that some people may not have the Kaggle API installed and may not want to install and set it up. So let's go through the steps of downloading the data without the Kaggle API.

<b><i>Pending:
    Include here either instructions for downloading the data, or find a way to automate the process. Ideally the process will be automated, but if it proves too cumbersome then it may have to be expected that the data is manually downloaded.</i></b>

### Automate The Download Procedure

Now let's automate the acquisition procedure for ease of use.

In [9]:
# We'll need a way to run the kaggle api commands with python. We can use the os module for this.

import os

In [5]:
# # This cell can be executed to remove the existing cumulative.csv file.
# !rm cumulative.csv

In [6]:
# This code will download the kepler data, unzip the file, remove the zip file, and read the data into a dataframe

# os.system('kaggle datasets download nasa/kepler-exoplanet-search-results -f cumulative.csv')
# os.system('unzip cumulative.csv.zip')
# os.system('rm cumulative.csv.zip')
# kepler = pd.read_csv('cumulative.csv')
# kepler.info()

100%|██████████| 1.16M/1.16M [00:00<00:00, 8.87MB/s]


Downloading cumulative.csv.zip to /Users/whoami/Repositories/kepler-exoplanet-analysis

Archive:  cumulative.csv.zip
  inflating: cumulative.csv          
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 50 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              9564 non-null   int64  
 1   kepid              9564 non-null   int64  
 2   kepoi_name         9564 non-null   object 
 3   kepler_name        2294 non-null   object 
 4   koi_disposition    9564 non-null   object 
 5   koi_pdisposition   9564 non-null   object 
 6   koi_score          8054 non-null   float64
 7   koi_fpflag_nt      9564 non-null   int64  
 8   koi_fpflag_ss      9564 non-null   int64  
 9   koi_fpflag_co      9564 non-null   int64  
 10  koi_fpflag_ec      9564 non-null   int64  
 11  koi_period         9564 non-null   float64
 12  koi_period_err1    9110 non-null   float64
 13  koi_period_er

That works, but what if the Kaggle API is not installed. How will we ensure that the proper error message is raised.

In [25]:
# os.system will return 0 if the command was successful. We will need to check if the return value was 0,
# and otherwise raise an exception.

output = os.system('invalid command')
if output != 0:
    raise SystemError('''
        An error occurred when running "kaggle datasets download".
        You must either follow the instructions for installing the Kaggle API
        here https://github.com/Kaggle/kaggle-api or manually download the 
        data from here https://exoplanetarchive.ipac.caltech.edu/docs/data.html
    ''')

sh: invalid: command not found


SystemError: 
        An error occurred when running "kaggle datasets download".
        You must either follow the instructions for installing the Kaggle API
        here https://github.com/Kaggle/kaggle-api or manually download the 
        data from here https://exoplanetarchive.ipac.caltech.edu/docs/data.html
    

In [28]:
# and if we use a valid command ...

output = os.system('echo hello')
if output != 0:
    raise SystemError('''
        An error occurred when running "kaggle datasets download".
        You must either follow the instructions for installing the Kaggle API
        here https://github.com/Kaggle/kaggle-api or manually download the 
        data from here https://exoplanetarchive.ipac.caltech.edu/docs/data.html
    ''')

hello


Now let's use the Acquire class to wrap the acquisition process into it's own object for ease of use.

In [4]:
from acquire import Acquire

In [6]:
# # This cell can be executed to remove the existing cumulative.csv file.
# !rm cumulative.csv

In [7]:
# This class will allow us to easily acquire the data from wherever we need it.

class AcquireKeplerData(Acquire):
    # The Acquire class handles most of the work for us. We only need to override the
    # read_from_source method.
    def read_from_source(self):
        shell_output = os.system('kaggle datasets download nasa/kepler-exoplanet-search-results -f cumulative.csv')
        if shell_output != 0:
            raise SystemError('''
                An error occurred when running "kaggle datasets download".
                You must either follow the instructions for installing the Kaggle API
                here https://github.com/Kaggle/kaggle-api or manually download the 
                data from here https://exoplanetarchive.ipac.caltech.edu/docs/data.html
            ''')
        
        os.system('unzip cumulative.csv.zip')
        os.system('rm cumulative.csv.zip')
        df = pd.read_csv('cumulative.csv')
        os.system('rm cumulative.csv')
        
        return df

In [10]:
# Let's test it.

kepler = AcquireKeplerData('kepler.csv').get_data()
kepler.info()

100%|██████████| 1.16M/1.16M [00:00<00:00, 12.1MB/s]


Downloading cumulative.csv.zip to /Users/peanutbutterandchocolate/Repositories/kepler-exoplanet-analysis

Archive:  cumulative.csv.zip
  inflating: cumulative.csv          
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Data columns (total 50 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   rowid              9564 non-null   int64  
 1   kepid              9564 non-null   int64  
 2   kepoi_name         9564 non-null   object 
 3   kepler_name        2294 non-null   object 
 4   koi_disposition    9564 non-null   object 
 5   koi_pdisposition   9564 non-null   object 
 6   koi_score          8054 non-null   float64
 7   koi_fpflag_nt      9564 non-null   int64  
 8   koi_fpflag_ss      9564 non-null   int64  
 9   koi_fpflag_co      9564 non-null   int64  
 10  koi_fpflag_ec      9564 non-null   int64  
 11  koi_period         9564 non-null   float64
 12  koi_period_err1    9110 non-null   float64


Now we can throw this code into it's own file to use in the final report or anywhere else it might be needed.

## Data Preparation