## Project 1: Predicting Exoplanets  

### David Kinney - DSC 680 - Spring 2021 - Professor Catherine Williams  

*********************************************

You’ve been working for about a week now. What’s happened?  

You will be submitting a 500-word check-in letting us know how things stand. This should include a discussion of each of the 5 items from the proposal (revisiting these topics), with 100 words each.

Topics:  

* Any surprises from your domain from these data?  
* The dataset is what you thought it was?  
* Have you had to adjust your approach or research questions?  
* Is your method working?  
* What challenges are you having?    

*****************************************

### Table of Contents

* [Data Wrangling](#chapter1)
* [Dimensionality Reduction](#chapter2)
    * [Reducing Features: PCA](#section_2_1)
* [Exploratory Data Analysis](#chapter3)
    * [Section 3.1](#section_3_1) 
     
****************************************************

In [1]:
# Import libraries
import seaborn as sns
import pandas as pd
pd.set_option('display.max_rows', None)
%matplotlib inline

In [2]:
# Read the Kepler Objects of Interest (KOI) dataset
df_koi = pd.read_csv('./data/cumulative_2021.03.16_17.10.21.csv')
print(df_koi.shape)
print(df_koi[1:2].T)

(9564, 141)
                                                                    1
rowid                                                               2
kepid                                                        10797460
kepoi_name                                                  K00752.02
kepler_name                                              Kepler-227 c
koi_disposition                                             CONFIRMED
koi_vet_stat                                                     Done
koi_vet_date                                                8/16/2018
koi_pdisposition                                            CANDIDATE
koi_score                                                       0.969
koi_fpflag_nt                                                       0
koi_fpflag_ss                                                       0
koi_fpflag_co                                                       0
koi_fpflag_ec                                                       0
koi_disp

***************************************************
### Data Wrangling <a class="anchor" id="chapter1"></a>

* Remove all empty columns  
* In addition, I decided to remove all "err1" and "err2" columns. I *suspect* they have minimal impact and removing them will further reduce the feature set.

In [3]:
df_describe = pd.DataFrame((df_koi.describe()))
df_describe.T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rowid,9564.0,4782.5,2761.033,1.0,2391.75,4782.5,7173.25,9564.0
kepid,9564.0,7690628.0,2653459.0,757450.0,5556034.0,7906892.0,9873066.0,12935140.0
koi_score,8054.0,0.4808294,0.4769285,0.0,0.0,0.334,0.998,1.0
koi_fpflag_nt,9564.0,0.2085947,4.76729,0.0,0.0,0.0,0.0,465.0
koi_fpflag_ss,9564.0,0.2327478,0.4226049,0.0,0.0,0.0,0.0,1.0
koi_fpflag_co,9564.0,0.1975115,0.3981423,0.0,0.0,0.0,0.0,1.0
koi_fpflag_ec,9564.0,0.1200335,0.3250176,0.0,0.0,0.0,0.0,1.0
koi_period,9564.0,75.67136,1334.744,0.2418425,2.733684,9.752831,40.71518,129995.8
koi_period_err1,9110.0,0.002148471,0.008242604,0.0,5.35e-06,3.52e-05,0.000276,0.173
koi_period_err2,9110.0,-0.002148471,0.008242604,-0.173,-0.000276,-3.52e-05,-5.35e-06,0.0


In [4]:
df_koi.count()

rowid                 9564
kepid                 9564
kepoi_name            9564
kepler_name           2360
koi_disposition       9564
koi_vet_stat          9564
koi_vet_date          9564
koi_pdisposition      9564
koi_score             8054
koi_fpflag_nt         9564
koi_fpflag_ss         9564
koi_fpflag_co         9564
koi_fpflag_ec         9564
koi_disp_prov         9564
koi_comment           8355
koi_period            9564
koi_period_err1       9110
koi_period_err2       9110
koi_time0bk           9564
koi_time0bk_err1      9110
koi_time0bk_err2      9110
koi_time0             9564
koi_time0_err1        9110
koi_time0_err2        9110
koi_eccen             9201
koi_eccen_err1           0
koi_eccen_err2           0
koi_longp                0
koi_longp_err1           0
koi_longp_err2           0
koi_impact            9201
koi_impact_err1       9110
koi_impact_err2       9110
koi_duration          9564
koi_duration_err1     9110
koi_duration_err2     9110
koi_ingress              0
k

In [5]:
# Remove variables with no data
df_koi_cleaned = df_koi.dropna(axis=1, how='all')

In [6]:
# Remove the err columns
df_koi_cleaned = df_koi_cleaned[df_koi_cleaned.columns.drop(
    list(df_koi_cleaned.filter(regex='_err')))]

In [7]:
df_koi_cleaned.shape

(9564, 78)

*********************************
### Dimensionality Reduction <a class="anchor" id="chapter2"></a>

#### Reducing Features using Principal Components Analysis (PCA) <a class="anchor" id="section_2_1"></a>

#### Approach  

The **Kepler Objects of Interest (KOI)** dataset contains eight categories of variables. The first three categories--Identification, Exoplanet Archive Information and Project Disposition--are descriptive. The remaining five are measures used to identify an object as an exoplanet.

* Transit Properties  
* Threshold-Crossing Event (TCE) Information  
* Stellar Parameters  
* KIC Parameters  
* Pixel-Based KOI Vetting Statistics

In light of this, I am going to take two passes at **Dimensionality Reduction** by first creating sub-datasets by category and applying `PCA` to each, merging the results back into one dataset, and then applying `PCA` to the (hopefully) reduced dataset.

In [8]:
transit_columns = ['koi_period', 'koi_time0bk', 'koi_time0', 'koi_eccen', 'koi_impact', 
                   'koi_duration', 'koi_depth', 'koi_ror', 'koi_srho', 'koi_fittype', 
                   'koi_prad', 'koi_sma', 'koi_incl', 'koi_teq', 'koi_insol', 'koi_dor', 
                   'koi_limbdark_mod', 'koi_ldm_coeff4', 'koi_ldm_coeff3', 'koi_ldm_coeff2', 
                   'koi_ldm_coeff1', 'koi_parm_prov']
tce_columns = ['koi_max_sngle_ev', 'koi_max_mult_ev', 'koi_model_snr', 'koi_count', 
               'koi_num_transits', 'koi_tce_plnt_num', 'koi_tce_delivname', 'koi_quarters', 
               'koi_trans_mod', 'koi_datalink_dvr', 'koi_datalink_dvs']
stellar_columns = ['koi_steff', 'koi_slogg', 'koi_smet', 'koi_srad', 'koi_smass', 'koi_sparprov']
kic_columns = ['ra', 'dec', 'koi_kepmag', 'koi_gmag', 'koi_rmag', 'koi_imag', 'koi_zmag', 
               'koi_jmag', 'koi_hmag', 'koi_kmag']
pixel_columns = ['koi_fwm_sra', 'koi_fwm_sdec', 'koi_fwm_srao', 'koi_fwm_sdeco', 'koi_fwm_prao', 
                 'koi_fwm_pdeco', 'koi_fwm_stat_sig', 'koi_dicco_mra', 'koi_dicco_mdec', 
                 'koi_dicco_msky', 'koi_dikco_mra', 'koi_dikco_mdec', 'koi_dikco_msky']


df_transit = df_koi_cleaned[transit_columns]
# print(df_transit.sample(2))
df_tce = df_koi_cleaned[tce_columns]
# print(df_tce.sample(2))
df_stellar = df_koi_cleaned[stellar_columns]
# print(df_stellar.sample(2))
df_kic = df_koi_cleaned[kic_columns]
# print(df_kic.sample(2))
df_pixel = df_koi_cleaned[pixel_columns]
# print(df_pixel.sample(2))

In [None]:
sns.pairplot(df_transit,height=1.5)

In [18]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# PCA function
def pca(df):
    
    # standardize the features matrix
    features = StandardSCaler().fit_transform(df)
    
    # Create a PCA that retains 99% of the variance
    pca = PCA(n_components = 0.99, whitten=True, svd_solver='randomized')
    features_pca = pca.fit_transform(features)
    
    return features, features_pca)

In [19]:
features, features_pca = pca(df_transit)
print('Original number of features: {}'.format(features.shape[1]))
print('Reduced number of features: {}'.format(features_pca.shape[1]))

NameError: name 'pca' is not defined

*****************************
### Exploratory Data Analysis <a class="anchor" id="chapter3"></a>




##### Section 1.2.1 <a class="anchor" id="section_1_2_1"></a>

##### Section 1.2.2 <a class="anchor" id="section_1_2_2"></a>

##### Section 1.2.3 <a class="anchor" id="section_1_2_3"></a>



#### Section 2.1 <a class="anchor" id="section_2_1"></a>

#### Section 2.2 <a class="anchor" id="section_2_2"></a>