**DOCUMENTATION LINKS**

sklearn Decision Trees User Guide:

https://scikit-learn.org/stable/modules/tree.html#tree

sklearn.tree.DecisionTreeClassifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

sklearn.model_selection.train_test_split:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html






In [0]:
# Import relevant libraries
from google.colab import files
import pandas as pd

In [5]:
# Image and tabular data can be downloaded from: 
# https://portal.nersc.gov/project/dessn/autoscan/

# Upload image data as zipped folder and unzip in Colab

#image_data = !unzip "Y1 2013 Data.zip"

# Open tabular data using pandas

tab_data = pd.read_csv("autoscan_features.2.csv", comment = '#')
len(tab_data)

898963

In [0]:
# Column Descriptions
  # ID : Linked with image thumbnails.
  # OBJECT_TYPE : 0 for artifact (is a SNe), 1 for non-artifact (not a SNe).
  # AMP : Amplitude of fit that produces Gauss.
  # A_IMAGE : Semi-major axis of object from SExtractor.
  # A_REF : Semi-major axis of the nearest source in the galaxy coadd catalog, 
  #         if one exists within 5". Else imputed.
  # BAND : ??????????
  # B_IMAGE : Semi-minor axis of object from SExtractor on I^d.
  # B_REF : Semi-minor axis of the nearest source in the galaxy coadd catalog, 
  #         if one exists within 5". Else imputed.
  # CCDID : The numerical ID of the CCD on which the detection was registered.
  # COLMEDS : The maximum of the median pixel values of each colum on B^d.
  # DIFFSUMRN : ??? The sum of the matrix elements in a 5 x 5 element box 
  #         centered on the detection location on R^d.
  # ELLIPTICITY : The ellipicity of the detection on I^d using a_image and 
  #         b_image from SExtractor.
  # FLAGS : Numerical representation of SExtractor extraction flags on I^d.
  # FLUX_RATIO : Ratio of the flux in a 5-pxel circular aperture centered on the
  #         location of the detection of I^d to the absolute value of the flux
  #         in a 5-pixel circular at the same location on I^d.
  # GAUSS : X^2 from fitting a spherical 2D Gaussian to a 15 x 15 pixel cutout
  #         around the detection on B^d.
  # GFLUX : ?????????
  # L1 : Equation that can be found on the paper in "Table 2-continued".
  # LACOSMIC : Another equation that can be found on the paper in 
  #         "Table 2-continued".
  # MAG : The magnitude of the object from SExtractor on I^d.
  # MAGDIFF : If a source is found within 5” of the location of the object in 
  #         the galaxy coadd catalog, the difference between mag and the 
  #         magnitude of the nearby source. Else, the difference between mag and
  #         the limiting magnitude of the parent image from which the I^d
  #         cutout was generated.
  # MAGLIM : True if there is no nearby galaxy coadd source, false otherwise.
  # MAG_FROM_LIMIT : Limiting magnitude of the parent image from which the I^d 
  #         cutout was generated minus mag.
  # MAG_REF : The magnitude of the nearest source in the galaxy coadd catalog, 
  #         if one exists within 5” of the detection on I^d. Else imputed.
  # MASKFRAC : The fraction of I^d that is masked.
  # MIN_DISTANCE_TO_EDGE_IN_NEW : Distance in pixels to the nearest edge of the 
  #         detector array on the parent image from which the I^d cutout was 
  #         generated.
  # N2SIG3 : Number of matrix elements in a 5×5 element block centered on the 
  #         detection on R^d with values less than -2.
  # N2SIG3SHIFT	: The number of matrix elements with values greater than or
  #         equal to 2 in the central 5 × 5 element block of R^d minus the number
  #         of matrix elements with values greater than or equal to 2 in the 
  #         central 5 × 5 element block of R^t.
  # N2SIG5 : Number of matrix elements in a 7×7 element block centered on the 
  #         detection on R^d with values less than -2.
  # N2SIG5SHIFT : The number of matrix elements with values greater than or
  #         equal to 2 in the central 7 × 7 element block of R^d minus the number
  #         of matrix elements with values greater than or equal to 2 in the 
  #         central 7 × 7 element block of R^t.
  # N3SIG3 : Number of matrix elements in a 5×5 element block centered on the 
  #         detection on R^d with values less than -3.
  # N3SIG3SHIFT : The number of matrix elements with values greater than or
  #         equal to 3 in the central 5 × 5 element block of R^d minus the number
  #         of matrix elements with values greater than or equal to 3 in the 
  #         central 5 × 5 element block of R^t.
  # N3SIG5 : Number of matrix elements in a 7×7 element block centered on the 
  #         detection on R^d with values less than -3
  # N3SIG5SHIFT	: The number of matrix elements with values greater than or
  #         equal to 3 in the central 7 × 7 element block of R^d minus the number
  #         of matrix elements with values greater than or equal to 3 in the 
  #         central 7 × 7 element block of R^t.
  # NN_DIST_RENORM : The distance from the detection to the nearest source in the
  #         galaxy coadd catalog, if one exists within 5”. Else imputed.
  # NUMNEGRN : The number of negative matrix elements in a 7 × 7 element box 
  #         centered on the detection in R^d.
  # SCALE : Scale parameter of fit that produced gauss.
  # SNR	: The flux from a 35 × 35-pixel PSF model-fit to the object on I^d 
  #         divided by the uncertainty from the fit.
  # SPREADERR_MODEL : Uncertainty on spread model.
  # SPREAD_MODEL : SPREAD MODEL output parameter from SExtractor on I^d.

#print(tab_data.head(10))
#print(tab_data['OBJECT_TYPE'])

In [7]:
# Import libraries
import sklearn as skl
from sklearn.model_selection import train_test_split
from sklearn import tree
import numpy as np

# Read Excel file using pandas and exclude columns 'ID' and 'BAND'
dat = pd.read_csv("autoscan_features.2.csv", comment = '#')
data = dat.drop(['ID', 'BAND'], axis=1)
#print(data.head(10))

# Shuffle data set using pandas (frac=1 lets you consider whole set as a sample)
shuffled_dat = data.sample(frac = 1)

# Reduce size of data being fed to classifier. Take first 100 values of shuffled
# data
small_dat = shuffled_dat.head(100)

# Define class labels for training sample
y_dat = small_dat['OBJECT_TYPE']
#print(y_dat, len(y_dat))

# Define training samples for classifier
x_dat = small_dat.drop(['OBJECT_TYPE'], axis=1)
#print(x_dat, len(x_dat))

# Split data into training and testing samples, with defined proportions of dataset
x_train, x_test, y_train, y_test = train_test_split(x_dat, y_dat, train_size = 0.3, test_size = 0.7)
#print(x_train, x_test, y_train, y_test)

# Replaces NaN values in data with zeroes
where_are_nans = np.isnan(x_train)
x_train[where_are_nans] = 0
#np.nan_to_num(x_train)
#print(x_train)

print(x_train.dtypes) #Changed here because it was crashing

#where_are_strings = np.where(x_train.dtype.type = np.string_)
#x_train[where_are_strings] = 0

# Build a decision tree classifier and fit data to decision tree
dtc = tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth='none',
                                min_samples_split=2, min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0, max_features=None,
                                max_leaf_nodes=None, min_impurity_decrease=0.0, 
                                min_impurity_split=None, class_weight=None, presort=False)
dtc = dtc.fit(x_train, y_train)

# why is it not working
# Error:  Input contains NaN, infinity or a value too large for dtype('float32').
# Issue with values in training data
# Assuming it is large values
# incorrect dtype?

# Once we do get it to fit, how do we know what columns weigh more / are more
# important? 
# Do we need to visualize using graphviz in order to do this?
  # Would be surprised if it didn't output in an array - probably dont have to visualize



# We need to consider that dataset changes every time we run it.
# for loop, change all strings to float
# figure out where test data go
# check for other code to see how they do it
# if we do boosted we can compare the accuracy 




AMP                            float64
A_IMAGE                        float64
A_REF                          float64
B_IMAGE                        float64
B_REF                          float64
CCDID                            int64
COLMEDS                        float64
DIFFSUMRN                      float64
ELLIPTICITY                    float64
FLAGS                            int64
FLUX_RATIO                     float64
GAUSS                          float64
GFLUX                          float64
L1                             float64
LACOSMIC                       float64
MAG                            float64
MAGDIFF                        float64
MAGLIM                           int64
MAG_FROM_LIMIT                 float64
MAG_REF                        float64
MAG_REF_ERR                    float64
MASKFRAC                       float64
MIN_DISTANCE_TO_EDGE_IN_NEW    float64
N2SIG3                           int64
N2SIG3SHIFT                      int64
N2SIG5                   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)


TypeError: ignored