# Intro

**Sloan Digital Sky Survey - DR18**

In this Machine Learning project we will classify the observations from the Data Release (DR) 18 of the Sloan Digital Sky Survey (SDSS). Each observation is described by 18 features and 1 class column classifying the observation as either:
* a STAR
* a GALAXY
* a QSO (Quasi-Stellar Object) or a Quasar.

We will use **XGBClassifier** from xgboost

# Load packages

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.decomposition import PCA
from xgboost import XGBClassifier, plot_importance
from sklearn.metrics import accuracy_score, classification_report
import time
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

# Load the data

In [2]:
dataset = pd.read_csv('/home/vignesh-nadar/vikky/My Work/TempPro/data/DR18.csv')

# Data Exploration and Analysis

In [3]:
dataset.head()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1.237665e+18,214.775024,26.231389,18.91452,17.82512,17.40111,17.12318,16.99686,4649,301,3,203,2.394797e+18,GALAXY,0.015198,2127,53859,27
1,1.237665e+18,190.979819,29.630396,18.93076,17.07127,16.13052,15.68037,15.3483,4649,301,3,61,2.51867e+18,GALAXY,0.103481,2237,53828,117
2,1.237665e+18,222.9017,24.065934,18.96398,18.00796,17.67222,17.4277,17.2847,4649,301,3,254,2.414085e+18,GALAXY,0.039642,2144,53770,567
3,1.237665e+18,211.236033,27.007872,19.07693,18.07678,17.46506,17.30033,16.90555,4649,301,3,181,2.39035e+18,GALAXY,0.157561,2123,53793,235
4,1.237662e+18,219.069523,47.597442,17.68583,16.78106,16.49471,16.39366,16.37046,3918,301,4,154,1.884871e+18,STAR,-0.0003,1674,53464,416


In [4]:
# Show number of rows and columns (m, n)
dataset.shape

(100000, 18)

**Check for null or missing values in the data**

In [5]:
null = dataset.isnull().sum().max()

if null == 0:
    print('There is no missing values')
else:
    print('There is missing values')

There is no missing values




Let's show a concise summary of a dataset using info() function                           
such as index dtype and columns, non-null values and memory usage.

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 18 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   objid      100000 non-null  float64
 1   ra         100000 non-null  float64
 2   dec        100000 non-null  float64
 3   u          100000 non-null  float64
 4   g          100000 non-null  float64
 5   r          100000 non-null  float64
 6   i          100000 non-null  float64
 7   z          100000 non-null  float64
 8   run        100000 non-null  int64  
 9   rerun      100000 non-null  int64  
 10  camcol     100000 non-null  int64  
 11  field      100000 non-null  int64  
 12  specobjid  100000 non-null  float64
 13  class      100000 non-null  object 
 14  redshift   100000 non-null  float64
 15  plate      100000 non-null  int64  
 16  mjd        100000 non-null  int64  
 17  fiberid    100000 non-null  int64  
dtypes: float64(10), int64(7), object(1)
memory usage: 13.7+ MB



Let’s get a quick statistical summary of the dataset using the describe() method. The describe() function applies basic statistical computations on the dataset like extreme values, count of data points standard deviation, etc. Any missing value or NaN value is automatically skipped. describe() function gives a good picture of the distribution of data.

In [7]:
dataset.describe()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,redshift,plate,mjd,fiberid
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,1.237663e+18,177.373612,25.093908,18.637564,17.407615,16.882776,16.627404,16.469266,3974.35278,301.0,3.27411,186.30117,2.919558e+18,0.17113,2593.00446,53917.13331,341.859
std,7270949000000.0,78.078701,20.577267,0.832058,0.986055,1.133157,1.210276,1.281926,1692.899859,0.0,1.620975,140.801444,2.495879e+18,0.438558,2216.771531,1549.002722,217.118311
min,1.237646e+18,0.013061,-19.495456,10.61181,9.668339,9.005167,8.848403,8.947795,109.0,301.0,1.0,11.0,2.994897e+17,-0.004268,266.0,51608.0,1.0
25%,1.237658e+18,136.201282,6.760243,18.21135,16.851275,16.195873,15.86504,15.62123,2826.0,301.0,2.0,85.0,1.335398e+18,0.0,1186.0,52734.0,160.0
50%,1.237662e+18,180.324646,24.050795,18.87241,17.51611,16.891335,16.600505,16.43053,3900.0,301.0,3.0,152.0,2.355424e+18,0.045669,2092.0,53727.5,327.0
75%,1.237667e+18,224.603842,40.420272,19.27337,18.056393,17.586313,17.34585,17.23527,5061.0,301.0,5.0,248.0,3.276507e+18,0.09541,2910.0,54586.0,502.0
max,1.237681e+18,359.999615,84.490494,19.59995,19.97499,31.9901,32.14147,29.38374,8162.0,301.0,6.0,982.0,1.412681e+19,7.011245,12547.0,58932.0,1000.0


In [8]:
dataset.columns.values

array(['objid', 'ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'rerun',
       'camcol', 'field', 'specobjid', 'class', 'redshift', 'plate',
       'mjd', 'fiberid'], dtype=object)

In [9]:
dataset.drop(['objid', 'run', 'rerun', 'camcol', 'plate', 'mjd', 'specobjid', 'fiberid'], axis=1, inplace=True)
dataset.head(1)

Unnamed: 0,ra,dec,u,g,r,i,z,field,class,redshift
0,214.775024,26.231389,18.91452,17.82512,17.40111,17.12318,16.99686,203,GALAXY,0.015198


**Target Column**

In [10]:
dataset['class'].value_counts()  # returns a Series containing counts of unique values.

class
GALAXY    51141
STAR      38227
QSO       10632
Name: count, dtype: int64

Let's use **LabelEncoder** to encode target labels with value between 0 and n_classes-1   
here the target (class) has three unique values a GALAXY, a STAR and a QSO

In [11]:
encoder = LabelEncoder()
dataset['class'] = encoder.fit_transform(dataset['class'])

In [12]:
dataset['class'].value_counts()

class
0    51141
2    38227
1    10632
Name: count, dtype: int64

In [13]:
pca = PCA(n_components=3)
ugriz = pca.fit_transform(dataset[['u', 'g', 'r', 'i', 'z']])

df_fe = pd.concat((dataset, pd.DataFrame(ugriz)), axis=1)
df_fe.rename({0: 'PCA_1', 1: 'PCA_2', 2: 'PCA_3'}, axis=1, inplace = True)
df_fe.drop(['u', 'g', 'r', 'i', 'z'], axis=1, inplace=True)
df_fe.head()

Unnamed: 0,ra,dec,field,class,redshift,PCA_1,PCA_2,PCA_3
0,214.775024,26.231389,203,0,0.015198,1.020793,0.024649,-0.015076
1,190.979819,29.630396,61,0,0.103481,-1.4988,0.819651,-0.005017
2,222.9017,24.065934,254,0,0.039642,1.545204,-0.068397,-0.048019
3,211.236033,27.007872,181,0,0.157561,1.236576,0.23242,-0.159778
4,219.069523,47.597442,154,2,-0.0003,-0.854628,-0.884592,0.009633


# Data preprocessing

In [14]:
scaler = MinMaxScaler()
sdss = scaler.fit_transform(df_fe.drop('class', axis=1))

In [15]:
X_train, X_test, y_train, y_test = train_test_split(sdss, df_fe['class'], test_size=0.33)

# The Model

**XGBoost** is an optimized distributed gradient boosting library designed for efficient and scalable training of machine learning models. It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting” and it has become one of the most popular and widely used machine learning algorithms

In [16]:
xgb = XGBClassifier(n_estimators=100)
training_start = time.perf_counter()
xgb.fit(X_train, y_train)
training_end = time.perf_counter()
prediction_start = time.perf_counter()
preds = xgb.predict(X_test)
prediction_end = time.perf_counter()
acc_xgb = (preds == y_test).sum().astype(float) / len(preds)*100
xgb_train_time = training_end-training_start
xgb_prediction_time = prediction_end-prediction_start
print("XGBoost's prediction accuracy is: %3.2f" % (acc_xgb))
print("Time consumed for training: %4.3f" % (xgb_train_time))
print("Time consumed for prediction: %6.5f seconds" % (xgb_prediction_time))

XGBoost's prediction accuracy is: 99.19
Time consumed for training: 27.287
Time consumed for prediction: 0.18119 seconds


In [17]:
preds[:10]
y_test[:10]

76838    2
44417    0
8178     2
89930    2
38473    0
1751     0
75815    2
58133    0
29149    1
71084    0
Name: class, dtype: int64

In [18]:
importances = pd.DataFrame({
    'Feature': df_fe.drop('class', axis=1).columns,
    'Importance': xgb.feature_importances_
})
importances = importances.sort_values(by='Importance', ascending=False)
importances = importances.set_index('Feature')
importances

Unnamed: 0_level_0,Importance
Feature,Unnamed: 1_level_1
redshift,0.956204
PCA_3,0.014687
PCA_2,0.008868
PCA_1,0.007052
ra,0.004561
dec,0.004415
field,0.004213


In [19]:
import pickle

# Make pickle file of our model
pickle.dump(xgb, open("/home/vignesh-nadar/vikky/My Work/TempPro/model.pkl", "wb"))