# Purpose 
### Feature Selection by optimizing FSJaya Algorithm. 

The challenge is to tweak the present algorithm for feature selection purposes. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from project_part_2 import sigmoid_function
import project_part_1 as pj1
import classifiers as cl

## Step 1: Load Dataset

The Datasets selected for evaluation purposes are: 
- mandelon
- musk 

Both datasets are accesbile from [add the name of the website]

### Madelon Dataset

- Data type: non-sparse
- Number of features: 500
- Number of examples and check-sums:
      	     Pos_ex	Neg_ex	Tot_ex	Check_sum
             

     Train	 1000	 1000	 2000	488083511.00
     

     Valid	  300	  300	  600	146395833.00
     

     Test	  900	  900	 1800	439209553.00
     

     All  	 2200	 2200	 4400	1073688897.00
     

In [2]:
madelon = pd.read_csv('madelon/madelon_train.data', sep=' ', header = None)
madelon.info()
madelon.head()

madelon_test = pd.read_csv('madelon/madelon_test.data', sep=' ', header = None)
madelon_test.info()
#madelon_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Columns: 501 entries, 0 to 500
dtypes: float64(1), int64(500)
memory usage: 7.6 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Columns: 501 entries, 0 to 500
dtypes: float64(1), int64(500)
memory usage: 6.9 MB


## Musk Dataset

4. Relevant Information:
   This dataset describes a set of **92** molecules of which **47** are judged
   by human experts to be musks and the remaining 45 molecules are
   judged to be non-musks.  The goal is to learn to predict whether
   new molecules will be musks or non-musks.  However, the 166 features
   that describe these molecules depend upon the exact shape, or
   conformation, of the molecule.  Because bonds can rotate, a single
   molecule can adopt many different shapes.  To generate this data
   set, the low-energy conformations of the molecules were generated
   and then filtered to remove highly similar conformations. This left
   476 conformations.  Then, a feature vector was extracted that
   describes each conformation.

   This many-to-one relationship between feature vectors and molecules
   is called the **"multiple instance problem"**.  When learning a
   classifier for this data, the classifier should classify a molecule
   as "musk" if ANY of its conformations is classified as a musk.  A
   molecule should be classified as "non-musk" if NONE of its
   conformations is classified as a musk.

5. Number of Instances  **476**

6. Number of Attributes **168** plus the class.

7. For Each Attribute:
   
   Attribute:           Description:
   molecule_name:       Symbolic name of each molecule.  Musks have names such
                        as MUSK-188.  Non-musks have names such as
                        NON-MUSK-jp13.
   conformation_name:   Symbolic name of each conformation.  These
                        have the format MOL_ISO+CONF, where MOL is the
                        molecule number, ISO is the stereoisomer
                        number (usually 1), and CONF is the
                        conformation number. 
   f1 through f162:     These are "distance features" along rays (see
                        paper cited above).  The distances are
                        measured in hundredths of Angstroms.  The
                        distances may be negative or positive, since
                        they are actually measured relative to an
                        origin placed along each ray.  The origin was
                        defined by a "consensus musk" surface that is
                        no longer used.  Hence, any experiments with
                        the data should treat these feature values as
                        lying on an arbitrary continuous scale.  In
                        particular, the algorithm should not make any
                        use of the zero point or the sign of each
                        feature value. 
   f163:                This is the distance of the oxygen atom in the
                        molecule to a designated point in 3-space.
                        This is also called OXY-DIS.
   f164:                OXY-X: X-displacement from the designated
                        point.
   f165:                OXY-Y: Y-displacement from the designated
                        point.
   f166:                OXY-Z: Z-displacement from the designated
                        point. 
   class:               0 => non-musk, 1 => musk

   Please note that the molecule_name and conformation_name attributes
   should not be used to predict the class.

8. Missing Attribute Values: none.

9. Class Distribution: 
   Musks:     47
   Non-musks: 45

In [3]:
musk =  pd.read_csv('musk/clean2.data/clean2.data', header = None)
musk.info()
musk.head()

musk_test =  pd.read_csv('musk/clean1.data/clean1.data', header = None)
#musk_test

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6598 entries, 0 to 6597
Columns: 169 entries, 0 to 168
dtypes: float64(1), int64(166), object(2)
memory usage: 8.5+ MB


## Step 2: Prepare data for Classifiers. 

According to the shared paper [A jaya algorithm based wrapper method for optimal feature selection in supervised classification] data has to be trained using following classifiers: 

    - NB (Naive Bayes)
    - KNN (K Nearest Neighbor)
    - LDA 
    - RT (Regression Tree)

#### Step 2a : Separating features and labels as X, Y variables  (Training Dataset)

In [4]:
Musk_X = musk.drop(168, axis = 1)
Musk_X = Musk_X.drop(0, axis=1)
Musk_X = Musk_X.drop(1, axis=1)
Musk_X.columns = np.arange(len(Musk_X.columns))
Musk_Y = musk[168]


Madelon_X = madelon.drop(500, axis = 1)
Y_labels = np.hstack([np.ones(1000), np.zeros(1000)])
Madelon_Y = pd.Series(Y_labels.T)


#### Step 2b : Separating features and labels as X, Y variables  (Test Dataset)

In [5]:
Musk_x = musk_test.drop(168, axis = 1)
Musk_x = Musk_x.drop(0, axis=1)
Musk_x = Musk_x.drop(1, axis=1)
Musk_x.columns = np.arange(len(Musk_x.columns))
Musk_y = musk_test[168]


Madelon_x = madelon_test.drop(500, axis = 1)
y_labels = np.hstack([np.ones(900), np.zeros(900)])
Madelon_y = pd.Series(y_labels.T)
len(Musk_x)
len(Musk_y)

print("Madelon TRAINING \n dataset: # of features: " , Madelon_X.shape[1], 
      '\n Number of Measurements: ', Madelon_X.shape[0], 
      'Y_shape: ', Madelon_Y.shape)

print("Madelon TEST \n dataset: # of features: " , Madelon_x.shape[1], 
      '\n Number of Measurements: ', Madelon_x.shape[0], 
      'Y_shape: ', Madelon_y.shape)

print("Musk TRAINING \n dataset: # of features: " , Musk_X.shape[1], 
      '\n Number of Measurements: ', Musk_X.shape[0],
      'Y_shape: ', Musk_Y.shape)

print("Musk TEST \n dataset: # of features: " , Musk_x.shape[1], 
      '\n Number of Measurements: ', Musk_x.shape[0], 
      'Y_shape: ', Musk_y.shape)

Madelon TRAINING 
 dataset: # of features:  500 
 Number of Measurements:  2000 Y_shape:  (2000,)
Madelon TEST 
 dataset: # of features:  500 
 Number of Measurements:  1800 Y_shape:  (1800,)
Musk TRAINING 
 dataset: # of features:  166 
 Number of Measurements:  6598 Y_shape:  (6598,)
Musk TEST 
 dataset: # of features:  166 
 Number of Measurements:  476 Y_shape:  (476,)


### Step 3: Classification

#### Step 3a: Train Classifiers without Feature Selection

In [6]:
madelon_return = cl.data_classification(train_x = Madelon_X, train_y = Madelon_Y, test_x = Madelon_x ,test_y = Madelon_y)

GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.4959    0.4667    0.4808       900
         1.0     0.4963    0.5256    0.5105       900

    accuracy                         0.4961      1800
   macro avg     0.4961    0.4961    0.4957      1800
weighted avg     0.4961    0.4961    0.4957      1800

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.5085    0.4978    0.5031       900
         1.0     0.5082    0.5189    0.5135       900

    accuracy                         0.5083      1800
   macro avg     0.5083

In [7]:
musk_return = cl.data_classification(train_x = Musk_X, train_y = Musk_Y, test_x = Musk_x ,test_y = Musk_y)


GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.6724    0.8773    0.7613       269
         1.0     0.7360    0.4444    0.5542       207

    accuracy                         0.6891       476
   macro avg     0.7042    0.6609    0.6578       476
weighted avg     0.7000    0.6891    0.6712       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.7477    0.6059    0.6694       269
         1.0     0.5891    0.7343    0.6538       207

    accuracy                         0.6618       476
   macro avg     0.6684

## TESTING PROPOSED STRATEGY

In [9]:
for a in range(10):
    print('================================\n -----------RESULT # ------- ', a)
    dsp = pj1.feature_optimization(trainX = Musk_X,
                               trainY = Musk_Y,
                               testX = Musk_x,
                               testY = Musk_y, 
                               population = 10,
                               obj_function= 'ER')

    MuskX_OP, Muskx_OP = dsp()
    print('Number of Features ', MuskX_OP.shape[1])
    musk_return = cl.data_classification(train_x = MuskX_OP, train_y = Musk_Y, 
                                  test_x = Muskx_OP ,test_y = Musk_y)



 -----------RESULT # -------  0
Processing .....
Only one population left
Number of Features  80
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.7085    0.8401    0.7687       269
         1.0     0.7261    0.5507    0.6264       207

    accuracy                         0.7143       476
   macro avg     0.7173    0.6954    0.6975       476
weighted avg     0.7161    0.7143    0.7068       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.8485    0.8327    0.8405       269
         1.0     0.7877    0.8068  

Processing .....
Only one population left
Number of Features  153
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.7170    0.8476    0.7768       269
         1.0     0.7405    0.5652    0.6411       207

    accuracy                         0.7248       476
   macro avg     0.7287    0.7064    0.7090       476
weighted avg     0.7272    0.7248    0.7178       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.7509    0.7844    0.7673       269
         1.0     0.7026    0.6618    0.6816       207

    accurac

                           solver='svd', store_covariance=False, tol=0.0001)
classifier LDA                precision    recall  f1-score   support

         0.0     0.7708    0.8625    0.8140       269
         1.0     0.7886    0.6667    0.7225       207

    accuracy                         0.7773       476
   macro avg     0.7797    0.7646    0.7683       476
weighted avg     0.7785    0.7773    0.7742       476

 -----------RESULT # -------  8
Processing .....
Only one population left
Number of Features  83
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.6966    0.8364    0.7601       269
         1.0     0.7124    0.5266    0.6056       207

    accuracy                         0.7017       476
   macro avg     0.7045    0.6815    0.6828       476
weighted avg     0.7035    0.7017    0.6929       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      

In [10]:
for a in range(2):
    print('================================\n -----------RESULT # ------- ', a)
    dsp = pj1.feature_optimization(trainX = Madelon_X,
                               trainY = Madelon_Y,
                               testX = Madelon_x,
                               testY = Madelon_y, 
                               population = 30,
                               obj_function= 'ER')

    MadelonX_OP, Madelonx_OP = dsp()
    print('Number of Features ', MadelonX_OP.shape[1])
    Madelon_return = cl.data_classification(train_x = MadelonX_OP, train_y = Madelon_Y, 
                                         test_x = Madelonx_OP ,test_y = Madelon_y)

 -----------RESULT # -------  0
Processing .....
Only one population left
Number of Features  353
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.5006    0.4767    0.4883       900
         1.0     0.5005    0.5244    0.5122       900

    accuracy                         0.5006      1800
   macro avg     0.5006    0.5006    0.5003      1800
weighted avg     0.5006    0.5006    0.5003      1800

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.5094    0.5133    0.5113       900
         1.0     0.5095    0.5056 