# Purpose 
### Feature Selection by optimizing FSJaya Algorithm. 

The challenge is to tweak the present algorithm for feature selection purposes. 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import os
from project_part_2 import sigmoid_function
import project_part_1 as pj1
import classifiers as cl

## Step 1: Load Dataset

The Datasets selected for evaluation purposes are: 
- mandelon
- musk 

Both datasets are accesbile from [add the name of the website]

### Madelon Dataset

- Data type: non-sparse
- Number of features: 500
- Number of examples and check-sums:
      	     Pos_ex	Neg_ex	Tot_ex	Check_sum
             

     Train	 1000	 1000	 2000	488083511.00
     

     Valid	  300	  300	  600	146395833.00
     

     Test	  900	  900	 1800	439209553.00
     

     All  	 2200	 2200	 4400	1073688897.00
     

In [2]:
madelon = pd.read_csv('madelon/madelon_train.data', sep=' ', header = None)
madelon.info()
madelon.head()

madelon_test = pd.read_csv('madelon/madelon_test.data', sep=' ', header = None)
madelon_test.info()
#madelon_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Columns: 501 entries, 0 to 500
dtypes: float64(1), int64(500)
memory usage: 7.6 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Columns: 501 entries, 0 to 500
dtypes: float64(1), int64(500)
memory usage: 6.9 MB


## Musk Dataset

4. Relevant Information:
   This dataset describes a set of **92** molecules of which **47** are judged
   by human experts to be musks and the remaining 45 molecules are
   judged to be non-musks.  The goal is to learn to predict whether
   new molecules will be musks or non-musks.  However, the 166 features
   that describe these molecules depend upon the exact shape, or
   conformation, of the molecule.  Because bonds can rotate, a single
   molecule can adopt many different shapes.  To generate this data
   set, the low-energy conformations of the molecules were generated
   and then filtered to remove highly similar conformations. This left
   476 conformations.  Then, a feature vector was extracted that
   describes each conformation.

   This many-to-one relationship between feature vectors and molecules
   is called the **"multiple instance problem"**.  When learning a
   classifier for this data, the classifier should classify a molecule
   as "musk" if ANY of its conformations is classified as a musk.  A
   molecule should be classified as "non-musk" if NONE of its
   conformations is classified as a musk.

5. Number of Instances  **476**

6. Number of Attributes **168** plus the class.

7. For Each Attribute:
   
   Attribute:           Description:
   molecule_name:       Symbolic name of each molecule.  Musks have names such
                        as MUSK-188.  Non-musks have names such as
                        NON-MUSK-jp13.
   conformation_name:   Symbolic name of each conformation.  These
                        have the format MOL_ISO+CONF, where MOL is the
                        molecule number, ISO is the stereoisomer
                        number (usually 1), and CONF is the
                        conformation number. 
   f1 through f162:     These are "distance features" along rays (see
                        paper cited above).  The distances are
                        measured in hundredths of Angstroms.  The
                        distances may be negative or positive, since
                        they are actually measured relative to an
                        origin placed along each ray.  The origin was
                        defined by a "consensus musk" surface that is
                        no longer used.  Hence, any experiments with
                        the data should treat these feature values as
                        lying on an arbitrary continuous scale.  In
                        particular, the algorithm should not make any
                        use of the zero point or the sign of each
                        feature value. 
   f163:                This is the distance of the oxygen atom in the
                        molecule to a designated point in 3-space.
                        This is also called OXY-DIS.
   f164:                OXY-X: X-displacement from the designated
                        point.
   f165:                OXY-Y: Y-displacement from the designated
                        point.
   f166:                OXY-Z: Z-displacement from the designated
                        point. 
   class:               0 => non-musk, 1 => musk

   Please note that the molecule_name and conformation_name attributes
   should not be used to predict the class.

8. Missing Attribute Values: none.

9. Class Distribution: 
   Musks:     47
   Non-musks: 45

In [3]:
musk =  pd.read_csv('musk/clean2.data/clean2.data', header = None)
musk.info()
musk.head()

musk_test =  pd.read_csv('musk/clean1.data/clean1.data', header = None)
#musk_test

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6598 entries, 0 to 6597
Columns: 169 entries, 0 to 168
dtypes: float64(1), int64(166), object(2)
memory usage: 8.5+ MB


## Step 2: Prepare data for Classifiers. 

According to the shared paper [A jaya algorithm based wrapper method for optimal feature selection in supervised classification] data has to be trained using following classifiers: 

    - NB (Naive Bayes)
    - KNN (K Nearest Neighbor)
    - LDA 
    - RT (Regression Tree)

#### Step 2a : Separating features and labels as X, Y variables  (Training Dataset)

In [4]:
Musk_X = musk.drop(168, axis = 1)
Musk_X = Musk_X.drop(0, axis=1)
Musk_X = Musk_X.drop(1, axis=1)
Musk_X.columns = np.arange(len(Musk_X.columns))
Musk_Y = musk[168]


Madelon_X = madelon.drop(500, axis = 1)
Y_labels = np.hstack([np.ones(1000), np.zeros(1000)])
Madelon_Y = pd.Series(Y_labels.T)


#### Step 2b : Separating features and labels as X, Y variables  (Test Dataset)

In [5]:
Musk_x = musk_test.drop(168, axis = 1)
Musk_x = Musk_x.drop(0, axis=1)
Musk_x = Musk_x.drop(1, axis=1)
Musk_x.columns = np.arange(len(Musk_x.columns))
Musk_y = musk_test[168]


Madelon_x = madelon_test.drop(500, axis = 1)
y_labels = np.hstack([np.ones(900), np.zeros(900)])
Madelon_y = pd.Series(y_labels.T)
len(Musk_x)
len(Musk_y)

print("Madelon TRAINING \n dataset: # of features: " , Madelon_X.shape[1], 
      '\n Number of Measurements: ', Madelon_X.shape[0], 
      'Y_shape: ', Madelon_Y.shape)

print("Madelon TEST \n dataset: # of features: " , Madelon_x.shape[1], 
      '\n Number of Measurements: ', Madelon_x.shape[0], 
      'Y_shape: ', Madelon_y.shape)

print("Musk TRAINING \n dataset: # of features: " , Musk_X.shape[1], 
      '\n Number of Measurements: ', Musk_X.shape[0],
      'Y_shape: ', Musk_Y.shape)

print("Musk TEST \n dataset: # of features: " , Musk_x.shape[1], 
      '\n Number of Measurements: ', Musk_x.shape[0], 
      'Y_shape: ', Musk_y.shape)

Madelon TRAINING 
 dataset: # of features:  500 
 Number of Measurements:  2000 Y_shape:  (2000,)
Madelon TEST 
 dataset: # of features:  500 
 Number of Measurements:  1800 Y_shape:  (1800,)
Musk TRAINING 
 dataset: # of features:  166 
 Number of Measurements:  6598 Y_shape:  (6598,)
Musk TEST 
 dataset: # of features:  166 
 Number of Measurements:  476 Y_shape:  (476,)


### Step 3: Classification

#### Step 3a: Train Classifiers without Feature Selection

In [6]:
madelon_return = cl.data_classification(train_x = Madelon_X, train_y = Madelon_Y, test_x = Madelon_x ,test_y = Madelon_y)

GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.4959    0.4667    0.4808       900
         1.0     0.4963    0.5256    0.5105       900

    accuracy                         0.4961      1800
   macro avg     0.4961    0.4961    0.4957      1800
weighted avg     0.4961    0.4961    0.4957      1800

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.5057    0.4967    0.5011       900
         1.0     0.5055    0.5144    0.5099       900

    accuracy                         0.5056      1800
   macro avg     0.5056

In [7]:
musk_return = cl.data_classification(train_x = Musk_X, train_y = Musk_Y, test_x = Musk_x ,test_y = Musk_y)


GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.6724    0.8773    0.7613       269
         1.0     0.7360    0.4444    0.5542       207

    accuracy                         0.6891       476
   macro avg     0.7042    0.6609    0.6578       476
weighted avg     0.7000    0.6891    0.6712       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.7580    0.4424    0.5587       269
         1.0     0.5298    0.8164    0.6426       207

    accuracy                         0.6050       476
   macro avg     0.6439

## TESTING PROPOSED STRATEGY

In [8]:
for a in range(10):
    print('================================\n -----------RESULT # ------- ', a)
    dsp = pj1.feature_optimization(trainX = Musk_X,
                               trainY = Musk_Y,
                               testX = Musk_x,
                               testY = Musk_y, 
                               population = 10,
                               obj_function= 'ER')

    MuskX_OP, Muskx_OP = dsp()
    print('Number of Features ', MuskX_OP.shape[1])
    musk_return = cl.data_classification(train_x = MuskX_OP, train_y = Musk_Y, 
                                  test_x = Muskx_OP ,test_y = Musk_y)



 -----------RESULT # -------  0
Processing .....
Only one population left
Number of Features  121
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.7079    0.8290    0.7637       269
         1.0     0.7143    0.5556    0.6250       207

    accuracy                         0.7101       476
   macro avg     0.7111    0.6923    0.6943       476
weighted avg     0.7107    0.7101    0.7034       476

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.8113    0.7993    0.8052       269
         1.0     0.7441    0.7585 

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)
classifier LDA                precision    recall  f1-score   support

         0.0     0.7641    0.8550    0.8070       269
         1.0     0.7771    0.6570    0.7120       207

    accuracy                         0.7689       476
   macro avg     0.7706    0.7560    0.7595       476
weighted avg     0.7698    0.7689    0.7657       476

 -----------RESULT # -------  4
Processing .....
Only one population left
Number of Features  74
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.7165    0.8550    0.7797       269
         1.0     0.7484    0.5604    0.6409       207

    accuracy                         0.7269       476
   macro avg     0.7324    0.7077    0.7103       476
weighted avg     0.7304    0.7269    0.7193       476

DecisionTreeRegressor

classifier KNN                precision    recall  f1-score   support

         0.0     0.8664    0.9405    0.9020       269
         1.0     0.9130    0.8116    0.8593       207

    accuracy                         0.8845       476
   macro avg     0.8897    0.8761    0.8806       476
weighted avg     0.8867    0.8845    0.8834       476

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)
classifier LDA                precision    recall  f1-score   support

         0.0     0.7616    0.8550    0.8056       269
         1.0     0.7759    0.6522    0.7087       207

    accuracy                         0.7668       476
   macro avg     0.7687    0.7536    0.7571       476
weighted avg     0.7678    0.7668    0.7634       476

 -----------RESULT # -------  8
Processing .....
Only one population left
Number of Features  67
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB       

In [9]:
for a in range(2):
    print('================================\n -----------RESULT # ------- ', a)
    dsp = pj1.feature_optimization(trainX = Madelon_X,
                               trainY = Madelon_Y,
                               testX = Madelon_x,
                               testY = Madelon_y, 
                               population = 30,
                               obj_function= 'ER')

    MadelonX_OP, Madelonx_OP = dsp()
    print('Number of Features ', MadelonX_OP.shape[1])
    Madelon_return = cl.data_classification(train_x = MadelonX_OP, train_y = Madelon_Y, 
                                         test_x = Madelonx_OP ,test_y = Madelon_y)

 -----------RESULT # -------  0
Processing .....
Only one population left
Number of Features  220
GaussianNB(priors=None, var_smoothing=1e-09)
classifier NB                precision    recall  f1-score   support

         0.0     0.5076    0.4811    0.4940       900
         1.0     0.5069    0.5333    0.5198       900

    accuracy                         0.5072      1800
   macro avg     0.5072    0.5072    0.5069      1800
weighted avg     0.5072    0.5072    0.5069      1800

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
classifier RT                precision    recall  f1-score   support

         0.0     0.5010    0.5389    0.5193       900
         1.0     0.5012    0.4633 