# <span style="color:#FFCC00"> Toxicity in CRY1 Molecules </span>

The dataset includes 171 molecules designed for functional domains of a core clock protein, CRY1, responsible for generating circadian rhythm. 56 of the molecules are toxic and the rest are non-toxic.

#### <span style="color:#FFFF00"> Information: </span>
| Dataset Characteristics | Subject Area | Associated Tasks | Attribute Type | Instances | Attributes |
|---------|:-------------|:--------------:|:-------------|:-------------|--------------:|
| Tabular | Life Sciences | Classification | - | 171 | 1203 |

<b>What do the instances in this dataset represent?</b>

Small molecules

<b>Was there any data preprocessing performed?</b>

The data consists a complete set of 1203 molecular descriptors and needs feature selection before classification since some of the features are redundant. We used Recursive Feature Elimination together with Decision Tree Classifier (DTC) to get the best set of molecular descriptors for DTC. Subsetted data with 13 features is included as supplementary file.

#### <span style="color:gold"> Task </span>

Implement an SVM algorithm using any benchmark data of your choosing; the only condition for the data is that it must have 1000 columns (features) or more,   then use Genetic Algorithm (GA) to implement dimensionality reduction be feature selections

#### <span style="color:#FFFF00"> Import libraries and read Toxicity Dataset </span>

In [16]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import random

import warnings
warnings.filterwarnings('ignore')

In [17]:
df = pd.read_csv("toxicity-dataset\data.csv")

#### <span style="color:#FFFF00"> Perform Exploratory Data Analysis </span>

In [18]:
df.shape

(171, 1204)

In [19]:
df.head()

Unnamed: 0,MATS3v,nHBint10,MATS3s,MATS3p,nHBDon_Lipinski,minHBint8,MATS3e,MATS3c,minHBint2,MATS3m,...,WTPT-4,WTPT-5,ETA_EtaP_L,ETA_EtaP_F,ETA_EtaP_B,nT5Ring,SHdNH,ETA_dEpsilon_C,MDEO-22,Class
0,0.0908,0,0.0075,0.0173,0,0.0,-0.0436,0.0409,0.0,0.1368,...,0.0,0.0,0.178,1.5488,0.0088,0,0.0,-0.0868,0.0,NonToxic
1,0.0213,0,0.1144,-0.041,0,0.0,0.1231,-0.0316,0.0,0.1318,...,8.866,19.3525,0.1739,1.3718,0.0048,2,0.0,-0.081,0.25,NonToxic
2,0.0018,0,-0.0156,-0.0765,2,0.0,-0.1138,-0.1791,0.0,0.0615,...,5.2267,27.8796,0.1688,1.4395,0.0116,2,0.0,-0.1004,0.0,NonToxic
3,-0.0251,0,-0.0064,-0.0894,3,0.0,-0.0747,-0.1151,0.0,0.0361,...,7.7896,24.7336,0.1702,1.4654,0.0133,2,0.0,-0.101,0.0,NonToxic
4,0.0135,0,0.0424,-0.0353,0,0.0,-0.0638,0.0307,0.0,0.0306,...,12.324,19.7486,0.1789,1.4495,0.012,2,0.0,-0.1071,0.0,NonToxic


In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MATS3v,171.0,-0.031244,0.063559,-0.3115,-0.06670,-0.0325,0.00485,0.1411
nHBint10,171.0,0.315789,0.762918,0.0000,0.00000,0.0000,0.00000,4.0000
MATS3s,171.0,-0.001001,0.063928,-0.1846,-0.03600,-0.0020,0.02900,0.2181
MATS3p,171.0,-0.061501,0.072891,-0.3485,-0.09955,-0.0594,-0.01710,0.1290
nHBDon_Lipinski,171.0,0.994152,1.108773,0.0000,0.00000,1.0000,2.00000,6.0000
...,...,...,...,...,...,...,...,...
ETA_EtaP_B,171.0,0.011316,0.005482,0.0014,0.00755,0.0107,0.01390,0.0346
nT5Ring,171.0,1.467836,1.013361,0.0000,1.00000,1.0000,2.00000,5.0000
SHdNH,171.0,0.004820,0.044475,0.0000,0.00000,0.0000,0.00000,0.4292
ETA_dEpsilon_C,171.0,-0.085088,0.029273,-0.2027,-0.09950,-0.0824,-0.06635,-0.0073


In [21]:
df.info() # information about the training data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Columns: 1204 entries, MATS3v to Class
dtypes: float64(1003), int64(200), object(1)
memory usage: 1.6+ MB


Missing Values

In [22]:
# print the percentage of missing values for instances.
total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
percent = ((df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)[(df.isnull().sum() / df.isnull().count()).sort_values(ascending = False) != 0])
missing = pd.concat([total, percent], axis = 1, keys = ['Total', 'Percent'])
print(missing)

Empty DataFrame
Columns: [Total, Percent]
Index: []


Data redundancy

In [23]:
# Check for duplicates across all columns
duplicated = df.duplicated()

# Print the number of duplicated instances
print("Number of duplicated instances:", duplicated.sum())

# Print the duplicated instances
print(df[duplicated])

Number of duplicated instances: 0
Empty DataFrame
Columns: [MATS3v, nHBint10, MATS3s, MATS3p, nHBDon_Lipinski, minHBint8, MATS3e, MATS3c, minHBint2, MATS3m, minHBint6, minHBint7, minHBint4, MATS3i, VR3_Dt, SpMax8_Bhi, SdsN, SpMax8_Bhm, SpMax8_Bhe, ECCEN, MDEC-14, SpMax8_Bhs, SpMax8_Bhp, SpMax8_Bhv, MDEC-11, MDEC-12, MDEC-13, VR2_Dt, BIC5, ATS7s, ATS7p, ATS7v, ATS7i, ATS7m, ATS7e, mintN, nHsNH2, khs.sssCH, minHBint3, maxdssC, nT6Ring, minHBint5, nF8Ring, minssCH2, SpMax_DzZ, ETA_EtaP, nHsOH, SpMin1_Bhe, maxHother, nHBAcc_Lipinski, StN, khs.aaS, khs.aaO, khs.aaN, Sare, SHAvin, SpMax3_Bhv, SpMax3_Bhp, SpMax3_Bhs, SpMax3_Bhe, SpMin6_Bhi, SpMax3_Bhm, SpMax3_Bhi, ETA_EtaP_F_L, mindCH2, AATSC2e, AATSC2c, AATSC2m, AATSC2i, nsBr, AATS5p, AATSC2v, AATSC2p, AATSC2s, VABC, maxdNH, khs.ddsN, RotBtFrac, ATS4e, ATS4m, nFRing, ATS4i, EE_DzZ, ATS4s, ATS4p, ETA_Alpha, khs.sssN, EE_Dzi, MAXDN, EE_Dzm, EE_Dze, EE_Dzs, EE_Dzp, EE_Dzv, ATS8e, maxsOH, minssssNp, maxsOm, MDEC-23, MDEC-22, ...]
Index: []

[0 r

Convert categorical features to numarical features

In [24]:
df = pd.get_dummies(df, columns= [df.columns[-1]], drop_first= True)

Preprocess the data by splitting it into training and testing sets, and scaling the features.

In [25]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### <span style="color:#FFFF00"> Implement an SVM algorithm </span>

Train an SVM classifier on the original dataset and evaluate its performance on the testing set.

In [26]:
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy on original dataset:", accuracy)

Accuracy on original dataset: 0.5428571428571428


#### <span style="color:#FFFF00"> Implement a Genetic Algorithm </span>

Use a genetic algorithm to perform feature selection. First, define a fitness function that evaluates the performance of the SVM classifier on a subset of features.

In [27]:
def fitness_function(features):
    clf = svm.SVC(kernel='linear')
    clf.fit(X_train[:, features], y_train)
    y_pred = clf.predict(X_test[:, features])
    accuracy = accuracy_score(y_test, y_pred)
    return accuracy


Define the genetic algorithm functions, such as the initialization, selection, crossover, and mutation.

In [28]:
# Define the genetic algorithm parameters, such as the population size, mutation rate, and number of generations.
POPULATION_SIZE = 40
MUTATION_RATE = 0.1
NUM_GENERATIONS = 10
NUM_FEATURES = X_train.shape[1]

def create_individual():
    return np.random.randint(2, size=NUM_FEATURES)

def create_population():
    return [create_individual() for _ in range(POPULATION_SIZE)]

def fitness(individual):
    return fitness_function(np.where(individual == 1)[0])

def selection(population):
    fitness_scores = [fitness(individual) for individual in population]
    return [population[i] for i in np.argsort(fitness_scores)[-int(len(population)*0.3):]]

def crossover(parent1, parent2):
    crossover_point = random.randint(1, len(parent1)-2)
    child1 = np.concatenate([parent1[:crossover_point], parent2[crossover_point:]])
    child2 = np.concatenate([parent2[:crossover_point], parent1[crossover_point:]])
    return child1, child2

def mutation(individual):
    for i in range(len(individual)):
        if random.random() < MUTATION_RATE:
            individual[i] = 1 - individual[i]
    return individual


In [29]:
# Run the genetic algorithm to select the best subset of features.
population = create_population()

for generation in range(NUM_GENERATIONS):
    selected_population = selection(population)
    next_population = []
    while len(next_population) < POPULATION_SIZE:
        parent1, parent2 = random.sample(selected_population, 2)
        child1, child2 = crossover(parent1, parent2)
        mutation(child1)
        mutation(child2)
        next_population.append(child1)
        next_population.append(child2)
    population = next_population


In [30]:
# Use the GA-selected features for SVM training and testing
selected_features = population[0]

X_selected = X[:, selected_features]

# Split the dataset into training and test sets using the selected features
X_train_sel, X_test_sel, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Train the SVM model using only the selected features
svm_sel = svm.SVC()
svm_sel.fit(X_train_sel, y_train)

# Make predictions on the test set using the selected features
y_pred_sel = svm_sel.predict(X_test_sel)

# Evaluate the accuracy of the model using the selected features
accuracy_sel = accuracy_score(y_test, y_pred_sel)
print(f"Accuracy of SVM with GA-selected features: {accuracy_sel}")

Accuracy of SVM with GA-selected features: 0.6857142857142857


#### <span style="color:#FFFF00"> Conclusion </span>

The SVM algorithm is a powerful technique for classification problems, but its performance can be improved by selecting a relevant subset of features.

In this study, we performed an exploratory data analysis and found that the dataset consisted of 1003 continuous and 200 discrete features with no missing values or redundant instances. We implemented an SVM algorithm using Python's scikit-learn library and used GA for feature selection. The accuracy of the SVM model improved from 54.3% before feature selection to 68.6% after feature selection. This indicates that GA was able to identify a subset of features that significantly improved the performance of the model.

In summary, the combination of SVM and GA is an effective approach for solving classification problems that involve large and complex datasets. The results of this study demonstrate the importance of feature selection in improving the accuracy of SVM models and highlight the potential of GA as a powerful feature selection technique.

#### <span style="color:#FFFF00"> References </span>

UC Irvine Machine Learning Repository. (n.d.). UC Irvine Machine Learning Repository. [https://archive-beta.ics.uci.edu/dataset/728/toxicity-2](https://archive-beta.ics.uci.edu/dataset/728/toxicity-2)