## Test smells classification

**Task:** Classify a Java class as having one of the following smells: Conditional Test Logic or Exception Handling<br>
**Data:** Experiment data from https://testsmells.github.io/pages/research/experimentdata.html<br>
**Model:** LSTM neural network

### Load vectorized data

In [22]:
import pandas as pd

# Read input data
df1 = pd.read_csv('data/ConditionalTestLogic_vectors.csv')
df2 = pd.read_csv('data/ExceptionCatchingThrowing_vectors.csv')

# Get rid off useless features
df1.drop(df1.columns[[0, 1, 2]], axis=1, inplace=True)
df2.drop(df2.columns[[0, 1, 2]], axis=1, inplace=True)

# Preprocess
df1['Vector'] = [x[:-1] for x in df1['Vector']]
df2['Vector'] = [x[:-1] for x in df2['Vector']]
df1['Vector'] = [list(map(int, x.split(' '))) for x in df1['Vector']]
df2['Vector'] = [list(map(int, x.split(' '))) for x in df2['Vector']]

### Train/predict procedure

In [49]:
from keras import Sequential
from keras.layers import Embedding, SpatialDropout1D, LSTM, Dense
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split


# Model parameters
MAX_NB_WORDS = 50000
MAX_SEQUENCE_LENGTH = 500
EMBEDDING_DIM = 100
EPOCHS = 10
BATCH_SIZE = 64


def split_dataset(df):
    """
    Split the dataset into training and test sets

    :param df: The dataset as a Pandas DataFrame
    :return A tuple of format (X_train, X_test, Y_train, Y_test)
    """

    X = pad_sequences(df['Vector'].values, MAX_SEQUENCE_LENGTH)
    Y = pd.get_dummies(df['Smell']).values
    return train_test_split(X, Y, test_size=0.3, random_state=42)


def train_LSTM(X_train, Y_train, Xs):
    """
    Train a LSTM model for the given training dataset

    :param X_train: The training feature vector
    :param Y_train: The labels vector
    :param Xs: The shape of the training feature vector
    :return The trained model
    """

    model = Sequential()
    model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=Xs))
    model.add(SpatialDropout1D(0.2))
    model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(2, activation='sigmoid'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    # Train model
    model.fit(X_train, Y_train, epochs=EPOCHS, batch_size=BATCH_SIZE)
    return model

In [52]:
# For ConditionalTestLogic
X_train, X_test, Y_train, Y_test = split_dataset(df1)
model = train_LSTM(X_train, Y_train, X_train.shape[1])
accr = model.evaluate(X_test, Y_test)
print('\nTest Set Performance\nLoss: {:0.3f}\nAccuracy: {:0.3f}'.format(accr[0], accr[1]))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Test Set Performance
Loss: 0.381
Accuracy: 0.850


In [53]:
# For ExceptionCatchingThrowing
X_train, X_test, Y_train, Y_test = split_dataset(df2)
model = train_LSTM(X_train, Y_train, X_train.shape[1])
accr = model.evaluate(X_test, Y_test)
print('\nTest Set Performance\nLoss: {:0.3f}\nAccuracy: {:0.3f}'.format(accr[0], accr[1]))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Test Set Performance
Loss: 0.349
Accuracy: 0.858
