# Modeling

Based on our data analysis in the exploration.ipynb notebook, we will now build a model to predict the target label based on input features. We will use a neural network model.

We'll also build a preprocessing pipeline to transform the data before feeding it into the model.

In [1]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd

TRAIN_CSV = 'data/train.csv'
TEST_CSV = 'data/test.csv'

drop_cols = ["Vicuna", "Wallaby", "Turkey", "Tick"]

def proc_pipeline(filename: str, cols=None):
    df = pd.read_csv(filename)
    df = df.drop(columns=drop_cols)
    df = df.astype({'Tiglon': 'string'})
    df.loc[df['Tiglon'].isnull(), 'Tiglon'] = 'True'

    numerical_cols = df.select_dtypes(include='number').columns.drop('target')
    categorical_cols = df.select_dtypes(exclude='number').columns

    # One-hot encoding
    df_encoded = pd.get_dummies(df, columns=categorical_cols)

    # merge numerical and ohe columns
    df = pd.concat([df[numerical_cols], df_encoded], axis=1)

    # drop columns that are not in training set
    if cols is not None:
        extra_cols = set(df.columns) - set(cols)
        df = df.drop(columns=extra_cols)


    x = tf.convert_to_tensor(df.drop(columns=['target']).values, dtype=tf.float32)
    y = tf.convert_to_tensor(df['target'].values, dtype=tf.int32)
    return x, y, df.columns

X_train, y_train, train_cols = proc_pipeline(TRAIN_CSV)

2023-11-13 03:44:41.101311: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-13 03:44:41.161276: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-13 03:44:41.161312: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-13 03:44:41.161401: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-13 03:44:41.172978: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-11-13 03:44:41.174170: I tensorflow/core/platform/cpu_feature_guard.cc:182] This Tens

### Training

In [2]:
# define a model
model = tf.keras.Sequential([
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=['accuracy'],
    run_eagerly=True
)

model.fit(X_train, y_train, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fb882b53280>

### Evaluation

In [3]:
X_test, y_test, test_cols = proc_pipeline(TEST_CSV, train_cols)
model.evaluate(X_test, y_test)



[7.425314903259277, 0.5266731381416321]

From the evaluation metrics, we can see that the model didn't do very well on the test set. This could be because training and test sets are not representative of the population.