BERTable: Universal Representation Learning for Tabular data

Requirements

Python >= 3.7
Numpy >= 1.17.4
PyTorch >= 1.13.0
tqdm >= 4.40.2

Usage

from BERTable import BERTable

# Read dataset
df = pd.read_csv('dataset.csv', header=None)
column_type = ['numerical', 'categorical', 'numerical', 'numerical', 'categorical'....]
df = df.values.tolist()

# Initialization
bertable = BERTable(
    df, column_type,
    embedding_dim=5, n_layers=5, dim_feedforward=100, n_head=5,
    dropout=0.15, ns_exponent=0.75, share_category=False, use_pos=False)

# Start self-supervised Pretraining
bertable.fit(
    df, 
    max_epochs=3, lr=1e-4,
    lr_weight={'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33},
    loss_clip = [0, 100],
    n_sample=5, mask_rate=0.15, replace_rate=0.8, 
    batch_size=256, shuffle=True, num_workers=10)

# Feature Extraction
df_t = bertable.transform(df, batch_size=256, num_workers=10)

Parameters

BERTable.BERTable

df (list, required)

The data used for training.
column_type (list, required)

Specify the column types. 'numerical, 'categorical' or 'vector'.
embedding_dim (int, default: 5)

Embedding dimension.
n_layers (int, default: 5)

Number of transformer encoder layers.
dim_feedforward (int, default: 100)

Hidden dimension of transformer encoder layers.
n_head (int, default: 5)

The number of heads in the multiheadattention models.
dropout (float, default: 0.15)

The dropout value.
ns_exponent (float, default: 0.75)

The exponent used to shape the negative sampling distribution.
share_category (bool, default: Fasle)

If True, same categorical data in different columns that share the same name will be treated as the same object.
use_pos (bool, default: Fasle)

Whether or not to add positional embedding.

BERTable.BERTable.fit

df (list, required)

The data used for training.
max_epochs (int, default: 3)

Number of epoch to train.
lr (float, default: 1e-4)

Learning rate for the optimizer.
lr_weight (dict, default: {'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33})

Learning rate weight for each data type.
loss_clip (list, default: [0, 100])

Loss clipping for numerical data.
n_sample (int, default: 4)

Number negative samples to use.
mask_rate (float, default: 0.15)

The masking probability.
replace_rate (float, default: 0.8)

The masking probability.
batch_size (int, default: 32)

The batch size.
shuffle (bool, default: True)

Whether or not to shuffle data.
num_workers (int, default: 1)

NUmber of workers.

Experiments

Check exp folder for detail implimentatin of the experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
exp		exp
modules		modules
.gitignore		.gitignore
BERTable.py		BERTable.py
README.md		README.md
environment.yml		environment.yml
git_push.sh		git_push.sh
prepare_data.py		prepare_data.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp

exp

modules

modules

.gitignore

.gitignore

BERTable.py

BERTable.py

README.md

README.md

environment.yml

environment.yml

git_push.sh

git_push.sh

prepare_data.py

prepare_data.py

utils.py

utils.py

Repository files navigation

BERTable: Universal Representation Learning for Tabular data

Requirements

Usage

Parameters

BERTable.BERTable

BERTable.BERTable.fit

Experiments

About

Releases

Packages

Languages

dwaydwaydway/BERTable

Folders and files

Latest commit

History

Repository files navigation

BERTable: Universal Representation Learning for Tabular data

Requirements

Usage

Parameters

BERTable.BERTable

BERTable.BERTable.fit

Experiments

About

Resources

Stars

Watchers

Forks

Languages