Skip to content

Universial Representation Learning for Tabular data

Notifications You must be signed in to change notification settings

dwaydwaydway/BERTable

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERTable: Universal Representation Learning for Tabular data

Requirements

  • Python >= 3.7
  • Numpy >= 1.17.4
  • PyTorch >= 1.13.0
  • tqdm >= 4.40.2

Usage

from BERTable import BERTable

# Read dataset
df = pd.read_csv('dataset.csv', header=None)
column_type = ['numerical', 'categorical', 'numerical', 'numerical', 'categorical'....]
df = df.values.tolist()

# Initialization
bertable = BERTable(
    df, column_type,
    embedding_dim=5, n_layers=5, dim_feedforward=100, n_head=5,
    dropout=0.15, ns_exponent=0.75, share_category=False, use_pos=False)

# Start self-supervised Pretraining
bertable.fit(
    df, 
    max_epochs=3, lr=1e-4,
    lr_weight={'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33},
    loss_clip = [0, 100],
    n_sample=5, mask_rate=0.15, replace_rate=0.8, 
    batch_size=256, shuffle=True, num_workers=10)

# Feature Extraction
df_t = bertable.transform(df, batch_size=256, num_workers=10)

Parameters

BERTable.BERTable

  • df (list, required)

    The data used for training.

  • column_type (list, required)

    Specify the column types. 'numerical, 'categorical' or 'vector'.

  • embedding_dim (int, default: 5)

    Embedding dimension.

  • n_layers (int, default: 5)

    Number of transformer encoder layers.

  • dim_feedforward (int, default: 100)

    Hidden dimension of transformer encoder layers.

  • n_head (int, default: 5)

    The number of heads in the multiheadattention models.

  • dropout (float, default: 0.15)

    The dropout value.

  • ns_exponent (float, default: 0.75)

    The exponent used to shape the negative sampling distribution.

  • share_category (bool, default: Fasle)

    If True, same categorical data in different columns that share the same name will be treated as the same object.

  • use_pos (bool, default: Fasle)

    Whether or not to add positional embedding.

BERTable.BERTable.fit

  • df (list, required)

    The data used for training.

  • max_epochs (int, default: 3)

    Number of epoch to train.

  • lr (float, default: 1e-4)

    Learning rate for the optimizer.

  • lr_weight (dict, default: {'numerical': 0.33, 'categorical': 0.33, 'vector': 0.33})

    Learning rate weight for each data type.

  • loss_clip (list, default: [0, 100])

    Loss clipping for numerical data.

  • n_sample (int, default: 4)

    Number negative samples to use.

  • mask_rate (float, default: 0.15)

    The masking probability.

  • replace_rate (float, default: 0.8)

    The masking probability.

  • batch_size (int, default: 32)

    The batch size.

  • shuffle (bool, default: True)

    Whether or not to shuffle data.

  • num_workers (int, default: 1)

    NUmber of workers.

Experiments

Check exp folder for detail implimentatin of the experiments.

About

Universial Representation Learning for Tabular data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published