Punctuation

Punctuation restoration using TensorFlow

Introduction

The repository was inspired by github.com/ottokart/punctuator2. I did not want to use phyton in serving the model. So I tried to change Theano to TensorFlow. Attention and Late Fusion layers were the big challanges for me, I'm not sure they are 100% correct. I compared the results with punctuator2 and they looked very similar.

NEW: Feature representation of words

Added support of Kaldi like word features. Word hashing was originally described in Learning deep structured semantic models for web search using clickthrough data - by Po-Sen Huang et al. 2013. It allows to use nearly unlimited vocabulary instead of a shortlist for NN training. Samples added for data prepareation and training.

Requirements

conda create --name punct python=3.10 Implementation was tested with python 3.10. The training code uses modules:

pip install -r requirements
pip install tqdm

The optimization of hyper parameters:

pip install parameter-sherpa
pip install keras

Data preparation

For the initial data requirements see github.com/ottokart/punctuator2. You can configure punctuation and vocabulary size in the ptf/data/data.py. To prepare the data for the training:

python ptf/punctuator2/data.py <initialDataDir> <dataDir>

Or see the egs for sample scripts

Training

To train one model, you can configure model parameters in ptf/train.py. The taining can be performed by:

mkdir model1 && cd model1
python ../ptf/train.py <dataDir> <modelPrefix>

The trained model is saved as keras hd5 format in the working folder model1.

Optimization

There is the python script to optimize hyperparameters of a model using sherpa tool. See optimize/optimize.py. The sample to start an optimization:

mkdir optim1 && cd optim1
python ../optimize/optimize.py <dataDir>

All models are saved in in the working folder optim1.

Prediction, error calculation

To predict punctuation for a test text:

python ptf/predict.py <testTextFile> <vocaburaly> <hd5ModelFile> <predictedOutputFile>

To evaluate error scores of the prediction:

python ptf/punctuator2/error_calculator.py <testTextFile> <predictedOutputFile>

Saving as a pure tensorflow model

During the training all models are saved in keras format. To save a model in a pure tensorflow format there is a script:

python ptf/save_as_tf.py <hd5ModelFile> <tfModelOutputDir>

Loading model with go

Sample go code on how to load the trained tensorflow model: examples/goload/loadtf.go. To compile the sample go code you need to install tensorflow library and configure LD_LIBRARY_PATH. See https://www.tensorflow.org/install/lang_go

Author

Airenas Vaičiūnas

License

Also, please, see the License Ottokar Tilk.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.idea		.idea
Licenses		Licenses
egs		egs
examples/goload		examples/goload
optimize		optimize
ptf		ptf
tools/go		tools/go
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punctuation

Introduction

NEW: Feature representation of words

Requirements

Data preparation

Training

Optimization

Prediction, error calculation

Saving as a pure tensorflow model

Loading model with go

Author

License

About

Releases

Packages

Languages

License

airenas/punctuation

Folders and files

Latest commit

History

Repository files navigation

Punctuation

Introduction

NEW: Feature representation of words

Requirements

Data preparation

Training

Optimization

Prediction, error calculation

Saving as a pure tensorflow model

Loading model with go

Author

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages