Big Data and ML - Spring 2021

Brad Windsor (bw1879), Kevin Choi (kc2296)

Thistle

Final project proposal
One needs to download the pretrained BERT model from https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/ and perform a conversion into a Rust-compatible form first. Run the following commands:

mkdir -p models/bert-base-nli-stsb-mean-tokens

wget -P models https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/bert-base-nli-stsb-mean-tokens.zip

unzip models/bert-base-nli-stsb-mean-tokens.zip -d models/bert-base-nli-stsb-mean-tokens

python3 -m venv thistle-env

source thistle-env/bin/activate

pip install torch

export PWD=`pwd`

python3 utils/convert_model.py $PWD/models/bert-base-nli-stsb-mean-tokens/0_BERT/pytorch_model.bin

Modifying Rust. This project uses some features of Rust that are not yet on the stable build. To use the nightly build, set:

rustup toolchain install nightly

rustup default nightly

Integration testing

cargo test

Running MS MARCO dataset

Data preparation

mkdir data
cd data
wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz
tar -xf triples.train.small.tar.gz
export SIZE=10000 # or any other size
head -n $SIZE triples.train.small.tsv > data.tsv
LC_ALL=C tr -dc '\0-\177' <data.tsv >data_cleaned.tsv
cd ..

To run:

# see run_eval.rs
cargo run > output100.txt

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
src		src
tests		tests
utils		utils
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data and ML - Spring 2021

Thistle

Running MS MARCO dataset

References

About

Releases

Packages

Contributors 2

Languages

bwindsor22/thistle

Folders and files

Latest commit

History

Repository files navigation

Big Data and ML - Spring 2021

Thistle

Running MS MARCO dataset

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages