Skip to content

Commit

Permalink
Merge pull request #1 from bookbot-hive/g2p
Browse files Browse the repository at this point in the history
Implemented G2p
  • Loading branch information
w11wo committed Jun 6, 2023
2 parents 90cdd5d + a8fc5e9 commit 29f2ee7
Show file tree
Hide file tree
Showing 28 changed files with 20,583 additions and 30 deletions.
47 changes: 22 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,16 +52,13 @@ To get a lexicon where phonemes are normalized (diacritics removed, digraphs spl
### Phonemization

```py
>>> from transformers import pipeline
>>> import torch
>>> g2p = pipeline(
... model="bookbot/byt5-small-wikipron-eng-latn-us-broad",
... device=0 if torch.cuda.is_available() else -1,
... )
>>> g2p("phonemizing", max_length=200)[0]['generated_text']
'f o ʊ n ə m a ɪ z ɪ ŋ'
>>> g2p("imposimpable", max_length=200)[0]['generated_text']
'ɪ m p ə z ɪ m p ə b ə l'
>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']
```

## Dictionaries & Models
Expand Down Expand Up @@ -102,24 +99,24 @@ To get a lexicon where phonemes are normalized (diacritics removed, digraphs spl

### English `(en-CA)`

| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------- |
| en-CA (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_ca_broad.tsv) | |
| en-CA (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_ca_narrow.tsv) | |
| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
| en-CA (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_ca_broad.tsv) | [bookbot/byt5-small-wikipron-eng-latn-ca-broad](https://huggingface.co/bookbot/byt5-small-wikipron-eng-latn-ca-broad) |
| en-CA (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_ca_narrow.tsv) | |

### English `(en-NZ)`

| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------- |
| en-NZ (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_nz_broad.tsv) | |
| en-NZ (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_nz_narrow.tsv) | |
| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
| en-NZ (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_nz_broad.tsv) | [bookbot/byt5-small-wikipron-eng-latn-nz-broad](https://huggingface.co/bookbot/byt5-small-wikipron-eng-latn-nz-broad) |
| en-NZ (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_nz_narrow.tsv) | |

### English `(en-IN)`

| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------- |
| en-IN (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_in_broad.tsv) | |
| en-IN (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_in_narrow.tsv) | |
| Language | Dictionary | Phone Set | Corpus | G2P Model |
| -------------- | ---------- | --------- | ------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
| en-IN (Broad) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_in_broad.tsv) | [bookbot/byt5-small-wikipron-eng-latn-in-broad](https://huggingface.co/bookbot/byt5-small-wikipron-eng-latn-in-broad) |
| en-IN (Narrow) | Wikipron | IPA | [Link](./lexikos/dict/wikipron/eng_latn_in_narrow.tsv) | |


## Training G2P Model
Expand Down Expand Up @@ -223,11 +220,11 @@ python eval.py \
| East Asian English | en-CN, en-HK, en-JP, en-KR, en-TW | China, Hong Kong, Japan, South Korea, Taiwan | | |
| European English | en-UK, en-HU, en-IE | United Kingdom, Hungary, Ireland | 🚧 | 🚧 |
| Mexican English | en-MX | Mexico | | |
| New Zealand English | en-NZ | New Zealand || |
| North American | en-CA, en-US | Canada, United States || 🚧 |
| New Zealand English | en-NZ | New Zealand || |
| North American | en-CA, en-US | Canada, United States || |
| Middle Eastern English | en-EG, en-IL | Egypt, Israel | | |
| Southeast Asian | en-TH, en-ID, en-MY, en-PH, en-SG | Thailand, Indonesia, Malaysia, Philippines, Singapore | | |
| South Asian English | en-IN | India || |
| South Asian English | en-IN | India || |

## Resources

Expand Down
4 changes: 2 additions & 2 deletions examples/eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ def log_results(result: Dataset, args):
logging_dir = f"{args.logging_dir}/{model_id}"
os.makedirs(logging_dir, exist_ok=True)

with open(f"{logging_dir}/metrics_{args.source}_{args.target}.txt", "w") as f:
with open(f"{logging_dir}/metrics.txt", "w") as f:
f.write(result_str)

with open(f"{logging_dir}/log_{args.source}_{args.target}.json", "w") as f:
with open(f"{logging_dir}/log.json", "w") as f:
data = [
{"prediction": p, "target": t}
for p, t in zip(result["prediction"], result["target"])
Expand Down
36 changes: 36 additions & 0 deletions examples/export_onnx.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from pathlib import Path
import argparse
import shutil

from optimum.onnxruntime import ORTModelForSeq2SeqLM

def parse_args():
parser = argparse.ArgumentParser()

parser.add_argument(
"--model_name", type=str, required=True, help="HuggingFace Hub model name."
)
parser.add_argument(
"--hub_model_id", type=str, required=True, help="HuggingFace Hub model ID for pushing"
)
return parser.parse_args()

def main(args):
if "/" in args.model_name:
_, model_name = args.model_name.split("/")
else:
model_name = args.model_name

save_dir = Path(f"onnx-{model_name}")

ort_model = ORTModelForSeq2SeqLM.from_pretrained(args.model_name, export=True)
ort_model.save_pretrained(save_dir)

ort_model.push_to_hub(str(save_dir), repository_id=args.hub_model_id)

# remove local repository after finish
shutil.rmtree(save_dir)

if __name__ == "__main__":
args = parse_args()
main(args)
Loading

0 comments on commit 29f2ee7

Please sign in to comment.