Skip to content

Easy, low-code clustering and labeling for Tibetan language text.

License

Notifications You must be signed in to change notification settings

billingsmoore/bocluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Clustering

This repository contains tools to easily embed and cluster texts in the Tibetan language as well as label clusters and produce visualizations of those labeled clusters.

Install

Install the library to get started:

pip install --upgrade bocluster

Usage

The pipeline can be used following the code block below.

from datasets import load_dataset
from bocluster.cluster import BoClusterClassifier

# load a Tibetan language text dataset
ds = load_dataset('billingsmoore/LotsawaHouse-bo-en', split='train')

# initilialize a BoClusterClassifier object
bcc = BoClusterClassifier()

# fit the classifier on a set of texts
bcc.fit(ds['bo'][:1000])

# if you want to treat all data points as members of clusters, with no data treated as outliers
bcc.classify_outliers()

# show a visualization of results
bcc.show()

About

Easy, low-code clustering and labeling for Tibetan language text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages