Skip to content

Some useful scripts for speech and language research.

Notifications You must be signed in to change notification settings

FrancisLawlor/HLT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HLT

lexical_diversity_calculator.py

Uses type/token ratio to calculate lexical diversity.

Pre-processing includes tokenising input, removing stopwords and using nltk's Porter Stemmer to obtain word stems.

Run:

python3 lexical_diversity_calculator.py -n SampleTexts/EdSheeranLyrics.txt

Output

EdSheeranLyrics.txt lexical diversity: 0.2112

word_proportions.py

Finds proportions of adjectives, verbs, nouns and adjectives in a text. Categorises remaining types as 'other'.

Preprocessing involves tokenisation of input and removal of stopwords.

Uses nltk's part of speech (POS) tagger to assign parts of speech to input text tokens. Given that nltk's POS tagger was trained using the Treebank Corpus it uses the Treebank tag set. This script will map the Treebank tags to WordNet tags before giving the proportions as output.

Run:

python3 word_proportions.py -n SampleTexts/GulliversTravels.txt

Output:

Adjectives: 7.75 %
Verbs: 17.18 %
Nouns: 22.76 %
Adverbs: 5.6 %
Other: 46.7 %

About

Some useful scripts for speech and language research.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages