English character bigram frequencies and transitional frequencies

For use in, e.g., creating controlled stimuli for artificial language learning experiments with English-speaking participants, or computing corpus-based transitional probabilities between English bigrams.

Created by Elizabeth Pankratz in 2022 based on web corpus data from ENCOW16A-NANO.

Illustration

This is an excerpt from the matrix containing transitional frequencies between character bigrams:

	bu	mn	ke	mo
zr	0	0	0	0
bu	3	1	25	6
mn	5	0	0	1
ke	100	1	16	266
hv	0	0	0	1
mo	2	2	210	12
aa	22	3	8	44

For the transition from syllable $i$ to syllable $j$, this matrix shows the transitional frequency in cell $[i, j]$. For instance, the transitional frequency from "mo" to "ke" appears 210 times, e.g., in words like "smoke".

What is this good for?

If you are running artificial language learning experiments with English-speaking participants and want to limit the influence of their prior linguistic knowledge on your task, then it may be useful to ensure that your artificial language does not contain bigrams or bigram transitions that resemble those in English.

With the frequency list of CV syllables, you can restrict your choice of syllables to a particular frequency range.
With the matrix of co-occurrence frequencies, you can create words using syllable sequences that appear together with particular frequencies.

code/1_get_encow_sents.py: A Python 2 script run on the SeaCOW server in fall 2022. Queries the web corpus for all sentences belonging to boilerplate class a or b (that is, actual content and not just the peripheral junk that's often found on websites) and in documents of badness class a or b (that is, well-formed standard English).
- In: Nothing.
- Out: data/encow_sents.csv (gitignored due to size and copyright; contains 351,392 sentences)
code/2_count_bigrams.ipynb: Gets all character bigrams (both within and between words) in corpus data. Counts them and all transitions between them.
- In: data/encow_sents.csv
- Out:
  - data/bigram_freqs.csv
  - data/bigram_freq_mtx.csv

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
.gitignore		.gitignore
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

.gitignore

.gitignore

CITATION.cff

CITATION.cff

README.md

README.md

Repository files navigation

English character bigram frequencies and transitional frequencies

Illustration

What is this good for?

Contents

About

Releases

Packages

Languages

	bu	mn	ke	mo
zr	0	0	0	0
bu	3	1	25	6
mn	5	0	0	1
ke	100	1	16	266
hv	0	0	0	1
mo	2	2	210	12
aa	22	3	8	44

	bu	mn	ke	mo
zr	0	0	0	0
bu	3	1	25	6
mn	5	0	0	1
ke	100	1	16	266
hv	0	0	0	1
mo	2	2	210	12
aa	22	3	8	44

elizabethpankratz/en-bigram-transitions

Folders and files

Latest commit

History

Repository files navigation

English character bigram frequencies and transitional frequencies

Illustration

What is this good for?

Contents

About

Resources

Stars

Watchers

Forks

Languages

	bu	mn	ke	mo
zr	0	0	0	0
bu	3	1	25	6
mn	5	0	0	1
ke	100	1	16	266
hv	0	0	0	1
mo	2	2	210	12
aa	22	3	8	44