This script identifies the word-internal two-character sequences (i.e., bigraphs) that appear in the sample from ENCOW16A-NANO.
It counts and saves the frequencies of these bigraphs.
It also creates a co-occurrence matrix between all CV-shaped bigraphs that contains the transitional frequences from each bigraph to each other one.

In:
- `../data/encow_sents.csv` (gitignored due to size)

Out:
- `../data/bigraph_freqs.csv`
- `../data/bigraph_transition_freqs.csv`

In [2]:
import pandas as pd
import nltk
import re
import itertools

In [1]:
def tidy_sent(sent):
    """
    Arg:
        sent: string containing a sentence
    Returns:
        string; sent with all non-alphabetic and non-whitespace symbols removed.
    """
    sent = re.sub(r'[^A-Za-z]+', '', sent)
    return sent.lower()

Read in the sentences, collapse them to a long string of alphabetic characters, and get the component character bigrams.
This lets us look at the transitions not just within words, but also between them; we don't treat them differently.

In [3]:
# Read in data and tidy.
encow = pd.read_csv('../data/encow_sents.csv')
encow['sent_proc'] = encow['sent'].map(tidy_sent)

# Concatenate all sentences into a single string with no spaces. (Approx 10s runtime)
chars = ''.join(encow['sent_proc'])  # n characters: 40,799,735

In [4]:
# Get all the bigrams in this long string. (Approx 12s runtime.)
all_char_bigrams = ["".join(tup) for tup in list(nltk.bigrams(chars))]  # len: 40,799,734

In [7]:
# Count the bigram frequencies and save distribution. (Approx 8s runtime.)
bigram_freqs = pd.Series(all_char_bigrams).value_counts().reset_index().rename(columns = {0:'freq', 'index':'bigram'})
print('There are {0} bigram types.'.format(len(bigram_freqs)))

bigram_freqs.to_csv('../data/bigram_freqs.csv', index = False)

There are 673 bigram types.


Now let's look at the transitions between these bigrams.
For example, for the input `['ma', 'ar', 'rk', 'ke', 'et']` (the bigrams in "market"), we want to get the transitions `[('ma', 'rk'), ('ar', 'ke'), ('rk', 'et')]`.

This is not just getting the bigrams of the bigrams—that would give us `('ma', 'ar'), ('rk', 'ke')`, etc., which all share a letter.
We don't want that.

In [9]:
def get_bigram_transitions(list_of_bigrams):
    """
    For input ['ma', 'ar', 'rk', 'ke', 'et'], 
    returns [('ma', 'rk'), ('ar', 'ke'), ('rk', 'et')]

    Arg:
        list_of_bigrams: a list of strings (character bigrams)
    Returns:
        list of tuples, each a transition between the original bigraphs
    """
    return list(zip(list_of_bigrams[:-2], list_of_bigrams[2:]))

# for the string "marketing has been taking"
get_bigram_transitions(all_char_bigrams[:21])

[('ma', 'rk'),
 ('ar', 'ke'),
 ('rk', 'et'),
 ('ke', 'ti'),
 ('et', 'in'),
 ('ti', 'ng'),
 ('in', 'gh'),
 ('ng', 'ha'),
 ('gh', 'as'),
 ('ha', 'sb'),
 ('as', 'be'),
 ('sb', 'ee'),
 ('be', 'en'),
 ('ee', 'nt'),
 ('en', 'ta'),
 ('nt', 'ak'),
 ('ta', 'ki'),
 ('ak', 'in'),
 ('ki', 'ng')]

Now count how many times every transition appears in the data.

In [10]:
# Runtime approx 40s.
all_transitions = get_bigram_transitions(all_char_bigrams)
trans_freqs = pd.Series(all_transitions).value_counts()
trans_freqs = trans_freqs.reset_index().rename(columns = {0:'trans_freq'})

# Bigraphs currently saved as a tuple in the index column; we'll split them up.
trans_freqs[['bigraph1', 'bigraph2']] = pd.DataFrame(trans_freqs['index'].tolist(), index = trans_freqs.index)
trans_freqs.head()

Unnamed: 0,index,trans_freq,bigraph1,bigraph2
0,"(ti, on)",208659,ti,on
1,"(at, io)",118088,at,io
2,"(nt, he)",112217,nt,he
3,"(th, at)",111215,th,at
4,"(th, er)",108323,th,er


We'll convert these into a matrix in which cell $i, j$ contains the frequency of the transition from bigraph $i$ to bigraph $j$.

In [18]:
# Now we'll compute the co-occurrence matrix that contains the transitional frequences of all bigraphs to all other bigraphs.

def coocc_mtx(df):
    """
    Creates a matrix with the transitional frequencies from the bigraphs 
    in the df index to the bigraphs in the col.

    Args:
        df: pandas dataframe containing minimally columns 'bigraph1', 'bigraph2', 'trans_freq'
    Returns:
        pd df, a co-occurrence matrix of how many times every bigraph in bigraphs 
        appears with every other one in df (zero if never).
    """
    # Begin the set of bigraphs that we'll be looking at: all bigraphs in df.
    bigraphs = set(df['bigraph1']).union(set(df['bigraph2']))

    # Add to this set all the other possible syllables we might want to consider:
    # anything CV and VC that we haven't seen yet.
    cons = 'bcdfghjklmnpqrstvwxz'
    vows = 'aeiou'
    cv = set(["".join(chars) for chars in list(itertools.product(cons, vows))])
    vc = set(["".join(chars) for chars in list(itertools.product(vows, cons))])
    bigraphs.update(cv)
    bigraphs.update(vc)

    # Convert to list and use to cdreate indices and headers for mtx.
    bigraphs = list(bigraphs)
    mtx = pd.DataFrame(index=range(len(bigraphs)),columns=range(len(bigraphs)))

    # Make a co-occurrence matrix of all elements as bigram1 and bigram2 in df.
    for b1_idx in range(len(bigraphs)):  # b1 is rows
        for b2_idx in range(len(bigraphs)):  # b2 is cols
            b1 = bigraphs[b1_idx]
            b2 = bigraphs[b2_idx]

            # Get the frequency of these bigrams co-occurring in the corpus data.
            # If the result is the empty list, it means they co-occur zero times.
            coocc_freq = list(df[(df['bigraph1'] == b1) & (df['bigraph2'] == b2)]['trans_freq'])
            if len(coocc_freq) == 1:
               mtx.iloc[b1_idx, b2_idx] =  coocc_freq[0]
            elif len(coocc_freq) == 0:
                mtx.iloc[b1_idx, b2_idx] = 0

    mtx.index = bigraphs
    mtx.columns = bigraphs
    return mtx

Unnamed: 0,ab,ew,ca,og,be,uk,iq,ar,fi,bu,...,ma,ed,nt,hi,as,ha,qe,xa,fe,ub
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ew,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ca,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
og,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
be,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ha,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
qe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
xa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fe,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we'll create the full proper matrix with all of the transitional frequencies and save it.

at least 217m

433 min 1.3 sec

In [19]:
# Runtime: approx X.
mtx = coocc_mtx(trans_freqs)
mtx

mtx.to_csv('bigram_freq_mtx.csv')