## Rough-Level Alignment: Locate Subset in Open Subtitle

In the rough alignment phrase, we try to align each utterance in tbbt episode with a open subtitle index. Specifically, we do the following:

1.Clean strings and then perform matching
2.assign index to each utterance. So each utterance is assigned a set of subtitles [index, segment, en_subtitle, zh_subtitle]
--index is the subtitle's index in open subtitle dataset
--segment is the utterance segment for alignment
--en_subtitle and zh_subtitle is the whole subtitle in English and Chinese language.


In [131]:
import pickle as pkl
import json
from collections import defaultdict
import jiwer

In [13]:
"""
Organize index within one episode
Input: ([[]])
Output: {index: []}
"""
def get_index_dict(episode):
    index_dict = defaultdict()
    for idx, segment, en_sub, zh_sub in episode:
        temp = [segment, en_sub, zh_sub]
        if idx not in index_dict:
            index_dict[idx] = [temp]
        else:
            index_dict[idx].append(temp)
    return index_dict

In [51]:
# Organize data
def organize_by_seasons(all_data):
    res = {}
    for epi in list(all_data.keys()):
        season = int(epi[1:3])
        episode = int(epi[-2:])
        # Process the data in one episode
        temp = get_index_dict(all_data[epi])
        if season not in res:
            res[season] = {
                episode: temp
            }
        else:
            res[season][episode] = temp
    return res

In [94]:
"""
Get all the indexs within one episode
Input: {index: [[]]}
Output: sorted index list
"""
def get_epi_indexs_gaps(episode):
    idx_list = []
    for idx in episode:
        idx_list.append(idx)
    idx_list.sort()
    # Calculate gaps
    gaps = calculate_gaps(idx_list)
    return idx_list, gaps

In [95]:
"""
Calculate gaps between elements given an list of integer
"""
def calculate_gaps(idx_list):
    gaps = []
    idx_list.sort()
    for i in range(len(idx_list)-1):
        gaps.append(idx_list[i+1]-idx_list[i])
    return gaps

In [96]:
"""
Locate continuous subset that gap between to indexs is small than threshold
Input: indexs, gaps, threshold
Output: indexs of continuous subset
"""
def find_all_continuous_subsets(idx_list, gaps, len_threshold, gap_threshold):
    res = []
    path = [idx_list[0]]
    for i in range(len(gaps)):
        if gaps[i]<=gap_threshold:
            path.append(idx_list[i+1])
        else:
            if len(path)>=len_threshold:
                res.append(path)
            path = [idx_list[i+1]]
    return res

In [56]:
# Load search result
with open('episode_indexs_transformed.pkl', 'rb') as f:
    temp = pkl.load(f)
results = organize_by_seasons(temp)

In [103]:
# Check all substrings in each episode
for season_id in sorted(list(results.keys())):
    season = results[season_id]
    for episode_id in sorted(list(season.keys())):
        idx_list, gaps = get_epi_indexs_gaps(results[season_id][episode_id])
        subsets = find_all_continuous_subsets(idx_list, gaps, 6, 100)
        try:
            gaps = calculate_gaps(subsets[-1])
            print(gaps)
            print("Season:", season_id, "|Episode:", episode_id, "|Subset Length:", len(subsets[-1]), "|Sum:", sum(gaps), "|Maximum:", max(gaps))
        except:
            print("Season:", season_id, "Episode:", episode_id, "Subset Length:", subsets)
        print('=='*50)

[3, 2, 1, 2, 2, 13, 1, 2, 1, 1, 1, 1, 4, 1, 7, 4, 10, 7, 25, 1, 1, 7, 3, 4, 1, 2, 7, 2, 1, 1, 9, 7, 2, 1, 1, 2, 1, 1, 1, 3, 7, 7, 1, 1, 1, 5, 14, 1, 10, 1, 1, 1, 20, 1, 2, 5, 4, 1, 1, 3, 1, 1, 4, 3, 13, 1, 1, 4, 1, 2, 3, 3, 8, 1, 14, 13, 16, 2, 2, 1, 4, 1, 1, 3, 2]
Season: 1 |Episode: 1 |Subset Length: 86 |Sum: 338 |Maximum: 25
[1, 17, 1, 1, 6, 2, 3, 3, 44, 2, 2, 1, 23, 4, 4, 14, 1, 2, 7, 1, 8, 1, 8, 2, 5, 1, 3, 1, 2, 1, 7, 2, 2, 1, 5, 2, 4, 1, 2, 1, 1, 1, 8, 2, 46, 1, 1, 2, 7]
Season: 1 |Episode: 2 |Subset Length: 50 |Sum: 267 |Maximum: 46
[1, 5, 3, 2, 8, 3, 5, 2, 5, 7, 10, 5, 1, 5, 1, 2, 27, 1, 1, 1, 16, 15, 1, 1, 1, 1, 19, 1, 4, 9, 4, 10, 2, 2, 3, 14, 2, 1, 3, 2, 1, 1, 1, 1, 1, 3, 2, 7, 12, 1, 4, 2, 3, 1, 1, 5, 1, 1, 7, 8, 4, 11, 13, 1, 1, 8, 6]
Season: 1 |Episode: 3 |Subset Length: 68 |Sum: 313 |Maximum: 27
[2, 14, 1, 8, 3, 7, 1, 6, 3, 18, 12, 10, 2, 8, 8, 8, 2, 1, 1, 2, 3, 2, 1, 2, 20, 11, 9, 2, 5, 1, 1, 3, 7, 41, 1, 1, 9, 6, 2, 6, 1, 4, 4, 9, 1, 16]
Season: 1 |Episode: 4 |Subset 

## Fine-Level Alignment: Search within the Open Subtitle Subset

In [113]:
# Load Open Subtitle
with open('en_subtitles.pkl', 'rb') as f:
    en_subtitle = pkl.load(f)
with open('zh_subtitles.pkl', 'rb') as f:
    zh_subtitle = pkl.load(f)

In [123]:
# Load Memor Dataset
with open('memor/data.json') as f:
    tbbt = json.load(f)

In [130]:
"""
Experiment within one season: Season 1, Episode 1
"""
idx_list, gaps = get_epi_indexs_gaps(results[1][1])
subsets = find_all_continuous_subsets(idx_list, gaps, 6, 100)[-1]
# Calculate gaps within the subset
gaps_subsets = calculate_gaps(subsets)

# Prepare Subtitle Subset
bias = 200
start = subsets[0]-bias
end = subsets[-1]+bias
en_subset = en_subtitle[start: end]
zh_subset = zh_subtitle[start: end]

# Prepare utterances of one episode
tbbt_episode = []
for item in tbbt:
    if item.strip().split('_')[0]=='S01E01':
        sentences = tbbt[item]['sentences']
        speakers = tbbt[item]['speakers']
        for sentence, speaker in zip(sentences, speakers):
            tbbt_episode.append([sentence, speaker])

In [132]:
# Define sentence transformation
transformation = jiwer.Compose([
    jiwer.ToLowerCase(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.ExpandCommonEnglishContractions(),
    jiwer.RemovePunctuation(),
    jiwer.Strip()
])

In [194]:
"""
Use cleaned open subtitle to match substring of utterance in episode
Each subtitle sentence is split into 6 tokens
"""
def string_match(en_subset, episode):
    res = {}
    for i, subtitle in enumerate(en_subset):
        subtitle = transformation(subtitle)
        subtitle_tokens = subtitle.strip().split(" ")
        if len(subtitle_tokens)<6:
            continue
        # Build subtitle segments
        subtitle_segments = []
        num_iter = len(subtitle_tokens) // 6
        for j in range(num_iter):
            subtitle_segments.append(" ".join(subtitle_tokens[j*6: j*6+6]))

        for j, (utt, speaker) in enumerate(tbbt_episode):
            utt = transformation(utt)
            for sub_seg in subtitle_segments:
                if sub_seg in utt:
                    if i not in res:
                        res[i] = [j]
                    else:
                        res[i].append(j)
    return res

In [197]:
alignment = string_match(en_subset, tbbt_episode)

In [201]:
"""
Use cleaned open subtitle to match substring of utterance in episode
Each subtitle sentence is split into 6 tokens
"""
def string_wer(en_subset, episode):
    res = {}
    # for i, (utt, speaker) in enumerate(tbbt_episode):
    #     print(utt, speaker)

    for i, subtitle in enumerate(en_subset):
        if i not in alignment:
            print(subtitle)


        # subtitle = transformation(subtitle)
        # subtitle_tokens = subtitle.strip().split(" ")
        # if len(subtitle_tokens)<6:
        #     continue
        # # Build subtitle segments
        # subtitle_segments = []
        # num_iter = len(subtitle_tokens) // 6
        # for j in range(num_iter):
        #     subtitle_segments.append(" ".join(subtitle_tokens[j*6: j*6+6]))
        #
        # for j, (utt, speaker) in enumerate(tbbt_episode):
        #     utt = transformation(utt)
        #     for sub_seg in subtitle_segments:
        #         if sub_seg in utt:
        #             if i not in res:
        #                 res[i] = [j]
        #             else:
        #                 res[i].append(j)
    return res

In [202]:
temp = string_wer(en_subset, tbbt_episode)

Thank you very much. Good day to you.
Good day to you.
Come and buy a dresser!
The years with Iisakki passed quickly.
Before I knew it, I was all grown up, with a beard and all.
The village had grown.
There were so many new children - that me and Iisakki could not keep count.
But we had a secret helper.
Nikolas.
-Eemeli.
Long time no see. You should come more often.
I've been busy. Iisakki is no longer young.
Do you have the list?
Well, I'll be... So many new children.
As a matter of fact, one name is missing from that list. Elsa?
Is that...
-A girl, three months.
Let's add her to the list.
What is the name of this little princess?
Aada.
Aada?
Hello, Aada.
Nikolas, meet Henrik and Hermanni.
My sons.
I sought them out and asked them here.
We were wrong when we...
We want to make it up to our father.
We came to take him to live with us.
-To live with you? Where?
Away from here. Father is too old to be living in arctic conditions.
You'll get the house and the workshop. You've earned them.

In [204]:
print(transformation("- Agreed. What's your point?"))
print(transformation("agreed, what's your point?"))

agreed what is your point
agreed what is your point


In [205]:
print(transformation("Twenty-six across is MCM."))
print(transformation(" twenty-six across is mcm,"))

twentysix across is mcm
twentysix across is mcm


In [191]:
# Show take indexs of both subtitle and episode
all_tbbt_indexs = set()
for subtitle_id in temp:
    for utt_id in temp[subtitle_id]:
        all_tbbt_indexs.add(utt_id)
# print(all_tbbt_indexs)
all_subtitle_indexs = set(temp.keys())
# print(all_subtitle_indexs)

In [192]:
for i, (utt, speaker) in enumerate(tbbt_episode):
    if i in all_tbbt_indexs:
        print("||||", utt)
    else:
        print(utt)

|||| if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. if it's unobserved it will, however, if it's observed after it's left the plane but before it hits its target, it will not have gone through both slits.
agreed, what's your point?
|||| there's no point, i just think it's a good idea for a t-shirt.
|||| one across is aegean, eight down is nabakov, twenty-six across is mcm, fourteen down is... move your finger... phylum, which makes fourteen across port-au-prince. see, papa doc's capital idea, that's port-au-prince. haiti.
|||| no. we are committing genetic fraud. there's no guarantee that our sperm is going to generate high iq offspring, think about that. i have a sister with the same basic dna mix who hostesses at fuddruckers.
|||| sheldon, this was your idea. a little extra money to get fractional t1 bandwidth in the apartment.
|||| i know, and i do yearn for faster downloads, but there's some poor woman is 

In [193]:
for i, subtitle in enumerate(en_subset):
    if i in temp:
        print("||||", subtitle)
    else:
        print(subtitle)

Thank you very much. Good day to you.
Good day to you.
Come and buy a dresser!
The years with Iisakki passed quickly.
Before I knew it, I was all grown up, with a beard and all.
The village had grown.
There were so many new children - that me and Iisakki could not keep count.
But we had a secret helper.
Nikolas.
-Eemeli.
Long time no see. You should come more often.
I've been busy. Iisakki is no longer young.
Do you have the list?
Well, I'll be... So many new children.
As a matter of fact, one name is missing from that list. Elsa?
Is that...
-A girl, three months.
Let's add her to the list.
What is the name of this little princess?
Aada.
Aada?
Hello, Aada.
Nikolas, meet Henrik and Hermanni.
My sons.
I sought them out and asked them here.
We were wrong when we...
We want to make it up to our father.
We came to take him to live with us.
-To live with you? Where?
Away from here. Father is too old to be living in arctic conditions.
You'll get the house and the workshop. You've earned them.

In [141]:
# Step 1: If subtitle is subset of the utterance
for subtitle in en_subset:
    subtitle = transformation(subtitle)
    for utt, speaker in tbbt_episode:
        utt = transformation(utt)
        if subtitle in utt:
            print("Subtitle:", subtitle)
            print("Utterance:", utt)
            print("=="*50)

Subtitle: is that
Utterance: i think what sheldon is trying to say is that sagittarius would not have been our first guess
Subtitle: is that
Utterance: you want to know the most pathetic part even though i hate his lying cheating guts i still love him is that crazy
Subtitle: red
Utterance: two hundred pound transvestite with a skin condition yes she is
Subtitle: red
Utterance: two hundred pound transvestite with a skin condition yes she is
Subtitle: i will show you
Utterance: come on i will show you the trick with the shower
Subtitle: that is nice
Utterance: oh that is nice
Subtitle: well
Utterance: well what do you want to do
Subtitle: well
Utterance: great well bye
Subtitle: well
Utterance: well
Subtitle: well
Utterance: well that is interesting leonard can not process corn
Subtitle: well
Utterance: well that is interesting leonard can not process corn
Subtitle: well
Utterance: well if that was a movie i would go see it
Subtitle: well
Utterance: well it sounds wonderful
Subtitle: wel

In [144]:
for utt, speaker in tbbt_episode:
    utt = transformation(utt)
    for subtitle in en_subset:
        if len(subtitle.strip().split(" "))<6:
            continue
        subtitle = transformation(subtitle)
        if subtitle in utt:
            print("Utterance:", utt)
            print("Subtitle:", subtitle)
            print("=="*50)

Utterance: if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits if it is unobserved it will however if it is observed after it is left the plane but before it hits its target it will not have gone through both slits
Subtitle: it will not have gone through both slits
Utterance: there is no point i just think it is a good idea for a tshirt
Subtitle: there is no point i just think it is a good idea for a tshirt
Utterance: one across is aegean eight down is nabakov twentysix across is mcm fourteen down is move your finger phylum which makes fourteen across portauprince see papa doc is capital idea that is portauprince haiti
Subtitle: fourteen down is move your finger
Utterance: one across is aegean eight down is nabakov twentysix across is mcm fourteen down is move your finger phylum which makes fourteen across portauprince see papa doc is capital idea that is portauprince haiti
Subtitle: see papa doc is capital idea tha

In [142]:
for subtitle in en_subset:
    subtitle = transformation(subtitle)
    print(subtitle)

thank you very much good day to you
good day to you
come and buy a dresser
the years with iisakki passed quickly
before i knew it i was all grown up with a beard and all
the village had grown
there were so many new children  that me and iisakki could not keep count
but we had a secret helper
nikolas
eemeli
long time no see you should come more often
i have been busy iisakki is no longer young
do you have the list
well i will be so many new children
as a matter of fact one name is missing from that list elsa
is that
a girl three months
let us add her to the list
what is the name of this little princess
aada
aada
hello aada
nikolas meet henrik and hermanni
my sons
i sought them out and asked them here
we were wrong when we
we want to make it up to our father
we came to take him to live with us
to live with you where
away from here father is too old to be living in arctic conditions
you will get the house and the workshop you have earned them
wait nikolas
nikolas
nikolas try to understand

In [140]:
for utt, speaker in tbbt_episode:
    print(utt)
    print(transformation(utt))
    print('=='*50)


if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits. if it's unobserved it will, however, if it's observed after it's left the plane but before it hits its target, it will not have gone through both slits.
if a photon is directed through a plane with two slits in it and either slit is observed it will not go through both slits if it is unobserved it will however if it is observed after it is left the plane but before it hits its target it will not have gone through both slits
agreed, what's your point?
agreed what is your point
there's no point, i just think it's a good idea for a t-shirt.
there is no point i just think it is a good idea for a tshirt
one across is aegean, eight down is nabakov, twenty-six across is mcm, fourteen down is... move your finger... phylum, which makes fourteen across port-au-prince. see, papa doc's capital idea, that's port-au-prince. haiti.
one across is aegean eight down is nabakov twen

In [None]:
for i in range(start, end):
    if i in subsets:
        print("||||", en_subtitle[i])
        print("||||", zh_subtitle[i])
    else:
        print(en_subtitle[i])
        print(zh_subtitle[i])
    print('=='*50)