### Filter corpus
- This code is for filtering the corpus files in different languages
- Currently the number of talks(documents) and the number of paragraphs in each talk in three language corpora are different.
- We should match those numbers for creating a parallel corpus
- The current details about the corpus is in this [link](https://github.ubc.ca/iameleve/COLX_523_Group2/blob/master/Milestone_2/corpus_readme.md#5-corpus-details)

### Prepare corpora in dataframes

In [1]:
import json
import pandas as pd

In [2]:
en_file_path = "../transcripts/en/annotated/annotated_ted_talks_en.json"
ko_file_path = "../transcripts/ko/annotated/annotated_ted_talks_ko.json"
cn_file_path = "../transcripts/zh-cn/annotated/annotated_ted_talks_cn.json"

In [3]:
with open(en_file_path, "r", encoding="utf-8") as f:
    en_corpus = json.load(f)

In [4]:
with open(ko_file_path, "r", encoding="utf-8") as f:
    ko_corpus = json.load(f)

In [5]:
with open(cn_file_path, "r", encoding="utf-8") as f:
    cn_corpus = json.load(f)

In [6]:
en_df = pd.DataFrame(en_corpus)
ko_df = pd.DataFrame(ko_corpus)
cn_df = pd.DataFrame(cn_corpus)

### First, filter talks
- We are going to filter talks that appear in all three corpora.

In [7]:
en_title = en_df[["title"]]
ko_title = ko_df[["title"]]
cn_title = cn_df[["title"]]

In [10]:
en_title["title"][775]

'bill_joy_what_i_m_worried_about_what_i_m_excited_about'

In [81]:
x = en_title.merge(ko_title, on="title", how="inner").drop_duplicates()

In [82]:
y = x.merge(cn_title, on="title", how="inner").drop_duplicates()

In [83]:
common_corpus = y["title"].to_list()

In [84]:
len(common_corpus)

519

In [85]:
en_df_filtered = en_df[en_df["title"].isin(common_corpus)].drop_duplicates("title")
ko_df_filtered = ko_df[ko_df["title"].isin(common_corpus)].drop_duplicates("title")
cn_df_filtered = cn_df[cn_df["title"].isin(common_corpus)].drop_duplicates("title")

In [86]:
assert len(en_df_filtered) == len(ko_df_filtered) == len(cn_df_filtered) == len(common_corpus)

### Second, we are going to filter those where the number of paragraphs is identical for each dataframe.

In [70]:
# add a column that identifies the number of paragraphs in each talk

In [87]:
en_df_filtered["num_paras"] = en_df_filtered["text"].apply(len)
ko_df_filtered["num_paras"] = ko_df_filtered["text"].apply(len)
cn_df_filtered["num_paras"] = cn_df_filtered["text"].apply(len)

In [110]:
en_paras = en_df_filtered[["title", "num_paras"]].reset_index()
ko_paras = ko_df_filtered[["title", "num_paras"]].reset_index()
cn_paras = cn_df_filtered[["title", "num_paras"]].reset_index()

In [113]:
en_paras["num_paras_ko"] = ko_paras["num_paras"]

In [114]:
en_paras["num_paras_cn"] = cn_paras["num_paras"]

In [118]:
para_common_list = en_paras.query("num_paras == num_paras_ko == num_paras_cn")["title"].to_list()

In [119]:
len(para_common_list)

330

In [120]:
en_df_filtered_paras = en_df_filtered[en_df_filtered["title"].isin(para_common_list)].drop_duplicates("title")
ko_df_filtered_paras = ko_df_filtered[ko_df_filtered["title"].isin(para_common_list)].drop_duplicates("title")
cn_df_filtered_paras = cn_df_filtered[cn_df_filtered["title"].isin(para_common_list)].drop_duplicates("title")

In [121]:
assert len(en_df_filtered_paras) == len(ko_df_filtered_paras) == len(cn_df_filtered_paras) == len(para_common_list)

### Finally, let's take a look at the corpus:
- How many documents are in each corpus?
- How many paragraphs are in each corpus?
- What is the average number of paragraphs?
- How many tokens in each corpus?
- What is the number of character for each corpus?

In [122]:
# num_docs
print(len(en_df_filtered_paras))
print(len(ko_df_filtered_paras))
print(len(cn_df_filtered_paras))

330
330
330


In [126]:
# num_paras
print(en_df_filtered_paras["num_paras"].sum())

7660


In [127]:
# avg. num_paras
print(en_df_filtered_paras["num_paras"].sum() / len(en_df_filtered_paras))

23.21212121212121


In [152]:
# num. tokens/characters in each corpus
def count_tokens(text):
    num_tokens = 0
    for para in text:
        tokens = para["text"].split()
        num_tokens += len(tokens)
    return num_tokens

In [153]:
en_df_filtered_paras["num_tokens"] = en_df_filtered_paras["text"].apply(count_tokens)

In [167]:
# num_tokens in eng corpus
en_df_filtered_paras["num_tokens"].sum()

689614

In [168]:
ko_df_filtered_paras["num_tokens"] = ko_df_filtered_paras["text"].apply(count_tokens)

In [169]:
# num_tokens in kor corpus
ko_df_filtered_paras["num_tokens"].sum()

478551

In [170]:
def count_chars(text):
    num_chars = 0
    for para in text:
        num_chars += len(para["text"])
    return num_chars

In [171]:
# num_chars in eng_corpus
en_df_filtered_paras["num_chars"] = en_df_filtered_paras["text"].apply(count_chars)
en_df_filtered_paras["num_chars"].sum()

3792843

In [172]:
# num_chars in kor_corpus
ko_df_filtered_paras["num_chars"] = ko_df_filtered_paras["text"].apply(count_chars)
ko_df_filtered_paras["num_chars"].sum()

2027123

In [173]:
# num_chars in cn_corpus
cn_df_filtered_paras["num_chars"] = cn_df_filtered_paras["text"].apply(count_chars)
cn_df_filtered_paras["num_chars"].sum()

1264013

### Save the filtered corpora to json files

In [174]:
en_filtered_file_path = "../transcripts/en/filtered/filtered_annotated_ted_talks_en.json"
ko_filtered_file_path = "../transcripts/ko/filtered/filtered_annotated_ted_talks_ko.json"
cn_filtered_file_path = "../transcripts/zh-cn/filtered/filtered_annotated_ted_talks_cn.json"

In [176]:
def convert_corpus_to_list(corpus):
    return corpus.apply(lambda x:x.to_dict(), axis=1).to_list()

In [177]:
en_corpus_list = convert_corpus_to_list(en_df_filtered_paras)
ko_corpus_list = convert_corpus_to_list(ko_df_filtered_paras)
cn_corpus_list = convert_corpus_to_list(cn_df_filtered_paras)

In [178]:
def write_json(corpus, file_path):
    with open(file_path, 'w', encoding='utf-8') as json_file:
        json.dump(corpus, json_file,  indent=4, separators=(',', ':'))

In [179]:
write_json(en_corpus_list, en_filtered_file_path)
write_json(ko_corpus_list, ko_filtered_file_path)
write_json(cn_corpus_list, cn_filtered_file_path)

### Let's test out our filtered corpora are parallelized in paragraph-wise

In [180]:
def load_parallel_corpus(en_df, ko_df, cn_df, title, para_id):
    en_row = en_df.query(f"title == '{title}'")["text"].to_list()[0][para_id]["text"]
    ko_row = ko_df.query(f"title == '{title}'")["text"].to_list()[0][para_id]["text"]
    cn_row = cn_df.query(f"title == '{title}'")["text"].to_list()[0][para_id]["text"]
    return en_row, ko_row, cn_row

In [184]:
with open(en_filtered_file_path, "r", encoding="utf-8") as f:
    en_corpus = json.load(f)

with open(ko_filtered_file_path, "r", encoding="utf-8") as f:
    ko_corpus = json.load(f)

with open(cn_filtered_file_path, "r", encoding="utf-8") as f:
    cn_corpus = json.load(f)

In [185]:
en_df = pd.DataFrame(en_corpus)
ko_df = pd.DataFrame(ko_corpus)
cn_df = pd.DataFrame(cn_corpus)

In [187]:
load_parallel_corpus(en_df, ko_df, cn_df, para_common_list[6], 5)

('One day in New York, I was on the street and I saw some kids playing baseball between stoops and cars and fire hydrants. And a tough, slouchy kid got up to bat, and he took a swing and really connected. And he watched the ball fly for a second, and then he went, "Dah dadaratatatah. Brah dada dadadadah." And he ran around the bases. And I thought, go figure. How did this piece of 18th century Austrian aristocratic entertainment turn into the victory crow of this New York kid? How was that passed on? How did he get to hear Mozart?',
 '하루는 제가 뉴욕에서 길을 걷고 있었는데 꼬마들이 현관과 자동차와 소화전들 사이에서 야구를 하고 있더군요 구부정하고 억세 보이는 한 꼬마가 자기 차례가 되자 야구배트를 휘둘렀는데, 정확히 딱 맞혔어요 그리고는 그 아이가 날아가는 공을 잠시 보더니 이러더군요 "따따 따따 라따따" \'따따 따따 따따따따따;" 이러며 베이스 사이를 뛰어다녔습니다 보고 있던 전 생각했죠 어떻게 18세기 오스트리아 귀족의 놀이 음악이 이 뉴욕 꼬마의 승리 행진곡이 될 수 있었을까? 그 음악이 어떻게 전해졌을까? 저 꼬마는 어디서 모차르트 음악을 들은 걸까?',
 '有一天，我在纽约的街头 我看到一些小孩子在门廊，汽车和消防栓之间打棒球。 一个强壮的，无精打采的孩子准备击球， 他甩开球棒，真的击到了球。 然后他看着球飞了一会儿， 然后就唱起来，”达 达达……（音乐旋律）。“ ”巴 达达 达……“ 然后他绕着球场跑起来。 我就想，试着猜猜看吧。 这首18世纪奥地利的贵族音

In [190]:
load_parallel_corpus(en_df, ko_df, cn_df, para_common_list[10], 1)

("If you understand the difference between 'the world' and 'my world,' you understand the difference between logos and mythos. 'The world' is objective, logical, universal, factual, scientific. 'My world' is subjective. It's emotional. It's personal. It's perceptions, thoughts, feelings, dreams. It is the belief system that we carry. It's the myth that we live in.",
 '"세계"와 "세상"의 차이를 이해하신다면 이성과 신화의 차이를 이해한 것입니다. "세계"는 객관적이고 논리적이고 보편적이며 사실적이고 과학적입니다. "내 세상"은 주관적이고 감성적이며 개인적입니다. 그것은 지각이고 생각이고 느낌이며 꿈입니다. 그것이 우리가 가지는 신념 체계입니다. 그것이 우리가 살고 있는 신화의 틀입니다.',
 '如果你明白了"外在世界"和"内在世界"的区别，你就明白了理性和神话的区别 如果你明白了"外在世界"和"内在世界"的区别，你就明白了理性和神话的区别 “外在世界”是客观公正，理性逻辑，普遍适用，实实在在，符合科学的 “外在世界”是客观公正，理性逻辑，普遍适用，实实在在，符合科学的 “外在世界”是客观公正，理性逻辑，普遍适用，实实在在，符合科学的 “内在世界”是主观的，感性的，个人的，它是你的观念、思想、感觉、梦想 “内在世界”是主观的，感性的，个人的，它是你的观念、思想、感觉、梦想 “内在世界”是主观的，感性的，个人的，它是你的观念、思想、感觉、梦想 “内在世界”是我们的信仰，是我们的世界观，是属于我们的神话世界 “内在世界”是我们的信仰，是我们的世界观，是属于我们的神话世界')

Now they seem to be parallelized!