In [9]:
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import re
from pathlib import Path
from tqdm import tqdm
import json

# Step 1: Look at available Chinese dialog corpora

**CallFriend**
- available `/corpora/LDC/LDC96S55`
- `/corpora/LDC/LDC96S55/doc/spkrinfo.doc`

In [None]:
%ls /corpora/LDC/LDC96S55/cf_man_m/evltest/

In [None]:
%ls /corpora/LDC/LDC96S55/doc

In [None]:
!head -n 20 /corpora/LDC/LDC96S55/doc/spkrinfo.doc

In [None]:
%less /corpora/LDC/LDC96S55/doc/spkrinfo.tbl

In [None]:
%cat /corpora/LDC/LDC96S55/doc/cf_man_m.doc

 **Notes** <br/>
 - Actually just audio files and descriptions
 - But it still has decent demographic information of speakers
 - it looks like [this](https://catalog.ldc.upenn.edu/LDC2018S18) is the version that includes transcripts: LDC2018S18
 

In [None]:
%ls /corpora/LDC/ | grep 'LDC2018' 

It looks like LDC2018S18 corpus is not in UW's directory

In [None]:
Brandon added it under /corpora/LDC/LDC2018S18

In [None]:
%ls /corpora/LDC/LDC2018S18/data/transcripts/mandarin

In [None]:
!head -n 10 /corpora/LDC/LDC2018S18/data/transcripts/mandarin/ma_4160.txt

See if you can get any more info on A and B speakers in ma_4160

In [None]:
%ls /corpora/LDC/LDC2018S18/docs

In [None]:
!cat /corpora/LDC/LDC2018S18/docs/spkrinfo.tbl

In [None]:
#some interesting variables available in /corpora/LDC/LDC2018S18/docs/callinfo.tbl
!head -n 10 /corpora/LDC/LDC2018S18/docs/callinfo.tbl

Given that there is also some detail of the *caller* only in spkrinfo.tbl, I wonder if it would be possible to identify at least the caller in the spkrinfo.tbl file?

In [None]:
!head -n 5 /corpora/LDC/LDC2018S18/docs/spkrinfo.tbl

In [None]:
with open('/corpora/LDC/LDC2018S18/docs/callinfo.tbl','r') as infile:
    callfriend_raw = infile.read()

In [None]:
#get counts of ages
all_ages = re.findall('(?<=age=).*?(?=\s)',callfriend_raw)
Counter(all_ages)

In [None]:
#get counts of all sexes
all_sexes = re.findall('(?<=sex=)[M|F]',callfriend_raw)
Counter(all_sexes)

In [None]:
#Find stuff in directories

p = Path('/corpora/LDC/LDC2018S18/data/transcripts/mandarin/')
files = [f for f in p.iterdir() if f.is_file()]


In [None]:
ni_count = 0
nin_count = 0

for i in tqdm(range(0,len(files))):
    with open(str(files[i].absolute()),'r') as infile:
        raw = infile.read()
        
        ni_count += len(re.findall('你',raw))
        nin_count += len(re.findall('您',raw))
        
        if bool(re.findall('您',raw)):
            print(str(files[i].absolute()))

In [None]:
print(f'{round(100*(nin_count/(nin_count+ni_count)),2)}% of "you"s are 您')
print(f'{nin_count} instances out of {nin_count+ni_count}')

In [None]:
%less /corpora/LDC/LDC2018S18/data/transcripts/mandarin/ma_5930.txt

### Explore relations between CallFriend speakers

Question: how can I link the call ids to the speaker ids?

In [None]:
!head -n 5 /corpora/LDC/LDC96S55/doc/callinfo.tbl

In [None]:
!head -n 20  /corpora/LDC/LDC96S55/doc/spkrinfo.tbl

 For example: <br/> 
 ma_4160,M,30,22,Anhui,203778bah means: <br/>
- Call-ID = ma_4160
- Gender of caller = Male
- Age of caller = 30
- Years of education completed by caller = 22
- Where the caller grew up (typically a state name) = Anhui
- Area-code plus first three digits of telephone number dialed (last four digits of number are encrypted as three letters) = 203778bah <br/> <br/>
Note that there seem to be plenty of missing values

In [None]:
%ls /corpora/LDC/LDC96S55/cf_man_m/devtest/

In [None]:
!head -n 10 /corpora/LDC/LDC96S55/cf_man_m/devtest/ma_4559.sph

In [None]:
%less /corpora/LDC/LDC96S55/doc/cf_man_m.doc

In [None]:
with open('/corpora/LDC/LDC96S55/doc/callinfo.tbl','r') as infile:
    raw= infile.read().split('\n')

In [None]:
import re
call_ids = [re.findall('ma_[0-9]*',item)[0] for item in raw if len(item)>0]
call_ids = sorted(call_ids)
len(call_ids)

In [None]:
call_ids[0:5]

In [None]:
with open('/corpora/LDC/LDC96S55/doc/spkrinfo.tbl','r') as infile:
    raw= infile.read().split('\n')
speaker_ids = [re.findall('ma_[0-9]*',item)[0] for item in raw if len(item)>0]
speaker_ids = sorted(speaker_ids)
len(speaker_ids)

In [None]:
len(set(speaker_ids).intersection(set(call_ids)))

Actually the speaker ids and call ids are overlapping... how does that work? maybe the first id of the call table is actually speaker ids and the PINs are the call ids? That would imply duplicate PIN ids with 2 different ma_ numbers

In [None]:
with open('/corpora/LDC/LDC96S55/doc/callinfo.tbl','r') as infile:
    raw= infile.read().split('\n')


In [None]:
test = raw[0]
test

In [None]:
PINs = [re.findall('(?<=PIN=)[0-9]*(?=|)',item)[0] for item in raw if len(item)>0]
len(PINs)

In [None]:
len(set(PINs))

There are no duplicated PIN numbers either. My guess is that only demographics of the caller are recorded, and not the recipient. 

Trying **2005 NIST Speaker Recognition Evaluation Training Data** <br/>
- `/corpora/LDC/LDC11S01`

In [None]:
%ls /corpora/LDC/LDC11S01/nist_2005_sre_tr_d1

In [None]:
!head -n 5 /corpora/LDC/LDC11S01/nist_2005_sre_tr_d1/index.html

In [None]:
%ls /corpora/LDC/LDC11S01/nist_2005_sre_tr_d1/data/

In [None]:
!head -n 5 /corpora/LDC/LDC11S01/nist_2005_sre_tr_d1/data/asr_tran/jaab.ctm

In [None]:
%less /corpora/LDC/LDC11S01/nist_2005_sre_tr_d1/doc/file.tbl

Trying **MAGICDATA Mandarin Chinese Conversational Speech Corpus: MDT2021S003** <br/>
- https://www.openslr.org/123/

In [None]:
%ls MDT2021S003/

In [None]:
!head -n 5 MDT2021S003/SPKINFO.txt

In [None]:
import pandas as pd
with open('MDT2021S003/SPKINFO.txt','r') as infile:
    df = pd.read_table(infile)
df.head(5)

In [None]:
from collections import Counter
Counter(df.AGE)

In [None]:
Counter(df.GENDER)

Problem with this data: very little demographic information available for speakers.

Trying **HKUST/MTS: A Very Large Scale Mandarin
Telephone Speech Corpus** <br/>
- https://www.researchgate.net/profile/David-Graff-5/publication/220758477_HKUSTMTS_A_very_large_scale_Mandarin_telephone_speech_corpus/links/004635346a94cf2642000000/HKUST-MTS-A-very-large-scale-Mandarin-telephone-speech-corpus.pdf
- LDC2005T32
- Not downloaded on Patas, but says it's installed [here](https://cldb.ling.washington.edu/live/livesearch-corpus-form.php)

In [None]:
%ls /corpora/LDC/ | grep 'LDC200' 

According to Brandon, this is under `/corpora/LDC/LDC05T32`

In [None]:
%ls /corpora/LDC/LDC05T32/docs

In [None]:
!head -n 5 /corpora/LDC/LDC05T32/docs/calldata-train.txt

In [None]:
%less /corpora/LDC/LDC05T32/docs/file.tbl

Interesting **notes** from readme: <br/>
- Most subjects did not previously know each other.
- Topics were proposed to subjects
- Subjects were asked to provide several pieces of demographic
information, including gender, age, native language/dialect,
birthplace, education, occupation, phone type, etc.

In [None]:
%ls /corpora/LDC/LDC05T32/data/trans/train

Do some exploratory statistics on the demographics of training data

In [None]:

with open('/corpora/LDC/LDC05T32/docs/calldata-train.txt','r') as infile:
    demographics = pd.read_table(infile)

In [None]:
demographics.head(5)

In [None]:
#rename all columns to have underscores instead of spaces
original_cols = list(demographics.columns)
new_cols = [re.sub(' ','_',item) if ' ' in item else item for item in list(demographics.columns)]

change_cols = dict()
for i in range(0,len(original_cols)):
    change_cols[original_cols[i]] = new_cols[i]
demographics = demographics.rename(columns=change_cols)
demographics.columns

In [None]:
#Get a histogram of ages of callers in buckets of decades
all_ages = list(demographics.Age_A)+list(demographics.Age_B)
sorted(Counter(all_ages).items())


In [None]:
print(max(Counter(all_ages).keys()))
print(min(Counter(all_ages).keys()))

In [None]:
def get_decade(a)->str:
    output = ''
    if a< 10:
        output += '0s'
    elif a <20:
        output += '10s'
    elif a <30:
        output += '20s'
    elif a <40:
        output += '30s'
    elif a <50:
        output += '40s'
    elif a <60:
        output += '50s'
    elif a <70:
        output += '60s'  
    return output

In [None]:
demographics['decade_A'] = demographics.Age_A.apply(get_decade)
demographics['decade_B'] = demographics.Age_B.apply(get_decade)

In [None]:
demographics.head(5)

In [None]:
#graph a sorted histogram of decades
all_decades = list(demographics.decade_A)+list(demographics.decade_B)

plt.bar(Counter(all_decades).keys(),Counter(all_decades).values())
plt.suptitle("Distribution of ages of caller participants")

In [None]:
#graph sorted pairs of participants to conversations

age_of_call = []
for i in range(0,len(demographics)):
    
    a = demographics.decade_A.iloc[i]
    b = demographics.decade_B.iloc[i]
    
    pair = sorted([a,b])
    age_of_call.append(pair[0]+' and '+pair[1])


In [None]:
age_pairs = Counter(age_of_call).most_common()

In [None]:
plt.bar([item[0] for item in age_pairs],[item[1] for item in age_pairs])
plt.suptitle("Distribution of ages pairs")
plt.xticks(rotation=90)

**notes**
- most calls are among people in their 20s
- most calls are between people both of whom are in their 20s

In [None]:
#Get distributions of topics discussed
with open('/corpora/LDC/LDC05T32/docs/topic-list.txt','r') as infile:
    topics_raw = infile.read()

In [None]:
%less /corpora/LDC/LDC05T32/docs/topic-list.txt

In [None]:
raw_list = re.findall('[0-9]{1,2}\..*(?=\n)',topics_raw)
topic_dict = Counter()
for row in raw_list:
    topic_key = int(re.findall('[0-9]*(?=\.)',row)[0])
    topic = re.findall('(?<=[0-9]\. ).*',row)[0]
    topic_dict[topic_key]=topic
    
topic_dict

In [None]:
def index_to_topic(a:int) -> str:
    
    return(topic_dict[a])


In [None]:
demographics['Topic_Name'] = demographics.Topic_ID.apply(index_to_topic)

In [None]:
Counter(demographics.Topic_Name).most_common()

In [None]:
#get a simple distribution of 你 and 您 across training data

In [None]:
%ls /corpora/LDC/LDC05T32/data/trans/train/

In [None]:
!head -n 5 /corpora/LDC/LDC05T32/data/trans/train/20040527_210939_A901153_B901154.txt

In [None]:
!file /corpora/LDC/LDC05T32/data/trans/train/20040527_210939_A901153_B901154.txt

Important lesson learned from [this thread](https://stackoverflow.com/questions/23731176/how-to-print-chinese-characters-stored-in-a-file-with-charset-iso-8859-1-in-py): "file is probably wrong, since it only 'guesses' the encoding. Using gb18030 as the encoding gives the correct result"

In [None]:
with open('/corpora/LDC/LDC05T32/data/trans/train/20040527_210939_A901153_B901154.txt','r',encoding='iso-8859-1') as infile:
    raw = infile.read()

In [None]:
with open('/corpora/LDC/LDC05T32/data/trans/train/20040527_210939_A901153_B901154.txt','r',encoding='gb18030') as infile:
    raw = infile.read()

In [None]:
#Find stuff in directories

p = Path('/corpora/LDC/LDC05T32/data/trans/train/')
files = [f for f in p.iterdir() if f.is_file()]


In [None]:
ni_count = 0
nin_count = 0

for i in tqdm(range(0,len(files))):
    with open(str(files[i].absolute()),'r',encoding='gb18030') as infile:
        raw = infile.read()
        
        ni_count += len(re.findall('你',raw))
        nin_count += len(re.findall('您',raw))
        

In [None]:
print(f'{round(100*(nin_count/(nin_count+ni_count)),2)}% of "you"s are 您')
print(f'{nin_count} instances out of {nin_count+ni_count}')

Trying **CallHome**
- /corpora/LDC/LDC96S34 

In [None]:
%ls /corpora/LDC/LDC96S34

In [None]:
!less /corpora/LDC/LDC96S34/0readme.1st

In [None]:
%ls /corpora/LDC/LDC96T16

Trying **CRECIL**
- [paper](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.250.pdf)
- [Github](https://github.com/bistu-nlp-lab/CRECIL)

In [None]:
%ls ../CRECIL/Final_Data/

In [None]:
%less ../CRECIL/Final_Data/train.json

In [12]:
with open('../CRECIL/Final_Data/train.json','r') as infile:
    train_df = json.loads(infile.read())

In [18]:
len(train_df[0])

2

In [27]:
train_df[1][0]

['S 1: 考察培养一个学生，和考察培养一个干部，性质是完全一样的，我多年从事这方面的工作，经验是有的，教训也是有的，今天我不想多谈，只想谈三年，第一点……',
 'S 2: 啊志国志国',
 'S 1: 嗯',
 'S 2: 不就是帮着圆圆总结个优缺点么，哪那么些三点四点的，我今天呀，倒要看看你是怎么转换的',
 'S 1: 爸，您别着急呀，慢工出巧匠，总结也好，鉴定也好，都关系到一个人的前途，所以一定要慎之又慎',
 'S 3: 你别说的那么玄乎成不成啊，不就给圆圆找几条缺点么',
 'S 1: 缺点就更不能小看了，如今我们考察看什么呀？主要就是看缺点，优点谁没有啊，全心全意为人民服务，密切联系群众，艰苦奋斗任劳任怨，拒腐蚀永不沾，这都是一个国家工作人员应该具备的么，至于缺点，每个人都不一样啊，只有缺点，才能反映出一个人的特点',
 'S 4: 爸，您这说的跟我没什么关系呀',
 'S 2: 是啊，这不越说越远了么',
 'S 3: 我觉得也是',
 'S 1: 是什么是，现在说的就是圆圆，圆圆，你自己先说说，你这两天，从昨天到现在，你想好自己的缺点没有啊？',
 'S 4: 我想了一晚上，我也想不出来，我哪有缺点呀……',
 'S 2: 不能正确认识自己的缺点，这就是你的缺点',
 'S 3: 志国！这得算一条吧这个',
 'S 1: 你怎么你？啊？就算我爸岁数大了，一时糊涂',
 'S 2: 啊？！',
 'S 1: 圆圆年纪太小，又不太懂事',
 'S 4: 嗯？',
 'S 1: 你这不老不小的你怎么也跟着希里马虎的呀啊？噢，不能正确认识自己的缺点',
 'S 3: 啊',
 'S 1: 这条缺点一写上，不光她所有的优点都没了，而且把她所有的缺点都包括进去了',
 'S 3: 啊？',
 'S 1: 我昨天反复强调的什么？转换转化，明白不？要写好缺点',
 'S 4: 爸，那要写缺点，那就写我锻炼身体不够，这能好缺点么？',
 'S 2: 这已经就是避重就轻了',
 'S 1: 啊？这还轻啊，最重就属这条了，三好学生的第一条是什么？身体好，你不锻炼身体能好么？身体不好，还谈什么现在的学习和将来的工作啊？就冲这条，重点中学就不能要你',
 'S 3: 哎哎，照你这么说您这标准咱上哪儿给圆圆找好缺点去呀这个',
 'S 1: 上哪儿找？上我这儿找啊，昨天晚

In [28]:
train_df[0][1]

[{'x': 'S 1', 'y': 'S 2', 'r': ['per:spouse'], 'rid': [23]},
 {'x': 'S 1', 'y': 'S 3', 'r': ['per:children-in-law'], 'rid': [16]},
 {'x': 'S 1', 'y': 'S 4', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 1', 'y': '志国', 'r': ['per:spouse'], 'rid': [23]},
 {'x': 'S 2', 'y': 'S 1', 'r': ['per:spouse'], 'rid': [23]},
 {'x': 'S 2', 'y': 'S 3', 'r': ['per:children'], 'rid': [1]},
 {'x': 'S 2', 'y': 'S 4', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 2', 'y': '志国', 'r': ['per:alternate_name'], 'rid': [0]},
 {'x': 'S 3', 'y': 'S 1', 'r': ['per:parents-in-law'], 'rid': [15]},
 {'x': 'S 3', 'y': 'S 2', 'r': ['per:parents'], 'rid': [2]},
 {'x': 'S 3', 'y': 'S 4', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 3', 'y': '志国', 'r': ['per:parents'], 'rid': [2]},
 {'x': 'S 4', 'y': 'S 1', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 4', 'y': 'S 2', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 4', 'y': 'S 3', 'r': ['unanswerable'], 'rid': [31]},
 {'x': 'S 4', 'y': '志国', 'r': ['unanswerable'], 'ri

In [25]:
len(train_df[0][1])

20