# Acoustic Token Distribution Similarity

![](ATDS.png)

## Indic languages

### Target: Punjabi

Data derivation walkthrough

#### 1. Load in frame-wise embeddings for each audio file that have been clustered (in the various `*embeddings-clustered.parquet` files)

In [1]:
import ATDS

all_clusters_df = ATDS.make_all_clusters_df("indic")

all_clusters_df

Unnamed: 0,wav_file,cluster_id,lang,cluster_char
0,844424932349094-1174-m.wav,282,bengali,ļ
1,844424932349094-1174-m.wav,148,bengali,¶
2,844424932349094-1174-m.wav,225,bengali,ă
3,844424932349094-1174-m.wav,389,bengali,Ƨ
4,844424932349094-1174-m.wav,206,bengali,ð
...,...,...,...,...
740416,844424930470011-1158-f.wav,58,punjabi,\
740417,844424930470011-1158-f.wav,278,punjabi,ĸ
740418,844424930470011-1158-f.wav,278,punjabi,ĸ
740419,844424930470011-1158-f.wav,335,punjabi,ű


#### 2. Combine and deduplicate to get pseudo-text for each utterance

In [2]:
all_utts_df = ATDS.make_all_utts_df(all_clusters_df)

all_utts_df

Unnamed: 0,lang,wav_file,cluster_char
0,bengali,844424930403302-711-f.wav,śƂȕŝĻUȅz+øƗĘ¥ưĪǫƄĘŶňŔǸǗǍǏǷǨB°ŁƆǸȏóúŤeèŝĩ-Äø...
1,bengali,844424930403323-711-f.wav,śðuìn¢^tÂĆÁƗŁN5ÊȇǏǨ°ŁyØǞǀǗǸEjćǜǱĹħÝǂRȅūŢÍÈĬŮ...
2,bengali,844424930442076-1037-f.wav,ƂȕTïŅÊaI`ĭŵǭǸUȅYƗŽńůøĘŶňŗdųjȇ`ħƗǣňÂĆǣŔUƢû...
3,bengali,844424930442110-1037-f.wav,ĈƫYƕȑ+ƗŁǸŗs¼cĨȌøĘňnňŔxY*ûäøĘì¢ưîǐĪǫƄĘŶňŔǻų...
4,bengali,844424930442118-1037-f.wav,"ƪ¾ăƽƹðu¢n¢ŗƼ¼ÄĦāǣŝƍǈĨáȌÑèTŔĻUYƳÈ£ȏŽúǴƟǗǸǥž&á,..."
...,...,...,...
26262,urdu,844424932975151-264-m.wav,¶ă¿õëøĘŶQŃĽǝÀƗȄN5;f³¡ƹÿTŝ=aǳŎÑōŒ$ȏµáĖĜĘŶmêOÃ...
26263,urdu,844424932975913-835-m.wav,"Ù¶ăƽúKǴƟŜǸ5ǱBǬęļôǹŸŵŁNǍa;fâ`ħƄƆ¥ưƼîȎǫ|ƿ""ǠT8TŇ..."
26264,urdu,844424932976710-264-m.wav,śðŜĩX¼EŀĪƄĘŶňŅ;ǏȍÑōŔǃǮĨløȓȐǥĂþǵǿľĤȆ_ī1ǌƷÎƔpŌ...
26265,urdu,844424932977992-835-m.wav,"śŵǸǍÕǏǨB|ƣƿ""T¸ZęƖêOƆſġFæȊƖ(Ƿ.`ħƆ¥ưŴíƭŖǶļıļ..."


#### 3. Train sentencepiece model on target language and use model to derive tokens for all languages

In [3]:
all_utts_df = ATDS.train_and_encode_spm(all_utts_df, "punjabi")

all_utts_df

Unnamed: 0,lang,wav_file,cluster_char,utt_piece_ids
0,bengali,844424930403302-711-f.wav,śƂȕŝĻUȅz+øƗĘ¥ưĪǫƄĘŶňŔǸǗǍǏǷǨB°ŁƆǸȏóúŤeèŝĩ-Äø...,"[531, 123, 1283, 29, 305, 78, 382, 21, 96, 6, ..."
1,bengali,844424930403323-711-f.wav,śðuìn¢^tÂĆÁƗŁN5ÊȇǏǨ°ŁyØǞǀǗǸEjćǜǱĹħÝǂRȅūŢÍÈĬŮ...,"[1288, 600, 264, 106, 2729, 1387, 85, 242, 546..."
2,bengali,844424930442076-1037-f.wav,ƂȕTïŅÊaI`ĭŵǭǸUȅYƗŽńůøĘŶňŗdųjȇ`ħƗǣňÂĆǣŔUƢû...,"[4749, 286, 820, 217, 370, 72, 146, 825, 667, ..."
3,bengali,844424930442110-1037-f.wav,ĈƫYƕȑ+ƗŁǸŗs¼cĨȌøĘňnňŔxY*ûäøĘì¢ưîǐĪǫƄĘŶňŔǻų...,"[150, 314, 792, 1118, 361, 259, 29, 6988, 223,..."
4,bengali,844424930442118-1037-f.wav,"ƪ¾ăƽƹðu¢n¢ŗƼ¼ÄĦāǣŝƍǈĨáȌÑèTŔĻUYƳÈ£ȏŽúǴƟǗǸǥž&á,...","[195, 39, 467, 5492, 26, 8, 26, 2517, 84, 151,..."
...,...,...,...,...
26262,urdu,844424932975151-264-m.wav,¶ă¿õëøĘŶQŃĽǝÀƗȄN5;f³¡ƹÿTŝ=aǳŎÑōŒ$ȏµáĖĜĘŶmêOÃ...,"[326, 224, 269, 189, 109, 1363, 8102, 981, 78,..."
26263,urdu,844424932975913-835-m.wav,"Ù¶ăƽúKǴƟŜǸ5ǱBǬęļôǹŸŵŁNǍa;fâ`ħƄƆ¥ưƼîȎǫ|ƿ""ǠT8TŇ...","[1589, 98, 467, 2993, 5111, 85, 5510, 515, 56,..."
26264,urdu,844424932976710-264-m.wav,śðŜĩX¼EŀĪƄĘŶňŅ;ǏȍÑōŔǃǮĨløȓȐǥĂþǵǿľĤȆ_ī1ǌƷÎƔpŌ...,"[1486, 736, 84, 577, 1, 539, 420, 851, 327, 10..."
26265,urdu,844424932977992-835-m.wav,"śŵǸǍÕǏǨB|ƣƿ""T¸ZęƖêOƆſġFæȊƖ(Ƿ.`ħƆ¥ưŴíƭŖǶļıļ...","[2144, 3253, 562, 601, 321, 73, 167, 322, 8706..."


#### 4. Derive token frequencies for each language (1.0 = most frequent within each language/column)

In [4]:
piece_freqs_matrix = ATDS.make_piece_freqs_matrix(all_utts_df, "punjabi")

piece_freqs_matrix

lang,punjabi,bengali,gujarati,hindi,malayalam,marathi,odia,tamil,urdu
utt_piece_ids,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1.000000,1.000000,1.000000,1.000000,0.795799,0.691673,1.000000,0.474374,1.000000
2,0.946369,0.443672,0.995691,0.965529,0.845628,1.000000,0.906939,0.745195,0.876662
3,0.640968,0.284809,0.478311,0.665135,0.767465,0.353818,0.435302,0.616482,0.472025
4,0.633892,0.340217,0.666475,0.446159,0.743039,0.478585,0.668073,0.481654,0.463060
5,0.614153,0.464936,0.521689,0.613263,0.869565,0.398510,0.238631,0.574549,0.801546
...,...,...,...,...,...,...,...,...,...
9991,0.000745,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9992,0.000745,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9993,0.000745,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9994,0.000745,0.001022,0.000575,0.000985,0.000000,0.000000,0.003282,0.000000,0.000000


#### 5. Get pair-wise similarities using consine similarity between each language vector

In [5]:
ATDS_matrix = ATDS.make_ATDS_matrix(piece_freqs_matrix)

ATDS_matrix

Unnamed: 0,ref_lang,punjabi,bengali,gujarati,hindi,malayalam,marathi,odia,tamil,urdu
0,punjabi,,0.89834,0.92537,0.960345,0.889211,0.917747,0.867011,0.85545,0.926529
0,bengali,0.89834,,0.915768,0.912807,0.837936,0.883071,0.907864,0.81156,0.886872
0,gujarati,0.92537,0.915768,,0.937169,0.871127,0.942984,0.936393,0.876194,0.89494
0,hindi,0.960345,0.912807,0.937169,,0.876207,0.936571,0.882,0.856358,0.958859
0,malayalam,0.889211,0.837936,0.871127,0.876207,,0.887913,0.838508,0.90777,0.881012
0,marathi,0.917747,0.883071,0.942984,0.936571,0.887913,,0.908061,0.881482,0.922264
0,odia,0.867011,0.907864,0.936393,0.882,0.838508,0.908061,,0.868832,0.852086
0,tamil,0.85545,0.81156,0.876194,0.856358,0.90777,0.881482,0.868832,,0.826908
0,urdu,0.926529,0.886872,0.89494,0.958859,0.881012,0.922264,0.852086,0.826908,


#### 6. Get best donors for target language (ATDS in descending order)

In [6]:
best_donors_for_punjabi = ATDS.get_best_donors_by_ATDS(ATDS_matrix, "punjabi")

# Write out indic similarities for comparison with SpeechBrain and lang2vec similarities
best_donors_for_punjabi.to_csv("/workspace/data/artefacts/ATDS/indic_atds.csv", index=False)

best_donors_for_punjabi

Unnamed: 0,target_lang,donor_lang,atds
27,punjabi,hindi,0.96
72,punjabi,urdu,0.93
18,punjabi,gujarati,0.93
45,punjabi,marathi,0.92
9,punjabi,bengali,0.9
36,punjabi,malayalam,0.89
54,punjabi,odia,0.87
63,punjabi,tamil,0.86


## Other targets

### Galician

In [7]:
west_iberian_atds_matrix, best_donors_for_galician = ATDS.run_all("west-iberian", "galician")

best_donors_for_galician

Unnamed: 0,target_lang,donor_lang,atds
6,galician,spanish,0.96
3,galician,portuguese,0.89


### Iban

In [8]:
malayic_atds_matrix, best_donors_for_iban = ATDS.run_all("malayic", "iban")

best_donors_for_iban

Unnamed: 0,target_lang,donor_lang,atds
6,iban,malay,0.91
3,iban,indonesian,0.88


### Setswana

In [9]:
sotho_tswana_atds_matrix, best_donors_for_setswana = ATDS.run_all("sotho-tswana", "setswana")

best_donors_for_setswana

Unnamed: 0,target_lang,donor_lang,atds
6,setswana,sesotho,0.96
3,setswana,sepedi,0.88
