## Количество различных слов в названиях функций

Пути всех vocab-ов с таргетами (названиями функций и частотой их встречаемости):

In [1]:
TRAIN_VOCAB_PATH = "../dataset/java-small/java-small.train.functions.vocab"
TEST_VOCAB_PATH = "../dataset/java-small/java-small.test.functions.vocab"
VALIDATION_VOCAB_PATH = "../dataset/java-small/java-small.validation.functions.vocab"

Загрузим все vocab-ы в pandas-таблицы:

In [63]:
import pandas as pd

train_df = pd.read_csv(TRAIN_VOCAB_PATH, sep=' ', names=["Function", "Frequency"])
test_df = pd.read_csv(TEST_VOCAB_PATH, sep=' ', names=["Function", "Frequency"])
validation_df = pd.read_csv(VALIDATION_VOCAB_PATH, sep=' ', names=["Function", "Frequency"])

display(train_df)

Unnamed: 0,Function,Frequency
0,set|replace,1
1,add|current|invocation|context|factory,6
2,test|register|notification|listener|for|non|ex...,1
3,assert|after|test|method|with|transactional|te...,1
4,get|keystore|path,1
...,...,...
40646,add|component|interceptor,2
40647,existing|transaction|with|participation|using|...,1
40648,set|target|field,1
40649,set|authenticator,2


Сгруппируем по дубликатам и просуммируем:

In [64]:
conc = pd.concat([train_df, test_df, validation_df], ignore_index=True)
conc = conc.groupby('Function').sum().reset_index()

display(conc)

Unnamed: 0,Function,Frequency
0,$,5
1,_,67
2,a,165
3,aabb|expand,1
4,aabb|overlap,1
...,...,...
68387,zip|dir,1
68388,zk|do|with|retries,1
68389,zone|id,1
68390,zone|id|from|resolver,1


Сохраним получившиеся данные:

In [65]:
CONCATENATED_DF_PATH = "../dataset/java-small/java-small.dataset.functions.vocab"

conc.to_csv(CONCATENATED_DF_PATH, index=False)

Split по названию функции:

In [70]:
words = conc['Function'].apply(lambda x: str.split(x, '|'))

display(words)

0                               [$]
1                               [_]
2                               [a]
3                    [aabb, expand]
4                   [aabb, overlap]
                    ...            
68387                    [zip, dir]
68388       [zk, do, with, retries]
68389                    [zone, id]
68390    [zone, id, from, resolver]
68391                        [zoom]
Name: Function, Length: 68392, dtype: object

In [87]:
from collections import Counter

cnt = Counter()
for seq in words:
    cnt.update(seq)
    
srt = sorted(cnt.items(), key=lambda x: x[1], reverse=True)
srt_labels, srt_freqs = list(map(list, zip(*srt)))
srt_labels = srt_labels[:25] + ['other words']
srt_freqs = srt_freqs[:25] + [sum(srt_freqs[25:])]

### Распределение слов в названиях функций без учета частоты встречаемости

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

plt.figure(figsize=(10, 10))
plt.pie(srt_freqs, labels=srt_labels)
plt.legend()
plt.plot();