## Steps to reproduce

1. Make up a list of repositories to clone
2. Clone selected repositories
3. Parse every .py file (returning a pair of .src and .ast files) for every cloned repository
4. Merge parsed pairs into two large files (train.src, train.ast)
5. Remove duplicate lines in .src file along with aligned lines in .ast file

### Step 1

In [2]:
import pandas as pd

def name_to_url(name: str) -> str:
    return f"https://github.com/{name}.git"

In [4]:
repositories = pd.read_json("/workspace/data/repositories/top_18k.jsonl", lines=True)

In [30]:
selected_repositories = repositories.sort_values("size")[:200]
selected_repositories

Unnamed: 0,full_name,language,commits,stargazers_count,watchers_count,forks_count,size,archived,fork
16948,kuangliu/pytorch-groupnorm,Python,-1,100,100,23,0,False,False
12100,mitmul/caltech-pedestrian-dataset-converter,Python,-1,146,146,64,0,False,False
9868,standupmaths/rolling_shutter,Python,-1,184,184,33,0,False,False
14552,breenmachine/JavaUnserializeExploits,Python,-1,120,120,235,0,False,False
17972,callmefeifei/SvnHack,Python,-1,93,93,41,1,False,False
...,...,...,...,...,...,...,...,...,...
18086,chrizator/netattack,Python,-1,92,92,51,8,False,False
17295,raulmur/evaluate_ate_scale,Python,-1,97,97,52,8,False,False
5943,smilli/py-corenlp,Python,-1,327,327,74,8,False,False
16285,voice32/stock_market_indicators,Python,-1,105,105,52,8,False,False


In [31]:
urls = [name_to_url(name) for name in selected_repositories["full_name"]]
urls = "\n".join(urls)
with open("/workspace/tmp/code2ast_large/repo_list.txt", mode="w") as file:
    file.write(urls)

### Step 2

In [32]:
!cd /workspace && python -m src.clone_repository \
    --repo_file /workspace/tmp/code2ast_large/repo_list.txt \
    --output /workspace/tmp/code2ast_large/repositories \
    --clear_before 1

100%|██████████████████████████████████████████| 200/200 [03:09<00:00, 1.06it/s]


### Step 3

In [33]:
!cd /workspace && python -m src.ast_dataset_prepare parse-nodes --rule-all \
    --library-path=/workspace/tmp/code2ast_large/langs.so \
    --language=python \
    --language-ext=py \
    --root-input-path=/workspace/tmp/code2ast_large/repositories \
    --output-path=/workspace/tmp/code2ast_large/_parsed_files \
    --extensions="src, ast"

Executing
100%|██████████████████████████████████████████| 449/449 [00:05<00:00, 78.1it/s]


### Step 4

In [39]:
!cd /workspace && python -m src.merge_files merge-pairs \
        --input-path=/workspace/tmp/code2ast_large/_parsed_files \
        --output-prefix=/workspace/tmp/code2ast_large/_parsed_files_merged/all \
        --extensions="src, ast" \
#         --remove-files

{'--extensions': 'src, ast',
 '--input-path': '/workspace/tmp/code2ast_large/_parsed_files',
 '--output-path': None,
 '--output-prefix': '/workspace/tmp/code2ast_large/_parsed_files_merged/all',
 '--remove-files': False,
 'merge-jsonl': False,
 'merge-pairs': True} False
source: /workspace/tmp/code2ast_large/_parsed_files/_workspace_tmp_code2ast_large_repositories_MalwareTech@TrickBot-Toolkit_includes_BotConfig.src
target: /workspace/tmp/code2ast_large/_parsed_files/_workspace_tmp_code2ast_large_repositories_MalwareTech@TrickBot-Toolkit_includes_BotConfig.ast
source: /workspace/tmp/code2ast_large/_parsed_files/_workspace_tmp_code2ast_large_repositories_wuchong@scrapy-dynamic-configurable_pipelines.src
target: /workspace/tmp/code2ast_large/_parsed_files/_workspace_tmp_code2ast_large_repositories_wuchong@scrapy-dynamic-configurable_pipelines.ast
source: /workspace/tmp/code2ast_large/_parsed_files/_workspace_tmp_code2ast_large_repositories_ambionics@magento-exploits_magento-sqli.src
targe

### Step 5

In [48]:
!cd /workspace && python -m src.remove_duplicates \
    --reference-filepath=/workspace/tmp/code2ast_large/_parsed_files_merged/all.src \
    --aligned-filepath=/workspace/tmp/code2ast_large/_parsed_files_merged/all.ast \
    --destination-path=/workspace/tmp/code2ast_large/_parsed_files_merged_dedup

In [49]:
!cd /workspace && python -m src.remove_duplicates \
    --reference-filepath=/workspace/tmp/ast_test/code2ast_medium/train.src \
    --aligned-filepath=/workspace/tmp/ast_test/code2ast_medium/train.ast \
    --destination-path=/workspace/tmp/ast_test/dedup