This notebook demonstrates the pre-processing of the data used for our experiments in addition to the CodeSearchNet datasets. This notebook is inspired from the CodeSearchNet notebook. For simplicity our experiment is perfromed on the Python dataset.

In [1]:
import json
import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

Part 1 : Preview the Dataset

Before working on the complete dataset, it will useful to understand small dataset to understand the format and the structure of the data. 

First we download the dataset from our GitHub Repo. We have three different datasets that we want to look at:
1. CodeSearchNet Corpus
2. CodeXGLUE dataset
3. SemanticCodeSearch Corpus (generated from CodeSearchNet Challenge)

## Part 1.1 CodeSearchNet Corpus


In [3]:
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

--2023-03-18 17:37:00--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.226.200, 52.217.75.134, 52.217.200.240, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.226.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 940909997 (897M) [application/zip]
Saving to: ‘python.zip.1’


2023-03-18 17:37:34 (26.4 MB/s) - ‘python.zip.1’ saved [940909997/940909997]



In [18]:
!unzip python.zip -d codesearchnet

Archive:  python.zip
   creating: codesearchnet/python/
   creating: codesearchnet/python/final/
   creating: codesearchnet/python/final/jsonl/
   creating: codesearchnet/python/final/jsonl/train/
  inflating: codesearchnet/python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_10.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_2.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_4.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_8.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_11.jsonl.gz  
  inflating: codesearchnet/python/final/jsonl/train/python_train_5.jsonl.gz  
  inflating: codesea

In [19]:
# decompress this gzip file
!gzip -d codesearchnet/python/final/jsonl/test/python_test_0.jsonl.gz


Read the file and display the first row. The data is stored in JSON Lines format

In [20]:
with open('codesearchnet/python/final/jsonl/test/python_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"repo": "soimort/you-get", "path": "src/you_get/extractors/youtube.py", "func_name": "YouTube.get_vid_from_url", "original_string": "def get_vid_from_url(url):\\n        \\"\\"\\"Extracts video ID from URL.\\n        \\"\\"\\"\\n        return match1(url, r\'youtu\\\\.be/([^?/]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/embed/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/v/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/watch/([^/?]+)\') or \\\\\\n          parse_query_param(url, \'v\') or \\\\\\n          parse_query_param(parse_query_param(url, \'u\'), \'v\')", "language": "python", "code": "def get_vid_from_url(url):\\n        \\"\\"\\"Extracts video ID from URL.\\n        \\"\\"\\"\\n        return match1(url, r\'youtu\\\\.be/([^?/]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/embed/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/v/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/watch/([^/?]+)\'

Pretty printing the JSON file contents. 

In [21]:
pprint(json.loads(sample_file[0]))

{'code': 'def get_vid_from_url(url):\n'
         '        """Extracts video ID from URL.\n'
         '        """\n'
         "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
         "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
         "          parse_query_param(url, 'v') or \\\n"
         "          parse_query_param(parse_query_param(url, 'u'), 'v')",
 'code_tokens': ['def',
                 'get_vid_from_url',
                 '(',
                 'url',
                 ')',
                 ':',
                 'return',
                 'match1',
                 '(',
                 'url',
                 ',',
                 "r'youtu\\.be/([^?/]+)'",
                 ')',
                 'or',
                 'match1',
                 '(',
                 'url',
                 ',',
        

Definitions for the file will be is documented in README.md in https://github.com/github/CodeSearchNet/tree/master

## Exploring The Full Dataset
For simplicity, we are going to only focus on the Python dataset 
To make analysis of the dataset easier, we can load all the data into a pandas dataframe

In [77]:
#define function to load jsonl files to dataframe
%cd /localhome/local-dineshr/cs224n/cs224n-project/demos/codesearchnet
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

/localhome/local-dineshr/cs224n/cs224n-project/demos/codesearchnet


In [74]:
python_files = sorted(Path('python/').glob('**/*.jsonl'))
print(f"python_files:{len(python_files)}")

python_files:16


In [78]:
pydf = jsonl_list_to_dataframe(python_files)

In [79]:
pydf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,soimort/you-get,src/you_get/extractors/youtube.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143,"def get_vid_from_url(url):\n """"""Extracts video ID from URL.\n """"""\n return match1(url, r'youtu\.be/([^?/]+)') or \\n match1(url, r'youtube\.com/embed/([^/?]+)') or \\n match1(url, r'youtube\.com/v/([^/?]+)') or \\n match1(url, r'youtube\.com/watch/...","[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...",Extracts video ID from URL.,"[Extracts, video, ID, from, URL, .]",python,test
1,soimort/you-get,src/you_get/extractors/miomio.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.chil...","[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...",str->list\n Convert XML to URL List.\n From Biligrab.,"[str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .]",python,test
2,soimort/you-get,src/you_get/extractors/fc2video.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17,"def makeMimi(upid):\n """"""From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110""""""\n strSeed = ""gGddgPfeaf_gzyr""\n prehash = upid + ""_"" + strSeed\n return md5(prehash.encode('utf-8')).hexdigest()","[def, makeMimi, (, upid, ), :, strSeed, =, ""gGddgPfeaf_gzyr"", prehash, =, upid, +, ""_"", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]",From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110,"[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]",python,test


Two columns that will be heavily used in this dataset are code_tokens and docstring_tokens, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.). You can find more information regarding the definition of the above columns in the README of this repo.

## Summary Statistics
### Row Counts
By Partition

In [81]:
pydf.partition.value_counts()

train    412178
valid     23107
test      22176
Name: partition, dtype: int64

### Token lengths

In [82]:
pydf['code_len'] = pydf.code_tokens.apply(lambda x: len(x))
pydf['query_len'] = pydf.docstring_tokens.apply(lambda x: len(x))

### Code Length Percentile
For example, the 80th percentile length for python tokens is 72

In [83]:
code_len_summary = pydf.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
python,0.5,72.0
python,0.7,114.0
python,0.8,155.0
python,0.9,237.0
python,0.95,341.0


### Query Length Percentile

In [85]:
query_len_summary = pydf.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
python,0.5,10.0
python,0.7,15.0
python,0.8,20.0
python,0.9,33.0
python,0.95,48.0


## PART1.2 CodeXGLUE Dataset
Dataset is the same as above, but there is some preproceessing done on top of it. 

In [13]:
!wget https://semanticcodesearch.s3.us-west-2.amazonaws.com/dataset.zip

--2023-03-18 18:19:17--  https://semanticcodesearch.s3.us-west-2.amazonaws.com/dataset.zip
Resolving semanticcodesearch.s3.us-west-2.amazonaws.com (semanticcodesearch.s3.us-west-2.amazonaws.com)... 52.92.226.90, 52.218.218.233, 52.92.165.106, ...
Connecting to semanticcodesearch.s3.us-west-2.amazonaws.com (semanticcodesearch.s3.us-west-2.amazonaws.com)|52.92.226.90|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25115627 (24M) [application/zip]
Saving to: ‘dataset.zip’


2023-03-18 18:19:17 (78.8 MB/s) - ‘dataset.zip’ saved [25115627/25115627]



In [29]:
#downwload and link the codesearchnet data
!unzip dataset.zip -d CodeXGLUE
!pwd
%cd ./CodeXGLUE/dataset
!pwd
!ln -s ../../codesearchnet codesearchnet

/localhome/local-dineshr/cs224n/cs224n-project/demos
/localhome/local-dineshr/cs224n/cs224n-project/demos/CodeXGLUE/dataset
/localhome/local-dineshr/cs224n/cs224n-project/demos/CodeXGLUE/dataset


In [54]:
#preproceess.py
import os
language='python'

train=[]
for root, dirs, files in os.walk('codesearchnet/'+language+'/final'):
    for file in files:
        temp=os.path.join(root,file)
        if '.jsonl' in temp:
            if 'train' in temp:
                train.append(temp)

data={}                    
for file in train:
    if '.gz' in file:
        os.system("gzip -d {}".format(file))
        file=file.replace('.gz','')
    with open(file) as f:
        for line in f:
            line=line.strip()
            js=json.loads(line)
            data[js['url']]=js
cont=0                    
with open('train.jsonl','w') as f, open("train.txt") as f1:
    for line in f1:
        line=line.strip()
        js=data[line].copy()
        js['idx']=cont
        cont+=1
        f.write(json.dumps(js)+'\n')
train_count = cont
print(f"traininig data count: {train_count}")

traininig data count: 251820


In [45]:
# printing first line of the train file
with open("train.jsonl", "r") as f:
    train_file = f.readlines()

pprint(json.loads(train_file[0]))

{'code': 'def split_phylogeny(p, level="s"):\n'
         '    """\n'
         '    Return either the full or truncated version of a QIIME-formatted '
         'taxonomy string.\n'
         '\n'
         '    :type p: str\n'
         '    :param p: A QIIME-formatted taxonomy string: k__Foo; p__Bar; '
         '...\n'
         '\n'
         '    :type level: str\n'
         '    :param level: The different level of identification are kingdom '
         '(k), phylum (p),\n'
         '                  class (c),order (o), family (f), genus (g) and '
         'species (s). If level is\n'
         '                  not provided, the default level of identification '
         'is species.\n'
         '\n'
         '    :rtype: str\n'
         '    :return: A QIIME-formatted taxonomy string up to the '
         'classification given\n'
         '            by param level.\n'
         '    """\n'
         '    level = level+"__"\n'
         '    result = p.split(level)\n'
         '    retur

In [55]:
data={}                    
with open('test_code.jsonl') as f:
    for line in f:
        line=line.strip()
        js=json.loads(line)
        data[js['url']]=js


In [56]:
if cont == 0:
    assert train_count != 0
    cont = train_count
with open('valid.jsonl','w') as f, open("valid.txt") as f1:
    for line in f1:
        line=line.strip()
        js=data[line].copy()
        js['idx']=cont
        cont+=1
        f.write(json.dumps(js)+'\n')
valid_count = cont - train_count
print(f"Valid count: {valid_count}")

Valid count: 9604


In [50]:
# printing first line of the valid file
with open("valid.jsonl", "r") as f:
    valid_file = f.readlines()

pprint(json.loads(valid_file[0]))

{'argument_list': '',
 'docstring': 'Helper which expand_dims `is_accepted` then applies tf.where.',
 'docstring_summary': 'Helper which expand_dims `is_accepted` then applies '
                      'tf.where.',
 'docstring_tokens': ['Helper',
                      'which',
                      'expand_dims',
                      'is_accepted',
                      'then',
                      'applies',
                      'tf',
                      '.',
                      'where',
                      '.'],
 'function': 'def Func(arg_0, arg_1, arg_2, arg_3=None):\n'
             '  """Helper which expand_dims `is_accepted` then applies '
             'tf.where."""\n'
             '  if not is_namedtuple_like(arg_1):\n'
             '    return _Func_base_case(arg_0, arg_1, arg_2, arg_3=arg_3)\n'
             '  if not isinstance(arg_1, type(arg_2)):\n'
             "    raise TypeError('Type of `accepted` ({}) must be identical "
             "to '\n"
             "      

In [57]:
if cont == 0:
    assert train_count != 0
    assert valid_count != 0 
    cont = train_count + valid_count

with open('test.jsonl','w') as f, open("test.txt") as f1:
    for line in f1:
        line=line.strip()
        js=data[line].copy()
        js['idx']=cont
        cont+=1
        f.write(json.dumps(js)+'\n')

test_count = cont - (train_count + valid_count)
print(f"test count: {test_count}")

test count: 19210


In [52]:
# printing first line of the test file
with open("test.jsonl", "r") as f:
    test_file = f.readlines()

pprint(json.loads(test_file[0]))

{'argument_list': '',
 'docstring': 'Try loading given cache file.',
 'docstring_summary': 'Try loading given cache file.',
 'docstring_tokens': ['Try', 'loading', 'given', 'cache', 'file', '.'],
 'function': 'def Func(arg_0, arg_1, *arg_2, **arg_3):\n'
             '        """Try loading given cache file."""\n'
             '        try:\n'
             '            arg_4 = shelve.open(arg_1)\n'
             '            return arg_0(arg_1, arg_4, *arg_2, **arg_3)\n'
             '        except OSError as e:\n'
             '            logger.debug("Loading {0} failed".format(arg_1))\n'
             '            raise e',
 'function_tokens': ['def',
                     'Func',
                     '(',
                     'arg_0',
                     ',',
                     'arg_1',
                     ',',
                     '*',
                     'arg_2',
                     ',',
                     '**',
                     'arg_3',
                     ')',
      

In [58]:
assert train_count != 0
assert valid_count != 0
assert test_count  != 0
print(f"Total Count: {train_count + valid_count + test_count}")

Total Count: 280634


# PART 1.3 SemanticCodeSearch Corpus

In [62]:
%cd /localhome/local-dineshr/cs224n/cs224n-project/demos/
!wget https://semanticcodesearch.s3.us-west-2.amazonaws.com/query_dataset.zip
!unzip query_dataset.zip -d semantic_search
%cd semantic_search

/localhome/local-dineshr/cs224n/cs224n-project/demos
--2023-03-19 00:08:25--  https://semanticcodesearch.s3.us-west-2.amazonaws.com/query_dataset.zip
Resolving semanticcodesearch.s3.us-west-2.amazonaws.com (semanticcodesearch.s3.us-west-2.amazonaws.com)... 52.92.227.178, 52.92.250.146, 3.5.79.14, ...
Connecting to semanticcodesearch.s3.us-west-2.amazonaws.com (semanticcodesearch.s3.us-west-2.amazonaws.com)|52.92.227.178|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 236336 (231K) [application/zip]
Saving to: ‘query_dataset.zip’


2023-03-19 00:08:25 (3.50 MB/s) - ‘query_dataset.zip’ saved [236336/236336]

Archive:  query_dataset.zip
  inflating: semantic_search/train.jsonl  
  inflating: semantic_search/valid.jsonl  
  inflating: semantic_search/test.jsonl  
  inflating: semantic_search/queries.csv  
/localhome/local-dineshr/cs224n/cs224n-project/demos/semantic_search


In [66]:
# Dump example train file
with open("train.jsonl", "r") as f:
    train_file = f.readlines()

pprint(json.loads(train_file[0]))
print(f"no of trainining samples: {len(train_file)}")

{'code': 'def header_without_lines(header, remove):\n'
         '    """"""\n'
         '    remove = set(remove)\n'
         '    lines = []\n'
         '    for line in header.lines:\n'
         "        if hasattr(line, 'mapping'):\n"
         "            if (line.key, line.mapping.get('ID', None)) in remove:\n"
         '                continue\n'
         '        elif (line.key, line.value) in remove:\n'
         '            continue\n'
         '        lines.append(line)\n'
         '    return Header(lines, header.samples)\n',
 'code_tokens': ['header',
                 'without',
                 'lines',
                 'header',
                 'remove',
                 'remove',
                 'set',
                 'remove',
                 'lines',
                 'for',
                 'line',
                 'in',
                 'header',
                 'lines',
                 'if',
                 'hasattr',
                 'line',
               

In [67]:
# Dump example valid file
with open("valid.jsonl", "r") as f:
    valid_file = f.readlines()

pprint(json.loads(valid_file[0]))
print(f"no of valid samples: {len(valid_file)}")

{'code': 'def to_bool(self, value):\n'
         '    if value == None:\n'
         '        return False\n'
         '    elif isinstance(value, bool):\n'
         '        return value\n'
         "    elif str(value).lower() in ['true', '1', 'yes']:\n"
         '        return True\n'
         '    else:\n'
         '        return False\n',
 'code_tokens': ['to',
                 'bool',
                 'self',
                 'value',
                 'if',
                 'value',
                 'none',
                 'return',
                 'false',
                 'elif',
                 'isinstance',
                 'value',
                 'bool',
                 'return',
                 'value',
                 'elif',
                 'str',
                 'value',
                 'lower',
                 'in',
                 'true',
                 '1',
                 'yes',
                 'return',
                 'true',
                 'els

In [68]:
# Dump example test file
with open("test.jsonl", "r") as f:
    test_file = f.readlines()

pprint(json.loads(test_file[0]))
print(f"no of test samples: {len(test_file)}")

{'code': 'def pack_unsigned_int(number, size, le):\n'
         '    if not isinstance(number, int):\n'
         "        raise StructError('argument for i,I,l,L,q,Q,h,H must be "
         "integer')\n"
         '    if number < 0:\n'
         '        raise TypeError("can\'t convert negative long to unsigned")\n'
         '    if number > (1 << 8 * size) - 1:\n'
         "        raise OverflowError('Number:%i too large to convert' % "
         'number)\n'
         '    return pack_int(number, size, le)\n',
 'code_tokens': ['pack',
                 'unsigned',
                 'int',
                 'number',
                 'size',
                 'le',
                 'if',
                 'not',
                 'isinstance',
                 'number',
                 'int',
                 'raise',
                 'structerror',
                 'argument',
                 'for',
                 'must',
                 'be',
                 'integer',
                 '