# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [1]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

For example, the link for the `python` is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

First we download and decompress this dataset:

In [2]:
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

--2019-06-14 01:05:08--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.77
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218813834 (209M) [application/zip]
Saving to: ‘python.zip’


2019-06-14 01:05:11 (63.9 MB/s) - ‘python.zip’ saved [218813834/218813834]



In [3]:
!unzip python.zip

Archive:  python.zip
   creating: python/
   creating: python/final/
   creating: python/final/jsonl/
   creating: python/final/jsonl/valid/
  inflating: python/final/jsonl/valid/python_valid_0.jsonl.gz  
   creating: python/final/jsonl/test/
  inflating: python/final/jsonl/test/python_test_0.jsonl.gz  
   creating: python/final/jsonl/train/
  inflating: python/final/jsonl/train/python_train_7.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_13.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_1.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_4.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_5.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_8.jsonl.gz  
  inflating: p

Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:

In [4]:
# decompress this gzip file
!gzip -d python/final/jsonl/test/python_test_0.jsonl.gz

Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [5]:
with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"repo": "soimort/you-get", "path": "src/you_get/extractors/youtube.py", "func_name": "YouTube.get_vid_from_url", "original_string": "def get_vid_from_url(url):\\n        \\"\\"\\"Extracts video ID from URL.\\n        \\"\\"\\"\\n        return match1(url, r\'youtu\\\\.be/([^?/]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/embed/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/v/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/watch/([^/?]+)\') or \\\\\\n          parse_query_param(url, \'v\') or \\\\\\n          parse_query_param(parse_query_param(url, \'u\'), \'v\')", "language": "python", "code": "def get_vid_from_url(url):\\n        \\"\\"\\"Extracts video ID from URL.\\n        \\"\\"\\"\\n        return match1(url, r\'youtu\\\\.be/([^?/]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/embed/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/v/([^/?]+)\') or \\\\\\n          match1(url, r\'youtube\\\\.com/watch/([^/?]+)\'

We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [6]:
pprint(json.loads(sample_file[0]))

{'code': 'def get_vid_from_url(url):\n'
         '        """Extracts video ID from URL.\n'
         '        """\n'
         "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
         "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
         "          parse_query_param(url, 'v') or \\\n"
         "          parse_query_param(parse_query_param(url, 'u'), 'v')",
 'code_tokens': ['def',
                 'get_vid_from_url',
                 '(',
                 'url',
                 ')',
                 ':',
                 'return',
                 'match1',
                 '(',
                 'url',
                 ',',
                 "r'youtu\\.be/([^?/]+)'",
                 ')',
                 'or',
                 'match1',
                 '(',
                 'url',
                 ',',
        

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [7]:
python_files = sorted(Path('../resources/data/python/').glob('**/*.gz'))
java_files = sorted(Path('../resources/data/java/').glob('**/*.gz'))
go_files = sorted(Path('../resources/data/go/').glob('**/*.gz'))
php_files = sorted(Path('../resources/data/php/').glob('**/*.gz'))
javascript_files = sorted(Path('../resources/data/javascript/').glob('**/*.gz'))
ruby_files = sorted(Path('../resources/data/ruby/').glob('**/*.gz'))
all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files

In [8]:
print(f'Total number of files: {len(all_files):,}')

Total number of files: 78


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [9]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [10]:
pydf = jsonl_list_to_dataframe(python_files)

In [11]:
pydf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,soimort/you-get,src/you_get/extractors/youtube.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143,"def get_vid_from_url(url):\n """"""Extracts video ID from URL.\n """"""\n return match1(url, r'youtu\.be/([^?/]+)') or \\n match1(url, r'youtube\.com/embed/([^/?]+)') or \\n match1(url, r'youtube\.com/v/([^/?]+)') or \\n match1(url, r'youtube\.com/watch/...","[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...",Extracts video ID from URL.,"[Extracts, video, ID, from, URL, .]",python,test
1,soimort/you-get,src/you_get/extractors/miomio.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.chil...","[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...",str->list\n Convert XML to URL List.\n From Biligrab.,"[str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .]",python,test
2,soimort/you-get,src/you_get/extractors/fc2video.py,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17,"def makeMimi(upid):\n """"""From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110""""""\n strSeed = ""gGddgPfeaf_gzyr""\n prehash = upid + ""_"" + strSeed\n return md5(prehash.encode('utf-8')).hexdigest()","[def, makeMimi, (, upid, ), :, strSeed, =, ""gGddgPfeaf_gzyr"", prehash, =, upid, +, ""_"", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]",From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110,"[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]",python,test


Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [12]:
all_df = jsonl_list_to_dataframe(all_files, columns_short_list)

## Summary Statistics

### Row Counts

By Partition

In [13]:
all_df.partition.value_counts()

train    1880853
test      100529
valid      89154
Name: partition, dtype: int64

By Language

In [14]:
all_df.language.value_counts()

php           578118
java          496688
python        457461
go            346365
javascript    138625
ruby           53279
Name: language, dtype: int64

By Partition & Language

In [15]:
all_df.groupby(['partition', 'language'])['code_tokens'].count()

partition  language  
test       go             14291
           java           26909
           javascript      6483
           php            28391
           python         22176
           ruby            2279
train      go            317832
           java          454451
           javascript    123889
           php           523712
           python        412178
           ruby           48791
valid      go             14242
           java           15328
           javascript      8253
           php            26015
           python         23107
           ruby            2209
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [16]:
all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))
all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [17]:
code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,61.0
go,0.7,100.0
go,0.8,138.0
go,0.9,217.0
go,0.95,319.0
java,0.5,66.0
java,0.7,104.0
java,0.8,142.0
java,0.9,224.0
java,0.95,331.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [18]:
query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,12.0
go,0.7,19.0
go,0.8,28.0
go,0.9,49.0
go,0.95,92.0
java,0.5,11.0
java,0.7,18.0
java,0.8,25.0
java,0.9,39.0
java,0.95,61.0


#### Query Length All Languages

In [19]:
query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,10.0
0.7,15.0
0.8,20.0
0.9,32.0
0.95,50.0
