# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [3]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

## Part 1: Preview The Dataset
    
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data.  While the full dataset can be automatically downloaded with the `/script/setup` script located in this repo, we can alternatively download a subset of the data from S3.  

The s3 links follow this pattern:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

For example, the link for the `python` is:

> https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

First we download and decompress this dataset:

Finally, we can inspect `python/final/jsonl/test/python_test_0.jsonl.gz` to see its contents:

Read in the file and display the first row.  The data is stored in [JSON Lines](http://jsonlines.org/) format.

In [4]:
with open('python.jsonl', 'r') as f:
    sample_file = f.readlines()
sample_file[0]

'{"code": "def JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\\n 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW\\n", "code_tokens": ["JX8UK2VA7O", "LKEPLLYWHP", "WCJITHXL2F", "."], "comment_tokens": [], "doc_id": 12345, "docstring": "JX8UK2VA7O LKEPLLYWHP WCJITHXL2F 7VR7BUBWE7.", "docstring_tokens": ["JX8UK2VA7O", "LKEPLLYWHP", "WCJITHXL2F", "."], "func_name": "JX8UK2VA7O", "hash_key": "XXXX/YYYY:ZZZZ/AAAA.py", "hash_val": 12345, "language": "python", "lineno": 1, "original_string": "JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\\n 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW TZGGUVUOLA I7TA56BF84\\n", "partition": "test", "path": "ZZZZ/AAAA.py", "repo": "XXXX/YYYY", "sha": ""}\n'

In [5]:
sample_file[2]

'{"code": "def JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\\n 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW\\n", "code_tokens": ["JX8UK2VA7O", "LKEPLLYWHP", "WCJITHXL2F", "."], "comment_tokens": [], "doc_id": 12345, "docstring": "JX8UK2VA7O LKEPLLYWHP WCJITHXL2F 7VR7BUBWE7.", "docstring_tokens": ["JX8UK2VA7O", "LKEPLLYWHP", "WCJITHXL2F", "."], "func_name": "JX8UK2VA7O", "hash_key": "XXXX/YYYY:ZZZZ/AAAA.py", "hash_val": 12345, "language": "python", "lineno": 1, "original_string": "JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\\n 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW TZGGUVUOLA I7TA56BF84\\n", "partition": "test", "path": "ZZZZ/AAAA.py", "repo": "XXXX/YYYY", "sha": ""}\n'

We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form:

In [6]:
pprint(json.loads(sample_file[0]))

{'code': 'def JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\n'
         ' 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW\n',
 'code_tokens': ['JX8UK2VA7O', 'LKEPLLYWHP', 'WCJITHXL2F', '.'],
 'comment_tokens': [],
 'doc_id': 12345,
 'docstring': 'JX8UK2VA7O LKEPLLYWHP WCJITHXL2F 7VR7BUBWE7.',
 'docstring_tokens': ['JX8UK2VA7O', 'LKEPLLYWHP', 'WCJITHXL2F', '.'],
 'func_name': 'JX8UK2VA7O',
 'hash_key': 'XXXX/YYYY:ZZZZ/AAAA.py',
 'hash_val': 12345,
 'language': 'python',
 'lineno': 1,
 'original_string': 'JX8UK2VA7O LKEPLLYWHP WCJITHXL2F\n'
                    ' 7VR7BUBWE7 TNOS1AYQOM F9AYYUCNL7 72YZYQNAXF GLEMJMWJDW '
                    'TZGGUVUOLA I7TA56BF84\n',
 'partition': 'test',
 'path': 'ZZZZ/AAAA.py',
 'repo': 'XXXX/YYYY',
 'sha': ''}


In [14]:
type(sample_file[0])

<class 'str'>

In [24]:
for element in my_list:
type(json.loads(sample_file[0]))
df = pd.DataFrame(data, columns=['Key', 'Value'])


<class 'dict'>