# code2vec using bag-of-words (BOW) approaches

###  As a baseline for learning representations

VECTORIZE - Create code vectors using different strategies:

1. Word Vectors
    * Count
    * TF-IDF
    * Frequency (ratio)
    * Binary (Presence / Absence)
2. Token Vectors
    * Count
3. AST Vectors

TRAIN - Learn representations using these combinations of code vectors for a machine to differentiate between:

- correct code (code submission pass the testcases)
- wrong code (code submission fail the testcases)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
font = {'weight': 'bold', 'size': 20}
plt.rc('font', **font)

## VECTORIZE

**Programming data**: Student develop programs locally for the laboratory sheets of computer programming courses at our university. These programs are submitted by them to an automatic grading platform that runs some test cases specified by the lecturer on each program. A JSON output for those testcases and whether the program passed them or not along with the actual code is stored.

In [3]:
!ls data/raw/programming_data.json

data/raw/programming_data.json


In [4]:
dataframe = pd.read_json('data/raw/programming_data.json')

In [5]:
dataframe.head(2)

Unnamed: 0,academic_year_0,academic_year_1,correct,date,extension,ip,module,task,upload,user
0,2016,2017,True,2016-09-19 14:11:41,py,,ca277,add.py,#!/usr/bin/env python\n\na = int(raw_input())\...,b9e7e608-6036-4d44-8770-a7036176b53c
1,2016,2017,True,2016-09-19 14:17:33,py,,ca277,concat-lines.py,#!/usr/bin/env python\n\na = str(raw_input())\...,b9e7e608-6036-4d44-8770-a7036176b53c


In [6]:
'{:,}'.format(len(dataframe))

'591,707'

Grab only code submissions from Python modules:

In [7]:
PYTHON_MODULES = [
    'ca116', 
    'ca117', 
    'ca177', 
    'ca277', 
    'ca278',
]

In [8]:
dataframe = dataframe[dataframe['module'].isin(PYTHON_MODULES)]

In [9]:
'{:,}'.format(len(dataframe))

'490,820'

Target value:

In [10]:
dataframe.correct.value_counts()

False    296369
True     194451
Name: correct, dtype: int64

Remove comments:

In [11]:
import re
def remove_comments(text):
    return re.sub(re.compile('#.*?\n'), '', text)

Grab docs (code submissions) and labels (correct or not):

In [12]:
def get_docs_and_labels(df):
    _docs = []
    _labels = []
    for index in df.index:
        # Program
        code = remove_comments(
            df.at[index, 'upload']
        )
        _docs.append(code)
        # Label
        label = int(df.at[index, 'correct'])
        _labels.append(label)
    return _docs, _labels

In [13]:
docs, labels = get_docs_and_labels(dataframe)

In [14]:
'{:,}'.format(len(docs))

'490,820'

In [15]:
docs[0]

u'\na = int(raw_input())\nb = int(raw_input())\n\nprint a + b\n\n\n'

In [16]:
labels[0]

1

## 1) Programs as word vectors

In [17]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [18]:
NUM_WORDS = 2000 # Originally 231,659 words

In [72]:
t = Tokenizer(num_words=NUM_WORDS, 
              filters='\t\n', 
              lower=True, 
              split=' ', 
              char_level=False)

In [73]:
t.fit_on_texts(docs)

In [74]:
# word_counts: a dictionary of words and their counts.
t.word_counts['if'] # word count

552539

In [75]:
ordered_words = sorted(t.word_counts.iteritems(), key=lambda (k,v): (v,k), reverse=True)

In [76]:
N = 5
for key, value in ordered_words[:N]:
    print "%s: %s" % (key, value)

=: 2440154
i: 910221
+: 575607
if: 552539
def: 522536


In [77]:
# document_count: an integer count of the total number of documents that were used to fit the Tokenizer.
'Number docs: {:,}'.format(t.document_count)

'Number docs: 490,820'

In [78]:
# word_index: a dictionary of words and their uniquely assigned integers.
t.word_index['if'] # index

4

In [79]:
# word_docs: a dictionary of words and how many documents each appeared in.
t.word_docs['if']

298487

### i) Count: count of each word in the document

In [25]:
encoded_docs_count = t.texts_to_matrix(docs, mode='count')

In [26]:
encoded_docs_count[0]

array([ 0.,  2.,  0., ...,  0.,  0.,  0.])

In [27]:
encoded_docs_count.shape

(490820, 2000)

In [28]:
filename = 'data/processed/word_count'
np.save(filename, encoded_docs_count)

In [29]:
np.save('data/processed/word_labels', labels)

### ii) TF-IDF: Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document

In [80]:
encoded_docs_tfidf = t.texts_to_matrix(docs, mode='tfidf')

In [81]:
encoded_docs_tfidf[0]

array([ 0.        ,  1.25768178,  0.        , ...,  0.        ,
        0.        ,  0.        ])

In [82]:
np.save('data/processed/word_tfidf', encoded_docs_tfidf)

### iii) Frequency: frequency of each word as a ratio of words within each document.

In [83]:
encoded_docs_freq = t.texts_to_matrix(docs, mode='freq')

In [84]:
encoded_docs_freq[0]

array([ 0. ,  0.2,  0. , ...,  0. ,  0. ,  0. ])

In [85]:
np.save('data/processed/word_freq', encoded_docs_freq)

### iv) Binary: Whether or not each word is present in the document

In [86]:
encoded_docs_binary = t.texts_to_matrix(docs, mode='binary')

In [87]:
encoded_docs_binary[0]

array([ 0.,  1.,  0., ...,  0.,  0.,  0.])

In [89]:
np.save('data/processed/word_binary', encoded_docs_binary)

## 2) Programs as tokens

In [44]:
from tokenize import generate_tokens
from StringIO import StringIO

Generate token IDs and token words. Token IDs are more general than token words (see the "Program Vectors" notebook)

In [45]:
sample_code = '''print("Hello World!")'''

In [46]:
[(t[0], t[1]) for t in list(generate_tokens(StringIO(sample_code).readline))]

[(1, 'print'), (51, '('), (3, '"Hello World!"'), (51, ')'), (0, '')]

In [47]:
token_ids = []
token_words = []
token_labels = []
i = 0
while i < len(docs):
    doc = docs[i]
    label = labels[i]
    try:
        tokens = [(t[0], t[1]) for t in list(generate_tokens(StringIO(doc).readline))]
        # Token ID
        token_ids.append(
            [token[0] for token in tokens]
        )
        # Token
        token_words.append(
            [token[1] for token in tokens]
        )
        token_labels.append(
            label
        )
    except:
        pass
    i += 1

### a) Token categories

In [48]:
[t[0] for t in list(generate_tokens(StringIO(sample_code).readline))]

[1, 51, 3, 51, 0]

In [49]:
token_ids[0]

[54,
 1,
 51,
 1,
 51,
 1,
 51,
 51,
 51,
 4,
 1,
 51,
 1,
 51,
 1,
 51,
 51,
 51,
 4,
 54,
 1,
 1,
 51,
 1,
 4,
 54,
 54,
 0]

In [50]:
'Number encoded docs: {:,}'.format(len(token_ids))

'Number encoded docs: 472,087'

In [51]:
token_cat_docs = [
    ' '.join([str(item) for item in array]) for array in token_ids
]

In [52]:
token_cat_docs[0]

'54 1 51 1 51 1 51 51 51 4 1 51 1 51 1 51 51 51 4 54 1 1 51 1 4 54 54 0'

In [53]:
'Number token category docs: {:,}'.format(len(token_cat_docs))

'Number token category docs: 472,087'

In [54]:
cat_t = Tokenizer(num_words=NUM_WORDS,
                  split=' ',
                  char_level=False)

In [55]:
cat_t.fit_on_texts(token_cat_docs)

In [56]:
ordered_categories = sorted(cat_t.word_counts.iteritems(), key=lambda (k,v): (v,k), reverse=True)
N = 5
for key, value in ordered_categories[:N]:
    print "%s: %s" % (key, value)

51: 20368593
1: 18075194
4: 5886806
2: 2317086
54: 1996531


### i) Count: count of each token word in the document

In [51]:
encoded_category_tokens_count = cat_t.texts_to_matrix(token_cat_docs, mode='count')

In [52]:
encoded_category_tokens_count[0]

array([  0.,  11.,   9., ...,   0.,   0.,   0.])

In [53]:
np.save('data/processed/token_cat_count', encoded_category_tokens_count)

In [54]:
np.save('data/processed/token_labels', token_labels)

### b) Token words

In [58]:
[t[1] for t in list(generate_tokens(StringIO(sample_code).readline))]

['print', '(', '"Hello World!"', ')', '']

In [59]:
token_words[0]

[u'\n',
 u'a',
 u'=',
 u'int',
 u'(',
 u'raw_input',
 u'(',
 u')',
 u')',
 u'\n',
 u'b',
 u'=',
 u'int',
 u'(',
 u'raw_input',
 u'(',
 u')',
 u')',
 u'\n',
 u'\n',
 u'print',
 u'a',
 u'+',
 u'b',
 u'\n',
 u'\n',
 u'\n',
 '']

In [60]:
'Number encoded docs: {:,}'.format(len(token_words))

'Number encoded docs: 472,087'

In [61]:
token_docs = [
    ' '.join(array) for array in token_words
]

In [62]:
token_docs[0]

u'\n a = int ( raw_input ( ) ) \n b = int ( raw_input ( ) ) \n \n print a + b \n \n \n '

In [63]:
'Number token docs: {:,}'.format(len(token_docs))

'Number token docs: 472,087'

In [64]:
token_t = Tokenizer(num_words=NUM_WORDS, 
                    filters='\t\n', 
                    lower=False, 
                    split=' ', 
                    char_level=False)

In [65]:
token_t.fit_on_texts(token_docs)

In [66]:
ordered_tokens = sorted(token_t.word_counts.iteritems(), key=lambda (k,v): (v,k), reverse=True)
N = 5
for key, value in ordered_tokens[:N]:
    print "%s: %s" % (key, value)

): 3556931
(: 3556907
=: 2581991
:: 2248901
.: 2011442


### i) Count: count of each token word in the document

In [63]:
encoded_word_tokens_count = token_t.texts_to_matrix(token_docs, mode='count')

In [64]:
encoded_word_tokens_count[0]

array([ 0.,  4.,  4., ...,  0.,  0.,  0.])

In [65]:
np.save('data/processed/token_word_count', encoded_word_tokens_count)

## 3) Programs as Abstract Syntax Trees

Extract a representation of the AST by using a BFS approach:

In [19]:
import ast

def _strip_docstring(body):
    first = body[0]
    if isinstance(first, ast.Expr) and isinstance(first.value, ast.Str):
        return body[1:]
    return body

def get_ast_repr(node):
    
    visited = set()
    queue = [ [node, None, None, False] ]
    output = []
    
    while queue:
        
        vertex, value, name, end = queue.pop(0)
        
        # OUTPUT
        output.append(get_leaf(vertex, value, name, end))
        
        if vertex not in visited:
            
            visited.add(vertex)
            
            if hasattr(vertex, '_fields'):
                
                for field_name, field_value in zip(vertex._fields, 
                                               (getattr(vertex, attr) for attr in vertex._fields)):

                    if isinstance(field_value, ast.AST):
                        queue.append([field_value, field_name, vertex, False])
                    
                    elif isinstance(field_value, list):

                        if field_name == 'body':
                            field_name = _strip_docstring(field_value)
                        for item in field_value:
                            if isinstance(item, ast.AST):
                                queue.append([item, field_name, vertex, False])
                            else:
                                queue.append([item, field_name, vertex, True])

                    else:
                        queue.append((field_value, field_name, vertex, True))
                   
    return output

def get_leaf(node, value, parent, end):
    
    node_name = node.__class__.__name__
    node_name = next(node_name) if node_name is None else node_name
    if node_name == 'str': node_name = str(node)

    return {
        'Node': node, 
        'Parent': parent, 
        'Name': node_name,
        'Value': value,
        'End': end,
    }

def get_ast_pairs(tree):
    names = {}
    tuples = []
    for leaf in tree:
        # Values
        node = leaf['Node']
        node_name = leaf['Name']
        parent = leaf['Parent']
        value = leaf['Value']
        # Save name
        names[node] = node_name
        if node is None or parent is None:
            continue
        # Parent name
        parent_name = names[parent]
        # Add tuple
        tuples.append((parent_name, node_name))
    return tuples

In [22]:
get_ast_pairs(get_ast_repr(ast.parse(sample_code)))

[('Module', 'Print'),
 ('Print', 'Str'),
 ('Print', 'bool'),
 ('Str', 'Hello World!')]

In [23]:
# flatten = lambda l: [item for sublist in l for item in sublist]

In [24]:
# flatten(get_ast_pairs(get_ast_repr(ast.parse(sample_code))))

Take the children only:

In [25]:
 [pair[1] for pair in get_ast_pairs(get_ast_repr(ast.parse(sample_code)))]

['Print', 'Str', 'bool', 'Hello World!']

In [26]:
ast_docs = []
ast_labels = []
i = 0
while i < len(docs):
    doc = docs[i]
    label = labels[i]
    try:
        ast_docs.append(
            # flatten(get_ast_pairs(get_ast_repr(ast.parse(doc))))
            [pair[1] for pair in get_ast_pairs(get_ast_repr(ast.parse(doc)))]
        )
        ast_labels.append(
            label
        )
    except:
        pass
    i += 1

In [27]:
'Number AST docs: {:,}'.format(len(ast_docs))

'Number AST docs: 421,267'

In [28]:
ast_docs = [
    ' '.join(array) for array in ast_docs
]

In [29]:
ast_docs[0]

'Assign Assign Print Name Call Name Call BinOp bool a Store Name Call b Store Name Call Name Add Name int Load Name int Load Name a Load b Load raw_input Load raw_input Load'

In [30]:
ast_t = Tokenizer(num_words=NUM_WORDS, 
                  lower=False, 
                  split=' ', 
                  char_level=False)

In [31]:
ast_t.fit_on_texts(ast_docs)

In [67]:
ordered_ast_nodes = sorted(ast_t.word_counts.iteritems(), key=lambda (k,v): (v,k), reverse=True)
N = 5
for key, value in ordered_ast_nodes[:N]:
    print "%s: %s" % (key, value)

Name: 10005368
Load: 9607682
Store: 2665169
Call: 2205672
Assign: 2186523


### i) Count: count of each tree leaf

In [32]:
encoded_ast_count = ast_t.texts_to_matrix(ast_docs, mode='count')

In [33]:
encoded_ast_count[0]

array([ 0.,  8.,  6., ...,  0.,  0.,  0.])

In [34]:
np.save('data/processed/ast_count', encoded_ast_count)

In [35]:
np.save('data/processed/ast_labels', ast_labels)