# code2vec

VECTORIZE - Create code vectors using different strategies:

1. Word Vectors
    * Count / Frequency
    * TF-IDF
    * Binary (Presence / Absence)
2. Token Vectors
3. AST Vectors

TRAIN - Learn representations using these combinations of code vectors for a machine to differentiate between:

- correct code (code submission pass the testcases)
- wrong code (code submission fail the testcases)

In [1]:
import pandas as pd

In [2]:
!ls data/raw/programming_data.json

data/raw/programming_data.json


## VECTORIZE

In [3]:
dataframe = pd.read_json('data/raw/programming_data.json')

In [4]:
dataframe.head(2)

Unnamed: 0,academic_year_0,academic_year_1,correct,date,extension,ip,module,task,upload,user
0,2016,2017,True,2016-09-19 14:11:41,py,,ca277,add.py,#!/usr/bin/env python\n\na = int(raw_input())\...,b9e7e608-6036-4d44-8770-a7036176b53c
1,2016,2017,True,2016-09-19 14:17:33,py,,ca277,concat-lines.py,#!/usr/bin/env python\n\na = str(raw_input())\...,b9e7e608-6036-4d44-8770-a7036176b53c


In [5]:
'{:,}'.format(len(dataframe))

'591,707'

Grab only code submissions from Python modules:

In [6]:
PYTHON_MODULES = [
    'ca116', 
    'ca117', 
    'ca177', 
    'ca277', 
    'ca278',
]

In [7]:
dataframe = dataframe[dataframe['module'].isin(PYTHON_MODULES)]

In [8]:
'{:,}'.format(len(dataframe))

'490,820'

Target value:

In [9]:
dataframe.correct.value_counts()

False    296369
True     194451
Name: correct, dtype: int64

Remove comments:

In [10]:
import re
def remove_comments(text):
    return re.sub(re.compile('#.*?\n'), '', text)

Grab docs (code submissions) and labels (correct or not):

In [11]:
def get_docs_and_labels(df):
    _docs = []
    _labels = []
    for index in df.index:
        # Program
        code = remove_comments(
            df.get_value(index, 'upload')
        )
        _docs.append(code)
        # Label
        label = int(df.get_value(index, 'correct'))
        _labels.append(label)
    return _docs, _labels

In [12]:
docs, labels = get_docs_and_labels(dataframe)

In [13]:
'{:,}'.format(len(docs))

'490,820'

In [14]:
docs[0]

u'\na = int(raw_input())\nb = int(raw_input())\n\nprint a + b\n\n\n'

In [15]:
labels[0]

1

## 1) Programs as word vectors

In [16]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [17]:
t = Tokenizer(num_words=None, 
              filters='\t\n', 
              lower=True, 
              split=' ', 
              char_level=False)

In [18]:
t.fit_on_texts(docs)

In [19]:
# word_counts: a dictionary of words and their counts.
t.word_counts['if'] # word count

552539

In [20]:
# document_count: an integer count of the total number of documents that were used to fit the Tokenizer.
'Number docs: {:,}'.format(t.document_count)

'Number docs: 490,820'

In [21]:
# word_index: a dictionary of words and their uniquely assigned integers.
t.word_index['if'] # index

4

In [22]:
# word_docs: a dictionary of words and how many documents each appeared in.
t.word_docs['if']

298487

### a) Count: count of each word in the document

In [23]:
encoded_docs_count = t.texts_to_matrix(docs, mode='count')

In [24]:
encoded_docs_count[0]

array([ 0.,  2.,  0., ...,  0.,  0.,  0.])

### b) TF-IDF: Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document

In [25]:
# encoded_docs_tfidf = t.texts_to_matrix(docs, mode='tfidf')

In [26]:
# encoded_docs_tfidf[0]

### c) Frequency: frequency of each word as a ratio of words within each document.

In [27]:
# encoded_docs_freq = t.texts_to_matrix(docs, mode='freq')

In [28]:
# encoded_docs_freq[0]

### d) Binary: Whether or not each word is present in the document

In [29]:
# encoded_docs_binary = t.texts_to_matrix(docs, mode='binary')

In [30]:
# encoded_docs_binary[0]

## 2) Programs as tokens

In [31]:
from tokenize import generate_tokens
from StringIO import StringIO

### a) Tokens

In [32]:
sample_code = '''print("Hello World!")'''

In [33]:
[t[0] for t in list(generate_tokens(StringIO(sample_code).readline))]

[1, 51, 3, 51, 0]

In [34]:
# encoded_docs_tokens = [[t[0] for t in list(generate_tokens(StringIO(doc).readline))] for doc in docs]

### b) Token words

In [35]:
[t[1] for t in list(generate_tokens(StringIO(sample_code).readline))]

['print', '(', '"Hello World!"', ')', '']

In [36]:
# encoded_docs_token_words = [[t[1] for t in list(generate_tokens(StringIO(doc).readline))] for doc in docs]

## 3) Programs as Abstract Syntax Trees

In [None]:
# TODO

## TRAIN

In [37]:
from sklearn.model_selection import train_test_split

In [38]:
from sklearn.naive_bayes import MultinomialNB

In [39]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
def train(X, y):

    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y, 
                                                        test_size=0.2, 
                                                        random_state=0)
    

    nb = MultinomialNB()
    nb.fit(X_train, y_train)
    
    preds = nb.predict(X_test)

    print(confusion_matrix(y_test, preds))
    print('\n')
    print(classification_report(y_test, preds))

In [None]:
train(encoded_docs_count, labels)