# Use Deep Learning To Detect Programming Languages

**`10-17-2018`**<br>
**`BlueOptima Data challenge - Dhiraj Tripathi, Rutgers University , Masters of IT & Analytics 2017-18`**<br>
**`email: dhiraj.tripathi@rutgers.edu`**<br>

`

## Introduction 

This notebook introduces a way to use deep learning to detect programming languages. Take the following code as an example.

``` Python
def test():
    print("something")

```
We will get an answer ```python``` if we use the program to be introduced below to detect the language of the above code, which is also the correct answer. In fact, through a preliminary test, the accuracy of the program is more than 90%.

## Project Structure

Let's first have a rough idea of the project structure.

**Neural_Network/resources/code/train**:

This folder represents the training data.The name of each subfolder representes a programming language. There are around 100 code files in each subfolder for Java, C, Javascript, Python. The data in this folder is used to train the neural network model to identify the programming language.

*The data was sourced from the links given in the problem statement*

**Neural_Network/resources/code/test**:

This folder represents the test data. There are around 30 files per programming language. The data in this folder will be used to test the accuracy of our neural network model.

**Neural_Network/src/config.py**: 

Some constants used in the program

**Neural_Network/src/neural_network_trainer.py**:

Code used to train the model.

**Neural_Network/src/detector.py**: 

Code used to load the model and detect the programming language.

**Order of Execution** : 

1. Run the config.py file
2. Run the neural_network_trainer.py file
3. Run the detector.py file

This will detect the below code set as default in the detector.py file and will tell you that the code is python:
``` Python
def test():
    print("something")
```
Ofcourse you can edit the detector.py file to detect any of the 4 programming languages i.e., Java, JS, C and Python. Just type a few lines of codes in the detector.py file and run the files in the order of execution to detect the programming language.




## Execution:

Let's start with installing the required packages using the below code in the command:

```conda install -c anaconda gensim```

```conda install -c conda-forge keras```

``` pip install tensorflow==1.3.0 ```

After installing the required packages, we have to follow the order of execution and as per the order, we will have to run the config.py as below:



### config.py

In [1]:
import os
os.chdir("F://GIT//demos//Neural_Network")
os.getcwd()


'F:\\GIT\\demos\\Neural_Network'

In [2]:
current_dir = os.path.dirname(os.path.abspath("F://GIT//demos//Neural_Network"))
data_dir = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources//code")

train_data_dir = os.path.join(data_dir, "train") #Path to the train data
test_data_dir = os.path.join(data_dir, "test") #Path to the test data
vocab_location = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources//vocab.txt") 
vocab_tokenizer_location = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources//vocab_tokenizer")
word2vec_location = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources/word2vec.txt")
model_file_location = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources/models/model.json")
weights_file_location = os.path.join(current_dir, "F://GIT//demos//Neural_Network//resources/models//model.h5")

input_length = 500
word2vec_dimension = 100

Now that we have run the config file, some global constants are already declared which are going to be used further in the script. Now let's take a look at the **neural network model**:

### neural_network_trainer.py



### 1. Construct Vocabulary:

Lets start with importing the required packages and then writing some functions to build the model.

In [3]:
import logging
import re
from typing import Counter
import numpy as np
import os
from gensim.models import Word2Vec
from keras import Sequential
from keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense
from keras.models import model_from_json
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from numpy import asarray, zeros
import pickle

from src import config
from src.config import input_length

all_languages = ["Python", "C", "Java", "Javascript", ]


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


We just need to scan all the code in resources/code/train and extract common words in it. Those common words will make up our vocabulary. Key code is as follows.

In [26]:
def build_vocab(train_data_dir):
    vocabulary = Counter()
    files = get_files(train_data_dir)
    for f in files:
        words = load_words_from_file(f)
        vocabulary.update(words)

    # remove rare words
    min_count = 5
    vocabulary = [word for word, count in vocabulary.items() if count >= min_count]
    return vocabulary

Now run the build_vocab function on the training data and see the first 20 items in the list vocab:

#### Turn off the warning messages by clicking on the toggle button below this piece of code.

In [64]:
from IPython.display import HTML
HTML('''<script>
code_show_err=false; 
function code_toggle_err() {
 if (code_show_err){
 $('div.output_stderr').hide();
 } else {
 $('div.output_stderr').show();
 }
 code_show_err = !code_show_err
} 
$( document ).ready(code_toggle_err);
</script>
To toggle on/off output_stderr, click <a href="javascript:code_toggle_err()">here</a>.''')


In [61]:
vocab = build_vocab(config.train_data_dir)
vocab[:20]



['SPDX',
 'License',
 'Identifier',
 'GPL',
 '2',
 '0',
 'Helper',
 'function',
 'for',
 'splitting',
 'a',
 'string',
 'into',
 'an',
 'argv',
 'like',
 'array',
 'include',
 'linux',
 'kernel']

### 2. Build the vocab_tokenizer

We use Tokenizer provided by Keras to build vocab_tokenizer.

In [52]:
def build_vocab_tokenizer_from_set(vocab):
    vocab_tokenizer = Tokenizer(lower=False, filters="")
    vocab_tokenizer.fit_on_texts(vocab)
    return vocab_tokenizer

Then we save this vocab_tokenizer as a file, to be used later.

In [53]:
def save_vocab_tokenizer(vocab_tokenizer_location, vocab_tokenizer):
    with open(vocab_tokenizer_location, 'wb') as f:
        pickle.dump(vocab_tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)

### 3. Build Word Vectors

 Word vectors are just vectors, and each word in the vocabulary is mapped to a word vector. The basic steps are below:
1. Load all the training data, extract those words which are in the vocabulary.
2. Map each word into its respective number by using vocab_tokenizer.
3. Put those numbers into Word2Vec library and obtain word vectors.

In [65]:
def build_word2vec(train_data_dir, vocab_tokenizer):
    all_words = []
    files = get_files(train_data_dir)
    for f in files:
        words = load_words_from_file(f)
        all_words.append([word for word in words if is_in_vocab(word, vocab_tokenizer)])
    model = Word2Vec(all_words, size=100, window=5, workers=8, min_count=1)
    return {word: model[word] for word in model.wv.index2word}

### 4. Build the Neural Network

For a clear understanding let's say that input of the Neural Network is the words mapped into numbers and the output is the probability of the code to belonging to a specific programming language.

Now that we know the input and output of the Neural Network, let's follow the steps below to train the model below:

1. **Embedding Layer**: it’s used to map each word into its respective word vector
2. **Conv1D, MaxPooling1D**: this part is a classic deep learning layer. To put it simply, what it does is extraction and transformation.
3. **Flatten, Dense**: convert the multi-dimensional array into one-dimensional, and output the prediction.


In [55]:
def build_model(train_data_dir, vocab_tokenizer, word2vec):
    weight_matrix = build_weight_matrix(vocab_tokenizer, word2vec)

    # build the embedding layer
    input_dim = len(vocab_tokenizer.word_index) + 1
    output_dim = get_word2vec_dimension(word2vec)
    x_train, y_train = load_data(train_data_dir, vocab_tokenizer)

    embedding_layer = Embedding(input_dim, output_dim, weights=[weight_matrix], input_length=input_length,
                                trainable=False)
    model = Sequential()
    model.add(embedding_layer)
    model.add(Conv1D(filters=128, kernel_size=5, activation="relu"))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(len(all_languages), activation="sigmoid"))
    logging.info(model.summary())
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(x_train, y_train, epochs=10, verbose=2)
    return model

Let’s write a function, which uses the neural network to detect test code, check out its accuracy.

In [56]:
def evaluate_model(test_data_dir, vocab_tokenizer, model):
    x_test, y_test = load_data(test_data_dir, vocab_tokenizer)
    loss, acc = model.evaluate(x_test, y_test, verbose=0)
    logging.info('Test Accuracy: %f' % (acc * 100))


As what we have got before, the test accuracy is around 94%~95%, which is good enough. Let’s save the neural network as files, so we can load it when detecting.

In [57]:
def save_model(model, model_file_location, weights_file_location):
    os.makedirs(os.path.dirname(model_file_location), exist_ok=True)
    with open(model_file_location, "w") as f:
        f.write(model.to_json())
    model.save_weights(weights_file_location)

Rest of the below functions are a part of the neural network and are referenced through out the script.

In [4]:
def load_words_from_string(s):
    contents = " ".join(s.splitlines())
    result = re.split(r"[{}()\[\]\'\":.*\s,#=_/\\><;?\-|+]", contents)

    # remove empty elements
    result = [word for word in result if word.strip() != ""]

    return result


In [6]:
def load_vocab_tokenizer(vocab_tokenizer_location):
    with open(vocab_tokenizer_location, 'rb') as f:
        tokenizer = pickle.load(f)
    return tokenizer

In [7]:
def evaluate_saved_data(x_file_name, y_file_name, model):
    x = np.loadtxt(x_file_name)
    y = np.loadtxt(y_file_name)
    loss, accuracy = model.evaluate(x, y, verbose=2)
    print(f"loss: {loss}, accuracy: {accuracy}")

In [8]:
def to_binary_list(i, count):
    result = [0] * count
    result[i] = 1
    return result

In [9]:
def get_lang_sequence(lang):
    for i in range(len(all_languages)):
        if all_languages[i] == lang:
            return to_binary_list(i, len(all_languages))
    raise Exception(f"Language {lang} is not supported.")

In [10]:
def encode_sentence(sentence, vocab_tokenizer):
    encoded_sentence = vocab_tokenizer.texts_to_sequences(sentence.split())
    return [word[0] for word in encoded_sentence if len(word) != 0]

In [11]:
def load_vocab(vocab_location):
    with open(vocab_location) as f:
        words = f.read().splitlines()
    return set(words)

In [12]:
def load_word2vec(word2vec_location):
    result = dict()
    with open(word2vec_location, "r", encoding="utf-8") as f:
        lines = f.readlines()[1:]
    for line in lines:
        parts = line.split()
        result[parts[0]] = asarray(parts[1:], dtype="float32")
    return result


In [13]:
def load_model(model_file_location, weights_file_location):
    with open(model_file_location) as f:
        model = model_from_json(f.read())
    model.load_weights(weights_file_location)
    return model

In [15]:
def get_files(data_dir):
    result = []
    depth = 0
    for root, sub_folders, files in os.walk(data_dir):
        depth += 1

        # ignore the first loop
        if depth == 1:
            continue

        language = os.path.basename(root)
        result.extend([os.path.join(root, f) for f in files])
        depth += 1
    return result

In [16]:
def load_words_from_file(file_name):
    try:
        with open(file_name, "r") as f:
            contents = f.read()
    except UnicodeDecodeError:
        logging.warning(f"Encountered UnicodeDecodeError, ignore file {file_name}.")
        return []
    return load_words_from_string(contents)

In [17]:
def get_languages(ext_lang_dict):
    languages = set()
    for ext, language in ext_lang_dict.items():
        if type(language) is str:
            languages.update([language])
        elif type(language) is list:
            languages.update(language)
    return languages

In [19]:
def save_vocabulary(vocabulary, file_location):
    with open(file_location, "w+") as f:
        for word in vocabulary:
            f.write(word + "\n")


In [20]:
def is_in_vocab(word, vocab_tokenizer):
    return word in vocab_tokenizer.word_counts.keys()

In [21]:
def concatenate_qualified_words(words, vocab_tokenizer):
    return " ".join([word for word in words if is_in_vocab(word, vocab_tokenizer)])

In [22]:
def load_sentence_from_file(file_name, vocab_tokenizer):
    words = load_words_from_file(file_name)
    return concatenate_qualified_words(words, vocab_tokenizer)

In [23]:
def load_sentence_from_string(s, vocab_tokenizer):
    words = load_words_from_string(s)
    return concatenate_qualified_words(words, vocab_tokenizer)

In [24]:
def load_encoded_sentence_from_file(file_name, vocab_tokenizer):
    sentence = load_sentence_from_file(file_name, vocab_tokenizer)
    return encode_sentence(sentence, vocab_tokenizer)


def load_encoded_sentence_from_string(s, vocab_tokenizer):
    sentence = load_sentence_from_string(s, vocab_tokenizer)
    return encode_sentence(sentence, vocab_tokenizer)


In [25]:
def load_data(data_dir, vocab_tokenizer):
    files = get_files(data_dir)
    x = []
    y = []
    for f in files:
        language = os.path.dirname(f).split(os.path.sep)[-1]
        x.append(load_encoded_sentence_from_file(f, vocab_tokenizer))
        y.append(get_lang_sequence(language))
    return pad_sequences(x, maxlen=input_length), asarray(y)

In [28]:
def get_word2vec_dimension(word2vec):
    first_vector = list(word2vec.values())[0]
    return len(first_vector)


In [29]:
def build_weight_matrix(vocab_tokenizer, word2vec):
    vocab_size = len(vocab_tokenizer.word_index) + 1
    word2vec_dimension = get_word2vec_dimension(word2vec)
    weight_matrix = zeros((vocab_size, word2vec_dimension))
    for word, index in vocab_tokenizer.word_index.items():
        weight_matrix[index] = word2vec[word]
    return weight_matrix

In [32]:
def build_and_save_vocab_tokenizer(train_data_dir, vocab_tokenizer_location):
    vocab = build_vocab(train_data_dir)
    vocab_tokenizer = build_vocab_tokenizer_from_set(vocab)
    save_vocab_tokenizer(vocab_tokenizer_location, vocab_tokenizer)
    return vocab_tokenizer


In [34]:
if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    vocab_tokenizer = build_and_save_vocab_tokenizer(config.train_data_dir, config.vocab_tokenizer_location)
    word2vec = build_word2vec(config.train_data_dir, vocab_tokenizer)

    model = build_model(config.train_data_dir, vocab_tokenizer, word2vec)
    evaluate_model(config.test_data_dir, vocab_tokenizer, model)

    save_model(model, config.model_file_location, config.weights_file_location)


INFO:gensim.models.word2vec:collecting all words and their counts
INFO:gensim.models.word2vec:PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO:gensim.models.word2vec:collected 8712 word types from a corpus of 543284 raw words and 480 sentences
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:effective_min_count=1 retains 8712 unique words (100% of original 8712, drops 0)
INFO:gensim.models.word2vec:effective_min_count=1 leaves 543284 word corpus (100% of original 543284, drops 0)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 8712 items
INFO:gensim.models.word2vec:sample=0.001 downsamples 45 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 472089 word corpus (86.9% of prior 543284)
INFO:gensim.models.base_any2vec:estimated required memory for 8712 words and 100 dimensions: 11325600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.base_any2vec:training model 

INFO:root:None


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 100)          871300    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 496, 128)          64128     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 248, 128)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 31744)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 126980    
Total params: 1,062,408
Trainable params: 191,108
Non-trainable params: 871,300
_________________________________________________________________
Epoch 1/10
 - 4s - loss: 0.5713 - acc: 0.7880
Epoch 2/10
 - 3s - loss: 0.1465 - acc: 0.9500
Epoch 3/10
 - 3s - loss: 0.0761 - ac

INFO:root:Test Accuracy: 94.600001


### detector.py

### 5. Load the Neural Network For Detection

In this part, we need to load vocab_tokenizer and the neural network for detection. The code is as follows.

In [73]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from src import config
from src.config import input_length
from src.neural_network_trainer import load_model, \
    load_vocab_tokenizer, load_encoded_sentence_from_string, all_languages

vocab_tokenizer = load_vocab_tokenizer(config.vocab_tokenizer_location)
model = load_model(config.model_file_location, config.weights_file_location)

In [74]:
def to_language(binary_list):
    i = np.argmax(binary_list)
    return all_languages[i]


def get_neural_network_input(code):
    encoded_sentence = load_encoded_sentence_from_string(code, vocab_tokenizer)
    return pad_sequences([encoded_sentence], maxlen=input_length)


def detect(code):
    y_proba = model.predict(get_neural_network_input(code))
    return to_language(y_proba)


In [75]:
code = """
   

package com.google.common.base;

import static com.google.common.base.Preconditions.checkNotNull;

import com.google.common.annotations.Beta;
import com.google.common.annotations.GwtCompatible;
import java.io.Serializable;
import java.util.Iterator;
import java.util.Set;
import org.checkerframework.checker.nullness.qual.Nullable;


  public abstract T or(T defaultValue);

  /**
   * Returns this {@code Optional} if it has a value present; {@code secondChoice} otherwise.
   *
   * <p><b>Comparison to {@code java.util.Optional}:</b> this method has no equivalent in Java 8's
   * {@code Optional} class; write {@code thisOptional.isPresent() ? thisOptional : secondChoice}
   * instead.
   */
 
   * @throws NullPointerException if this optional's value is absent and the supplier returns {@code
   *     null}
   */
  @Beta
 ct <V> Optional<V> transform(Function<? super T, V> function);

  /**
   * Returns {@code true} if {@code object} is an {@code Optional} instance, and either the
   * contained references are {@linkplain Object#equals equal} to each other or both are absent.
   * Note that {@code Optional} instances of differing parameterized types can be equal.
   *
   * <p><b>Comparison to {@code java.util.Optional}:</b> no differences.
   */
  @Override
  public abstract boolean equals(@Nullable Object object);

 

  private static final long serialVersionUID = 0;
}


"""

In [70]:
type(code)

str

In [67]:
print(detect(code))

Java


#### You should be able to notice that the detected code is Java. You can replace the string "code" with a chunk of code from any of the four programming launguages i.e., Java, JS, C and Python. 

Lets test the model for some different programming language.

In [85]:
code = """
   


#include <linux/export.h>

#include <linux/libgcc.h>

long long notrace __ashldi3(long long u, word_type b)
{
	DWunion uu, w;
	word_type bm;

	if (b == 0)
		return u;

	uu.ll = u;
	bm = 32 - b;

	if (bm <= 0) {
		w.s.low = 0;
		w.s.high = (unsigned int) uu.s.low << -bm;
	} else {
		const unsigned int carries = (unsigned int) uu.s.low >> bm;

		w.s.low = (unsigned int) uu.s.low << b;
		w.s.high = ((unsigned int) uu.s.high << b) | carries;
	}

	return w.ll;
}
EXPORT_SYMBOL(__ashldi3);


"""

In [84]:
print(detect(code))

C


## Summary:

There are below 5 conceptual steps in training this neural network model:
1. Build vocabulary.
2. Build vocab_tokenizer using vocabulary, which is used to convert words into numbers.
3. Load words into Word2Vec to build word vectors.
4. Load word vectors into the neural network as part of the input layer.
5. Load all the training data, extract words that are in the vocabulary, convert them into numbers using vocab_tokenizer, load them into the neural network for training.

#### Three steps for detection:
1. Extract words in the code and remove those that are not in the vocabulary.
2. Convert those words into number through vocab_tokenizer, and load them into the neural network.
3. Choose the language which has the most probability, which the answer we want. The input to the neural network is the number(mapped from vocab tokenizer) and the output is the probability of a code to be of the specific programming language.

## Possible Enhancements:

The detector demostrated above requires a manual input from the user to type in the code in the last part of the notebook to be able to detect the programming language. If given more time, I would surely like to research on the part to automate the process in which there is no need of manual input and the script automatically opens a folder, reads the file, and detects the programming language as output. 