# Final project for DH

## Fetching the data

First we will need to dowload a ton of Python code. We chose the ActiveState repository, as it contains a hell lot of code that is mostly high quality.

In [1]:
! git clone https://github.com/ActiveState/code.git

Cloning into 'code'...
remote: Counting objects: 68352, done.[K
remote: Total 68352 (delta 0), reused 0 (delta 0), pack-reused 68352[K
Receiving objects: 100% (68352/68352), 12.05 MiB | 22.68 MiB/s, done.
Resolving deltas: 100% (24509/24509), done.
Checking connectivity... done.


## Inspecting the data

Before we begin with our actual work, it is always beneficial to take a look at the data first, so you now what you will be dealing with. This is especially important in our case as we are working with a programming language, which looks way different than natural language.

### Gather all source files

We will traverse the code directory and absorb every file that ends with .py, which should be a good enough indicator that this is indeed python source code.

In [2]:
import os

python_files = []
for root, dirs, files in os.walk('./code'):
    for file in files:
        if file.endswith('.py'):
            python_files += [ root + "/" + file ]

print(str(len(python_files)) + " files loaded")

4597 files loaded


That wasn't so hard, right? Now that we have a list of all relevant files we will read all of them in and save the whole text as one giant string.

In [3]:
blob = ""
for file in python_files:
    with open(file) as f:
        blob += f.read()

all_words = blob.split()

print("Read all files")

Read all files


### Cleaning the data

We will now use the Counter class to generate a list of all words and their corresponding frequency in the whole corpus. Python will do this more or less automatically for us, so we don't have to write too much source code here. After that, let's look at some of the most frequent tokens.

In [4]:
from collections import Counter

word_freqs = Counter(all_words)

for w in word_freqs.most_common(20):
    print(w)

('=', 91988)
('#', 43096)
('if', 33328)
('def', 30766)
('the', 29026)
('in', 23738)
('return', 20957)
('for', 20362)
('to', 14870)
('a', 14759)
('of', 12784)
('==', 12583)
('print', 12504)
('and', 12445)
('is', 12422)
('+', 12064)
('import', 10869)
('"""', 8902)
('not', 8452)
('1', 7927)


Since we are dealing with a programming language, it is no big surprise that something as trivial as an equal sign is on top of the list. But we can also see that the corpus contains a lot of 'normal' english words. This may pose a problem later because we don't want to help the person in front of the screen to write better prose, but nice and clean source code. On the other hand, python contains a lot of english words like _for_ or _if_ as reserved key words. Of cource we wouldn't want to kick these out of our data.

Fortunately, Python offers a way to get out of such situation. Take a look at this:

In [5]:
import keyword

print(keyword.iskeyword('def'))
print(keyword.kwlist[:6])

True
['False', 'None', 'True', 'and', 'as', 'assert']


How nice! We will use this a little bit later on. For now, let's focus on cleaning up the data by removing anything that isn't that important (think of comments for example).

In [6]:
# Step 1 - remove all comments, including inline ones, while preserving the source code in front of them

import re

comments = re.compile(r'#.*')

cleaned_text = comments.sub('', blob)

Great! Now we will the special characters dot, comma and end-of-line a special treatment so they stand out a little bit more.

In [7]:
# Step 2 - Replace dots and commas by special symbols, so things like "a," and "a ," are both recognized
# as an 'a' followed by a comma

commas = re.compile(r',')
cleaned_text = commas.sub(' <COMMA> ', cleaned_text)

dots = re.compile(r'\.')
cleaned_text = dots.sub(' <DOT> ', cleaned_text)

# Line breaks can also cause trouble if the next line is not indented

cr = re.compile(r'\n')
cleaned_text = cr.sub(' <EOL> ', cleaned_text)

And now it's time to seperate brackets and operators from things like variable names by putting some whitespace between them.

In [8]:
# Now just add some spaces to all kinds of brackets
cleaned_text = re.sub(r'\(', ' ( ', cleaned_text)
cleaned_text = re.sub(r'\[', ' [ ', cleaned_text)
cleaned_text = re.sub(r'\{', ' { ', cleaned_text)
cleaned_text = re.sub(r'\)', ' ) ', cleaned_text)
cleaned_text = re.sub(r'\]', ' ] ', cleaned_text)
cleaned_text = re.sub(r'\}', ' } ', cleaned_text)

# Equal signs are also special. We don't want something like "a=b" clobber our dictionary,
# so we add spaces. Then we undo this transformation on the comparison operator "==" and some special
# assignments
cleaned_text = re.sub(r'=', ' = ', cleaned_text)
cleaned_text = re.sub(r'=  =', '==', cleaned_text)
cleaned_text = re.sub(r'\+ =', '\+=', cleaned_text)
cleaned_text = re.sub(r'- =', '-=', cleaned_text)
cleaned_text = re.sub(r'\* =', '\*=', cleaned_text)
cleaned_text = re.sub(r'/ =', '/=', cleaned_text)

# Let's do the same with all the operators
cleaned_text = re.sub(r'\+', ' \+ ', cleaned_text)
cleaned_text = re.sub(r'-', ' - ', cleaned_text)
cleaned_text = re.sub(r'\*', ' \* ', cleaned_text)
cleaned_text = re.sub(r'/', ' / ', cleaned_text)

cleaned_text = re.sub(r'\+ =', '\+=', cleaned_text)
cleaned_text = re.sub(r'- =', '-=', cleaned_text)
cleaned_text = re.sub(r'\* =', '\*=', cleaned_text)
cleaned_text = re.sub(r'/ =', '/=', cleaned_text)


print("Done cleaning!")

Done cleaning!


## Transforming the data

The next step is to turn the words into numbers. Therefore we will create two dictionaries, one that translates words into numbers and another one that translates numbers back into words (for performance reasons).

In [9]:
def create_dicts(word_list):
    w2n_dict = { }
    n2w_dict = { }
    
    # We want certain things to always be in our dictionary, no matter what
    keep_list = keyword.kwlist + ['<DOT>', '<COMMA>', '<EOL>']
    cleaned_word_list = [ w.strip() for w in word_list ]
    
    full_word_list = list(set(cleaned_word_list + keep_list))
    for i,w in enumerate(full_word_list, len(w2n_dict)):
        w2n_dict[w] = i
        n2w_dict[i] = w
    return w2n_dict, n2w_dict

all_words = cleaned_text.split(" ")
w2n_dict, n2w_dict = create_dicts(all_words)

Let's see if it works:

In [10]:
n2w_dict[w2n_dict['return']]

'return'

Looks like the lookup is working in both directions. Let's see how big our dictionary has gotten:

In [11]:
print(len(w2n_dict))

151441


This is still a bit much. Let's see what's inside:

In [12]:
print(list(w2n_dict.keys())[:50])

['', 'simultaneously', 'typechecking', "'22'", 'FSList', 'od6WRehbe55W86jMV9"', '6OqqUbfd\\', 'call;', 'LruDict', '"sum"', "'SPSS_FMT_RBHEX'", 'items"%cnt', 'outparams', '"v8DT3Eb1LFFSGxolbeFRdl1Gc161X5WPO8qMQyYmiYsSfwzPsVFSQnxQmRwlAwlRUp8YZf"', '"write"', 'openchar:', 'getRows', 'OwnerId', '7116948092557', 'end_of_line', '"vBy8HgG4', '"Unpickleable', "'NOFILE'", 'complex1:bar', '"mechanize', "'fB'", '3NO5', "'''it", 'sgi9cVGc45pKDpmf0qJY7NNSM', 'Rendering"', '2014"', 'Qs', 'qmHFlvneAe', '__slot__', 'tickers', '``cls``', '"price"', 'TarFileCompat', 'TAOUP', "'convolve", "'_utilizes_abilities'", 'StreamResult', 'PNG"', 'playerIdentities:', 'Convert', 'manager"', 'ft&0xFFFFFFFFL', "'u3411'", 'PyNumber_InPlaceAnd', 'freevars']


Yikes, what a mess this still is!

One thing that is commonly used in NLP (natural language processing) is cutting of the lower end of your word distribution. That is, take all the words that occur less then a certain threshold and just throw them overboard. Let's do this!

In [13]:
# How often should a word appear to escape it's doom
min_count = 50

# Cut off words that are simply too rare
all_words = cleaned_text.split(" ")
c = Counter(all_words)

reduced_words = [ w for w in all_words if c[w] >= min_count ]

reduced_w2n_dict, reduced_n2w_dict = create_dicts(reduced_words)

print("Before: " + str(len(w2n_dict)))
print("After:  " + str(len(reduced_w2n_dict)))

Before: 151441
After:  2791


## Making the data ready for training

Ok so now that we have greatly reduced our dictionary size, we still have one problem to solve. The words that were filtered out are actually still inside the text and appear alongside the words that we are interested in. So when creating the input label pairs, what should we do?

In our case, we will just silently ignore these cases and return nothing. So let's get to it.

First we will transform every line into a series of IDs, masking all words that we don't care about with a -1.

In [14]:
def sentence_to_id(sent, ldict):
    transformed = []
    for w in sent.split(" "):
        if w in ldict.keys():
            transformed.append(ldict[w])
        else:
            transformed.append(-1)
    return transformed

lines = [ sentence_to_id(s, reduced_w2n_dict) for s in cleaned_text.split("<EOL>") ]

print(lines[20:30])

[[0, 0, 0, 0, 0, -1, 2627, 996, 0, 1506, 0, 1506, 2739, 2100, 623, -1, 0, 1506, 0, 1506, 839, 1506, 1257, -1, 2481, 2723, 966, 0, 1506, 1890, 2481, 878, 966, 0, 1506, 1890, 2481, 878, 966, 1693, 0, 1506, 0, 1506, 2569, 1506, 1257, -1, 1506, -1, 1506, -1, 0], [0, 0, 0], [0, 0, 1558, 0, 1558, 2550, 1444, 2670, 1745, 2032, 396, 1270, 1558, 0, 1558, 0, 0], [0, 0, 0], [0, 0, 0, 0, 0, -1, 2627, 996, 0, 1506, 0, 1506, 2739, 2100, 623, -1, 0, 1506, 0, 1506, 839, 1506, 1257, -1, 2481, 2723, 966, 0, 1506, 1890, 2481, 878, 966, 0, 1506, 1890, 2481, 878, 966, 1693, 0, 1506, 0, 1506, 2569, 1506, 1257, -1, 1506, -1, 1506, -1, 0, 1506, 0, 1506, 2032, 1506, 396, 0], [0, 0, 0], [0, 0, 1558, 0, 1558, -1, 475, 2551, 219, 302, 680, 1872, 1506, 715, -1, 1558, 0, 1558, 0, 0], [0, 0, 0], [0, 0, 0, 0, 0, -1, 2627, 996, 0, 1506, 1066, 2100, 623, -1, 0, 1506, 648, -1, 2481, 2723, 966, 0, 1506, 1890, 2481, 878, 966, 0, 1506, 1890, 2481, 878, 966, 1693, 0, 1506, 2348, -1, 1506, -1, 1506, -1, 0, 1506, 421, 0], [0,

So far we got all sentences tokenized and words that are irrelevant marked as such. Now we can start turning these into input target pairs, while throwing every pair out that has a '-1' in it.

In [15]:
def sent_to_samples(sentence):
    
    # ls stands for labeled sentence (meaning that the
    # end of the sentence is marked with an <EOL>)
    ls = sentence + [ reduced_w2n_dict['<EOL>'] ]
    
    x_vals = [ [ls[i]] for i in range(len(sentence)) ]
    y_vals = [ ls[i+1] for i in range(len(sentence)) ]
    
    cleaned_x = []
    cleaned_y = []
    
    for idx in range(len(x_vals)):
        if x_vals[idx][0] != -1 and y_vals[idx] != -1:
            cleaned_x.append(x_vals[idx])
            cleaned_y.append(y_vals[idx])
    
    return cleaned_x, cleaned_y

In [16]:
inputs = []
labels = []

for line in lines:
    x, y = sent_to_samples(line)
    inputs += x
    labels += y

In [17]:
print(inputs[20:30])
print(labels[20:30])

[[2487], [932], [1449], [903], [2239], [2199], [1275], [2207], [1695], [2733]]
[932, 1449, 903, 2239, 2199, 1275, 680, 1695, 2733, 932]


In [18]:
# We have more than plenty of data, so we will keep a random sample of it
keep_prob = 0.1

import random

training_inputs = []
training_labels = []

for idx in range(len(inputs)):
    if random.random() < keep_prob:
        training_inputs.append(inputs[idx])
        training_labels.append(labels[idx])
        
print("Kept {} out of {} samples".format(len(training_inputs), len(inputs)) )

Kept 708665 out of 7082116 samples


## Time to introduce the neural network

Now that we got our data set up, we can cast it into a numpy array and one-hot-encode the labels.

In [19]:
import numpy as np
from keras.utils import to_categorical

training_inputs = np.array(training_inputs)
training_labels = to_categorical(training_labels)

Using TensorFlow backend.


Here we set some parameters that influence how the network will perform.

In [20]:
learning_rate = 0.0001
embedding_dim = 256
mem_size = 256

# --- Auto computed
vocab_size = len(reduced_w2n_dict)

In [21]:
from keras.models import Sequential

from keras.layers import Dense, GRU, Embedding

from keras.optimizers import Adam

model = Sequential()
model.add( Embedding(vocab_size, embedding_dim, input_length=1) )
model.add( GRU(mem_size) )
model.add( Dense(vocab_size) )

model.compile(optimizer=Adam(learning_rate), loss='mse')
model.summary()

model.fit(x = training_inputs, y = training_labels, epochs = 5, batch_size=32, verbose=1, validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 256)            714496    
_________________________________________________________________
gru_1 (GRU)                  (None, 256)               393984    
_________________________________________________________________
dense_1 (Dense)              (None, 2791)              717287    
Total params: 1,825,767
Trainable params: 1,825,767
Non-trainable params: 0
_________________________________________________________________
Train on 566932 samples, validate on 141733 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb19265f8d0>

## Testing the network

Let's throw some words at the network and see what it predicts.

In [22]:
test_words = ['if', 'import', 'def', 'range', '(']
top_n = 5

for word in test_words:
    test_int = reduced_w2n_dict[word]
    
    output = model.predict(np.array([test_int]))
    
    candidates = output.argsort()[0][::-1]
    
    for idx in range(top_n):
        print(word, "->", reduced_n2w_dict[candidates[idx]])

if -> not
if -> __name__
if -> self
if -> len
if -> isinstance
import -> os
import -> sys
import -> random
import -> time
import -> re
def -> __init__
def -> main
def -> __repr__
def -> run
def -> __str__
range -> (
range -> <DOT>
range -> bytes
range -> 
range -> '%s
( -> 
( -> self
( -> 1
( -> x
( -> 0


Looking good!

Now we should save the model to reuse it later.

In [23]:
import time
timestr = time.strftime("%Y-%m-%d--%H-%M-%S")

filename = timestr + "--model.h5"
model.save(filename)